## ia64/xen-unstable

### view docs/src/interface/hypercalls.tex @ 6979:f8e7af29daa1

merge?
author cl349@firebug.cl.cam.ac.uk Tue Sep 20 09:43:46 2005 +0000 (2005-09-20) 750ad97f37b0 06d84bf87159
line source
2 \newcommand{\hypercall}[1]{\vspace{2mm}{\sf #1}}
4 \chapter{Xen Hypercalls}
5 \label{a:hypercalls}
7 Hypercalls represent the procedural interface to Xen; this appendix
8 categorizes and describes the current set of hypercalls.
10 \section{Invoking Hypercalls}
12 Hypercalls are invoked in a manner analogous to system calls in a
13 conventional operating system; a software interrupt is issued which
14 vectors to an entry point within Xen. On x86\_32 machines the
15 instruction required is {\tt int \$82}; the (real) IDT is setup so 16 that this may only be issued from within ring 1. The particular 17 hypercall to be invoked is contained in {\tt EAX} --- a list 18 mapping these values to symbolic hypercall names can be found 19 in {\tt xen/include/public/xen.h}. 21 On some occasions a set of hypercalls will be required to carry 22 out a higher-level function; a good example is when a guest 23 operating wishes to context switch to a new process which 24 requires updating various privileged CPU state. As an optimization 25 for these cases, there is a generic mechanism to issue a set of 26 hypercalls as a batch: 28 \begin{quote} 29 \hypercall{multicall(void *call\_list, int nr\_calls)} 31 Execute a series of hypervisor calls; {\tt nr\_calls} is the length of 32 the array of {\tt multicall\_entry\_t} structures pointed to be {\tt 33 call\_list}. Each entry contains the hypercall operation code followed 34 by up to 7 word-sized arguments. 35 \end{quote} 37 Note that multicalls are provided purely as an optimization; there is 38 no requirement to use them when first porting a guest operating 39 system. 42 \section{Virtual CPU Setup} 44 At start of day, a guest operating system needs to setup the virtual 45 CPU it is executing on. This includes installing vectors for the 46 virtual IDT so that the guest OS can handle interrupts, page faults, 47 etc. However the very first thing a guest OS must setup is a pair 48 of hypervisor callbacks: these are the entry points which Xen will 49 use when it wishes to notify the guest OS of an occurrence. 51 \begin{quote} 52 \hypercall{set\_callbacks(unsigned long event\_selector, unsigned long 53 event\_address, unsigned long failsafe\_selector, unsigned long 54 failsafe\_address) } 56 Register the normal (event'') and failsafe callbacks for 57 event processing. In each case the code segment selector and 58 address within that segment are provided. The selectors must 59 have RPL 1; in XenLinux we simply use the kernel's CS for both 60 {\tt event\_selector} and {\tt failsafe\_selector}. 62 The value {\tt event\_address} specifies the address of the guest OSes 63 event handling and dispatch routine; the {\tt failsafe\_address} 64 specifies a separate entry point which is used only if a fault occurs 65 when Xen attempts to use the normal callback. 66 \end{quote} 69 After installing the hypervisor callbacks, the guest OS can 70 install a virtual IDT' by using the following hypercall: 72 \begin{quote} 73 \hypercall{set\_trap\_table(trap\_info\_t *table)} 75 Install one or more entries into the per-domain 76 trap handler table (essentially a software version of the IDT). 77 Each entry in the array pointed to by {\tt table} includes the 78 exception vector number with the corresponding segment selector 79 and entry point. Most guest OSes can use the same handlers on 80 Xen as when running on the real hardware; an exception is the 81 page fault handler (exception vector 14) where a modified 82 stack-frame layout is used. 85 \end{quote} 89 \section{Scheduling and Timer} 91 Domains are preemptively scheduled by Xen according to the 92 parameters installed by domain 0 (see Section~\ref{s:dom0ops}). 93 In addition, however, a domain may choose to explicitly 94 control certain behavior with the following hypercall: 96 \begin{quote} 97 \hypercall{sched\_op(unsigned long op)} 99 Request scheduling operation from hypervisor. The options are: {\it 100 yield}, {\it block}, and {\it shutdown}. {\it yield} keeps the 101 calling domain runnable but may cause a reschedule if other domains 102 are runnable. {\it block} removes the calling domain from the run 103 queue and cause is to sleeps until an event is delivered to it. {\it 104 shutdown} is used to end the domain's execution; the caller can 105 additionally specify whether the domain should reboot, halt or 106 suspend. 107 \end{quote} 109 To aid the implementation of a process scheduler within a guest OS, 110 Xen provides a virtual programmable timer: 112 \begin{quote} 113 \hypercall{set\_timer\_op(uint64\_t timeout)} 115 Request a timer event to be sent at the specified system time (time 116 in nanoseconds since system boot). The hypercall actually passes the 117 64-bit timeout value as a pair of 32-bit values. 119 \end{quote} 121 Note that calling {\tt set\_timer\_op()} prior to {\tt sched\_op} 122 allows block-with-timeout semantics. 125 \section{Page Table Management} 127 Since guest operating systems have read-only access to their page 128 tables, Xen must be involved when making any changes. The following 129 multi-purpose hypercall can be used to modify page-table entries, 130 update the machine-to-physical mapping table, flush the TLB, install 131 a new page-table base pointer, and more. 133 \begin{quote} 134 \hypercall{mmu\_update(mmu\_update\_t *req, int count, int *success\_count)} 136 Update the page table for the domain; a set of {\tt count} updates are 137 submitted for processing in a batch, with {\tt success\_count} being 138 updated to report the number of successful updates. 140 Each element of {\tt req[]} contains a pointer (address) and value; 141 the least significant 2-bits of the pointer are used to distinguish 142 the type of update requested as follows: 143 \begin{description} 145 \item[\it MMU\_NORMAL\_PT\_UPDATE:] update a page directory entry or 146 page table entry to the associated value; Xen will check that the 147 update is safe, as described in Chapter~\ref{c:memory}. 149 \item[\it MMU\_MACHPHYS\_UPDATE:] update an entry in the 150 machine-to-physical table. The calling domain must own the machine 151 page in question (or be privileged). 153 \item[\it MMU\_EXTENDED\_COMMAND:] perform additional MMU operations. 154 The set of additional MMU operations is considerable, and includes 155 updating {\tt cr3} (or just re-installing it for a TLB flush), 156 flushing the cache, installing a new LDT, or pinning \& unpinning 157 page-table pages (to ensure their reference count doesn't drop to zero 158 which would require a revalidation of all entries). 160 Further extended commands are used to deal with granting and 161 acquiring page ownership; see Section~\ref{s:idc}. 164 \end{description} 166 More details on the precise format of all commands can be 167 found in {\tt xen/include/public/xen.h}. 170 \end{quote} 172 Explicitly updating batches of page table entries is extremely 173 efficient, but can require a number of alterations to the guest 174 OS. Using the writable page table mode (Chapter~\ref{c:memory}) is 175 recommended for new OS ports. 177 Regardless of which page table update mode is being used, however, 178 there are some occasions (notably handling a demand page fault) where 179 a guest OS will wish to modify exactly one PTE rather than a 180 batch. This is catered for by the following: 182 \begin{quote} 183 \hypercall{update\_va\_mapping(unsigned long page\_nr, unsigned long 184 val, \\ unsigned long flags)} 186 Update the currently installed PTE for the page {\tt page\_nr} to 187 {\tt val}. As with {\tt mmu\_update()}, Xen checks the modification 188 is safe before applying it. The {\tt flags} determine which kind 189 of TLB flush, if any, should follow the update. 191 \end{quote} 193 Finally, sufficiently privileged domains may occasionally wish to manipulate 194 the pages of others: 195 \begin{quote} 197 \hypercall{update\_va\_mapping\_otherdomain(unsigned long page\_nr, 198 unsigned long val, unsigned long flags, uint16\_t domid)} 200 Identical to {\tt update\_va\_mapping()} save that the pages being 201 mapped must belong to the domain {\tt domid}. 203 \end{quote} 205 This privileged operation is currently used by backend virtual device 206 drivers to safely map pages containing I/O data. 210 \section{Segmentation Support} 212 Xen allows guest OSes to install a custom GDT if they require it; 213 this is context switched transparently whenever a domain is 214 [de]scheduled. The following hypercall is effectively a 215 safe' version of {\tt lgdt}: 217 \begin{quote} 218 \hypercall{set\_gdt(unsigned long *frame\_list, int entries)} 220 Install a global descriptor table for a domain; {\tt frame\_list} is 221 an array of up to 16 machine page frames within which the GDT resides, 222 with {\tt entries} being the actual number of descriptor-entry 223 slots. All page frames must be mapped read-only within the guest's 224 address space, and the table must be large enough to contain Xen's 225 reserved entries (see {\tt xen/include/public/arch-x86\_32.h}). 227 \end{quote} 229 Many guest OSes will also wish to install LDTs; this is achieved by 230 using {\tt mmu\_update()} with an extended command, passing the 231 linear address of the LDT base along with the number of entries. No 232 special safety checks are required; Xen needs to perform this task 233 simply since {\tt lldt} requires CPL 0. 236 Xen also allows guest operating systems to update just an 237 individual segment descriptor in the GDT or LDT: 239 \begin{quote} 240 \hypercall{update\_descriptor(unsigned long ma, unsigned long word1, 241 unsigned long word2)} 243 Update the GDT/LDT entry at machine address {\tt ma}; the new 244 8-byte descriptor is stored in {\tt word1} and {\tt word2}. 245 Xen performs a number of checks to ensure the descriptor is 246 valid. 248 \end{quote} 250 Guest OSes can use the above in place of context switching entire 251 LDTs (or the GDT) when the number of changing descriptors is small. 253 \section{Context Switching} 255 When a guest OS wishes to context switch between two processes, 256 it can use the page table and segmentation hypercalls described 257 above to perform the the bulk of the privileged work. In addition, 258 however, it will need to invoke Xen to switch the kernel (ring 1) 259 stack pointer: 261 \begin{quote} 262 \hypercall{stack\_switch(unsigned long ss, unsigned long esp)} 264 Request kernel stack switch from hypervisor; {\tt ss} is the new 265 stack segment, which {\tt esp} is the new stack pointer. 267 \end{quote} 269 A final useful hypercall for context switching allows lazy'' 270 save and restore of floating point state: 272 \begin{quote} 273 \hypercall{fpu\_taskswitch(void)} 275 This call instructs Xen to set the {\tt TS} bit in the {\tt cr0} 276 control register; this means that the next attempt to use floating 277 point will cause a trap which the guest OS can trap. Typically it will 278 then save/restore the FP state, and clear the {\tt TS} bit. 279 \end{quote} 281 This is provided as an optimization only; guest OSes can also choose 282 to save and restore FP state on all context switches for simplicity. 285 \section{Physical Memory Management} 287 As mentioned previously, each domain has a maximum and current 288 memory allocation. The maximum allocation, set at domain creation 289 time, cannot be modified. However a domain can choose to reduce 290 and subsequently grow its current allocation by using the 291 following call: 293 \begin{quote} 294 \hypercall{dom\_mem\_op(unsigned int op, unsigned long *extent\_list, 295 unsigned long nr\_extents, unsigned int extent\_order)} 297 Increase or decrease current memory allocation (as determined by 298 the value of {\tt op}). Each invocation provides a list of 299 extents each of which is$2^s$pages in size, 300 where$s\$ is the value of {\tt extent\_order}.
302 \end{quote}
304 In addition to simply reducing or increasing the current memory
305 allocation via a balloon driver', this call is also useful for
306 obtaining contiguous regions of machine memory when required (e.g.
307 for certain PCI devices, or if using superpages).
310 \section{Inter-Domain Communication}
311 \label{s:idc}
313 Xen provides a simple asynchronous notification mechanism via
314 \emph{event channels}. Each domain has a set of end-points (or
315 \emph{ports}) which may be bound to an event source (e.g. a physical
316 IRQ, a virtual IRQ, or an port in another domain). When a pair of
317 end-points in two different domains are bound together, then a send'
318 operation on one will cause an event to be received by the destination
319 domain.
321 The control and use of event channels involves the following hypercall:
323 \begin{quote}
324 \hypercall{event\_channel\_op(evtchn\_op\_t *op)}
326 Inter-domain event-channel management; {\tt op} is a discriminated
327 union which allows the following 7 operations:
329 \begin{description}
331 \item[\it alloc\_unbound:] allocate a free (unbound) local
332 port and prepare for connection from a specified domain.
333 \item[\it bind\_virq:] bind a local port to a virtual
334 IRQ; any particular VIRQ can be bound to at most one port per domain.
335 \item[\it bind\_pirq:] bind a local port to a physical IRQ;
336 once more, a given pIRQ can be bound to at most one port per
337 domain. Furthermore the calling domain must be sufficiently
338 privileged.
339 \item[\it bind\_interdomain:] construct an interdomain event
340 channel; in general, the target domain must have previously allocated
341 an unbound port for this channel, although this can be bypassed by
342 privileged domains during domain setup.
343 \item[\it close:] close an interdomain event channel.
344 \item[\it send:] send an event to the remote end of a
345 interdomain event channel.
346 \item[\it status:] determine the current status of a local port.
347 \end{description}
349 For more details see
350 {\tt xen/include/public/event\_channel.h}.
352 \end{quote}
354 Event channels are the fundamental communication primitive between
355 Xen domains and seamlessly support SMP. However they provide little
356 bandwidth for communication {\sl per se}, and hence are typically
357 married with a piece of shared memory to produce effective and
358 high-performance inter-domain communication.
360 Safe sharing of memory pages between guest OSes is carried out by
361 granting access on a per page basis to individual domains. This is
362 achieved by using the {\tt grant\_table\_op()} hypercall.
364 \begin{quote}
365 \hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)}
367 Grant or remove access to a particular page to a particular domain.
369 \end{quote}
371 This is not currently widely in use by guest operating systems, but
372 we intend to integrate support more fully in the near future.
374 \section{PCI Configuration}
376 Domains with physical device access (i.e.\ driver domains) receive
378 interrupts). However many guest operating systems attempt to
379 determine the PCI configuration by directly access the PCI BIOS,
380 which cannot be allowed for safety.
382 Instead, Xen provides the following hypercall:
384 \begin{quote}
385 \hypercall{physdev\_op(void *physdev\_op)}
387 Perform a PCI configuration option; depending on the value
388 of {\tt physdev\_op} this can be a PCI config read, a PCI config
389 write, or a small number of other queries.
391 \end{quote}
394 For examples of using {\tt physdev\_op()}, see the
395 Xen-specific PCI code in the linux sparse tree.
398 \label{s:dom0ops}
400 A large number of control operations are available to a sufficiently
401 privileged domain (typically domain 0). These allow the creation and
402 management of new domains, for example. A complete list is given
403 below: for more details on any or all of these, please see
404 {\tt xen/include/public/dom0\_ops.h}
407 \begin{quote}
408 \hypercall{dom0\_op(dom0\_op\_t *op)}
410 Administrative domain operations for domain management. The options are:
412 \begin{description}
413 \item [\it DOM0\_CREATEDOMAIN:] create a new domain
415 \item [\it DOM0\_PAUSEDOMAIN:] remove a domain from the scheduler run
416 queue.
418 \item [\it DOM0\_UNPAUSEDOMAIN:] mark a paused domain as schedulable
419 once again.
421 \item [\it DOM0\_DESTROYDOMAIN:] deallocate all resources associated
422 with a domain
424 \item [\it DOM0\_GETMEMLIST:] get list of pages used by the domain
426 \item [\it DOM0\_SCHEDCTL:]
430 \item [\it DOM0\_BUILDDOMAIN:] do final guest OS setup for domain
432 \item [\it DOM0\_GETDOMAINFO:] get statistics about the domain
434 \item [\it DOM0\_GETPAGEFRAMEINFO:]
436 \item [\it DOM0\_GETPAGEFRAMEINFO2:]
438 \item [\it DOM0\_IOPL:] set I/O privilege level
440 \item [\it DOM0\_MSR:] read or write model specific registers
442 \item [\it DOM0\_DEBUG:] interactively invoke the debugger
444 \item [\it DOM0\_SETTIME:] set system time
448 \item [\it DOM0\_PINCPUDOMAIN:] pin domain to a particular CPU
450 \item [\it DOM0\_GETTBUFS:] get information about the size and location of
451 the trace buffers (only on trace-buffer enabled builds)
453 \item [\it DOM0\_PHYSINFO:] get information about the host machine
455 \item [\it DOM0\_PCIDEV\_ACCESS:] modify PCI device access permissions
457 \item [\it DOM0\_SCHED\_ID:] get the ID of the current Xen scheduler
461 \item [\it DOM0\_SETDOMAININITIALMEM:] set initial memory allocation of a domain
463 \item [\it DOM0\_SETDOMAINMAXMEM:] set maximum memory allocation of a domain
465 \item [\it DOM0\_SETDOMAINVMASSIST:] set domain VM assist options
466 \end{description}
467 \end{quote}
469 Most of the above are best understood by looking at the code
470 implementing them (in {\tt xen/common/dom0\_ops.c}) and in
471 the user-space tools that use them (mostly in {\tt tools/libxc}).
473 \section{Debugging Hypercalls}
475 A few additional hypercalls are mainly useful for debugging:
477 \begin{quote}
478 \hypercall{console\_io(int cmd, int count, char *str)}
480 Use Xen to interact with the console; operations are:
482 {\it CONSOLEIO\_write}: Output count characters from buffer str.
484 {\it CONSOLEIO\_read}: Input at most count characters into buffer str.
485 \end{quote}
487 A pair of hypercalls allows access to the underlying debug registers:
488 \begin{quote}
489 \hypercall{set\_debugreg(int reg, unsigned long value)}
491 Set debug register {\tt reg} to {\tt value}
493 \hypercall{get\_debugreg(int reg)}
495 Return the contents of the debug register {\tt reg}
496 \end{quote}
498 And finally:
499 \begin{quote}
500 \hypercall{xen\_version(int cmd)}
502 Request Xen version number.
503 \end{quote}
505 This is useful to ensure that user-space tools are in sync
506 with the underlying hypervisor.
508 \section{Deprecated Hypercalls}
510 Xen is under constant development and refinement; as such there
511 are plans to improve the way in which various pieces of functionality
512 are exposed to guest OSes.
514 \begin{quote}
515 \hypercall{vm\_assist(unsigned int cmd, unsigned int type)}
517 Toggle various memory management modes (in particular wrritable page
518 tables and superpage support).
520 \end{quote}
522 This is likely to be replaced with mode values in the shared
523 information page since this is more resilient for resumption
524 after migration or checkpoint.