ia64/xen-unstable

view docs/src/interface.tex @ 9488:0a6f5527ca4b

[IA64] set itv handoff as masked and enable reading irr[0-3]

Set initial vcpu itv handoff state to mask the timer vector.
This seems to match hardware and makes logical sense from a
spurious interrupt perspective. Enable vcpu_get_irr[0-3]
functions as they seem to work and have the proper backing.
This enables the check_sal_cache_flush() in arch/ia64/kernel.sal.c
to work unmodified, allowing us to remove the Xen changes from
the file (and thus the file from the sparse tree).

Signed-off-by: Alex Williamson <alex.williamson@hp.com>
author awilliam@xenbuild.aw
date Tue Apr 04 09:39:45 2006 -0600 (2006-04-04)
parents f85bb99187bf
children 6d476981e3a5
line source
1 \documentclass[11pt,twoside,final,openright]{report}
2 \usepackage{a4,graphicx,html,setspace,times}
3 \usepackage{comment,parskip}
4 \setstretch{1.15}
6 % LIBRARY FUNCTIONS
8 \newcommand{\hypercall}[1]{\vspace{2mm}{\sf #1}}
10 \begin{document}
12 % TITLE PAGE
13 \pagestyle{empty}
14 \begin{center}
15 \vspace*{\fill}
16 \includegraphics{figs/xenlogo.eps}
17 \vfill
18 \vfill
19 \vfill
20 \begin{tabular}{l}
21 {\Huge \bf Interface manual} \\[4mm]
22 {\huge Xen v3.0 for x86} \\[80mm]
24 {\Large Xen is Copyright (c) 2002-2005, The Xen Team} \\[3mm]
25 {\Large University of Cambridge, UK} \\[20mm]
26 \end{tabular}
27 \end{center}
29 {\bf DISCLAIMER: This documentation is always under active development
30 and as such there may be mistakes and omissions --- watch out for
31 these and please report any you find to the developer's mailing list.
32 The latest version is always available on-line. Contributions of
33 material, suggestions and corrections are welcome. }
35 \vfill
36 \cleardoublepage
38 % TABLE OF CONTENTS
39 \pagestyle{plain}
40 \pagenumbering{roman}
41 { \parskip 0pt plus 1pt
42 \tableofcontents }
43 \cleardoublepage
45 % PREPARE FOR MAIN TEXT
46 \pagenumbering{arabic}
47 \raggedbottom
48 \widowpenalty=10000
49 \clubpenalty=10000
50 \parindent=0pt
51 \parskip=5pt
52 \renewcommand{\topfraction}{.8}
53 \renewcommand{\bottomfraction}{.8}
54 \renewcommand{\textfraction}{.2}
55 \renewcommand{\floatpagefraction}{.8}
56 \setstretch{1.1}
58 \chapter{Introduction}
60 Xen allows the hardware resources of a machine to be virtualized and
61 dynamically partitioned, allowing multiple different {\em guest}
62 operating system images to be run simultaneously. Virtualizing the
63 machine in this manner provides considerable flexibility, for example
64 allowing different users to choose their preferred operating system
65 (e.g., Linux, NetBSD, or a custom operating system). Furthermore, Xen
66 provides secure partitioning between virtual machines (known as
67 {\em domains} in Xen terminology), and enables better resource
68 accounting and QoS isolation than can be achieved with a conventional
69 operating system.
71 Xen essentially takes a `whole machine' virtualization approach as
72 pioneered by IBM VM/370. However, unlike VM/370 or more recent
73 efforts such as VMware and Virtual PC, Xen does not attempt to
74 completely virtualize the underlying hardware. Instead parts of the
75 hosted guest operating systems are modified to work with the VMM; the
76 operating system is effectively ported to a new target architecture,
77 typically requiring changes in just the machine-dependent code. The
78 user-level API is unchanged, and so existing binaries and operating
79 system distributions work without modification.
81 In addition to exporting virtualized instances of CPU, memory, network
82 and block devices, Xen exposes a control interface to manage how these
83 resources are shared between the running domains. Access to the
84 control interface is restricted: it may only be used by one
85 specially-privileged VM, known as {\em domain 0}. This domain is a
86 required part of any Xen-based server and runs the application software
87 that manages the control-plane aspects of the platform. Running the
88 control software in {\it domain 0}, distinct from the hypervisor
89 itself, allows the Xen framework to separate the notions of
90 mechanism and policy within the system.
93 \chapter{Virtual Architecture}
95 In a Xen/x86 system, only the hypervisor runs with full processor
96 privileges ({\it ring 0} in the x86 four-ring model). It has full
97 access to the physical memory available in the system and is
98 responsible for allocating portions of it to running domains.
100 On a 32-bit x86 system, guest operating systems may use {\it rings 1},
101 {\it 2} and {\it 3} as they see fit. Segmentation is used to prevent
102 the guest OS from accessing the portion of the address space that is
103 reserved for Xen. We expect most guest operating systems will use
104 ring 1 for their own operation and place applications in ring 3.
106 On 64-bit systems it is not possible to protect the hypervisor from
107 untrusted guest code running in rings 1 and 2. Guests are therefore
108 restricted to run in ring 3 only. The guest kernel is protected from its
109 applications by context switching between the kernel and currently
110 running application.
112 In this chapter we consider the basic virtual architecture provided by
113 Xen: CPU state, exception and interrupt handling, and time.
114 Other aspects such as memory and device access are discussed in later
115 chapters.
118 \section{CPU state}
120 All privileged state must be handled by Xen. The guest OS has no
121 direct access to CR3 and is not permitted to update privileged bits in
122 EFLAGS. Guest OSes use \emph{hypercalls} to invoke operations in Xen;
123 these are analogous to system calls but occur from ring 1 to ring 0.
125 A list of all hypercalls is given in Appendix~\ref{a:hypercalls}.
128 \section{Exceptions}
130 A virtual IDT is provided --- a domain can submit a table of trap
131 handlers to Xen via the {\bf set\_trap\_table} hypercall. The
132 exception stack frame presented to a virtual trap handler is identical
133 to its native equivalent.
136 \section{Interrupts and events}
138 Interrupts are virtualized by mapping them to \emph{event channels},
139 which are delivered asynchronously to the target domain using a callback
140 supplied via the {\bf set\_callbacks} hypercall. A guest OS can map
141 these events onto its standard interrupt dispatch mechanisms. Xen is
142 responsible for determining the target domain that will handle each
143 physical interrupt source. For more details on the binding of event
144 sources to event channels, see Chapter~\ref{c:devices}.
147 \section{Time}
149 Guest operating systems need to be aware of the passage of both real
150 (or wallclock) time and their own `virtual time' (the time for which
151 they have been executing). Furthermore, Xen has a notion of time which
152 is used for scheduling. The following notions of time are provided:
154 \begin{description}
155 \item[Cycle counter time.]
157 This provides a fine-grained time reference. The cycle counter time
158 is used to accurately extrapolate the other time references. On SMP
159 machines it is currently assumed that the cycle counter time is
160 synchronized between CPUs. The current x86-based implementation
161 achieves this within inter-CPU communication latencies.
163 \item[System time.]
165 This is a 64-bit counter which holds the number of nanoseconds that
166 have elapsed since system boot.
168 \item[Wall clock time.]
170 This is the time of day in a Unix-style {\bf struct timeval}
171 (seconds and microseconds since 1 January 1970, adjusted by leap
172 seconds). An NTP client hosted by {\it domain 0} can keep this
173 value accurate.
175 \item[Domain virtual time.]
177 This progresses at the same pace as system time, but only while a
178 domain is executing --- it stops while a domain is de-scheduled.
179 Therefore the share of the CPU that a domain receives is indicated
180 by the rate at which its virtual time increases.
182 \end{description}
185 Xen exports timestamps for system time and wall-clock time to guest
186 operating systems through a shared page of memory. Xen also provides
187 the cycle counter time at the instant the timestamps were calculated,
188 and the CPU frequency in Hertz. This allows the guest to extrapolate
189 system and wall-clock times accurately based on the current cycle
190 counter time.
192 Since all time stamps need to be updated and read \emph{atomically}
193 a version number is also stored in the shared info page, which is
194 incremented before and after updating the timestamps. Thus a guest can
195 be sure that it read a consistent state by checking the two version
196 numbers are equal and even.
198 Xen includes a periodic ticker which sends a timer event to the
199 currently executing domain every 10ms. The Xen scheduler also sends a
200 timer event whenever a domain is scheduled; this allows the guest OS
201 to adjust for the time that has passed while it has been inactive. In
202 addition, Xen allows each domain to request that they receive a timer
203 event sent at a specified system time by using the {\bf
204 set\_timer\_op} hypercall. Guest OSes may use this timer to
205 implement timeout values when they block.
209 %% % akw: demoting this to a section -- not sure if there is any point
210 %% % though, maybe just remove it.
212 % KAF: Remove these random sections!
213 \begin{comment}
214 \section{Xen CPU Scheduling}
216 Xen offers a uniform API for CPU schedulers. It is possible to choose
217 from a number of schedulers at boot and it should be easy to add more.
218 The BVT, Atropos and Round Robin schedulers are part of the normal Xen
219 distribution. BVT provides proportional fair shares of the CPU to the
220 running domains. Atropos can be used to reserve absolute shares of
221 the CPU for each domain. Round-robin is provided as an example of
222 Xen's internal scheduler API.
224 \paragraph*{Note: SMP host support}
225 Xen has always supported SMP host systems. Domains are statically
226 assigned to CPUs, either at creation time or when manually pinning to
227 a particular CPU. The current schedulers then run locally on each CPU
228 to decide which of the assigned domains should be run there. The
229 user-level control software can be used to perform coarse-grain
230 load-balancing between CPUs.
231 \end{comment}
234 %% More information on the characteristics and use of these schedulers
235 %% is available in {\bf Sched-HOWTO.txt}.
238 \section{Privileged operations}
240 Xen exports an extended interface to privileged domains (viz.\ {\it
241 Domain 0}). This allows such domains to build and boot other domains
242 on the server, and provides control interfaces for managing
243 scheduling, memory, networking, and block devices.
245 \chapter{Memory}
246 \label{c:memory}
248 Xen is responsible for managing the allocation of physical memory to
249 domains, and for ensuring safe use of the paging and segmentation
250 hardware.
253 \section{Memory Allocation}
255 As well as allocating a portion of physical memory for its own private
256 use, Xen also reserves s small fixed portion of every virtual address
257 space. This is located in the top 64MB on 32-bit systems, the top
258 168MB on PAE systems, and a larger portion in the middle of the
259 address space on 64-bit systems. Unreserved physical memory is
260 available for allocation to domains at a page granularity. Xen tracks
261 the ownership and use of each page, which allows it to enforce secure
262 partitioning between domains.
264 Each domain has a maximum and current physical memory allocation. A
265 guest OS may run a `balloon driver' to dynamically adjust its current
266 memory allocation up to its limit.
269 \section{Pseudo-Physical Memory}
271 Since physical memory is allocated and freed on a page granularity,
272 there is no guarantee that a domain will receive a contiguous stretch
273 of physical memory. However most operating systems do not have good
274 support for operating in a fragmented physical address space. To aid
275 porting such operating systems to run on top of Xen, we make a
276 distinction between \emph{machine memory} and \emph{pseudo-physical
277 memory}.
279 Put simply, machine memory refers to the entire amount of memory
280 installed in the machine, including that reserved by Xen, in use by
281 various domains, or currently unallocated. We consider machine memory
282 to comprise a set of 4kB \emph{machine page frames} numbered
283 consecutively starting from 0. Machine frame numbers mean the same
284 within Xen or any domain.
286 Pseudo-physical memory, on the other hand, is a per-domain
287 abstraction. It allows a guest operating system to consider its memory
288 allocation to consist of a contiguous range of physical page frames
289 starting at physical frame 0, despite the fact that the underlying
290 machine page frames may be sparsely allocated and in any order.
292 To achieve this, Xen maintains a globally readable {\it
293 machine-to-physical} table which records the mapping from machine
294 page frames to pseudo-physical ones. In addition, each domain is
295 supplied with a {\it physical-to-machine} table which performs the
296 inverse mapping. Clearly the machine-to-physical table has size
297 proportional to the amount of RAM installed in the machine, while each
298 physical-to-machine table has size proportional to the memory
299 allocation of the given domain.
301 Architecture dependent code in guest operating systems can then use
302 the two tables to provide the abstraction of pseudo-physical memory.
303 In general, only certain specialized parts of the operating system
304 (such as page table management) needs to understand the difference
305 between machine and pseudo-physical addresses.
308 \section{Page Table Updates}
310 In the default mode of operation, Xen enforces read-only access to
311 page tables and requires guest operating systems to explicitly request
312 any modifications. Xen validates all such requests and only applies
313 updates that it deems safe. This is necessary to prevent domains from
314 adding arbitrary mappings to their page tables.
316 To aid validation, Xen associates a type and reference count with each
317 memory page. A page has one of the following mutually-exclusive types
318 at any point in time: page directory ({\sf PD}), page table ({\sf
319 PT}), local descriptor table ({\sf LDT}), global descriptor table
320 ({\sf GDT}), or writable ({\sf RW}). Note that a guest OS may always
321 create readable mappings of its own memory regardless of its current
322 type.
324 %%% XXX: possibly explain more about ref count 'lifecyle' here?
325 This mechanism is used to maintain the invariants required for safety;
326 for example, a domain cannot have a writable mapping to any part of a
327 page table as this would require the page concerned to simultaneously
328 be of types {\sf PT} and {\sf RW}.
330 \hypercall{mmu\_update(mmu\_update\_t *req, int count, int *success\_count, domid\_t domid)}
332 This hypercall is used to make updates to either the domain's
333 pagetables or to the machine to physical mapping table. It supports
334 submitting a queue of updates, allowing batching for maximal
335 performance. Explicitly queuing updates using this interface will
336 cause any outstanding writable pagetable state to be flushed from the
337 system.
339 \section{Writable Page Tables}
341 Xen also provides an alternative mode of operation in which guests
342 have the illusion that their page tables are directly writable. Of
343 course this is not really the case, since Xen must still validate
344 modifications to ensure secure partitioning. To this end, Xen traps
345 any write attempt to a memory page of type {\sf PT} (i.e., that is
346 currently part of a page table). If such an access occurs, Xen
347 temporarily allows write access to that page while at the same time
348 \emph{disconnecting} it from the page table that is currently in use.
349 This allows the guest to safely make updates to the page because the
350 newly-updated entries cannot be used by the MMU until Xen revalidates
351 and reconnects the page. Reconnection occurs automatically in a
352 number of situations: for example, when the guest modifies a different
353 page-table page, when the domain is preempted, or whenever the guest
354 uses Xen's explicit page-table update interfaces.
356 Writable pagetable functionality is enabled when the guest requests
357 it, using a {\bf vm\_assist} hypercall. Writable pagetables do {\em
358 not} provide full virtualisation of the MMU, so the memory management
359 code of the guest still needs to be aware that it is running on Xen.
360 Since the guest's page tables are used directly, it must translate
361 pseudo-physical addresses to real machine addresses when building page
362 table entries. The guest may not attempt to map its own pagetables
363 writably, since this would violate the memory type invariants; page
364 tables will automatically be made writable by the hypervisor, as
365 necessary.
367 \section{Shadow Page Tables}
369 Finally, Xen also supports a form of \emph{shadow page tables} in
370 which the guest OS uses a independent copy of page tables which are
371 unknown to the hardware (i.e.\ which are never pointed to by {\tt
372 cr3}). Instead Xen propagates changes made to the guest's tables to
373 the real ones, and vice versa. This is useful for logging page writes
374 (e.g.\ for live migration or checkpoint). A full version of the shadow
375 page tables also allows guest OS porting with less effort.
378 \section{Segment Descriptor Tables}
380 At start of day a guest is supplied with a default GDT, which does not reside
381 within its own memory allocation. If the guest wishes to use other
382 than the default `flat' ring-1 and ring-3 segments that this GDT
383 provides, it must register a custom GDT and/or LDT with Xen, allocated
384 from its own memory.
386 The following hypercall is used to specify a new GDT:
388 \begin{quote}
389 int {\bf set\_gdt}(unsigned long *{\em frame\_list}, int {\em
390 entries})
392 \emph{frame\_list}: An array of up to 14 machine page frames within
393 which the GDT resides. Any frame registered as a GDT frame may only
394 be mapped read-only within the guest's address space (e.g., no
395 writable mappings, no use as a page-table page, and so on). Only 14
396 pages may be specified because pages 15 and 16 are reserved for
397 the hypervisor's GDT entries.
399 \emph{entries}: The number of descriptor-entry slots in the GDT.
400 \end{quote}
402 The LDT is updated via the generic MMU update mechanism (i.e., via the
403 {\bf mmu\_update} hypercall.
405 \section{Start of Day}
407 The start-of-day environment for guest operating systems is rather
408 different to that provided by the underlying hardware. In particular,
409 the processor is already executing in protected mode with paging
410 enabled.
412 {\it Domain 0} is created and booted by Xen itself. For all subsequent
413 domains, the analogue of the boot-loader is the {\it domain builder},
414 user-space software running in {\it domain 0}. The domain builder is
415 responsible for building the initial page tables for a domain and
416 loading its kernel image at the appropriate virtual address.
418 \section{VM assists}
420 Xen provides a number of ``assists'' for guest memory management.
421 These are available on an ``opt-in'' basis to provide commonly-used
422 extra functionality to a guest.
424 \hypercall{vm\_assist(unsigned int cmd, unsigned int type)}
426 The {\bf cmd} parameter describes the action to be taken, whilst the
427 {\bf type} parameter describes the kind of assist that is being
428 referred to. Available commands are as follows:
430 \begin{description}
431 \item[VMASST\_CMD\_enable] Enable a particular assist type
432 \item[VMASST\_CMD\_disable] Disable a particular assist type
433 \end{description}
435 And the available types are:
437 \begin{description}
438 \item[VMASST\_TYPE\_4gb\_segments] Provide emulated support for
439 instructions that rely on 4GB segments (such as the techniques used
440 by some TLS solutions).
441 \item[VMASST\_TYPE\_4gb\_segments\_notify] Provide a callback to the
442 guest if the above segment fixups are used: allows the guest to
443 display a warning message during boot.
444 \item[VMASST\_TYPE\_writable\_pagetables] Enable writable pagetable
445 mode - described above.
446 \end{description}
449 \chapter{Xen Info Pages}
451 The {\bf Shared info page} is used to share various CPU-related state
452 between the guest OS and the hypervisor. This information includes VCPU
453 status, time information and event channel (virtual interrupt) state.
454 The {\bf Start info page} is used to pass build-time information to
455 the guest when it boots and when it is resumed from a suspended state.
456 This chapter documents the fields included in the {\bf
457 shared\_info\_t} and {\bf start\_info\_t} structures for use by the
458 guest OS.
460 \section{Shared info page}
462 The {\bf shared\_info\_t} is accessed at run time by both Xen and the
463 guest OS. It is used to pass information relating to the
464 virtual CPU and virtual machine state between the OS and the
465 hypervisor.
467 The structure is declared in {\bf xen/include/public/xen.h}:
469 \scriptsize
470 \begin{verbatim}
471 typedef struct shared_info {
472 vcpu_info_t vcpu_info[MAX_VIRT_CPUS];
474 /*
475 * A domain can create "event channels" on which it can send and receive
476 * asynchronous event notifications. There are three classes of event that
477 * are delivered by this mechanism:
478 * 1. Bi-directional inter- and intra-domain connections. Domains must
479 * arrange out-of-band to set up a connection (usually by allocating
480 * an unbound 'listener' port and avertising that via a storage service
481 * such as xenstore).
482 * 2. Physical interrupts. A domain with suitable hardware-access
483 * privileges can bind an event-channel port to a physical interrupt
484 * source.
485 * 3. Virtual interrupts ('events'). A domain can bind an event-channel
486 * port to a virtual interrupt source, such as the virtual-timer
487 * device or the emergency console.
488 *
489 * Event channels are addressed by a "port index". Each channel is
490 * associated with two bits of information:
491 * 1. PENDING -- notifies the domain that there is a pending notification
492 * to be processed. This bit is cleared by the guest.
493 * 2. MASK -- if this bit is clear then a 0->1 transition of PENDING
494 * will cause an asynchronous upcall to be scheduled. This bit is only
495 * updated by the guest. It is read-only within Xen. If a channel
496 * becomes pending while the channel is masked then the 'edge' is lost
497 * (i.e., when the channel is unmasked, the guest must manually handle
498 * pending notifications as no upcall will be scheduled by Xen).
499 *
500 * To expedite scanning of pending notifications, any 0->1 pending
501 * transition on an unmasked channel causes a corresponding bit in a
502 * per-vcpu selector word to be set. Each bit in the selector covers a
503 * 'C long' in the PENDING bitfield array.
504 */
505 unsigned long evtchn_pending[sizeof(unsigned long) * 8];
506 unsigned long evtchn_mask[sizeof(unsigned long) * 8];
508 /*
509 * Wallclock time: updated only by control software. Guests should base
510 * their gettimeofday() syscall on this wallclock-base value.
511 */
512 uint32_t wc_version; /* Version counter: see vcpu_time_info_t. */
513 uint32_t wc_sec; /* Secs 00:00:00 UTC, Jan 1, 1970. */
514 uint32_t wc_nsec; /* Nsecs 00:00:00 UTC, Jan 1, 1970. */
516 arch_shared_info_t arch;
518 } shared_info_t;
519 \end{verbatim}
520 \normalsize
522 \begin{description}
523 \item[vcpu\_info] An array of {\bf vcpu\_info\_t} structures, each of
524 which holds either runtime information about a virtual CPU, or is
525 ``empty'' if the corresponding VCPU does not exist.
526 \item[evtchn\_pending] Guest-global array, with one bit per event
527 channel. Bits are set if an event is currently pending on that
528 channel.
529 \item[evtchn\_mask] Guest-global array for masking notifications on
530 event channels.
531 \item[wc\_version] Version counter for current wallclock time.
532 \item[wc\_sec] Whole seconds component of current wallclock time.
533 \item[wc\_nsec] Nanoseconds component of current wallclock time.
534 \item[arch] Host architecture-dependent portion of the shared info
535 structure.
536 \end{description}
538 \subsection{vcpu\_info\_t}
540 \scriptsize
541 \begin{verbatim}
542 typedef struct vcpu_info {
543 /*
544 * 'evtchn_upcall_pending' is written non-zero by Xen to indicate
545 * a pending notification for a particular VCPU. It is then cleared
546 * by the guest OS /before/ checking for pending work, thus avoiding
547 * a set-and-check race. Note that the mask is only accessed by Xen
548 * on the CPU that is currently hosting the VCPU. This means that the
549 * pending and mask flags can be updated by the guest without special
550 * synchronisation (i.e., no need for the x86 LOCK prefix).
551 * This may seem suboptimal because if the pending flag is set by
552 * a different CPU then an IPI may be scheduled even when the mask
553 * is set. However, note:
554 * 1. The task of 'interrupt holdoff' is covered by the per-event-
555 * channel mask bits. A 'noisy' event that is continually being
556 * triggered can be masked at source at this very precise
557 * granularity.
558 * 2. The main purpose of the per-VCPU mask is therefore to restrict
559 * reentrant execution: whether for concurrency control, or to
560 * prevent unbounded stack usage. Whatever the purpose, we expect
561 * that the mask will be asserted only for short periods at a time,
562 * and so the likelihood of a 'spurious' IPI is suitably small.
563 * The mask is read before making an event upcall to the guest: a
564 * non-zero mask therefore guarantees that the VCPU will not receive
565 * an upcall activation. The mask is cleared when the VCPU requests
566 * to block: this avoids wakeup-waiting races.
567 */
568 uint8_t evtchn_upcall_pending;
569 uint8_t evtchn_upcall_mask;
570 unsigned long evtchn_pending_sel;
571 arch_vcpu_info_t arch;
572 vcpu_time_info_t time;
573 } vcpu_info_t; /* 64 bytes (x86) */
574 \end{verbatim}
575 \normalsize
577 \begin{description}
578 \item[evtchn\_upcall\_pending] This is set non-zero by Xen to indicate
579 that there are pending events to be received.
580 \item[evtchn\_upcall\_mask] This is set non-zero to disable all
581 interrupts for this CPU for short periods of time. If individual
582 event channels need to be masked, the {\bf evtchn\_mask} in the {\bf
583 shared\_info\_t} is used instead.
584 \item[evtchn\_pending\_sel] When an event is delivered to this VCPU, a
585 bit is set in this selector to indicate which word of the {\bf
586 evtchn\_pending} array in the {\bf shared\_info\_t} contains the
587 event in question.
588 \item[arch] Architecture-specific VCPU info. On x86 this contains the
589 virtualized CR2 register (page fault linear address) for this VCPU.
590 \item[time] Time values for this VCPU.
591 \end{description}
593 \subsection{vcpu\_time\_info}
595 \scriptsize
596 \begin{verbatim}
597 typedef struct vcpu_time_info {
598 /*
599 * Updates to the following values are preceded and followed by an
600 * increment of 'version'. The guest can therefore detect updates by
601 * looking for changes to 'version'. If the least-significant bit of
602 * the version number is set then an update is in progress and the guest
603 * must wait to read a consistent set of values.
604 * The correct way to interact with the version number is similar to
605 * Linux's seqlock: see the implementations of read_seqbegin/read_seqretry.
606 */
607 uint32_t version;
608 uint32_t pad0;
609 uint64_t tsc_timestamp; /* TSC at last update of time vals. */
610 uint64_t system_time; /* Time, in nanosecs, since boot. */
611 /*
612 * Current system time:
613 * system_time + ((tsc - tsc_timestamp) << tsc_shift) * tsc_to_system_mul
614 * CPU frequency (Hz):
615 * ((10^9 << 32) / tsc_to_system_mul) >> tsc_shift
616 */
617 uint32_t tsc_to_system_mul;
618 int8_t tsc_shift;
619 int8_t pad1[3];
620 } vcpu_time_info_t; /* 32 bytes */
621 \end{verbatim}
622 \normalsize
624 \begin{description}
625 \item[version] Used to ensure the guest gets consistent time updates.
626 \item[tsc\_timestamp] Cycle counter timestamp of last time value;
627 could be used to expolate in between updates, for instance.
628 \item[system\_time] Time since boot (nanoseconds).
629 \item[tsc\_to\_system\_mul] Cycle counter to nanoseconds multiplier
630 (used in extrapolating current time).
631 \item[tsc\_shift] Cycle counter to nanoseconds shift (used in
632 extrapolating current time).
633 \end{description}
635 \subsection{arch\_shared\_info\_t}
637 On x86, the {\bf arch\_shared\_info\_t} is defined as follows (from
638 xen/public/arch-x86\_32.h):
640 \scriptsize
641 \begin{verbatim}
642 typedef struct arch_shared_info {
643 unsigned long max_pfn; /* max pfn that appears in table */
644 /* Frame containing list of mfns containing list of mfns containing p2m. */
645 unsigned long pfn_to_mfn_frame_list_list;
646 } arch_shared_info_t;
647 \end{verbatim}
648 \normalsize
650 \begin{description}
651 \item[max\_pfn] The maximum PFN listed in the physical-to-machine
652 mapping table (P2M table).
653 \item[pfn\_to\_mfn\_frame\_list\_list] Machine address of the frame
654 that contains the machine addresses of the P2M table frames.
655 \end{description}
657 \section{Start info page}
659 The start info structure is declared as the following (in {\bf
660 xen/include/public/xen.h}):
662 \scriptsize
663 \begin{verbatim}
664 #define MAX_GUEST_CMDLINE 1024
665 typedef struct start_info {
666 /* THE FOLLOWING ARE FILLED IN BOTH ON INITIAL BOOT AND ON RESUME. */
667 char magic[32]; /* "Xen-<version>.<subversion>". */
668 unsigned long nr_pages; /* Total pages allocated to this domain. */
669 unsigned long shared_info; /* MACHINE address of shared info struct. */
670 uint32_t flags; /* SIF_xxx flags. */
671 unsigned long store_mfn; /* MACHINE page number of shared page. */
672 uint32_t store_evtchn; /* Event channel for store communication. */
673 unsigned long console_mfn; /* MACHINE address of console page. */
674 uint32_t console_evtchn; /* Event channel for console messages. */
675 /* THE FOLLOWING ARE ONLY FILLED IN ON INITIAL BOOT (NOT RESUME). */
676 unsigned long pt_base; /* VIRTUAL address of page directory. */
677 unsigned long nr_pt_frames; /* Number of bootstrap p.t. frames. */
678 unsigned long mfn_list; /* VIRTUAL address of page-frame list. */
679 unsigned long mod_start; /* VIRTUAL address of pre-loaded module. */
680 unsigned long mod_len; /* Size (bytes) of pre-loaded module. */
681 int8_t cmd_line[MAX_GUEST_CMDLINE];
682 } start_info_t;
683 \end{verbatim}
684 \normalsize
686 The fields are in two groups: the first group are always filled in
687 when a domain is booted or resumed, the second set are only used at
688 boot time.
690 The always-available group is as follows:
692 \begin{description}
693 \item[magic] A text string identifying the Xen version to the guest.
694 \item[nr\_pages] The number of real machine pages available to the
695 guest.
696 \item[shared\_info] Machine address of the shared info structure,
697 allowing the guest to map it during initialisation.
698 \item[flags] Flags for describing optional extra settings to the
699 guest.
700 \item[store\_mfn] Machine address of the Xenstore communications page.
701 \item[store\_evtchn] Event channel to communicate with the store.
702 \item[console\_mfn] Machine address of the console data page.
703 \item[console\_evtchn] Event channel to notify the console backend.
704 \end{description}
706 The boot-only group may only be safely referred to during system boot:
708 \begin{description}
709 \item[pt\_base] Virtual address of the page directory created for us
710 by the domain builder.
711 \item[nr\_pt\_frames] Number of frames used by the builders' bootstrap
712 pagetables.
713 \item[mfn\_list] Virtual address of the list of machine frames this
714 domain owns.
715 \item[mod\_start] Virtual address of any pre-loaded modules
716 (e.g. ramdisk)
717 \item[mod\_len] Size of pre-loaded module (if any).
718 \item[cmd\_line] Kernel command line passed by the domain builder.
719 \end{description}
722 % by Mark Williamson <mark.williamson@cl.cam.ac.uk>
724 \chapter{Event Channels}
725 \label{c:eventchannels}
727 Event channels are the basic primitive provided by Xen for event
728 notifications. An event is the Xen equivalent of a hardware
729 interrupt. They essentially store one bit of information, the event
730 of interest is signalled by transitioning this bit from 0 to 1.
732 Notifications are received by a guest via an upcall from Xen,
733 indicating when an event arrives (setting the bit). Further
734 notifications are masked until the bit is cleared again (therefore,
735 guests must check the value of the bit after re-enabling event
736 delivery to ensure no missed notifications).
738 Event notifications can be masked by setting a flag; this is
739 equivalent to disabling interrupts and can be used to ensure atomicity
740 of certain operations in the guest kernel.
742 \section{Hypercall interface}
744 \hypercall{event\_channel\_op(evtchn\_op\_t *op)}
746 The event channel operation hypercall is used for all operations on
747 event channels / ports. Operations are distinguished by the value of
748 the {\bf cmd} field of the {\bf op} structure. The possible commands
749 are described below:
751 \begin{description}
753 \item[EVTCHNOP\_alloc\_unbound]
754 Allocate a new event channel port, ready to be connected to by a
755 remote domain.
756 \begin{itemize}
757 \item Specified domain must exist.
758 \item A free port must exist in that domain.
759 \end{itemize}
760 Unprivileged domains may only allocate their own ports, privileged
761 domains may also allocate ports in other domains.
762 \item[EVTCHNOP\_bind\_interdomain]
763 Bind an event channel for interdomain communications.
764 \begin{itemize}
765 \item Caller domain must have a free port to bind.
766 \item Remote domain must exist.
767 \item Remote port must be allocated and currently unbound.
768 \item Remote port must be expecting the caller domain as the ``remote''.
769 \end{itemize}
770 \item[EVTCHNOP\_bind\_virq]
771 Allocate a port and bind a VIRQ to it.
772 \begin{itemize}
773 \item Caller domain must have a free port to bind.
774 \item VIRQ must be valid.
775 \item VCPU must exist.
776 \item VIRQ must not currently be bound to an event channel.
777 \end{itemize}
778 \item[EVTCHNOP\_bind\_ipi]
779 Allocate and bind a port for notifying other virtual CPUs.
780 \begin{itemize}
781 \item Caller domain must have a free port to bind.
782 \item VCPU must exist.
783 \end{itemize}
784 \item[EVTCHNOP\_bind\_pirq]
785 Allocate and bind a port to a real IRQ.
786 \begin{itemize}
787 \item Caller domain must have a free port to bind.
788 \item PIRQ must be within the valid range.
789 \item Another binding for this PIRQ must not exist for this domain.
790 \item Caller must have an available port.
791 \end{itemize}
792 \item[EVTCHNOP\_close]
793 Close an event channel (no more events will be received).
794 \begin{itemize}
795 \item Port must be valid (currently allocated).
796 \end{itemize}
797 \item[EVTCHNOP\_send] Send a notification on an event channel attached
798 to a port.
799 \begin{itemize}
800 \item Port must be valid.
801 \item Only valid for Interdomain, IPI or Allocated Unbound ports.
802 \end{itemize}
803 \item[EVTCHNOP\_status] Query the status of a port; what kind of port,
804 whether it is bound, what remote domain is expected, what PIRQ or
805 VIRQ it is bound to, what VCPU will be notified, etc.
806 Unprivileged domains may only query the state of their own ports.
807 Privileged domains may query any port.
808 \item[EVTCHNOP\_bind\_vcpu] Bind event channel to a particular VCPU -
809 receive notification upcalls only on that VCPU.
810 \begin{itemize}
811 \item VCPU must exist.
812 \item Port must be valid.
813 \item Event channel must be either: allocated but unbound, bound to
814 an interdomain event channel, bound to a PIRQ.
815 \end{itemize}
817 \end{description}
819 %%
820 %% grant_tables.tex
821 %%
822 %% Made by Mark Williamson
823 %% Login <mark@maw48>
824 %%
826 \chapter{Grant tables}
827 \label{c:granttables}
829 Xen's grant tables provide a generic mechanism to memory sharing
830 between domains. This shared memory interface underpins the split
831 device drivers for block and network IO.
833 Each domain has its own {\bf grant table}. This is a data structure
834 that is shared with Xen; it allows the domain to tell Xen what kind of
835 permissions other domains have on its pages. Entries in the grant
836 table are identified by {\bf grant references}. A grant reference is
837 an integer, which indexes into the grant table. It acts as a
838 capability which the grantee can use to perform operations on the
839 granter's memory.
841 This capability-based system allows shared-memory communications
842 between unprivileged domains. A grant reference also encapsulates the
843 details of a shared page, removing the need for a domain to know the
844 real machine address of a page it is sharing. This makes it possible
845 to share memory correctly with domains running in fully virtualised
846 memory.
848 \section{Interface}
850 \subsection{Grant table manipulation}
852 Creating and destroying grant references is done by direct access to
853 the grant table. This removes the need to involve Xen when creating
854 grant references, modifying access permissions, etc. The grantee
855 domain will invoke hypercalls to use the grant references. Four main
856 operations can be accomplished by directly manipulating the table:
858 \begin{description}
859 \item[Grant foreign access] allocate a new entry in the grant table
860 and fill out the access permissions accordingly. The access
861 permissions will be looked up by Xen when the grantee attempts to
862 use the reference to map the granted frame.
863 \item[End foreign access] check that the grant reference is not
864 currently in use, then remove the mapping permissions for the frame.
865 This prevents further mappings from taking place but does not allow
866 forced revocations of existing mappings.
867 \item[Grant foreign transfer] allocate a new entry in the table
868 specifying transfer permissions for the grantee. Xen will look up
869 this entry when the grantee attempts to transfer a frame to the
870 granter.
871 \item[End foreign transfer] remove permissions to prevent a transfer
872 occurring in future. If the transfer is already committed,
873 modifying the grant table cannot prevent it from completing.
874 \end{description}
876 \subsection{Hypercalls}
878 Use of grant references is accomplished via a hypercall. The grant
879 table op hypercall takes three arguments:
881 \hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)}
883 {\bf cmd} indicates the grant table operation of interest. {\bf uop}
884 is a pointer to a structure (or an array of structures) describing the
885 operation to be performed. The {\bf count} field describes how many
886 grant table operations are being batched together.
888 The core logic is situated in {\bf xen/common/grant\_table.c}. The
889 grant table operation hypercall can be used to perform the following
890 actions:
892 \begin{description}
893 \item[GNTTABOP\_map\_grant\_ref] Given a grant reference from another
894 domain, map the referred page into the caller's address space.
895 \item[GNTTABOP\_unmap\_grant\_ref] Remove a mapping to a granted frame
896 from the caller's address space. This is used to voluntarily
897 relinquish a mapping to a granted page.
898 \item[GNTTABOP\_setup\_table] Setup grant table for caller domain.
899 \item[GNTTABOP\_dump\_table] Debugging operation.
900 \item[GNTTABOP\_transfer] Given a transfer reference from another
901 domain, transfer ownership of a page frame to that domain.
902 \end{description}
904 %%
905 %% xenstore.tex
906 %%
907 %% Made by Mark Williamson
908 %% Login <mark@maw48>
909 %%
911 \chapter{Xenstore}
913 Xenstore is the mechanism by which control-plane activities occur.
914 These activities include:
916 \begin{itemize}
917 \item Setting up shared memory regions and event channels for use with
918 the split device drivers.
919 \item Notifying the guest of control events (e.g. balloon driver
920 requests)
921 \item Reporting back status information from the guest
922 (e.g. performance-related statistics, etc).
923 \end{itemize}
925 The store is arranged as a hierachical collection of key-value pairs.
926 Each domain has a directory hierarchy containing data related to its
927 configuration. Domains are permitted to register for notifications
928 about changes in subtrees of the store, and to apply changes to the
929 store transactionally.
931 \section{Guidelines}
933 A few principles govern the operation of the store:
935 \begin{itemize}
936 \item Domains should only modify the contents of their own
937 directories.
938 \item The setup protocol for a device channel should simply consist of
939 entering the configuration data into the store.
940 \item The store should allow device discovery without requiring the
941 relevant device drivers to be loaded: a Xen ``bus'' should be
942 visible to probing code in the guest.
943 \item The store should be usable for inter-tool communications,
944 allowing the tools themselves to be decomposed into a number of
945 smaller utilities, rather than a single monolithic entity. This
946 also facilitates the development of alternate user interfaces to the
947 same functionality.
948 \end{itemize}
950 \section{Store layout}
952 There are three main paths in XenStore:
954 \begin{description}
955 \item[/vm] stores configuration information about domain
956 \item[/local/domain] stores information about the domain on the local node (domid, etc.)
957 \item[/tool] stores information for the various tools
958 \end{description}
960 The {\bf /vm} path stores configuration information for a domain.
961 This information doesn't change and is indexed by the domain's UUID.
962 A {\bf /vm} entry contains the following information:
964 \begin{description}
965 \item[ssidref] ssid reference for domain
966 \item[uuid] uuid of the domain (somewhat redundant)
967 \item[on\_reboot] the action to take on a domain reboot request (destroy or restart)
968 \item[on\_poweroff] the action to take on a domain halt request (destroy or restart)
969 \item[on\_crash] the action to take on a domain crash (destroy or restart)
970 \item[vcpus] the number of allocated vcpus for the domain
971 \item[memory] the amount of memory (in megabytes) for the domain Note: appears to sometimes be empty for domain-0
972 \item[vcpu\_avail] the number of active vcpus for the domain (vcpus - number of disabled vcpus)
973 \item[name] the name of the domain
974 \end{description}
977 {\bf /vm/$<$uuid$>$/image/}
979 The image path is only available for Domain-Us and contains:
980 \begin{description}
981 \item[ostype] identifies the builder type (linux or vmx)
982 \item[kernel] path to kernel on domain-0
983 \item[cmdline] command line to pass to domain-U kernel
984 \item[ramdisk] path to ramdisk on domain-0
985 \end{description}
987 {\bf /local}
989 The {\tt /local} path currently only contains one directory, {\tt
990 /local/domain} that is indexed by domain id. It contains the running
991 domain information. The reason to have two storage areas is that
992 during migration, the uuid doesn't change but the domain id does. The
993 {\tt /local/domain} directory can be created and populated before
994 finalizing the migration enabling localhost to localhost migration.
996 {\bf /local/domain/$<$domid$>$}
998 This path contains:
1000 \begin{description}
1001 \item[cpu\_time] xend start time (this is only around for domain-0)
1002 \item[handle] private handle for xend
1003 \item[name] see /vm
1004 \item[on\_reboot] see /vm
1005 \item[on\_poweroff] see /vm
1006 \item[on\_crash] see /vm
1007 \item[vm] the path to the VM directory for the domain
1008 \item[domid] the domain id (somewhat redundant)
1009 \item[running] indicates that the domain is currently running
1010 \item[memory] the current memory in megabytes for the domain (empty for domain-0?)
1011 \item[maxmem\_KiB] the maximum memory for the domain (in kilobytes)
1012 \item[memory\_KiB] the memory allocated to the domain (in kilobytes)
1013 \item[cpu] the current CPU the domain is pinned to (empty for domain-0?)
1014 \item[cpu\_weight] the weight assigned to the domain
1015 \item[vcpu\_avail] a bitmap telling the domain whether it may use a given VCPU
1016 \item[online\_vcpus] how many vcpus are currently online
1017 \item[vcpus] the total number of vcpus allocated to the domain
1018 \item[console/] a directory for console information
1019 \begin{description}
1020 \item[ring-ref] the grant table reference of the console ring queue
1021 \item[port] the event channel being used for the console ring queue (local port)
1022 \item[tty] the current tty the console data is being exposed of
1023 \item[limit] the limit (in bytes) of console data to buffer
1024 \end{description}
1025 \item[backend/] a directory containing all backends the domain hosts
1026 \begin{description}
1027 \item[vbd/] a directory containing vbd backends
1028 \begin{description}
1029 \item[$<$domid$>$/] a directory containing vbd's for domid
1030 \begin{description}
1031 \item[$<$virtual-device$>$/] a directory for a particular
1032 virtual-device on domid
1033 \begin{description}
1034 \item[frontend-id] domain id of frontend
1035 \item[frontend] the path to the frontend domain
1036 \item[physical-device] backend device number
1037 \item[sector-size] backend sector size
1038 \item[info] 0 read/write, 1 read-only (is this right?)
1039 \item[domain] name of frontend domain
1040 \item[params] parameters for device
1041 \item[type] the type of the device
1042 \item[dev] the virtual device (as given by the user)
1043 \item[node] output from block creation script
1044 \end{description}
1045 \end{description}
1046 \end{description}
1048 \item[vif/] a directory containing vif backends
1049 \begin{description}
1050 \item[$<$domid$>$/] a directory containing vif's for domid
1051 \begin{description}
1052 \item[$<$vif number$>$/] a directory for each vif
1053 \item[frontend-id] the domain id of the frontend
1054 \item[frontend] the path to the frontend
1055 \item[mac] the mac address of the vif
1056 \item[bridge] the bridge the vif is connected to
1057 \item[handle] the handle of the vif
1058 \item[script] the script used to create/stop the vif
1059 \item[domain] the name of the frontend
1060 \end{description}
1061 \end{description}
1063 \item[vtpm/] a directory containin vtpm backends
1064 \begin{description}
1065 \item[$<$domid$>$/] a directory containing vtpm's for domid
1066 \begin{description}
1067 \item[$<$vtpm number$>$/] a directory for each vtpm
1068 \item[frontend-id] the domain id of the frontend
1069 \item[frontend] the path to the frontend
1070 \item[instance] the instance of the virtual TPM that is used
1071 \item[pref{\textunderscore}instance] the instance number as given in the VM configuration file;
1072 may be different from {\bf instance}
1073 \item[domain] the name of the domain of the frontend
1074 \end{description}
1075 \end{description}
1077 \end{description}
1079 \item[device/] a directory containing the frontend devices for the
1080 domain
1081 \begin{description}
1082 \item[vbd/] a directory containing vbd frontend devices for the
1083 domain
1084 \begin{description}
1085 \item[$<$virtual-device$>$/] a directory containing the vbd frontend for
1086 virtual-device
1087 \begin{description}
1088 \item[virtual-device] the device number of the frontend device
1089 \item[backend-id] the domain id of the backend
1090 \item[backend] the path of the backend in the store (/local/domain
1091 path)
1092 \item[ring-ref] the grant table reference for the block request
1093 ring queue
1094 \item[event-channel] the event channel used for the block request
1095 ring queue
1096 \end{description}
1098 \item[vif/] a directory containing vif frontend devices for the
1099 domain
1100 \begin{description}
1101 \item[$<$id$>$/] a directory for vif id frontend device for the domain
1102 \begin{description}
1103 \item[backend-id] the backend domain id
1104 \item[mac] the mac address of the vif
1105 \item[handle] the internal vif handle
1106 \item[backend] a path to the backend's store entry
1107 \item[tx-ring-ref] the grant table reference for the transmission ring queue
1108 \item[rx-ring-ref] the grant table reference for the receiving ring queue
1109 \item[event-channel] the event channel used for the two ring queues
1110 \end{description}
1111 \end{description}
1113 \item[vtpm/] a directory containing the vtpm frontend device for the
1114 domain
1115 \begin{description}
1116 \item[$<$id$>$] a directory for vtpm id frontend device for the domain
1117 \begin{description}
1118 \item[backend-id] the backend domain id
1119 \item[backend] a path to the backend's store entry
1120 \item[ring-ref] the grant table reference for the tx/rx ring
1121 \item[event-channel] the event channel used for the ring
1122 \end{description}
1123 \end{description}
1125 \item[device-misc/] miscellanous information for devices
1126 \begin{description}
1127 \item[vif/] miscellanous information for vif devices
1128 \begin{description}
1129 \item[nextDeviceID] the next device id to use
1130 \end{description}
1131 \end{description}
1132 \end{description}
1133 \end{description}
1135 \item[store/] per-domain information for the store
1136 \begin{description}
1137 \item[port] the event channel used for the store ring queue
1138 \item[ring-ref] - the grant table reference used for the store's
1139 communication channel
1140 \end{description}
1142 \item[image] - private xend information
1143 \end{description}
1146 \chapter{Devices}
1147 \label{c:devices}
1149 Virtual devices under Xen are provided by a {\bf split device driver}
1150 architecture. The illusion of the virtual device is provided by two
1151 co-operating drivers: the {\bf frontend}, which runs an the
1152 unprivileged domain and the {\bf backend}, which runs in a domain with
1153 access to the real device hardware (often called a {\bf driver
1154 domain}; in practice domain 0 usually fulfills this function).
1156 The frontend driver appears to the unprivileged guest as if it were a
1157 real device, for instance a block or network device. It receives IO
1158 requests from its kernel as usual, however since it does not have
1159 access to the physical hardware of the system it must then issue
1160 requests to the backend. The backend driver is responsible for
1161 receiving these IO requests, verifying that they are safe and then
1162 issuing them to the real device hardware. The backend driver appears
1163 to its kernel as a normal user of in-kernel IO functionality. When
1164 the IO completes the backend notifies the frontend that the data is
1165 ready for use; the frontend is then able to report IO completion to
1166 its own kernel.
1168 Frontend drivers are designed to be simple; most of the complexity is
1169 in the backend, which has responsibility for translating device
1170 addresses, verifying that requests are well-formed and do not violate
1171 isolation guarantees, etc.
1173 Split drivers exchange requests and responses in shared memory, with
1174 an event channel for asynchronous notifications of activity. When the
1175 frontend driver comes up, it uses Xenstore to set up a shared memory
1176 frame and an interdomain event channel for communications with the
1177 backend. Once this connection is established, the two can communicate
1178 directly by placing requests / responses into shared memory and then
1179 sending notifications on the event channel. This separation of
1180 notification from data transfer allows message batching, and results
1181 in very efficient device access.
1183 This chapter focuses on some individual split device interfaces
1184 available to Xen guests.
1187 \section{Network I/O}
1189 Virtual network device services are provided by shared memory
1190 communication with a backend domain. From the point of view of other
1191 domains, the backend may be viewed as a virtual ethernet switch
1192 element with each domain having one or more virtual network interfaces
1193 connected to it.
1195 From the point of view of the backend domain itself, the network
1196 backend driver consists of a number of ethernet devices. Each of
1197 these has a logical direct connection to a virtual network device in
1198 another domain. This allows the backend domain to route, bridge,
1199 firewall, etc the traffic to / from the other domains using normal
1200 operating system mechanisms.
1202 \subsection{Backend Packet Handling}
1204 The backend driver is responsible for a variety of actions relating to
1205 the transmission and reception of packets from the physical device.
1206 With regard to transmission, the backend performs these key actions:
1208 \begin{itemize}
1209 \item {\bf Validation:} To ensure that domains do not attempt to
1210 generate invalid (e.g. spoofed) traffic, the backend driver may
1211 validate headers ensuring that source MAC and IP addresses match the
1212 interface that they have been sent from.
1214 Validation functions can be configured using standard firewall rules
1215 ({\small{\tt iptables}} in the case of Linux).
1217 \item {\bf Scheduling:} Since a number of domains can share a single
1218 physical network interface, the backend must mediate access when
1219 several domains each have packets queued for transmission. This
1220 general scheduling function subsumes basic shaping or rate-limiting
1221 schemes.
1223 \item {\bf Logging and Accounting:} The backend domain can be
1224 configured with classifier rules that control how packets are
1225 accounted or logged. For example, log messages might be generated
1226 whenever a domain attempts to send a TCP packet containing a SYN.
1227 \end{itemize}
1229 On receipt of incoming packets, the backend acts as a simple
1230 demultiplexer: Packets are passed to the appropriate virtual interface
1231 after any necessary logging and accounting have been carried out.
1233 \subsection{Data Transfer}
1235 Each virtual interface uses two ``descriptor rings'', one for
1236 transmit, the other for receive. Each descriptor identifies a block
1237 of contiguous machine memory allocated to the domain.
1239 The transmit ring carries packets to transmit from the guest to the
1240 backend domain. The return path of the transmit ring carries messages
1241 indicating that the contents have been physically transmitted and the
1242 backend no longer requires the associated pages of memory.
1244 To receive packets, the guest places descriptors of unused pages on
1245 the receive ring. The backend will return received packets by
1246 exchanging these pages in the domain's memory with new pages
1247 containing the received data, and passing back descriptors regarding
1248 the new packets on the ring. This zero-copy approach allows the
1249 backend to maintain a pool of free pages to receive packets into, and
1250 then deliver them to appropriate domains after examining their
1251 headers.
1253 % Real physical addresses are used throughout, with the domain
1254 % performing translation from pseudo-physical addresses if that is
1255 % necessary.
1257 If a domain does not keep its receive ring stocked with empty buffers
1258 then packets destined to it may be dropped. This provides some
1259 defence against receive livelock problems because an overloaded domain
1260 will cease to receive further data. Similarly, on the transmit path,
1261 it provides the application with feedback on the rate at which packets
1262 are able to leave the system.
1264 Flow control on rings is achieved by including a pair of producer
1265 indexes on the shared ring page. Each side will maintain a private
1266 consumer index indicating the next outstanding message. In this
1267 manner, the domains cooperate to divide the ring into two message
1268 lists, one in each direction. Notification is decoupled from the
1269 immediate placement of new messages on the ring; the event channel
1270 will be used to generate notification when {\em either} a certain
1271 number of outstanding messages are queued, {\em or} a specified number
1272 of nanoseconds have elapsed since the oldest message was placed on the
1273 ring.
1275 %% Not sure if my version is any better -- here is what was here
1276 %% before: Synchronization between the backend domain and the guest is
1277 %% achieved using counters held in shared memory that is accessible to
1278 %% both. Each ring has associated producer and consumer indices
1279 %% indicating the area in the ring that holds descriptors that contain
1280 %% data. After receiving {\it n} packets or {\t nanoseconds} after
1281 %% receiving the first packet, the hypervisor sends an event to the
1282 %% domain.
1285 \subsection{Network ring interface}
1287 The network device uses two shared memory rings for communication: one
1288 for transmit, one for receieve.
1290 Transmit requests are described by the following structure:
1292 \scriptsize
1293 \begin{verbatim}
1294 typedef struct netif_tx_request {
1295 grant_ref_t gref; /* Reference to buffer page */
1296 uint16_t offset; /* Offset within buffer page */
1297 uint16_t flags; /* NETTXF_* */
1298 uint16_t id; /* Echoed in response message. */
1299 uint16_t size; /* Packet size in bytes. */
1300 } netif_tx_request_t;
1301 \end{verbatim}
1302 \normalsize
1304 \begin{description}
1305 \item[gref] Grant reference for the network buffer
1306 \item[offset] Offset to data
1307 \item[flags] Transmit flags (currently only NETTXF\_csum\_blank is
1308 supported, to indicate that the protocol checksum field is
1309 incomplete).
1310 \item[id] Echoed to guest by the backend in the ring-level response so
1311 that the guest can match it to this request
1312 \item[size] Buffer size
1313 \end{description}
1315 Each transmit request is followed by a transmit response at some later
1316 date. This is part of the shared-memory communication protocol and
1317 allows the guest to (potentially) retire internal structures related
1318 to the request. It does not imply a network-level response. This
1319 structure is as follows:
1321 \scriptsize
1322 \begin{verbatim}
1323 typedef struct netif_tx_response {
1324 uint16_t id;
1325 int16_t status;
1326 } netif_tx_response_t;
1327 \end{verbatim}
1328 \normalsize
1330 \begin{description}
1331 \item[id] Echo of the ID field in the corresponding transmit request.
1332 \item[status] Success / failure status of the transmit request.
1333 \end{description}
1335 Receive requests must be queued by the frontend, accompanied by a
1336 donation of page-frames to the backend. The backend transfers page
1337 frames full of data back to the guest
1339 \scriptsize
1340 \begin{verbatim}
1341 typedef struct {
1342 uint16_t id; /* Echoed in response message. */
1343 grant_ref_t gref; /* Reference to incoming granted frame */
1344 } netif_rx_request_t;
1345 \end{verbatim}
1346 \normalsize
1348 \begin{description}
1349 \item[id] Echoed by the frontend to identify this request when
1350 responding.
1351 \item[gref] Transfer reference - the backend will use this reference
1352 to transfer a frame of network data to us.
1353 \end{description}
1355 Receive response descriptors are queued for each received frame. Note
1356 that these may only be queued in reply to an existing receive request,
1357 providing an in-built form of traffic throttling.
1359 \scriptsize
1360 \begin{verbatim}
1361 typedef struct {
1362 uint16_t id;
1363 uint16_t offset; /* Offset in page of start of received packet */
1364 uint16_t flags; /* NETRXF_* */
1365 int16_t status; /* -ve: BLKIF_RSP_* ; +ve: Rx'ed pkt size. */
1366 } netif_rx_response_t;
1367 \end{verbatim}
1368 \normalsize
1370 \begin{description}
1371 \item[id] ID echoed from the original request, used by the guest to
1372 match this response to the original request.
1373 \item[offset] Offset to data within the transferred frame.
1374 \item[flags] Transmit flags (currently only NETRXF\_csum\_valid is
1375 supported, to indicate that the protocol checksum field has already
1376 been validated).
1377 \item[status] Success / error status for this operation.
1378 \end{description}
1380 Note that the receive protocol includes a mechanism for guests to
1381 receive incoming memory frames but there is no explicit transfer of
1382 frames in the other direction. Guests are expected to return memory
1383 to the hypervisor in order to use the network interface. They {\em
1384 must} do this or they will exceed their maximum memory reservation and
1385 will not be able to receive incoming frame transfers. When necessary,
1386 the backend is able to replenish its pool of free network buffers by
1387 claiming some of this free memory from the hypervisor.
1389 \section{Block I/O}
1391 All guest OS disk access goes through the virtual block device VBD
1392 interface. This interface allows domains access to portions of block
1393 storage devices visible to the the block backend device. The VBD
1394 interface is a split driver, similar to the network interface
1395 described above. A single shared memory ring is used between the
1396 frontend and backend drivers for each virtual device, across which
1397 IO requests and responses are sent.
1399 Any block device accessible to the backend domain, including
1400 network-based block (iSCSI, *NBD, etc), loopback and LVM/MD devices,
1401 can be exported as a VBD. Each VBD is mapped to a device node in the
1402 guest, specified in the guest's startup configuration.
1404 \subsection{Data Transfer}
1406 The per-(virtual)-device ring between the guest and the block backend
1407 supports two messages:
1409 \begin{description}
1410 \item [{\small {\tt READ}}:] Read data from the specified block
1411 device. The front end identifies the device and location to read
1412 from and attaches pages for the data to be copied to (typically via
1413 DMA from the device). The backend acknowledges completed read
1414 requests as they finish.
1416 \item [{\small {\tt WRITE}}:] Write data to the specified block
1417 device. This functions essentially as {\small {\tt READ}}, except
1418 that the data moves to the device instead of from it.
1419 \end{description}
1421 %% Rather than copying data, the backend simply maps the domain's
1422 %% buffers in order to enable direct DMA to them. The act of mapping
1423 %% the buffers also increases the reference counts of the underlying
1424 %% pages, so that the unprivileged domain cannot try to return them to
1425 %% the hypervisor, install them as page tables, or any other unsafe
1426 %% behaviour.
1427 %%
1428 %% % block API here
1430 \subsection{Block ring interface}
1432 The block interface is defined by the structures passed over the
1433 shared memory interface. These structures are either requests (from
1434 the frontend to the backend) or responses (from the backend to the
1435 frontend).
1437 The request structure is defined as follows:
1439 \scriptsize
1440 \begin{verbatim}
1441 typedef struct blkif_request {
1442 uint8_t operation; /* BLKIF_OP_??? */
1443 uint8_t nr_segments; /* number of segments */
1444 blkif_vdev_t handle; /* only for read/write requests */
1445 uint64_t id; /* private guest value, echoed in resp */
1446 blkif_sector_t sector_number;/* start sector idx on disk (r/w only) */
1447 struct blkif_request_segment {
1448 grant_ref_t gref; /* reference to I/O buffer frame */
1449 /* @first_sect: first sector in frame to transfer (inclusive). */
1450 /* @last_sect: last sector in frame to transfer (inclusive). */
1451 uint8_t first_sect, last_sect;
1452 } seg[BLKIF_MAX_SEGMENTS_PER_REQUEST];
1453 } blkif_request_t;
1454 \end{verbatim}
1455 \normalsize
1457 The fields are as follows:
1459 \begin{description}
1460 \item[operation] operation ID: one of the operations described above
1461 \item[nr\_segments] number of segments for scatter / gather IO
1462 described by this request
1463 \item[handle] identifier for a particular virtual device on this
1464 interface
1465 \item[id] this value is echoed in the response message for this IO;
1466 the guest may use it to identify the original request
1467 \item[sector\_number] start sector on the virtal device for this
1468 request
1469 \item[frame\_and\_sects] This array contains structures encoding
1470 scatter-gather IO to be performed:
1471 \begin{description}
1472 \item[gref] The grant reference for the foreign I/O buffer page.
1473 \item[first\_sect] First sector to access within the buffer page (0 to 7).
1474 \item[last\_sect] Last sector to access within the buffer page (0 to 7).
1475 \end{description}
1476 Data will be transferred into frames at an offset determined by the
1477 value of {\tt first\_sect}.
1478 \end{description}
1480 \section{Virtual TPM}
1482 Virtual TPM (VTPM) support provides TPM functionality to each virtual
1483 machine that requests this functionality in its configuration file.
1484 The interface enables domains to access therr own private TPM like it
1485 was a hardware TPM built into the machine.
1487 The virtual TPM interface is implemented as a split driver,
1488 similar to the network and block interfaces described above.
1489 The user domain hosting the frontend exports a character device /dev/tpm0
1490 to user-level applications for communicating with the virtual TPM.
1491 This is the same device interface that is also offered if a hardware TPM
1492 is available in the system. The backend provides a single interface
1493 /dev/vtpm where the virtual TPM is waiting for commands from all domains
1494 that have located their backend in a given domain.
1496 \subsection{Data Transfer}
1498 A single shared memory ring is used between the frontend and backend
1499 drivers. TPM requests and responses are sent in pages where a pointer
1500 to those pages and other information is placed into the ring such that
1501 the backend can map the pages into its memory space using the grant
1502 table mechanism.
1504 The backend driver has been implemented to only accept well-formed
1505 TPM requests. To meet this requirement, the length inidicator in the
1506 TPM request must correctly indicate the length of the request.
1507 Otherwise an error message is automatically sent back by the device driver.
1509 The virtual TPM implementation listenes for TPM request on /dev/vtpm. Since
1510 it must be able to apply the TPM request packet to the virtual TPM instance
1511 associated with the virtual machine, a 4-byte virtual TPM instance
1512 identifier is prepended to each packet by the backend driver (in network
1513 byte order) for internal routing of the request.
1515 \subsection{Virtual TPM ring interface}
1517 The TPM protocol is a strict request/response protocol and therefore
1518 only one ring is used to send requests from the frontend to the backend
1519 and responses on the reverse path.
1521 The request/response structure is defined as follows:
1523 \scriptsize
1524 \begin{verbatim}
1525 typedef struct {
1526 unsigned long addr; /* Machine address of packet. */
1527 grant_ref_t ref; /* grant table access reference. */
1528 uint16_t unused; /* unused */
1529 uint16_t size; /* Packet size in bytes. */
1530 } tpmif_tx_request_t;
1531 \end{verbatim}
1532 \normalsize
1534 The fields are as follows:
1536 \begin{description}
1537 \item[addr] The machine address of the page asscoiated with the TPM
1538 request/response; a request/response may span multiple
1539 pages
1540 \item[ref] The grant table reference associated with the address.
1541 \item[size] The size of the remaining packet; up to
1542 PAGE{\textunderscore}SIZE bytes can be found in the
1543 page referenced by 'addr'
1544 \end{description}
1546 The frontend initially allocates several pages whose addresses
1547 are stored in the ring. Only these pages are used for exchange of
1548 requests and responses.
1551 \chapter{Further Information}
1553 If you have questions that are not answered by this manual, the
1554 sources of information listed below may be of interest to you. Note
1555 that bug reports, suggestions and contributions related to the
1556 software (or the documentation) should be sent to the Xen developers'
1557 mailing list (address below).
1560 \section{Other documentation}
1562 If you are mainly interested in using (rather than developing for)
1563 Xen, the \emph{Xen Users' Manual} is distributed in the {\tt docs/}
1564 directory of the Xen source distribution.
1566 % Various HOWTOs are also available in {\tt docs/HOWTOS}.
1569 \section{Online references}
1571 The official Xen web site can be found at:
1572 \begin{quote} {\tt http://www.xensource.com}
1573 \end{quote}
1576 This contains links to the latest versions of all online
1577 documentation, including the latest version of the FAQ.
1579 Information regarding Xen is also available at the Xen Wiki at
1580 \begin{quote} {\tt http://wiki.xensource.com/xenwiki/}\end{quote}
1581 The Xen project uses Bugzilla as its bug tracking system. You'll find
1582 the Xen Bugzilla at http://bugzilla.xensource.com/bugzilla/.
1585 \section{Mailing lists}
1587 There are several mailing lists that are used to discuss Xen related
1588 topics. The most widely relevant are listed below. An official page of
1589 mailing lists and subscription information can be found at \begin{quote}
1590 {\tt http://lists.xensource.com/} \end{quote}
1592 \begin{description}
1593 \item[xen-devel@lists.xensource.com] Used for development
1594 discussions and bug reports. Subscribe at: \\
1595 {\small {\tt http://lists.xensource.com/xen-devel}}
1596 \item[xen-users@lists.xensource.com] Used for installation and usage
1597 discussions and requests for help. Subscribe at: \\
1598 {\small {\tt http://lists.xensource.com/xen-users}}
1599 \item[xen-announce@lists.xensource.com] Used for announcements only.
1600 Subscribe at: \\
1601 {\small {\tt http://lists.xensource.com/xen-announce}}
1602 \item[xen-changelog@lists.xensource.com] Changelog feed
1603 from the unstable and 2.0 trees - developer oriented. Subscribe at: \\
1604 {\small {\tt http://lists.xensource.com/xen-changelog}}
1605 \end{description}
1607 \appendix
1610 \chapter{Xen Hypercalls}
1611 \label{a:hypercalls}
1613 Hypercalls represent the procedural interface to Xen; this appendix
1614 categorizes and describes the current set of hypercalls.
1616 \section{Invoking Hypercalls}
1618 Hypercalls are invoked in a manner analogous to system calls in a
1619 conventional operating system; a software interrupt is issued which
1620 vectors to an entry point within Xen. On x86/32 machines the
1621 instruction required is {\tt int \$82}; the (real) IDT is setup so
1622 that this may only be issued from within ring 1. The particular
1623 hypercall to be invoked is contained in {\tt EAX} --- a list
1624 mapping these values to symbolic hypercall names can be found
1625 in {\tt xen/include/public/xen.h}.
1627 On some occasions a set of hypercalls will be required to carry
1628 out a higher-level function; a good example is when a guest
1629 operating wishes to context switch to a new process which
1630 requires updating various privileged CPU state. As an optimization
1631 for these cases, there is a generic mechanism to issue a set of
1632 hypercalls as a batch:
1634 \begin{quote}
1635 \hypercall{multicall(void *call\_list, int nr\_calls)}
1637 Execute a series of hypervisor calls; {\tt nr\_calls} is the length of
1638 the array of {\tt multicall\_entry\_t} structures pointed to be {\tt
1639 call\_list}. Each entry contains the hypercall operation code followed
1640 by up to 7 word-sized arguments.
1641 \end{quote}
1643 Note that multicalls are provided purely as an optimization; there is
1644 no requirement to use them when first porting a guest operating
1645 system.
1648 \section{Virtual CPU Setup}
1650 At start of day, a guest operating system needs to setup the virtual
1651 CPU it is executing on. This includes installing vectors for the
1652 virtual IDT so that the guest OS can handle interrupts, page faults,
1653 etc. However the very first thing a guest OS must setup is a pair
1654 of hypervisor callbacks: these are the entry points which Xen will
1655 use when it wishes to notify the guest OS of an occurrence.
1657 \begin{quote}
1658 \hypercall{set\_callbacks(unsigned long event\_selector, unsigned long
1659 event\_address, unsigned long failsafe\_selector, unsigned long
1660 failsafe\_address) }
1662 Register the normal (``event'') and failsafe callbacks for
1663 event processing. In each case the code segment selector and
1664 address within that segment are provided. The selectors must
1665 have RPL 1; in XenLinux we simply use the kernel's CS for both
1666 {\bf event\_selector} and {\bf failsafe\_selector}.
1668 The value {\bf event\_address} specifies the address of the guest OSes
1669 event handling and dispatch routine; the {\bf failsafe\_address}
1670 specifies a separate entry point which is used only if a fault occurs
1671 when Xen attempts to use the normal callback.
1673 \end{quote}
1675 On x86/64 systems the hypercall takes slightly different
1676 arguments. This is because callback CS does not need to be specified
1677 (since teh callbacks are entered via SYSRET), and also because an
1678 entry address needs to be specified for SYSCALLs from guest user
1679 space:
1681 \begin{quote}
1682 \hypercall{set\_callbacks(unsigned long event\_address, unsigned long
1683 failsafe\_address, unsigned long syscall\_address)}
1684 \end{quote}
1687 After installing the hypervisor callbacks, the guest OS can
1688 install a `virtual IDT' by using the following hypercall:
1690 \begin{quote}
1691 \hypercall{set\_trap\_table(trap\_info\_t *table)}
1693 Install one or more entries into the per-domain
1694 trap handler table (essentially a software version of the IDT).
1695 Each entry in the array pointed to by {\bf table} includes the
1696 exception vector number with the corresponding segment selector
1697 and entry point. Most guest OSes can use the same handlers on
1698 Xen as when running on the real hardware.
1701 \end{quote}
1703 A further hypercall is provided for the management of virtual CPUs:
1705 \begin{quote}
1706 \hypercall{vcpu\_op(int cmd, int vcpuid, void *extra\_args)}
1708 This hypercall can be used to bootstrap VCPUs, to bring them up and
1709 down and to test their current status.
1711 \end{quote}
1713 \section{Scheduling and Timer}
1715 Domains are preemptively scheduled by Xen according to the
1716 parameters installed by domain 0 (see Section~\ref{s:dom0ops}).
1717 In addition, however, a domain may choose to explicitly
1718 control certain behavior with the following hypercall:
1720 \begin{quote}
1721 \hypercall{sched\_op\_new(int cmd, void *extra\_args)}
1723 Request scheduling operation from hypervisor. The following
1724 sub-commands are available:
1726 \begin{description}
1727 \item[SCHEDOP\_yield] voluntarily yields the CPU, but leaves the
1728 caller marked as runnable. No extra arguments are passed to this
1729 command.
1730 \item[SCHEDOP\_block] removes the calling domain from the run queue
1731 and causes it to sleep until an event is delivered to it. No extra
1732 arguments are passed to this command.
1733 \item[SCHEDOP\_shutdown] is used to end the calling domain's
1734 execution. The extra argument is a {\bf sched\_shutdown} structure
1735 which indicates the reason why the domain suspended (e.g., for reboot,
1736 halt, power-off).
1737 \item[SCHEDOP\_poll] allows a VCPU to wait on a set of event channels
1738 with an optional timeout (all of which are specified in the {\bf
1739 sched\_poll} extra argument). The semantics are similar to the UNIX
1740 {\bf poll} system call. The caller must have event-channel upcalls
1741 masked when executing this command.
1742 \end{description}
1743 \end{quote}
1745 {\bf sched\_op\_new} was not available prior to Xen 3.0.2. Older versions
1746 provide only the following hypercall:
1748 \begin{quote}
1749 \hypercall{sched\_op(int cmd, unsigned long extra\_arg)}
1751 This hypercall supports the following subset of {\bf sched\_op\_new} commands:
1753 \begin{description}
1754 \item[SCHEDOP\_yield] (extra argument is 0).
1755 \item[SCHEDOP\_block] (extra argument is 0).
1756 \item[SCHEDOP\_shutdown] (extra argument is numeric reason code).
1757 \end{description}
1758 \end{quote}
1760 To aid the implementation of a process scheduler within a guest OS,
1761 Xen provides a virtual programmable timer:
1763 \begin{quote}
1764 \hypercall{set\_timer\_op(uint64\_t timeout)}
1766 Request a timer event to be sent at the specified system time (time
1767 in nanoseconds since system boot).
1769 \end{quote}
1771 Note that calling {\bf set\_timer\_op} prior to {\bf sched\_op}
1772 allows block-with-timeout semantics.
1775 \section{Page Table Management}
1777 Since guest operating systems have read-only access to their page
1778 tables, Xen must be involved when making any changes. The following
1779 multi-purpose hypercall can be used to modify page-table entries,
1780 update the machine-to-physical mapping table, flush the TLB, install
1781 a new page-table base pointer, and more.
1783 \begin{quote}
1784 \hypercall{mmu\_update(mmu\_update\_t *req, int count, int *success\_count)}
1786 Update the page table for the domain; a set of {\bf count} updates are
1787 submitted for processing in a batch, with {\bf success\_count} being
1788 updated to report the number of successful updates.
1790 Each element of {\bf req[]} contains a pointer (address) and value;
1791 the least significant 2-bits of the pointer are used to distinguish
1792 the type of update requested as follows:
1793 \begin{description}
1795 \item[MMU\_NORMAL\_PT\_UPDATE:] update a page directory entry or
1796 page table entry to the associated value; Xen will check that the
1797 update is safe, as described in Chapter~\ref{c:memory}.
1799 \item[MMU\_MACHPHYS\_UPDATE:] update an entry in the
1800 machine-to-physical table. The calling domain must own the machine
1801 page in question (or be privileged).
1802 \end{description}
1804 \end{quote}
1806 Explicitly updating batches of page table entries is extremely
1807 efficient, but can require a number of alterations to the guest
1808 OS. Using the writable page table mode (Chapter~\ref{c:memory}) is
1809 recommended for new OS ports.
1811 Regardless of which page table update mode is being used, however,
1812 there are some occasions (notably handling a demand page fault) where
1813 a guest OS will wish to modify exactly one PTE rather than a
1814 batch, and where that PTE is mapped into the current address space.
1815 This is catered for by the following:
1817 \begin{quote}
1818 \hypercall{update\_va\_mapping(unsigned long va, uint64\_t val,
1819 unsigned long flags)}
1821 Update the currently installed PTE that maps virtual address {\bf va}
1822 to new value {\bf val}. As with {\bf mmu\_update}, Xen checks the
1823 modification is safe before applying it. The {\bf flags} determine
1824 which kind of TLB flush, if any, should follow the update.
1826 \end{quote}
1828 Finally, sufficiently privileged domains may occasionally wish to manipulate
1829 the pages of others:
1831 \begin{quote}
1832 \hypercall{update\_va\_mapping(unsigned long va, uint64\_t val,
1833 unsigned long flags, domid\_t domid)}
1835 Identical to {\bf update\_va\_mapping} save that the pages being
1836 mapped must belong to the domain {\bf domid}.
1838 \end{quote}
1840 An additional MMU hypercall provides an ``extended command''
1841 interface. This provides additional functionality beyond the basic
1842 table updating commands:
1844 \begin{quote}
1846 \hypercall{mmuext\_op(struct mmuext\_op *op, int count, int *success\_count, domid\_t domid)}
1848 This hypercall is used to perform additional MMU operations. These
1849 include updating {\tt cr3} (or just re-installing it for a TLB flush),
1850 requesting various kinds of TLB flush, flushing the cache, installing
1851 a new LDT, or pinning \& unpinning page-table pages (to ensure their
1852 reference count doesn't drop to zero which would require a
1853 revalidation of all entries). Some of the operations available are
1854 restricted to domains with sufficient system privileges.
1856 It is also possible for privileged domains to reassign page ownership
1857 via an extended MMU operation, although grant tables are used instead
1858 of this where possible; see Section~\ref{s:idc}.
1860 \end{quote}
1862 Finally, a hypercall interface is exposed to activate and deactivate
1863 various optional facilities provided by Xen for memory management.
1865 \begin{quote}
1866 \hypercall{vm\_assist(unsigned int cmd, unsigned int type)}
1868 Toggle various memory management modes (in particular writable page
1869 tables).
1871 \end{quote}
1873 \section{Segmentation Support}
1875 Xen allows guest OSes to install a custom GDT if they require it;
1876 this is context switched transparently whenever a domain is
1877 [de]scheduled. The following hypercall is effectively a
1878 `safe' version of {\tt lgdt}:
1880 \begin{quote}
1881 \hypercall{set\_gdt(unsigned long *frame\_list, int entries)}
1883 Install a global descriptor table for a domain; {\bf frame\_list} is
1884 an array of up to 16 machine page frames within which the GDT resides,
1885 with {\bf entries} being the actual number of descriptor-entry
1886 slots. All page frames must be mapped read-only within the guest's
1887 address space, and the table must be large enough to contain Xen's
1888 reserved entries (see {\bf xen/include/public/arch-x86\_32.h}).
1890 \end{quote}
1892 Many guest OSes will also wish to install LDTs; this is achieved by
1893 using {\bf mmu\_update} with an extended command, passing the
1894 linear address of the LDT base along with the number of entries. No
1895 special safety checks are required; Xen needs to perform this task
1896 simply since {\tt lldt} requires CPL 0.
1899 Xen also allows guest operating systems to update just an
1900 individual segment descriptor in the GDT or LDT:
1902 \begin{quote}
1903 \hypercall{update\_descriptor(uint64\_t ma, uint64\_t desc)}
1905 Update the GDT/LDT entry at machine address {\bf ma}; the new
1906 8-byte descriptor is stored in {\bf desc}.
1907 Xen performs a number of checks to ensure the descriptor is
1908 valid.
1910 \end{quote}
1912 Guest OSes can use the above in place of context switching entire
1913 LDTs (or the GDT) when the number of changing descriptors is small.
1915 \section{Context Switching}
1917 When a guest OS wishes to context switch between two processes,
1918 it can use the page table and segmentation hypercalls described
1919 above to perform the the bulk of the privileged work. In addition,
1920 however, it will need to invoke Xen to switch the kernel (ring 1)
1921 stack pointer:
1923 \begin{quote}
1924 \hypercall{stack\_switch(unsigned long ss, unsigned long esp)}
1926 Request kernel stack switch from hypervisor; {\bf ss} is the new
1927 stack segment, which {\bf esp} is the new stack pointer.
1929 \end{quote}
1931 A useful hypercall for context switching allows ``lazy'' save and
1932 restore of floating point state:
1934 \begin{quote}
1935 \hypercall{fpu\_taskswitch(int set)}
1937 This call instructs Xen to set the {\tt TS} bit in the {\tt cr0}
1938 control register; this means that the next attempt to use floating
1939 point will cause a trap which the guest OS can trap. Typically it will
1940 then save/restore the FP state, and clear the {\tt TS} bit, using the
1941 same call.
1942 \end{quote}
1944 This is provided as an optimization only; guest OSes can also choose
1945 to save and restore FP state on all context switches for simplicity.
1947 Finally, a hypercall is provided for entering vm86 mode:
1949 \begin{quote}
1950 \hypercall{switch\_vm86}
1952 This allows the guest to run code in vm86 mode, which is needed for
1953 some legacy software.
1954 \end{quote}
1956 \section{Physical Memory Management}
1958 As mentioned previously, each domain has a maximum and current
1959 memory allocation. The maximum allocation, set at domain creation
1960 time, cannot be modified. However a domain can choose to reduce
1961 and subsequently grow its current allocation by using the
1962 following call:
1964 \begin{quote}
1965 \hypercall{memory\_op(unsigned int op, void *arg)}
1967 Increase or decrease current memory allocation (as determined by
1968 the value of {\bf op}). The available operations are:
1970 \begin{description}
1971 \item[XENMEM\_increase\_reservation] Request an increase in machine
1972 memory allocation; {\bf arg} must point to a {\bf
1973 xen\_memory\_reservation} structure.
1974 \item[XENMEM\_decrease\_reservation] Request a decrease in machine
1975 memory allocation; {\bf arg} must point to a {\bf
1976 xen\_memory\_reservation} structure.
1977 \item[XENMEM\_maximum\_ram\_page] Request the frame number of the
1978 highest-addressed frame of machine memory in the system. {\bf arg}
1979 must point to an {\bf unsigned long} where this value will be
1980 stored.
1981 \item[XENMEM\_current\_reservation] Returns current memory reservation
1982 of the specified domain.
1983 \item[XENMEM\_maximum\_reservation] Returns maximum memory resrevation
1984 of the specified domain.
1985 \end{description}
1987 \end{quote}
1989 In addition to simply reducing or increasing the current memory
1990 allocation via a `balloon driver', this call is also useful for
1991 obtaining contiguous regions of machine memory when required (e.g.
1992 for certain PCI devices, or if using superpages).
1995 \section{Inter-Domain Communication}
1996 \label{s:idc}
1998 Xen provides a simple asynchronous notification mechanism via
1999 \emph{event channels}. Each domain has a set of end-points (or
2000 \emph{ports}) which may be bound to an event source (e.g. a physical
2001 IRQ, a virtual IRQ, or an port in another domain). When a pair of
2002 end-points in two different domains are bound together, then a `send'
2003 operation on one will cause an event to be received by the destination
2004 domain.
2006 The control and use of event channels involves the following hypercall:
2008 \begin{quote}
2009 \hypercall{event\_channel\_op(evtchn\_op\_t *op)}
2011 Inter-domain event-channel management; {\bf op} is a discriminated
2012 union which allows the following 7 operations:
2014 \begin{description}
2016 \item[alloc\_unbound:] allocate a free (unbound) local
2017 port and prepare for connection from a specified domain.
2018 \item[bind\_virq:] bind a local port to a virtual
2019 IRQ; any particular VIRQ can be bound to at most one port per domain.
2020 \item[bind\_pirq:] bind a local port to a physical IRQ;
2021 once more, a given pIRQ can be bound to at most one port per
2022 domain. Furthermore the calling domain must be sufficiently
2023 privileged.
2024 \item[bind\_interdomain:] construct an interdomain event
2025 channel; in general, the target domain must have previously allocated
2026 an unbound port for this channel, although this can be bypassed by
2027 privileged domains during domain setup.
2028 \item[close:] close an interdomain event channel.
2029 \item[send:] send an event to the remote end of a
2030 interdomain event channel.
2031 \item[status:] determine the current status of a local port.
2032 \end{description}
2034 For more details see
2035 {\bf xen/include/public/event\_channel.h}.
2037 \end{quote}
2039 Event channels are the fundamental communication primitive between
2040 Xen domains and seamlessly support SMP. However they provide little
2041 bandwidth for communication {\sl per se}, and hence are typically
2042 married with a piece of shared memory to produce effective and
2043 high-performance inter-domain communication.
2045 Safe sharing of memory pages between guest OSes is carried out by
2046 granting access on a per page basis to individual domains. This is
2047 achieved by using the {\tt grant\_table\_op} hypercall.
2049 \begin{quote}
2050 \hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)}
2052 Used to invoke operations on a grant reference, to setup the grant
2053 table and to dump the tables' contents for debugging.
2055 \end{quote}
2057 \section{IO Configuration}
2059 Domains with physical device access (i.e.\ driver domains) receive
2060 limited access to certain PCI devices (bus address space and
2061 interrupts). However many guest operating systems attempt to
2062 determine the PCI configuration by directly access the PCI BIOS,
2063 which cannot be allowed for safety.
2065 Instead, Xen provides the following hypercall:
2067 \begin{quote}
2068 \hypercall{physdev\_op(void *physdev\_op)}
2070 Set and query IRQ configuration details, set the system IOPL, set the
2071 TSS IO bitmap.
2073 \end{quote}
2076 For examples of using {\tt physdev\_op}, see the
2077 Xen-specific PCI code in the linux sparse tree.
2079 \section{Administrative Operations}
2080 \label{s:dom0ops}
2082 A large number of control operations are available to a sufficiently
2083 privileged domain (typically domain 0). These allow the creation and
2084 management of new domains, for example. A complete list is given
2085 below: for more details on any or all of these, please see
2086 {\tt xen/include/public/dom0\_ops.h}
2089 \begin{quote}
2090 \hypercall{dom0\_op(dom0\_op\_t *op)}
2092 Administrative domain operations for domain management. The options are:
2094 \begin{description}
2095 \item [DOM0\_GETMEMLIST:] get list of pages used by the domain
2097 \item [DOM0\_SCHEDCTL:]
2099 \item [DOM0\_ADJUSTDOM:] adjust scheduling priorities for domain
2101 \item [DOM0\_CREATEDOMAIN:] create a new domain
2103 \item [DOM0\_DESTROYDOMAIN:] deallocate all resources associated
2104 with a domain
2106 \item [DOM0\_PAUSEDOMAIN:] remove a domain from the scheduler run
2107 queue.
2109 \item [DOM0\_UNPAUSEDOMAIN:] mark a paused domain as schedulable
2110 once again.
2112 \item [DOM0\_GETDOMAININFO:] get statistics about the domain
2114 \item [DOM0\_SETDOMAININFO:] set VCPU-related attributes
2116 \item [DOM0\_MSR:] read or write model specific registers
2118 \item [DOM0\_DEBUG:] interactively invoke the debugger
2120 \item [DOM0\_SETTIME:] set system time
2122 \item [DOM0\_GETPAGEFRAMEINFO:]
2124 \item [DOM0\_READCONSOLE:] read console content from hypervisor buffer ring
2126 \item [DOM0\_PINCPUDOMAIN:] pin domain to a particular CPU
2128 \item [DOM0\_TBUFCONTROL:] get and set trace buffer attributes
2130 \item [DOM0\_PHYSINFO:] get information about the host machine
2132 \item [DOM0\_SCHED\_ID:] get the ID of the current Xen scheduler
2134 \item [DOM0\_SHADOW\_CONTROL:] switch between shadow page-table modes
2136 \item [DOM0\_SETDOMAINMAXMEM:] set maximum memory allocation of a domain
2138 \item [DOM0\_GETPAGEFRAMEINFO2:] batched interface for getting
2139 page frame info
2141 \item [DOM0\_ADD\_MEMTYPE:] set MTRRs
2143 \item [DOM0\_DEL\_MEMTYPE:] remove a memory type range
2145 \item [DOM0\_READ\_MEMTYPE:] read MTRR
2147 \item [DOM0\_PERFCCONTROL:] control Xen's software performance
2148 counters
2150 \item [DOM0\_MICROCODE:] update CPU microcode
2152 \item [DOM0\_IOPORT\_PERMISSION:] modify domain permissions for an
2153 IO port range (enable / disable a range for a particular domain)
2155 \item [DOM0\_GETVCPUCONTEXT:] get context from a VCPU
2157 \item [DOM0\_GETVCPUINFO:] get current state for a VCPU
2158 \item [DOM0\_GETDOMAININFOLIST:] batched interface to get domain
2159 info
2161 \item [DOM0\_PLATFORM\_QUIRK:] inform Xen of a platform quirk it
2162 needs to handle (e.g. noirqbalance)
2164 \item [DOM0\_PHYSICAL\_MEMORY\_MAP:] get info about dom0's memory
2165 map
2167 \item [DOM0\_MAX\_VCPUS:] change max number of VCPUs for a domain
2169 \item [DOM0\_SETDOMAINHANDLE:] set the handle for a domain
2171 \end{description}
2172 \end{quote}
2174 Most of the above are best understood by looking at the code
2175 implementing them (in {\tt xen/common/dom0\_ops.c}) and in
2176 the user-space tools that use them (mostly in {\tt tools/libxc}).
2178 Hypercalls relating to the management of the Access Control Module are
2179 also restricted to domain 0 access for now:
2181 \begin{quote}
2183 \hypercall{acm\_op(struct acm\_op * u\_acm\_op)}
2185 This hypercall can be used to configure the state of the ACM, query
2186 that state, request access control decisions and dump additional
2187 information.
2189 \end{quote}
2192 \section{Debugging Hypercalls}
2194 A few additional hypercalls are mainly useful for debugging:
2196 \begin{quote}
2197 \hypercall{console\_io(int cmd, int count, char *str)}
2199 Use Xen to interact with the console; operations are:
2201 {CONSOLEIO\_write}: Output count characters from buffer str.
2203 {CONSOLEIO\_read}: Input at most count characters into buffer str.
2204 \end{quote}
2206 A pair of hypercalls allows access to the underlying debug registers:
2207 \begin{quote}
2208 \hypercall{set\_debugreg(int reg, unsigned long value)}
2210 Set debug register {\bf reg} to {\bf value}
2212 \hypercall{get\_debugreg(int reg)}
2214 Return the contents of the debug register {\bf reg}
2215 \end{quote}
2217 And finally:
2218 \begin{quote}
2219 \hypercall{xen\_version(int cmd)}
2221 Request Xen version number.
2222 \end{quote}
2224 This is useful to ensure that user-space tools are in sync
2225 with the underlying hypervisor.
2228 \end{document}