ia64/xen-unstable

view docs/src/interface.tex @ 4610:382a09db90f0

bitkeeper revision 1.1159.258.105 (4267670d6FkIvFKVISqYF-TBHKlEMg)

docs typo fix
Signed-off-by: Nguyen Anh Quynh <aquynh@gmail.com>
Signed-off-by: ian@xensource.com
author iap10@freefall.cl.cam.ac.uk
date Thu Apr 21 08:40:45 2005 +0000 (2005-04-21)
parents 85f87d4249f9
children 5758f7910e00
line source
1 \documentclass[11pt,twoside,final,openright]{report}
2 \usepackage{a4,graphicx,html,setspace,times}
3 \usepackage{comment,parskip}
4 \setstretch{1.15}
6 \begin{document}
8 % TITLE PAGE
9 \pagestyle{empty}
10 \begin{center}
11 \vspace*{\fill}
12 \includegraphics{figs/xenlogo.eps}
13 \vfill
14 \vfill
15 \vfill
16 \begin{tabular}{l}
17 {\Huge \bf Interface manual} \\[4mm]
18 {\huge Xen v2.0 for x86} \\[80mm]
20 {\Large Xen is Copyright (c) 2002-2004, The Xen Team} \\[3mm]
21 {\Large University of Cambridge, UK} \\[20mm]
22 \end{tabular}
23 \end{center}
25 {\bf
26 DISCLAIMER: This documentation is currently under active development
27 and as such there may be mistakes and omissions --- watch out for
28 these and please report any you find to the developer's mailing list.
29 Contributions of material, suggestions and corrections are welcome.
30 }
32 \vfill
33 \cleardoublepage
35 % TABLE OF CONTENTS
36 \pagestyle{plain}
37 \pagenumbering{roman}
38 { \parskip 0pt plus 1pt
39 \tableofcontents }
40 \cleardoublepage
42 % PREPARE FOR MAIN TEXT
43 \pagenumbering{arabic}
44 \raggedbottom
45 \widowpenalty=10000
46 \clubpenalty=10000
47 \parindent=0pt
48 \parskip=5pt
49 \renewcommand{\topfraction}{.8}
50 \renewcommand{\bottomfraction}{.8}
51 \renewcommand{\textfraction}{.2}
52 \renewcommand{\floatpagefraction}{.8}
53 \setstretch{1.1}
55 \chapter{Introduction}
57 Xen allows the hardware resources of a machine to be virtualized and
58 dynamically partitioned, allowing multiple different {\em guest}
59 operating system images to be run simultaneously. Virtualizing the
60 machine in this manner provides considerable flexibility, for example
61 allowing different users to choose their preferred operating system
62 (e.g., Linux, NetBSD, or a custom operating system). Furthermore, Xen
63 provides secure partitioning between virtual machines (known as
64 {\em domains} in Xen terminology), and enables better resource
65 accounting and QoS isolation than can be achieved with a conventional
66 operating system.
68 Xen essentially takes a `whole machine' virtualization approach as
69 pioneered by IBM VM/370. However, unlike VM/370 or more recent
70 efforts such as VMWare and Virtual PC, Xen does not attempt to
71 completely virtualize the underlying hardware. Instead parts of the
72 hosted guest operating systems are modified to work with the VMM; the
73 operating system is effectively ported to a new target architecture,
74 typically requiring changes in just the machine-dependent code. The
75 user-level API is unchanged, and so existing binaries and operating
76 system distributions work without modification.
78 In addition to exporting virtualized instances of CPU, memory, network
79 and block devices, Xen exposes a control interface to manage how these
80 resources are shared between the running domains. Access to the
81 control interface is restricted: it may only be used by one
82 specially-privileged VM, known as {\em domain 0}. This domain is a
83 required part of any Xen-based server and runs the application software
84 that manages the control-plane aspects of the platform. Running the
85 control software in {\it domain 0}, distinct from the hypervisor
86 itself, allows the Xen framework to separate the notions of
87 mechanism and policy within the system.
91 \chapter{Virtual Architecture}
93 On a Xen-based system, the hypervisor itself runs in {\it ring 0}. It
94 has full access to the physical memory available in the system and is
95 responsible for allocating portions of it to the domains. Guest
96 operating systems run in and use {\it rings 1}, {\it 2} and {\it 3} as
97 they see fit. Segmentation is used to prevent the guest OS from
98 accessing the portion of the address space that is reserved for
99 Xen. We expect most guest operating systems will use ring 1 for their
100 own operation and place applications in ring 3.
102 In this chapter we consider the basic virtual architecture provided
103 by Xen: the basic CPU state, exception and interrupt handling, and
104 time. Other aspects such as memory and device access are discussed
105 in later chapters.
107 \section{CPU state}
109 All privileged state must be handled by Xen. The guest OS has no
110 direct access to CR3 and is not permitted to update privileged bits in
111 EFLAGS. Guest OSes use \emph{hypercalls} to invoke operations in Xen;
112 these are analogous to system calls but occur from ring 1 to ring 0.
114 A list of all hypercalls is given in Appendix~\ref{a:hypercalls}.
118 \section{Exceptions}
120 A virtual IDT is provided --- a domain can submit a table of trap
121 handlers to Xen via the {\tt set\_trap\_table()} hypercall. Most trap
122 handlers are identical to native x86 handlers, although the page-fault
123 handler is somewhat different.
126 \section{Interrupts and events}
128 Interrupts are virtualized by mapping them to \emph{events}, which are
129 delivered asynchronously to the target domain using a callback
130 supplied via the {\tt set\_callbacks()} hypercall. A guest OS can map
131 these events onto its standard interrupt dispatch mechanisms. Xen is
132 responsible for determining the target domain that will handle each
133 physical interrupt source. For more details on the binding of event
134 sources to events, see Chapter~\ref{c:devices}.
138 \section{Time}
140 Guest operating systems need to be aware of the passage of both real
141 (or wallclock) time and their own `virtual time' (the time for
142 which they have been executing). Furthermore, Xen has a notion of
143 time which is used for scheduling. The following notions of
144 time are provided:
146 \begin{description}
147 \item[Cycle counter time.]
149 This provides a fine-grained time reference. The cycle counter time is
150 used to accurately extrapolate the other time references. On SMP machines
151 it is currently assumed that the cycle counter time is synchronized between
152 CPUs. The current x86-based implementation achieves this within inter-CPU
153 communication latencies.
155 \item[System time.]
157 This is a 64-bit counter which holds the number of nanoseconds that
158 have elapsed since system boot.
161 \item[Wall clock time.]
163 This is the time of day in a Unix-style {\tt struct timeval} (seconds
164 and microseconds since 1 January 1970, adjusted by leap seconds). An
165 NTP client hosted by {\it domain 0} can keep this value accurate.
168 \item[Domain virtual time.]
170 This progresses at the same pace as system time, but only while a
171 domain is executing --- it stops while a domain is de-scheduled.
172 Therefore the share of the CPU that a domain receives is indicated by
173 the rate at which its virtual time increases.
175 \end{description}
178 Xen exports timestamps for system time and wall-clock time to guest
179 operating systems through a shared page of memory. Xen also provides
180 the cycle counter time at the instant the timestamps were calculated,
181 and the CPU frequency in Hertz. This allows the guest to extrapolate
182 system and wall-clock times accurately based on the current cycle
183 counter time.
185 Since all time stamps need to be updated and read \emph{atomically}
186 two version numbers are also stored in the shared info page. The
187 first is incremented prior to an update, while the second is only
188 incremented afterwards. Thus a guest can be sure that it read a consistent
189 state by checking the two version numbers are equal.
191 Xen includes a periodic ticker which sends a timer event to the
192 currently executing domain every 10ms. The Xen scheduler also sends a
193 timer event whenever a domain is scheduled; this allows the guest OS
194 to adjust for the time that has passed while it has been inactive. In
195 addition, Xen allows each domain to request that they receive a timer
196 event sent at a specified system time by using the {\tt
197 set\_timer\_op()} hypercall. Guest OSes may use this timer to
198 implement timeout values when they block.
202 %% % akw: demoting this to a section -- not sure if there is any point
203 %% % though, maybe just remove it.
205 \section{Xen CPU Scheduling}
207 Xen offers a uniform API for CPU schedulers. It is possible to choose
208 from a number of schedulers at boot and it should be easy to add more.
209 The BVT, Atropos and Round Robin schedulers are part of the normal
210 Xen distribution. BVT provides proportional fair shares of the CPU to
211 the running domains. Atropos can be used to reserve absolute shares
212 of the CPU for each domain. Round-robin is provided as an example of
213 Xen's internal scheduler API.
215 \paragraph*{Note: SMP host support}
216 Xen has always supported SMP host systems. Domains are statically assigned to
217 CPUs, either at creation time or when manually pinning to a particular CPU.
218 The current schedulers then run locally on each CPU to decide which of the
219 assigned domains should be run there. The user-level control software
220 can be used to perform coarse-grain load-balancing between CPUs.
223 %% More information on the characteristics and use of these schedulers is
224 %% available in {\tt Sched-HOWTO.txt}.
227 \section{Privileged operations}
229 Xen exports an extended interface to privileged domains (viz.\ {\it
230 Domain 0}). This allows such domains to build and boot other domains
231 on the server, and provides control interfaces for managing
232 scheduling, memory, networking, and block devices.
235 \chapter{Memory}
236 \label{c:memory}
238 Xen is responsible for managing the allocation of physical memory to
239 domains, and for ensuring safe use of the paging and segmentation
240 hardware.
243 \section{Memory Allocation}
246 Xen resides within a small fixed portion of physical memory; it also
247 reserves the top 64MB of every virtual address space. The remaining
248 physical memory is available for allocation to domains at a page
249 granularity. Xen tracks the ownership and use of each page, which
250 allows it to enforce secure partitioning between domains.
252 Each domain has a maximum and current physical memory allocation.
253 A guest OS may run a `balloon driver' to dynamically adjust its
254 current memory allocation up to its limit.
257 %% XXX SMH: I use machine and physical in the next section (which
258 %% is kinda required for consistency with code); wonder if this
259 %% section should use same terms?
260 %%
261 %% Probably.
262 %%
263 %% Merging this and below section at some point prob makes sense.
265 \section{Pseudo-Physical Memory}
267 Since physical memory is allocated and freed on a page granularity,
268 there is no guarantee that a domain will receive a contiguous stretch
269 of physical memory. However most operating systems do not have good
270 support for operating in a fragmented physical address space. To aid
271 porting such operating systems to run on top of Xen, we make a
272 distinction between \emph{machine memory} and \emph{pseudo-physical
273 memory}.
275 Put simply, machine memory refers to the entire amount of memory
276 installed in the machine, including that reserved by Xen, in use by
277 various domains, or currently unallocated. We consider machine memory
278 to comprise a set of 4K \emph{machine page frames} numbered
279 consecutively starting from 0. Machine frame numbers mean the same
280 within Xen or any domain.
282 Pseudo-physical memory, on the other hand, is a per-domain
283 abstraction. It allows a guest operating system to consider its memory
284 allocation to consist of a contiguous range of physical page frames
285 starting at physical frame 0, despite the fact that the underlying
286 machine page frames may be sparsely allocated and in any order.
288 To achieve this, Xen maintains a globally readable {\it
289 machine-to-physical} table which records the mapping from machine page
290 frames to pseudo-physical ones. In addition, each domain is supplied
291 with a {\it physical-to-machine} table which performs the inverse
292 mapping. Clearly the machine-to-physical table has size proportional
293 to the amount of RAM installed in the machine, while each
294 physical-to-machine table has size proportional to the memory
295 allocation of the given domain.
297 Architecture dependent code in guest operating systems can then use
298 the two tables to provide the abstraction of pseudo-physical
299 memory. In general, only certain specialized parts of the operating
300 system (such as page table management) needs to understand the
301 difference between machine and pseudo-physical addresses.
303 \section{Page Table Updates}
305 In the default mode of operation, Xen enforces read-only access to
306 page tables and requires guest operating systems to explicitly request
307 any modifications. Xen validates all such requests and only applies
308 updates that it deems safe. This is necessary to prevent domains from
309 adding arbitrary mappings to their page tables.
311 To aid validation, Xen associates a type and reference count with each
312 memory page. A page has one of the following
313 mutually-exclusive types at any point in time: page directory ({\sf
314 PD}), page table ({\sf PT}), local descriptor table ({\sf LDT}),
315 global descriptor table ({\sf GDT}), or writable ({\sf RW}). Note that
316 a guest OS may always create readable mappings of its own memory
317 regardless of its current type.
318 %%% XXX: possibly explain more about ref count 'lifecyle' here?
319 This mechanism is used to
320 maintain the invariants required for safety; for example, a domain
321 cannot have a writable mapping to any part of a page table as this
322 would require the page concerned to simultaneously be of types {\sf
323 PT} and {\sf RW}.
326 %\section{Writable Page Tables}
328 Xen also provides an alternative mode of operation in which guests be
329 have the illusion that their page tables are directly writable. Of
330 course this is not really the case, since Xen must still validate
331 modifications to ensure secure partitioning. To this end, Xen traps
332 any write attempt to a memory page of type {\sf PT} (i.e., that is
333 currently part of a page table). If such an access occurs, Xen
334 temporarily allows write access to that page while at the same time
335 {\em disconnecting} it from the page table that is currently in
336 use. This allows the guest to safely make updates to the page because
337 the newly-updated entries cannot be used by the MMU until Xen
338 revalidates and reconnects the page.
339 Reconnection occurs automatically in a number of situations: for
340 example, when the guest modifies a different page-table page, when the
341 domain is preempted, or whenever the guest uses Xen's explicit
342 page-table update interfaces.
344 Finally, Xen also supports a form of \emph{shadow page tables} in
345 which the guest OS uses a independent copy of page tables which are
346 unknown to the hardware (i.e.\ which are never pointed to by {\tt
347 cr3}). Instead Xen propagates changes made to the guest's tables to the
348 real ones, and vice versa. This is useful for logging page writes
349 (e.g.\ for live migration or checkpoint). A full version of the shadow
350 page tables also allows guest OS porting with less effort.
352 \section{Segment Descriptor Tables}
354 On boot a guest is supplied with a default GDT, which does not reside
355 within its own memory allocation. If the guest wishes to use other
356 than the default `flat' ring-1 and ring-3 segments that this GDT
357 provides, it must register a custom GDT and/or LDT with Xen,
358 allocated from its own memory. Note that a number of GDT
359 entries are reserved by Xen -- any custom GDT must also include
360 sufficient space for these entries.
362 For example, the following hypercall is used to specify a new GDT:
364 \begin{quote}
365 int {\bf set\_gdt}(unsigned long *{\em frame\_list}, int {\em entries})
367 {\em frame\_list}: An array of up to 16 machine page frames within
368 which the GDT resides. Any frame registered as a GDT frame may only
369 be mapped read-only within the guest's address space (e.g., no
370 writable mappings, no use as a page-table page, and so on).
372 {\em entries}: The number of descriptor-entry slots in the GDT. Note
373 that the table must be large enough to contain Xen's reserved entries;
374 thus we must have `{\em entries $>$ LAST\_RESERVED\_GDT\_ENTRY}\ '.
375 Note also that, after registering the GDT, slots {\em FIRST\_} through
376 {\em LAST\_RESERVED\_GDT\_ENTRY} are no longer usable by the guest and
377 may be overwritten by Xen.
378 \end{quote}
380 The LDT is updated via the generic MMU update mechanism (i.e., via
381 the {\tt mmu\_update()} hypercall.
383 \section{Start of Day}
385 The start-of-day environment for guest operating systems is rather
386 different to that provided by the underlying hardware. In particular,
387 the processor is already executing in protected mode with paging
388 enabled.
390 {\it Domain 0} is created and booted by Xen itself. For all subsequent
391 domains, the analogue of the boot-loader is the {\it domain builder},
392 user-space software running in {\it domain 0}. The domain builder
393 is responsible for building the initial page tables for a domain
394 and loading its kernel image at the appropriate virtual address.
398 \chapter{Devices}
399 \label{c:devices}
401 Devices such as network and disk are exported to guests using a
402 split device driver. The device driver domain, which accesses the
403 physical device directly also runs a {\em backend} driver, serving
404 requests to that device from guests. Each guest will use a simple
405 {\em frontend} driver, to access the backend. Communication between these
406 domains is composed of two parts: First, data is placed onto a shared
407 memory page between the domains. Second, an event channel between the
408 two domains is used to pass notification that data is outstanding.
409 This separation of notification from data transfer allows message
410 batching, and results in very efficient device access.
412 Event channels are used extensively in device virtualization; each
413 domain has a number of end-points or \emph{ports} each of which
414 may be bound to one of the following \emph{event sources}:
415 \begin{itemize}
416 \item a physical interrupt from a real device,
417 \item a virtual interrupt (callback) from Xen, or
418 \item a signal from another domain
419 \end{itemize}
421 Events are lightweight and do not carry much information beyond
422 the source of the notification. Hence when performing bulk data
423 transfer, events are typically used as synchronization primitives
424 over a shared memory transport. Event channels are managed via
425 the {\tt event\_channel\_op()} hypercall; for more details see
426 Section~\ref{s:idc}.
428 This chapter focuses on some individual device interfaces
429 available to Xen guests.
431 \section{Network I/O}
433 Virtual network device services are provided by shared memory
434 communication with a backend domain. From the point of view of
435 other domains, the backend may be viewed as a virtual ethernet switch
436 element with each domain having one or more virtual network interfaces
437 connected to it.
439 \subsection{Backend Packet Handling}
441 The backend driver is responsible for a variety of actions relating to
442 the transmission and reception of packets from the physical device.
443 With regard to transmission, the backend performs these key actions:
445 \begin{itemize}
446 \item {\bf Validation:} To ensure that domains do not attempt to
447 generate invalid (e.g. spoofed) traffic, the backend driver may
448 validate headers ensuring that source MAC and IP addresses match the
449 interface that they have been sent from.
451 Validation functions can be configured using standard firewall rules
452 ({\small{\tt iptables}} in the case of Linux).
454 \item {\bf Scheduling:} Since a number of domains can share a single
455 physical network interface, the backend must mediate access when
456 several domains each have packets queued for transmission. This
457 general scheduling function subsumes basic shaping or rate-limiting
458 schemes.
460 \item {\bf Logging and Accounting:} The backend domain can be
461 configured with classifier rules that control how packets are
462 accounted or logged. For example, log messages might be generated
463 whenever a domain attempts to send a TCP packet containing a SYN.
464 \end{itemize}
466 On receipt of incoming packets, the backend acts as a simple
467 demultiplexer: Packets are passed to the appropriate virtual
468 interface after any necessary logging and accounting have been carried
469 out.
471 \subsection{Data Transfer}
473 Each virtual interface uses two ``descriptor rings'', one for transmit,
474 the other for receive. Each descriptor identifies a block of contiguous
475 physical memory allocated to the domain.
477 The transmit ring carries packets to transmit from the guest to the
478 backend domain. The return path of the transmit ring carries messages
479 indicating that the contents have been physically transmitted and the
480 backend no longer requires the associated pages of memory.
482 To receive packets, the guest places descriptors of unused pages on
483 the receive ring. The backend will return received packets by
484 exchanging these pages in the domain's memory with new pages
485 containing the received data, and passing back descriptors regarding
486 the new packets on the ring. This zero-copy approach allows the
487 backend to maintain a pool of free pages to receive packets into, and
488 then deliver them to appropriate domains after examining their
489 headers.
491 %
492 %Real physical addresses are used throughout, with the domain performing
493 %translation from pseudo-physical addresses if that is necessary.
495 If a domain does not keep its receive ring stocked with empty buffers then
496 packets destined to it may be dropped. This provides some defence against
497 receive livelock problems because an overload domain will cease to receive
498 further data. Similarly, on the transmit path, it provides the application
499 with feedback on the rate at which packets are able to leave the system.
502 Flow control on rings is achieved by including a pair of producer
503 indexes on the shared ring page. Each side will maintain a private
504 consumer index indicating the next outstanding message. In this
505 manner, the domains cooperate to divide the ring into two message
506 lists, one in each direction. Notification is decoupled from the
507 immediate placement of new messages on the ring; the event channel
508 will be used to generate notification when {\em either} a certain
509 number of outstanding messages are queued, {\em or} a specified number
510 of nanoseconds have elapsed since the oldest message was placed on the
511 ring.
513 % Not sure if my version is any better -- here is what was here before:
514 %% Synchronization between the backend domain and the guest is achieved using
515 %% counters held in shared memory that is accessible to both. Each ring has
516 %% associated producer and consumer indices indicating the area in the ring
517 %% that holds descriptors that contain data. After receiving {\it n} packets
518 %% or {\t nanoseconds} after receiving the first packet, the hypervisor sends
519 %% an event to the domain.
521 \section{Block I/O}
523 All guest OS disk access goes through the virtual block device VBD
524 interface. This interface allows domains access to portions of block
525 storage devices visible to the the block backend device. The VBD
526 interface is a split driver, similar to the network interface
527 described above. A single shared memory ring is used between the
528 frontend and backend drivers, across which read and write messages are
529 sent.
531 Any block device accessible to the backend domain, including
532 network-based block (iSCSI, *NBD, etc), loopback and LVM/MD devices,
533 can be exported as a VBD. Each VBD is mapped to a device node in the
534 guest, specified in the guest's startup configuration.
536 Old (Xen 1.2) virtual disks are not supported under Xen 2.0, since
537 similar functionality can be achieved using the more complete LVM
538 system, which is already in widespread use.
540 \subsection{Data Transfer}
542 The single ring between the guest and the block backend supports three
543 messages:
545 \begin{description}
546 \item [{\small {\tt PROBE}}:] Return a list of the VBDs available to this guest
547 from the backend. The request includes a descriptor of a free page
548 into which the reply will be written by the backend.
550 \item [{\small {\tt READ}}:] Read data from the specified block device. The
551 front end identifies the device and location to read from and
552 attaches pages for the data to be copied to (typically via DMA from
553 the device). The backend acknowledges completed read requests as
554 they finish.
556 \item [{\small {\tt WRITE}}:] Write data to the specified block device. This
557 functions essentially as {\small {\tt READ}}, except that the data moves to
558 the device instead of from it.
559 \end{description}
561 % um... some old text
562 %% In overview, the same style of descriptor-ring that is used for
563 %% network packets is used here. Each domain has one ring that carries
564 %% operation requests to the hypervisor and carries the results back
565 %% again.
567 %% Rather than copying data, the backend simply maps the domain's buffers
568 %% in order to enable direct DMA to them. The act of mapping the buffers
569 %% also increases the reference counts of the underlying pages, so that
570 %% the unprivileged domain cannot try to return them to the hypervisor,
571 %% install them as page tables, or any other unsafe behaviour.
572 %% %block API here
575 \chapter{Further Information}
578 If you have questions that are not answered by this manual, the
579 sources of information listed below may be of interest to you. Note
580 that bug reports, suggestions and contributions related to the
581 software (or the documentation) should be sent to the Xen developers'
582 mailing list (address below).
584 \section{Other documentation}
586 If you are mainly interested in using (rather than developing for)
587 Xen, the {\em Xen Users' Manual} is distributed in the {\tt docs/}
588 directory of the Xen source distribution.
590 % Various HOWTOs are also available in {\tt docs/HOWTOS}.
592 \section{Online references}
594 The official Xen web site is found at:
595 \begin{quote}
596 {\tt http://www.cl.cam.ac.uk/Research/SRG/netos/xen/}
597 \end{quote}
599 This contains links to the latest versions of all on-line
600 documentation.
602 \section{Mailing lists}
604 There are currently four official Xen mailing lists:
606 \begin{description}
607 \item[xen-devel@lists.xensource.com] Used for development
608 discussions and bug reports. Subscribe at: \\
609 {\small {\tt http://lists.xensource.com/xen-devel}}
610 \item[xen-users@lists.xensource.com] Used for installation and usage
611 discussions and requests for help. Subscribe at: \\
612 {\small {\tt http://lists.xensource.com/xen-users}}
613 \item[xen-announce@lists.xensource.com] Used for announcements only.
614 Subscribe at: \\
615 {\small {\tt http://lists.xensource.com/xen-announce}}
616 \item[xen-changelog@lists.xensource.com] Changelog feed
617 from the unstable and 2.0 trees - developer oriented. Subscribe at: \\
618 {\small {\tt http://lists.xensource.com/xen-changelog}}
619 \end{description}
621 Of these, xen-devel is the most active.
626 \appendix
628 %\newcommand{\hypercall}[1]{\vspace{5mm}{\large\sf #1}}
634 \newcommand{\hypercall}[1]{\vspace{2mm}{\sf #1}}
641 \chapter{Xen Hypercalls}
642 \label{a:hypercalls}
644 Hypercalls represent the procedural interface to Xen; this appendix
645 categorizes and describes the current set of hypercalls.
647 \section{Invoking Hypercalls}
649 Hypercalls are invoked in a manner analogous to system calls in a
650 conventional operating system; a software interrupt is issued which
651 vectors to an entry point within Xen. On x86\_32 machines the
652 instruction required is {\tt int \$82}; the (real) IDT is setup so
653 that this may only be issued from within ring 1. The particular
654 hypercall to be invoked is contained in {\tt EAX} --- a list
655 mapping these values to symbolic hypercall names can be found
656 in {\tt xen/include/public/xen.h}.
658 On some occasions a set of hypercalls will be required to carry
659 out a higher-level function; a good example is when a guest
660 operating wishes to context switch to a new process which
661 requires updating various privileged CPU state. As an optimization
662 for these cases, there is a generic mechanism to issue a set of
663 hypercalls as a batch:
665 \begin{quote}
666 \hypercall{multicall(void *call\_list, int nr\_calls)}
668 Execute a series of hypervisor calls; {\tt nr\_calls} is the length of
669 the array of {\tt multicall\_entry\_t} structures pointed to be {\tt
670 call\_list}. Each entry contains the hypercall operation code followed
671 by up to 7 word-sized arguments.
672 \end{quote}
674 Note that multicalls are provided purely as an optimization; there is
675 no requirement to use them when first porting a guest operating
676 system.
679 \section{Virtual CPU Setup}
681 At start of day, a guest operating system needs to setup the virtual
682 CPU it is executing on. This includes installing vectors for the
683 virtual IDT so that the guest OS can handle interrupts, page faults,
684 etc. However the very first thing a guest OS must setup is a pair
685 of hypervisor callbacks: these are the entry points which Xen will
686 use when it wishes to notify the guest OS of an occurrence.
688 \begin{quote}
689 \hypercall{set\_callbacks(unsigned long event\_selector, unsigned long
690 event\_address, unsigned long failsafe\_selector, unsigned long
691 failsafe\_address) }
693 Register the normal (``event'') and failsafe callbacks for
694 event processing. In each case the code segment selector and
695 address within that segment are provided. The selectors must
696 have RPL 1; in XenLinux we simply use the kernel's CS for both
697 {\tt event\_selector} and {\tt failsafe\_selector}.
699 The value {\tt event\_address} specifies the address of the guest OSes
700 event handling and dispatch routine; the {\tt failsafe\_address}
701 specifies a separate entry point which is used only if a fault occurs
702 when Xen attempts to use the normal callback.
703 \end{quote}
706 After installing the hypervisor callbacks, the guest OS can
707 install a `virtual IDT' by using the following hypercall:
709 \begin{quote}
710 \hypercall{set\_trap\_table(trap\_info\_t *table)}
712 Install one or more entries into the per-domain
713 trap handler table (essentially a software version of the IDT).
714 Each entry in the array pointed to by {\tt table} includes the
715 exception vector number with the corresponding segment selector
716 and entry point. Most guest OSes can use the same handlers on
717 Xen as when running on the real hardware; an exception is the
718 page fault handler (exception vector 14) where a modified
719 stack-frame layout is used.
722 \end{quote}
724 Finally, as an optimization it is possible for each guest OS
725 to install one ``fast trap'': this is a trap gate which will
726 allow direct transfer of control from ring 3 into ring 1 without
727 indirecting via Xen. In most cases this is suitable for use by
728 the guest OS system call mechanism, although it may be used for
729 any purpose.
732 \begin{quote}
733 \hypercall{set\_fast\_trap(int idx)}
735 Install the handler for exception vector {\tt idx} as the ``fast
736 trap'' for this domain. Note that this installs the current handler
737 (i.e. that which has been installed more recently via a call
738 to {\tt set\_trap\_table()}).
740 \end{quote}
744 \section{Scheduling and Timer}
746 Domains are preemptively scheduled by Xen according to the
747 parameters installed by domain 0 (see Section~\ref{s:dom0ops}).
748 In addition, however, a domain may choose to explicitly
749 control certain behavior with the following hypercall:
751 \begin{quote}
752 \hypercall{sched\_op(unsigned long op)}
754 Request scheduling operation from hypervisor. The options are: {\it
755 yield}, {\it block}, and {\it shutdown}. {\it yield} keeps the
756 calling domain runnable but may cause a reschedule if other domains
757 are runnable. {\it block} removes the calling domain from the run
758 queue and cause is to sleeps until an event is delivered to it. {\it
759 shutdown} is used to end the domain's execution; the caller can
760 additionally specify whether the domain should reboot, halt or
761 suspend.
762 \end{quote}
764 To aid the implementation of a process scheduler within a guest OS,
765 Xen provides a virtual programmable timer:
767 \begin{quote}
768 \hypercall{set\_timer\_op(uint64\_t timeout)}
770 Request a timer event to be sent at the specified system time (time
771 in nanoseconds since system boot). The hypercall actually passes the
772 64-bit timeout value as a pair of 32-bit values.
774 \end{quote}
776 Note that calling {\tt set\_timer\_op()} prior to {\tt sched\_op}
777 allows block-with-timeout semantics.
780 \section{Page Table Management}
782 Since guest operating systems have read-only access to their page
783 tables, Xen must be involved when making any changes. The following
784 multi-purpose hypercall can be used to modify page-table entries,
785 update the machine-to-physical mapping table, flush the TLB, install
786 a new page-table base pointer, and more.
788 \begin{quote}
789 \hypercall{mmu\_update(mmu\_update\_t *req, int count, int *success\_count)}
791 Update the page table for the domain; a set of {\tt count} updates are
792 submitted for processing in a batch, with {\tt success\_count} being
793 updated to report the number of successful updates.
795 Each element of {\tt req[]} contains a pointer (address) and value;
796 the least significant 2-bits of the pointer are used to distinguish
797 the type of update requested as follows:
798 \begin{description}
800 \item[\it MMU\_NORMAL\_PT\_UPDATE:] update a page directory entry or
801 page table entry to the associated value; Xen will check that the
802 update is safe, as described in Chapter~\ref{c:memory}.
804 \item[\it MMU\_MACHPHYS\_UPDATE:] update an entry in the
805 machine-to-physical table. The calling domain must own the machine
806 page in question (or be privileged).
808 \item[\it MMU\_EXTENDED\_COMMAND:] perform additional MMU operations.
809 The set of additional MMU operations is considerable, and includes
810 updating {\tt cr3} (or just re-installing it for a TLB flush),
811 flushing the cache, installing a new LDT, or pinning \& unpinning
812 page-table pages (to ensure their reference count doesn't drop to zero
813 which would require a revalidation of all entries).
815 Further extended commands are used to deal with granting and
816 acquiring page ownership; see Section~\ref{s:idc}.
819 \end{description}
821 More details on the precise format of all commands can be
822 found in {\tt xen/include/public/xen.h}.
825 \end{quote}
827 Explicitly updating batches of page table entries is extremely
828 efficient, but can require a number of alterations to the guest
829 OS. Using the writable page table mode (Chapter~\ref{c:memory}) is
830 recommended for new OS ports.
832 Regardless of which page table update mode is being used, however,
833 there are some occasions (notably handling a demand page fault) where
834 a guest OS will wish to modify exactly one PTE rather than a
835 batch. This is catered for by the following:
837 \begin{quote}
838 \hypercall{update\_va\_mapping(unsigned long page\_nr, unsigned long
839 val, \\ unsigned long flags)}
841 Update the currently installed PTE for the page {\tt page\_nr} to
842 {\tt val}. As with {\tt mmu\_update()}, Xen checks the modification
843 is safe before applying it. The {\tt flags} determine which kind
844 of TLB flush, if any, should follow the update.
846 \end{quote}
848 Finally, sufficiently privileged domains may occasionally wish to manipulate
849 the pages of others:
850 \begin{quote}
852 \hypercall{update\_va\_mapping\_otherdomain(unsigned long page\_nr,
853 unsigned long val, unsigned long flags, uint16\_t domid)}
855 Identical to {\tt update\_va\_mapping()} save that the pages being
856 mapped must belong to the domain {\tt domid}.
858 \end{quote}
860 This privileged operation is currently used by backend virtual device
861 drivers to safely map pages containing I/O data.
865 \section{Segmentation Support}
867 Xen allows guest OSes to install a custom GDT if they require it;
868 this is context switched transparently whenever a domain is
869 [de]scheduled. The following hypercall is effectively a
870 `safe' version of {\tt lgdt}:
872 \begin{quote}
873 \hypercall{set\_gdt(unsigned long *frame\_list, int entries)}
875 Install a global descriptor table for a domain; {\tt frame\_list} is
876 an array of up to 16 machine page frames within which the GDT resides,
877 with {\tt entries} being the actual number of descriptor-entry
878 slots. All page frames must be mapped read-only within the guest's
879 address space, and the table must be large enough to contain Xen's
880 reserved entries (see {\tt xen/include/public/arch-x86\_32.h}).
882 \end{quote}
884 Many guest OSes will also wish to install LDTs; this is achieved by
885 using {\tt mmu\_update()} with an extended command, passing the
886 linear address of the LDT base along with the number of entries. No
887 special safety checks are required; Xen needs to perform this task
888 simply since {\tt lldt} requires CPL 0.
891 Xen also allows guest operating systems to update just an
892 individual segment descriptor in the GDT or LDT:
894 \begin{quote}
895 \hypercall{update\_descriptor(unsigned long ma, unsigned long word1,
896 unsigned long word2)}
898 Update the GDT/LDT entry at machine address {\tt ma}; the new
899 8-byte descriptor is stored in {\tt word1} and {\tt word2}.
900 Xen performs a number of checks to ensure the descriptor is
901 valid.
903 \end{quote}
905 Guest OSes can use the above in place of context switching entire
906 LDTs (or the GDT) when the number of changing descriptors is small.
908 \section{Context Switching}
910 When a guest OS wishes to context switch between two processes,
911 it can use the page table and segmentation hypercalls described
912 above to perform the the bulk of the privileged work. In addition,
913 however, it will need to invoke Xen to switch the kernel (ring 1)
914 stack pointer:
916 \begin{quote}
917 \hypercall{stack\_switch(unsigned long ss, unsigned long esp)}
919 Request kernel stack switch from hypervisor; {\tt ss} is the new
920 stack segment, which {\tt esp} is the new stack pointer.
922 \end{quote}
924 A final useful hypercall for context switching allows ``lazy''
925 save and restore of floating point state:
927 \begin{quote}
928 \hypercall{fpu\_taskswitch(void)}
930 This call instructs Xen to set the {\tt TS} bit in the {\tt cr0}
931 control register; this means that the next attempt to use floating
932 point will cause a trap which the guest OS can trap. Typically it will
933 then save/restore the FP state, and clear the {\tt TS} bit.
934 \end{quote}
936 This is provided as an optimization only; guest OSes can also choose
937 to save and restore FP state on all context switches for simplicity.
940 \section{Physical Memory Management}
942 As mentioned previously, each domain has a maximum and current
943 memory allocation. The maximum allocation, set at domain creation
944 time, cannot be modified. However a domain can choose to reduce
945 and subsequently grow its current allocation by using the
946 following call:
948 \begin{quote}
949 \hypercall{dom\_mem\_op(unsigned int op, unsigned long *extent\_list,
950 unsigned long nr\_extents, unsigned int extent\_order)}
952 Increase or decrease current memory allocation (as determined by
953 the value of {\tt op}). Each invocation provides a list of
954 extents each of which is $2^s$ pages in size,
955 where $s$ is the value of {\tt extent\_order}.
957 \end{quote}
959 In addition to simply reducing or increasing the current memory
960 allocation via a `balloon driver', this call is also useful for
961 obtaining contiguous regions of machine memory when required (e.g.
962 for certain PCI devices, or if using superpages).
965 \section{Inter-Domain Communication}
966 \label{s:idc}
968 Xen provides a simple asynchronous notification mechanism via
969 \emph{event channels}. Each domain has a set of end-points (or
970 \emph{ports}) which may be bound to an event source (e.g. a physical
971 IRQ, a virtual IRQ, or an port in another domain). When a pair of
972 end-points in two different domains are bound together, then a `send'
973 operation on one will cause an event to be received by the destination
974 domain.
976 The control and use of event channels involves the following hypercall:
978 \begin{quote}
979 \hypercall{event\_channel\_op(evtchn\_op\_t *op)}
981 Inter-domain event-channel management; {\tt op} is a discriminated
982 union which allows the following 7 operations:
984 \begin{description}
986 \item[\it alloc\_unbound:] allocate a free (unbound) local
987 port and prepare for connection from a specified domain.
988 \item[\it bind\_virq:] bind a local port to a virtual
989 IRQ; any particular VIRQ can be bound to at most one port per domain.
990 \item[\it bind\_pirq:] bind a local port to a physical IRQ;
991 once more, a given pIRQ can be bound to at most one port per
992 domain. Furthermore the calling domain must be sufficiently
993 privileged.
994 \item[\it bind\_interdomain:] construct an interdomain event
995 channel; in general, the target domain must have previously allocated
996 an unbound port for this channel, although this can be bypassed by
997 privileged domains during domain setup.
998 \item[\it close:] close an interdomain event channel.
999 \item[\it send:] send an event to the remote end of a
1000 interdomain event channel.
1001 \item[\it status:] determine the current status of a local port.
1002 \end{description}
1004 For more details see
1005 {\tt xen/include/public/event\_channel.h}.
1007 \end{quote}
1009 Event channels are the fundamental communication primitive between
1010 Xen domains and seamlessly support SMP. However they provide little
1011 bandwidth for communication {\sl per se}, and hence are typically
1012 married with a piece of shared memory to produce effective and
1013 high-performance inter-domain communication.
1015 Safe sharing of memory pages between guest OSes is carried out by
1016 granting access on a per page basis to individual domains. This is
1017 achieved by using the {\tt grant\_table\_op()} hypercall.
1019 \begin{quote}
1020 \hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)}
1022 Grant or remove access to a particular page to a particular domain.
1024 \end{quote}
1026 This is not currently widely in use by guest operating systems, but
1027 we intend to integrate support more fully in the near future.
1029 \section{PCI Configuration}
1031 Domains with physical device access (i.e.\ driver domains) receive
1032 limited access to certain PCI devices (bus address space and
1033 interrupts). However many guest operating systems attempt to
1034 determine the PCI configuration by directly access the PCI BIOS,
1035 which cannot be allowed for safety.
1037 Instead, Xen provides the following hypercall:
1039 \begin{quote}
1040 \hypercall{physdev\_op(void *physdev\_op)}
1042 Perform a PCI configuration option; depending on the value
1043 of {\tt physdev\_op} this can be a PCI config read, a PCI config
1044 write, or a small number of other queries.
1046 \end{quote}
1049 For examples of using {\tt physdev\_op()}, see the
1050 Xen-specific PCI code in the linux sparse tree.
1052 \section{Administrative Operations}
1053 \label{s:dom0ops}
1055 A large number of control operations are available to a sufficiently
1056 privileged domain (typically domain 0). These allow the creation and
1057 management of new domains, for example. A complete list is given
1058 below: for more details on any or all of these, please see
1059 {\tt xen/include/public/dom0\_ops.h}
1062 \begin{quote}
1063 \hypercall{dom0\_op(dom0\_op\_t *op)}
1065 Administrative domain operations for domain management. The options are:
1067 \begin{description}
1068 \item [\it DOM0\_CREATEDOMAIN:] create a new domain
1070 \item [\it DOM0\_PAUSEDOMAIN:] remove a domain from the scheduler run
1071 queue.
1073 \item [\it DOM0\_UNPAUSEDOMAIN:] mark a paused domain as schedulable
1074 once again.
1076 \item [\it DOM0\_DESTROYDOMAIN:] deallocate all resources associated
1077 with a domain
1079 \item [\it DOM0\_GETMEMLIST:] get list of pages used by the domain
1081 \item [\it DOM0\_SCHEDCTL:]
1083 \item [\it DOM0\_ADJUSTDOM:] adjust scheduling priorities for domain
1085 \item [\it DOM0\_BUILDDOMAIN:] do final guest OS setup for domain
1087 \item [\it DOM0\_GETDOMAINFO:] get statistics about the domain
1089 \item [\it DOM0\_GETPAGEFRAMEINFO:]
1091 \item [\it DOM0\_GETPAGEFRAMEINFO2:]
1093 \item [\it DOM0\_IOPL:] set I/O privilege level
1095 \item [\it DOM0\_MSR:] read or write model specific registers
1097 \item [\it DOM0\_DEBUG:] interactively invoke the debugger
1099 \item [\it DOM0\_SETTIME:] set system time
1101 \item [\it DOM0\_READCONSOLE:] read console content from hypervisor buffer ring
1103 \item [\it DOM0\_PINCPUDOMAIN:] pin domain to a particular CPU
1105 \item [\it DOM0\_GETTBUFS:] get information about the size and location of
1106 the trace buffers (only on trace-buffer enabled builds)
1108 \item [\it DOM0\_PHYSINFO:] get information about the host machine
1110 \item [\it DOM0\_PCIDEV\_ACCESS:] modify PCI device access permissions
1112 \item [\it DOM0\_SCHED\_ID:] get the ID of the current Xen scheduler
1114 \item [\it DOM0\_SHADOW\_CONTROL:] switch between shadow page-table modes
1116 \item [\it DOM0\_SETDOMAININITIALMEM:] set initial memory allocation of a domain
1118 \item [\it DOM0\_SETDOMAINMAXMEM:] set maximum memory allocation of a domain
1120 \item [\it DOM0\_SETDOMAINVMASSIST:] set domain VM assist options
1121 \end{description}
1122 \end{quote}
1124 Most of the above are best understood by looking at the code
1125 implementing them (in {\tt xen/common/dom0\_ops.c}) and in
1126 the user-space tools that use them (mostly in {\tt tools/libxc}).
1128 \section{Debugging Hypercalls}
1130 A few additional hypercalls are mainly useful for debugging:
1132 \begin{quote}
1133 \hypercall{console\_io(int cmd, int count, char *str)}
1135 Use Xen to interact with the console; operations are:
1137 {\it CONSOLEIO\_write}: Output count characters from buffer str.
1139 {\it CONSOLEIO\_read}: Input at most count characters into buffer str.
1140 \end{quote}
1142 A pair of hypercalls allows access to the underlying debug registers:
1143 \begin{quote}
1144 \hypercall{set\_debugreg(int reg, unsigned long value)}
1146 Set debug register {\tt reg} to {\tt value}
1148 \hypercall{get\_debugreg(int reg)}
1150 Return the contents of the debug register {\tt reg}
1151 \end{quote}
1153 And finally:
1154 \begin{quote}
1155 \hypercall{xen\_version(int cmd)}
1157 Request Xen version number.
1158 \end{quote}
1160 This is useful to ensure that user-space tools are in sync
1161 with the underlying hypervisor.
1163 \section{Deprecated Hypercalls}
1165 Xen is under constant development and refinement; as such there
1166 are plans to improve the way in which various pieces of functionality
1167 are exposed to guest OSes.
1169 \begin{quote}
1170 \hypercall{vm\_assist(unsigned int cmd, unsigned int type)}
1172 Toggle various memory management modes (in particular wrritable page
1173 tables and superpage support).
1175 \end{quote}
1177 This is likely to be replaced with mode values in the shared
1178 information page since this is more resilient for resumption
1179 after migration or checkpoint.
1188 %%
1189 %% XXX SMH: not really sure how useful below is -- if it's still
1190 %% actually true, might be useful for someone wanting to write a
1191 %% new scheduler... not clear how many of them there are...
1192 %%
1194 \begin{comment}
1196 \chapter{Scheduling API}
1198 The scheduling API is used by both the schedulers described above and should
1199 also be used by any new schedulers. It provides a generic interface and also
1200 implements much of the ``boilerplate'' code.
1202 Schedulers conforming to this API are described by the following
1203 structure:
1205 \begin{verbatim}
1206 struct scheduler
1208 char *name; /* full name for this scheduler */
1209 char *opt_name; /* option name for this scheduler */
1210 unsigned int sched_id; /* ID for this scheduler */
1212 int (*init_scheduler) ();
1213 int (*alloc_task) (struct task_struct *);
1214 void (*add_task) (struct task_struct *);
1215 void (*free_task) (struct task_struct *);
1216 void (*rem_task) (struct task_struct *);
1217 void (*wake_up) (struct task_struct *);
1218 void (*do_block) (struct task_struct *);
1219 task_slice_t (*do_schedule) (s_time_t);
1220 int (*control) (struct sched_ctl_cmd *);
1221 int (*adjdom) (struct task_struct *,
1222 struct sched_adjdom_cmd *);
1223 s32 (*reschedule) (struct task_struct *);
1224 void (*dump_settings) (void);
1225 void (*dump_cpu_state) (int);
1226 void (*dump_runq_el) (struct task_struct *);
1227 };
1228 \end{verbatim}
1230 The only method that {\em must} be implemented is
1231 {\tt do\_schedule()}. However, if there is not some implementation for the
1232 {\tt wake\_up()} method then waking tasks will not get put on the runqueue!
1234 The fields of the above structure are described in more detail below.
1236 \subsubsection{name}
1238 The name field should point to a descriptive ASCII string.
1240 \subsubsection{opt\_name}
1242 This field is the value of the {\tt sched=} boot-time option that will select
1243 this scheduler.
1245 \subsubsection{sched\_id}
1247 This is an integer that uniquely identifies this scheduler. There should be a
1248 macro corrsponding to this scheduler ID in {\tt <xen/sched-if.h>}.
1250 \subsubsection{init\_scheduler}
1252 \paragraph*{Purpose}
1254 This is a function for performing any scheduler-specific initialisation. For
1255 instance, it might allocate memory for per-CPU scheduler data and initialise it
1256 appropriately.
1258 \paragraph*{Call environment}
1260 This function is called after the initialisation performed by the generic
1261 layer. The function is called exactly once, for the scheduler that has been
1262 selected.
1264 \paragraph*{Return values}
1266 This should return negative on failure --- this will cause an
1267 immediate panic and the system will fail to boot.
1269 \subsubsection{alloc\_task}
1271 \paragraph*{Purpose}
1272 Called when a {\tt task\_struct} is allocated by the generic scheduler
1273 layer. A particular scheduler implementation may use this method to
1274 allocate per-task data for this task. It may use the {\tt
1275 sched\_priv} pointer in the {\tt task\_struct} to point to this data.
1277 \paragraph*{Call environment}
1278 The generic layer guarantees that the {\tt sched\_priv} field will
1279 remain intact from the time this method is called until the task is
1280 deallocated (so long as the scheduler implementation does not change
1281 it explicitly!).
1283 \paragraph*{Return values}
1284 Negative on failure.
1286 \subsubsection{add\_task}
1288 \paragraph*{Purpose}
1290 Called when a task is initially added by the generic layer.
1292 \paragraph*{Call environment}
1294 The fields in the {\tt task\_struct} are now filled out and available for use.
1295 Schedulers should implement appropriate initialisation of any per-task private
1296 information in this method.
1298 \subsubsection{free\_task}
1300 \paragraph*{Purpose}
1302 Schedulers should free the space used by any associated private data
1303 structures.
1305 \paragraph*{Call environment}
1307 This is called when a {\tt task\_struct} is about to be deallocated.
1308 The generic layer will have done generic task removal operations and
1309 (if implemented) called the scheduler's {\tt rem\_task} method before
1310 this method is called.
1312 \subsubsection{rem\_task}
1314 \paragraph*{Purpose}
1316 This is called when a task is being removed from scheduling (but is
1317 not yet being freed).
1319 \subsubsection{wake\_up}
1321 \paragraph*{Purpose}
1323 Called when a task is woken up, this method should put the task on the runqueue
1324 (or do the scheduler-specific equivalent action).
1326 \paragraph*{Call environment}
1328 The task is already set to state RUNNING.
1330 \subsubsection{do\_block}
1332 \paragraph*{Purpose}
1334 This function is called when a task is blocked. This function should
1335 not remove the task from the runqueue.
1337 \paragraph*{Call environment}
1339 The EVENTS\_MASTER\_ENABLE\_BIT is already set and the task state changed to
1340 TASK\_INTERRUPTIBLE on entry to this method. A call to the {\tt
1341 do\_schedule} method will be made after this method returns, in
1342 order to select the next task to run.
1344 \subsubsection{do\_schedule}
1346 This method must be implemented.
1348 \paragraph*{Purpose}
1350 The method is called each time a new task must be chosen for scheduling on the
1351 current CPU. The current time as passed as the single argument (the current
1352 task can be found using the {\tt current} macro).
1354 This method should select the next task to run on this CPU and set it's minimum
1355 time to run as well as returning the data described below.
1357 This method should also take the appropriate action if the previous
1358 task has blocked, e.g. removing it from the runqueue.
1360 \paragraph*{Call environment}
1362 The other fields in the {\tt task\_struct} are updated by the generic layer,
1363 which also performs all Xen-specific tasks and performs the actual task switch
1364 (unless the previous task has been chosen again).
1366 This method is called with the {\tt schedule\_lock} held for the current CPU
1367 and local interrupts disabled.
1369 \paragraph*{Return values}
1371 Must return a {\tt struct task\_slice} describing what task to run and how long
1372 for (at maximum).
1374 \subsubsection{control}
1376 \paragraph*{Purpose}
1378 This method is called for global scheduler control operations. It takes a
1379 pointer to a {\tt struct sched\_ctl\_cmd}, which it should either
1380 source data from or populate with data, depending on the value of the
1381 {\tt direction} field.
1383 \paragraph*{Call environment}
1385 The generic layer guarantees that when this method is called, the
1386 caller selected the correct scheduler ID, hence the scheduler's
1387 implementation does not need to sanity-check these parts of the call.
1389 \paragraph*{Return values}
1391 This function should return the value to be passed back to user space, hence it
1392 should either be 0 or an appropriate errno value.
1394 \subsubsection{sched\_adjdom}
1396 \paragraph*{Purpose}
1398 This method is called to adjust the scheduling parameters of a particular
1399 domain, or to query their current values. The function should check
1400 the {\tt direction} field of the {\tt sched\_adjdom\_cmd} it receives in
1401 order to determine which of these operations is being performed.
1403 \paragraph*{Call environment}
1405 The generic layer guarantees that the caller has specified the correct
1406 control interface version and scheduler ID and that the supplied {\tt
1407 task\_struct} will not be deallocated during the call (hence it is not
1408 necessary to {\tt get\_task\_struct}).
1410 \paragraph*{Return values}
1412 This function should return the value to be passed back to user space, hence it
1413 should either be 0 or an appropriate errno value.
1415 \subsubsection{reschedule}
1417 \paragraph*{Purpose}
1419 This method is called to determine if a reschedule is required as a result of a
1420 particular task.
1422 \paragraph*{Call environment}
1423 The generic layer will cause a reschedule if the current domain is the idle
1424 task or it has exceeded its minimum time slice before a reschedule. The
1425 generic layer guarantees that the task passed is not currently running but is
1426 on the runqueue.
1428 \paragraph*{Return values}
1430 Should return a mask of CPUs to cause a reschedule on.
1432 \subsubsection{dump\_settings}
1434 \paragraph*{Purpose}
1436 If implemented, this should dump any private global settings for this
1437 scheduler to the console.
1439 \paragraph*{Call environment}
1441 This function is called with interrupts enabled.
1443 \subsubsection{dump\_cpu\_state}
1445 \paragraph*{Purpose}
1447 This method should dump any private settings for the specified CPU.
1449 \paragraph*{Call environment}
1451 This function is called with interrupts disabled and the {\tt schedule\_lock}
1452 for the specified CPU held.
1454 \subsubsection{dump\_runq\_el}
1456 \paragraph*{Purpose}
1458 This method should dump any private settings for the specified task.
1460 \paragraph*{Call environment}
1462 This function is called with interrupts disabled and the {\tt schedule\_lock}
1463 for the task's CPU held.
1465 \end{comment}
1470 %%
1471 %% XXX SMH: we probably should have something in here on debugging
1472 %% etc; this is a kinda developers manual and many devs seem to
1473 %% like debugging support :^)
1474 %% Possibly sanitize below, else wait until new xendbg stuff is in
1475 %% (and/or kip's stuff?) and write about that instead?
1476 %%
1478 \begin{comment}
1480 \chapter{Debugging}
1482 Xen provides tools for debugging both Xen and guest OSes. Currently, the
1483 Pervasive Debugger provides a GDB stub, which provides facilities for symbolic
1484 debugging of Xen itself and of OS kernels running on top of Xen. The Trace
1485 Buffer provides a lightweight means to log data about Xen's internal state and
1486 behaviour at runtime, for later analysis.
1488 \section{Pervasive Debugger}
1490 Information on using the pervasive debugger is available in pdb.txt.
1493 \section{Trace Buffer}
1495 The trace buffer provides a means to observe Xen's operation from domain 0.
1496 Trace events, inserted at key points in Xen's code, record data that can be
1497 read by the {\tt xentrace} tool. Recording these events has a low overhead
1498 and hence the trace buffer may be useful for debugging timing-sensitive
1499 behaviours.
1501 \subsection{Internal API}
1503 To use the trace buffer functionality from within Xen, you must {\tt \#include
1504 <xen/trace.h>}, which contains definitions related to the trace buffer. Trace
1505 events are inserted into the buffer using the {\tt TRACE\_xD} ({\tt x} = 0, 1,
1506 2, 3, 4 or 5) macros. These all take an event number, plus {\tt x} additional
1507 (32-bit) data as their arguments. For trace buffer-enabled builds of Xen these
1508 will insert the event ID and data into the trace buffer, along with the current
1509 value of the CPU cycle-counter. For builds without the trace buffer enabled,
1510 the macros expand to no-ops and thus can be left in place without incurring
1511 overheads.
1513 \subsection{Trace-enabled builds}
1515 By default, the trace buffer is enabled only in debug builds (i.e. {\tt NDEBUG}
1516 is not defined). It can be enabled separately by defining {\tt TRACE\_BUFFER},
1517 either in {\tt <xen/config.h>} or on the gcc command line.
1519 The size (in pages) of the per-CPU trace buffers can be specified using the
1520 {\tt tbuf\_size=n } boot parameter to Xen. If the size is set to 0, the trace
1521 buffers will be disabled.
1523 \subsection{Dumping trace data}
1525 When running a trace buffer build of Xen, trace data are written continuously
1526 into the buffer data areas, with newer data overwriting older data. This data
1527 can be captured using the {\tt xentrace} program in domain 0.
1529 The {\tt xentrace} tool uses {\tt /dev/mem} in domain 0 to map the trace
1530 buffers into its address space. It then periodically polls all the buffers for
1531 new data, dumping out any new records from each buffer in turn. As a result,
1532 for machines with multiple (logical) CPUs, the trace buffer output will not be
1533 in overall chronological order.
1535 The output from {\tt xentrace} can be post-processed using {\tt
1536 xentrace\_cpusplit} (used to split trace data out into per-cpu log files) and
1537 {\tt xentrace\_format} (used to pretty-print trace data). For the predefined
1538 trace points, there is an example format file in {\tt tools/xentrace/formats }.
1540 For more information, see the manual pages for {\tt xentrace}, {\tt
1541 xentrace\_format} and {\tt xentrace\_cpusplit}.
1543 \end{comment}
1548 \end{document}