view docs/src/interface.tex @ 3256:054fad40be91

bitkeeper revision 1.1159.170.62 (41b658f7CVacvwHL6JLDDIqJsW_sSQ)

Update to Linux 2.6.10-rc3.
author cl349@arcadians.cl.cam.ac.uk
date Wed Dec 08 01:29:27 2004 +0000 (2004-12-08)
parents c3682eb26b1b
children 0a4b76b6b5a0 85f87d4249f9 0dc3b8b8c298
line source
1 \documentclass[11pt,twoside,final,openright]{report}
2 \usepackage{a4,graphicx,html,setspace,times}
3 \usepackage{comment,parskip}
4 \setstretch{1.15}
6 \begin{document}
9 \pagestyle{empty}
10 \begin{center}
11 \vspace*{\fill}
12 \includegraphics{figs/xenlogo.eps}
13 \vfill
14 \vfill
15 \vfill
16 \begin{tabular}{l}
17 {\Huge \bf Interface manual} \\[4mm]
18 {\huge Xen v2.0 for x86} \\[80mm]
20 {\Large Xen is Copyright (c) 2002-2004, The Xen Team} \\[3mm]
21 {\Large University of Cambridge, UK} \\[20mm]
22 \end{tabular}
23 \end{center}
25 {\bf
26 DISCLAIMER: This documentation is currently under active development
27 and as such there may be mistakes and omissions --- watch out for
28 these and please report any you find to the developer's mailing list.
29 Contributions of material, suggestions and corrections are welcome.
30 }
32 \vfill
33 \cleardoublepage
36 \pagestyle{plain}
37 \pagenumbering{roman}
38 { \parskip 0pt plus 1pt
39 \tableofcontents }
40 \cleardoublepage
43 \pagenumbering{arabic}
44 \raggedbottom
45 \widowpenalty=10000
46 \clubpenalty=10000
47 \parindent=0pt
48 \parskip=5pt
49 \renewcommand{\topfraction}{.8}
50 \renewcommand{\bottomfraction}{.8}
51 \renewcommand{\textfraction}{.2}
52 \renewcommand{\floatpagefraction}{.8}
53 \setstretch{1.1}
55 \chapter{Introduction}
57 Xen allows the hardware resources of a machine to be virtualized and
58 dynamically partitioned, allowing multiple different {\em guest}
59 operating system images to be run simultaneously. Virtualizing the
60 machine in this manner provides considerable flexibility, for example
61 allowing different users to choose their preferred operating system
62 (e.g., Linux, NetBSD, or a custom operating system). Furthermore, Xen
63 provides secure partitioning between virtual machines (known as
64 {\em domains} in Xen terminology), and enables better resource
65 accounting and QoS isolation than can be achieved with a conventional
66 operating system.
68 Xen essentially takes a `whole machine' virtualization approach as
69 pioneered by IBM VM/370. However, unlike VM/370 or more recent
70 efforts such as VMWare and Virtual PC, Xen does not attempt to
71 completely virtualize the underlying hardware. Instead parts of the
72 hosted guest operating systems are modified to work with the VMM; the
73 operating system is effectively ported to a new target architecture,
74 typically requiring changes in just the machine-dependent code. The
75 user-level API is unchanged, and so existing binaries and operating
76 system distributions work without modification.
78 In addition to exporting virtualized instances of CPU, memory, network
79 and block devices, Xen exposes a control interface to manage how these
80 resources are shared between the running domains. Access to the
81 control interface is restricted: it may only be used by one
82 specially-privileged VM, known as {\em domain 0}. This domain is a
83 required part of any Xen-based server and runs the application software
84 that manages the control-plane aspects of the platform. Running the
85 control software in {\it domain 0}, distinct from the hypervisor
86 itself, allows the Xen framework to separate the notions of
87 mechanism and policy within the system.
91 \chapter{Virtual Architecture}
93 On a Xen-based system, the hypervisor itself runs in {\it ring 0}. It
94 has full access to the physical memory available in the system and is
95 responsible for allocating portions of it to the domains. Guest
96 operating systems run in and use {\it rings 1}, {\it 2} and {\it 3} as
97 they see fit. Segmentation is used to prevent the guest OS from
98 accessing the portion of the address space that is reserved for
99 Xen. We expect most guest operating systems will use ring 1 for their
100 own operation and place applications in ring 3.
102 In this chapter we consider the basic virtual architecture provided
103 by Xen: the basic CPU state, exception and interrupt handling, and
104 time. Other aspects such as memory and device access are discussed
105 in later chapters.
107 \section{CPU state}
109 All privileged state must be handled by Xen. The guest OS has no
110 direct access to CR3 and is not permitted to update privileged bits in
111 EFLAGS. Guest OSes use \emph{hypercalls} to invoke operations in Xen;
112 these are analogous to system calls but occur from ring 1 to ring 0.
114 A list of all hypercalls is given in Appendix~\ref{a:hypercalls}.
118 \section{Exceptions}
120 A virtual IDT is provided --- a domain can submit a table of trap
121 handlers to Xen via the {\tt set\_trap\_table()} hypercall. Most trap
122 handlers are identical to native x86 handlers, although the page-fault
123 handler is somewhat different.
126 \section{Interrupts and events}
128 Interrupts are virtualized by mapping them to \emph{events}, which are
129 delivered asynchronously to the target domain using a callback
130 supplied via the {\tt set\_callbacks()} hypercall. A guest OS can map
131 these events onto its standard interrupt dispatch mechanisms. Xen is
132 responsible for determining the target domain that will handle each
133 physical interrupt source. For more details on the binding of event
134 sources to events, see Chapter~\ref{c:devices}.
138 \section{Time}
140 Guest operating systems need to be aware of the passage of both real
141 (or wallclock) time and their own `virtual time' (the time for
142 which they have been executing). Furthermore, Xen has a notion of
143 time which is used for scheduling. The following notions of
144 time are provided:
146 \begin{description}
147 \item[Cycle counter time.]
149 This provides a fine-grained time reference. The cycle counter time is
150 used to accurately extrapolate the other time references. On SMP machines
151 it is currently assumed that the cycle counter time is synchronized between
152 CPUs. The current x86-based implementation achieves this within inter-CPU
153 communication latencies.
155 \item[System time.]
157 This is a 64-bit counter which holds the number of nanoseconds that
158 have elapsed since system boot.
161 \item[Wall clock time.]
163 This is the time of day in a Unix-style {\tt struct timeval} (seconds
164 and microseconds since 1 January 1970, adjusted by leap seconds). An
165 NTP client hosted by {\it domain 0} can keep this value accurate.
168 \item[Domain virtual time.]
170 This progresses at the same pace as system time, but only while a
171 domain is executing --- it stops while a domain is de-scheduled.
172 Therefore the share of the CPU that a domain receives is indicated by
173 the rate at which its virtual time increases.
175 \end{description}
178 Xen exports timestamps for system time and wall-clock time to guest
179 operating systems through a shared page of memory. Xen also provides
180 the cycle counter time at the instant the timestamps were calculated,
181 and the CPU frequency in Hertz. This allows the guest to extrapolate
182 system and wall-clock times accurately based on the current cycle
183 counter time.
185 Since all time stamps need to be updated and read \emph{atomically}
186 two version numbers are also stored in the shared info page. The
187 first is incremented prior to an update, while the second is only
188 incremented afterwards. Thus a guest can be sure that it read a consistent
189 state by checking the two version numbers are equal.
191 Xen includes a periodic ticker which sends a timer event to the
192 currently executing domain every 10ms. The Xen scheduler also sends a
193 timer event whenever a domain is scheduled; this allows the guest OS
194 to adjust for the time that has passed while it has been inactive. In
195 addition, Xen allows each domain to request that they receive a timer
196 event sent at a specified system time by using the {\tt
197 set\_timer\_op()} hypercall. Guest OSes may use this timer to
198 implement timeout values when they block.
202 %% % akw: demoting this to a section -- not sure if there is any point
203 %% % though, maybe just remove it.
205 \section{Xen CPU Scheduling}
207 Xen offers a uniform API for CPU schedulers. It is possible to choose
208 from a number of schedulers at boot and it should be easy to add more.
209 The BVT, Atropos and Round Robin schedulers are part of the normal
210 Xen distribution. BVT provides proportional fair shares of the CPU to
211 the running domains. Atropos can be used to reserve absolute shares
212 of the CPU for each domain. Round-robin is provided as an example of
213 Xen's internal scheduler API.
215 \paragraph*{Note: SMP host support}
216 Xen has always supported SMP host systems. Domains are statically assigned to
217 CPUs, either at creation time or when manually pinning to a particular CPU.
218 The current schedulers then run locally on each CPU to decide which of the
219 assigned domains should be run there. The user-level control software
220 can be used to perform coarse-grain load-balancing between CPUs.
223 %% More information on the characteristics and use of these schedulers is
224 %% available in {\tt Sched-HOWTO.txt}.
227 \section{Privileged operations}
229 Xen exports an extended interface to privileged domains (viz.\ {\it
230 Domain 0}). This allows such domains to build and boot other domains
231 on the server, and provides control interfaces for managing
232 scheduling, memory, networking, and block devices.
235 \chapter{Memory}
236 \label{c:memory}
238 Xen is responsible for managing the allocation of physical memory to
239 domains, and for ensuring safe use of the paging and segmentation
240 hardware.
243 \section{Memory Allocation}
246 Xen resides within a small fixed portion of physical memory; it also
247 reserves the top 64MB of every virtual address space. The remaining
248 physical memory is available for allocation to domains at a page
249 granularity. Xen tracks the ownership and use of each page, which
250 allows it to enforce secure partitioning between domains.
252 Each domain has a maximum and current physical memory allocation.
253 A guest OS may run a `balloon driver' to dynamically adjust its
254 current memory allocation up to its limit.
257 %% XXX SMH: I use machine and physical in the next section (which
258 %% is kinda required for consistency with code); wonder if this
259 %% section should use same terms?
260 %%
261 %% Probably.
262 %%
263 %% Merging this and below section at some point prob makes sense.
265 \section{Pseudo-Physical Memory}
267 Since physical memory is allocated and freed on a page granularity,
268 there is no guarantee that a domain will receive a contiguous stretch
269 of physical memory. However most operating systems do not have good
270 support for operating in a fragmented physical address space. To aid
271 porting such operating systems to run on top of Xen, we make a
272 distinction between \emph{machine memory} and \emph{pseudo-physical
273 memory}.
275 Put simply, machine memory refers to the entire amount of memory
276 installed in the machine, including that reserved by Xen, in use by
277 various domains, or currently unallocated. We consider machine memory
278 to comprise a set of 4K \emph{machine page frames} numbered
279 consecutively starting from 0. Machine frame numbers mean the same
280 within Xen or any domain.
282 Pseudo-physical memory, on the other hand, is a per-domain
283 abstraction. It allows a guest operating system to consider its memory
284 allocation to consist of a contiguous range of physical page frames
285 starting at physical frame 0, despite the fact that the underlying
286 machine page frames may be sparsely allocated and in any order.
288 To achieve this, Xen maintains a globally readable {\it
289 machine-to-physical} table which records the mapping from machine page
290 frames to pseudo-physical ones. In addition, each domain is supplied
291 with a {\it physical-to-machine} table which performs the inverse
292 mapping. Clearly the machine-to-physical table has size proportional
293 to the amount of RAM installed in the machine, while each
294 physical-to-machine table has size proportional to the memory
295 allocation of the given domain.
297 Architecture dependent code in guest operating systems can then use
298 the two tables to provide the abstraction of pseudo-physical
299 memory. In general, only certain specialized parts of the operating
300 system (such as page table management) needs to understand the
301 difference between machine and pseudo-physical addresses.
303 \section{Page Table Updates}
305 In the default mode of operation, Xen enforces read-only access to
306 page tables and requires guest operating systems to explicitly request
307 any modifications. Xen validates all such requests and only applies
308 updates that it deems safe. This is necessary to prevent domains from
309 adding arbitrary mappings to their page tables.
311 To aid validation, Xen associates a type and reference count with each
312 memory page. A page has one of the following
313 mutually-exclusive types at any point in time: page directory ({\sf
314 PD}), page table ({\sf PT}), local descriptor table ({\sf LDT}),
315 global descriptor table ({\sf GDT}), or writable ({\sf RW}). Note that
316 a guest OS may always create readable mappings of its own memory
317 regardless of its current type.
318 %%% XXX: possibly explain more about ref count 'lifecyle' here?
319 This mechanism is used to
320 maintain the invariants required for safety; for example, a domain
321 cannot have a writable mapping to any part of a page table as this
322 would require the page concerned to simultaneously be of types {\sf
323 PT} and {\sf RW}.
326 %\section{Writable Page Tables}
328 Xen also provides an alternative mode of operation in which guests be
329 have the illusion that their page tables are directly writable. Of
330 course this is not really the case, since Xen must still validate
331 modifications to ensure secure partitioning. To this end, Xen traps
332 any write attempt to a memory page of type {\sf PT} (i.e., that is
333 currently part of a page table). If such an access occurs, Xen
334 temporarily allows write access to that page while at the same time
335 {\em disconnecting} it from the page table that is currently in
336 use. This allows the guest to safely make updates to the page because
337 the newly-updated entries cannot be used by the MMU until Xen
338 revalidates and reconnects the page.
339 Reconnection occurs automatically in a number of situations: for
340 example, when the guest modifies a different page-table page, when the
341 domain is preempted, or whenever the guest uses Xen's explicit
342 page-table update interfaces.
344 Finally, Xen also supports a form of \emph{shadow page tables} in
345 which the guest OS uses a independent copy of page tables which are
346 unknown to the hardware (i.e.\ which are never pointed to by {\tt
347 cr3}). Instead Xen propagates changes made to the guest's tables to the
348 real ones, and vice versa. This is useful for logging page writes
349 (e.g.\ for live migration or checkpoint). A full version of the shadow
350 page tables also allows guest OS porting with less effort.
352 \section{Segment Descriptor Tables}
354 On boot a guest is supplied with a default GDT, which does not reside
355 within its own memory allocation. If the guest wishes to use other
356 than the default `flat' ring-1 and ring-3 segments that this GDT
357 provides, it must register a custom GDT and/or LDT with Xen,
358 allocated from its own memory. Note that a number of GDT
359 entries are reserved by Xen -- any custom GDT must also include
360 sufficient space for these entries.
362 For example, the following hypercall is used to specify a new GDT:
364 \begin{quote}
365 int {\bf set\_gdt}(unsigned long *{\em frame\_list}, int {\em entries})
367 {\em frame\_list}: An array of up to 16 machine page frames within
368 which the GDT resides. Any frame registered as a GDT frame may only
369 be mapped read-only within the guest's address space (e.g., no
370 writable mappings, no use as a page-table page, and so on).
372 {\em entries}: The number of descriptor-entry slots in the GDT. Note
373 that the table must be large enough to contain Xen's reserved entries;
374 thus we must have `{\em entries $>$ LAST\_RESERVED\_GDT\_ENTRY}\ '.
375 Note also that, after registering the GDT, slots {\em FIRST\_} through
376 {\em LAST\_RESERVED\_GDT\_ENTRY} are no longer usable by the guest and
377 may be overwritten by Xen.
378 \end{quote}
380 The LDT is updated via the generic MMU update mechanism (i.e., via
381 the {\tt mmu\_update()} hypercall.
383 \section{Start of Day}
385 The start-of-day environment for guest operating systems is rather
386 different to that provided by the underlying hardware. In particular,
387 the processor is already executing in protected mode with paging
388 enabled.
390 {\it Domain 0} is created and booted by Xen itself. For all subsequent
391 domains, the analogue of the boot-loader is the {\it domain builder},
392 user-space software running in {\it domain 0}. The domain builder
393 is responsible for building the initial page tables for a domain
394 and loading its kernel image at the appropriate virtual address.
398 \chapter{Devices}
399 \label{c:devices}
401 Devices such as network and disk are exported to guests using a
402 split device driver. The device driver domain, which accesses the
403 physical device directly also runs a {\em backend} driver, serving
404 requests to that device from guests. Each guest will use a simple
405 {\em frontend} driver, to access the backend. Communication between these
406 domains is composed of two parts: First, data is placed onto a shared
407 memory page between the domains. Second, an event channel between the
408 two domains is used to pass notification that data is outstanding.
409 This separation of notification from data transfer allows message
410 batching, and results in very efficient device access.
412 Even channels are used extensively in device virtualization; each
413 domain has a number of end-points or \emph{ports} each of which
414 may be bound to one of the following \emph{event sources}:
415 \begin{itemize}
416 \item a physical interrupt from a real device,
417 \item a virtual interrupt (callback) from Xen, or
418 \item a signal from another domain
419 \end{itemize}
421 Events are lightweight and do not carry much information beyond
422 the source of the notification. Hence when performing bulk data
423 transfer, events are typically used as synchronization primitives
424 over a shared memory transport. Event channels are managed via
425 the {\tt event\_channel\_op()} hypercall; for more details see
426 Section~\ref{s:idc}.
428 This chapter focuses on some individual device interfaces
429 available to Xen guests.
431 \section{Network I/O}
433 Virtual network device services are provided by shared memory
434 communication with a backend domain. From the point of view of
435 other domains, the backend may be viewed as a virtual ethernet switch
436 element with each domain having one or more virtual network interfaces
437 connected to it.
439 \subsection{Backend Packet Handling}
441 The backend driver is responsible for a variety of actions relating to
442 the transmission and reception of packets from the physical device.
443 With regard to transmission, the backend performs these key actions:
445 \begin{itemize}
446 \item {\bf Validation:} To ensure that domains do not attempt to
447 generate invalid (e.g. spoofed) traffic, the backend driver may
448 validate headers ensuring that source MAC and IP addresses match the
449 interface that they have been sent from.
451 Validation functions can be configured using standard firewall rules
452 ({\small{\tt iptables}} in the case of Linux).
454 \item {\bf Scheduling:} Since a number of domains can share a single
455 physical network interface, the backend must mediate access when
456 several domains each have packets queued for transmission. This
457 general scheduling function subsumes basic shaping or rate-limiting
458 schemes.
460 \item {\bf Logging and Accounting:} The backend domain can be
461 configured with classifier rules that control how packets are
462 accounted or logged. For example, log messages might be generated
463 whenever a domain attempts to send a TCP packet containing a SYN.
464 \end{itemize}
466 On receipt of incoming packets, the backend acts as a simple
467 demultiplexer: Packets are passed to the appropriate virtual
468 interface after any necessary logging and accounting have been carried
469 out.
471 \subsection{Data Transfer}
473 Each virtual interface uses two ``descriptor rings'', one for transmit,
474 the other for receive. Each descriptor identifies a block of contiguous
475 physical memory allocated to the domain.
477 The transmit ring carries packets to transmit from the guest to the
478 backend domain. The return path of the transmit ring carries messages
479 indicating that the contents have been physically transmitted and the
480 backend no longer requires the associated pages of memory.
482 To receive packets, the guest places descriptors of unused pages on
483 the receive ring. The backend will return received packets by
484 exchanging these pages in the domain's memory with new pages
485 containing the received data, and passing back descriptors regarding
486 the new packets on the ring. This zero-copy approach allows the
487 backend to maintain a pool of free pages to receive packets into, and
488 then deliver them to appropriate domains after examining their
489 headers.
491 %
492 %Real physical addresses are used throughout, with the domain performing
493 %translation from pseudo-physical addresses if that is necessary.
495 If a domain does not keep its receive ring stocked with empty buffers then
496 packets destined to it may be dropped. This provides some defence against
497 receive livelock problems because an overload domain will cease to receive
498 further data. Similarly, on the transmit path, it provides the application
499 with feedback on the rate at which packets are able to leave the system.
502 Flow control on rings is achieved by including a pair of producer
503 indexes on the shared ring page. Each side will maintain a private
504 consumer index indicating the next outstanding message. In this
505 manner, the domains cooperate to divide the ring into two message
506 lists, one in each direction. Notification is decoupled from the
507 immediate placement of new messages on the ring; the event channel
508 will be used to generate notification when {\em either} a certain
509 number of outstanding messages are queued, {\em or} a specified number
510 of nanoseconds have elapsed since the oldest message was placed on the
511 ring.
513 % Not sure if my version is any better -- here is what was here before:
514 %% Synchronization between the backend domain and the guest is achieved using
515 %% counters held in shared memory that is accessible to both. Each ring has
516 %% associated producer and consumer indices indicating the area in the ring
517 %% that holds descriptors that contain data. After receiving {\it n} packets
518 %% or {\t nanoseconds} after receiving the first packet, the hypervisor sends
519 %% an event to the domain.
521 \section{Block I/O}
523 All guest OS disk access goes through the virtual block device VBD
524 interface. This interface allows domains access to portions of block
525 storage devices visible to the the block backend device. The VBD
526 interface is a split driver, similar to the network interface
527 described above. A single shared memory ring is used between the
528 frontend and backend drivers, across which read and write messages are
529 sent.
531 Any block device accessible to the backend domain, including
532 network-based block (iSCSI, *NBD, etc), loopback and LVM/MD devices,
533 can be exported as a VBD. Each VBD is mapped to a device node in the
534 guest, specified in the guest's startup configuration.
536 Old (Xen 1.2) virtual disks are not supported under Xen 2.0, since
537 similar functionality can be achieved using the more complete LVM
538 system, which is already in widespread use.
540 \subsection{Data Transfer}
542 The single ring between the guest and the block backend supports three
543 messages:
545 \begin{description}
546 \item [{\small {\tt PROBE}}:] Return a list of the VBDs available to this guest
547 from the backend. The request includes a descriptor of a free page
548 into which the reply will be written by the backend.
550 \item [{\small {\tt READ}}:] Read data from the specified block device. The
551 front end identifies the device and location to read from and
552 attaches pages for the data to be copied to (typically via DMA from
553 the device). The backend acknowledges completed read requests as
554 they finish.
556 \item [{\small {\tt WRITE}}:] Write data to the specified block device. This
557 functions essentially as {\small {\tt READ}}, except that the data moves to
558 the device instead of from it.
559 \end{description}
561 % um... some old text
562 %% In overview, the same style of descriptor-ring that is used for
563 %% network packets is used here. Each domain has one ring that carries
564 %% operation requests to the hypervisor and carries the results back
565 %% again.
567 %% Rather than copying data, the backend simply maps the domain's buffers
568 %% in order to enable direct DMA to them. The act of mapping the buffers
569 %% also increases the reference counts of the underlying pages, so that
570 %% the unprivileged domain cannot try to return them to the hypervisor,
571 %% install them as page tables, or any other unsafe behaviour.
572 %% %block API here
575 \chapter{Further Information}
578 If you have questions that are not answered by this manual, the
579 sources of information listed below may be of interest to you. Note
580 that bug reports, suggestions and contributions related to the
581 software (or the documentation) should be sent to the Xen developers'
582 mailing list (address below).
584 \section{Other documentation}
586 If you are mainly interested in using (rather than developing for)
587 Xen, the {\em Xen Users' Manual} is distributed in the {\tt docs/}
588 directory of the Xen source distribution.
590 % Various HOWTOs are also available in {\tt docs/HOWTOS}.
592 \section{Online references}
594 The official Xen web site is found at:
595 \begin{quote}
596 {\tt http://www.cl.cam.ac.uk/Research/SRG/netos/xen/}
597 \end{quote}
599 This contains links to the latest versions of all on-line
600 documentation.
602 \section{Mailing lists}
604 There are currently three official Xen mailing lists:
606 \begin{description}
607 \item[xen-devel@lists.sourceforge.net] Used for development
608 discussions and requests for help. Subscribe at: \\
609 {\small {\tt http://lists.sourceforge.net/mailman/listinfo/xen-devel}}
610 \item[xen-announce@lists.sourceforge.net] Used for announcements only.
611 Subscribe at: \\
612 {\small {\tt http://lists.sourceforge.net/mailman/listinfo/xen-announce}}
613 \item[xen-changelog@lists.sourceforge.net] Changelog feed
614 from the unstable and 2.0 trees - developer oriented. Subscribe at: \\
615 {\small {\tt http://lists.sourceforge.net/mailman/listinfo/xen-changelog}}
616 \end{description}
618 Of these, xen-devel is the most active; it is currently used for
619 both developer and user-related discussions.
624 \appendix
626 %\newcommand{\hypercall}[1]{\vspace{5mm}{\large\sf #1}}
632 \newcommand{\hypercall}[1]{\vspace{2mm}{\sf #1}}
639 \chapter{Xen Hypercalls}
640 \label{a:hypercalls}
642 Hypercalls represent the procedural interface to Xen; this appendix
643 categorizes and describes the current set of hypercalls.
645 \section{Invoking Hypercalls}
647 Hypercalls are invoked in a manner analogous to system calls in a
648 conventional operating system; a software interrupt is issued which
649 vectors to an entry point within Xen. On x86\_32 machines the
650 instruction required is {\tt int \$82}; the (real) IDT is setup so
651 that this may only be issued from within ring 1. The particular
652 hypercall to be invoked is contained in {\tt EAX} --- a list
653 mapping these values to symbolic hypercall names can be found
654 in {\tt xen/include/public/xen.h}.
656 On some occasions a set of hypercalls will be required to carry
657 out a higher-level function; a good example is when a guest
658 operating wishes to context switch to a new process which
659 requires updating various privileged CPU state. As an optimization
660 for these cases, there is a generic mechanism to issue a set of
661 hypercalls as a batch:
663 \begin{quote}
664 \hypercall{multicall(void *call\_list, int nr\_calls)}
666 Execute a series of hypervisor calls; {\tt nr\_calls} is the length of
667 the array of {\tt multicall\_entry\_t} structures pointed to be {\tt
668 call\_list}. Each entry contains the hypercall operation code followed
669 by up to 7 word-sized arguments.
670 \end{quote}
672 Note that multicalls are provided purely as an optimization; there is
673 no requirement to use them when first porting a guest operating
674 system.
677 \section{Virtual CPU Setup}
679 At start of day, a guest operating system needs to setup the virtual
680 CPU it is executing on. This includes installing vectors for the
681 virtual IDT so that the guest OS can handle interrupts, page faults,
682 etc. However the very first thing a guest OS must setup is a pair
683 of hypervisor callbacks: these are the entry points which Xen will
684 use when it wishes to notify the guest OS of an occurrence.
686 \begin{quote}
687 \hypercall{set\_callbacks(unsigned long event\_selector, unsigned long
688 event\_address, unsigned long failsafe\_selector, unsigned long
689 failsafe\_address) }
691 Register the normal (``event'') and failsafe callbacks for
692 event processing. In each case the code segment selector and
693 address within that segment are provided. The selectors must
694 have RPL 1; in XenLinux we simply use the kernel's CS for both
695 {\tt event\_selector} and {\tt failsafe\_selector}.
697 The value {\tt event\_address} specifies the address of the guest OSes
698 event handling and dispatch routine; the {\tt failsafe\_address}
699 specifies a separate entry point which is used only if a fault occurs
700 when Xen attempts to use the normal callback.
701 \end{quote}
704 After installing the hypervisor callbacks, the guest OS can
705 install a `virtual IDT' by using the following hypercall:
707 \begin{quote}
708 \hypercall{set\_trap\_table(trap\_info\_t *table)}
710 Install one or more entries into the per-domain
711 trap handler table (essentially a software version of the IDT).
712 Each entry in the array pointed to by {\tt table} includes the
713 exception vector number with the corresponding segment selector
714 and entry point. Most guest OSes can use the same handlers on
715 Xen as when running on the real hardware; an exception is the
716 page fault handler (exception vector 14) where a modified
717 stack-frame layout is used.
720 \end{quote}
722 Finally, as an optimization it is possible for each guest OS
723 to install one ``fast trap'': this is a trap gate which will
724 allow direct transfer of control from ring 3 into ring 1 without
725 indirecting via Xen. In most cases this is suitable for use by
726 the guest OS system call mechanism, although it may be used for
727 any purpose.
730 \begin{quote}
731 \hypercall{set\_fast\_trap(int idx)}
733 Install the handler for exception vector {\tt idx} as the ``fast
734 trap'' for this domain. Note that this installs the current handler
735 (i.e. that which has been installed more recently via a call
736 to {\tt set\_trap\_table()}).
738 \end{quote}
742 \section{Scheduling and Timer}
744 Domains are preemptively scheduled by Xen according to the
745 parameters installed by domain 0 (see Section~\ref{s:dom0ops}).
746 In addition, however, a domain may choose to explicitly
747 control certain behavior with the following hypercall:
749 \begin{quote}
750 \hypercall{sched\_op(unsigned long op)}
752 Request scheduling operation from hypervisor. The options are: {\it
753 yield}, {\it block}, and {\it shutdown}. {\it yield} keeps the
754 calling domain runnable but may cause a reschedule if other domains
755 are runnable. {\it block} removes the calling domain from the run
756 queue and cause is to sleeps until an event is delivered to it. {\it
757 shutdown} is used to end the domain's execution; the caller can
758 additionally specify whether the domain should reboot, halt or
759 suspend.
760 \end{quote}
762 To aid the implementation of a process scheduler within a guest OS,
763 Xen provides a virtual programmable timer:
765 \begin{quote}
766 \hypercall{set\_timer\_op(uint64\_t timeout)}
768 Request a timer event to be sent at the specified system time (time
769 in nanoseconds since system boot). The hypercall actually passes the
770 64-bit timeout value as a pair of 32-bit values.
772 \end{quote}
774 Note that calling {\tt set\_timer\_op()} prior to {\tt sched\_op}
775 allows block-with-timeout semantics.
778 \section{Page Table Management}
780 Since guest operating systems have read-only access to their page
781 tables, Xen must be involved when making any changes. The following
782 multi-purpose hypercall can be used to modify page-table entries,
783 update the machine-to-physical mapping table, flush the TLB, install
784 a new page-table base pointer, and more.
786 \begin{quote}
787 \hypercall{mmu\_update(mmu\_update\_t *req, int count, int *success\_count)}
789 Update the page table for the domain; a set of {\tt count} updates are
790 submitted for processing in a batch, with {\tt success\_count} being
791 updated to report the number of successful updates.
793 Each element of {\tt req[]} contains a pointer (address) and value;
794 the least significant 2-bits of the pointer are used to distinguish
795 the type of update requested as follows:
796 \begin{description}
798 \item[\it MMU\_NORMAL\_PT\_UPDATE:] update a page directory entry or
799 page table entry to the associated value; Xen will check that the
800 update is safe, as described in Chapter~\ref{c:memory}.
802 \item[\it MMU\_MACHPHYS\_UPDATE:] update an entry in the
803 machine-to-physical table. The calling domain must own the machine
804 page in question (or be privileged).
806 \item[\it MMU\_EXTENDED\_COMMAND:] perform additional MMU operations.
807 The set of additional MMU operations is considerable, and includes
808 updating {\tt cr3} (or just re-installing it for a TLB flush),
809 flushing the cache, installing a new LDT, or pinning \& unpinning
810 page-table pages (to ensure their reference count doesn't drop to zero
811 which would require a revalidation of all entries).
813 Further extended commands are used to deal with granting and
814 acquiring page ownership; see Section~\ref{s:idc}.
817 \end{description}
819 More details on the precise format of all commands can be
820 found in {\tt xen/include/public/xen.h}.
823 \end{quote}
825 Explicitly updating batches of page table entries is extremely
826 efficient, but can require a number of alterations to the guest
827 OS. Using the writable page table mode (Chapter~\ref{c:memory}) is
828 recommended for new OS ports.
830 Regardless of which page table update mode is being used, however,
831 there are some occasions (notably handling a demand page fault) where
832 a guest OS will wish to modify exactly one PTE rather than a
833 batch. This is catered for by the following:
835 \begin{quote}
836 \hypercall{update\_va\_mapping(unsigned long page\_nr, unsigned long
837 val, \\ unsigned long flags)}
839 Update the currently installed PTE for the page {\tt page\_nr} to
840 {\tt val}. As with {\tt mmu\_update()}, Xen checks the modification
841 is safe before applying it. The {\tt flags} determine which kind
842 of TLB flush, if any, should follow the update.
844 \end{quote}
846 Finally, sufficiently privileged domains may occasionally wish to manipulate
847 the pages of others:
848 \begin{quote}
850 \hypercall{update\_va\_mapping\_otherdomain(unsigned long page\_nr,
851 unsigned long val, unsigned long flags, uint16\_t domid)}
853 Identical to {\tt update\_va\_mapping()} save that the pages being
854 mapped must belong to the domain {\tt domid}.
856 \end{quote}
858 This privileged operation is currently used by backend virtual device
859 drivers to safely map pages containing I/O data.
863 \section{Segmentation Support}
865 Xen allows guest OSes to install a custom GDT if they require it;
866 this is context switched transparently whenever a domain is
867 [de]scheduled. The following hypercall is effectively a
868 `safe' version of {\tt lgdt}:
870 \begin{quote}
871 \hypercall{set\_gdt(unsigned long *frame\_list, int entries)}
873 Install a global descriptor table for a domain; {\tt frame\_list} is
874 an array of up to 16 machine page frames within which the GDT resides,
875 with {\tt entries} being the actual number of descriptor-entry
876 slots. All page frames must be mapped read-only within the guest's
877 address space, and the table must be large enough to contain Xen's
878 reserved entries (see {\tt xen/include/public/arch-x86\_32.h}).
880 \end{quote}
882 Many guest OSes will also wish to install LDTs; this is achieved by
883 using {\tt mmu\_update()} with an extended command, passing the
884 linear address of the LDT base along with the number of entries. No
885 special safety checks are required; Xen needs to perform this task
886 simply since {\tt lldt} requires CPL 0.
889 Xen also allows guest operating systems to update just an
890 individual segment descriptor in the GDT or LDT:
892 \begin{quote}
893 \hypercall{update\_descriptor(unsigned long ma, unsigned long word1,
894 unsigned long word2)}
896 Update the GDT/LDT entry at machine address {\tt ma}; the new
897 8-byte descriptor is stored in {\tt word1} and {\tt word2}.
898 Xen performs a number of checks to ensure the descriptor is
899 valid.
901 \end{quote}
903 Guest OSes can use the above in place of context switching entire
904 LDTs (or the GDT) when the number of changing descriptors is small.
906 \section{Context Switching}
908 When a guest OS wishes to context switch between two processes,
909 it can use the page table and segmentation hypercalls described
910 above to perform the the bulk of the privileged work. In addition,
911 however, it will need to invoke Xen to switch the kernel (ring 1)
912 stack pointer:
914 \begin{quote}
915 \hypercall{stack\_switch(unsigned long ss, unsigned long esp)}
917 Request kernel stack switch from hypervisor; {\tt ss} is the new
918 stack segment, which {\tt esp} is the new stack pointer.
920 \end{quote}
922 A final useful hypercall for context switching allows ``lazy''
923 save and restore of floating point state:
925 \begin{quote}
926 \hypercall{fpu\_taskswitch(void)}
928 This call instructs Xen to set the {\tt TS} bit in the {\tt cr0}
929 control register; this means that the next attempt to use floating
930 point will cause a trap which the guest OS can trap. Typically it will
931 then save/restore the FP state, and clear the {\tt TS} bit.
932 \end{quote}
934 This is provided as an optimization only; guest OSes can also choose
935 to save and restore FP state on all context switches for simplicity.
938 \section{Physical Memory Management}
940 As mentioned previously, each domain has a maximum and current
941 memory allocation. The maximum allocation, set at domain creation
942 time, cannot be modified. However a domain can choose to reduce
943 and subsequently grow its current allocation by using the
944 following call:
946 \begin{quote}
947 \hypercall{dom\_mem\_op(unsigned int op, unsigned long *extent\_list,
948 unsigned long nr\_extents, unsigned int extent\_order)}
950 Increase or decrease current memory allocation (as determined by
951 the value of {\tt op}). Each invocation provides a list of
952 extents each of which is $2^s$ pages in size,
953 where $s$ is the value of {\tt extent\_order}.
955 \end{quote}
957 In addition to simply reducing or increasing the current memory
958 allocation via a `balloon driver', this call is also useful for
959 obtaining contiguous regions of machine memory when required (e.g.
960 for certain PCI devices, or if using superpages).
963 \section{Inter-Domain Communication}
964 \label{s:idc}
966 Xen provides a simple asynchronous notification mechanism via
967 \emph{event channels}. Each domain has a set of end-points (or
968 \emph{ports}) which may be bound to an event source (e.g. a physical
969 IRQ, a virtual IRQ, or an port in another domain). When a pair of
970 end-points in two different domains are bound together, then a `send'
971 operation on one will cause an event to be received by the destination
972 domain.
974 The control and use of event channels involves the following hypercall:
976 \begin{quote}
977 \hypercall{event\_channel\_op(evtchn\_op\_t *op)}
979 Inter-domain event-channel management; {\tt op} is a discriminated
980 union which allows the following 7 operations:
982 \begin{description}
984 \item[\it alloc\_unbound:] allocate a free (unbound) local
985 port and prepare for connection from a specified domain.
986 \item[\it bind\_virq:] bind a local port to a virtual
987 IRQ; any particular VIRQ can be bound to at most one port per domain.
988 \item[\it bind\_pirq:] bind a local port to a physical IRQ;
989 once more, a given pIRQ can be bound to at most one port per
990 domain. Furthermore the calling domain must be sufficiently
991 privileged.
992 \item[\it bind\_interdomain:] construct an interdomain event
993 channel; in general, the target domain must have previously allocated
994 an unbound port for this channel, although this can be bypassed by
995 privileged domains during domain setup.
996 \item[\it close:] close an interdomain event channel.
997 \item[\it send:] send an event to the remote end of a
998 interdomain event channel.
999 \item[\it status:] determine the current status of a local port.
1000 \end{description}
1002 For more details see
1003 {\tt xen/include/public/event\_channel.h}.
1005 \end{quote}
1007 Event channels are the fundamental communication primitive between
1008 Xen domains and seamlessly support SMP. However they provide little
1009 bandwidth for communication {\sl per se}, and hence are typically
1010 married with a piece of shared memory to produce effective and
1011 high-performance inter-domain communication.
1013 Safe sharing of memory pages between guest OSes is carried out by
1014 granting access on a per page basis to individual domains. This is
1015 achieved by using the {\tt grant\_table\_op()} hypercall.
1017 \begin{quote}
1018 \hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)}
1020 Grant or remove access to a particular page to a particular domain.
1022 \end{quote}
1024 This is not currently widely in use by guest operating systems, but
1025 we intend to integrate support more fully in the near future.
1027 \section{PCI Configuration}
1029 Domains with physical device access (i.e.\ driver domains) receive
1030 limited access to certain PCI devices (bus address space and
1031 interrupts). However many guest operating systems attempt to
1032 determine the PCI configuration by directly access the PCI BIOS,
1033 which cannot be allowed for safety.
1035 Instead, Xen provides the following hypercall:
1037 \begin{quote}
1038 \hypercall{physdev\_op(void *physdev\_op)}
1040 Perform a PCI configuration option; depending on the value
1041 of {\tt physdev\_op} this can be a PCI config read, a PCI config
1042 write, or a small number of other queries.
1044 \end{quote}
1047 For examples of using {\tt physdev\_op()}, see the
1048 Xen-specific PCI code in the linux sparse tree.
1050 \section{Administrative Operations}
1051 \label{s:dom0ops}
1053 A large number of control operations are available to a sufficiently
1054 privileged domain (typically domain 0). These allow the creation and
1055 management of new domains, for example. A complete list is given
1056 below: for more details on any or all of these, please see
1057 {\tt xen/include/public/dom0\_ops.h}
1060 \begin{quote}
1061 \hypercall{dom0\_op(dom0\_op\_t *op)}
1063 Administrative domain operations for domain management. The options are:
1065 \begin{description}
1066 \item [\it DOM0\_CREATEDOMAIN:] create a new domain
1068 \item [\it DOM0\_PAUSEDOMAIN:] remove a domain from the scheduler run
1069 queue.
1071 \item [\it DOM0\_UNPAUSEDOMAIN:] mark a paused domain as schedulable
1072 once again.
1074 \item [\it DOM0\_DESTROYDOMAIN:] deallocate all resources associated
1075 with a domain
1077 \item [\it DOM0\_GETMEMLIST:] get list of pages used by the domain
1079 \item [\it DOM0\_SCHEDCTL:]
1081 \item [\it DOM0\_ADJUSTDOM:] adjust scheduling priorities for domain
1083 \item [\it DOM0\_BUILDDOMAIN:] do final guest OS setup for domain
1085 \item [\it DOM0\_GETDOMAINFO:] get statistics about the domain
1087 \item [\it DOM0\_GETPAGEFRAMEINFO:]
1089 \item [\it DOM0\_GETPAGEFRAMEINFO2:]
1091 \item [\it DOM0\_IOPL:] set I/O privilege level
1093 \item [\it DOM0\_MSR:] read or write model specific registers
1095 \item [\it DOM0\_DEBUG:] interactively invoke the debugger
1097 \item [\it DOM0\_SETTIME:] set system time
1099 \item [\it DOM0\_READCONSOLE:] read console content from hypervisor buffer ring
1101 \item [\it DOM0\_PINCPUDOMAIN:] pin domain to a particular CPU
1103 \item [\it DOM0\_GETTBUFS:] get information about the size and location of
1104 the trace buffers (only on trace-buffer enabled builds)
1106 \item [\it DOM0\_PHYSINFO:] get information about the host machine
1108 \item [\it DOM0\_PCIDEV\_ACCESS:] modify PCI device access permissions
1110 \item [\it DOM0\_SCHED\_ID:] get the ID of the current Xen scheduler
1112 \item [\it DOM0\_SHADOW\_CONTROL:] switch between shadow page-table modes
1114 \item [\it DOM0\_SETDOMAININITIALMEM:] set initial memory allocation of a domain
1116 \item [\it DOM0\_SETDOMAINMAXMEM:] set maximum memory allocation of a domain
1118 \item [\it DOM0\_SETDOMAINVMASSIST:] set domain VM assist options
1119 \end{description}
1120 \end{quote}
1122 Most of the above are best understood by looking at the code
1123 implementing them (in {\tt xen/common/dom0\_ops.c}) and in
1124 the user-space tools that use them (mostly in {\tt tools/libxc}).
1126 \section{Debugging Hypercalls}
1128 A few additional hypercalls are mainly useful for debugging:
1130 \begin{quote}
1131 \hypercall{console\_io(int cmd, int count, char *str)}
1133 Use Xen to interact with the console; operations are:
1135 {\it CONSOLEIO\_write}: Output count characters from buffer str.
1137 {\it CONSOLEIO\_read}: Input at most count characters into buffer str.
1138 \end{quote}
1140 A pair of hypercalls allows access to the underlying debug registers:
1141 \begin{quote}
1142 \hypercall{set\_debugreg(int reg, unsigned long value)}
1144 Set debug register {\tt reg} to {\tt value}
1146 \hypercall{get\_debugreg(int reg)}
1148 Return the contents of the debug register {\tt reg}
1149 \end{quote}
1151 And finally:
1152 \begin{quote}
1153 \hypercall{xen\_version(int cmd)}
1155 Request Xen version number.
1156 \end{quote}
1158 This is useful to ensure that user-space tools are in sync
1159 with the underlying hypervisor.
1161 \section{Deprecated Hypercalls}
1163 Xen is under constant development and refinement; as such there
1164 are plans to improve the way in which various pieces of functionality
1165 are exposed to guest OSes.
1167 \begin{quote}
1168 \hypercall{vm\_assist(unsigned int cmd, unsigned int type)}
1170 Toggle various memory management modes (in particular wrritable page
1171 tables and superpage support).
1173 \end{quote}
1175 This is likely to be replaced with mode values in the shared
1176 information page since this is more resilient for resumption
1177 after migration or checkpoint.
1186 %%
1187 %% XXX SMH: not really sure how useful below is -- if it's still
1188 %% actually true, might be useful for someone wanting to write a
1189 %% new scheduler... not clear how many of them there are...
1190 %%
1192 \begin{comment}
1194 \chapter{Scheduling API}
1196 The scheduling API is used by both the schedulers described above and should
1197 also be used by any new schedulers. It provides a generic interface and also
1198 implements much of the ``boilerplate'' code.
1200 Schedulers conforming to this API are described by the following
1201 structure:
1203 \begin{verbatim}
1204 struct scheduler
1206 char *name; /* full name for this scheduler */
1207 char *opt_name; /* option name for this scheduler */
1208 unsigned int sched_id; /* ID for this scheduler */
1210 int (*init_scheduler) ();
1211 int (*alloc_task) (struct task_struct *);
1212 void (*add_task) (struct task_struct *);
1213 void (*free_task) (struct task_struct *);
1214 void (*rem_task) (struct task_struct *);
1215 void (*wake_up) (struct task_struct *);
1216 void (*do_block) (struct task_struct *);
1217 task_slice_t (*do_schedule) (s_time_t);
1218 int (*control) (struct sched_ctl_cmd *);
1219 int (*adjdom) (struct task_struct *,
1220 struct sched_adjdom_cmd *);
1221 s32 (*reschedule) (struct task_struct *);
1222 void (*dump_settings) (void);
1223 void (*dump_cpu_state) (int);
1224 void (*dump_runq_el) (struct task_struct *);
1225 };
1226 \end{verbatim}
1228 The only method that {\em must} be implemented is
1229 {\tt do\_schedule()}. However, if there is not some implementation for the
1230 {\tt wake\_up()} method then waking tasks will not get put on the runqueue!
1232 The fields of the above structure are described in more detail below.
1234 \subsubsection{name}
1236 The name field should point to a descriptive ASCII string.
1238 \subsubsection{opt\_name}
1240 This field is the value of the {\tt sched=} boot-time option that will select
1241 this scheduler.
1243 \subsubsection{sched\_id}
1245 This is an integer that uniquely identifies this scheduler. There should be a
1246 macro corrsponding to this scheduler ID in {\tt <xen/sched-if.h>}.
1248 \subsubsection{init\_scheduler}
1250 \paragraph*{Purpose}
1252 This is a function for performing any scheduler-specific initialisation. For
1253 instance, it might allocate memory for per-CPU scheduler data and initialise it
1254 appropriately.
1256 \paragraph*{Call environment}
1258 This function is called after the initialisation performed by the generic
1259 layer. The function is called exactly once, for the scheduler that has been
1260 selected.
1262 \paragraph*{Return values}
1264 This should return negative on failure --- this will cause an
1265 immediate panic and the system will fail to boot.
1267 \subsubsection{alloc\_task}
1269 \paragraph*{Purpose}
1270 Called when a {\tt task\_struct} is allocated by the generic scheduler
1271 layer. A particular scheduler implementation may use this method to
1272 allocate per-task data for this task. It may use the {\tt
1273 sched\_priv} pointer in the {\tt task\_struct} to point to this data.
1275 \paragraph*{Call environment}
1276 The generic layer guarantees that the {\tt sched\_priv} field will
1277 remain intact from the time this method is called until the task is
1278 deallocated (so long as the scheduler implementation does not change
1279 it explicitly!).
1281 \paragraph*{Return values}
1282 Negative on failure.
1284 \subsubsection{add\_task}
1286 \paragraph*{Purpose}
1288 Called when a task is initially added by the generic layer.
1290 \paragraph*{Call environment}
1292 The fields in the {\tt task\_struct} are now filled out and available for use.
1293 Schedulers should implement appropriate initialisation of any per-task private
1294 information in this method.
1296 \subsubsection{free\_task}
1298 \paragraph*{Purpose}
1300 Schedulers should free the space used by any associated private data
1301 structures.
1303 \paragraph*{Call environment}
1305 This is called when a {\tt task\_struct} is about to be deallocated.
1306 The generic layer will have done generic task removal operations and
1307 (if implemented) called the scheduler's {\tt rem\_task} method before
1308 this method is called.
1310 \subsubsection{rem\_task}
1312 \paragraph*{Purpose}
1314 This is called when a task is being removed from scheduling (but is
1315 not yet being freed).
1317 \subsubsection{wake\_up}
1319 \paragraph*{Purpose}
1321 Called when a task is woken up, this method should put the task on the runqueue
1322 (or do the scheduler-specific equivalent action).
1324 \paragraph*{Call environment}
1326 The task is already set to state RUNNING.
1328 \subsubsection{do\_block}
1330 \paragraph*{Purpose}
1332 This function is called when a task is blocked. This function should
1333 not remove the task from the runqueue.
1335 \paragraph*{Call environment}
1337 The EVENTS\_MASTER\_ENABLE\_BIT is already set and the task state changed to
1338 TASK\_INTERRUPTIBLE on entry to this method. A call to the {\tt
1339 do\_schedule} method will be made after this method returns, in
1340 order to select the next task to run.
1342 \subsubsection{do\_schedule}
1344 This method must be implemented.
1346 \paragraph*{Purpose}
1348 The method is called each time a new task must be chosen for scheduling on the
1349 current CPU. The current time as passed as the single argument (the current
1350 task can be found using the {\tt current} macro).
1352 This method should select the next task to run on this CPU and set it's minimum
1353 time to run as well as returning the data described below.
1355 This method should also take the appropriate action if the previous
1356 task has blocked, e.g. removing it from the runqueue.
1358 \paragraph*{Call environment}
1360 The other fields in the {\tt task\_struct} are updated by the generic layer,
1361 which also performs all Xen-specific tasks and performs the actual task switch
1362 (unless the previous task has been chosen again).
1364 This method is called with the {\tt schedule\_lock} held for the current CPU
1365 and local interrupts disabled.
1367 \paragraph*{Return values}
1369 Must return a {\tt struct task\_slice} describing what task to run and how long
1370 for (at maximum).
1372 \subsubsection{control}
1374 \paragraph*{Purpose}
1376 This method is called for global scheduler control operations. It takes a
1377 pointer to a {\tt struct sched\_ctl\_cmd}, which it should either
1378 source data from or populate with data, depending on the value of the
1379 {\tt direction} field.
1381 \paragraph*{Call environment}
1383 The generic layer guarantees that when this method is called, the
1384 caller selected the correct scheduler ID, hence the scheduler's
1385 implementation does not need to sanity-check these parts of the call.
1387 \paragraph*{Return values}
1389 This function should return the value to be passed back to user space, hence it
1390 should either be 0 or an appropriate errno value.
1392 \subsubsection{sched\_adjdom}
1394 \paragraph*{Purpose}
1396 This method is called to adjust the scheduling parameters of a particular
1397 domain, or to query their current values. The function should check
1398 the {\tt direction} field of the {\tt sched\_adjdom\_cmd} it receives in
1399 order to determine which of these operations is being performed.
1401 \paragraph*{Call environment}
1403 The generic layer guarantees that the caller has specified the correct
1404 control interface version and scheduler ID and that the supplied {\tt
1405 task\_struct} will not be deallocated during the call (hence it is not
1406 necessary to {\tt get\_task\_struct}).
1408 \paragraph*{Return values}
1410 This function should return the value to be passed back to user space, hence it
1411 should either be 0 or an appropriate errno value.
1413 \subsubsection{reschedule}
1415 \paragraph*{Purpose}
1417 This method is called to determine if a reschedule is required as a result of a
1418 particular task.
1420 \paragraph*{Call environment}
1421 The generic layer will cause a reschedule if the current domain is the idle
1422 task or it has exceeded its minimum time slice before a reschedule. The
1423 generic layer guarantees that the task passed is not currently running but is
1424 on the runqueue.
1426 \paragraph*{Return values}
1428 Should return a mask of CPUs to cause a reschedule on.
1430 \subsubsection{dump\_settings}
1432 \paragraph*{Purpose}
1434 If implemented, this should dump any private global settings for this
1435 scheduler to the console.
1437 \paragraph*{Call environment}
1439 This function is called with interrupts enabled.
1441 \subsubsection{dump\_cpu\_state}
1443 \paragraph*{Purpose}
1445 This method should dump any private settings for the specified CPU.
1447 \paragraph*{Call environment}
1449 This function is called with interrupts disabled and the {\tt schedule\_lock}
1450 for the specified CPU held.
1452 \subsubsection{dump\_runq\_el}
1454 \paragraph*{Purpose}
1456 This method should dump any private settings for the specified task.
1458 \paragraph*{Call environment}
1460 This function is called with interrupts disabled and the {\tt schedule\_lock}
1461 for the task's CPU held.
1463 \end{comment}
1468 %%
1469 %% XXX SMH: we probably should have something in here on debugging
1470 %% etc; this is a kinda developers manual and many devs seem to
1471 %% like debugging support :^)
1472 %% Possibly sanitize below, else wait until new xendbg stuff is in
1473 %% (and/or kip's stuff?) and write about that instead?
1474 %%
1476 \begin{comment}
1478 \chapter{Debugging}
1480 Xen provides tools for debugging both Xen and guest OSes. Currently, the
1481 Pervasive Debugger provides a GDB stub, which provides facilities for symbolic
1482 debugging of Xen itself and of OS kernels running on top of Xen. The Trace
1483 Buffer provides a lightweight means to log data about Xen's internal state and
1484 behaviour at runtime, for later analysis.
1486 \section{Pervasive Debugger}
1488 Information on using the pervasive debugger is available in pdb.txt.
1491 \section{Trace Buffer}
1493 The trace buffer provides a means to observe Xen's operation from domain 0.
1494 Trace events, inserted at key points in Xen's code, record data that can be
1495 read by the {\tt xentrace} tool. Recording these events has a low overhead
1496 and hence the trace buffer may be useful for debugging timing-sensitive
1497 behaviours.
1499 \subsection{Internal API}
1501 To use the trace buffer functionality from within Xen, you must {\tt \#include
1502 <xen/trace.h>}, which contains definitions related to the trace buffer. Trace
1503 events are inserted into the buffer using the {\tt TRACE\_xD} ({\tt x} = 0, 1,
1504 2, 3, 4 or 5) macros. These all take an event number, plus {\tt x} additional
1505 (32-bit) data as their arguments. For trace buffer-enabled builds of Xen these
1506 will insert the event ID and data into the trace buffer, along with the current
1507 value of the CPU cycle-counter. For builds without the trace buffer enabled,
1508 the macros expand to no-ops and thus can be left in place without incurring
1509 overheads.
1511 \subsection{Trace-enabled builds}
1513 By default, the trace buffer is enabled only in debug builds (i.e. {\tt NDEBUG}
1514 is not defined). It can be enabled separately by defining {\tt TRACE\_BUFFER},
1515 either in {\tt <xen/config.h>} or on the gcc command line.
1517 The size (in pages) of the per-CPU trace buffers can be specified using the
1518 {\tt tbuf\_size=n } boot parameter to Xen. If the size is set to 0, the trace
1519 buffers will be disabled.
1521 \subsection{Dumping trace data}
1523 When running a trace buffer build of Xen, trace data are written continuously
1524 into the buffer data areas, with newer data overwriting older data. This data
1525 can be captured using the {\tt xentrace} program in domain 0.
1527 The {\tt xentrace} tool uses {\tt /dev/mem} in domain 0 to map the trace
1528 buffers into its address space. It then periodically polls all the buffers for
1529 new data, dumping out any new records from each buffer in turn. As a result,
1530 for machines with multiple (logical) CPUs, the trace buffer output will not be
1531 in overall chronological order.
1533 The output from {\tt xentrace} can be post-processed using {\tt
1534 xentrace\_cpusplit} (used to split trace data out into per-cpu log files) and
1535 {\tt xentrace\_format} (used to pretty-print trace data). For the predefined
1536 trace points, there is an example format file in {\tt tools/xentrace/formats }.
1538 For more information, see the manual pages for {\tt xentrace}, {\tt
1539 xentrace\_format} and {\tt xentrace\_cpusplit}.
1541 \end{comment}
1546 \end{document}