## ia64/xen-unstable

### view docs/interface.tex @ 2570:30a9b33481dc

bitkeeper revision 1.1159.90.2 (415be9d5hTw1zLV9fA-AYcekmwhMwg)

Discard devices early for local migrate.
author mjw@wray-m-3.hpl.hp.com Thu Sep 30 11:11:17 2004 +0000 (2004-09-30) 11be1dfb262b 1cec0bdb4c6f 1c21b245b050
line source
1 \documentclass[11pt,twoside,final,openright]{xenstyle}
2 \usepackage{a4,graphicx,setspace}
3 \setstretch{1.15}
4 \input{style.tex}
6 \begin{document}
8 % TITLE PAGE
9 \pagestyle{empty}
10 \begin{center}
11 \vspace*{\fill}
12 \includegraphics{eps/xenlogo.eps}
13 \vfill
14 \vfill
15 \vfill
16 \begin{tabular}{l}
17 {\Huge \bf Interface manual} \\[4mm]
18 {\huge Xen v2.0 for x86} \\[80mm]
20 {\Large Xen is Copyright (c) 2004, The Xen Team} \\[3mm]
21 {\Large University of Cambridge, UK} \\[20mm]
22 {\large Last updated on 11th March, 2004}
23 \end{tabular}
24 \vfill
25 \end{center}
26 \cleardoublepage
29 \pagestyle{plain}
30 \pagenumbering{roman}
31 { \parskip 0pt plus 1pt
32 \tableofcontents }
33 \cleardoublepage
35 % PREPARE FOR MAIN TEXT
36 \pagenumbering{arabic}
37 \raggedbottom
38 \widowpenalty=10000
39 \clubpenalty=10000
40 \parindent=0pt
41 \renewcommand{\topfraction}{.8}
42 \renewcommand{\bottomfraction}{.8}
43 \renewcommand{\textfraction}{.2}
44 \renewcommand{\floatpagefraction}{.8}
45 \setstretch{1.15}
47 \chapter{Introduction}
48 Xen allows the hardware resouces of a machine to be virtualized and
49 dynamically partitioned such as to allow multiple different 'guest'
50 operating system images to be run simultaneously.
52 Virtualizing the machine in this manner provides flexibility allowing
53 different users to choose their preferred operating system (Windows,
54 Linux, FreeBSD, or a custom operating system). Furthermore, Xen provides
55 secure partitioning between these 'domains', and enables better resource
56 accounting and QoS isolation than can be achieved with a conventional
57 operating system.
59 The hypervisor runs directly on server hardware and dynamically partitions
60 it between a number of {\it domains}, each of which hosts an instance
61 of a {\it guest operating system}. The hypervisor provides just enough
62 abstraction of the machine to allow effective isolation and resource
63 management between these domains.
65 Xen essentially takes a virtual machine approach as pioneered by IBM VM/370.
66 However, unlike VM/370 or more recent efforts such as VMWare and Virtual PC,
67 Xen doesn not attempt to completely virtualize the underlying hardware. Instead
68 parts of the hosted guest operating systems to work with the hypervisor; the
69 operating system is effectively ported to a new target architecture, typically
70 requiring changes in just the machine-dependent code. The user-level API is
71 unchanged, thus existing binaries and operating system distributions can work
72 unmodified.
74 In addition to exporting virtualized instances of CPU, memory, network and
75 block devicees, Xen exposes a control interface to set how these resources
76 are shared between the running domains. The control interface is privileged
77 and may only be accessed by one particular virtual machine: {\it domain0}.
78 This domain is a required part of any Xen-base server and runs the application
79 software that manages the control-plane aspects of the platform. Running the
80 control software in {\it domain0}, distinct from the hypervisor itself, allows
81 the Xen framework to separate the notions of {\it mechanism} and {\it policy}
82 within the system.
85 \chapter{CPU state}
87 All privileged state must be handled by Xen. The guest OS has no direct access
88 to CR3 and is not permitted to update privileged bits in EFLAGS.
90 \chapter{Exceptions}
91 The IDT is virtualised by submitting a virtual 'trap
92 table' to Xen. Most trap handlers are identical to native x86
93 handlers. The page-fault handler is a noteable exception.
95 \chapter{Interrupts and events}
96 Interrupts are virtualized by mapping them to events, which are delivered
97 asynchronously to the target domain. A guest OS can map these events onto
98 its standard interrupt dispatch mechanisms, such as a simple vectoring
99 scheme. Each physical interrupt source controlled by the hypervisor, including
100 network devices, disks, or the timer subsystem, is responsible for identifying
101 the target for an incoming interrupt and sending an event to that domain.
103 This demultiplexing mechanism also provides a device-specific mechanism for
104 event coalescing or hold-off. For example, a guest OS may request to only
105 actually receive an event after {\it n} packets are queued ready for delivery
106 to it, {\it t} nanoseconds after the first packet arrived (which ever is true
107 first). This allows latency and throughput requirements to be addressed on a
108 domain-specific basis.
110 \chapter{Time}
111 Guest operating systems need to be aware of the passage of real time and their
112 own virtual time'', i.e. the time they have been executing. Furthermore, a
113 notion of time is required in the hypervisor itself for scheduling and the
114 activities that relate to it. To this end the hypervisor provides for notions
115 of time: cycle counter time, system time, wall clock time, domain virtual
116 time.
119 \section{Cycle counter time}
120 This provides the finest-grained, free-running time reference, with the
121 approximate frequency being publicly accessible. The cycle counter time is
122 used to accurately extrapolate the other time references. On SMP machines
123 it is currently assumed that the cycle counter time is synchronised between
124 CPUs. The current x86-based implementation achieves this within inter-CPU
125 communication latencies.
127 \section{System time}
128 This is a 64-bit value containing the nanoseconds elapsed since boot
129 time. Unlike cycle counter time, system time accurately reflects the
130 passage of real time, i.e. it is adjusted several times a second for timer
131 drift. This is done by running an NTP client in {\it domain0} on behalf of
132 the machine, feeding updates to the hypervisor. Intermediate values can be
133 extrapolated using the cycle counter.
135 \section{Wall clock time}
136 This is the actual time of day'' Unix style struct timeval (i.e. seconds and
137 microseconds since 1 January 1970, adjusted by leap seconds etc.). Again, an
138 NTP client hosted by {\it domain0} can help maintain this value. To guest
139 operating systems this value will be reported instead of the hardware RTC
140 clock value and they can use the system time and cycle counter times to start
141 and remain perfectly in time.
144 \section{Domain virtual time}
145 This progresses at the same pace as cycle counter time, but only while a
146 domain is executing. It stops while a domain is de-scheduled. Therefore the
147 share of the CPU that a domain receives is indicated by the rate at which
148 its domain virtual time increases, relative to the rate at which cycle
149 counter time does so.
151 \section{Time interface}
152 Xen exports some timestamps to guest operating systems through their shared
153 info page. Timestamps are provided for system time and wall-clock time. Xen
154 also provides the cycle counter values at the time of the last update
155 allowing guests to calculate the current values. The cpu frequency and a
156 scaling factor are provided for guests to convert cycle counter values to
157 real time. Since all time stamps need to be updated and read
158 \emph{atomically} two version numbers are also stored in the shared info
159 page.
161 Xen will ensure that the time stamps are updated frequently enough to avoid
162 an overflow of the cycle counter values. Guest can check if its notion of
163 time is up-to-date by comparing the version numbers.
165 \section{Timer events}
167 Xen maintains a periodic timer (currently with a 10ms period) which sends a
168 timer event to the currently executing domain. This allows Guest OSes to
169 keep track of the passing of time when executing. The scheduler also
170 arranges for a newly activated domain to receive a timer event when
171 scheduled so that the Guest OS can adjust to the passage of time while it
172 has been inactive.
174 In addition, Xen exports a hypercall interface to each domain which allows
175 them to request a timer event send to them at the specified system
176 time. Guest OSes may use this timer to implemented timeout values when they
177 block.
179 \chapter{Memory}
181 The hypervisor is responsible for providing memory to each of the domains running
182 over it. However, the Xen hypervisor's duty is restricted to managing physical
183 memory and to policing page table updates. All other memory management functions
184 are handly externally. Start-of-day issues such as building initial page tables
185 for a domain, loading its kernel image and so on are done by the {\it domain builder}
186 running in user-space with {\it domain0}. Paging to disk and swapping is handled
187 by the guest operating systems themselves, if they need it.
189 On a Xen-based system, the hypervisor itself runs in {\it ring 0}. It has full
190 access to the physical memory available in the system and is responsible for
191 allocating portions of it to the domains. Guest operating systems run in and use
192 {\it rings 1}, {\it 2} and {\it 3} as they see fit, aside from the fact that
193 segmentation is used to prevent the guest OS from accessing a portion of the
194 linear address space that is reserved for use by the hypervisor. This approach
195 allows transitions between the guest OS and hypervisor without flushing the TLB.
196 We expect most guest operating systems will use ring 1 for their own operation
197 and place applications (if they support such a notion) in ring 3.
199 \section{Physical Memory Allocation}
200 The hypervisor reserves a small fixed portion of physical memory at system boot
201 time. This special memory region is located at the beginning of physical memory
202 and is mapped at the very top of every virtual address space.
204 Any physical memory that is not used directly by the hypervisor is divided into
205 pages and is available for allocation to domains. The hypervisor tracks which
206 pages are free and which pages have been allocated to each domain. When a new
207 domain is initialized, the hypervisor allocates it pages drawn from the free
208 list. The amount of memory required by the domain is passed to the hypervisor
209 as one of the parameters for new domain initialization by the domain builder.
211 Domains can never be allocated further memory beyond that which was requested
212 for them on initialization. However, a domain can return pages to the hypervisor
213 if it discovers that its memory requirements have diminished.
215 % put reasons for why pages might be returned here.
217 In addition to managing physical memory allocation, the hypervisor is also in
218 charge of performing page table updates on behalf of the domains. This is
219 neccessary to prevent domains from adding arbitrary mappings to their page
220 tables or introducing mappings to other's page tables.
222 \section{Segment Descriptor Tables}
224 On boot a guest is supplied with a default GDT, which is {\em not}
225 taken from its own memory allocation. If the guest wishes to use other
226 than the default flat' ring-1 and ring-3 segments that this default
227 table provides, it must register a custom GDT and/or LDT with Xen,
228 allocated from its own memory.
230 int {\bf set\_gdt}(unsigned long *{\em frame\_list}, int {\em entries})
232 {\em frame\_list}: An array of up to 16 page frames within which the GDT
233 resides. Any frame registered as a GDT frame may only be mapped
235 mappings, no use as a page-table page, and so on).
237 {\em entries}: The number of descriptor-entry slots in the GDT. Note that
238 the table must be large enough to contain Xen's reserved entries; thus
239 we must have '{\em entries $>$ LAST\_RESERVED\_GDT\_ENTRY}'. Note also that,
240 after registering the GDT, slots {\em FIRST\_} through
241 {\em LAST\_RESERVED\_GDT\_ENTRY} are no longer usable by the guest and may be
242 overwritten by Xen.
244 \section{Pseudo-Physical Memory}
245 The usual problem of external fragmentation means that a domain is unlikely to
246 receive a contiguous stretch of physical memory. However, most guest operating
247 systems do not have built-in support for operating in a fragmented physical
248 address space e.g. Linux has to have a one-to-one mapping for it physical
249 memory. There a notion of {\it pseudo physical memory} is introdouced.
250 Once a domain is allocated a number of pages, at its start of the day, one of
251 the first things it needs to do is build its own {\it real physical} to
252 {\it pseudo physical} mapping. From that moment onwards {\it pseudo physical}
254 the rest of the guest OS code has an impression of operating in a contiguous
255 address space. Guest OS page tables contain real physical addresses. Mapping
256 {\it pseudo physical} to {\it real physical} addresses is need on page
257 table updates and also on remapping memory regions with the guest OS.
261 \chapter{Network I/O}
263 Virtual network device services are provided by shared memory
264 communications with a backend' domain. From the point of view of
265 other domains, the backend may be viewed as a virtual ethernet switch
266 element with each domain having one or more virtual network interfaces
267 connected to it.
269 \section{Backend Packet Handling}
270 The backend driver is responsible primarily for {\it data-path} operations.
271 In terms of networking this means packet transmission and reception.
273 On the transmission side, the backend needs to perform two key actions:
274 \begin{itemize}
275 \item {\tt Validation:} A domain is only allowed to emit packets
276 matching a certain specification; for example, ones in which the
277 source IP address matches one assigned to the virtual interface over
278 which it is sent. The backend is responsible for ensuring any such
279 requirements are met, either by checking or by stamping outgoing
280 packets with prescribed values for certain fields.
282 Validation functions can be configured using standard firewall rules
283 (i.e. IP Tables, in the case of Linux).
285 \item {\tt Scheduling:} Since a number of domains can share a single
286 real'' network interface, the hypervisor must mediate access when
287 several domains each have packets queued for transmission. Of course,
288 this general scheduling function subsumes basic shaping or
289 rate-limiting schemes.
291 \item {\tt Logging and Accounting:} The hypervisor can be configured
292 with classifier rules that control how packets are accounted or
293 logged. For example, {\it domain0} could request that it receives a
294 log message or copy of the packet whenever another domain attempts to
295 send a TCP packet containg a SYN.
296 \end{itemize}
298 On the recive side, the backend's role is relatively straightforward:
299 once a packet is received, it just needs to determine the virtual interface(s)
300 to which it must be delivered and deliver it via page-flipping.
303 \section{Data Transfer}
305 Each virtual interface uses two descriptor rings'', one for transmit,
306 the other for receive. Each descriptor identifies a block of contiguous
307 physical memory allocated to the domain. There are four cases:
309 \begin{itemize}
311 \item The transmit ring carries packets to transmit from the domain to the
312 hypervisor.
314 \item The return path of the transmit ring carries empty'' descriptors
315 indicating that the contents have been transmitted and the memory can be
316 re-used.
318 \item The receive ring carries empty descriptors from the domain to the
319 hypervisor; these provide storage space for that domain's received packets.
321 \item The return path of the receive ring carries packets that have been
323 \end{itemize}
325 Real physical addresses are used throughout, with the domain performing
326 translation from pseudo-physical addresses if that is necessary.
328 If a domain does not keep its receive ring stocked with empty buffers then
329 packets destined to it may be dropped. This provides some defense against
331 further data. Similarly, on the transmit path, it provides the application
332 with feedback on the rate at which packets are able to leave the system.
334 Synchronization between the hypervisor and the domain is achieved using
335 counters held in shared memory that is accessible to both. Each ring has
336 associated producer and consumer indices indicating the area in the ring
337 that holds descriptors that contain data. After receiving {\it n} packets
338 or {\t nanoseconds} after receiving the first packet, the hypervisor sends
339 an event to the domain.
341 \chapter{Block I/O}
343 \section{Virtual Block Devices (VBDs)}
345 All guest OS disk access goes through the VBD interface. The VBD
346 interface provides the administrator with the ability to selectively
347 grant domains access to portions of block storage devices visible to
348 the the block backend device (usually domain 0).
350 VBDs can literally be backed by any block device accessible to the
351 backend domain, including network-based block devices (iSCSI, *NBD,
352 etc), loopback devices and LVM / MD devices.
354 Old (Xen 1.2) virtual disks are not supported under Xen 2.0, since
355 similar functionality can be achieved using the (more advanced) LVM
358 \subsection{Data Transfer}
359 Domains which have been granted access to a logical block device are permitted
360 to read and write it by shared memory communications with the backend domain.
362 In overview, the same style of descriptor-ring that is used for
363 network packets is used here. Each domain has one ring that carries
364 operation requests to the hypervisor and carries the results back
365 again.
367 Rather than copying data, the backend simply maps the domain's buffers
368 in order to enable direct DMA to them. The act of mapping the buffers
369 also increases the reference counts of the underlying pages, so that
370 the unprivileged domain cannot try to return them to the hypervisor,
371 install them as page tables, or any other unsafe behaviour.
372 %block API here
374 \chapter{Privileged operations}
375 {\it Domain0} is responsible for building all other domains on the server
376 and providing control interfaces for managing scheduling, networking, and
377 blocks.
379 \chapter{CPU Scheduler}
381 Xen offers a uniform API for CPU schedulers. It is possible to choose
382 from a number of schedulers at boot and it should be easy to add more.
384 \paragraph*{Note: SMP host support}
385 Xen has always supported SMP host systems. Domains are statically assigned to
386 CPUs, either at creation time or when manually pinning to a particular CPU.
387 The current schedulers then run locally on each CPU to decide which of the
388 assigned domains should be run there.
390 \section{Standard Schedulers}
392 These BVT, Atropos and Round Robin schedulers are part of the normal
393 Xen distribution. BVT provides porportional fair shares of the CPU to
394 the running domains. Atropos can be used to reserve absolute shares
395 of the CPU for each domain. Round-robin is provided as an example of
396 Xen's internal scheduler API.
398 More information on the characteristics and use of these schedulers is
399 available in { \tt Sched-HOWTO.txt }.
401 \section{Scheduling API}
403 The scheduling API is used by both the schedulers described above and should
404 also be used by any new schedulers. It provides a generic interface and also
405 implements much of the boilerplate'' code.
407 Schedulers conforming to this API are described by the following
408 structure:
410 \begin{verbatim}
411 struct scheduler
412 {
413 char *name; /* full name for this scheduler */
414 char *opt_name; /* option name for this scheduler */
415 unsigned int sched_id; /* ID for this scheduler */
417 int (*init_scheduler) ();
422 void (*wake_up) (struct task_struct *);
423 void (*do_block) (struct task_struct *);
425 int (*control) (struct sched_ctl_cmd *);
428 s32 (*reschedule) (struct task_struct *);
429 void (*dump_settings) (void);
430 void (*dump_cpu_state) (int);
431 void (*dump_runq_el) (struct task_struct *);
432 };
433 \end{verbatim}
435 The only method that {\em must} be implemented is
436 {\tt do\_schedule()}. However, if there is not some implementation for the
437 {\tt wake\_up()} method then waking tasks will not get put on the runqueue!
439 The fields of the above structure are described in more detail below.
441 \subsubsection{name}
443 The name field should point to a descriptive ASCII string.
445 \subsubsection{opt\_name}
447 This field is the value of the {\tt sched=} boot-time option that will select
448 this scheduler.
450 \subsubsection{sched\_id}
452 This is an integer that uniquely identifies this scheduler. There should be a
453 macro corrsponding to this scheduler ID in {\tt <hypervisor-ifs/sched-if.h>}.
455 \subsubsection{init\_scheduler}
457 \paragraph*{Purpose}
459 This is a function for performing any scheduler-specific initialisation. For
460 instance, it might allocate memory for per-CPU scheduler data and initialise it
461 appropriately.
463 \paragraph*{Call environment}
465 This function is called after the initialisation performed by the generic
466 layer. The function is called exactly once, for the scheduler that has been
467 selected.
469 \paragraph*{Return values}
471 This should return negative on failure --- this will cause an
472 immediate panic and the system will fail to boot.
476 \paragraph*{Purpose}
477 Called when a {\tt task\_struct} is allocated by the generic scheduler
478 layer. A particular scheduler implementation may use this method to
479 allocate per-task data for this task. It may use the {\tt
480 sched\_priv} pointer in the {\tt task\_struct} to point to this data.
482 \paragraph*{Call environment}
483 The generic layer guarantees that the {\tt sched\_priv} field will
484 remain intact from the time this method is called until the task is
485 deallocated (so long as the scheduler implementation does not change
486 it explicitly!).
488 \paragraph*{Return values}
489 Negative on failure.
493 \paragraph*{Purpose}
495 Called when a task is initially added by the generic layer.
497 \paragraph*{Call environment}
499 The fields in the {\tt task\_struct} are now filled out and available for use.
500 Schedulers should implement appropriate initialisation of any per-task private
501 information in this method.
505 \paragraph*{Purpose}
507 Schedulers should free the space used by any associated private data
508 structures.
510 \paragraph*{Call environment}
512 This is called when a {\tt task\_struct} is about to be deallocated.
513 The generic layer will have done generic task removal operations and
514 (if implemented) called the scheduler's {\tt rem\_task} method before
515 this method is called.
519 \paragraph*{Purpose}
521 This is called when a task is being removed from scheduling (but is
522 not yet being freed).
524 \subsubsection{wake\_up}
526 \paragraph*{Purpose}
528 Called when a task is woken up, this method should put the task on the runqueue
529 (or do the scheduler-specific equivalent action).
531 \paragraph*{Call environment}
535 \subsubsection{do\_block}
537 \paragraph*{Purpose}
539 This function is called when a task is blocked. This function should
540 not remove the task from the runqueue.
542 \paragraph*{Call environment}
544 The EVENTS\_MASTER\_ENABLE\_BIT is already set and the task state changed to
545 TASK\_INTERRUPTIBLE on entry to this method. A call to the {\tt
546 do\_schedule} method will be made after this method returns, in
547 order to select the next task to run.
549 \subsubsection{do\_schedule}
551 This method must be implemented.
553 \paragraph*{Purpose}
555 The method is called each time a new task must be chosen for scheduling on the
556 current CPU. The current time as passed as the single argument (the current
557 task can be found using the {\tt current} macro).
559 This method should select the next task to run on this CPU and set it's minimum
560 time to run as well as returning the data described below.
562 This method should also take the appropriate action if the previous
563 task has blocked, e.g. removing it from the runqueue.
565 \paragraph*{Call environment}
567 The other fields in the {\tt task\_struct} are updated by the generic layer,
568 which also performs all Xen-specific tasks and performs the actual task switch
569 (unless the previous task has been chosen again).
571 This method is called with the {\tt schedule\_lock} held for the current CPU
572 and local interrupts interrupts disabled.
574 \paragraph*{Return values}
576 Must return a {\tt struct task\_slice} describing what task to run and how long
577 for (at maximum).
579 \subsubsection{control}
581 \paragraph*{Purpose}
583 This method is called for global scheduler control operations. It takes a
584 pointer to a {\tt struct sched\_ctl\_cmd}, which it should either
585 source data from or populate with data, depending on the value of the
586 {\tt direction} field.
588 \paragraph*{Call environment}
590 The generic layer guarantees that when this method is called, the
591 caller was using the caller selected the correct scheduler ID, hence
592 the scheduler's implementation does not need to sanity-check these
593 parts of the call.
595 \paragraph*{Return values}
597 This function should return the value to be passed back to user space, hence it
598 should either be 0 or an appropriate errno value.
602 \paragraph*{Purpose}
604 This method is called to adjust the scheduling parameters of a particular
605 domain, or to query their current values. The function should check
606 the {\tt direction} field of the {\tt sched\_adjdom\_cmd} it receives in
607 order to determine which of these operations is being performed.
609 \paragraph*{Call environment}
611 The generic layer guarantees that the caller has specified the correct
612 control interface version and scheduler ID and that the supplied {\tt
613 task\_struct} will not be deallocated during the call (hence it is not
616 \paragraph*{Return values}
618 This function should return the value to be passed back to user space, hence it
619 should either be 0 or an appropriate errno value.
621 \subsubsection{reschedule}
623 \paragraph*{Purpose}
625 This method is called to determine if a reschedule is required as a result of a
628 \paragraph*{Call environment}
629 The generic layer will cause a reschedule if the current domain is the idle
630 task or it has exceeded its minimum time slice before a reschedule. The
631 generic layer guarantees that the task passed is not currently running but is
632 on the runqueue.
634 \paragraph*{Return values}
636 Should return a mask of CPUs to cause a reschedule on.
638 \subsubsection{dump\_settings}
640 \paragraph*{Purpose}
642 If implemented, this should dump any private global settings for this
643 scheduler to the console.
645 \paragraph*{Call environment}
647 This function is called with interrupts enabled.
649 \subsubsection{dump\_cpu\_state}
651 \paragraph*{Purpose}
653 This method should dump any private settings for the specified CPU.
655 \paragraph*{Call environment}
657 This function is called with interrupts disabled and the {\tt schedule\_lock}
658 for the specified CPU held.
660 \subsubsection{dump\_runq\_el}
662 \paragraph*{Purpose}
664 This method should dump any private settings for the specified task.
666 \paragraph*{Call environment}
668 This function is called with interrupts disabled and the {\tt schedule\_lock}
669 for the task's CPU held.
672 \chapter{Debugging}
674 Xen provides tools for debugging both Xen and guest OSes. Currently, the
675 Pervasive Debugger provides a GDB stub, which provides facilities for symbolic
676 debugging of Xen itself and of OS kernels running on top of Xen. The Trace
677 Buffer provides a lightweight means to log data about Xen's internal state and
678 behaviour at runtime, for later analysis.
680 \section{Pervasive Debugger}
682 Information on using the pervasive debugger is available in pdb.txt.
685 \section{Trace Buffer}
687 The trace buffer provides a means to observe Xen's operation from domain 0.
688 Trace events, inserted at key points in Xen's code, record data that can be
689 read by the {\tt xentrace} tool. Recording these events has a low overhead
690 and hence the trace buffer may be useful for debugging timing-sensitive
691 behaviours.
693 \subsection{Internal API}
695 To use the trace buffer functionality from within Xen, you must {\tt \#include
696 <xen/trace.h>}, which contains definitions related to the trace buffer. Trace
697 events are inserted into the buffer using the {\tt TRACE\_xD} ({\tt x} = 0, 1,
698 2, 3, 4 or 5) macros. These all take an event number, plus {\tt x} additional
699 (32-bit) data as their arguments. For trace buffer-enabled builds of Xen these
700 will insert the event ID and data into the trace buffer, along with the current
701 value of the CPU cycle-counter. For builds without the trace buffer enabled,
702 the macros expand to no-ops and thus can be left in place without incurring
705 \subsection{Trace-enabled builds}
707 By default, the trace buffer is enabled only in debug builds (i.e. {\tt NDEBUG}
708 is not defined). It can be enabled separately by defining {\tt TRACE\_BUFFER},
709 either in {\tt <xen/config.h>} or on the gcc command line.
711 The size (in pages) of the per-CPU trace buffers can be specified using the
712 {\tt tbuf\_size=n } boot parameter to Xen. If the size is set to 0, the trace
713 buffers will be disabled.
715 \subsection{Dumping trace data}
717 When running a trace buffer build of Xen, trace data are written continuously
718 into the buffer data areas, with newer data overwriting older data. This data
719 can be captured using the {\tt xentrace} program in Domain 0.
721 The {\tt xentrace} tool uses {\tt /dev/mem} in domain 0 to map the trace
722 buffers into its address space. It then periodically polls all the buffers for
723 new data, dumping out any new records from each buffer in turn. As a result,
724 for machines with multiple (logical) CPUs, the trace buffer output will not be
725 in overall chronological order.
727 The output from {\tt xentrace} can be post-processed using {\tt
728 xentrace\_cpusplit} (used to split trace data out into per-cpu log files) and
729 {\tt xentrace\_format} (used to pretty-print trace data). For the predefined
730 trace points, there is an example format file in {\tt tools/xentrace/formats }.
732 For more information, see the manual pages for {\tt xentrace}, {\tt
733 xentrace\_format} and {\tt xentrace\_cpusplit}.
736 \chapter{Hypervisor calls}
738 \section{ set\_trap\_table(trap\_info\_t *table)}
740 Install trap handler table.
742 \section{ mmu\_update(mmu\_update\_t *req, int count)}
743 Update the page table for the domain. Updates can be batched.
744 The update types are:
746 {\it MMU\_NORMAL\_PT\_UPDATE}:
748 {\it MMU\_UNCHECKED\_PT\_UPDATE}:
750 {\it MMU\_MACHPHYS\_UPDATE}:
752 {\it MMU\_EXTENDED\_COMMAND}:
754 \section{ console\_write(const char *str, int count)}
755 Output buffer str to the console.
757 \section{ set\_gdt(unsigned long *frame\_list, int entries)}
758 Set the global descriptor table - virtualization for lgdt.
760 \section{ stack\_switch(unsigned long ss, unsigned long esp)}
761 Request context switch from hypervisor.
763 \section{ set\_callbacks(unsigned long event\_selector, unsigned long event\_address,
764 unsigned long failsafe\_selector, unsigned long failsafe\_address) }
765 Register OS event processing routine. In Linux both the event\_selector and
766 failsafe\_selector are the kernel's CS. The value event\_address specifies the address for
767 an interrupt handler dispatch routine and failsafe\_address specifies a handler for
768 application faults.
770 \section{ net\_io\_op(netop\_t *op)}
774 Notify hypervisor that fpu registers needed to be save on context switch.
776 \section{ sched\_op(unsigned long op)}
777 Request scheduling operation from hypervisor. The options are: {\it yield},
778 {\it block}, {\it stop}, and {\it exit}. {\it yield} keeps the calling
779 domain run-able but may cause a reschedule if other domains are
780 run-able. {\it block} removes the calling domain from the run queue and the
781 domains sleeps until an event is delivered to it. {\it stop} and {\it exit}
782 should be self-explanatory.
784 \section{ set\_dom\_timer(dom\_timer\_arg\_t *timer\_arg)}
785 Request a timer event to be sent at the specified system time.
787 \section{ dom0\_op(dom0\_op\_t *op)}
788 Administrative domain operations for domain management. The options are:
790 {\it DOM0\_CREATEDOMAIN}: create new domain, specifying the name and memory usage
791 in kilobytes.
793 {\it DOM0\_STARTDOMAIN}: make domain schedulable
795 {\it DOM0\_STOPDOMAIN}: mark domain as unschedulable
797 {\it DOM0\_DESTROYDOMAIN}: deallocate resources associated with the domain
799 {\it DOM0\_GETMEMLIST}: get list of pages used by the domain
801 {\it DOM0\_BUILDDOMAIN}: do final guest OS setup for domain
803 {\it DOM0\_BVTCTL}: adjust scheduler context switch time
807 {\it DOM0\_GETDOMAINFO}: get statistics about the domain
809 {\it DOM0\_GETPAGEFRAMEINFO}:
811 {\it DOM0\_IOPL}: set IO privilege level
813 {\it DOM0\_DEBUG}: interactively call pervasive debugger
815 {\it DOM0\_SETTIME}: set system time
819 {\it DOM0\_PINCPUDOMAIN}: pin domain to a particular CPU
821 {\it DOM0\_GETTBUFS}: get information about the size and location of
822 the trace buffers (only on trace-buffer enabled builds)
824 {\it DOM0\_PHYSINFO}: get information about the host machine
826 {\it DOM0\_PCIDEV\_ACCESS}: modify PCI device access permissions
828 {\it DOM0\_SCHED\_ID}: get the ID of the current Xen scheduler
830 {\it DOM0\_SETDOMAINNAME}: set the name of a domain
832 {\it DOM0\_SETDOMAININITIALMEM}: set initial memory allocation of a domain
834 {\it DOM0\_GETPAGEFRAMEINFO2}:
836 \section{ set\_debugreg(int reg, unsigned long value)}
837 set debug register reg to value
839 \section{ get\_debugreg(int reg)}
840 get the debug register reg
842 \section{ update\_descriptor(unsigned long pa, unsigned long word1, unsigned long word2)}
844 \section{ set\_fast\_trap(int idx)}
845 install traps to allow guest OS to bypass hypervisor
847 \section{ dom\_mem\_op(unsigned int op, void *pages, unsigned long nr\_pages)}
848 increase or decrease memory reservations for guest OS
850 \section{ multicall(multicall\_entry\_t *call\_list, int nr\_calls)}
851 execute a series of hypervisor calls
853 \section{ kbd\_op(unsigned char op, unsigned char val)}
855 \section{update\_va\_mapping(unsigned long page\_nr, unsigned long val, unsigned long flags)}
857 \section{ event\_channel\_op(unsigned int cmd, unsigned int id)}
858 inter-domain event-channel management, options are: open, close, send, and status.
860 \end{document}