ia64/xen-unstable

view docs/interface.tex @ 1474:3cf2ba082a24

bitkeeper revision 1.959.1.1 (40cdc96bIe1WZJx-cKTSpa-TOaS14w)

Merge xenbk@gandalf:/var/bk/xeno-unstable.bk
into wray-m-3.hpl.hp.com:/home/mjw/repos-bk/xeno-unstable.bk
author mjw@wray-m-3.hpl.hp.com
date Mon Jun 14 15:51:07 2004 +0000 (2004-06-14)
parents 76c4a76ab5d1
children 94eddae02f4a
line source
1 \documentclass[11pt,twoside,final,openright]{xenstyle}
2 \usepackage{a4,graphicx,setspace}
3 \setstretch{1.15}
4 \input{style.tex}
6 \begin{document}
8 % TITLE PAGE
9 \pagestyle{empty}
10 \begin{center}
11 \vspace*{\fill}
12 \includegraphics{eps/xenlogo.eps}
13 \vfill
14 \vfill
15 \vfill
16 \begin{tabular}{l}
17 {\Huge \bf Interface manual} \\[4mm]
18 {\huge Xen v1.3 for x86} \\[80mm]
20 {\Large Xen is Copyright (c) 2004, The Xen Team} \\[3mm]
21 {\Large University of Cambridge, UK} \\[20mm]
22 {\large Last updated on 11th March, 2004}
23 \end{tabular}
24 \vfill
25 \end{center}
26 \cleardoublepage
28 % TABLE OF CONTENTS
29 \pagestyle{plain}
30 \pagenumbering{roman}
31 { \parskip 0pt plus 1pt
32 \tableofcontents }
33 \cleardoublepage
35 % PREPARE FOR MAIN TEXT
36 \pagenumbering{arabic}
37 \raggedbottom
38 \widowpenalty=10000
39 \clubpenalty=10000
40 \parindent=0pt
41 \renewcommand{\topfraction}{.8}
42 \renewcommand{\bottomfraction}{.8}
43 \renewcommand{\textfraction}{.2}
44 \renewcommand{\floatpagefraction}{.8}
45 \setstretch{1.15}
47 \chapter{Introduction}
48 Xen allows the hardware resouces of a machine to be virtualized and
49 dynamically partitioned such as to allow multiple different 'guest'
50 operating system images to be run simultaneously.
52 Virtualizing the machine in this manner provides flexibility allowing
53 different users to choose their preferred operating system (Windows,
54 Linux, FreeBSD, or a custom operating system). Furthermore, Xen provides
55 secure partitioning between these 'domains', and enables better resource
56 accounting and QoS isolation than can be achieved with a conventional
57 operating system.
59 The hypervisor runs directly on server hardware and dynamically partitions
60 it between a number of {\it domains}, each of which hosts an instance
61 of a {\it guest operating system}. The hypervisor provides just enough
62 abstraction of the machine to allow effective isolation and resource
63 management between these domains.
65 Xen essentially takes a virtual machine approach as pioneered by IBM VM/370.
66 However, unlike VM/370 or more recent efforts such as VMWare and Virtual PC,
67 Xen doesn not attempt to completely virtualize the underlying hardware. Instead
68 parts of the hosted guest operating systems to work with the hypervisor; the
69 operating system is effectively ported to a new target architecture, typically
70 requiring changes in just the machine-dependent code. The user-level API is
71 unchanged, thus existing binaries and operating system distributions can work
72 unmodified.
74 In addition to exporting virtualized instances of CPU, memory, network and
75 block devicees, Xen exposes a control interface to set how these resources
76 are shared between the running domains. The control interface is privileged
77 and may only be accessed by one particular virtual machine: {\it domain0}.
78 This domain is a required part of any Xen-base server and runs the application
79 software that manages the control-plane aspects of the platform. Running the
80 control software in {\it domain0}, distinct from the hypervisor itself, allows
81 the Xen framework to separate the notions of {\it mechanism} and {\it policy}
82 within the system.
85 \chapter{CPU state}
87 All privileged state must be handled by Xen. The guest OS has no direct access
88 to CR3 and is not permitted to update privileged bits in EFLAGS.
90 \chapter{Exceptions}
91 The IDT is virtualised by submitting a virtual 'trap
92 table' to Xen. Most trap handlers are identical to native x86
93 handlers. The page-fault handler is a noteable exception.
95 \chapter{Interrupts and events}
96 Interrupts are virtualized by mapping them to events, which are delivered
97 asynchronously to the target domain. A guest OS can map these events onto
98 its standard interrupt dispatch mechanisms, such as a simple vectoring
99 scheme. Each physical interrupt source controlled by the hypervisor, including
100 network devices, disks, or the timer subsystem, is responsible for identifying
101 the target for an incoming interrupt and sending an event to that domain.
103 This demultiplexing mechanism also provides a device-specific mechanism for
104 event coalescing or hold-off. For example, a guest OS may request to only
105 actually receive an event after {\it n} packets are queued ready for delivery
106 to it, {\it t} nanoseconds after the first packet arrived (which ever is true
107 first). This allows latency and throughput requirements to be addressed on a
108 domain-specific basis.
110 \chapter{Time}
111 Guest operating systems need to be aware of the passage of real time and their
112 own ``virtual time'', i.e. the time they have been executing. Furthermore, a
113 notion of time is required in the hypervisor itself for scheduling and the
114 activities that relate to it. To this end the hypervisor provides for notions
115 of time: cycle counter time, system time, wall clock time, domain virtual
116 time.
119 \section{Cycle counter time}
120 This provides the finest-grained, free-running time reference, with the
121 approximate frequency being publicly accessible. The cycle counter time is
122 used to accurately extrapolate the other time references. On SMP machines
123 it is currently assumed that the cycle counter time is synchronised between
124 CPUs. The current x86-based implementation achieves this within inter-CPU
125 communication latencies.
127 \section{System time}
128 This is a 64-bit value containing the nanoseconds elapsed since boot
129 time. Unlike cycle counter time, system time accurately reflects the
130 passage of real time, i.e. it is adjusted several times a second for timer
131 drift. This is done by running an NTP client in {\it domain0} on behalf of
132 the machine, feeding updates to the hypervisor. Intermediate values can be
133 extrapolated using the cycle counter.
135 \section{Wall clock time}
136 This is the actual ``time of day'' Unix style struct timeval (i.e. seconds and
137 microseconds since 1 January 1970, adjusted by leap seconds etc.). Again, an
138 NTP client hosted by {\it domain0} can help maintain this value. To guest
139 operating systems this value will be reported instead of the hardware RTC
140 clock value and they can use the system time and cycle counter times to start
141 and remain perfectly in time.
144 \section{Domain virtual time}
145 This progresses at the same pace as cycle counter time, but only while a
146 domain is executing. It stops while a domain is de-scheduled. Therefore the
147 share of the CPU that a domain receives is indicated by the rate at which
148 its domain virtual time increases, relative to the rate at which cycle
149 counter time does so.
151 \section{Time interface}
152 Xen exports some timestamps to guest operating systems through their shared
153 info page. Timestamps are provided for system time and wall-clock time. Xen
154 also provides the cycle counter values at the time of the last update
155 allowing guests to calculate the current values. The cpu frequency and a
156 scaling factor are provided for guests to convert cycle counter values to
157 real time. Since all time stamps need to be updated and read
158 \emph{atomically} two version numbers are also stored in the shared info
159 page.
161 Xen will ensure that the time stamps are updated frequently enough to avoid
162 an overflow of the cycle counter values. Guest can check if its notion of
163 time is up-to-date by comparing the version numbers.
165 \section{Timer events}
167 Xen maintains a periodic timer (currently with a 10ms period) which sends a
168 timer event to the currently executing domain. This allows Guest OSes to
169 keep track of the passing of time when executing. The scheduler also
170 arranges for a newly activated domain to receive a timer event when
171 scheduled so that the Guest OS can adjust to the passage of time while it
172 has been inactive.
174 In addition, Xen exports a hypercall interface to each domain which allows
175 them to request a timer event send to them at the specified system
176 time. Guest OSes may use this timer to implemented timeout values when they
177 block.
179 \chapter{Memory}
181 The hypervisor is responsible for providing memory to each of the domains running
182 over it. However, the Xen hypervisor's duty is restricted to managing physical
183 memory and to policing page table updates. All other memory management functions
184 are handly externally. Start-of-day issues such as building initial page tables
185 for a domain, loading its kernel image and so on are done by the {\it domain builder}
186 running in user-space with {\it domain0}. Paging to disk and swapping is handled
187 by the guest operating systems themselves, if they need it.
189 On a Xen-based system, the hypervisor itself runs in {\it ring 0}. It has full
190 access to the physical memory available in the system and is responsible for
191 allocating portions of it to the domains. Guest operating systems run in and use
192 {\it rings 1}, {\it 2} and {\it 3} as they see fit, aside from the fact that
193 segmentation is used to prevent the guest OS from accessing a portion of the
194 linear address space that is reserved for use by the hypervisor. This approach
195 allows transitions between the guest OS and hypervisor without flushing the TLB.
196 We expect most guest operating systems will use ring 1 for their own operation
197 and place applications (if they support such a notion) in ring 3.
199 \section{Physical Memory Allocation}
200 The hypervisor reserves a small fixed portion of physical memory at system boot
201 time. This special memory region is located at the beginning of physical memory
202 and is mapped at the very top of every virtual address space.
204 Any physical memory that is not used directly by the hypervisor is divided into
205 pages and is available for allocation to domains. The hypervisor tracks which
206 pages are free and which pages have been allocated to each domain. When a new
207 domain is initialized, the hypervisor allocates it pages drawn from the free
208 list. The amount of memory required by the domain is passed to the hypervisor
209 as one of the parameters for new domain initialization by the domain builder.
211 Domains can never be allocated further memory beyond that which was requested
212 for them on initialization. However, a domain can return pages to the hypervisor
213 if it discovers that its memory requirements have diminished.
215 % put reasons for why pages might be returned here.
216 \section{Page Table Updates}
217 In addition to managing physical memory allocation, the hypervisor is also in
218 charge of performing page table updates on behalf of the domains. This is
219 neccessary to prevent domains from adding arbitrary mappings to their page
220 tables or introducing mappings to other's page tables.
222 \section{Segment Descriptor Tables}
224 On boot a guest is supplied with a default GDT, which is {\em not}
225 taken from its own memory allocation. If the guest wishes to use other
226 than the default `flat' ring-1 and ring-3 segments that this default
227 table provides, it must register a custom GDT and/or LDT with Xen,
228 allocated from its own memory.
230 int {\bf set\_gdt}(unsigned long *{\em frame\_list}, int {\em entries})
232 {\em frame\_list}: An array of up to 16 page frames within which the GDT
233 resides. Any frame registered as a GDT frame may only be mapped
234 read-only within the guest's address space (e.g., no writeable
235 mappings, no use as a page-table page, and so on).
237 {\em entries}: The number of descriptor-entry slots in the GDT. Note that
238 the table must be large enough to contain Xen's reserved entries; thus
239 we must have '{\em entries $>$ LAST\_RESERVED\_GDT\_ENTRY}'. Note also that,
240 after registering the GDT, slots {\em FIRST\_} through
241 {\em LAST\_RESERVED\_GDT\_ENTRY} are no longer usable by the guest and may be
242 overwritten by Xen.
244 \section{Pseudo-Physical Memory}
245 The usual problem of external fragmentation means that a domain is unlikely to
246 receive a contiguous stretch of physical memory. However, most guest operating
247 systems do not have built-in support for operating in a fragmented physical
248 address space e.g. Linux has to have a one-to-one mapping for it physical
249 memory. There a notion of {\it pseudo physical memory} is introdouced.
250 Once a domain is allocated a number of pages, at its start of the day, one of
251 the first things it needs to do is build its own {\it real physical} to
252 {\it pseudo physical} mapping. From that moment onwards {\it pseudo physical}
253 address are used instead of discontiguous {\it real physical} addresses. Thus,
254 the rest of the guest OS code has an impression of operating in a contiguous
255 address space. Guest OS page tables contain real physical addresses. Mapping
256 {\it pseudo physical} to {\it real physical} addresses is need on page
257 table updates and also on remapping memory regions with the guest OS.
261 \chapter{Network I/O}
262 Since the hypervisor must multiplex network resources, its network subsystem
263 may be viewed as a virtual network switching element with each domain having
264 one or more virtual network interfaces to this network.
266 The hypervisor acts conceptually as an IP router, forwarding each domain's
267 traffic according to a set of rules.
269 \section{Hypervisor Packet Handling}
270 The hypervisor is responsible primarily for {\it data-path} operations.
271 In terms of networking this means packet transmission and reception.
273 On the transmission side, the hypervisor needs to perform two key actions:
274 \begin{itemize}
275 \item {\tt Validation:} A domain is only allowed to emit packets matching a certain
276 specification; for example, ones in which the source IP address matches
277 one assigned to the virtual interface over which it is sent. The hypervisor
278 is responsible for ensuring any such requirements are met, either by checking
279 or by stamping outgoing packets with prescribed values for certain fields.
281 \item {\tt Scheduling:} Since a number of domains can share a single ``real'' network
282 interface, the hypervisor must mediate access when several domains each
283 have packets queued for transmission. Of course, this general scheduling
284 function subsumes basic shaping or rate-limiting schemes.
286 \item {\tt Logging and Accounting:} The hypervisor can be configured with classifier
287 rules that control how packets are accounted or logged. For example,
288 {\it domain0} could request that it receives a log message or copy of the
289 packet whenever another domain attempts to send a TCP packet containg a
290 SYN.
291 \end{itemize}
292 On the recive side, the hypervisor's role is relatively straightforward:
293 once a packet is received, it just needs to determine the virtual interface(s)
294 to which it must be delivered and deliver it via page-flipping.
297 \section{Data Transfer}
299 Each virtual interface uses two ``descriptor rings'', one for transmit,
300 the other for receive. Each descriptor identifies a block of contiguous
301 physical memory allocated to the domain. There are four cases:
303 \begin{itemize}
305 \item The transmit ring carries packets to transmit from the domain to the
306 hypervisor.
308 \item The return path of the transmit ring carries ``empty'' descriptors
309 indicating that the contents have been transmitted and the memory can be
310 re-used.
312 \item The receive ring carries empty descriptors from the domain to the
313 hypervisor; these provide storage space for that domain's received packets.
315 \item The return path of the receive ring carries packets that have been
316 received.
317 \end{itemize}
319 Real physical addresses are used throughout, with the domain performing
320 translation from pseudo-physical addresses if that is necessary.
322 If a domain does not keep its receive ring stocked with empty buffers then
323 packets destined to it may be dropped. This provides some defense against
324 receiver-livelock problems because an overload domain will cease to receive
325 further data. Similarly, on the transmit path, it provides the application
326 with feedback on the rate at which packets are able to leave the system.
328 Synchronization between the hypervisor and the domain is achieved using
329 counters held in shared memory that is accessible to both. Each ring has
330 associated producer and consumer indices indicating the area in the ring
331 that holds descriptors that contain data. After receiving {\it n} packets
332 or {\t nanoseconds} after receiving the first packet, the hypervisor sends
333 an event to the domain.
335 \chapter{Block I/O}
337 \section{Virtual Block Devices (VBDs)}
339 All guest OS disk access goes through the VBD interface. The VBD interface
340 provides the administrator with the ability to selectively grant domains
341 access to portions of block storage devices visible to the system.
343 A VBD can also be comprised of a set of extents from multiple storage devices.
344 This provides the same functionality as a concatenated disk driver.
346 \section{Virtual Disks (VDs)}
348 VDs are an abstraction built on top of the VBD interface. One can reserve disk
349 space for use by the VD layer. This space is then managed as a pool of free extents.
350 The VD tools can automatically allocate collections of extents from this pool to
351 create ``virtual disks'' on demand.
353 \subsection{Virtual Disk Management}
354 The VD management code consists of a set of python libraries. It can therefore
355 be accessed by custom scripts as well as the convenience scripts provided. The
356 VD database is a SQLite database in /var/db/xen\_vdisks.sqlite.
358 The VD scripts and general VD usage are documented in the VBD-HOWTO.txt.
360 \subsection{Data Transfer}
361 Domains which have been granted access to a logical block device are permitted
362 to read and write it directly through the hypervisor, rather than requiring
363 {\it domain0} to mediate every data access.
365 In overview, the same style of descriptor-ring that is used for network
366 packets is used here. Each domain has one ring that carries operation requests to the
367 hypervisor and carries the results back again.
369 Rather than copying data in and out of the hypervisor, we use page pinning to
370 enable DMA transfers directly between the physical device and the domain's
371 buffers. Disk read operations are straightforward; the hypervisor just needs
372 to know which pages have pending DMA transfers, and prevent the guest OS from
373 giving the page back to the hypervisor, or to use them for storing page tables.
375 %block API here
377 \chapter{Privileged operations}
378 {\it Domain0} is responsible for building all other domains on the server
379 and providing control interfaces for managing scheduling, networking, and
380 blocks.
382 \chapter{CPU Scheduler}
384 Xen offers a uniform API for CPU schedulers. It is possible to choose
385 from a number of schedulers at boot and it should be easy to add more.
387 \paragraph*{Note: SMP host support}
388 Xen has always supported SMP host systems. Domains are statically assigned to
389 CPUs, either at creation time or when manually pinning to a particular CPU.
390 The current schedulers then run locally on each CPU to decide which of the
391 assigned domains should be run there.
393 \section{Standard Schedulers}
395 These BVT, Atropos and Round Robin schedulers are part of the normal
396 Xen distribution. BVT provides porportional fair shares of the CPU to
397 the running domains. Atropos can be used to reserve absolute shares
398 of the CPU for each domain. Round-robin is provided as an example of
399 Xen's internal scheduler API.
401 More information on the characteristics and use of these schedulers is
402 available in { \tt Sched-HOWTO.txt }.
404 \section{Scheduling API}
406 The scheduling API is used by both the schedulers described above and should
407 also be used by any new schedulers. It provides a generic interface and also
408 implements much of the ``boilerplate'' code.
410 Schedulers conforming to this API are described by the following
411 structure:
413 \begin{verbatim}
414 struct scheduler
415 {
416 char *name; /* full name for this scheduler */
417 char *opt_name; /* option name for this scheduler */
418 unsigned int sched_id; /* ID for this scheduler */
420 int (*init_scheduler) ();
421 int (*alloc_task) (struct task_struct *);
422 void (*add_task) (struct task_struct *);
423 void (*free_task) (struct task_struct *);
424 void (*rem_task) (struct task_struct *);
425 void (*wake_up) (struct task_struct *);
426 void (*do_block) (struct task_struct *);
427 task_slice_t (*do_schedule) (s_time_t);
428 int (*control) (struct sched_ctl_cmd *);
429 int (*adjdom) (struct task_struct *,
430 struct sched_adjdom_cmd *);
431 s32 (*reschedule) (struct task_struct *);
432 void (*dump_settings) (void);
433 void (*dump_cpu_state) (int);
434 void (*dump_runq_el) (struct task_struct *);
435 };
436 \end{verbatim}
438 The only method that {\em must} be implemented is
439 {\tt do\_schedule()}. However, if there is not some implementation for the
440 {\tt wake\_up()} method then waking tasks will not get put on the runqueue!
442 The fields of the above structure are described in more detail below.
444 \subsubsection{name}
446 The name field should point to a descriptive ASCII string.
448 \subsubsection{opt\_name}
450 This field is the value of the {\tt sched=} boot-time option that will select
451 this scheduler.
453 \subsubsection{sched\_id}
455 This is an integer that uniquely identifies this scheduler. There should be a
456 macro corrsponding to this scheduler ID in {\tt <hypervisor-ifs/sched-if.h>}.
458 \subsubsection{init\_scheduler}
460 \paragraph*{Purpose}
462 This is a function for performing any scheduler-specific initialisation. For
463 instance, it might allocate memory for per-CPU scheduler data and initialise it
464 appropriately.
466 \paragraph*{Call environment}
468 This function is called after the initialisation performed by the generic
469 layer. The function is called exactly once, for the scheduler that has been
470 selected.
472 \paragraph*{Return values}
474 This should return negative on failure --- this will cause an
475 immediate panic and the system will fail to boot.
477 \subsubsection{alloc\_task}
479 \paragraph*{Purpose}
480 Called when a {\tt task\_struct} is allocated by the generic scheduler
481 layer. A particular scheduler implementation may use this method to
482 allocate per-task data for this task. It may use the {\tt
483 sched\_priv} pointer in the {\tt task\_struct} to point to this data.
485 \paragraph*{Call environment}
486 The generic layer guarantees that the {\tt sched\_priv} field will
487 remain intact from the time this method is called until the task is
488 deallocated (so long as the scheduler implementation does not change
489 it explicitly!).
491 \paragraph*{Return values}
492 Negative on failure.
494 \subsubsection{add\_task}
496 \paragraph*{Purpose}
498 Called when a task is initially added by the generic layer.
500 \paragraph*{Call environment}
502 The fields in the {\tt task\_struct} are now filled out and available for use.
503 Schedulers should implement appropriate initialisation of any per-task private
504 information in this method.
506 \subsubsection{free\_task}
508 \paragraph*{Purpose}
510 Schedulers should free the space used by any associated private data
511 structures.
513 \paragraph*{Call environment}
515 This is called when a {\tt task\_struct} is about to be deallocated.
516 The generic layer will have done generic task removal operations and
517 (if implemented) called the scheduler's {\tt rem\_task} method before
518 this method is called.
520 \subsubsection{rem\_task}
522 \paragraph*{Purpose}
524 This is called when a task is being removed from scheduling (but is
525 not yet being freed).
527 \subsubsection{wake\_up}
529 \paragraph*{Purpose}
531 Called when a task is woken up, this method should put the task on the runqueue
532 (or do the scheduler-specific equivalent action).
534 \paragraph*{Call environment}
536 The task is already set to state RUNNING.
538 \subsubsection{do\_block}
540 \paragraph*{Purpose}
542 This function is called when a task is blocked. This function should
543 not remove the task from the runqueue.
545 \paragraph*{Call environment}
547 The EVENTS\_MASTER\_ENABLE\_BIT is already set and the task state changed to
548 TASK\_INTERRUPTIBLE on entry to this method. A call to the {\tt
549 do\_schedule} method will be made after this method returns, in
550 order to select the next task to run.
552 \subsubsection{do\_schedule}
554 This method must be implemented.
556 \paragraph*{Purpose}
558 The method is called each time a new task must be chosen for scheduling on the
559 current CPU. The current time as passed as the single argument (the current
560 task can be found using the {\tt current} macro).
562 This method should select the next task to run on this CPU and set it's minimum
563 time to run as well as returning the data described below.
565 This method should also take the appropriate action if the previous
566 task has blocked, e.g. removing it from the runqueue.
568 \paragraph*{Call environment}
570 The other fields in the {\tt task\_struct} are updated by the generic layer,
571 which also performs all Xen-specific tasks and performs the actual task switch
572 (unless the previous task has been chosen again).
574 This method is called with the {\tt schedule\_lock} held for the current CPU
575 and local interrupts interrupts disabled.
577 \paragraph*{Return values}
579 Must return a {\tt struct task\_slice} describing what task to run and how long
580 for (at maximum).
582 \subsubsection{control}
584 \paragraph*{Purpose}
586 This method is called for global scheduler control operations. It takes a
587 pointer to a {\tt struct sched\_ctl\_cmd}, which it should either
588 source data from or populate with data, depending on the value of the
589 {\tt direction} field.
591 \paragraph*{Call environment}
593 The generic layer guarantees that when this method is called, the
594 caller was using the caller selected the correct scheduler ID, hence
595 the scheduler's implementation does not need to sanity-check these
596 parts of the call.
598 \paragraph*{Return values}
600 This function should return the value to be passed back to user space, hence it
601 should either be 0 or an appropriate errno value.
603 \subsubsection{sched\_adjdom}
605 \paragraph*{Purpose}
607 This method is called to adjust the scheduling parameters of a particular
608 domain, or to query their current values. The function should check
609 the {\tt direction} field of the {\tt sched\_adjdom\_cmd} it receives in
610 order to determine which of these operations is being performed.
612 \paragraph*{Call environment}
614 The generic layer guarantees that the caller has specified the correct
615 control interface version and scheduler ID and that the supplied {\tt
616 task\_struct} will not be deallocated during the call (hence it is not
617 necessary to {\tt get\_task\_struct}).
619 \paragraph*{Return values}
621 This function should return the value to be passed back to user space, hence it
622 should either be 0 or an appropriate errno value.
624 \subsubsection{reschedule}
626 \paragraph*{Purpose}
628 This method is called to determine if a reschedule is required as a result of a
629 particular task.
631 \paragraph*{Call environment}
632 The generic layer will cause a reschedule if the current domain is the idle
633 task or it has exceeded its minimum time slice before a reschedule. The
634 generic layer guarantees that the task passed is not currently running but is
635 on the runqueue.
637 \paragraph*{Return values}
639 Should return a mask of CPUs to cause a reschedule on.
641 \subsubsection{dump\_settings}
643 \paragraph*{Purpose}
645 If implemented, this should dump any private global settings for this
646 scheduler to the console.
648 \paragraph*{Call environment}
650 This function is called with interrupts enabled.
652 \subsubsection{dump\_cpu\_state}
654 \paragraph*{Purpose}
656 This method should dump any private settings for the specified CPU.
658 \paragraph*{Call environment}
660 This function is called with interrupts disabled and the {\tt schedule\_lock}
661 for the specified CPU held.
663 \subsubsection{dump\_runq\_el}
665 \paragraph*{Purpose}
667 This method should dump any private settings for the specified task.
669 \paragraph*{Call environment}
671 This function is called with interrupts disabled and the {\tt schedule\_lock}
672 for the task's CPU held.
675 \chapter{Debugging}
677 Xen provides tools for debugging both Xen and guest OSes. Currently, the
678 Pervasive Debugger provides a GDB stub, which provides facilities for symbolic
679 debugging of Xen itself and of OS kernels running on top of Xen. The Trace
680 Buffer provides a lightweight means to log data about Xen's internal state and
681 behaviour at runtime, for later analysis.
683 \section{Pervasive Debugger}
685 Information on using the pervasive debugger is available in pdb.txt.
688 \section{Trace Buffer}
690 The trace buffer provides a means to observe Xen's operation from domain 0.
691 Trace events, inserted at key points in Xen's code, record data that can be
692 read by the {\tt xentrace} tool. Recording these events has a low overhead
693 and hence the trace buffer may be useful for debugging timing-sensitive
694 behaviours.
696 \subsection{Internal API}
698 To use the trace buffer functionality from within Xen, you must {\tt \#include
699 <xen/trace.h>}, which contains definitions related to the trace buffer. Trace
700 events are inserted into the buffer using the {\tt TRACE\_xD} ({\tt x} = 0, 1,
701 2, 3, 4 or 5) macros. These all take an event number, plus {\tt x} additional
702 (32-bit) data as their arguments. For trace buffer-enabled builds of Xen these
703 will insert the event ID and data into the trace buffer, along with the current
704 value of the CPU cycle-counter. For builds without the trace buffer enabled,
705 the macros expand to no-ops and thus can be left in place without incurring
706 overheads.
708 \subsection{Trace-enabled builds}
710 By default, the trace buffer is enabled only in debug builds (i.e. {\tt NDEBUG}
711 is not defined). It can be enabled separately by defining {\tt TRACE\_BUFFER},
712 either in {\tt <xen/config.h>} or on the gcc command line.
714 The size (in pages) of the per-CPU trace buffers can be specified using the
715 {\tt tbuf_size=n } boot parameter to Xen. If the size is set to 0, the trace
716 buffers will be disabled.
718 \subsection{Dumping trace data}
720 When running a trace buffer build of Xen, trace data are written continuously
721 into the buffer data areas, with newer data overwriting older data. This data
722 can be captured using the {\tt xentrace} program in Domain 0.
724 The {\tt xentrace} tool uses {\tt /dev/mem} in domain 0 to map the trace
725 buffers into its address space. It then periodically polls all the buffers for
726 new data, dumping out any new records from each buffer in turn. As a result,
727 for machines with multiple (logical) CPUs, the trace buffer output will not be
728 in overall chronological order.
730 The output from {\tt xentrace} can be post-processed using {\tt
731 xentrace\_cpusplit} (used to split trace data out into per-cpu log files) and
732 {\tt xentrace\_format} (used to pretty-print trace data). For the predefined
733 trace points, there is an example format file in {\tt tools/xentrace/formats }.
735 For more information, see the manual pages for {\tt xentrace}, {\tt
736 xentrace\_format} and {\tt xentrace\_cpusplit}.
739 \chapter{Hypervisor calls}
741 \section{ set\_trap\_table(trap\_info\_t *table)}
743 Install trap handler table.
745 \section{ mmu\_update(mmu\_update\_t *req, int count)}
746 Update the page table for the domain. Updates can be batched.
747 The update types are:
749 {\it MMU\_NORMAL\_PT\_UPDATE}:
751 {\it MMU\_UNCHECKED\_PT\_UPDATE}:
753 {\it MMU\_MACHPHYS\_UPDATE}:
755 {\it MMU\_EXTENDED\_COMMAND}:
757 \section{ console\_write(const char *str, int count)}
758 Output buffer str to the console.
760 \section{ set\_gdt(unsigned long *frame\_list, int entries)}
761 Set the global descriptor table - virtualization for lgdt.
763 \section{ stack\_switch(unsigned long ss, unsigned long esp)}
764 Request context switch from hypervisor.
766 \section{ set\_callbacks(unsigned long event\_selector, unsigned long event\_address,
767 unsigned long failsafe\_selector, unsigned long failsafe\_address) }
768 Register OS event processing routine. In Linux both the event\_selector and
769 failsafe\_selector are the kernel's CS. The value event\_address specifies the address for
770 an interrupt handler dispatch routine and failsafe\_address specifies a handler for
771 application faults.
773 \section{ net\_io\_op(netop\_t *op)}
774 Notify hypervisor of updates to transmit and/or receive descriptor rings.
776 \section{ fpu\_taskswitch(void)}
777 Notify hypervisor that fpu registers needed to be save on context switch.
779 \section{ sched\_op(unsigned long op)}
780 Request scheduling operation from hypervisor. The options are: {\it yield},
781 {\it block}, {\it stop}, and {\it exit}. {\it yield} keeps the calling
782 domain run-able but may cause a reschedule if other domains are
783 run-able. {\it block} removes the calling domain from the run queue and the
784 domains sleeps until an event is delivered to it. {\it stop} and {\it exit}
785 should be self-explanatory.
787 \section{ set\_dom\_timer(dom\_timer\_arg\_t *timer\_arg)}
788 Request a timer event to be sent at the specified system time.
790 \section{ dom0\_op(dom0\_op\_t *op)}
791 Administrative domain operations for domain management. The options are:
793 {\it DOM0\_CREATEDOMAIN}: create new domain, specifying the name and memory usage
794 in kilobytes.
796 {\it DOM0\_STARTDOMAIN}: make domain schedulable
798 {\it DOM0\_STOPDOMAIN}: mark domain as unschedulable
800 {\it DOM0\_DESTROYDOMAIN}: deallocate resources associated with the domain
802 {\it DOM0\_GETMEMLIST}: get list of pages used by the domain
804 {\it DOM0\_BUILDDOMAIN}: do final guest OS setup for domain
806 {\it DOM0\_BVTCTL}: adjust scheduler context switch time
808 {\it DOM0\_ADJUSTDOM}: adjust scheduling priorities for domain
810 {\it DOM0\_GETDOMAINFO}: get statistics about the domain
812 {\it DOM0\_GETPAGEFRAMEINFO}:
814 {\it DOM0\_IOPL}:
816 {\it DOM0\_MSR}:
818 {\it DOM0\_DEBUG}: interactively call pervasive debugger
820 {\it DOM0\_SETTIME}: set system time
822 {\it DOM0\_READCONSOLE}: read console content from hypervisor buffer ring
824 {\it DOM0\_PINCPUDOMAIN}: pin domain to a particular CPU
826 {\it DOM0\_GETTBUFS}: get information about the size and location of
827 the trace buffers (only on trace-buffer enabled builds)
829 {\it DOM0\_PHYSINFO}: get information about the host machine
831 {\it DOM0\_PCIDEV\_ACCESS}: modify PCI device access permissions
834 \section{network\_op(network\_op\_t *op)}
835 update network ruleset
837 \section{ block\_io\_op(block\_io\_op\_t *op)}
839 \section{ set\_debugreg(int reg, unsigned long value)}
840 set debug register reg to value
842 \section{ get\_debugreg(int reg)}
843 get the debug register reg
845 \section{ update\_descriptor(unsigned long pa, unsigned long word1, unsigned long word2)}
847 \section{ set\_fast\_trap(int idx)}
848 install traps to allow guest OS to bypass hypervisor
850 \section{ dom\_mem\_op(dom\_mem\_op\_t *op)}
851 increase or decrease memory reservations for guest OS
853 \section{ multicall(multicall\_entry\_t *call\_list, int nr\_calls)}
854 execute a series of hypervisor calls
856 \section{ kbd\_op(unsigned char op, unsigned char val)}
858 \section{update\_va\_mapping(unsigned long page\_nr, unsigned long val, unsigned long flags)}
860 \section{ event\_channel\_op(unsigned int cmd, unsigned int id)}
861 inter-domain event-channel management, options are: open, close, send, and status.
863 \end{document}