ia64/xen-unstable

view docs/interface.tex @ 1191:2d19a1a32eea

bitkeeper revision 1.805 (405855d4miA78lKONuhS5MJEPyuuhQ)

console_client.py:
Fix Python console module for direct use.
author kaf24@scramble.cl.cam.ac.uk
date Wed Mar 17 13:42:44 2004 +0000 (2004-03-17)
parents 39a4998c88fc
children 814cfcf225c4
line source
1 \documentclass[11pt,twoside,final,openright]{xenstyle}
2 \usepackage{a4,graphicx,setspace}
3 \setstretch{1.15}
4 \input{style.tex}
6 \begin{document}
8 % TITLE PAGE
9 \pagestyle{empty}
10 \begin{center}
11 \vspace*{\fill}
12 \includegraphics{eps/xenlogo.eps}
13 \vfill
14 \vfill
15 \vfill
16 \begin{tabular}{l}
17 {\Huge \bf Interface manual} \\[4mm]
18 {\huge Xen v1.3 for x86} \\[80mm]
20 {\Large Xen is Copyright (c) 2004, The Xen Team} \\[3mm]
21 {\Large University of Cambridge, UK} \\[20mm]
22 {\large Last updated on 11th March, 2004}
23 \end{tabular}
24 \vfill
25 \end{center}
26 \cleardoublepage
28 % TABLE OF CONTENTS
29 \pagestyle{plain}
30 \pagenumbering{roman}
31 { \parskip 0pt plus 1pt
32 \tableofcontents }
33 \cleardoublepage
35 % PREPARE FOR MAIN TEXT
36 \pagenumbering{arabic}
37 \raggedbottom
38 \widowpenalty=10000
39 \clubpenalty=10000
40 \parindent=0pt
41 \renewcommand{\topfraction}{.8}
42 \renewcommand{\bottomfraction}{.8}
43 \renewcommand{\textfraction}{.2}
44 \renewcommand{\floatpagefraction}{.8}
45 \setstretch{1.15}
47 \chapter{Introduction}
48 Xen allows the hardware resouces of a machine to be virtualized and
49 dynamically partitioned such as to allow multiple different 'guest'
50 operating system images to be run simultaneously.
52 Virtualizing the machine in this manner provides flexibility allowing
53 different users to choose their preferred operating system (Windows,
54 Linux, FreeBSD, or a custom operating system). Furthermore, Xen provides
55 secure partitioning between these 'domains', and enables better resource
56 accounting and QoS isolation than can be achieved with a conventional
57 operating system.
59 The hypervisor runs directly on server hardware and dynamically partitions
60 it between a number of {\it domains}, each of which hosts an instance
61 of a {\it guest operating system}. The hypervisor provides just enough
62 abstraction of the machine to allow effective isolation and resource
63 management between these domains.
65 Xen essentially takes a virtual machine approach as pioneered by IBM VM/370.
66 However, unlike VM/370 or more recent efforts such as VMWare and Virtual PC,
67 Xen doesn not attempt to completely virtualize the underlying hardware. Instead
68 parts of the hosted guest operating systems to work with the hypervisor; the
69 operating system is effectively ported to a new target architecture, typically
70 requiring changes in just the machine-dependent code. The user-level API is
71 unchanged, thus existing binaries and operating system distributions can work
72 unmodified.
74 In addition to exporting virtualized instances of CPU, memory, network and
75 block devicees, Xen exposes a control interface to set how these resources
76 are shared between the running domains. The control interface is privileged
77 and may only be accessed by one particular virtual machine: {\it domain0}.
78 This domain is a required part of any Xen-base server and runs the application
79 software that manages the control-plane aspects of the platform. Running the
80 control software in {\it domain0}, distinct from the hypervisor itself, allows
81 the Xen framework to separate the notions of {\it mechanism} and {\it policy}
82 within the system.
85 \chapter{CPU state}
87 All privileged state must be handled by Xen. The guest OS has no direct access
88 to CR3 and is not permitted to update privileged bits in EFLAGS.
90 \chapter{Exceptions}
91 The IDT is virtualised by submitting a virtual 'trap
92 table' to Xen. Most trap handlers are identical to native x86
93 handlers. The page-fault handler is a noteable exception.
95 \chapter{Interrupts and events}
96 Interrupts are virtualized by mapping them to events, which are delivered
97 asynchronously to the target domain. A guest OS can map these events onto
98 its standard interrupt dispatch mechanisms, such as a simple vectoring
99 scheme. Each physical interrupt source controlled by the hypervisor, including
100 network devices, disks, or the timer subsystem, is responsible for identifying
101 the target for an incoming interrupt and sending an event to that domain.
103 This demultiplexing mechanism also provides a device-specific mechanism for
104 event coalescing or hold-off. For example, a guest OS may request to only
105 actually receive an event after {\it n} packets are queued ready for delivery
106 to it, {\it t} nanoseconds after the first packet arrived (which ever is true
107 first). This allows latency and throughput requirements to be addressed on a
108 domain-specific basis.
110 \chapter{Time}
111 Guest operating systems need to be aware of the passage of real time and their
112 own ``virtual time'', i.e. the time they have been executing. Furthermore, a
113 notion of time is required in the hypervisor itself for scheduling and the
114 activities that relate to it. To this end the hypervisor provides for notions
115 of time: cycle counter time, system time, wall clock time, domain virtual
116 time.
119 \section{Cycle counter time}
120 This provides the finest-grained, free-running time reference, with the
121 approximate frequency being publicly accessible. The cycle counter time is
122 used to accurately extrapolate the other time references. On SMP machines
123 it is currently assumed that the cycle counter time is synchronised between
124 CPUs. The current x86-based implementation achieves this within inter-CPU
125 communication latencies.
127 \section{System time}
128 This is a 64-bit value containing the nanoseconds elapsed since boot
129 time. Unlike cycle counter time, system time accurately reflects the
130 passage of real time, i.e. it is adjusted several times a second for timer
131 drift. This is done by running an NTP client in {\it domain0} on behalf of
132 the machine, feeding updates to the hypervisor. Intermediate values can be
133 extrapolated using the cycle counter.
135 \section{Wall clock time}
136 This is the actual ``time of day'' Unix style struct timeval (i.e. seconds and
137 microseconds since 1 January 1970, adjusted by leap seconds etc.). Again, an
138 NTP client hosted by {\it domain0} can help maintain this value. To guest
139 operating systems this value will be reported instead of the hardware RTC
140 clock value and they can use the system time and cycle counter times to start
141 and remain perfectly in time.
144 \section{Domain virtual time}
145 This progresses at the same pace as cycle counter time, but only while a
146 domain is executing. It stops while a domain is de-scheduled. Therefore the
147 share of the CPU that a domain receives is indicated by the rate at which
148 its domain virtual time increases, relative to the rate at which cycle
149 counter time does so.
151 \section{Time interface}
152 Xen exports some timestamps to guest operating systems through their shared
153 info page. Timestamps are provided for system time and wall-clock time. Xen
154 also provides the cycle counter values at the time of the last update
155 allowing guests to calculate the current values. The cpu frequency and a
156 scaling factor are provided for guests to convert cycle counter values to
157 real time. Since all time stamps need to be updated and read
158 \emph{atomically} two version numbers are also stored in the shared info
159 page.
161 Xen will ensure that the time stamps are updated frequently enough to avoid
162 an overflow of the cycle counter values. Guest can check if its notion of
163 time is up-to-date by comparing the version numbers.
165 \section{Timer events}
167 Xen maintains a periodic timer (currently with a 10ms period) which sends a
168 timer event to the currently executing domain. This allows Guest OSes to
169 keep track of the passing of time when executing. The scheduler also
170 arranges for a newly activated domain to receive a timer event when
171 scheduled so that the Guest OS can adjust to the passage of time while it
172 has been inactive.
174 In addition, Xen exports a hypercall interface to each domain which allows
175 them to request a timer event send to them at the specified system
176 time. Guest OSes may use this timer to implemented timeout values when they
177 block.
179 \chapter{Memory}
181 The hypervisor is responsible for providing memory to each of the domains running
182 over it. However, the Xen hypervisor's duty is restricted to managing physical
183 memory and to policing page table updates. All other memory management functions
184 are handly externally. Start-of-day issues such as building initial page tables
185 for a domain, loading its kernel image and so on are done by the {\it domain builder}
186 running in user-space with {\it domain0}. Paging to disk and swapping is handled
187 by the guest operating systems themselves, if they need it.
189 On a Xen-based system, the hypervisor itself runs in {\it ring 0}. It has full
190 access to the physical memory available in the system and is responsible for
191 allocating portions of it to the domains. Guest operating systems run in and use
192 {\it rings 1}, {\it 2} and {\it 3} as they see fit, aside from the fact that
193 segmentation is used to prevent the guest OS from accessing a portion of the
194 linear address space that is reserved for use by the hypervisor. This approach
195 allows transitions between the guest OS and hypervisor without flushing the TLB.
196 We expect most guest operating systems will use ring 1 for their own operation
197 and place applications (if they support such a notion) in ring 3.
199 \section{Physical Memory Allocation}
200 The hypervisor reserves a small fixed portion of physical memory at system boot
201 time. This special memory region is located at the beginning of physical memory
202 and is mapped at the very top of every virtual address space.
204 Any physical memory that is not used directly by the hypervisor is divided into
205 pages and is available for allocation to domains. The hypervisor tracks which
206 pages are free and which pages have been allocated to each domain. When a new
207 domain is initialized, the hypervisor allocates it pages drawn from the free
208 list. The amount of memory required by the domain is passed to the hypervisor
209 as one of the parameters for new domain initialization by the domain builder.
211 Domains can never be allocated further memory beyond that which was requested
212 for them on initialization. However, a domain can return pages to the hypervisor
213 if it discovers that its memory requirements have diminished.
215 % put reasons for why pages might be returned here.
216 \section{Page Table Updates}
217 In addition to managing physical memory allocation, the hypervisor is also in
218 charge of performing page table updates on behalf of the domains. This is
219 neccessary to prevent domains from adding arbitrary mappings to their page
220 tables or introducing mappings to other's page tables.
222 \section{Segment Descriptor Tables}
224 On boot a guest is supplied with a default GDT, which is {\em not}
225 taken from its own memory allocation. If the guest wishes to use other
226 than the default `flat' ring-1 and ring-3 segments that this default
227 table provides, it must register a custom GDT and/or LDT with Xen,
228 allocated from its own memory.
230 int {\bf set\_gdt}(unsigned long *{\em frame\_list}, int {\em entries})
232 {\em frame\_list}: An array of up to 16 page frames within which the GDT
233 resides. Any frame registered as a GDT frame may only be mapped
234 read-only within the guest's address space (e.g., no writeable
235 mappings, no use as a page-table page, and so on).
237 {\em entries}: The number of descriptor-entry slots in the GDT. Note that
238 the table must be large enough to contain Xen's reserved entries; thus
239 we must have '{\em entries $>$ LAST\_RESERVED\_GDT\_ENTRY}'. Note also that,
240 after registering the GDT, slots {\em FIRST\_} through
241 {\em LAST\_RESERVED\_GDT\_ENTRY} are no longer usable by the guest and may be
242 overwritten by Xen.
244 \section{Pseudo-Physical Memory}
245 The usual problem of external fragmentation means that a domain is unlikely to
246 receive a contiguous stretch of physical memory. However, most guest operating
247 systems do not have built-in support for operating in a fragmented physical
248 address space e.g. Linux has to have a one-to-one mapping for it physical
249 memory. There a notion of {\it pseudo physical memory} is introdouced.
250 Once a domain is allocated a number of pages, at its start of the day, one of
251 the first things it needs to do is build its own {\it real physical} to
252 {\it pseudo physical} mapping. From that moment onwards {\it pseudo physical}
253 address are used instead of discontiguous {\it real physical} addresses. Thus,
254 the rest of the guest OS code has an impression of operating in a contiguous
255 address space. Guest OS page tables contain real physical addresses. Mapping
256 {\it pseudo physical} to {\it real physical} addresses is need on page
257 table updates and also on remapping memory regions with the guest OS.
261 \chapter{Network I/O}
262 Since the hypervisor must multiplex network resources, its network subsystem
263 may be viewed as a virtual network switching element with each domain having
264 one or more virtual network interfaces to this network.
266 The hypervisor acts conceptually as an IP router, forwarding each domain's
267 traffic according to a set of rules.
269 \section{Hypervisor Packet Handling}
270 The hypervisor is responsible primarily for {\it data-path} operations.
271 In terms of networking this means packet transmission and reception.
273 On the transmission side, the hypervisor needs to perform two key actions:
274 \begin{itemize}
275 \item {\tt Validation:} A domain is only allowed to emit packets matching a certain
276 specification; for example, ones in which the source IP address matches
277 one assigned to the virtual interface over which it is sent. The hypervisor
278 is responsible for ensuring any such requirements are met, either by checking
279 or by stamping outgoing packets with prescribed values for certain fields.
281 \item {\tt Scheduling:} Since a number of domains can share a single ``real'' network
282 interface, the hypervisor must mediate access when several domains each
283 have packets queued for transmission. Of course, this general scheduling
284 function subsumes basic shaping or rate-limiting schemes.
286 \item {\tt Logging and Accounting:} The hypervisor can be configured with classifier
287 rules that control how packets are accounted or logged. For example,
288 {\it domain0} could request that it receives a log message or copy of the
289 packet whenever another domain attempts to send a TCP packet containg a
290 SYN.
291 \end{itemize}
292 On the recive side, the hypervisor's role is relatively straightforward:
293 once a packet is received, it just needs to determine the virtual interface(s)
294 to which it must be delivered and deliver it via page-flipping.
297 \section{Data Transfer}
299 Each virtual interface uses two ``descriptor rings'', one for transmit,
300 the other for receive. Each descriptor identifies a block of contiguous
301 physical memory allocated to the domain. There are four cases:
303 \begin{itemize}
305 \item The transmit ring carries packets to transmit from the domain to the
306 hypervisor.
308 \item The return path of the transmit ring carries ``empty'' descriptors
309 indicating that the contents have been transmitted and the memory can be
310 re-used.
312 \item The receive ring carries empty descriptors from the domain to the
313 hypervisor; these provide storage space for that domain's received packets.
315 \item The return path of the receive ring carries packets that have been
316 received.
317 \end{itemize}
319 Real physical addresses are used throughout, with the domain performing
320 translation from pseudo-physical addresses if that is necessary.
322 If a domain does not keep its receive ring stocked with empty buffers then
323 packets destined to it may be dropped. This provides some defense against
324 receiver-livelock problems because an overload domain will cease to receive
325 further data. Similarly, on the transmit path, it provides the application
326 with feedback on the rate at which packets are able to leave the system.
328 Synchronization between the hypervisor and the domain is achieved using
329 counters held in shared memory that is accessible to both. Each ring has
330 associated producer and consumer indices indicating the area in the ring
331 that holds descriptors that contain data. After receiving {\it n} packets
332 or {\t nanoseconds} after receiving the first packet, the hypervisor sends
333 an event to the domain.
335 \chapter{Block I/O}
337 \section{Virtual Block Devices (VBDs)}
339 All guest OS disk access goes through the VBD interface. The VBD interface
340 provides the administrator with the ability to selectively grant domains
341 access to portions of block storage devices visible to the system.
343 A VBD can also be comprised of a set of extents from multiple storage devices.
344 This provides the same functionality as a concatenated disk driver.
346 \section{Virtual Disks (VDs)}
348 VDs are an abstraction built on top of the VBD interface. One can reserve disk
349 space for use by the VD layer. This space is then managed as a pool of free extents.
350 The VD tools can automatically allocate collections of extents from this pool to
351 create ``virtual disks'' on demand.
353 \subsection{Virtual Disk Management}
354 The VD management code consists of a set of python libraries. It can therefore
355 be accessed by custom scripts as well as the convenience scripts provided. The
356 VD database is a SQLite database in /var/db/xen\_vdisk.sqlite.
358 The VD scripts and general VD usage are documented in the VBD-HOWTO.txt.
360 \subsection{Data Transfer}
361 Domains which have been granted access to a logical block device are permitted
362 to read and write it directly through the hypervisor, rather than requiring
363 {\it domain0} to mediate every data access.
365 In overview, the same style of descriptor-ring that is used for network
366 packets is used here. Each domain has one ring that carries operation requests to the
367 hypervisor and carries the results back again.
369 Rather than copying data in and out of the hypervisor, we use page pinning to
370 enable DMA transfers directly between the physical device and the domain's
371 buffers. Disk read operations are straightforward; the hypervisor just needs
372 to know which pages have pending DMA transfers, and prevent the guest OS from
373 giving the page back to the hypervisor, or to use them for storing page tables.
375 %block API here
377 \chapter{Privileged operations}
378 {\it Domain0} is responsible for building all other domains on the server
379 and providing control interfaces for managing scheduling, networking, and
380 blocks.
383 \chapter{Debugging}
385 Xen provides tools for debugging both Xen and guest OSes. Currently, the
386 Pervasive Debugger provides a GDB stub, which provides facilities for symbolic
387 debugging of Xen itself and of OS kernels running on top of Xen. The Trace
388 Buffer provides a lightweight means to log data about Xen's internal state and
389 behaviour at runtime, for later analysis.
391 \section{Pervasive Debugger}
393 Information on using the pervasive debugger is available in pdb.txt.
396 \section{Trace Buffer}
398 The trace buffer provides a means to observe Xen's operation from domain 0.
399 Trace events, inserted at key points in Xen's code, record data that can be
400 read by the {\tt xentrace} tool. Recording these events has a low overhead
401 and hence the trace buffer may be useful for debugging timing-sensitive
402 behaviours.
404 \subsection{Internal API}
406 To use the trace buffer functionality from within Xen, you must {\tt \#include
407 <xeno/trace.h>}, which contains definitions related to the trace buffer. Trace
408 events are inserted into the buffer using the {\tt TRACE\_xD} ({\tt x} = 0, 1,
409 2, 3, 4 or 5) macros. These all take an event number, plus {\tt x} additional
410 (32-bit) data as their arguments. For trace buffer-enabled builds of Xen these
411 will insert the event ID and data into the trace buffer, along with the current
412 value of the CPU cycle-counter. For builds without the trace buffer enabled,
413 the macros expand to no-ops and thus can be left in place without incurring
414 overheads.
416 \subsection{Enabling tracing}
418 By default, the trace buffer is enabled only in debug builds (i.e. {\tt NDEBUG}
419 is not defined). It can be enabled separately by defining {\tt TRACE\_BUFFER},
420 either in {\tt <xeno/config.h>} or on the gcc command line.
422 \subsection{Dumping trace data}
424 When running a trace buffer build of Xen, trace data are written continuously
425 into the buffer data areas, with newer data overwriting older data. This data
426 can be captured using the {\tt xentrace} program in Domain 0.
428 The {\tt xentrace} tool uses {\tt /dev/mem} in domain 0 to map the trace
429 buffers into its address space. It then periodically polls all the buffers for
430 new data, dumping out any new records from each buffer in turn. As a result,
431 for machines with multiple (logical) CPUs, the trace buffer output will not be
432 in overall chronological order.
434 The output from {\tt xentrace} can be post-processed using {\tt
435 xentrace\_cpusplit} (used to split trace data out into per-cpu log files) and
436 {\tt xentrace\_format} (used to pretty-print trace data).
438 For more information, see the manual pages for {\tt xentrace},
439 {\tt xentrace\_format} and {\tt xentrace\_cpusplit}.
442 \chapter{Hypervisor calls}
444 \section{ set\_trap\_table(trap\_info\_t *table)}
446 Install trap handler table.
448 \section{ mmu\_update(mmu\_update\_t *req, int count)}
449 Update the page table for the domain. Updates can be batched.
450 The update types are:
452 {\it MMU\_NORMAL\_PT\_UPDATE}:
454 {\it MMU\_UNCHECKED\_PT\_UPDATE}:
456 {\it MMU\_MACHPHYS\_UPDATE}:
458 {\it MMU\_EXTENDED\_COMMAND}:
460 \section{ console\_write(const char *str, int count)}
461 Output buffer str to the console.
463 \section{ set\_gdt(unsigned long *frame\_list, int entries)}
464 Set the global descriptor table - virtualization for lgdt.
466 \section{ stack\_switch(unsigned long ss, unsigned long esp)}
467 Request context switch from hypervisor.
469 \section{ set\_callbacks(unsigned long event\_selector, unsigned long event\_address,
470 unsigned long failsafe\_selector, unsigned long failsafe\_address) }
471 Register OS event processing routine. In Linux both the event\_selector and
472 failsafe\_selector are the kernel's CS. The value event\_address specifies the address for
473 an interrupt handler dispatch routine and failsafe\_address specifies a handler for
474 application faults.
476 \section{ net\_io\_op(netop\_t *op)}
477 Notify hypervisor of updates to transmit and/or receive descriptor rings.
479 \section{ fpu\_taskswitch(void)}
480 Notify hypervisor that fpu registers needed to be save on context switch.
482 \section{ sched\_op(unsigned long op)}
483 Request scheduling operation from hypervisor. The options are: {\it yield},
484 {\it block}, {\it stop}, and {\it exit}. {\it yield} keeps the calling
485 domain run-able but may cause a reschedule if other domains are
486 run-able. {\it block} removes the calling domain from the run queue and the
487 domains sleeps until an event is delivered to it. {\it stop} and {\it exit}
488 should be self-explanatory.
490 \section{ set\_dom\_timer(dom\_timer\_arg\_t *timer\_arg)}
491 Request a timer event to be sent at the specified system time.
493 \section{ dom0\_op(dom0\_op\_t *op)}
494 Administrative domain operations for domain management. The options are:
496 {\it DOM0\_CREATEDOMAIN}: create new domain, specifying the name and memory usage
497 in kilobytes.
499 {\it DOM0\_STARTDOMAIN}: make domain schedulable
501 {\it DOM0\_STOPDOMAIN}: mark domain as unschedulable
503 {\it DOM0\_DESTROYDOMAIN}: deallocate resources associated with the domain
505 {\it DOM0\_GETMEMLIST}: get list of pages used by the domain
507 {\it DOM0\_BUILDDOMAIN}: do final guest OS setup for domain
509 {\it DOM0\_BVTCTL}: adjust scheduler context switch time
511 {\it DOM0\_ADJUSTDOM}: adjust scheduling priorities for domain
513 {\it DOM0\_GETDOMAINFO}: get statistics about the domain
515 {\it DOM0\_GETPAGEFRAMEINFO}:
517 {\it DOM0\_IOPL}:
519 {\it DOM0\_MSR}:
521 {\it DOM0\_DEBUG}: interactively call pervasive debugger
523 {\it DOM0\_SETTIME}: set system time
525 {\it DOM0\_READCONSOLE}: read console content from hypervisor buffer ring
527 {\it DOM0\_PINCPUDOMAIN}: pin domain to a particular CPU
529 {\it DOM0\_GETTBUFS}: get information about the size and location of
530 the trace buffers (only on trace-buffer enabled builds)
532 {\it DOM0\_PHYSINFO}: get information about the host machine
534 {\it DOM0\_PCIDEV\_ACCESS}: modify PCI device access permissions
537 \section{network\_op(network\_op\_t *op)}
538 update network ruleset
540 \section{ block\_io\_op(block\_io\_op\_t *op)}
542 \section{ set\_debugreg(int reg, unsigned long value)}
543 set debug register reg to value
545 \section{ get\_debugreg(int reg)}
546 get the debug register reg
548 \section{ update\_descriptor(unsigned long pa, unsigned long word1, unsigned long word2)}
550 \section{ set\_fast\_trap(int idx)}
551 install traps to allow guest OS to bypass hypervisor
553 \section{ dom\_mem\_op(dom\_mem\_op\_t *op)}
554 increase or decrease memory reservations for guest OS
556 \section{ multicall(multicall\_entry\_t *call\_list, int nr\_calls)}
557 execute a series of hypervisor calls
559 \section{ kbd\_op(unsigned char op, unsigned char val)}
561 \section{update\_va\_mapping(unsigned long page\_nr, unsigned long val, unsigned long flags)}
563 \section{ event\_channel\_op(unsigned int cmd, unsigned int id)}
564 inter-domain event-channel management, options are: open, close, send, and status.
566 \end{document}