changeset 1054:8fbde9a89a05

bitkeeper revision 1.693 (40132757EiZr-olEQuLGDqktxmfp6g)

Documentation upgrade - interface document filled in by Kip Macy.
author kaf24@scramble.cl.cam.ac.uk
date Sun Jan 25 02:17:59 2004 +0000 (2004-01-25)
parents 4d8a0cc41eb6
children aadd0dc51c45
files docs/interface.tex
line diff
     1.1 --- a/docs/interface.tex	Sat Jan 24 07:22:03 2004 +0000
     1.2 +++ b/docs/interface.tex	Sun Jan 25 02:17:59 2004 +0000
     1.3 @@ -15,10 +15,11 @@
     1.4  \vfill
     1.5  \begin{tabular}{l}
     1.6  {\Huge \bf Interface manual} \\[4mm]
     1.7 -{\huge Xen v1.1 for x86} \\[80mm]
     1.8 -{\Large Copyright (c) 2003, The Xen Team} \\[3mm]
     1.9 +{\huge Xen v1.3 for x86} \\[80mm]
    1.10 +
    1.11 +{\Large Xen is Copyright (c) 2004, The Xen Team} \\[3mm]
    1.12  {\Large University of Cambridge, UK} \\[20mm]
    1.13 -{\large Last updated on 28th October, 2003}
    1.14 +{\large Last updated on 18th January, 2004}
    1.15  \end{tabular}
    1.16  \vfill
    1.17  \end{center}
    1.18 @@ -44,19 +45,396 @@
    1.19  \setstretch{1.15}
    1.21  \chapter{Introduction}
    1.22 +Xen allows the hardware resouces of a machine to be virtualized and
    1.23 +dynamically partitioned such as to allow multiple different 'guest'
    1.24 +operating system images to be run simultaneously.
    1.25 +
    1.26 +Virtualizing the machine in this manner provides flexibility allowing
    1.27 +different users to choose their preferred operating system (Windows,
    1.28 +Linux, FreeBSD, or a custom operating system). Furthermore, Xen provides
    1.29 +secure partitioning between these 'domains', and enables better resource
    1.30 +accounting and QoS isolation than can be achieved with a conventional
    1.31 +operating system.
    1.32 +
    1.33 +The hypervisor runs directly on server hardware and dynamically partitions
    1.34 +it between a number of {\it domains}, each of which hosts an instance
    1.35 +of a {\it guest operating system}. The hypervisor provides just enough
    1.36 +abstraction of the machine to allow effective isolation and resource 
    1.37 +management between these domains.
    1.38 +
    1.39 +Xen essentially takes a virtual machine approach as pioneered by IBM VM/370.
    1.40 +However, unlike VM/370 or more recent efforts such as VMWare and Virtual PC,
    1.41 +Xen doesn not attempt to completely virtualize the underlying hardware. Instead
    1.42 +parts of the hosted guest operating systems to work with the hypervisor; the
    1.43 +operating system is effectively ported to a new target architecture, typically
    1.44 +requiring changes in just the machine-dependent code. The user-level API is
    1.45 +unchanged, thus existing binaries and operating system distributions can work
    1.46 +unmodified.
    1.47 +
    1.48 +In addition to exporting virtualized instances of CPU, memory, network and
    1.49 +block devicees, Xen exposes a control interface to set how these resources
    1.50 +are shared between the running domains. The control interface is privileged
    1.51 +and may only be accessed by one particular virtual machine: {\it domain0}.
    1.52 +This domain is a required part of any Xen-base server and runs the application
    1.53 +software that manages the control-plane aspects of the platform. Running the
    1.54 +control software in {\it domain0}, distinct from the hypervisor itself, allows
    1.55 +the Xen framework to separate the notions of {\it mechanism} and {\it policy}
    1.56 +within the system.
    1.57 +
    1.59  \chapter{CPU state}
    1.61 +All privileged state must be handled by Xen. The guest OS has no direct access
    1.62 +to CR3 and is not permitted to update privileged bits in EFLAGS.
    1.63 +
    1.64  \chapter{Exceptions}
    1.65 +The IDT is virtualised by submitting a virtual 'trap
    1.66 +table' to Xen. Most trap handlers are identical to native x86
    1.67 +handlers. The page-fault handler is a noteable exception.
    1.69  \chapter{Interrupts and events}
    1.70 +Interrupts are virtualized by mapping them to events, which are delivered 
    1.71 +asynchronously to the target domain. A guest OS can map these events onto
    1.72 +its standard interrupt dispatch mechanisms, such as a simple vectoring 
    1.73 +scheme. Each physical interrupt source controlled by the hypervisor, including
    1.74 +network devices, disks, or the timer subsystem, is responsible for identifying
    1.75 +the target for an incoming interrupt and sending an event to that domain.
    1.76 +
    1.77 +This demultiplexing mechanism also provides a device-specific mechanism for 
    1.78 +event coalescing or hold-off. For example, a guest OS may request to only 
    1.79 +actually receive an event after {\it n} packets are queued ready for delivery
    1.80 +to it, {\it t} nanoseconds after the first packet arrived (which ever is true
    1.81 +first). This allows latency and throughput requirements to be addressed on a
    1.82 +domain-specific basis.
    1.84  \chapter{Time}
    1.85 +Guest operating systems need to be aware of the passage of real time and their
    1.86 +own ``virtual time'', i.e. the time they have been executing. Furthermore, a
    1.87 +notion of time is required in the hypervisor itself for scheduling and the
    1.88 +activities that relate to it. To this end the hypervisor provides for notions
    1.89 +of time: cycle counter time, system time, wall clock time, domain virtual 
    1.90 +time.
    1.91 +
    1.92 +
    1.93 +\section{Cycle counter time}
    1.94 +This provides the finest-grained, free-running time reference, with the approximate
    1.95 +frequency being publicly accessible. The cycle counter time is used to accurately
    1.96 +extrapolate the other time references. On SMP machines it is currently assumed
    1.97 +that the cycle counter time is synchronised between CPUs. The current x86-based
    1.98 +implementation achieves this within inter-CPU communication latencies.
    1.99 +
   1.100 +\section{System time}
   1.101 +This is a 64-bit value containing the nanoseconds elapsed since boot time. Unlike
   1.102 +cycle counter time, system time accurately reflects the passage of real time, i.e.
   1.103 +it is adjusted several times a second for timer drift. This is done by running an
   1.104 +NTP client in {\it domain0} on behalf of the machine, feeding updates to the 
   1.105 +hypervisor. Intermediate values can be extrapolated using the cycle counter. 
   1.106 +
   1.107 +\section{Wall clock time}
   1.108 +This is the actual ``time of day'' Unix style struct timeval (i.e. seconds and
   1.109 +microseconds since 1 January 1970, adjusted by leap seconds etc.). Again, an 
   1.110 +NTP client hosted by {\it domain0} can help maintain this value. To guest 
   1.111 +operating systems this value will be reported instead of the hardware RTC
   1.112 +clock value and they can use the system time and cycle counter times to start
   1.113 +and remain perfectly in time.
   1.114 +
   1.115 +
   1.116 +\section{Domain virtual time}
   1.117 +This progresses at the same pace as cycle counter time, but only while a domain
   1.118 +is executing. It stops while a domain is de-scheduled. Therefore the share of the 
   1.119 +CPU that a domain receives is indicated by the rate at which its domain virtual
   1.120 +time increases, relative to the rate at which cycle counter time does so.
   1.122  \chapter{Memory}
   1.124 -\chapter{I/O}
   1.125 +The hypervisor is responsible for providing memory to each of the domains running 
   1.126 +over it. However, the Xen hypervisor's duty is restricted to managing physical
   1.127 +memory and to policing page table updates. All other memory management functions
   1.128 +are handly externally. Start-of-day issues such as building initial page tables
   1.129 +for a domain, loading its kernel image and so on are done by the {\it domain builder}
   1.130 +running in user-space with {\it domain0}. Paging to disk and swapping is handled
   1.131 +by the guest operating systems themselves, if they need it.
   1.132 +
   1.133 +On a Xen-based system, the hypervisor itself runs in {\it ring 0}. It has full
   1.134 +access to the physical memory available in the system and is responsible for 
   1.135 +allocating portions of it to the domains. Guest operating systems run in and use
   1.136 +{\it rings 1}, {\it 2} and {\it 3} as they see fit, aside from the fact that
   1.137 +segmentation is used to prevent the guest OS from accessing a portion of the 
   1.138 +linear address space that is reserved for use by the hypervisor. This approach
   1.139 +allows transitions between the guest OS and hypervisor without flushing the TLB.
   1.140 +We expect most guest operating systems will use ring 1 for their own operation
   1.141 +and place applications (if they support such a notion) in ring 3.
   1.142 +
   1.143 +\section{Physical Memory Allocation}
   1.144 +The hypervisor reserves a small fixed portion of physical memory at system boot
   1.145 +time. This special memory region is located at the beginning of physical memory
   1.146 +and is mapped at the very top of every virtual address space. 
   1.147 +
   1.148 +Any physical memory that is not used directly by the hypervisor is divided into
   1.149 +pages and is available for allocation to domains. The hypervisor tracks which
   1.150 +pages are free and which pages have been allocated to each domain. When a new
   1.151 +domain is initialized, the hypervisor allocates it pages drawn from the free 
   1.152 +list. The amount of memory required by the domain is passed to the hypervisor
   1.153 +as one of the parameters for new domain initialization by the domain builder.
   1.154 +
   1.155 +Domains can never be allocated further memory beyond that which was requested
   1.156 +for them on initialization. However, a domain can return pages to the hypervisor
   1.157 +if it discovers that its memory requirements have diminished.
   1.158 +
   1.159 +% put reasons for why pages might be returned here.
   1.160 +\section{Page Table Updates}
   1.161 +In addition to managing physical memory allocation, the hypervisor is also in
   1.162 +charge of performing page table updates on behalf of the domains. This is 
   1.163 +neccessary to prevent domains from adding arbitrary mappings to their page
   1.164 +tables or introducing mappings to other's page tables.
   1.165 +
   1.166 +
   1.167 +
   1.168 +
   1.169 +\section{Pseudo-Physical Memory}
   1.170 +The usual problem of external fragmentation means that a domain is unlikely to
   1.171 +receive a contiguous stretch of physical memory. However, most guest operating
   1.172 +systems do not have built-in support for operating in a fragmented physical
   1.173 +address space e.g. Linux has to have a one-to-one mapping for it physical
   1.174 +memory. There a notion of {\it pseudo physical memory} is introdouced. 
   1.175 +Once a domain is allocated a number of pages, at its start of the day, one of
   1.176 +the first things it needs to do is build its own {\it real physical} to 
   1.177 +{\it pseudo physical} mapping. From that moment onwards {\it pseudo physical}
   1.178 +address are used instead of discontiguous {\it real physical} addresses. Thus,
   1.179 +the rest of the guest OS code has an impression of operating in a contiguous
   1.180 +address space. Guest OS page tables contain real physical addresses. Mapping
   1.181 +{\it pseudo physical} to {\it real physical} addresses is need on page
   1.182 +table updates and also on remapping memory regions with the guest OS.
   1.183 +
   1.184 +
   1.185 +
   1.186 +\chapter{Network I/O}
   1.187 +Since the hypervisor must multiplex network resources, its network subsystem
   1.188 +may be viewed as a virtual network switching element with each domain having
   1.189 +one or more virtual network interfaces to this network.
   1.190 +
   1.191 +The hypervisor acts conceptually as an IP router, forwarding each domain's
   1.192 +traffic according to a set of rules.
   1.193 +
   1.194 +\section{Hypervisor Packet Handling}
   1.195 +The hypervisor is responsible primarily for {\it data-path} operations.
   1.196 +In terms of networking this means packet transmission and reception.
   1.197 +
   1.198 +On the transmission side, the hypervisor needs to perform two key actions:
   1.199 +\begin{itemize}
   1.200 +\item {\tt Validation:} A domain is only allowed to emit packets matching a certain
   1.201 +specification; for example, ones in which the source IP address matches
   1.202 +one assigned to the virtual interface over which it is sent. The hypervisor
   1.203 +is responsible for ensuring any such requirements are met, either by checking
   1.204 +or by stamping outgoing packets with prescribed values for certain fields.
   1.205 +
   1.206 +\item {\tt Scheduling:} Since a number of domains can share a single ``real'' network 
   1.207 +interface, the hypervisor must mediate access when several domains each 
   1.208 +have packets queued for transmission. Of course, this general scheduling
   1.209 +function subsumes basic shaping or rate-limiting schemes.
   1.210 +
   1.211 +\item {\tt Logging and Accounting:} The hypervisor can be configured with classifier 
   1.212 +rules that control how packets are accounted or logged. For example, 
   1.213 +{\it domain0} could request that it receives a log message or copy of the
   1.214 +packet whenever another domain attempts to send a TCP packet containg a 
   1.215 +SYN.
   1.216 +\end{itemize}
   1.217 +On the recive side, the hypervisor's role is relatively straightforward:
   1.218 +once a packet is received, it just needs to determine the virtual interface(s)
   1.219 +to which it must be delivered and deliver it via page-flipping. 
   1.220 +
   1.221 +
   1.222 +\section{Data Transfer}
   1.223 +
   1.224 +Each virtual interface uses two ``descriptor rings'', one for transmit,
   1.225 +the other for receive. Each descriptor identifies a block of contiguous
   1.226 +physical memory allocated to the domain. There are four cases:
   1.227 +
   1.228 +\begin{itemize}
   1.229 +
   1.230 +\item The transmit ring carries packets to transmit from the domain to the
   1.231 +hypervisor.
   1.232 +
   1.233 +\item The return path of the transmit ring carries ``empty'' descriptors
   1.234 +indicating that the contents have been transmitted and the memory can be
   1.235 +re-used.
   1.236 +
   1.237 +\item The receive ring carries empty descriptors from the domain to the 
   1.238 +hypervisor; these provide storage space for that domain's received packets.
   1.239 +
   1.240 +\item The return path of the receive ring carries packets that have been
   1.241 +received.
   1.242 +\end{itemize}
   1.243 +
   1.244 +Real physical addresses are used throughout, with the domain performing 
   1.245 +translation from pseudo-physical addresses if that is necessary.
   1.246 +
   1.247 +If a domain does not keep its receive ring stocked with empty buffers then 
   1.248 +packets destined to it may be dropped. This provides some defense against 
   1.249 +receiver-livelock problems because an overload domain will cease to receive
   1.250 +further data. Similarly, on the transmit path, it provides the application
   1.251 +with feedback on the rate at which packets are able to leave the system.
   1.252 +
   1.253 +Synchronization between the hypervisor and the domain is achieved using 
   1.254 +counters held in shared memory that is accessible to both. Each ring has
   1.255 +associated producer and consumer indices indicating the area in the ring
   1.256 +that holds descriptors that contain data. After receiving {\it n} packets
   1.257 +or {\t nanoseconds} after receiving the first packet, the hypervisor sends
   1.258 +an event to the domain. 
   1.259 +
   1.260 +\chapter{Block I/O}
   1.261 +
   1.262 +\section{Virtual Block Devices (VBDs)}
   1.263 +
   1.264 +All guest OS disk access goes through the VBD interface. The VBD interface
   1.265 +provides the administrator with the ability to selectively grant domains 
   1.266 +access to portions of block storage devices visible to the system.
   1.267 +
   1.268 +A VBD can also be comprised of a set of extents from multiple storage devices.
   1.269 +This provides the same functionality as a concatenated disk driver.
   1.270 +
   1.271 +\section{Virtual Disks (VDs)}
   1.272 +
   1.273 +VDs are an abstraction built on top of the VBD interface. One can reserve disk
   1.274 +space for use by the VD layer. This space is then managed as a pool of free extents.
   1.275 +The VD tools can automatically allocate collections of extents from this pool to
   1.276 +create ``virtual disks'' on demand. 
   1.277 +
   1.278 +\subsection{Virtual Disk Management}
   1.279 +The VD management code consists of a set of python libraries. It can therefore
   1.280 +be accessed by custom scripts as well as the convenience scripts provided. The
   1.281 +VD database is a SQLite database in /var/db/xen\_vdisk.sqlite.
   1.282 +
   1.283 +The VD scripts and general VD usage are documented in the VBD-HOWTO.txt.
   1.284 +
   1.285 +\subsection{Data Transfer}
   1.286 +Domains which have been granted access to a logical block device are permitted
   1.287 +to read and write it directly through the hypervisor, rather than requiring
   1.288 +{\it domain0} to mediate every data access. 
   1.289 +
   1.290 +In overview, the same style of descriptor-ring that is used for network
   1.291 +packets is used here. Each domain has one ring that carries operation requests to the 
   1.292 +hypervisor and carries the results back again. 
   1.293 +
   1.294 +Rather than copying data in and out of the hypervisor, we use page pinning to
   1.295 +enable DMA transfers directly between the physical device and the domain's 
   1.296 +buffers. Disk read operations are straightforward; the hypervisor just needs
   1.297 +to know which pages have pending DMA transfers, and prevent the guest OS from
   1.298 +giving the page back to the hypervisor, or to use them for storing page tables.
   1.299 +
   1.300 +%block API here 
   1.302  \chapter{Privileged operations}
   1.303 +{\it Domain0} is responsible for building all other domains on the server
   1.304 +and providing control interfaces for managing scheduling, networking, and
   1.305 +blocks.
   1.306 +
   1.307 +
   1.308 +\chapter{Hypervisor calls}
   1.309 +
   1.310 +\section{ set\_trap\_table(trap\_info\_t *table)} 
   1.311 +
   1.312 +Install trap handler table.
   1.313 +
   1.314 +\section{ mmu\_update(mmu\_update\_t *req, int count)} 
   1.315 +Update the page table for the domain. Updates can be batched.
   1.316 +The update types are: 
   1.317 +
   1.318 +{\it MMU\_NORMAL\_PT\_UPDATE}:
   1.319 +
   1.320 +{\it MMU\_UNCHECKED\_PT\_UPDATE}:
   1.321 +
   1.322 +{\it MMU\_MACHPHYS\_UPDATE}:
   1.323 +
   1.324 +{\it MMU\_EXTENDED\_COMMAND}:
   1.325 +
   1.326 +\section{ console\_write(const char *str, int count)}
   1.327 +Output buffer str to the console.
   1.328 +
   1.329 +\section{ set\_gdt(unsigned long *frame\_list, int entries)} 
   1.330 +Set the global descriptor table - virtualization for lgdt.
   1.331 +
   1.332 +\section{ stack\_switch(unsigned long ss, unsigned long esp)} 
   1.333 +Request context switch from hypervisor.
   1.334 +
   1.335 +\section{ set\_callbacks(unsigned long event\_selector, unsigned long event\_address,
   1.336 +                        unsigned long failsafe\_selector, unsigned long failsafe\_address) } 
   1.337 + Register OS event processing routine. In Linux both the event\_selector and 
   1.338 +failsafe\_selector are the kernel's CS. The value event\_address specifies the address for
   1.339 +an interrupt handler dispatch routine and failsafe\_address specifies a handler for 
   1.340 +application faults.
   1.341 +
   1.342 +\section{ net\_io\_op(netop\_t *op)}  
   1.343 +Notify hypervisor of updates to transmit and/or receive descriptor rings.
   1.344 +
   1.345 +\section{ fpu\_taskswitch(void)} 
   1.346 +Notify hypervisor that fpu registers needed to be save on context switch.
   1.347 +
   1.348 +\section{ sched\_op(unsigned long op)} 
   1.349 +Request scheduling operation from hypervisor. The options are: yield, stop, and exit.
   1.350 +
   1.351 +\section{ dom0\_op(dom0\_op\_t *op)} 
   1.352 +Administrative domain operations for domain management. The options are:
   1.353 +
   1.354 +{\it DOM0\_CREATEDOMAIN}: create new domain, specifying the name and memory usage
   1.355 +in kilobytes.
   1.356 +
   1.357 +{\it DOM0\_STARTDOMAIN}: make domain schedulable
   1.358 +
   1.359 +{\it DOM0\_STOPDOMAIN}: mark domain as unschedulable
   1.360 +
   1.361 +{\it DOM0\_DESTROYDOMAIN}: deallocate resources associated with the domain
   1.362 +
   1.363 +{\it DOM0\_GETMEMLIST}: get list of pages used by the domain
   1.364 +
   1.365 +{\it DOM0\_BUILDDOMAIN}: do final guest OS setup for domain
   1.366 +
   1.367 +{\it DOM0\_BVTCTL}: adjust scheduler context switch time
   1.368 +
   1.369 +{\it DOM0\_ADJUSTDOM}: adjust scheduling priorities for domain
   1.370 +
   1.371 +{\it DOM0\_GETDOMAINFO}: get statistics about the domain
   1.372 +
   1.373 +{\it DOM0\_IOPL}:
   1.374 +
   1.375 +{\it DOM0\_MSR}:
   1.376 +
   1.377 +{\it DOM0\_DEBUG}:
   1.378 +
   1.379 +{\it DOM0\_SETTIME}: set system time
   1.380 +
   1.381 +{\it DOMO\_READCONSOLE}: read console content from hypervisor buffer ring
   1.382 +
   1.383 +{\it DOMO\_PINCPUDOMAIN}: pin domain to a particular CPU
   1.384 +
   1.385 +
   1.386 +\section{network\_op(network\_op\_t *op)} 
   1.387 +update network ruleset
   1.388 +
   1.389 +\section{ block\_io\_op(block\_io\_op\_t *op)}
   1.390 +
   1.391 +\section{ set\_debugreg(int reg, unsigned long value)}
   1.392 +set debug register reg to value
   1.393 +
   1.394 +\section{ get\_debugreg(int reg)}
   1.395 + get the debug register reg
   1.396 +
   1.397 +\section{ update\_descriptor(unsigned long pa, unsigned long word1, unsigned long word2)} 
   1.398 +
   1.399 +\section{ set\_fast\_trap(int idx)}
   1.400 + install traps to allow guest OS to bypass hypervisor
   1.401 +
   1.402 +\section{ dom\_mem\_op(dom\_mem\_op\_t *op)}
   1.403 + increase or decrease memory reservations for guest OS
   1.404 +
   1.405 +\section{ multicall(multicall\_entry\_t *call\_list, int nr\_calls)}
   1.406 + execute a series of hypervisor calls
   1.407 +
   1.408 +\section{ kbd\_op(unsigned char op, unsigned char val)}
   1.409 +
   1.410 +\section{update\_va\_mapping(unsigned long page\_nr, unsigned long val, unsigned long flags)}
   1.411 +
   1.412 +\section{ event\_channel\_op(unsigned int cmd, unsigned int id)} 
   1.413 +inter-domain event-channel management, options are: open, close, send, and status.
   1.415  \end{document}