ia64/xen-unstable

changeset 2672:1cec0bdb4c6f

bitkeeper revision 1.1159.117.4 (4176f32as2THW4beHDnUYVrng1zIzw)

Doc update.
author cl349@freefall.cl.cam.ac.uk
date Wed Oct 20 23:22:18 2004 +0000 (2004-10-20)
parents 3a1793ba6859
children 1c21b245b050
files docs/interface.tex
line diff
     1.1 --- a/docs/interface.tex	Wed Oct 20 21:12:59 2004 +0000
     1.2 +++ b/docs/interface.tex	Wed Oct 20 23:22:18 2004 +0000
     1.3 @@ -51,32 +51,33 @@ operating system images to be run simult
     1.4  
     1.5  Virtualizing the machine in this manner provides flexibility allowing
     1.6  different users to choose their preferred operating system (Windows,
     1.7 -Linux, FreeBSD, or a custom operating system). Furthermore, Xen provides
     1.8 +Linux, NetBSD, or a custom operating system).  Furthermore, Xen provides
     1.9  secure partitioning between these 'domains', and enables better resource
    1.10  accounting and QoS isolation than can be achieved with a conventional
    1.11  operating system.
    1.12  
    1.13  The hypervisor runs directly on server hardware and dynamically partitions
    1.14  it between a number of {\it domains}, each of which hosts an instance
    1.15 -of a {\it guest operating system}. The hypervisor provides just enough
    1.16 +of a {\it guest operating system}.  The hypervisor provides just enough
    1.17  abstraction of the machine to allow effective isolation and resource 
    1.18  management between these domains.
    1.19  
    1.20 -Xen essentially takes a virtual machine approach as pioneered by IBM VM/370.
    1.21 -However, unlike VM/370 or more recent efforts such as VMWare and Virtual PC,
    1.22 -Xen doesn not attempt to completely virtualize the underlying hardware. Instead
    1.23 -parts of the hosted guest operating systems to work with the hypervisor; the
    1.24 -operating system is effectively ported to a new target architecture, typically
    1.25 -requiring changes in just the machine-dependent code. The user-level API is
    1.26 -unchanged, thus existing binaries and operating system distributions can work
    1.27 -unmodified.
    1.28 +Xen essentially takes a virtual machine approach as pioneered by IBM
    1.29 +VM/370.  However, unlike VM/370 or more recent efforts such as VMWare
    1.30 +and Virtual PC, Xen doesn not attempt to completely virtualize the
    1.31 +underlying hardware.  Instead parts of the hosted guest operating
    1.32 +systems are modified to work with the hypervisor; the operating system
    1.33 +is effectively ported to a new target architecture, typically
    1.34 +requiring changes in just the machine-dependent code.  The user-level
    1.35 +API is unchanged, thus existing binaries and operating system
    1.36 +distributions can work unmodified.
    1.37  
    1.38  In addition to exporting virtualized instances of CPU, memory, network and
    1.39  block devicees, Xen exposes a control interface to set how these resources
    1.40 -are shared between the running domains. The control interface is privileged
    1.41 +are shared between the running domains.  The control interface is privileged
    1.42  and may only be accessed by one particular virtual machine: {\it domain0}.
    1.43  This domain is a required part of any Xen-base server and runs the application
    1.44 -software that manages the control-plane aspects of the platform. Running the
    1.45 +software that manages the control-plane aspects of the platform.  Running the
    1.46  control software in {\it domain0}, distinct from the hypervisor itself, allows
    1.47  the Xen framework to separate the notions of {\it mechanism} and {\it policy}
    1.48  within the system.
    1.49 @@ -84,58 +85,59 @@ within the system.
    1.50  
    1.51  \chapter{CPU state}
    1.52  
    1.53 -All privileged state must be handled by Xen. The guest OS has no direct access
    1.54 -to CR3 and is not permitted to update privileged bits in EFLAGS.
    1.55 +All privileged state must be handled by Xen.  The guest OS has no
    1.56 +direct access to CR3 and is not permitted to update privileged bits in
    1.57 +EFLAGS.
    1.58  
    1.59  \chapter{Exceptions}
    1.60  The IDT is virtualised by submitting a virtual 'trap
    1.61 -table' to Xen. Most trap handlers are identical to native x86
    1.62 -handlers. The page-fault handler is a noteable exception.
    1.63 +table' to Xen.  Most trap handlers are identical to native x86
    1.64 +handlers.  The page-fault handler is a noteable exception.
    1.65  
    1.66  \chapter{Interrupts and events}
    1.67  Interrupts are virtualized by mapping them to events, which are delivered 
    1.68 -asynchronously to the target domain. A guest OS can map these events onto
    1.69 +asynchronously to the target domain.  A guest OS can map these events onto
    1.70  its standard interrupt dispatch mechanisms, such as a simple vectoring 
    1.71 -scheme. Each physical interrupt source controlled by the hypervisor, including
    1.72 +scheme.  Each physical interrupt source controlled by the hypervisor, including
    1.73  network devices, disks, or the timer subsystem, is responsible for identifying
    1.74  the target for an incoming interrupt and sending an event to that domain.
    1.75  
    1.76  This demultiplexing mechanism also provides a device-specific mechanism for 
    1.77 -event coalescing or hold-off. For example, a guest OS may request to only 
    1.78 +event coalescing or hold-off.  For example, a guest OS may request to only 
    1.79  actually receive an event after {\it n} packets are queued ready for delivery
    1.80  to it, {\it t} nanoseconds after the first packet arrived (which ever is true
    1.81 -first). This allows latency and throughput requirements to be addressed on a
    1.82 +first).  This allows latency and throughput requirements to be addressed on a
    1.83  domain-specific basis.
    1.84  
    1.85  \chapter{Time}
    1.86  Guest operating systems need to be aware of the passage of real time and their
    1.87 -own ``virtual time'', i.e. the time they have been executing. Furthermore, a
    1.88 +own ``virtual time'', i.e. the time they have been executing.  Furthermore, a
    1.89  notion of time is required in the hypervisor itself for scheduling and the
    1.90 -activities that relate to it. To this end the hypervisor provides for notions
    1.91 -of time: cycle counter time, system time, wall clock time, domain virtual 
    1.92 +activities that relate to it.  To this end the hypervisor provides for notions
    1.93 +of time:  cycle counter time, system time, wall clock time, domain virtual 
    1.94  time.
    1.95  
    1.96  
    1.97  \section{Cycle counter time}
    1.98  This provides the finest-grained, free-running time reference, with the
    1.99 -approximate frequency being publicly accessible. The cycle counter time is
   1.100 -used to accurately extrapolate the other time references. On SMP machines
   1.101 +approximate frequency being publicly accessible.  The cycle counter time is
   1.102 +used to accurately extrapolate the other time references.  On SMP machines
   1.103  it is currently assumed that the cycle counter time is synchronised between
   1.104 -CPUs. The current x86-based implementation achieves this within inter-CPU
   1.105 +CPUs.  The current x86-based implementation achieves this within inter-CPU
   1.106  communication latencies.
   1.107  
   1.108  \section{System time}
   1.109  This is a 64-bit value containing the nanoseconds elapsed since boot
   1.110 -time. Unlike cycle counter time, system time accurately reflects the
   1.111 +time.  Unlike cycle counter time, system time accurately reflects the
   1.112  passage of real time, i.e.  it is adjusted several times a second for timer
   1.113 -drift. This is done by running an NTP client in {\it domain0} on behalf of
   1.114 -the machine, feeding updates to the hypervisor. Intermediate values can be
   1.115 +drift.  This is done by running an NTP client in {\it domain0} on behalf of
   1.116 +the machine, feeding updates to the hypervisor.  Intermediate values can be
   1.117  extrapolated using the cycle counter.
   1.118  
   1.119  \section{Wall clock time}
   1.120  This is the actual ``time of day'' Unix style struct timeval (i.e. seconds and
   1.121 -microseconds since 1 January 1970, adjusted by leap seconds etc.). Again, an 
   1.122 -NTP client hosted by {\it domain0} can help maintain this value. To guest 
   1.123 +microseconds since 1 January 1970, adjusted by leap seconds etc.).  Again, an 
   1.124 +NTP client hosted by {\it domain0} can help maintain this value.  To guest 
   1.125  operating systems this value will be reported instead of the hardware RTC
   1.126  clock value and they can use the system time and cycle counter times to start
   1.127  and remain perfectly in time.
   1.128 @@ -143,118 +145,136 @@ and remain perfectly in time.
   1.129  
   1.130  \section{Domain virtual time}
   1.131  This progresses at the same pace as cycle counter time, but only while a
   1.132 -domain is executing. It stops while a domain is de-scheduled. Therefore the
   1.133 +domain is executing.  It stops while a domain is de-scheduled.  Therefore the
   1.134  share of the CPU that a domain receives is indicated by the rate at which
   1.135  its domain virtual time increases, relative to the rate at which cycle
   1.136  counter time does so.
   1.137  
   1.138  \section{Time interface}
   1.139  Xen exports some timestamps to guest operating systems through their shared
   1.140 -info page. Timestamps are provided for system time and wall-clock time. Xen
   1.141 +info page.  Timestamps are provided for system time and wall-clock time.  Xen
   1.142  also provides the cycle counter values at the time of the last update
   1.143 -allowing guests to calculate the current values. The cpu frequency and a
   1.144 +allowing guests to calculate the current values.  The cpu frequency and a
   1.145  scaling factor are provided for guests to convert cycle counter values to
   1.146 -real time. Since all time stamps need to be updated and read
   1.147 +real time.  Since all time stamps need to be updated and read
   1.148  \emph{atomically} two version numbers are also stored in the shared info
   1.149  page.
   1.150  
   1.151  Xen will ensure that the time stamps are updated frequently enough to avoid
   1.152 -an overflow of the cycle counter values. Guest can check if its notion of
   1.153 +an overflow of the cycle counter values.  A guest can check if its notion of
   1.154  time is up-to-date by comparing the version numbers.
   1.155  
   1.156  \section{Timer events}
   1.157  
   1.158  Xen maintains a periodic timer (currently with a 10ms period) which sends a
   1.159 -timer event to the currently executing domain. This allows Guest OSes to
   1.160 -keep track of the passing of time when executing. The scheduler also
   1.161 +timer event to the currently executing domain.  This allows Guest OSes to
   1.162 +keep track of the passing of time when executing.  The scheduler also
   1.163  arranges for a newly activated domain to receive a timer event when
   1.164  scheduled so that the Guest OS can adjust to the passage of time while it
   1.165  has been inactive.
   1.166  
   1.167  In addition, Xen exports a hypercall interface to each domain which allows
   1.168 -them to request a timer event send to them at the specified system
   1.169 -time. Guest OSes may use this timer to implemented timeout values when they
   1.170 +them to request a timer event sent to them at the specified system
   1.171 +time.  Guest OSes may use this timer to implement timeout values when they
   1.172  block.
   1.173  
   1.174  \chapter{Memory}
   1.175  
   1.176 -The hypervisor is responsible for providing memory to each of the domains running 
   1.177 -over it. However, the Xen hypervisor's duty is restricted to managing physical
   1.178 -memory and to policing page table updates. All other memory management functions
   1.179 -are handly externally. Start-of-day issues such as building initial page tables
   1.180 -for a domain, loading its kernel image and so on are done by the {\it domain builder}
   1.181 -running in user-space with {\it domain0}. Paging to disk and swapping is handled
   1.182 -by the guest operating systems themselves, if they need it.
   1.183 +The hypervisor is responsible for providing memory to each of the
   1.184 +domains running over it.  However, the Xen hypervisor's duty is
   1.185 +restricted to managing physical memory and to policying page table
   1.186 +updates.  All other memory management functions are handled
   1.187 +externally.  Start-of-day issues such as building initial page tables
   1.188 +for a domain, loading its kernel image and so on are done by the {\it
   1.189 +domain builder} running in user-space in {\it domain0}.  Paging to
   1.190 +disk and swapping is handled by the guest operating systems
   1.191 +themselves, if they need it.
   1.192  
   1.193 -On a Xen-based system, the hypervisor itself runs in {\it ring 0}. It has full
   1.194 -access to the physical memory available in the system and is responsible for 
   1.195 -allocating portions of it to the domains. Guest operating systems run in and use
   1.196 -{\it rings 1}, {\it 2} and {\it 3} as they see fit, aside from the fact that
   1.197 -segmentation is used to prevent the guest OS from accessing a portion of the 
   1.198 -linear address space that is reserved for use by the hypervisor. This approach
   1.199 -allows transitions between the guest OS and hypervisor without flushing the TLB.
   1.200 -We expect most guest operating systems will use ring 1 for their own operation
   1.201 -and place applications (if they support such a notion) in ring 3.
   1.202 +On a Xen-based system, the hypervisor itself runs in {\it ring 0}.  It
   1.203 +has full access to the physical memory available in the system and is
   1.204 +responsible for allocating portions of it to the domains.  Guest
   1.205 +operating systems run in and use {\it rings 1}, {\it 2} and {\it 3} as
   1.206 +they see fit, aside from the fact that segmentation is used to prevent
   1.207 +the guest OS from accessing a portion of the linear address space that
   1.208 +is reserved for use by the hypervisor.  This approach allows
   1.209 +transitions between the guest OS and hypervisor without flushing the
   1.210 +TLB.  We expect most guest operating systems will use ring 1 for their
   1.211 +own operation and place applications (if they support such a notion)
   1.212 +in ring 3.
   1.213  
   1.214  \section{Physical Memory Allocation}
   1.215 -The hypervisor reserves a small fixed portion of physical memory at system boot
   1.216 -time. This special memory region is located at the beginning of physical memory
   1.217 -and is mapped at the very top of every virtual address space. 
   1.218 +The hypervisor reserves a small fixed portion of physical memory at
   1.219 +system boot time.  This special memory region is located at the
   1.220 +beginning of physical memory and is mapped at the very top of every
   1.221 +virtual address space.
   1.222  
   1.223  Any physical memory that is not used directly by the hypervisor is divided into
   1.224 -pages and is available for allocation to domains. The hypervisor tracks which
   1.225 -pages are free and which pages have been allocated to each domain. When a new
   1.226 +pages and is available for allocation to domains.  The hypervisor tracks which
   1.227 +pages are free and which pages have been allocated to each domain.  When a new
   1.228  domain is initialized, the hypervisor allocates it pages drawn from the free 
   1.229 -list. The amount of memory required by the domain is passed to the hypervisor
   1.230 +list.  The amount of memory required by the domain is passed to the hypervisor
   1.231  as one of the parameters for new domain initialization by the domain builder.
   1.232  
   1.233 -Domains can never be allocated further memory beyond that which was requested
   1.234 -for them on initialization. However, a domain can return pages to the hypervisor
   1.235 -if it discovers that its memory requirements have diminished.
   1.236 +Domains can never be allocated further memory beyond that which was
   1.237 +requested for them on initialization.  However, a domain can return
   1.238 +pages to the hypervisor if it discovers that its memory requirements
   1.239 +have diminished.
   1.240  
   1.241  % put reasons for why pages might be returned here.
   1.242  \section{Page Table Updates}
   1.243  In addition to managing physical memory allocation, the hypervisor is also in
   1.244 -charge of performing page table updates on behalf of the domains. This is 
   1.245 +charge of performing page table updates on behalf of the domains.  This is 
   1.246  neccessary to prevent domains from adding arbitrary mappings to their page
   1.247  tables or introducing mappings to other's page tables.
   1.248  
   1.249 +\section{Writabel Page Tables}
   1.250 +A domain can also request write access to its page tables.  In this
   1.251 +mode, Xen notes write attempts to page table pages and makes the page
   1.252 +temporarily writable.  In-use page table pages are also disconnect
   1.253 +from the page directory.  The domain can now update entries in these
   1.254 +page table pages without the assistance of Xen.  As soon as the
   1.255 +writabel page table pages get used as page table pages, Xen makes the
   1.256 +pages read-only again and revalidates the entries in the pages.
   1.257 +
   1.258  \section{Segment Descriptor Tables}
   1.259  
   1.260  On boot a guest is supplied with a default GDT, which is {\em not}
   1.261 -taken from its own memory allocation. If the guest wishes to use other
   1.262 +taken from its own memory allocation.  If the guest wishes to use other
   1.263  than the default `flat' ring-1 and ring-3 segments that this default
   1.264  table provides, it must register a custom GDT and/or LDT with Xen,
   1.265  allocated from its own memory.
   1.266  
   1.267  int {\bf set\_gdt}(unsigned long *{\em frame\_list}, int {\em entries})
   1.268  
   1.269 -{\em frame\_list}: An array of up to 16 page frames within which the GDT
   1.270 -resides. Any frame registered as a GDT frame may only be mapped
   1.271 -read-only within the guest's address space (e.g., no writeable
   1.272 +{\em frame\_list}: An array of up to 16 page frames within which the
   1.273 +GDT resides.  Any frame registered as a GDT frame may only be mapped
   1.274 +read-only within the guest's address space (e.g., no writable
   1.275  mappings, no use as a page-table page, and so on).
   1.276  
   1.277 -{\em entries}: The number of descriptor-entry slots in the GDT. Note that
   1.278 -the table must be large enough to contain Xen's reserved entries; thus
   1.279 -we must have '{\em entries $>$ LAST\_RESERVED\_GDT\_ENTRY}'. Note also that,
   1.280 -after registering the GDT, slots {\em FIRST\_} through
   1.281 -{\em LAST\_RESERVED\_GDT\_ENTRY} are no longer usable by the guest and may be
   1.282 -overwritten by Xen.
   1.283 +{\em entries}: The number of descriptor-entry slots in the GDT.  Note
   1.284 +that the table must be large enough to contain Xen's reserved entries;
   1.285 +thus we must have '{\em entries $>$ LAST\_RESERVED\_GDT\_ENTRY}'.
   1.286 +Note also that, after registering the GDT, slots {\em FIRST\_} through
   1.287 +{\em LAST\_RESERVED\_GDT\_ENTRY} are no longer usable by the guest and
   1.288 +may be overwritten by Xen.
   1.289  
   1.290  \section{Pseudo-Physical Memory}
   1.291 -The usual problem of external fragmentation means that a domain is unlikely to
   1.292 -receive a contiguous stretch of physical memory. However, most guest operating
   1.293 -systems do not have built-in support for operating in a fragmented physical
   1.294 -address space e.g. Linux has to have a one-to-one mapping for it physical
   1.295 -memory. There a notion of {\it pseudo physical memory} is introdouced. 
   1.296 -Once a domain is allocated a number of pages, at its start of the day, one of
   1.297 -the first things it needs to do is build its own {\it real physical} to 
   1.298 -{\it pseudo physical} mapping. From that moment onwards {\it pseudo physical}
   1.299 -address are used instead of discontiguous {\it real physical} addresses. Thus,
   1.300 -the rest of the guest OS code has an impression of operating in a contiguous
   1.301 -address space. Guest OS page tables contain real physical addresses. Mapping
   1.302 -{\it pseudo physical} to {\it real physical} addresses is need on page
   1.303 -table updates and also on remapping memory regions with the guest OS.
   1.304 +The usual problem of external fragmentation means that a domain is
   1.305 +unlikely to receive a contiguous stretch of physical memory.  However,
   1.306 +most guest operating systems do not have built-in support for
   1.307 +operating in a fragmented physical address space e.g. Linux has to
   1.308 +have a one-to-one mapping for its physical memory.  There a notion of
   1.309 +{\it pseudo physical memory} is introdouced.  Xen maintains a {\it
   1.310 +real physical} to {\it pseudo physical} mapping which can be consulted
   1.311 +by every domain.  Additionally, at its start of day, a domain is
   1.312 +supplied a {\it pseudo physical} to {\it real physical} mapping which
   1.313 +it needs to keep updated itself.  From that moment onwards {\it pseudo
   1.314 +physical} addresses are used instead of discontiguous {\it real
   1.315 +physical} addresses.  Thus, the rest of the guest OS code has an
   1.316 +impression of operating in a contiguous address space.  Guest OS page
   1.317 +tables contain real physical addresses.  Mapping {\it pseudo physical}
   1.318 +to {\it real physical} addresses is needed on page table updates and
   1.319 +also on remapping memory regions with the guest OS.
   1.320  
   1.321  
   1.322  
   1.323 @@ -272,11 +292,11 @@ In terms of networking this means packet
   1.324  
   1.325  On the transmission side, the backend needs to perform two key actions:
   1.326  \begin{itemize}
   1.327 -\item {\tt Validation:} A domain is only allowed to emit packets
   1.328 +\item {\tt Validation:} A domain may only be allowed to emit packets
   1.329  matching a certain specification; for example, ones in which the
   1.330  source IP address matches one assigned to the virtual interface over
   1.331 -which it is sent. The backend is responsible for ensuring any such
   1.332 -requirements are met, either by checking or by stamping outgoing
   1.333 +which it is sent.  The backend would be responsible for ensuring any
   1.334 +such requirements are met, either by checking or by stamping outgoing
   1.335  packets with prescribed values for certain fields.
   1.336  
   1.337  Validation functions can be configured using standard firewall rules
   1.338 @@ -284,13 +304,13 @@ Validation functions can be configured u
   1.339  
   1.340  \item {\tt Scheduling:} Since a number of domains can share a single
   1.341  ``real'' network interface, the hypervisor must mediate access when
   1.342 -several domains each have packets queued for transmission. Of course,
   1.343 +several domains each have packets queued for transmission.  Of course,
   1.344  this general scheduling function subsumes basic shaping or
   1.345  rate-limiting schemes.
   1.346  
   1.347  \item {\tt Logging and Accounting:} The hypervisor can be configured
   1.348  with classifier rules that control how packets are accounted or
   1.349 -logged. For example, {\it domain0} could request that it receives a
   1.350 +logged.  For example, {\it domain0} could request that it receives a
   1.351  log message or copy of the packet whenever another domain attempts to
   1.352  send a TCP packet containg a SYN.
   1.353  \end{itemize}
   1.354 @@ -303,8 +323,8 @@ to which it must be delivered and delive
   1.355  \section{Data Transfer}
   1.356  
   1.357  Each virtual interface uses two ``descriptor rings'', one for transmit,
   1.358 -the other for receive. Each descriptor identifies a block of contiguous
   1.359 -physical memory allocated to the domain. There are four cases:
   1.360 +the other for receive.  Each descriptor identifies a block of contiguous
   1.361 +physical memory allocated to the domain.  There are four cases:
   1.362  
   1.363  \begin{itemize}
   1.364  
   1.365 @@ -326,15 +346,15 @@ Real physical addresses are used through
   1.366  translation from pseudo-physical addresses if that is necessary.
   1.367  
   1.368  If a domain does not keep its receive ring stocked with empty buffers then 
   1.369 -packets destined to it may be dropped. This provides some defense against 
   1.370 +packets destined to it may be dropped.  This provides some defense against 
   1.371  receiver-livelock problems because an overload domain will cease to receive
   1.372 -further data. Similarly, on the transmit path, it provides the application
   1.373 +further data.  Similarly, on the transmit path, it provides the application
   1.374  with feedback on the rate at which packets are able to leave the system.
   1.375  
   1.376  Synchronization between the hypervisor and the domain is achieved using 
   1.377 -counters held in shared memory that is accessible to both. Each ring has
   1.378 +counters held in shared memory that is accessible to both.  Each ring has
   1.379  associated producer and consumer indices indicating the area in the ring
   1.380 -that holds descriptors that contain data. After receiving {\it n} packets
   1.381 +that holds descriptors that contain data.  After receiving {\it n} packets
   1.382  or {\t nanoseconds} after receiving the first packet, the hypervisor sends
   1.383  an event to the domain. 
   1.384  
   1.385 @@ -342,7 +362,7 @@ an event to the domain.
   1.386  
   1.387  \section{Virtual Block Devices (VBDs)}
   1.388  
   1.389 -All guest OS disk access goes through the VBD interface. The VBD
   1.390 +All guest OS disk access goes through the VBD interface.  The VBD
   1.391  interface provides the administrator with the ability to selectively
   1.392  grant domains access to portions of block storage devices visible to
   1.393  the the block backend device (usually domain 0).
   1.394 @@ -360,7 +380,7 @@ Domains which have been granted access t
   1.395  to read and write it by shared memory communications with the backend domain. 
   1.396  
   1.397  In overview, the same style of descriptor-ring that is used for
   1.398 -network packets is used here. Each domain has one ring that carries
   1.399 +network packets is used here.  Each domain has one ring that carries
   1.400  operation requests to the hypervisor and carries the results back
   1.401  again.
   1.402  
   1.403 @@ -390,7 +410,7 @@ assigned domains should be run there.
   1.404  \section{Standard Schedulers}
   1.405  
   1.406  These BVT, Atropos and Round Robin schedulers are part of the normal
   1.407 -Xen distribution.  BVT provides porportional fair shares of the CPU to
   1.408 +Xen distribution.  BVT provides proportional fair shares of the CPU to
   1.409  the running domains.  Atropos can be used to reserve absolute shares
   1.410  of the CPU for each domain.  Round-robin is provided as an example of
   1.411  Xen's internal scheduler API.
   1.412 @@ -569,7 +589,7 @@ which also performs all Xen-specific tas
   1.413  (unless the previous task has been chosen again).
   1.414  
   1.415  This method is called with the {\tt schedule\_lock} held for the current CPU
   1.416 -and local interrupts interrupts disabled.
   1.417 +and local interrupts disabled.
   1.418  
   1.419  \paragraph*{Return values}
   1.420  
   1.421 @@ -588,9 +608,8 @@ source data from or populate with data, 
   1.422  \paragraph*{Call environment}
   1.423  
   1.424  The generic layer guarantees that when this method is called, the
   1.425 -caller was using the caller selected the correct scheduler ID, hence
   1.426 -the scheduler's implementation does not need to sanity-check these
   1.427 -parts of the call.
   1.428 +caller selected the correct scheduler ID, hence the scheduler's
   1.429 +implementation does not need to sanity-check these parts of the call.
   1.430  
   1.431  \paragraph*{Return values}
   1.432  
   1.433 @@ -739,21 +758,17 @@ xentrace\_format} and {\tt xentrace\_cpu
   1.434  
   1.435  Install trap handler table.
   1.436  
   1.437 -\section{ mmu\_update(mmu\_update\_t *req, int count)} 
   1.438 +\section{ mmu\_update(mmu\_update\_t *req, int count, int *success_count)} 
   1.439  Update the page table for the domain. Updates can be batched.
   1.440 -The update types are: 
   1.441 +success_count will be updated to report the number of successfull
   1.442 +updates.  The update types are:
   1.443  
   1.444  {\it MMU\_NORMAL\_PT\_UPDATE}:
   1.445  
   1.446 -{\it MMU\_UNCHECKED\_PT\_UPDATE}:
   1.447 -
   1.448  {\it MMU\_MACHPHYS\_UPDATE}:
   1.449  
   1.450  {\it MMU\_EXTENDED\_COMMAND}:
   1.451  
   1.452 -\section{ console\_write(const char *str, int count)}
   1.453 -Output buffer str to the console.
   1.454 -
   1.455  \section{ set\_gdt(unsigned long *frame\_list, int entries)} 
   1.456  Set the global descriptor table - virtualization for lgdt.
   1.457  
   1.458 @@ -761,28 +776,24 @@ Set the global descriptor table - virtua
   1.459  Request context switch from hypervisor.
   1.460  
   1.461  \section{ set\_callbacks(unsigned long event\_selector, unsigned long event\_address,
   1.462 -                        unsigned long failsafe\_selector, unsigned long failsafe\_address) } 
   1.463 - Register OS event processing routine. In Linux both the event\_selector and 
   1.464 -failsafe\_selector are the kernel's CS. The value event\_address specifies the address for
   1.465 -an interrupt handler dispatch routine and failsafe\_address specifies a handler for 
   1.466 -application faults.
   1.467 -
   1.468 -\section{ net\_io\_op(netop\_t *op)}  
   1.469 -Notify hypervisor of updates to transmit and/or receive descriptor rings.
   1.470 +                        unsigned long failsafe\_selector, unsigned
   1.471 + long failsafe\_address) } Register OS event processing routine.  In
   1.472 + Linux both the event\_selector and failsafe\_selector are the
   1.473 + kernel's CS.  The value event\_address specifies the address for an
   1.474 + interrupt handler dispatch routine and failsafe\_address specifies a
   1.475 + handler for application faults.
   1.476  
   1.477  \section{ fpu\_taskswitch(void)} 
   1.478  Notify hypervisor that fpu registers needed to be save on context switch.
   1.479  
   1.480  \section{ sched\_op(unsigned long op)} 
   1.481 -Request scheduling operation from hypervisor. The options are: {\it yield},
   1.482 -{\it block}, {\it stop}, and {\it exit}. {\it yield} keeps the calling
   1.483 -domain run-able but may cause a reschedule if other domains are
   1.484 -run-able. {\it block} removes the calling domain from the run queue and the
   1.485 -domains sleeps until an event is delivered to it. {\it stop} and {\it exit}
   1.486 -should be self-explanatory.
   1.487 -
   1.488 -\section{ set\_dom\_timer(dom\_timer\_arg\_t *timer\_arg)} 
   1.489 -Request a timer event to be sent at the specified system time.
   1.490 +Request scheduling operation from hypervisor. The options are: {\it
   1.491 +yield}, {\it block}, and {\it shutdown}.  {\it yield} keeps the
   1.492 +calling domain run-able but may cause a reschedule if other domains
   1.493 +are run-able.  {\it block} removes the calling domain from the run
   1.494 +queue and the domains sleeps until an event is delivered to it.  {\it
   1.495 +shutdown} is used to end the domain's execution and allows to specify
   1.496 +whether the domain should reboot, halt or suspend..
   1.497  
   1.498  \section{ dom0\_op(dom0\_op\_t *op)} 
   1.499  Administrative domain operations for domain management. The options are:
   1.500 @@ -790,26 +801,30 @@ Administrative domain operations for dom
   1.501  {\it DOM0\_CREATEDOMAIN}: create new domain, specifying the name and memory usage
   1.502  in kilobytes.
   1.503  
   1.504 -{\it DOM0\_STARTDOMAIN}: make domain schedulable
   1.505 +{\it DOM0\_CREATEDOMAIN}: create domain
   1.506  
   1.507 -{\it DOM0\_STOPDOMAIN}: mark domain as unschedulable
   1.508 +{\it DOM0\_PAUSEDOMAIN}: mark domain as unschedulable
   1.509 +
   1.510 +{\it DOM0\_UNPAUSEDOMAIN}: mark domain as schedulable
   1.511  
   1.512  {\it DOM0\_DESTROYDOMAIN}: deallocate resources associated with the domain
   1.513  
   1.514  {\it DOM0\_GETMEMLIST}: get list of pages used by the domain
   1.515  
   1.516 -{\it DOM0\_BUILDDOMAIN}: do final guest OS setup for domain
   1.517 -
   1.518 -{\it DOM0\_BVTCTL}: adjust scheduler context switch time
   1.519 +{\it DOM0\_SCHEDCTL}:
   1.520  
   1.521  {\it DOM0\_ADJUSTDOM}: adjust scheduling priorities for domain
   1.522  
   1.523 +{\it DOM0\_BUILDDOMAIN}: do final guest OS setup for domain
   1.524 +
   1.525  {\it DOM0\_GETDOMAINFO}: get statistics about the domain
   1.526  
   1.527  {\it DOM0\_GETPAGEFRAMEINFO}:
   1.528  
   1.529  {\it DOM0\_IOPL}: set IO privilege level
   1.530  
   1.531 +{\it DOM0\_MSR}:
   1.532 +
   1.533  {\it DOM0\_DEBUG}: interactively call pervasive debugger
   1.534  
   1.535  {\it DOM0\_SETTIME}: set system time
   1.536 @@ -827,34 +842,60 @@ in kilobytes.
   1.537  
   1.538  {\it DOM0\_SCHED\_ID}: get the ID of the current Xen scheduler
   1.539  
   1.540 +{\it DOM0\_SHADOW\_CONTROL}:
   1.541 +
   1.542  {\it DOM0\_SETDOMAINNAME}: set the name of a domain
   1.543  
   1.544  {\it DOM0\_SETDOMAININITIALMEM}: set initial memory allocation of a domain
   1.545  
   1.546 +{\it DOM0\_SETDOMAINMAXMEM}: set maximum memory allocation of a domain
   1.547 +
   1.548  {\it DOM0\_GETPAGEFRAMEINFO2}:
   1.549  
   1.550 +{\it DOM0\_SETDOMAINVMASSIST}: set domain VM assist options
   1.551 +
   1.552 +
   1.553  \section{ set\_debugreg(int reg, unsigned long value)}
   1.554  set debug register reg to value
   1.555  
   1.556  \section{ get\_debugreg(int reg)}
   1.557   get the debug register reg
   1.558  
   1.559 -\section{ update\_descriptor(unsigned long pa, unsigned long word1, unsigned long word2)} 
   1.560 +\section{ update\_descriptor(unsigned long ma, unsigned long word1, unsigned long word2)} 
   1.561  
   1.562  \section{ set\_fast\_trap(int idx)}
   1.563   install traps to allow guest OS to bypass hypervisor
   1.564  
   1.565 -\section{ dom\_mem\_op(unsigned int op, void *pages, unsigned long nr\_pages)}
   1.566 - increase or decrease memory reservations for guest OS
   1.567 +\section{ dom\_mem\_op(unsigned int op, unsigned long *extent_list, unsigned long nr\_extents, unsigned int extent_order)}
   1.568 +Increase or decrease memory reservations for guest OS
   1.569 +
   1.570 +\section{ multicall(void *call\_list, int nr\_calls)}
   1.571 +Execute a series of hypervisor calls
   1.572  
   1.573 -\section{ multicall(multicall\_entry\_t *call\_list, int nr\_calls)}
   1.574 - execute a series of hypervisor calls
   1.575 +\section{ update\_va\_mapping(unsigned long page\_nr, unsigned long val, unsigned long flags)}
   1.576 +
   1.577 +\section{ set\_timer\_op(uint64_t timeout)} 
   1.578 +Request a timer event to be sent at the specified system time.
   1.579 +
   1.580 +\section{ event\_channel\_op(void *op)} 
   1.581 +Iinter-domain event-channel management.
   1.582  
   1.583 -\section{ kbd\_op(unsigned char op, unsigned char val)}
   1.584 +\section{ xen\_version(int cmd)}
   1.585 +Request Xen version number.
   1.586 +
   1.587 +\section{ console\_io(int cmd, int count, char *str)}
   1.588 +Interact with the console, operations are:
   1.589 +
   1.590 +{\it CONSOLEIO\_write}: Output count characters from buffer str.
   1.591  
   1.592 -\section{update\_va\_mapping(unsigned long page\_nr, unsigned long val, unsigned long flags)}
   1.593 +{\it CONSOLEIO\_read}: Input at most count characters into buffer str.
   1.594 +
   1.595 +\section{ physdev\_op(void *physdev\_op)}
   1.596  
   1.597 -\section{ event\_channel\_op(unsigned int cmd, unsigned int id)} 
   1.598 -inter-domain event-channel management, options are: open, close, send, and status.
   1.599 +\section{ grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)}
   1.600 +
   1.601 +\section{ vm\_assist(unsigned int cmd, unsigned int type)}
   1.602 +
   1.603 +\section{ update\_va\_mapping\_otherdomain(unsigned long page\_nr, unsigned long val, unsigned long flags, uint16_t domid)}
   1.604  
   1.605  \end{document}