ia64/xen-unstable

changeset 2860:61d139354129

bitkeeper revision 1.1159.157.1 (418934b0-qzq3Mn8ZcEAFEUYycHb2Q)

Device section fixes.
author akw27@labyrinth.cl.cam.ac.uk
date Wed Nov 03 19:42:40 2004 +0000 (2004-11-03)
parents 82a6612c741f
children 2aafe0f9a4dc
files docs/src/interface.tex
line diff
     1.1 --- a/docs/src/interface.tex	Wed Nov 03 18:02:06 2004 +0000
     1.2 +++ b/docs/src/interface.tex	Wed Nov 03 19:42:40 2004 +0000
     1.3 @@ -352,120 +352,171 @@ and loading its kernel image at the appr
     1.4  
     1.5  
     1.6  
     1.7 -\chapter{Network I/O}
     1.8 +\chapter{Devices}
     1.9 +
    1.10 +Devices such as network and disk are exported to guests using a
    1.11 +split device driver.  The device driver domain, which accesses the
    1.12 +physical device directly also runs a {\em backend} driver, serving
    1.13 +requests to that device from guests.  Each guest will use a simple
    1.14 +{\em frontend} driver, to access the backend.  Communication between these
    1.15 +domains is composed of two parts:  First, data is placed onto a shared
    1.16 +memory page between the domains.  Second, an event channel between the
    1.17 +two domains is used to pass notification that data is outstanding.
    1.18 +This separation of notification from data transfer allows message
    1.19 +batching, and results in very efficient device access.  Specific
    1.20 +details on inter-domain communication are available in
    1.21 +Appendix~\ref{s:idc}.
    1.22 +
    1.23 +This chapter provides details on some individual device interfaces
    1.24 +available to Xen guests. 
    1.25 +
    1.26 +\section{Network I/O}
    1.27  
    1.28  Virtual network device services are provided by shared memory
    1.29 -communications with a `backend' domain.  From the point of view of
    1.30 +communications with a backend domain.  From the point of view of
    1.31  other domains, the backend may be viewed as a virtual ethernet switch
    1.32  element with each domain having one or more virtual network interfaces
    1.33  connected to it.
    1.34  
    1.35 -\section{Backend Packet Handling}
    1.36 -The backend driver is responsible primarily for {\it data-path} operations.
    1.37 -In terms of networking this means packet transmission and reception.
    1.38 +\subsection{Backend Packet Handling}
    1.39  
    1.40 -On the transmission side, the backend needs to perform two key actions:
    1.41 +The backend driver is responsible for a variety of actions relating to
    1.42 +the transmission and reception of packets from the physical device.
    1.43 +With regard to transmission, the backend performs these key actions:
    1.44 +
    1.45  \begin{itemize}
    1.46 -\item {\tt Validation:} A domain may only be allowed to emit packets
    1.47 -matching a certain specification; for example, ones in which the
    1.48 -source IP address matches one assigned to the virtual interface over
    1.49 -which it is sent.  The backend would be responsible for ensuring any
    1.50 -such requirements are met, either by checking or by stamping outgoing
    1.51 -packets with prescribed values for certain fields.
    1.52 +\item {\bf Validation:} To ensure that domains do not attempt to
    1.53 +  generate invalid (e.g. spoofed) traffic, the backend driver may
    1.54 +  validate headers ensuring that source MAC and IP addresses match the
    1.55 +  interface that they have been sent from.
    1.56  
    1.57 -Validation functions can be configured using standard firewall rules
    1.58 -(i.e. IP Tables, in the case of Linux).
    1.59 -
    1.60 -\item {\tt Scheduling:} Since a number of domains can share a single
    1.61 -``real'' network interface, the hypervisor must mediate access when
    1.62 -several domains each have packets queued for transmission.  Of course,
    1.63 -this general scheduling function subsumes basic shaping or
    1.64 -rate-limiting schemes.
    1.65 -
    1.66 -\item {\tt Logging and Accounting:} The hypervisor can be configured
    1.67 -with classifier rules that control how packets are accounted or
    1.68 -logged.  For example, {\it domain0} could request that it receives a
    1.69 -log message or copy of the packet whenever another domain attempts to
    1.70 -send a TCP packet containg a SYN.
    1.71 +  Validation functions can be configured using standard firewall rules
    1.72 +  ({\small{\tt iptables}} in the case of Linux).
    1.73 +  
    1.74 +\item {\bf Scheduling:} Since a number of domains can share a single
    1.75 +  physical network interface, the backend must mediate access when
    1.76 +  several domains each have packets queued for transmission.  This
    1.77 +  general scheduling function subsumes basic shaping or rate-limiting
    1.78 +  schemes.
    1.79 +  
    1.80 +\item {\bf Logging and Accounting:} The backend domain can be
    1.81 +  configured with classifier rules that control how packets are
    1.82 +  accounted or logged.  For example, log messages might be generated
    1.83 +  whenever a domain attempts to send a TCP packet containing a SYN.
    1.84  \end{itemize}
    1.85  
    1.86 -On the recive side, the backend's role is relatively straightforward:
    1.87 -once a packet is received, it just needs to determine the virtual interface(s)
    1.88 -to which it must be delivered and deliver it via page-flipping. 
    1.89 +On receipt of incoming packets, the backend acts as a simple
    1.90 +demultiplexer:  Packets are passed to the appropriate virtual
    1.91 +interface after any necessary logging and accounting have been carried
    1.92 +out.
    1.93  
    1.94 -
    1.95 -\section{Data Transfer}
    1.96 +\subsection{Data Transfer}
    1.97  
    1.98  Each virtual interface uses two ``descriptor rings'', one for transmit,
    1.99  the other for receive.  Each descriptor identifies a block of contiguous
   1.100 -physical memory allocated to the domain.  There are four cases:
   1.101 +physical memory allocated to the domain.  
   1.102  
   1.103 -\begin{itemize}
   1.104 -
   1.105 -\item The transmit ring carries packets to transmit from the domain to the
   1.106 -hypervisor.
   1.107 +The transmit ring carries packets to transmit from the guest to the
   1.108 +backend domain.  The return path of the transmit ring carries messages
   1.109 +indicating that the contents have been physically transmitted and the
   1.110 +backend no longer requires the associated pages of memory.
   1.111  
   1.112 -\item The return path of the transmit ring carries ``empty'' descriptors
   1.113 -indicating that the contents have been transmitted and the memory can be
   1.114 -re-used.
   1.115 -
   1.116 -\item The receive ring carries empty descriptors from the domain to the 
   1.117 -hypervisor; these provide storage space for that domain's received packets.
   1.118 +To receive packets, the guest places descriptors of unused pages on
   1.119 +the receive ring.  The backend will return received packets by
   1.120 +exchanging these pages in the domain's memory with new pages
   1.121 +containing the received data, and passing back descriptors regarding
   1.122 +the new packets on the ring.  This zero-copy approach allows the
   1.123 +backend to maintain a pool of free pages to receive packets into, and
   1.124 +then deliver them to appropriate domains after examining their
   1.125 +headers.
   1.126  
   1.127 -\item The return path of the receive ring carries packets that have been
   1.128 -received.
   1.129 -\end{itemize}
   1.130 -
   1.131 -Real physical addresses are used throughout, with the domain performing 
   1.132 -translation from pseudo-physical addresses if that is necessary.
   1.133 +%
   1.134 +%Real physical addresses are used throughout, with the domain performing 
   1.135 +%translation from pseudo-physical addresses if that is necessary.
   1.136  
   1.137  If a domain does not keep its receive ring stocked with empty buffers then 
   1.138 -packets destined to it may be dropped.  This provides some defense against 
   1.139 -receiver-livelock problems because an overload domain will cease to receive
   1.140 +packets destined to it may be dropped.  This provides some defence against 
   1.141 +receive livelock problems because an overload domain will cease to receive
   1.142  further data.  Similarly, on the transmit path, it provides the application
   1.143  with feedback on the rate at which packets are able to leave the system.
   1.144  
   1.145 -Synchronization between the hypervisor and the domain is achieved using 
   1.146 -counters held in shared memory that is accessible to both.  Each ring has
   1.147 -associated producer and consumer indices indicating the area in the ring
   1.148 -that holds descriptors that contain data.  After receiving {\it n} packets
   1.149 -or {\t nanoseconds} after receiving the first packet, the hypervisor sends
   1.150 -an event to the domain. 
   1.151  
   1.152 -\chapter{Block I/O}
   1.153 +Flow control on rings is achieved by including a pair of producer
   1.154 +indexes on the shared ring page.  Each side will maintain a private
   1.155 +consumer index indicating the next outstanding message.  In this
   1.156 +manner, the domains cooperate to divide the ring into two message
   1.157 +lists, one in each direction.  Notification is decoupled from the
   1.158 +immediate placement of new messages on the ring; the event channel
   1.159 +will be used to generate notification when {\em either} a certain
   1.160 +number of outstanding messages are queued, {\em or} a specified number
   1.161 +of nanoseconds have elapsed since the oldest message was placed on the
   1.162 +ring.
   1.163  
   1.164 -\section{Virtual Block Devices (VBDs)}
   1.165 +% Not sure if my version is any better -- here is what was here before:
   1.166 +%% Synchronization between the backend domain and the guest is achieved using 
   1.167 +%% counters held in shared memory that is accessible to both.  Each ring has
   1.168 +%% associated producer and consumer indices indicating the area in the ring
   1.169 +%% that holds descriptors that contain data.  After receiving {\it n} packets
   1.170 +%% or {\t nanoseconds} after receiving the first packet, the hypervisor sends
   1.171 +%% an event to the domain. 
   1.172 +
   1.173 +\section{Block I/O}
   1.174  
   1.175 -All guest OS disk access goes through the VBD interface.  The VBD
   1.176 -interface provides the administrator with the ability to selectively
   1.177 -grant domains access to portions of block storage devices visible to
   1.178 -the the block backend device (usually domain 0).
   1.179 +All guest OS disk access goes through the virtual block device VBD
   1.180 +interface.  This interface allows domains access to portions of block
   1.181 +storage devices visible to the the block backend device.  The VBD
   1.182 +interface is a split driver, similar to the network interface
   1.183 +described above.  A single shared memory ring is used between the
   1.184 +frontend and backend drivers, across which read and write messages are
   1.185 +sent.
   1.186  
   1.187 -VBDs can literally be backed by any block device accessible to the
   1.188 -backend domain, including network-based block devices (iSCSI, *NBD,
   1.189 -etc), loopback devices and LVM / MD devices.
   1.190 +Any block device accessible to the backend domain, including
   1.191 +network-based block (iSCSI, *NBD, etc), loopback and LVM/MD devices,
   1.192 +can be exported as a VBD.  Each VBD is mapped to a device node in the
   1.193 +guest, specified in the guest's startup configuration.
   1.194  
   1.195  Old (Xen 1.2) virtual disks are not supported under Xen 2.0, since
   1.196  similar functionality can be achieved using the (more advanced) LVM
   1.197  system, which is already in widespread use.
   1.198  
   1.199  \subsection{Data Transfer}
   1.200 -Domains which have been granted access to a logical block device are permitted
   1.201 -to read and write it by shared memory communications with the backend domain. 
   1.202 +
   1.203 +The single ring between the guest and the block backend supports three
   1.204 +messages:
   1.205  
   1.206 -In overview, the same style of descriptor-ring that is used for
   1.207 -network packets is used here.  Each domain has one ring that carries
   1.208 -operation requests to the hypervisor and carries the results back
   1.209 -again.
   1.210 +\begin{description}
   1.211 +\item [{\small {\tt PROBE}}:] Return a list of the VBDs available to this guest
   1.212 +  from the backend.  The request includes a descriptor of a free page
   1.213 +  into which the reply will be written by the backend.
   1.214 +
   1.215 +\item [{\small {\tt READ}}:] Read data from the specified block device.  The
   1.216 +  front end identifies the device and location to read from and
   1.217 +  attaches pages for the data to be copied to (typically via DMA from
   1.218 +  the device).  The backend acknowledges completed read requests as
   1.219 +  they finish.
   1.220  
   1.221 -Rather than copying data, the backend simply maps the domain's buffers
   1.222 -in order to enable direct DMA to them.  The act of mapping the buffers
   1.223 -also increases the reference counts of the underlying pages, so that
   1.224 -the unprivileged domain cannot try to return them to the hypervisor,
   1.225 -install them as page tables, or any other unsafe behaviour.
   1.226 -%block API here 
   1.227 +\item [{\small {\tt WRITE}}:] Write data to the specified block device.  This
   1.228 +  functions essentially as {\small {\tt READ}}, except that the data moves to
   1.229 +  the device instead of from it.
   1.230 +\end{description}
   1.231 +
   1.232 +% um... some old text
   1.233 +%% In overview, the same style of descriptor-ring that is used for
   1.234 +%% network packets is used here.  Each domain has one ring that carries
   1.235 +%% operation requests to the hypervisor and carries the results back
   1.236 +%% again.
   1.237  
   1.238 -\chapter{Privileged operations}
   1.239 +%% Rather than copying data, the backend simply maps the domain's buffers
   1.240 +%% in order to enable direct DMA to them.  The act of mapping the buffers
   1.241 +%% also increases the reference counts of the underlying pages, so that
   1.242 +%% the unprivileged domain cannot try to return them to the hypervisor,
   1.243 +%% install them as page tables, or any other unsafe behaviour.
   1.244 +%% %block API here 
   1.245 +
   1.246 +
   1.247 +% akw: demoting this to a section -- not sure if there is any point
   1.248 +% though, maybe just remove it.
   1.249 +\section{Privileged operations}
   1.250  {\it Domain0} is responsible for building all other domains on the server
   1.251  and providing control interfaces for managing scheduling, networking, and
   1.252  blocks.