ia64/xen-unstable

view docs/src/interface/devices.tex @ 6979:f8e7af29daa1

merge?
author cl349@firebug.cl.cam.ac.uk
date Tue Sep 20 09:43:46 2005 +0000 (2005-09-20)
parents 750ad97f37b0
children 06d84bf87159
line source
1 \chapter{Devices}
2 \label{c:devices}
4 Devices such as network and disk are exported to guests using a split
5 device driver. The device driver domain, which accesses the physical
6 device directly also runs a \emph{backend} driver, serving requests to
7 that device from guests. Each guest will use a simple \emph{frontend}
8 driver, to access the backend. Communication between these domains is
9 composed of two parts: First, data is placed onto a shared memory page
10 between the domains. Second, an event channel between the two domains
11 is used to pass notification that data is outstanding. This
12 separation of notification from data transfer allows message batching,
13 and results in very efficient device access.
15 Event channels are used extensively in device virtualization; each
16 domain has a number of end-points or \emph{ports} each of which may be
17 bound to one of the following \emph{event sources}:
18 \begin{itemize}
19 \item a physical interrupt from a real device,
20 \item a virtual interrupt (callback) from Xen, or
21 \item a signal from another domain
22 \end{itemize}
24 Events are lightweight and do not carry much information beyond the
25 source of the notification. Hence when performing bulk data transfer,
26 events are typically used as synchronization primitives over a shared
27 memory transport. Event channels are managed via the {\tt
28 event\_channel\_op()} hypercall; for more details see
29 Section~\ref{s:idc}.
31 This chapter focuses on some individual device interfaces available to
32 Xen guests.
35 \section{Network I/O}
37 Virtual network device services are provided by shared memory
38 communication with a backend domain. From the point of view of other
39 domains, the backend may be viewed as a virtual ethernet switch
40 element with each domain having one or more virtual network interfaces
41 connected to it.
43 \subsection{Backend Packet Handling}
45 The backend driver is responsible for a variety of actions relating to
46 the transmission and reception of packets from the physical device.
47 With regard to transmission, the backend performs these key actions:
49 \begin{itemize}
50 \item {\bf Validation:} To ensure that domains do not attempt to
51 generate invalid (e.g. spoofed) traffic, the backend driver may
52 validate headers ensuring that source MAC and IP addresses match the
53 interface that they have been sent from.
55 Validation functions can be configured using standard firewall rules
56 ({\small{\tt iptables}} in the case of Linux).
58 \item {\bf Scheduling:} Since a number of domains can share a single
59 physical network interface, the backend must mediate access when
60 several domains each have packets queued for transmission. This
61 general scheduling function subsumes basic shaping or rate-limiting
62 schemes.
64 \item {\bf Logging and Accounting:} The backend domain can be
65 configured with classifier rules that control how packets are
66 accounted or logged. For example, log messages might be generated
67 whenever a domain attempts to send a TCP packet containing a SYN.
68 \end{itemize}
70 On receipt of incoming packets, the backend acts as a simple
71 demultiplexer: Packets are passed to the appropriate virtual interface
72 after any necessary logging and accounting have been carried out.
74 \subsection{Data Transfer}
76 Each virtual interface uses two ``descriptor rings'', one for
77 transmit, the other for receive. Each descriptor identifies a block
78 of contiguous physical memory allocated to the domain.
80 The transmit ring carries packets to transmit from the guest to the
81 backend domain. The return path of the transmit ring carries messages
82 indicating that the contents have been physically transmitted and the
83 backend no longer requires the associated pages of memory.
85 To receive packets, the guest places descriptors of unused pages on
86 the receive ring. The backend will return received packets by
87 exchanging these pages in the domain's memory with new pages
88 containing the received data, and passing back descriptors regarding
89 the new packets on the ring. This zero-copy approach allows the
90 backend to maintain a pool of free pages to receive packets into, and
91 then deliver them to appropriate domains after examining their
92 headers.
94 % Real physical addresses are used throughout, with the domain
95 % performing translation from pseudo-physical addresses if that is
96 % necessary.
98 If a domain does not keep its receive ring stocked with empty buffers
99 then packets destined to it may be dropped. This provides some
100 defence against receive livelock problems because an overload domain
101 will cease to receive further data. Similarly, on the transmit path,
102 it provides the application with feedback on the rate at which packets
103 are able to leave the system.
105 Flow control on rings is achieved by including a pair of producer
106 indexes on the shared ring page. Each side will maintain a private
107 consumer index indicating the next outstanding message. In this
108 manner, the domains cooperate to divide the ring into two message
109 lists, one in each direction. Notification is decoupled from the
110 immediate placement of new messages on the ring; the event channel
111 will be used to generate notification when {\em either} a certain
112 number of outstanding messages are queued, {\em or} a specified number
113 of nanoseconds have elapsed since the oldest message was placed on the
114 ring.
116 %% Not sure if my version is any better -- here is what was here
117 %% before: Synchronization between the backend domain and the guest is
118 %% achieved using counters held in shared memory that is accessible to
119 %% both. Each ring has associated producer and consumer indices
120 %% indicating the area in the ring that holds descriptors that contain
121 %% data. After receiving {\it n} packets or {\t nanoseconds} after
122 %% receiving the first packet, the hypervisor sends an event to the
123 %% domain.
126 \section{Block I/O}
128 All guest OS disk access goes through the virtual block device VBD
129 interface. This interface allows domains access to portions of block
130 storage devices visible to the the block backend device. The VBD
131 interface is a split driver, similar to the network interface
132 described above. A single shared memory ring is used between the
133 frontend and backend drivers, across which read and write messages are
134 sent.
136 Any block device accessible to the backend domain, including
137 network-based block (iSCSI, *NBD, etc), loopback and LVM/MD devices,
138 can be exported as a VBD. Each VBD is mapped to a device node in the
139 guest, specified in the guest's startup configuration.
141 Old (Xen 1.2) virtual disks are not supported under Xen 2.0, since
142 similar functionality can be achieved using the more complete LVM
143 system, which is already in widespread use.
145 \subsection{Data Transfer}
147 The single ring between the guest and the block backend supports three
148 messages:
150 \begin{description}
151 \item [{\small {\tt PROBE}}:] Return a list of the VBDs available to
152 this guest from the backend. The request includes a descriptor of a
153 free page into which the reply will be written by the backend.
155 \item [{\small {\tt READ}}:] Read data from the specified block
156 device. The front end identifies the device and location to read
157 from and attaches pages for the data to be copied to (typically via
158 DMA from the device). The backend acknowledges completed read
159 requests as they finish.
161 \item [{\small {\tt WRITE}}:] Write data to the specified block
162 device. This functions essentially as {\small {\tt READ}}, except
163 that the data moves to the device instead of from it.
164 \end{description}
166 %% um... some old text: In overview, the same style of descriptor-ring
167 %% that is used for network packets is used here. Each domain has one
168 %% ring that carries operation requests to the hypervisor and carries
169 %% the results back again.
171 %% Rather than copying data, the backend simply maps the domain's
172 %% buffers in order to enable direct DMA to them. The act of mapping
173 %% the buffers also increases the reference counts of the underlying
174 %% pages, so that the unprivileged domain cannot try to return them to
175 %% the hypervisor, install them as page tables, or any other unsafe
176 %% behaviour.
177 %%
178 %% % block API here