view docs/misc/blkif-drivers-explained.txt @ 15783:c93e2a822d6f

[xen, xencomm] xencomm multiple page support
Current implementation doesn't allow struct xencomm_desc::address
array to be more than single page. On IA64 it causes 64GB+ domain
creation failure. This patch generalizes xencomm to allow multipage

Signed-off-by: Isaku Yamahata <yamahata@valinux.co.jp>
author kfraser@localhost.localdomain
date Tue Aug 28 15:32:27 2007 +0100 (2007-08-28)
parents 0a4b76b6b5a0
line source
1 === How the Blkif Drivers Work ===
2 Andrew Warfield
3 andrew.warfield@cl.cam.ac.uk
5 The intent of this is to explain at a fairly detailed level how the
6 split device drivers work in Xen 1.3 (aka 2.0beta). The intended
7 audience for this, I suppose, is anyone who intends to work with the
8 existing blkif interfaces and wants something to help them get up to
9 speed with the code in a hurry. Secondly though, I hope to break out
10 the general mechanisms that are used in the drivers that are likely to
11 be necessary to implement other drivers interfaces.
13 As a point of warning before starting, it is worth mentioning that I
14 anticipate much of the specifics described here changing in the near
15 future. There has been talk about making the blkif protocol
16 a bit more efficient than it currently is. Keir's addition of grant
17 tables will change the current remapping code that is used when shared
18 pages are initially set up.
20 Also, writing other control interface types will likely need support
21 from Xend, which at the moment has a steep learning curve... this
22 should be addressed in the future.
24 For more information on the driver model as a whole, read the
25 "Reconstructing I/O" technical report
26 (http://www.cl.cam.ac.uk/Research/SRG/netos/papers/2004-xenngio.pdf).
28 ==== High-level structure of a split-driver interface ====
30 Why would you want to write a split driver in the first place? As Xen
31 is a virtual machine manager and focuses on isolation as an initial
32 design principle, it is generally considered unwise to share physical
33 access to devices across domains. The reasons for this are obvious:
34 when device resources are shared, misbehaving code or hardware can
35 result in the failure of all of the client applications. Moreover, as
36 virtual machines in Xen are entire OSs, standard device drives that
37 they might use cannot have multiple instantiations for a single piece
38 of hardware. In light of all this, the general approach in Xen is to
39 give a single virtual machine hardware access to a device, and where
40 other VMs want to share the device, export a higher-level interface to
41 facilitate that sharing. If you don't want to share, that's fine.
42 There are currently Xen users actively exploring running two
43 completely isolated X-Servers on a Xen host, each with it's own video
44 card, keyboard, and mouse. In these situations, the guests need only
45 be given physical access to the necessary devices and left to go on
46 their own. However, for devices such as disks and network interfaces,
47 where sharing is required, the split driver approach is a good
48 solution.
50 The structure is like this:
52 +--------------------------+ +--------------------------+
53 | Domain 0 (privileged) | | Domain 1 (unprivileged) |
54 | | | |
55 | Xend ( Application ) | | |
56 | Blkif Backend Driver | | Blkif Frontend Driver |
57 | Physical Device Driver | | |
58 +--------------------------+ +--------------------------+
59 +--------------------------------------------------------+
60 | X E N |
61 +--------------------------------------------------------+
64 The Blkif driver is in two parts, which we refer to as frontend (FE)
65 and a backend (BE). Together, they serve to proxy device requests
66 between the guest operating system in an unprivileged domain, and the
67 physical device driver in the physical domain. An additional benefit
68 to this approach is that the FE driver can provide a single interface
69 for a whole class of physical devices. The blkif interface mounts
70 IDE, SCSI, and our own VBD-structured disks, independent of the
71 physical driver underneath. Moreover, supporting additional OSs only
72 requires that a new FE driver be written to connect to the existing
73 backend.
75 ==== Inter-Domain Communication Mechanisms ====
77 ===== Event Channels =====
79 Before getting into the specifics of the block interface driver, it is
80 worth discussing the mechanisms that are used to communicate between
81 domains. Two mechanisms are used to allow the construction of
82 high-performance drivers: event channels and shared-memory rings.
84 Event channels are an asynchronous interdomain notification
85 mechanism. Xen allows channels to be instantiated between two
86 domains, and domains can request that a virtual irq be attached to
87 notifications on a given channel. The result of this is that the
88 frontend domain can send a notification on an event channel, resulting
89 in an interrupt entry into the backend at a later time.
91 The event channel between two domains is instantiated in the Xend code
92 during driver startup (described later). Xend's channel.py
93 (tools/python/xen/xend/server/channel.py) defines the function
96 def eventChannel(dom1, dom2):
97 return xc.evtchn_bind_interdomain(dom1=dom1, dom2=dom2)
100 which maps to xc_evtchn_bind_interdomain() in tools/libxc/xc_evtchn.c,
101 which in turn generates a hypercall to Xen to patch the event channel
102 between the domains. Only a privileged domain can request the
103 creation of an event channel.
105 Once the event channel is created in Xend, its ends are passed to both the
106 front and backend domains over the control channel. The end that is
107 passed to a domain is just an integer "port" uniquely identifying the
108 event channel's local connection to that domain. An example of this
109 setup code is in linux-2.6.x/drivers/xen/blkfront/blkfront.c in
110 blkif_connect(), which receives several status change events as
111 the driver starts up. It is passed an event channel end in a
112 BLKIF_INTERFACE_STATUS_CONNECTED message, and patches it in like this:
115 blkif_evtchn = status->evtchn;
116 blkif_irq = bind_evtchn_to_irq(blkif_evtchn);
117 if ( (rc = request_irq(blkif_irq, blkif_int,
118 SA_SAMPLE_RANDOM, "blkif", NULL)) )
119 printk(KERN_ALERT"blkfront request_irq failed (%ld)\n",rc);
122 This code associates a virtual irq with the event channel, and
123 attaches the function blkif_int() as an interrupt handler for that
124 irq. blkif_int() simply handles the notification and returns, it does
125 not need to interact with the channel at all.
127 An example of generating a notification can also be seen in blkfront.c:
130 static inline void flush_requests(void)
131 {
133 wmb(); /* Ensure that the frontend can see the requests. */
134 blk_ring->req_prod = req_prod;
135 notify_via_evtchn(blkif_evtchn);
136 }
137 }}}
139 notify_via_evtchn() issues a hypercall to set the event waiting flag on
140 the other domain's end of the channel.
142 ===== Communication Rings =====
144 Event channels are strictly a notification mechanism between domains.
145 To move large chunks of data back and forth, Xen allows domains to
146 share pages of memory. We use communication rings as a means of
147 managing access to a shared memory page for message passing between
148 domains. These rings are not explicitly a mechanism of Xen, which is
149 only concerned with the actual sharing of the page and not how it is
150 used, they are however worth discussing as they are used in many
151 places in the current code and are a useful model for communicating
152 across a shared page.
154 A shared page is set up by a front end guest first allocating and passing
155 the address of a page in its own address space to the backend driver.
157 Consider the following code, also from blkfront.c. Note: this code
158 is in blkif_disconnect(). The driver transitions from STATE_CLOSED
159 to STATE_DISCONNECTED before becoming CONNECTED. The state automata
160 is in blkif_status().
162 blk_ring = (blkif_ring_t *)__get_free_page(GFP_KERNEL);
163 blk_ring->req_prod = blk_ring->resp_prod = resp_cons = req_prod = 0;
164 ...
165 /* Construct an interface-CONNECT message for the domain controller. */
166 cmsg.type = CMSG_BLKIF_FE;
168 cmsg.length = sizeof(blkif_fe_interface_connect_t);
169 up.handle = 0;
170 up.shmem_frame = virt_to_machine(blk_ring) >> PAGE_SHIFT;
171 memcpy(cmsg.msg, &up, sizeof(up));
174 blk_ring will be the shared page. The producer and consumer pointers
175 are then initialised (these will be discussed soon), and then the
176 machine address of the page is send to the backend via a control
177 channel to Xend. This control channel itself uses the notification
178 and shared memory mechanisms described here, but is set up for each
179 domain automatically at startup.
181 The backend, which is a privileged domain then takes the page address
182 and maps it into its own address space (in
183 linux26/drivers/xen/blkback/interface.c:blkif_connect()):
186 void blkif_connect(blkif_be_connect_t *connect)
188 ...
189 unsigned long shmem_frame = connect->shmem_frame;
190 ...
192 if ( (vma = get_vm_area(PAGE_SIZE, VM_IOREMAP)) == NULL )
193 {
194 connect->status = BLKIF_BE_STATUS_OUT_OF_MEMORY;
195 return;
196 }
198 prot = __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED);
199 error = direct_remap_area_pages(&init_mm, VMALLOC_VMADDR(vma->addr),
200 shmem_frame<<PAGE_SHIFT, PAGE_SIZE,
201 prot, domid);
203 ...
205 blkif->blk_ring_base = (blkif_ring_t *)vma->addr
206 }}}
208 The machine address of the page is passed in the shmem_frame field of
209 the connect message. This is then mapped into the virtual address
210 space of the backend domain, and saved in the blkif structure
211 representing this particular backend connection.
213 NOTE: New mechanisms will be added very shortly to allow domains to
214 explicitly grant access to their pages to other domains. This "grant
215 table" support is in the process of being added to the tree, and will
216 change the way a shared page is set up. In particular, it will remove
217 the need of the remapping domain to be privileged.
219 Sending data across shared rings:
221 Shared rings avoid the potential for write interference between
222 domains in a very cunning way. A ring is partitioned into a request
223 and a response region, and domains only work within their own space.
224 This can be thought of as a double producer-consumer ring -- the ring
225 is described by four pointers into a circular buffer of fixed-size
226 records. Pointers may only advance, and may not pass one another.
229 resp_cons----+
230 V
231 +----+----+----+----+----+----+----+
232 | | | free(A) |RSP1|RSP2|
233 +----+----+----+----+----+----+----+
234 req_prod->| | --------> |RSP3|
235 +----+ +----+
236 |REQ8| | |<-resp_prod
237 +----+ +----+
238 |REQ7| | |
239 +----+ +----+
240 |REQ6| <-------- | |
241 +----+----+----+----+----+----+----+
242 |REQ5|REQ4| free(B) | | |
243 +----+----+----+----+----+----+----+
244 req_cons---------^
248 By adopting the convention that every request will receive a response,
249 not all four pointers need be shared and flow control on the ring
250 becomes very easy to manage. Each domain manages its own
251 consumer pointer, and the two producer pointers are visible to both
252 (xen/include/public/io/blkif.h):
255 /* NB. Ring size must be small enough for sizeof(blkif_ring_t) <=PAGE_SIZE.*/
256 #define BLKIF_RING_SIZE 64
258 ...
260 /*
261 * We use a special capitalised type name because it is _essential_ that all
262 * arithmetic on indexes is done on an integer type of the correct size.
263 */
264 typedef u32 BLKIF_RING_IDX;
266 /*
267 * Ring indexes are 'free running'. That is, they are not stored modulo the
268 * size of the ring buffer. The following macro converts a free-running counter
269 * into a value that can directly index a ring-buffer array.
270 */
271 #define MASK_BLKIF_IDX(_i) ((_i)&(BLKIF_RING_SIZE-1))
273 typedef struct {
274 BLKIF_RING_IDX req_prod; /* 0: Request producer. Updated by front-end. */
275 BLKIF_RING_IDX resp_prod; /* 4: Response producer. Updated by back-end. */
276 union { /* 8 */
277 blkif_request_t req;
278 blkif_response_t resp;
280 } PACKED blkif_ring_t;
284 As shown in the diagram above, the rules for using a shared memory
285 ring are simple.
287 1. A ring is full when a domain's producer and consumer pointers are
288 equal (e.g. req_prod == resp_cons). In this situation, the
289 consumer pointer must be advanced. Furthermore, if the consumer
290 pointer is equal to the other domain's producer pointer,
291 (e.g. resp_cons = resp_prod), then the other domain has all the
292 buffers.
294 2. Producer pointers point to the next buffer that will be written to.
295 (So blk_ring[MASK_BLKIF_IDX(req_prod)] should not be consumed.)
297 3. Consumer pointers point to a valid message, so long as they are not
298 equal to the associated producer pointer.
300 4. A domain should only ever write to the message pointed
301 to by its producer index, and read from the message at it's
302 consumer. More generally, the domain may be thought of to have
303 exclusive access to the messages between its consumer and producer,
304 and should absolutely not read or write outside this region.
306 Thus the front end has exclusive access to the free(A) region
307 in the figure above, and the back end driver has exclusive
308 access to the free(B) region.
310 In general, drivers keep a private copy of their producer pointer and
311 then set the shared version when they are ready for the other end to
312 process a set of messages. Additionally, it is worth paying attention
313 to the use of memory barriers (rmb/wmb) in the code, to ensure that
314 rings that are shared across processors behave as expected.
316 ==== Structure of the Blkif Drivers ====
318 Now that the communications primitives have been discussed, I'll
319 quickly cover the general structure of the blkif driver. This is
320 intended to give a high-level idea of what is going on, in an effort
321 to make reading the code a more approachable task.
323 There are three key software components that are involved in the blkif
324 drivers (not counting Xen itself). The frontend and backend driver,
325 and Xend, which coordinates their initial connection. Xend may also
326 be involved in control-channel signalling in some cases after startup,
327 for instance to manage reconnection if the backend is restarted.
329 ===== Frontend Driver Structure =====
331 The frontend domain uses a single event channel and a shared memory
332 ring to trade control messages with the backend. These are both setup
333 during domain startup, which will be discussed shortly. The shared
334 memory ring is called blkif_ring, and the private ring indexes are
335 resp_cons, and req_prod. The ring is protected by blkif_io_lock.
336 Additionally, the frontend keeps a list of outstanding requests in
337 rec_ring[]. These are uniquely identified by a guest-local id number,
338 which is associated with each request sent to the backend, and
339 returned with the matching responses. Information about the actual
340 disks are stored in major_info[], of which only the first nr_vbds
341 entries are valid. Finally, the global 'recovery' indicates that the
342 connection between the backend and frontend drivers has been broken
343 (possibly due to a backend driver crash) and that the frontend is in
344 recovery mode, in which case it will attempt to reconnect and reissue
345 outstanding requests.
347 The frontend driver is single-threaded and after setup is entered only
348 through three points: (1) read/write requests from the XenLinux guest
349 that it is a part of, (2) interrupts from the backend driver on its
350 event channel (blkif_int()), and (3) control messages from Xend
351 (blkif_ctrlif_rx).
353 ===== Backend Driver Structure =====
355 The backend driver is slightly more complex as it must manage any
356 number of concurrent frontend connections. For each domain it
357 manages, the backend driver maintains a blkif structure, which
358 describes all the connection and disk information associated with that
359 particular domain. This structure is associated with the interrupt
360 registration, and allows the backend driver to have immediate context
361 when it takes a notification from some domain.
363 All of the blkif structures are stored in a hash table (blkif_hash),
364 which is indexed by a hash of the domain id, and a "handle", really a
365 per-domain blkif identifier, in case it wants to have multiple connections.
367 The per-connection blkif structure is of type blkif_t. It contains
368 all of the communication details (event channel, irq, shared memory
369 ring and indexes), and blk_ring_lock, which is the backend mutex on
370 the shared ring. The structure also contains vbd_rb, which is a
371 red-black tree, containing an entry for each device/partition that is
372 assigned to that domain. This structure is filled by xend passing
373 disk information to the backend at startup, and is protected by
374 vbd_lock. Finally, the blkif struct contains a status field, which
375 describes the state of the connection.
377 The backend driver spawns a kernel thread at startup
378 (blkio_schedule()), which handles requests to and from the actual disk
379 device drivers. This scheduler thread maintains a list of blkif
380 structures that have pending requests, and services them round-robin
381 with a maximum per-round request limit. blkifs are added to the list
382 in the interrupt handler (blkif_be_int()) using
383 add_to_blkdev_list_tail(), and removed in the scheduler loop after
384 calling do_block_io_op(), which processes a batch of requests. The
385 scheduler thread is explicitly activated at several points in the code
386 using maybe_trigger_blkio_schedule().
388 Pending requests between the backend driver and the physical device
389 drivers use another ring, pending_ring. Requests are placed in this
390 ring in the scheduler thread and issued to the device. A completion
391 callback, end_block_io_op, indicates that requests have been serviced
392 and generates a response on the appropriate blkif ring. pending
393 reqs[] stores a list of outstanding requests with the physical drivers.
395 So, control entries to the backend are (1) the blkio scheduler thread,
396 which sends requests to the real device drivers, (2) end_block_io_op,
397 which is called as serviced requests complete, (3) blkif_be_int()
398 handles notifications from the frontend drivers in other domains, and
399 (4) blkif_ctrlif_rx() handles control messages from xend.
401 ==== Driver Startup ====
403 Prior to starting a new guest using the frontend driver, the backend
404 will have been started in a privileged domain. The backend
405 initialisation code initialises all of its data structures, such as
406 the blkif hash table, and starts the scheduler thread as a kernel
407 thread. It then sends a driver status up message to let xend know it
408 is ready to take frontend connections.
410 When a new domain that uses the blkif frontend driver is started,
411 there are a series of interactions between it, xend, and the specified
412 backend driver. These interactions are as follows:
414 The domain configuration given to xend will specify the backend domain
415 and disks that the new guest is to use. Prior to actually running the
416 domain, xend and the backend driver interact to setup the initial
417 blkif record in the backend.
419 (1) Xend sends a BLKIF_BE_CREATE message to backend.
421 Backend does blkif_create(), having been passed FE domid and handle.
422 It creates and initialises a new blkif struct, and puts it in the
423 hash table.
424 It then returns a STATUS_OK response to xend.
426 (2) Xend sends a BLKIF_BE_VBD_CREATE message to the backend.
428 Backend adds a vbd entry in the red-black tree for the
429 specified (dom, handle) blkif entry.
430 Sends a STATUS_OK response.
432 (3) Xend sends a BLKIF_BE_VBD_GROW message to the backend.
434 Backend takes the physical device information passed in the
435 message and assigns them to the newly created vbd struct.
437 (2) and (3) repeat as any additional devices are added to the domain.
439 At this point, the backend has enough state to allow the frontend
440 domain to start. The domain is run, and eventually gets to the
441 frontend driver initialisation code. After setting up the frontend
442 data structures, this code continues the communications with xend and
443 the backend to negotiate a connection:
445 (4) Frontend sends Xend a BLKIF_FE_DRIVER_STATUS_CHANGED message.
447 This message tells xend that the driver is up. The init function
448 now spin-waits until driver setup is complete in order to prevent
449 Linux from attempting to boot before the disks are connected.
451 (5) Xend sends the frontend an INTERFACE_STATUS_CHANGED message
453 This message specifies that the interface is now disconnected
454 (instead of closed).
455 The domain updates it's state, and allocates the shared blk_ring
456 page. Next,
458 (6) Frontend sends Xend a BLKIF_INTERFACE_CONNECT message
460 This message specifies the domain and handle, and includes the
461 address of the newly created page.
463 (7) Xend sends the backend a BLKIF_BE_CONNECT message
465 The backend fills in the blkif connection information, maps the
466 shared page, and binds an irq to the event channel.
468 (8) Xend sends the frontend an INTERFACE_STATUS_CHANGED message
470 This message takes the frontend driver to a CONNECTED state, at
471 which point it binds an irq to the event channel and calls
472 xlvbd_init to initialise the individual block devices.
474 The frontend Linux is stall spin waiting at this point, until all of
475 the disks have been probed. Messaging now is directly between the
476 front and backend domain using the new shared ring and event channel.
478 (9) The frontend sends a BLKIF_OP_PROBE directly to the backend.
480 This message includes a reference to an additional page, that the
481 backend can use for it's reply. The backend responds with an array
482 of the domains disks (as vdisk_t structs) on the provided page.
484 The frontend now initialises each disk, calling xlvbd_init_device()
485 for each one.