view docs/blkif-drivers-explained.txt @ 2570:30a9b33481dc

bitkeeper revision 1.1159.90.2 (415be9d5hTw1zLV9fA-AYcekmwhMwg)

Discard devices early for local migrate.
author mjw@wray-m-3.hpl.hp.com
date Thu Sep 30 11:11:17 2004 +0000 (2004-09-30)
parents 11be1dfb262b
line source
1 === How the Blkif Drivers Work ===
2 Andrew Warfield
3 andrew.warfield@cl.cam.ac.uk
5 The intent of this is to explain at a fairly detailed level how the
6 split device drivers work in Xen 1.3 (aka 2.0beta). The intended
7 audience for this, I suppose, is anyone who intends to work with the
8 existing blkif interfaces and wants something to help them get up to
9 speed with the code in a hurry. Secondly though, I hope to break out
10 the general mechanisms that are used in the drivers that are likely to
11 be necessary to implement other drivers interfaces.
13 As a point of warning before starting, it is worth mentioning that I
14 anticipate much of the specifics described here changing in the near
15 future. There has been talk about making the blkif protocol
16 a bit more efficient than it currently is. Keir's addition of grant
17 tables will change the current remapping code that is used when shared
18 pages are initially set up.
20 Also, writing other control interface types will likely need support
21 from Xend, which at the moment has a steep learning curve... this
22 should be addressed in the future.
24 For more information on the driver model as a whole, read the
25 "Reconstructing I/O" technical report
26 (http://www.cl.cam.ac.uk/Research/SRG/netos/papers/2004-xenngio.pdf).
28 ==== High-level structure of a split-driver interface ====
30 Why would you want to write a split driver in the first place? As Xen
31 is a virtual machine manager and focuses on isolation as an initial
32 design principle, it is generally considered unwise to share physical
33 access to devices across domains. The reasons for this are obvious:
34 when device resources are shared, misbehaving code or hardware can
35 result in the failure of all of the client applications. Moreover, as
36 virtual machines in Xen are entire OSs, standard device drives that
37 they might use cannot have multiple instantiations for a single piece
38 of hardware. In light of all this, the general approach in Xen is to
39 give a single virtual machine hardware access to a device, and where
40 other VMs want to share the device, export a higher-level interface to
41 facilitate that sharing. If you don't want to share, that's fine.
42 There are currently Xen users actively exploring running two
43 completely isolated X-Servers on a Xen host, each with it's own video
44 card, keyboard, and mouse. In these situations, the guests need only
45 be given physical access to the necessary devices and left to go on
46 their own. However, for devices such as disks and network interfaces,
47 where sharing is required, the split driver approach is a good
48 solution.
50 The structure is like this:
52 +--------------------------+ +--------------------------+
53 | Domain 0 (privileged) | | Domain 1 (unprivileged) |
54 | | | |
55 | Xend ( Application ) | | |
56 | Blkif Backend Driver | | Blkif Frontend Driver |
57 | Physical Device Driver | | |
58 +--------------------------+ +--------------------------+
59 +--------------------------------------------------------+
60 | X E N |
61 +--------------------------------------------------------+
64 The Blkif driver is in two parts, which we refer to as frontend (FE)
65 and a backend (BE). Together, they serve to proxy device requests
66 between the guest operating system in an unprivileged domain, and the
67 physical device driver in the physical domain. An additional benefit
68 to this approach is that the FE driver can provide a single interface
69 for a whole class of physical devices. The blkif interface mounts
70 IDE, SCSI, and our own VBD-structured disks, independent of the
71 physical driver underneath. Moreover, supporting additional OSs only
72 requires that a new FE driver be written to connect to the existing
73 backend.
75 ==== Inter-Domain Communication Mechanisms ====
77 ===== Event Channels =====
79 Before getting into the specifics of the block interface driver, it is
80 worth discussing the mechanisms that are used to communicate between
81 domains. Two mechanisms are used to allow the construction of
82 high-performance drivers: event channels and shared-memory rings.
84 Event channels are an asynchronous interdomain notification
85 mechanism. Xen allows channels to be instantiated between two
86 domains, and domains can request that a virtual irq be attached to
87 notifications on a given channel. The result of this is that the
88 frontend domain can send a notification on an event channel, resulting
89 in an interrupt entry into the backend at a later time.
91 The event channel between two domains is instantiated in the Xend code
92 during driver startup (described later). Xend's channel.py
93 (tools/python/xen/xend/server/channel.py) defines the function
96 def eventChannel(dom1, dom2):
97 return xc.evtchn_bind_interdomain(dom1=dom1, dom2=dom2)
100 which maps to xc_evtchn_bind_interdomain() in tools/libxc/xc_evtchn.c,
101 which in turn generates a hypercall to Xen to patch the event channel
102 between the domains. Only a privileged domain can request the
103 creation of an event channel.
105 Once the event channel is created in Xend, its ends are passed to both the
106 front and backend domains over the control channel. The end that is
107 passed to a domain is just an integer "port" uniquely identifying the
108 event channel's local connection to that domain. An example of this
109 setup code is in linux-2.6.x/drivers/xen/blkfront/blkfront.c in
110 blkif_status_change, which receives several status change events as
111 the driver starts up. It is passed an event channel end in a
112 BLKIF_INTERFACE_STATUS_CONNECTED message, and patches it in like this:
115 blkif_evtchn = status->evtchn;
116 blkif_irq = bind_evtchn_to_irq(blkif_evtchn);
117 if ( (rc = request_irq(blkif_irq, blkif_int,
118 SA_SAMPLE_RANDOM, "blkif", NULL)) )
119 printk(KERN_ALERT"blkfront request_irq failed (%ld)\n",rc);
122 This code associates a virtual irq with the event channel, and
123 attaches the function blkif_int() as an interrupt handler for that
124 irq. blkif_int() simply handles the notification and returns, it does
125 not need to interact with the channel at all.
127 An example of generating a notification can also be seen in blkfront.c:
130 static inline void flush_requests(void)
131 {
133 wmb(); /* Ensure that the frontend can see the requests. */
134 blk_ring->req_prod = req_prod;
135 notify_via_evtchn(blkif_evtchn);
136 }
137 }}}
139 notify_via_evtchn issues a hypercall to set the event waiting flag on
140 the other domain's end of the channel.
142 ===== Communication Rings =====
144 Event channels are strictly a notification mechanism between domains.
145 To move large chunks of data back and forth, Xen allows domains to
146 share pages of memory. We use communication rings as a means of
147 managing access to a shared memory page for message passing between
148 domains. These rings are not explicitly a mechanism of Xen, which is
149 only concerned with the actual sharing of the page and not how it is
150 used, they are however worth discussing as they are used in many
151 places in the current code and are a useful model for communicating
152 across a shared page.
154 A shared page is set up by a guest first allocating and passing the
155 address of a page in its own address space to the backend driver.
158 blk_ring = (blkif_ring_t *)__get_free_page(GFP_KERNEL);
159 blk_ring->req_prod = blk_ring->resp_prod = resp_cons = req_prod = 0;
160 ...
161 /* Construct an interface-CONNECT message for the domain controller. */
162 cmsg.type = CMSG_BLKIF_FE;
164 cmsg.length = sizeof(blkif_fe_interface_connect_t);
165 up.handle = 0;
166 up.shmem_frame = virt_to_machine(blk_ring) >> PAGE_SHIFT;
167 memcpy(cmsg.msg, &up, sizeof(up));
170 blk_ring will be the shared page. The producer and consumer pointers
171 are then initialised (these will be discussed soon), and then the
172 machine address of the page is send to the backend via a control
173 channel to Xend. This control channel itself uses the notification
174 and shared memory mechanisms described here, but is set up for each
175 domain automatically at startup.
177 The backend, which is a privileged domain then takes the page address
178 and maps it into its own address space (in
179 linux26/drivers/xen/blkback/interface.c:blkif_connect()):
182 void blkif_connect(blkif_be_connect_t *connect)
184 ...
185 unsigned long shmem_frame = connect->shmem_frame;
186 ...
188 if ( (vma = get_vm_area(PAGE_SIZE, VM_IOREMAP)) == NULL )
189 {
190 connect->status = BLKIF_BE_STATUS_OUT_OF_MEMORY;
191 return;
192 }
194 prot = __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED);
195 error = direct_remap_area_pages(&init_mm, VMALLOC_VMADDR(vma->addr),
196 shmem_frame<<PAGE_SHIFT, PAGE_SIZE,
197 prot, domid);
199 ...
201 blkif->blk_ring_base = (blkif_ring_t *)vma->addr
202 }}}
204 The machine address of the page is passed in the shmem_frame field of
205 the connect message. This is then mapped into the virtual address
206 space of the backend domain, and saved in the blkif structure
207 representing this particular backend connection.
209 NOTE: New mechanisms will be added very shortly to allow domains to
210 explicitly grant access to their pages to other domains. This "grant
211 table" support is in the process of being added to the tree, and will
212 change the way a shared page is set up. In particular, it will remove
213 the need of the remapping domain to be privileged.
215 Sending data across shared rings:
217 Shared rings avoid the potential for write interference between
218 domains in a very cunning way. A ring is partitioned into a request
219 and a response region, and domains only work within their own space.
220 This can be thought of as a double producer-consumer ring -- the ring
221 is described by four pointers into a circular buffer of fixed-size
222 records. Pointers may only advance, and may not pass one another.
225 rsp_cons----+
226 V
227 +----+----+----+----+----+----+----+
228 | | | free |RSP1|RSP2|
229 +----+----+----+----+----+----+----+
230 req_prod->| | --------> |RSP3|
231 +----+ +----+
232 |REQ8| | |<-rsp_prod
233 +----+ +----+
234 |REQ7| | |
235 +----+ +----+
236 |REQ6| <-------- | |
237 +----+----+----+----+----+----+----+
238 |REQ5|REQ4| free | | |
239 +----+----+----+----+----+----+----+
240 req_cons---------^
244 By adopting the convention that every request will receive a response,
245 not all four pointers need be shared and flow control on the ring
246 becomes very easy to manage. Each domain manages its own
247 consumer pointer, and the two producer pointers are visible to both (Xen/include/hypervisor-ifs/io/blkif.h):
251 /* NB. Ring size must be small enough for sizeof(blkif_ring_t) <=PAGE_SIZE.*/
252 #define BLKIF_RING_SIZE 64
254 ...
256 /*
257 * We use a special capitalised type name because it is _essential_ that all
258 * arithmetic on indexes is done on an integer type of the correct size.
259 */
260 typedef u32 BLKIF_RING_IDX;
262 /*
263 * Ring indexes are 'free running'. That is, they are not stored modulo the
264 * size of the ring buffer. The following macro converts a free-running counter
265 * into a value that can directly index a ring-buffer array.
266 */
267 #define MASK_BLKIF_IDX(_i) ((_i)&(BLKIF_RING_SIZE-1))
269 typedef struct {
270 BLKIF_RING_IDX req_prod; /* 0: Request producer. Updated by front-end. */
271 BLKIF_RING_IDX resp_prod; /* 4: Response producer. Updated by back-end. */
272 union { /* 8 */
273 blkif_request_t req;
274 blkif_response_t resp;
276 } PACKED blkif_ring_t;
280 As shown in the diagram above, the rules for using a shared memory
281 ring are simple.
283 1. A ring is full when a domain's producer and consumer pointers are
284 equal (e.g. req_prod == resp_cons). In this situation, the
285 consumer pointer must be advanced. Furthermore, if the consumer
286 pointer is equal to the other domain's producer pointer,
287 (e.g. resp_cons = resp_prod), then the other domain has all the
288 buffers.
290 2. Producer pointers point to the next buffer that will be written to.
291 (So blk_ring[MASK_BLKIF_IDX(req_prod)] should not be consumed.)
293 3. Consumer pointers point to a valid message, so long as they are not
294 equal to the associated producer pointer.
296 4. A domain should only ever write to the message pointed
297 to by its producer index, and read from the message at it's
298 consumer. More generally, the domain may be thought of to have
299 exclusive access to the messages between its consumer and producer,
300 and should absolutely not read or write outside this region.
302 In general, drivers keep a private copy of their producer pointer and
303 then set the shared version when they are ready for the other end to
304 process a set of messages. Additionally, it is worth paying attention
305 to the use of memory barriers (rmb/wmb) in the code, to ensure that
306 rings that are shared across processors behave as expected.
308 ==== Structure of the Blkif Drivers ====
310 Now that the communications primitives have been discussed, I'll
311 quickly cover the general structure of the blkif driver. This is
312 intended to give a high-level idea of what is going on, in an effort
313 to make reading the code a more approachable task.
315 There are three key software components that are involved in the blkif
316 drivers (not counting Xen itself). The frontend and backend driver,
317 and Xend, which coordinates their initial connection. Xend may also
318 be involved in control-channel signalling in some cases after startup,
319 for instance to manage reconnection if the backend is restarted.
321 ===== Frontend Driver Structure =====
323 The frontend domain uses a single event channel and a shared memory
324 ring to trade control messages with the backend. These are both setup
325 during domain startup, which will be discussed shortly. The shared
326 memory ring is called blkif_ring, and the private ring indexes are
327 resp_cons, and req_prod. The ring is protected by blkif_io_lock.
328 Additionally, the frontend keeps a list of outstanding requests in
329 rec_ring[]. These are uniquely identified by a guest-local id number,
330 which is associated with each request sent to the backend, and
331 returned with the matching responses. Information about the actual
332 disks are stored in major_info[], of which only the first nr_vbds
333 entries are valid. Finally, the global 'recovery' indicates that the
334 connection between the backend and frontend drivers has been broken
335 (possibly due to a backend driver crash) and that the frontend is in
336 recovery mode, in which case it will attempt to reconnect and reissue
337 outstanding requests.
339 The frontend driver is single-threaded and after setup is entered only
340 through three points: (1) read/write requests from the XenLinux guest
341 that it is a part of, (2) interrupts from the backend driver on its
342 event channel (blkif_int()), and (3) control messages from Xend
343 (blkif_ctrlif_rx).
345 ===== Backend Driver Structure =====
347 The backend driver is slightly more complex as it must manage any
348 number of concurrent frontend connections. For each domain it
349 manages, the backend driver maintains a blkif structure, which
350 describes all the connection and disk information associated with that
351 particular domain. This structure is associated with the interrupt
352 registration, and allows the backend driver to have immediate context
353 when it takes a notification from some domain.
355 All of the blkif structures are stored in a hash table (blkif_hash),
356 which is indexed by a hash of the domain id, and a "handle", really a
357 per-domain blkif identifier, in case it wants to have multiple connections.
359 The per-connection blkif structure is of type blkif_t. It contains
360 all of the communication details (event channel, irq, shared memory
361 ring and indexes), and blk_ring_lock, which is the backend mutex on
362 the shared ring. The structure also contains vbd_rb, which is a
363 red-black tree, containing an entry for each device/partition that is
364 assigned to that domain. This structure is filled by xend passing
365 disk information to the backend at startup, and is protected by
366 vbd_lock. Finally, the blkif struct contains a status field, which
367 describes the state of the connection.
369 The backend driver spawns a kernel thread at startup
370 (blkio_schedule()), which handles requests to and from the actual disk
371 device drivers. This scheduler thread maintains a list of blkif
372 structures that have pending requests, and services them round-robin
373 with a maximum per-round request limit. blkifs are added to the list
374 in the interrupt handler (blkif_be_int()) using
375 add_to_blkdev_list_tail(), and removed in the scheduler loop after
376 calling do_block_io_op(), which processes a batch of requests. The
377 scheduler thread is explicitly activated at several points in the code
378 using maybe_trigger_blkio_schedule().
380 Pending requests between the backend driver and the physical device
381 drivers use another ring, pending_ring. Requests are placed in this
382 ring in the scheduler thread and issued to the device. A completion
383 callback, end_block_io_op, indicates that requests have been serviced
384 and generates a response on the appropriate blkif ring. pending
385 reqs[] stores a list of outstanding requests with the physical drivers.
387 So, control entries to the backend are (1) the blkio scheduler thread,
388 which sends requests to the real device drivers, (2) end_block_io_op,
389 which is called as serviced requests complete, (3) blkif_be_int()
390 handles notifications from the frontend drivers in other domains, and
391 (4) blkif_ctrlif_rx() handles control messages from xend.
393 ==== Driver Startup ====
395 Prior to starting a new guest using the frontend driver, the backend
396 will have been started in a privileged domain. The backend
397 initialisation code initialises all of its data structures, such as
398 the blkif hash table, and starts the scheduler thread as a kernel
399 thread. It then sends a driver status up message to let xend know it
400 is ready to take frontend connections.
402 When a new domain that uses the blkif frontend driver is started,
403 there are a series of interactions between it, xend, and the specified
404 backend driver. These interactions are as follows:
406 The domain configuration given to xend will specify the backend domain
407 and disks that the new guest is to use. Prior to actually running the
408 domain, xend and the backend driver interact to setup the initial
409 blkif record in the backend.
411 (1) Xend sends a BLKIF_BE_CREATE message to backend.
413 Backend does blkif_create(), having been passed FE domid and handle.
414 It creates and initialises a new blkif struct, and puts it in the
415 hash table.
416 It then returns a STATUS_OK response to xend.
418 (2) Xend sends a BLKIF_BE_VBD_CREATE message to the backend.
420 Backend adds a vbd entry in the red-black tree for the
421 specified (dom, handle) blkif entry.
422 Sends a STATUS_OK response.
424 (3) Xend sends a BLKIF_BE_VBD_GROW message to the backend.
426 Backend takes the physical device information passed in the
427 message and assigns them to the newly created vbd struct.
429 (2) and (3) repeat as any additional devices are added to the domain.
431 At this point, the backend has enough state to allow the frontend
432 domain to start. The domain is run, and eventually gets to the
433 frontend driver initialisation code. After setting up the frontend
434 data structures, this code continues the communications with xend and
435 the backend to negotiate a connection:
437 (4) Frontend sends Xend a BLKIF_FE_DRIVER_STATUS_CHANGED message.
439 This message tells xend that the driver is up. The init function
440 now spin-waits until driver setup is complete in order to prevent
441 Linux from attempting to boot before the disks are connected.
443 (5) Xend sends the frontend an INTERFACE_STATUS_CHANGED message
445 This message specifies that the interface is now disconnected
446 (instead of closed).
447 The domain updates it's state, and allocates the shared blk_ring
448 page. Next,
450 (6) Frontend sends Xend a BLKIF_INTERFACE_CONNECT message
452 This message specifies the domain and handle, and includes the
453 address of the newly created page.
455 (7) Xend sends the backend a BLKIF_BE_CONNECT message
457 The backend fills in the blkif connection information, maps the
458 shared page, and binds an irq to the event channel.
460 (8) Xend sends the frontend an INTERFACE_STATUS_CHANGED message
462 This message takes the frontend driver to a CONNECTED state, at
463 which point it binds an irq to the event channel and calls
464 xlvbd_init to initialise the individual block devices.
466 The frontend Linux is stall spin waiting at this point, until all of
467 the disks have been probed. Messaging now is directly between the
468 front and backend domain using the new shared ring and event channel.
470 (9) The frontend sends a BLKIF_OP_PROBE directly to the backend.
472 This message includes a reference to an additional page, that the
473 backend can use for it's reply. The backend responds with an array
474 of the domains disks (as vdisk_t structs) on the provided page.
476 The frontend now initialises each disk, calling xlvbd_init_device()
477 for each one.