annotate Documentation/DMA-mapping.txt @ 524:7f8b544237bf

netfront: Allow netfront in domain 0.

This is useful if your physical network device is in a utility domain.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
author Keir Fraser <keir.fraser@citrix.com>
date Tue Apr 15 15:18:58 2008 +0100 (2008-04-15)
parents 831230e53067
rev   line source
ian@0 1 Dynamic DMA mapping
ian@0 2 ===================
ian@0 3
ian@0 4 David S. Miller <davem@redhat.com>
ian@0 5 Richard Henderson <rth@cygnus.com>
ian@0 6 Jakub Jelinek <jakub@redhat.com>
ian@0 7
ian@0 8 This document describes the DMA mapping system in terms of the pci_
ian@0 9 API. For a similar API that works for generic devices, see
ian@0 10 DMA-API.txt.
ian@0 11
ian@0 12 Most of the 64bit platforms have special hardware that translates bus
ian@0 13 addresses (DMA addresses) into physical addresses. This is similar to
ian@0 14 how page tables and/or a TLB translates virtual addresses to physical
ian@0 15 addresses on a CPU. This is needed so that e.g. PCI devices can
ian@0 16 access with a Single Address Cycle (32bit DMA address) any page in the
ian@0 17 64bit physical address space. Previously in Linux those 64bit
ian@0 18 platforms had to set artificial limits on the maximum RAM size in the
ian@0 19 system, so that the virt_to_bus() static scheme works (the DMA address
ian@0 20 translation tables were simply filled on bootup to map each bus
ian@0 21 address to the physical page __pa(bus_to_virt())).
ian@0 22
ian@0 23 So that Linux can use the dynamic DMA mapping, it needs some help from the
ian@0 24 drivers, namely it has to take into account that DMA addresses should be
ian@0 25 mapped only for the time they are actually used and unmapped after the DMA
ian@0 26 transfer.
ian@0 27
ian@0 28 The following API will work of course even on platforms where no such
ian@0 29 hardware exists, see e.g. include/asm-i386/pci.h for how it is implemented on
ian@0 30 top of the virt_to_bus interface.
ian@0 31
ian@0 32 First of all, you should make sure
ian@0 33
ian@0 34 #include <linux/pci.h>
ian@0 35
ian@0 36 is in your driver. This file will obtain for you the definition of the
ian@0 37 dma_addr_t (which can hold any valid DMA address for the platform)
ian@0 38 type which should be used everywhere you hold a DMA (bus) address
ian@0 39 returned from the DMA mapping functions.
ian@0 40
ian@0 41 What memory is DMA'able?
ian@0 42
ian@0 43 The first piece of information you must know is what kernel memory can
ian@0 44 be used with the DMA mapping facilities. There has been an unwritten
ian@0 45 set of rules regarding this, and this text is an attempt to finally
ian@0 46 write them down.
ian@0 47
ian@0 48 If you acquired your memory via the page allocator
ian@0 49 (i.e. __get_free_page*()) or the generic memory allocators
ian@0 50 (i.e. kmalloc() or kmem_cache_alloc()) then you may DMA to/from
ian@0 51 that memory using the addresses returned from those routines.
ian@0 52
ian@0 53 This means specifically that you may _not_ use the memory/addresses
ian@0 54 returned from vmalloc() for DMA. It is possible to DMA to the
ian@0 55 _underlying_ memory mapped into a vmalloc() area, but this requires
ian@0 56 walking page tables to get the physical addresses, and then
ian@0 57 translating each of those pages back to a kernel address using
ian@0 58 something like __va(). [ EDIT: Update this when we integrate
ian@0 59 Gerd Knorr's generic code which does this. ]
ian@0 60
ian@0 61 This rule also means that you may use neither kernel image addresses
ian@0 62 (items in data/text/bss segments), nor module image addresses, nor
ian@0 63 stack addresses for DMA. These could all be mapped somewhere entirely
ian@0 64 different than the rest of physical memory. Even if those classes of
ian@0 65 memory could physically work with DMA, you'd need to ensure the I/O
ian@0 66 buffers were cacheline-aligned. Without that, you'd see cacheline
ian@0 67 sharing problems (data corruption) on CPUs with DMA-incoherent caches.
ian@0 68 (The CPU could write to one word, DMA would write to a different one
ian@0 69 in the same cache line, and one of them could be overwritten.)
ian@0 70
ian@0 71 Also, this means that you cannot take the return of a kmap()
ian@0 72 call and DMA to/from that. This is similar to vmalloc().
ian@0 73
ian@0 74 What about block I/O and networking buffers? The block I/O and
ian@0 75 networking subsystems make sure that the buffers they use are valid
ian@0 76 for you to DMA from/to.
ian@0 77
ian@0 78 DMA addressing limitations
ian@0 79
ian@0 80 Does your device have any DMA addressing limitations? For example, is
ian@0 81 your device only capable of driving the low order 24-bits of address
ian@0 82 on the PCI bus for SAC DMA transfers? If so, you need to inform the
ian@0 83 PCI layer of this fact.
ian@0 84
ian@0 85 By default, the kernel assumes that your device can address the full
ian@0 86 32-bits in a SAC cycle. For a 64-bit DAC capable device, this needs
ian@0 87 to be increased. And for a device with limitations, as discussed in
ian@0 88 the previous paragraph, it needs to be decreased.
ian@0 89
ian@0 90 pci_alloc_consistent() by default will return 32-bit DMA addresses.
ian@0 91 PCI-X specification requires PCI-X devices to support 64-bit
ian@0 92 addressing (DAC) for all transactions. And at least one platform (SGI
ian@0 93 SN2) requires 64-bit consistent allocations to operate correctly when
ian@0 94 the IO bus is in PCI-X mode. Therefore, like with pci_set_dma_mask(),
ian@0 95 it's good practice to call pci_set_consistent_dma_mask() to set the
ian@0 96 appropriate mask even if your device only supports 32-bit DMA
ian@0 97 (default) and especially if it's a PCI-X device.
ian@0 98
ian@0 99 For correct operation, you must interrogate the PCI layer in your
ian@0 100 device probe routine to see if the PCI controller on the machine can
ian@0 101 properly support the DMA addressing limitation your device has. It is
ian@0 102 good style to do this even if your device holds the default setting,
ian@0 103 because this shows that you did think about these issues wrt. your
ian@0 104 device.
ian@0 105
ian@0 106 The query is performed via a call to pci_set_dma_mask():
ian@0 107
ian@0 108 int pci_set_dma_mask(struct pci_dev *pdev, u64 device_mask);
ian@0 109
ian@0 110 The query for consistent allocations is performed via a a call to
ian@0 111 pci_set_consistent_dma_mask():
ian@0 112
ian@0 113 int pci_set_consistent_dma_mask(struct pci_dev *pdev, u64 device_mask);
ian@0 114
ian@0 115 Here, pdev is a pointer to the PCI device struct of your device, and
ian@0 116 device_mask is a bit mask describing which bits of a PCI address your
ian@0 117 device supports. It returns zero if your card can perform DMA
ian@0 118 properly on the machine given the address mask you provided.
ian@0 119
ian@0 120 If it returns non-zero, your device can not perform DMA properly on
ian@0 121 this platform, and attempting to do so will result in undefined
ian@0 122 behavior. You must either use a different mask, or not use DMA.
ian@0 123
ian@0 124 This means that in the failure case, you have three options:
ian@0 125
ian@0 126 1) Use another DMA mask, if possible (see below).
ian@0 127 2) Use some non-DMA mode for data transfer, if possible.
ian@0 128 3) Ignore this device and do not initialize it.
ian@0 129
ian@0 130 It is recommended that your driver print a kernel KERN_WARNING message
ian@0 131 when you end up performing either #2 or #3. In this manner, if a user
ian@0 132 of your driver reports that performance is bad or that the device is not
ian@0 133 even detected, you can ask them for the kernel messages to find out
ian@0 134 exactly why.
ian@0 135
ian@0 136 The standard 32-bit addressing PCI device would do something like
ian@0 137 this:
ian@0 138
ian@0 139 if (pci_set_dma_mask(pdev, DMA_32BIT_MASK)) {
ian@0 140 printk(KERN_WARNING
ian@0 141 "mydev: No suitable DMA available.\n");
ian@0 142 goto ignore_this_device;
ian@0 143 }
ian@0 144
ian@0 145 Another common scenario is a 64-bit capable device. The approach
ian@0 146 here is to try for 64-bit DAC addressing, but back down to a
ian@0 147 32-bit mask should that fail. The PCI platform code may fail the
ian@0 148 64-bit mask not because the platform is not capable of 64-bit
ian@0 149 addressing. Rather, it may fail in this case simply because
ian@0 150 32-bit SAC addressing is done more efficiently than DAC addressing.
ian@0 151 Sparc64 is one platform which behaves in this way.
ian@0 152
ian@0 153 Here is how you would handle a 64-bit capable device which can drive
ian@0 154 all 64-bits when accessing streaming DMA:
ian@0 155
ian@0 156 int using_dac;
ian@0 157
ian@0 158 if (!pci_set_dma_mask(pdev, DMA_64BIT_MASK)) {
ian@0 159 using_dac = 1;
ian@0 160 } else if (!pci_set_dma_mask(pdev, DMA_32BIT_MASK)) {
ian@0 161 using_dac = 0;
ian@0 162 } else {
ian@0 163 printk(KERN_WARNING
ian@0 164 "mydev: No suitable DMA available.\n");
ian@0 165 goto ignore_this_device;
ian@0 166 }
ian@0 167
ian@0 168 If a card is capable of using 64-bit consistent allocations as well,
ian@0 169 the case would look like this:
ian@0 170
ian@0 171 int using_dac, consistent_using_dac;
ian@0 172
ian@0 173 if (!pci_set_dma_mask(pdev, DMA_64BIT_MASK)) {
ian@0 174 using_dac = 1;
ian@0 175 consistent_using_dac = 1;
ian@0 176 pci_set_consistent_dma_mask(pdev, DMA_64BIT_MASK);
ian@0 177 } else if (!pci_set_dma_mask(pdev, DMA_32BIT_MASK)) {
ian@0 178 using_dac = 0;
ian@0 179 consistent_using_dac = 0;
ian@0 180 pci_set_consistent_dma_mask(pdev, DMA_32BIT_MASK);
ian@0 181 } else {
ian@0 182 printk(KERN_WARNING
ian@0 183 "mydev: No suitable DMA available.\n");
ian@0 184 goto ignore_this_device;
ian@0 185 }
ian@0 186
ian@0 187 pci_set_consistent_dma_mask() will always be able to set the same or a
ian@0 188 smaller mask as pci_set_dma_mask(). However for the rare case that a
ian@0 189 device driver only uses consistent allocations, one would have to
ian@0 190 check the return value from pci_set_consistent_dma_mask().
ian@0 191
ian@0 192 If your 64-bit device is going to be an enormous consumer of DMA
ian@0 193 mappings, this can be problematic since the DMA mappings are a
ian@0 194 finite resource on many platforms. Please see the "DAC Addressing
ian@0 195 for Address Space Hungry Devices" section near the end of this
ian@0 196 document for how to handle this case.
ian@0 197
ian@0 198 Finally, if your device can only drive the low 24-bits of
ian@0 199 address during PCI bus mastering you might do something like:
ian@0 200
ian@0 201 if (pci_set_dma_mask(pdev, DMA_24BIT_MASK)) {
ian@0 202 printk(KERN_WARNING
ian@0 203 "mydev: 24-bit DMA addressing not available.\n");
ian@0 204 goto ignore_this_device;
ian@0 205 }
ian@0 206 [Better use DMA_24BIT_MASK instead of 0x00ffffff.
ian@0 207 See linux/include/dma-mapping.h for reference.]
ian@0 208
ian@0 209 When pci_set_dma_mask() is successful, and returns zero, the PCI layer
ian@0 210 saves away this mask you have provided. The PCI layer will use this
ian@0 211 information later when you make DMA mappings.
ian@0 212
ian@0 213 There is a case which we are aware of at this time, which is worth
ian@0 214 mentioning in this documentation. If your device supports multiple
ian@0 215 functions (for example a sound card provides playback and record
ian@0 216 functions) and the various different functions have _different_
ian@0 217 DMA addressing limitations, you may wish to probe each mask and
ian@0 218 only provide the functionality which the machine can handle. It
ian@0 219 is important that the last call to pci_set_dma_mask() be for the
ian@0 220 most specific mask.
ian@0 221
ian@0 222 Here is pseudo-code showing how this might be done:
ian@0 223
ian@0 225 #define RECORD_ADDRESS_BITS 0x00ffffff
ian@0 226
ian@0 227 struct my_sound_card *card;
ian@0 228 struct pci_dev *pdev;
ian@0 229
ian@0 230 ...
ian@0 231 if (!pci_set_dma_mask(pdev, PLAYBACK_ADDRESS_BITS)) {
ian@0 232 card->playback_enabled = 1;
ian@0 233 } else {
ian@0 234 card->playback_enabled = 0;
ian@0 235 printk(KERN_WARN "%s: Playback disabled due to DMA limitations.\n",
ian@0 236 card->name);
ian@0 237 }
ian@0 238 if (!pci_set_dma_mask(pdev, RECORD_ADDRESS_BITS)) {
ian@0 239 card->record_enabled = 1;
ian@0 240 } else {
ian@0 241 card->record_enabled = 0;
ian@0 242 printk(KERN_WARN "%s: Record disabled due to DMA limitations.\n",
ian@0 243 card->name);
ian@0 244 }
ian@0 245
ian@0 246 A sound card was used as an example here because this genre of PCI
ian@0 247 devices seems to be littered with ISA chips given a PCI front end,
ian@0 248 and thus retaining the 16MB DMA addressing limitations of ISA.
ian@0 249
ian@0 250 Types of DMA mappings
ian@0 251
ian@0 252 There are two types of DMA mappings:
ian@0 253
ian@0 254 - Consistent DMA mappings which are usually mapped at driver
ian@0 255 initialization, unmapped at the end and for which the hardware should
ian@0 256 guarantee that the device and the CPU can access the data
ian@0 257 in parallel and will see updates made by each other without any
ian@0 258 explicit software flushing.
ian@0 259
ian@0 260 Think of "consistent" as "synchronous" or "coherent".
ian@0 261
ian@0 262 The current default is to return consistent memory in the low 32
ian@0 263 bits of the PCI bus space. However, for future compatibility you
ian@0 264 should set the consistent mask even if this default is fine for your
ian@0 265 driver.
ian@0 266
ian@0 267 Good examples of what to use consistent mappings for are:
ian@0 268
ian@0 269 - Network card DMA ring descriptors.
ian@0 270 - SCSI adapter mailbox command data structures.
ian@0 271 - Device firmware microcode executed out of
ian@0 272 main memory.
ian@0 273
ian@0 274 The invariant these examples all require is that any CPU store
ian@0 275 to memory is immediately visible to the device, and vice
ian@0 276 versa. Consistent mappings guarantee this.
ian@0 277
ian@0 278 IMPORTANT: Consistent DMA memory does not preclude the usage of
ian@0 279 proper memory barriers. The CPU may reorder stores to
ian@0 280 consistent memory just as it may normal memory. Example:
ian@0 281 if it is important for the device to see the first word
ian@0 282 of a descriptor updated before the second, you must do
ian@0 283 something like:
ian@0 284
ian@0 285 desc->word0 = address;
ian@0 286 wmb();
ian@0 287 desc->word1 = DESC_VALID;
ian@0 288
ian@0 289 in order to get correct behavior on all platforms.
ian@0 290
ian@0 291 Also, on some platforms your driver may need to flush CPU write
ian@0 292 buffers in much the same way as it needs to flush write buffers
ian@0 293 found in PCI bridges (such as by reading a register's value
ian@0 294 after writing it).
ian@0 295
ian@0 296 - Streaming DMA mappings which are usually mapped for one DMA transfer,
ian@0 297 unmapped right after it (unless you use pci_dma_sync_* below) and for which
ian@0 298 hardware can optimize for sequential accesses.
ian@0 299
ian@0 300 This of "streaming" as "asynchronous" or "outside the coherency
ian@0 301 domain".
ian@0 302
ian@0 303 Good examples of what to use streaming mappings for are:
ian@0 304
ian@0 305 - Networking buffers transmitted/received by a device.
ian@0 306 - Filesystem buffers written/read by a SCSI device.
ian@0 307
ian@0 308 The interfaces for using this type of mapping were designed in
ian@0 309 such a way that an implementation can make whatever performance
ian@0 310 optimizations the hardware allows. To this end, when using
ian@0 311 such mappings you must be explicit about what you want to happen.
ian@0 312
ian@0 313 Neither type of DMA mapping has alignment restrictions that come
ian@0 314 from PCI, although some devices may have such restrictions.
ian@0 315 Also, systems with caches that aren't DMA-coherent will work better
ian@0 316 when the underlying buffers don't share cache lines with other data.
ian@0 317
ian@0 318
ian@0 319 Using Consistent DMA mappings.
ian@0 320
ian@0 321 To allocate and map large (PAGE_SIZE or so) consistent DMA regions,
ian@0 322 you should do:
ian@0 323
ian@0 324 dma_addr_t dma_handle;
ian@0 325
ian@0 326 cpu_addr = pci_alloc_consistent(dev, size, &dma_handle);
ian@0 327
ian@0 328 where dev is a struct pci_dev *. You should pass NULL for PCI like buses
ian@0 329 where devices don't have struct pci_dev (like ISA, EISA). This may be
ian@0 330 called in interrupt context.
ian@0 331
ian@0 332 This argument is needed because the DMA translations may be bus
ian@0 333 specific (and often is private to the bus which the device is attached
ian@0 334 to).
ian@0 335
ian@0 336 Size is the length of the region you want to allocate, in bytes.
ian@0 337
ian@0 338 This routine will allocate RAM for that region, so it acts similarly to
ian@0 339 __get_free_pages (but takes size instead of a page order). If your
ian@0 340 driver needs regions sized smaller than a page, you may prefer using
ian@0 341 the pci_pool interface, described below.
ian@0 342
ian@0 343 The consistent DMA mapping interfaces, for non-NULL dev, will by
ian@0 344 default return a DMA address which is SAC (Single Address Cycle)
ian@0 345 addressable. Even if the device indicates (via PCI dma mask) that it
ian@0 346 may address the upper 32-bits and thus perform DAC cycles, consistent
ian@0 347 allocation will only return > 32-bit PCI addresses for DMA if the
ian@0 348 consistent dma mask has been explicitly changed via
ian@0 349 pci_set_consistent_dma_mask(). This is true of the pci_pool interface
ian@0 350 as well.
ian@0 351
ian@0 352 pci_alloc_consistent returns two values: the virtual address which you
ian@0 353 can use to access it from the CPU and dma_handle which you pass to the
ian@0 354 card.
ian@0 355
ian@0 356 The cpu return address and the DMA bus master address are both
ian@0 357 guaranteed to be aligned to the smallest PAGE_SIZE order which
ian@0 358 is greater than or equal to the requested size. This invariant
ian@0 359 exists (for example) to guarantee that if you allocate a chunk
ian@0 360 which is smaller than or equal to 64 kilobytes, the extent of the
ian@0 361 buffer you receive will not cross a 64K boundary.
ian@0 362
ian@0 363 To unmap and free such a DMA region, you call:
ian@0 364
ian@0 365 pci_free_consistent(dev, size, cpu_addr, dma_handle);
ian@0 366
ian@0 367 where dev, size are the same as in the above call and cpu_addr and
ian@0 368 dma_handle are the values pci_alloc_consistent returned to you.
ian@0 369 This function may not be called in interrupt context.
ian@0 370
ian@0 371 If your driver needs lots of smaller memory regions, you can write
ian@0 372 custom code to subdivide pages returned by pci_alloc_consistent,
ian@0 373 or you can use the pci_pool API to do that. A pci_pool is like
ian@0 374 a kmem_cache, but it uses pci_alloc_consistent not __get_free_pages.
ian@0 375 Also, it understands common hardware constraints for alignment,
ian@0 376 like queue heads needing to be aligned on N byte boundaries.
ian@0 377
ian@0 378 Create a pci_pool like this:
ian@0 379
ian@0 380 struct pci_pool *pool;
ian@0 381
ian@0 382 pool = pci_pool_create(name, dev, size, align, alloc);
ian@0 383
ian@0 384 The "name" is for diagnostics (like a kmem_cache name); dev and size
ian@0 385 are as above. The device's hardware alignment requirement for this
ian@0 386 type of data is "align" (which is expressed in bytes, and must be a
ian@0 387 power of two). If your device has no boundary crossing restrictions,
ian@0 388 pass 0 for alloc; passing 4096 says memory allocated from this pool
ian@0 389 must not cross 4KByte boundaries (but at that time it may be better to
ian@0 390 go for pci_alloc_consistent directly instead).
ian@0 391
ian@0 392 Allocate memory from a pci pool like this:
ian@0 393
ian@0 394 cpu_addr = pci_pool_alloc(pool, flags, &dma_handle);
ian@0 395
ian@0 396 flags are SLAB_KERNEL if blocking is permitted (not in_interrupt nor
ian@0 397 holding SMP locks), SLAB_ATOMIC otherwise. Like pci_alloc_consistent,
ian@0 398 this returns two values, cpu_addr and dma_handle.
ian@0 399
ian@0 400 Free memory that was allocated from a pci_pool like this:
ian@0 401
ian@0 402 pci_pool_free(pool, cpu_addr, dma_handle);
ian@0 403
ian@0 404 where pool is what you passed to pci_pool_alloc, and cpu_addr and
ian@0 405 dma_handle are the values pci_pool_alloc returned. This function
ian@0 406 may be called in interrupt context.
ian@0 407
ian@0 408 Destroy a pci_pool by calling:
ian@0 409
ian@0 410 pci_pool_destroy(pool);
ian@0 411
ian@0 412 Make sure you've called pci_pool_free for all memory allocated
ian@0 413 from a pool before you destroy the pool. This function may not
ian@0 414 be called in interrupt context.
ian@0 415
ian@0 416 DMA Direction
ian@0 417
ian@0 418 The interfaces described in subsequent portions of this document
ian@0 419 take a DMA direction argument, which is an integer and takes on
ian@0 420 one of the following values:
ian@0 421
ian@0 425 PCI_DMA_NONE
ian@0 426
ian@0 427 One should provide the exact DMA direction if you know it.
ian@0 428
ian@0 429 PCI_DMA_TODEVICE means "from main memory to the PCI device"
ian@0 430 PCI_DMA_FROMDEVICE means "from the PCI device to main memory"
ian@0 431 It is the direction in which the data moves during the DMA
ian@0 432 transfer.
ian@0 433
ian@0 434 You are _strongly_ encouraged to specify this as precisely
ian@0 435 as you possibly can.
ian@0 436
ian@0 437 If you absolutely cannot know the direction of the DMA transfer,
ian@0 438 specify PCI_DMA_BIDIRECTIONAL. It means that the DMA can go in
ian@0 439 either direction. The platform guarantees that you may legally
ian@0 440 specify this, and that it will work, but this may be at the
ian@0 441 cost of performance for example.
ian@0 442
ian@0 443 The value PCI_DMA_NONE is to be used for debugging. One can
ian@0 444 hold this in a data structure before you come to know the
ian@0 445 precise direction, and this will help catch cases where your
ian@0 446 direction tracking logic has failed to set things up properly.
ian@0 447
ian@0 448 Another advantage of specifying this value precisely (outside of
ian@0 449 potential platform-specific optimizations of such) is for debugging.
ian@0 450 Some platforms actually have a write permission boolean which DMA
ian@0 451 mappings can be marked with, much like page protections in the user
ian@0 452 program address space. Such platforms can and do report errors in the
ian@0 453 kernel logs when the PCI controller hardware detects violation of the
ian@0 454 permission setting.
ian@0 455
ian@0 456 Only streaming mappings specify a direction, consistent mappings
ian@0 457 implicitly have a direction attribute setting of
ian@0 459
ian@0 460 The SCSI subsystem tells you the direction to use in the
ian@0 461 'sc_data_direction' member of the SCSI command your driver is
ian@0 462 working on.
ian@0 463
ian@0 464 For Networking drivers, it's a rather simple affair. For transmit
ian@0 465 packets, map/unmap them with the PCI_DMA_TODEVICE direction
ian@0 466 specifier. For receive packets, just the opposite, map/unmap them
ian@0 467 with the PCI_DMA_FROMDEVICE direction specifier.
ian@0 468
ian@0 469 Using Streaming DMA mappings
ian@0 470
ian@0 471 The streaming DMA mapping routines can be called from interrupt
ian@0 472 context. There are two versions of each map/unmap, one which will
ian@0 473 map/unmap a single memory region, and one which will map/unmap a
ian@0 474 scatterlist.
ian@0 475
ian@0 476 To map a single region, you do:
ian@0 477
ian@0 478 struct pci_dev *pdev = mydev->pdev;
ian@0 479 dma_addr_t dma_handle;
ian@0 480 void *addr = buffer->ptr;
ian@0 481 size_t size = buffer->len;
ian@0 482
ian@0 483 dma_handle = pci_map_single(dev, addr, size, direction);
ian@0 484
ian@0 485 and to unmap it:
ian@0 486
ian@0 487 pci_unmap_single(dev, dma_handle, size, direction);
ian@0 488
ian@0 489 You should call pci_unmap_single when the DMA activity is finished, e.g.
ian@0 490 from the interrupt which told you that the DMA transfer is done.
ian@0 491
ian@0 492 Using cpu pointers like this for single mappings has a disadvantage,
ian@0 493 you cannot reference HIGHMEM memory in this way. Thus, there is a
ian@0 494 map/unmap interface pair akin to pci_{map,unmap}_single. These
ian@0 495 interfaces deal with page/offset pairs instead of cpu pointers.
ian@0 496 Specifically:
ian@0 497
ian@0 498 struct pci_dev *pdev = mydev->pdev;
ian@0 499 dma_addr_t dma_handle;
ian@0 500 struct page *page = buffer->page;
ian@0 501 unsigned long offset = buffer->offset;
ian@0 502 size_t size = buffer->len;
ian@0 503
ian@0 504 dma_handle = pci_map_page(dev, page, offset, size, direction);
ian@0 505
ian@0 506 ...
ian@0 507
ian@0 508 pci_unmap_page(dev, dma_handle, size, direction);
ian@0 509
ian@0 510 Here, "offset" means byte offset within the given page.
ian@0 511
ian@0 512 With scatterlists, you map a region gathered from several regions by:
ian@0 513
ian@0 514 int i, count = pci_map_sg(dev, sglist, nents, direction);
ian@0 515 struct scatterlist *sg;
ian@0 516
ian@0 517 for (i = 0, sg = sglist; i < count; i++, sg++) {
ian@0 518 hw_address[i] = sg_dma_address(sg);
ian@0 519 hw_len[i] = sg_dma_len(sg);
ian@0 520 }
ian@0 521
ian@0 522 where nents is the number of entries in the sglist.
ian@0 523
ian@0 524 The implementation is free to merge several consecutive sglist entries
ian@0 525 into one (e.g. if DMA mapping is done with PAGE_SIZE granularity, any
ian@0 526 consecutive sglist entries can be merged into one provided the first one
ian@0 527 ends and the second one starts on a page boundary - in fact this is a huge
ian@0 528 advantage for cards which either cannot do scatter-gather or have very
ian@0 529 limited number of scatter-gather entries) and returns the actual number
ian@0 530 of sg entries it mapped them to. On failure 0 is returned.
ian@0 531
ian@0 532 Then you should loop count times (note: this can be less than nents times)
ian@0 533 and use sg_dma_address() and sg_dma_len() macros where you previously
ian@0 534 accessed sg->address and sg->length as shown above.
ian@0 535
ian@0 536 To unmap a scatterlist, just call:
ian@0 537
ian@0 538 pci_unmap_sg(dev, sglist, nents, direction);
ian@0 539
ian@0 540 Again, make sure DMA activity has already finished.
ian@0 541
ian@0 542 PLEASE NOTE: The 'nents' argument to the pci_unmap_sg call must be
ian@0 543 the _same_ one you passed into the pci_map_sg call,
ian@0 544 it should _NOT_ be the 'count' value _returned_ from the
ian@0 545 pci_map_sg call.
ian@0 546
ian@0 547 Every pci_map_{single,sg} call should have its pci_unmap_{single,sg}
ian@0 548 counterpart, because the bus address space is a shared resource (although
ian@0 549 in some ports the mapping is per each BUS so less devices contend for the
ian@0 550 same bus address space) and you could render the machine unusable by eating
ian@0 551 all bus addresses.
ian@0 552
ian@0 553 If you need to use the same streaming DMA region multiple times and touch
ian@0 554 the data in between the DMA transfers, the buffer needs to be synced
ian@0 555 properly in order for the cpu and device to see the most uptodate and
ian@0 556 correct copy of the DMA buffer.
ian@0 557
ian@0 558 So, firstly, just map it with pci_map_{single,sg}, and after each DMA
ian@0 559 transfer call either:
ian@0 560
ian@0 561 pci_dma_sync_single_for_cpu(dev, dma_handle, size, direction);
ian@0 562
ian@0 563 or:
ian@0 564
ian@0 565 pci_dma_sync_sg_for_cpu(dev, sglist, nents, direction);
ian@0 566
ian@0 567 as appropriate.
ian@0 568
ian@0 569 Then, if you wish to let the device get at the DMA area again,
ian@0 570 finish accessing the data with the cpu, and then before actually
ian@0 571 giving the buffer to the hardware call either:
ian@0 572
ian@0 573 pci_dma_sync_single_for_device(dev, dma_handle, size, direction);
ian@0 574
ian@0 575 or:
ian@0 576
ian@0 577 pci_dma_sync_sg_for_device(dev, sglist, nents, direction);
ian@0 578
ian@0 579 as appropriate.
ian@0 580
ian@0 581 After the last DMA transfer call one of the DMA unmap routines
ian@0 582 pci_unmap_{single,sg}. If you don't touch the data from the first pci_map_*
ian@0 583 call till pci_unmap_*, then you don't have to call the pci_dma_sync_*
ian@0 584 routines at all.
ian@0 585
ian@0 586 Here is pseudo code which shows a situation in which you would need
ian@0 587 to use the pci_dma_sync_*() interfaces.
ian@0 588
ian@0 589 my_card_setup_receive_buffer(struct my_card *cp, char *buffer, int len)
ian@0 590 {
ian@0 591 dma_addr_t mapping;
ian@0 592
ian@0 593 mapping = pci_map_single(cp->pdev, buffer, len, PCI_DMA_FROMDEVICE);
ian@0 594
ian@0 595 cp->rx_buf = buffer;
ian@0 596 cp->rx_len = len;
ian@0 597 cp->rx_dma = mapping;
ian@0 598
ian@0 599 give_rx_buf_to_card(cp);
ian@0 600 }
ian@0 601
ian@0 602 ...
ian@0 603
ian@0 604 my_card_interrupt_handler(int irq, void *devid, struct pt_regs *regs)
ian@0 605 {
ian@0 606 struct my_card *cp = devid;
ian@0 607
ian@0 608 ...
ian@0 609 if (read_card_status(cp) == RX_BUF_TRANSFERRED) {
ian@0 610 struct my_card_header *hp;
ian@0 611
ian@0 612 /* Examine the header to see if we wish
ian@0 613 * to accept the data. But synchronize
ian@0 614 * the DMA transfer with the CPU first
ian@0 615 * so that we see updated contents.
ian@0 616 */
ian@0 617 pci_dma_sync_single_for_cpu(cp->pdev, cp->rx_dma,
ian@0 618 cp->rx_len,
ian@0 620
ian@0 621 /* Now it is safe to examine the buffer. */
ian@0 622 hp = (struct my_card_header *) cp->rx_buf;
ian@0 623 if (header_is_ok(hp)) {
ian@0 624 pci_unmap_single(cp->pdev, cp->rx_dma, cp->rx_len,
ian@0 626 pass_to_upper_layers(cp->rx_buf);
ian@0 627 make_and_setup_new_rx_buf(cp);
ian@0 628 } else {
ian@0 629 /* Just sync the buffer and give it back
ian@0 630 * to the card.
ian@0 631 */
ian@0 632 pci_dma_sync_single_for_device(cp->pdev,
ian@0 633 cp->rx_dma,
ian@0 634 cp->rx_len,
ian@0 636 give_rx_buf_to_card(cp);
ian@0 637 }
ian@0 638 }
ian@0 639 }
ian@0 640
ian@0 641 Drivers converted fully to this interface should not use virt_to_bus any
ian@0 642 longer, nor should they use bus_to_virt. Some drivers have to be changed a
ian@0 643 little bit, because there is no longer an equivalent to bus_to_virt in the
ian@0 644 dynamic DMA mapping scheme - you have to always store the DMA addresses
ian@0 645 returned by the pci_alloc_consistent, pci_pool_alloc, and pci_map_single
ian@0 646 calls (pci_map_sg stores them in the scatterlist itself if the platform
ian@0 647 supports dynamic DMA mapping in hardware) in your driver structures and/or
ian@0 648 in the card registers.
ian@0 649
ian@0 650 All PCI drivers should be using these interfaces with no exceptions.
ian@0 651 It is planned to completely remove virt_to_bus() and bus_to_virt() as
ian@0 652 they are entirely deprecated. Some ports already do not provide these
ian@0 653 as it is impossible to correctly support them.
ian@0 654
ian@0 655 64-bit DMA and DAC cycle support
ian@0 656
ian@0 657 Do you understand all of the text above? Great, then you already
ian@0 658 know how to use 64-bit DMA addressing under Linux. Simply make
ian@0 659 the appropriate pci_set_dma_mask() calls based upon your cards
ian@0 660 capabilities, then use the mapping APIs above.
ian@0 661
ian@0 662 It is that simple.
ian@0 663
ian@0 664 Well, not for some odd devices. See the next section for information
ian@0 665 about that.
ian@0 666
ian@0 667 DAC Addressing for Address Space Hungry Devices
ian@0 668
ian@0 669 There exists a class of devices which do not mesh well with the PCI
ian@0 670 DMA mapping API. By definition these "mappings" are a finite
ian@0 671 resource. The number of total available mappings per bus is platform
ian@0 672 specific, but there will always be a reasonable amount.
ian@0 673
ian@0 674 What is "reasonable"? Reasonable means that networking and block I/O
ian@0 675 devices need not worry about using too many mappings.
ian@0 676
ian@0 677 As an example of a problematic device, consider compute cluster cards.
ian@0 678 They can potentially need to access gigabytes of memory at once via
ian@0 679 DMA. Dynamic mappings are unsuitable for this kind of access pattern.
ian@0 680
ian@0 681 To this end we've provided a small API by which a device driver
ian@0 682 may use DAC cycles to directly address all of physical memory.
ian@0 683 Not all platforms support this, but most do. It is easy to determine
ian@0 684 whether the platform will work properly at probe time.
ian@0 685
ian@0 686 First, understand that there may be a SEVERE performance penalty for
ian@0 687 using these interfaces on some platforms. Therefore, you MUST only
ian@0 688 use these interfaces if it is absolutely required. %99 of devices can
ian@0 689 use the normal APIs without any problems.
ian@0 690
ian@0 691 Note that for streaming type mappings you must either use these
ian@0 692 interfaces, or the dynamic mapping interfaces above. You may not mix
ian@0 693 usage of both for the same device. Such an act is illegal and is
ian@0 694 guaranteed to put a banana in your tailpipe.
ian@0 695
ian@0 696 However, consistent mappings may in fact be used in conjunction with
ian@0 697 these interfaces. Remember that, as defined, consistent mappings are
ian@0 698 always going to be SAC addressable.
ian@0 699
ian@0 700 The first thing your driver needs to do is query the PCI platform
ian@0 701 layer if it is capable of handling your devices DAC addressing
ian@0 702 capabilities:
ian@0 703
ian@0 704 int pci_dac_dma_supported(struct pci_dev *hwdev, u64 mask);
ian@0 705
ian@0 706 You may not use the following interfaces if this routine fails.
ian@0 707
ian@0 708 Next, DMA addresses using this API are kept track of using the
ian@0 709 dma64_addr_t type. It is guaranteed to be big enough to hold any
ian@0 710 DAC address the platform layer will give to you from the following
ian@0 711 routines. If you have consistent mappings as well, you still
ian@0 712 use plain dma_addr_t to keep track of those.
ian@0 713
ian@0 714 All mappings obtained here will be direct. The mappings are not
ian@0 715 translated, and this is the purpose of this dialect of the DMA API.
ian@0 716
ian@0 717 All routines work with page/offset pairs. This is the _ONLY_ way to
ian@0 718 portably refer to any piece of memory. If you have a cpu pointer
ian@0 719 (which may be validly DMA'd too) you may easily obtain the page
ian@0 720 and offset using something like this:
ian@0 721
ian@0 722 struct page *page = virt_to_page(ptr);
ian@0 723 unsigned long offset = offset_in_page(ptr);
ian@0 724
ian@0 725 Here are the interfaces:
ian@0 726
ian@0 727 dma64_addr_t pci_dac_page_to_dma(struct pci_dev *pdev,
ian@0 728 struct page *page,
ian@0 729 unsigned long offset,
ian@0 730 int direction);
ian@0 731
ian@0 732 The DAC address for the tuple PAGE/OFFSET are returned. The direction
ian@0 733 argument is the same as for pci_{map,unmap}_single(). The same rules
ian@0 734 for cpu/device access apply here as for the streaming mapping
ian@0 735 interfaces. To reiterate:
ian@0 736
ian@0 737 The cpu may touch the buffer before pci_dac_page_to_dma.
ian@0 738 The device may touch the buffer after pci_dac_page_to_dma
ian@0 739 is made, but the cpu may NOT.
ian@0 740
ian@0 741 When the DMA transfer is complete, invoke:
ian@0 742
ian@0 743 void pci_dac_dma_sync_single_for_cpu(struct pci_dev *pdev,
ian@0 744 dma64_addr_t dma_addr,
ian@0 745 size_t len, int direction);
ian@0 746
ian@0 747 This must be done before the CPU looks at the buffer again.
ian@0 748 This interface behaves identically to pci_dma_sync_{single,sg}_for_cpu().
ian@0 749
ian@0 750 And likewise, if you wish to let the device get back at the buffer after
ian@0 751 the cpu has read/written it, invoke:
ian@0 752
ian@0 753 void pci_dac_dma_sync_single_for_device(struct pci_dev *pdev,
ian@0 754 dma64_addr_t dma_addr,
ian@0 755 size_t len, int direction);
ian@0 756
ian@0 757 before letting the device access the DMA area again.
ian@0 758
ian@0 759 If you need to get back to the PAGE/OFFSET tuple from a dma64_addr_t
ian@0 760 the following interfaces are provided:
ian@0 761
ian@0 762 struct page *pci_dac_dma_to_page(struct pci_dev *pdev,
ian@0 763 dma64_addr_t dma_addr);
ian@0 764 unsigned long pci_dac_dma_to_offset(struct pci_dev *pdev,
ian@0 765 dma64_addr_t dma_addr);
ian@0 766
ian@0 767 This is possible with the DAC interfaces purely because they are
ian@0 768 not translated in any way.
ian@0 769
ian@0 770 Optimizing Unmap State Space Consumption
ian@0 771
ian@0 772 On many platforms, pci_unmap_{single,page}() is simply a nop.
ian@0 773 Therefore, keeping track of the mapping address and length is a waste
ian@0 774 of space. Instead of filling your drivers up with ifdefs and the like
ian@0 775 to "work around" this (which would defeat the whole purpose of a
ian@0 776 portable API) the following facilities are provided.
ian@0 777
ian@0 778 Actually, instead of describing the macros one by one, we'll
ian@0 779 transform some example code.
ian@0 780
ian@0 781 1) Use DECLARE_PCI_UNMAP_{ADDR,LEN} in state saving structures.
ian@0 782 Example, before:
ian@0 783
ian@0 784 struct ring_state {
ian@0 785 struct sk_buff *skb;
ian@0 786 dma_addr_t mapping;
ian@0 787 __u32 len;
ian@0 788 };
ian@0 789
ian@0 790 after:
ian@0 791
ian@0 792 struct ring_state {
ian@0 793 struct sk_buff *skb;
ian@0 794 DECLARE_PCI_UNMAP_ADDR(mapping)
ian@0 795 DECLARE_PCI_UNMAP_LEN(len)
ian@0 796 };
ian@0 797
ian@0 798 NOTE: DO NOT put a semicolon at the end of the DECLARE_*()
ian@0 799 macro.
ian@0 800
ian@0 801 2) Use pci_unmap_{addr,len}_set to set these values.
ian@0 802 Example, before:
ian@0 803
ian@0 804 ringp->mapping = FOO;
ian@0 805 ringp->len = BAR;
ian@0 806
ian@0 807 after:
ian@0 808
ian@0 809 pci_unmap_addr_set(ringp, mapping, FOO);
ian@0 810 pci_unmap_len_set(ringp, len, BAR);
ian@0 811
ian@0 812 3) Use pci_unmap_{addr,len} to access these values.
ian@0 813 Example, before:
ian@0 814
ian@0 815 pci_unmap_single(pdev, ringp->mapping, ringp->len,
ian@0 817
ian@0 818 after:
ian@0 819
ian@0 820 pci_unmap_single(pdev,
ian@0 821 pci_unmap_addr(ringp, mapping),
ian@0 822 pci_unmap_len(ringp, len),
ian@0 824
ian@0 825 It really should be self-explanatory. We treat the ADDR and LEN
ian@0 826 separately, because it is possible for an implementation to only
ian@0 827 need the address in order to perform the unmap operation.
ian@0 828
ian@0 829 Platform Issues
ian@0 830
ian@0 831 If you are just writing drivers for Linux and do not maintain
ian@0 832 an architecture port for the kernel, you can safely skip down
ian@0 833 to "Closing".
ian@0 834
ian@0 835 1) Struct scatterlist requirements.
ian@0 836
ian@0 837 Struct scatterlist must contain, at a minimum, the following
ian@0 838 members:
ian@0 839
ian@0 840 struct page *page;
ian@0 841 unsigned int offset;
ian@0 842 unsigned int length;
ian@0 843
ian@0 844 The base address is specified by a "page+offset" pair.
ian@0 845
ian@0 846 Previous versions of struct scatterlist contained a "void *address"
ian@0 847 field that was sometimes used instead of page+offset. As of Linux
ian@0 848 2.5., page+offset is always used, and the "address" field has been
ian@0 849 deleted.
ian@0 850
ian@0 851 2) More to come...
ian@0 852
ian@0 853 Handling Errors
ian@0 854
ian@0 855 DMA address space is limited on some architectures and an allocation
ian@0 856 failure can be determined by:
ian@0 857
ian@0 858 - checking if pci_alloc_consistent returns NULL or pci_map_sg returns 0
ian@0 859
ian@0 860 - checking the returned dma_addr_t of pci_map_single and pci_map_page
ian@0 861 by using pci_dma_mapping_error():
ian@0 862
ian@0 863 dma_addr_t dma_handle;
ian@0 864
ian@0 865 dma_handle = pci_map_single(dev, addr, size, direction);
ian@0 866 if (pci_dma_mapping_error(dma_handle)) {
ian@0 867 /*
ian@0 868 * reduce current DMA mapping usage,
ian@0 869 * delay and try again later or
ian@0 870 * reset driver.
ian@0 871 */
ian@0 872 }
ian@0 873
ian@0 874 Closing
ian@0 875
ian@0 876 This document, and the API itself, would not be in it's current
ian@0 877 form without the feedback and suggestions from numerous individuals.
ian@0 878 We would like to specifically mention, in no particular order, the
ian@0 879 following people:
ian@0 880
ian@0 881 Russell King <rmk@arm.linux.org.uk>
ian@0 882 Leo Dagum <dagum@barrel.engr.sgi.com>
ian@0 883 Ralf Baechle <ralf@oss.sgi.com>
ian@0 884 Grant Grundler <grundler@cup.hp.com>
ian@0 885 Jay Estabrook <Jay.Estabrook@compaq.com>
ian@0 886 Thomas Sailer <sailer@ife.ee.ethz.ch>
ian@0 887 Andrea Arcangeli <andrea@suse.de>
ian@0 888 Jens Axboe <axboe@suse.de>
ian@0 889 David Mosberger-Tang <davidm@hpl.hp.com>