view Documentation/io_ordering.txt @ 897:329ea0ccb344

balloon: try harder to balloon up under memory pressure.

Currently if the balloon driver is unable to increase the guest's
reservation it assumes the failure was due to reaching its full
allocation, gives up on the ballooning operation and records the limit
it reached as the "hard limit". The driver will not try again until
the target is set again (even to the same value).

However it is possible that ballooning has in fact failed due to
memory pressure in the host and therefore it is desirable to keep
attempting to reach the target in case memory becomes available. The
most likely scenario is that some guests are ballooning down while
others are ballooning up and therefore there is temporary memory
pressure while things stabilise. You would not expect a well behaved
toolstack to ask a domain to balloon to more than its allocation nor
would you expect it to deliberately over-commit memory by setting
balloon targets which exceed the total host memory.

This patch drops the concept of a hard limit and causes the balloon
driver to retry increasing the reservation on a timer in the same
manner as when decreasing the reservation.

Also if we partially succeed in increasing the reservation
(i.e. receive less pages than we asked for) then we may as well keep
those pages rather than returning them to Xen.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
author Keir Fraser <keir.fraser@citrix.com>
date Fri Jun 05 14:01:20 2009 +0100 (2009-06-05)
parents 831230e53067
line source
1 On some platforms, so-called memory-mapped I/O is weakly ordered. On such
2 platforms, driver writers are responsible for ensuring that I/O writes to
3 memory-mapped addresses on their device arrive in the order intended. This is
4 typically done by reading a 'safe' device or bridge register, causing the I/O
5 chipset to flush pending writes to the device before any reads are posted. A
6 driver would usually use this technique immediately prior to the exit of a
7 critical section of code protected by spinlocks. This would ensure that
8 subsequent writes to I/O space arrived only after all prior writes (much like a
9 memory barrier op, mb(), only with respect to I/O).
11 A more concrete example from a hypothetical device driver:
13 ...
14 CPU A: spin_lock_irqsave(&dev_lock, flags)
15 CPU A: val = readl(my_status);
16 CPU A: ...
17 CPU A: writel(newval, ring_ptr);
18 CPU A: spin_unlock_irqrestore(&dev_lock, flags)
19 ...
20 CPU B: spin_lock_irqsave(&dev_lock, flags)
21 CPU B: val = readl(my_status);
22 CPU B: ...
23 CPU B: writel(newval2, ring_ptr);
24 CPU B: spin_unlock_irqrestore(&dev_lock, flags)
25 ...
27 In the case above, the device may receive newval2 before it receives newval,
28 which could cause problems. Fixing it is easy enough though:
30 ...
31 CPU A: spin_lock_irqsave(&dev_lock, flags)
32 CPU A: val = readl(my_status);
33 CPU A: ...
34 CPU A: writel(newval, ring_ptr);
35 CPU A: (void)readl(safe_register); /* maybe a config register? */
36 CPU A: spin_unlock_irqrestore(&dev_lock, flags)
37 ...
38 CPU B: spin_lock_irqsave(&dev_lock, flags)
39 CPU B: val = readl(my_status);
40 CPU B: ...
41 CPU B: writel(newval2, ring_ptr);
42 CPU B: (void)readl(safe_register); /* maybe a config register? */
43 CPU B: spin_unlock_irqrestore(&dev_lock, flags)
45 Here, the reads from safe_register will cause the I/O chipset to flush any
46 pending writes before actually posting the read to the chipset, preventing
47 possible data corruption.