## ia64/linux-2.6.18-xen.hg

### view Documentation/prio_tree.txt @ 897:329ea0ccb344

find changesets by author, revision,
files, or words in the commit message

balloon: try harder to balloon up under memory pressure.

Currently if the balloon driver is unable to increase the guest's

reservation it assumes the failure was due to reaching its full

allocation, gives up on the ballooning operation and records the limit

it reached as the "hard limit". The driver will not try again until

the target is set again (even to the same value).

However it is possible that ballooning has in fact failed due to

memory pressure in the host and therefore it is desirable to keep

attempting to reach the target in case memory becomes available. The

most likely scenario is that some guests are ballooning down while

others are ballooning up and therefore there is temporary memory

pressure while things stabilise. You would not expect a well behaved

toolstack to ask a domain to balloon to more than its allocation nor

would you expect it to deliberately over-commit memory by setting

balloon targets which exceed the total host memory.

This patch drops the concept of a hard limit and causes the balloon

driver to retry increasing the reservation on a timer in the same

manner as when decreasing the reservation.

Also if we partially succeed in increasing the reservation

(i.e. receive less pages than we asked for) then we may as well keep

those pages rather than returning them to Xen.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>

Currently if the balloon driver is unable to increase the guest's

reservation it assumes the failure was due to reaching its full

allocation, gives up on the ballooning operation and records the limit

it reached as the "hard limit". The driver will not try again until

the target is set again (even to the same value).

However it is possible that ballooning has in fact failed due to

memory pressure in the host and therefore it is desirable to keep

attempting to reach the target in case memory becomes available. The

most likely scenario is that some guests are ballooning down while

others are ballooning up and therefore there is temporary memory

pressure while things stabilise. You would not expect a well behaved

toolstack to ask a domain to balloon to more than its allocation nor

would you expect it to deliberately over-commit memory by setting

balloon targets which exceed the total host memory.

This patch drops the concept of a hard limit and causes the balloon

driver to retry increasing the reservation on a timer in the same

manner as when decreasing the reservation.

Also if we partially succeed in increasing the reservation

(i.e. receive less pages than we asked for) then we may as well keep

those pages rather than returning them to Xen.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>

author | Keir Fraser <keir.fraser@citrix.com> |
---|---|

date | Fri Jun 05 14:01:20 2009 +0100 (2009-06-05) |

parents | 831230e53067 |

children |

line source

1 The prio_tree.c code indexes vmas using 3 different indexes:

2 * heap_index = vm_pgoff + vm_size_in_pages : end_vm_pgoff

3 * radix_index = vm_pgoff : start_vm_pgoff

4 * size_index = vm_size_in_pages

6 A regular radix-priority-search-tree indexes vmas using only heap_index and

7 radix_index. The conditions for indexing are:

8 * ->heap_index >= ->left->heap_index &&

9 ->heap_index >= ->right->heap_index

10 * if (->heap_index == ->left->heap_index)

11 then ->radix_index < ->left->radix_index;

12 * if (->heap_index == ->right->heap_index)

13 then ->radix_index < ->right->radix_index;

14 * nodes are hashed to left or right subtree using radix_index

15 similar to a pure binary radix tree.

17 A regular radix-priority-search-tree helps to store and query

18 intervals (vmas). However, a regular radix-priority-search-tree is only

19 suitable for storing vmas with different radix indices (vm_pgoff).

21 Therefore, the prio_tree.c extends the regular radix-priority-search-tree

22 to handle many vmas with the same vm_pgoff. Such vmas are handled in

23 2 different ways: 1) All vmas with the same radix _and_ heap indices are

24 linked using vm_set.list, 2) if there are many vmas with the same radix

25 index, but different heap indices and if the regular radix-priority-search

26 tree cannot index them all, we build an overflow-sub-tree that indexes such

27 vmas using heap and size indices instead of heap and radix indices. For

28 example, in the figure below some vmas with vm_pgoff = 0 (zero) are

29 indexed by regular radix-priority-search-tree whereas others are pushed

30 into an overflow-subtree. Note that all vmas in an overflow-sub-tree have

31 the same vm_pgoff (radix_index) and if necessary we build different

32 overflow-sub-trees to handle each possible radix_index. For example,

33 in figure we have 3 overflow-sub-trees corresponding to radix indices

34 0, 2, and 4.

36 In the final tree the first few (prio_tree_root->index_bits) levels

37 are indexed using heap and radix indices whereas the overflow-sub-trees below

38 those levels (i.e. levels prio_tree_root->index_bits + 1 and higher) are

39 indexed using heap and size indices. In overflow-sub-trees the size_index

40 is used for hashing the nodes to appropriate places.

42 Now, an example prio_tree:

44 vmas are represented [radix_index, size_index, heap_index]

45 i.e., [start_vm_pgoff, vm_size_in_pages, end_vm_pgoff]

47 level prio_tree_root->index_bits = 3

48 -----

49 _

50 0 [0,7,7] |

51 / \ |

52 ------------------ ------------ | Regular

53 / \ | radix priority

54 1 [1,6,7] [4,3,7] | search tree

55 / \ / \ |

56 ------- ----- ------ ----- | heap-and-radix

57 / \ / \ | indexed

58 2 [0,6,6] [2,5,7] [5,2,7] [6,1,7] |

59 / \ / \ / \ / \ |

60 3 [0,5,5] [1,5,6] [2,4,6] [3,4,7] [4,2,6] [5,1,6] [6,0,6] [7,0,7] |

61 / / / _

62 / / / _

63 4 [0,4,4] [2,3,5] [4,1,5] |

64 / / / |

65 5 [0,3,3] [2,2,4] [4,0,4] | Overflow-sub-trees

66 / / |

67 6 [0,2,2] [2,1,3] | heap-and-size

68 / / | indexed

69 7 [0,1,1] [2,0,2] |

70 / |

71 8 [0,0,0] |

72 _

74 Note that we use prio_tree_root->index_bits to optimize the height

75 of the heap-and-radix indexed tree. Since prio_tree_root->index_bits is

76 set according to the maximum end_vm_pgoff mapped, we are sure that all

77 bits (in vm_pgoff) above prio_tree_root->index_bits are 0 (zero). Therefore,

78 we only use the first prio_tree_root->index_bits as radix_index.

79 Whenever index_bits is increased in prio_tree_expand, we shuffle the tree

80 to make sure that the first prio_tree_root->index_bits levels of the tree

81 is indexed properly using heap and radix indices.

83 We do not optimize the height of overflow-sub-trees using index_bits.

84 The reason is: there can be many such overflow-sub-trees and all of

85 them have to be suffled whenever the index_bits increases. This may involve

86 walking the whole prio_tree in prio_tree_insert->prio_tree_expand code

87 path which is not desirable. Hence, we do not optimize the height of the

88 heap-and-size indexed overflow-sub-trees using prio_tree->index_bits.

89 Instead the overflow sub-trees are indexed using full BITS_PER_LONG bits

90 of size_index. This may lead to skewed sub-trees because most of the

91 higher significant bits of the size_index are likely to be be 0 (zero). In

92 the example above, all 3 overflow-sub-trees are skewed. This may marginally

93 affect the performance. However, processes rarely map many vmas with the

94 same start_vm_pgoff but different end_vm_pgoffs. Therefore, we normally

95 do not require overflow-sub-trees to index all vmas.

97 From the above discussion it is clear that the maximum height of

98 a prio_tree can be prio_tree_root->index_bits + BITS_PER_LONG.

99 However, in most of the common cases we do not need overflow-sub-trees,

100 so the tree height in the common cases will be prio_tree_root->index_bits.

102 It is fair to mention here that the prio_tree_root->index_bits

103 is increased on demand, however, the index_bits is not decreased when

104 vmas are removed from the prio_tree. That's tricky to do. Hence, it's

105 left as a home work problem.