view Documentation/networking/fib_trie.txt @ 897:329ea0ccb344

balloon: try harder to balloon up under memory pressure.

Currently if the balloon driver is unable to increase the guest's
reservation it assumes the failure was due to reaching its full
allocation, gives up on the ballooning operation and records the limit
it reached as the "hard limit". The driver will not try again until
the target is set again (even to the same value).

However it is possible that ballooning has in fact failed due to
memory pressure in the host and therefore it is desirable to keep
attempting to reach the target in case memory becomes available. The
most likely scenario is that some guests are ballooning down while
others are ballooning up and therefore there is temporary memory
pressure while things stabilise. You would not expect a well behaved
toolstack to ask a domain to balloon to more than its allocation nor
would you expect it to deliberately over-commit memory by setting
balloon targets which exceed the total host memory.

This patch drops the concept of a hard limit and causes the balloon
driver to retry increasing the reservation on a timer in the same
manner as when decreasing the reservation.

Also if we partially succeed in increasing the reservation
(i.e. receive less pages than we asked for) then we may as well keep
those pages rather than returning them to Xen.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
author Keir Fraser <keir.fraser@citrix.com>
date Fri Jun 05 14:01:20 2009 +0100 (2009-06-05)
parents 831230e53067
line source
1 LC-trie implementation notes.
3 Node types
4 ----------
5 leaf
6 An end node with data. This has a copy of the relevant key, along
7 with 'hlist' with routing table entries sorted by prefix length.
8 See struct leaf and struct leaf_info.
10 trie node or tnode
11 An internal node, holding an array of child (leaf or tnode) pointers,
12 indexed through a subset of the key. See Level Compression.
14 A few concepts explained
15 ------------------------
16 Bits (tnode)
17 The number of bits in the key segment used for indexing into the
18 child array - the "child index". See Level Compression.
20 Pos (tnode)
21 The position (in the key) of the key segment used for indexing into
22 the child array. See Path Compression.
24 Path Compression / skipped bits
25 Any given tnode is linked to from the child array of its parent, using
26 a segment of the key specified by the parent's "pos" and "bits"
27 In certain cases, this tnode's own "pos" will not be immediately
28 adjacent to the parent (pos+bits), but there will be some bits
29 in the key skipped over because they represent a single path with no
30 deviations. These "skipped bits" constitute Path Compression.
31 Note that the search algorithm will simply skip over these bits when
32 searching, making it necessary to save the keys in the leaves to
33 verify that they actually do match the key we are searching for.
35 Level Compression / child arrays
36 the trie is kept level balanced moving, under certain conditions, the
37 children of a full child (see "full_children") up one level, so that
38 instead of a pure binary tree, each internal node ("tnode") may
39 contain an arbitrarily large array of links to several children.
40 Conversely, a tnode with a mostly empty child array (see empty_children)
41 may be "halved", having some of its children moved downwards one level,
42 in order to avoid ever-increasing child arrays.
44 empty_children
45 the number of positions in the child array of a given tnode that are
46 NULL.
48 full_children
49 the number of children of a given tnode that aren't path compressed.
50 (in other words, they aren't NULL or leaves and their "pos" is equal
51 to this tnode's "pos"+"bits").
53 (The word "full" here is used more in the sense of "complete" than
54 as the opposite of "empty", which might be a tad confusing.)
57 ---------
59 We have tried to keep the structure of the code as close to fib_hash as
60 possible to allow verification and help up reviewing.
62 fib_find_node()
63 A good start for understanding this code. This function implements a
64 straightforward trie lookup.
66 fib_insert_node()
67 Inserts a new leaf node in the trie. This is bit more complicated than
68 fib_find_node(). Inserting a new node means we might have to run the
69 level compression algorithm on part of the trie.
71 trie_leaf_remove()
72 Looks up a key, deletes it and runs the level compression algorithm.
74 trie_rebalance()
75 The key function for the dynamic trie after any change in the trie
76 it is run to optimize and reorganize. Tt will walk the trie upwards
77 towards the root from a given tnode, doing a resize() at each step
78 to implement level compression.
80 resize()
81 Analyzes a tnode and optimizes the child array size by either inflating
82 or shrinking it repeatedly until it fullfills the criteria for optimal
83 level compression. This part follows the original paper pretty closely
84 and there may be some room for experimentation here.
86 inflate()
87 Doubles the size of the child array within a tnode. Used by resize().
89 halve()
90 Halves the size of the child array within a tnode - the inverse of
91 inflate(). Used by resize();
93 fn_trie_insert(), fn_trie_delete(), fn_trie_select_default()
94 The route manipulation functions. Should conform pretty closely to the
95 corresponding functions in fib_hash.
97 fn_trie_flush()
98 This walks the full trie (using nextleaf()) and searches for empty
99 leaves which have to be removed.
101 fn_trie_dump()
102 Dumps the routing table ordered by prefix length. This is somewhat
103 slower than the corresponding fib_hash function, as we have to walk the
104 entire trie for each prefix length. In comparison, fib_hash is organized
105 as one "zone"/hash per prefix length.
107 Locking
108 -------
110 fib_lock is used for an RW-lock in the same way that this is done in fib_hash.
111 However, the functions are somewhat separated for other possible locking
112 scenarios. It might conceivably be possible to run trie_rebalance via RCU
113 to avoid read_lock in the fn_trie_lookup() function.
115 Main lookup mechanism
116 ---------------------
117 fn_trie_lookup() is the main lookup function.
119 The lookup is in its simplest form just like fib_find_node(). We descend the
120 trie, key segment by key segment, until we find a leaf. check_leaf() does
121 the fib_semantic_match in the leaf's sorted prefix hlist.
123 If we find a match, we are done.
125 If we don't find a match, we enter prefix matching mode. The prefix length,
126 starting out at the same as the key length, is reduced one step at a time,
127 and we backtrack upwards through the trie trying to find a longest matching
128 prefix. The goal is always to reach a leaf and get a positive result from the
129 fib_semantic_match mechanism.
131 Inside each tnode, the search for longest matching prefix consists of searching
132 through the child array, chopping off (zeroing) the least significant "1" of
133 the child index until we find a match or the child index consists of nothing but
134 zeros.
136 At this point we backtrack (t->stats.backtrack++) up the trie, continuing to
137 chop off part of the key in order to find the longest matching prefix.
139 At this point we will repeatedly descend subtries to look for a match, and there
140 are some optimizations available that can provide us with "shortcuts" to avoid
141 descending into dead ends. Look for "HL_OPTIMIZE" sections in the code.
143 To alleviate any doubts about the correctness of the route selection process,
144 a new netlink operation has been added. Look for NETLINK_FIB_LOOKUP, which
145 gives userland access to fib_lookup().