view Documentation/networking/xfrm_sync.txt @ 897:329ea0ccb344

balloon: try harder to balloon up under memory pressure.

Currently if the balloon driver is unable to increase the guest's
reservation it assumes the failure was due to reaching its full
allocation, gives up on the ballooning operation and records the limit
it reached as the "hard limit". The driver will not try again until
the target is set again (even to the same value).

However it is possible that ballooning has in fact failed due to
memory pressure in the host and therefore it is desirable to keep
attempting to reach the target in case memory becomes available. The
most likely scenario is that some guests are ballooning down while
others are ballooning up and therefore there is temporary memory
pressure while things stabilise. You would not expect a well behaved
toolstack to ask a domain to balloon to more than its allocation nor
would you expect it to deliberately over-commit memory by setting
balloon targets which exceed the total host memory.

This patch drops the concept of a hard limit and causes the balloon
driver to retry increasing the reservation on a timer in the same
manner as when decreasing the reservation.

Also if we partially succeed in increasing the reservation
(i.e. receive less pages than we asked for) then we may as well keep
those pages rather than returning them to Xen.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
author Keir Fraser <keir.fraser@citrix.com>
date Fri Jun 05 14:01:20 2009 +0100 (2009-06-05)
parents 831230e53067
line source
2 The sync patches work is based on initial patches from
3 Krisztian <hidden@balabit.hu> and others and additional patches
4 from Jamal <hadi@cyberus.ca>.
6 The end goal for syncing is to be able to insert attributes + generate
7 events so that the an SA can be safely moved from one machine to another
8 for HA purposes.
9 The idea is to synchronize the SA so that the takeover machine can do
10 the processing of the SA as accurate as possible if it has access to it.
12 We already have the ability to generate SA add/del/upd events.
13 These patches add ability to sync and have accurate lifetime byte (to
14 ensure proper decay of SAs) and replay counters to avoid replay attacks
15 with as minimal loss at failover time.
16 This way a backup stays as closely uptodate as an active member.
18 Because the above items change for every packet the SA receives,
19 it is possible for a lot of the events to be generated.
20 For this reason, we also add a nagle-like algorithm to restrict
21 the events. i.e we are going to set thresholds to say "let me
22 know if the replay sequence threshold is reached or 10 secs have passed"
23 These thresholds are set system-wide via sysctls or can be updated
24 per SA.
26 The identified items that need to be synchronized are:
27 - the lifetime byte counter
28 note that: lifetime time limit is not important if you assume the failover
29 machine is known ahead of time since the decay of the time countdown
30 is not driven by packet arrival.
31 - the replay sequence for both inbound and outbound
33 1) Message Structure
34 ----------------------
36 nlmsghdr:aevent_id:optional-TLVs.
38 The netlink message types are:
42 A XFRM_MSG_GETAE does not have TLVs.
43 A XFRM_MSG_NEWAE will have at least two TLVs (as is
44 discussed further below).
46 aevent_id structure looks like:
48 struct xfrm_aevent_id {
49 struct xfrm_usersa_id sa_id;
50 __u32 flags;
51 };
53 xfrm_usersa_id in this message layout identifies the SA.
55 flags are used to indicate different things. The possible
56 flags are:
57 XFRM_AE_RTHR=1, /* replay threshold*/
58 XFRM_AE_RVAL=2, /* replay value */
59 XFRM_AE_LVAL=4, /* lifetime value */
60 XFRM_AE_ETHR=8, /* expiry timer threshold */
61 XFRM_AE_CR=16, /* Event cause is replay update */
62 XFRM_AE_CE=32, /* Event cause is timer expiry */
63 XFRM_AE_CU=64, /* Event cause is policy update */
65 How these flags are used is dependent on the direction of the
66 message (kernel<->user) as well the cause (config, query or event).
67 This is described below in the different messages.
69 The pid will be set appropriately in netlink to recognize direction
70 (0 to the kernel and pid = processid that created the event
71 when going from kernel to user space)
73 A program needs to subscribe to multicast group XFRMNLGRP_AEVENTS
74 to get notified of these events.
76 2) TLVS reflect the different parameters:
77 -----------------------------------------
79 a) byte value (XFRMA_LTIME_VAL)
80 This TLV carries the running/current counter for byte lifetime since
81 last event.
83 b)replay value (XFRMA_REPLAY_VAL)
84 This TLV carries the running/current counter for replay sequence since
85 last event.
87 c)replay threshold (XFRMA_REPLAY_THRESH)
88 This TLV carries the threshold being used by the kernel to trigger events
89 when the replay sequence is exceeded.
91 d) expiry timer (XFRMA_ETIMER_THRESH)
92 This is a timer value in milliseconds which is used as the nagle
93 value to rate limit the events.
95 3) Default configurations for the parameters:
96 ----------------------------------------------
98 By default these events should be turned off unless there is
99 at least one listener registered to listen to the multicast
102 Programs installing SAs will need to specify the two thresholds, however,
103 in order to not change existing applications such as racoon
104 we also provide default threshold values for these different parameters
105 in case they are not specified.
107 the two sysctls/proc entries are:
108 a) /proc/sys/net/core/sysctl_xfrm_aevent_etime
109 used to provide default values for the XFRMA_ETIMER_THRESH in incremental
110 units of time of 100ms. The default is 10 (1 second)
112 b) /proc/sys/net/core/sysctl_xfrm_aevent_rseqth
113 used to provide default values for XFRMA_REPLAY_THRESH parameter
114 in incremental packet count. The default is two packets.
116 4) Message types
117 ----------------
119 a) XFRM_MSG_GETAE issued by user-->kernel.
120 XFRM_MSG_GETAE does not carry any TLVs.
121 The response is a XFRM_MSG_NEWAE which is formatted based on what
122 XFRM_MSG_GETAE queried for.
123 The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
124 *if XFRM_AE_RTHR flag is set, then XFRMA_REPLAY_THRESH is also retrieved
125 *if XFRM_AE_ETHR flag is set, then XFRMA_ETIMER_THRESH is also retrieved
127 b) XFRM_MSG_NEWAE is issued by either user space to configure
128 or kernel to announce events or respond to a XFRM_MSG_GETAE.
130 i) user --> kernel to configure a specific SA.
131 any of the values or threshold parameters can be updated by passing the
132 appropriate TLV.
133 A response is issued back to the sender in user space to indicate success
134 or failure.
135 In the case of success, additionally an event with
136 XFRM_MSG_NEWAE is also issued to any listeners as described in iii).
138 ii) kernel->user direction as a response to XFRM_MSG_GETAE
139 The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
140 The threshold TLVs will be included if explicitly requested in
141 the XFRM_MSG_GETAE message.
143 iii) kernel->user to report as event if someone sets any values or
144 thresholds for an SA using XFRM_MSG_NEWAE (as described in #i above).
145 In such a case XFRM_AE_CU flag is set to inform the user that
146 the change happened as a result of an update.
147 The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
149 iv) kernel->user to report event when replay threshold or a timeout
150 is exceeded.
151 In such a case either XFRM_AE_CR (replay exceeded) or XFRM_AE_CE (timeout
152 happened) is set to inform the user what happened.
153 Note the two flags are mutually exclusive.
154 The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
156 Exceptions to threshold settings
157 --------------------------------
159 If you have an SA that is getting hit by traffic in bursts such that
160 there is a period where the timer threshold expires with no packets
161 seen, then an odd behavior is seen as follows:
162 The first packet arrival after a timer expiry will trigger a timeout
163 aevent; i.e we dont wait for a timeout period or a packet threshold
164 to be reached. This is done for simplicity and efficiency reasons.
166 -JHS