ia64/xen-unstable

view tools/python/xen/xend/balloon.py @ 18894:b1b9cf7a2d36

xend: Fix memory allocation bug after hvm reboot in numa system

Recently we find a bug on Nahelem machine (totally with two nodes, 6G
memory (3G in each node):
- Start a HVM guest with its all VCPUS pinned to node1, so all its
memory is allocated from node1.
- Reboot the HVM.
- There will be some memory allocated from node0 even there is enough
free memory on node1.

Reason: For security issues, xen will not put all the pages of a dying
hvm to domheap directly, but put them in scrub list and wait for handled
by page_scrub_softirq(). If the dying hvm have a lot of memory,
page_scrub_softirq() will not handle all of them before the start the
hvm. There are some pages belong to node1 still in scrub list, new hvm
can't use pages in it. So this hvm will get different memory
distribution than before. Before changeset 18304, page_scrub_softirq()
can be excuted parallel between all the cpus. Changeset 18305
serialise page_scrub_softirq() and Changeset 18307 serialise
page_scrub_softirq() with a new lock to avoid holding up acquiring
page_scrub_lock in free_domheap_pages(). Those changeset slow the ability
to handle pages in scrub list. So the bug becomes more obvious after.

Patch: This patch modifiers balloon.free to avoid this bug. After
patch, balloon.free will check whether current machine is a numa
system and the new created hvm has all its vcpus in the same node. If
all the conditions above fit, we will wait until all the pages in
scrub list are freed (if waiting time go beyond 20s, we will stop
waiting it.).

This seems to be too restricted at the first glance. We used to only
wait for the free memory size of pinned node is bigger than
required. But as we know HVM memory alloction granularity is 2M. Even
the former condition is satisfied, we still may not find enough
2M-size memory on that node.

Signed-off-by: Ting Zhou <ting.g.zhou@intel.com>
Signed-off-by: Xiaowei Yang <Xiaowei.yang@intel.com>
author Keir Fraser <keir.fraser@citrix.com>
date Tue Dec 09 12:44:32 2008 +0000 (2008-12-09)
parents 0bf73f557f41
children d8267d3d2665
line source
1 #===========================================================================
2 # This library is free software; you can redistribute it and/or
3 # modify it under the terms of version 2.1 of the GNU Lesser General Public
4 # License as published by the Free Software Foundation.
5 #
6 # This library is distributed in the hope that it will be useful,
7 # but WITHOUT ANY WARRANTY; without even the implied warranty of
8 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
9 # Lesser General Public License for more details.
10 #
11 # You should have received a copy of the GNU Lesser General Public
12 # License along with this library; if not, write to the Free Software
13 # Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
14 #============================================================================
15 # Copyright (C) 2004, 2005 Mike Wray <mike.wray@hp.com>
16 # Copyright (C) 2005 XenSource Ltd
17 #============================================================================
20 import time
22 import xen.lowlevel.xc
24 import XendDomain
25 import XendOptions
26 from XendLogging import log
27 from XendError import VmError
28 import osdep
30 RETRY_LIMIT = 20
31 RETRY_LIMIT_INCR = 5
32 ##
33 # The time to sleep between retries grows linearly, using this value (in
34 # seconds). When the system is lightly loaded, memory should be scrubbed and
35 # returned to the system very quickly, whereas when it is loaded, the system
36 # needs idle time to get the scrubbing done. This linear growth accommodates
37 # such requirements.
38 SLEEP_TIME_GROWTH = 0.1
40 # A mapping between easy-to-remember labels and the more verbose
41 # label actually shown in the PROC_XEN_BALLOON file.
42 #labels = { 'current' : 'Current allocation',
43 # 'target' : 'Requested target',
44 # 'low-balloon' : 'Low-mem balloon',
45 # 'high-balloon' : 'High-mem balloon',
46 # 'limit' : 'Xen hard limit' }
48 def _get_proc_balloon(label):
49 """Returns the value for the named label. Returns None if the label was
50 not found or the value was non-numeric."""
52 return osdep.lookup_balloon_stat(label)
54 def get_dom0_current_alloc():
55 """Returns the current memory allocation (in KiB) of dom0."""
57 kb = _get_proc_balloon('current')
58 if kb == None:
59 raise VmError('Failed to query current memory allocation of dom0.')
60 return kb
62 def get_dom0_target_alloc():
63 """Returns the target memory allocation (in KiB) of dom0."""
65 kb = _get_proc_balloon('target')
66 if kb == None:
67 raise VmError('Failed to query target memory allocation of dom0.')
68 return kb
70 def free(need_mem ,self):
71 """Balloon out memory from the privileged domain so that there is the
72 specified required amount (in KiB) free.
73 """
75 # We check whether there is enough free memory, and if not, instruct dom0
76 # to balloon out to free some up. Memory freed by a destroyed domain may
77 # not appear in the free_memory field immediately, because it needs to be
78 # scrubbed before it can be released to the free list, which is done
79 # asynchronously by Xen; ballooning is asynchronous also. Such memory
80 # does, however, need to be accounted for when calculating how much dom0
81 # needs to balloon. No matter where we expect the free memory to come
82 # from, we need to wait for it to become available.
83 #
84 # We are not allowed to balloon below dom0_min_mem, or if dom0_ballooning
85 # is False, we cannot balloon at all. Memory can still become available
86 # through a rebooting domain, however.
87 #
88 # Eventually, we time out (presumably because there really isn't enough
89 # free memory).
90 #
91 # We don't want to set the memory target (triggering a watch) when that
92 # has already been done, but we do want to respond to changing memory
93 # usage, so we recheck the required alloc each time around the loop, but
94 # track the last used value so that we don't trigger too many watches.
96 xoptions = XendOptions.instance()
97 dom0 = XendDomain.instance().privilegedDomain()
98 xc = xen.lowlevel.xc.xc()
100 try:
101 dom0_min_mem = xoptions.get_dom0_min_mem() * 1024
102 dom0_ballooning = xoptions.get_enable_dom0_ballooning()
103 dom0_alloc = get_dom0_current_alloc()
105 retries = 0
106 sleep_time = SLEEP_TIME_GROWTH
107 new_alloc = 0
108 last_new_alloc = None
109 last_free = None
110 rlimit = RETRY_LIMIT
112 # If unreasonable memory size is required, we give up waiting
113 # for ballooning or scrubbing, as if had retried.
114 physinfo = xc.physinfo()
115 free_mem = physinfo['free_memory']
116 scrub_mem = physinfo['scrub_memory']
117 total_mem = physinfo['total_memory']
118 if dom0_ballooning:
119 max_free_mem = total_mem - dom0_min_mem
120 else:
121 max_free_mem = total_mem - dom0_alloc
122 if need_mem >= max_free_mem:
123 retries = rlimit
125 # Check whethercurrent machine is a numa system and the new
126 # created hvm has all its vcpus in the same node, if all the
127 # conditions above are fit. We will wait until all the pages
128 # in scrub list are freed (if waiting time go beyond 20s,
129 # we will stop waiting it.)
130 if physinfo['nr_nodes'] > 1 and retries == 0:
131 oldnode = -1
132 waitscrub = 1
133 vcpus = self.info['cpus'][0]
134 for vcpu in vcpus:
135 nodenum = 0
136 for node in physinfo['node_to_cpu']:
137 for cpu in node:
138 if vcpu == cpu:
139 if oldnode == -1:
140 oldnode = nodenum
141 elif oldnode != nodenum:
142 waitscrub = 0
143 nodenum = nodenum + 1
145 if waitscrub == 1 and scrub_mem > 0:
146 log.debug("wait for scrub %s", scrub_mem)
147 while scrub_mem > 0 and retries < rlimit:
148 time.sleep(sleep_time)
149 physinfo = xc.physinfo()
150 free_mem = physinfo['free_memory']
151 scrub_mem = physinfo['scrub_memory']
152 retries += 1
153 sleep_time += SLEEP_TIME_GROWTH
154 log.debug("scrub for %d times", retries)
156 retries = 0
157 sleep_time = SLEEP_TIME_GROWTH
159 while retries < rlimit:
160 physinfo = xc.physinfo()
161 free_mem = physinfo['free_memory']
162 scrub_mem = physinfo['scrub_memory']
164 if free_mem >= need_mem:
165 log.debug("Balloon: %d KiB free; need %d; done.",
166 free_mem, need_mem)
167 return
169 if retries == 0:
170 rlimit += ((need_mem - free_mem)/1024/1024) * RETRY_LIMIT_INCR
171 log.debug("Balloon: %d KiB free; %d to scrub; need %d; retries: %d.",
172 free_mem, scrub_mem, need_mem, rlimit)
174 if dom0_ballooning:
175 dom0_alloc = get_dom0_current_alloc()
176 new_alloc = dom0_alloc - (need_mem - free_mem - scrub_mem)
178 if free_mem + scrub_mem >= need_mem:
179 if last_new_alloc == None:
180 log.debug("Balloon: waiting on scrubbing")
181 last_new_alloc = dom0_alloc
182 else:
183 if (new_alloc >= dom0_min_mem and
184 new_alloc != last_new_alloc):
185 new_alloc_mb = new_alloc / 1024 # Round down
186 log.debug("Balloon: setting dom0 target to %d MiB.",
187 new_alloc_mb)
188 dom0.setMemoryTarget(new_alloc_mb)
189 last_new_alloc = new_alloc
190 # Continue to retry, waiting for ballooning or scrubbing.
192 time.sleep(sleep_time)
193 if retries < 2 * RETRY_LIMIT:
194 sleep_time += SLEEP_TIME_GROWTH
195 if last_free != None and last_free >= free_mem + scrub_mem:
196 retries += 1
197 last_free = free_mem + scrub_mem
199 # Not enough memory; diagnose the problem.
200 if not dom0_ballooning:
201 raise VmError(('Not enough free memory and enable-dom0-ballooning '
202 'is False, so I cannot release any more. '
203 'I need %d KiB but only have %d.') %
204 (need_mem, free_mem))
205 elif new_alloc < dom0_min_mem:
206 raise VmError(
207 ('I need %d KiB, but dom0_min_mem is %d and shrinking to '
208 '%d KiB would leave only %d KiB free.') %
209 (need_mem, dom0_min_mem, dom0_min_mem,
210 free_mem + scrub_mem + dom0_alloc - dom0_min_mem))
211 else:
212 dom0_start_alloc_mb = get_dom0_current_alloc() / 1024
213 dom0.setMemoryTarget(dom0_start_alloc_mb)
214 raise VmError(
215 ('Not enough memory is available, and dom0 cannot'
216 ' be shrunk any further'))
218 finally:
219 del xc