ia64/xen-unstable

changeset 18894:b1b9cf7a2d36

xend: Fix memory allocation bug after hvm reboot in numa system

Recently we find a bug on Nahelem machine (totally with two nodes, 6G
memory (3G in each node):
- Start a HVM guest with its all VCPUS pinned to node1, so all its
memory is allocated from node1.
- Reboot the HVM.
- There will be some memory allocated from node0 even there is enough
free memory on node1.

Reason: For security issues, xen will not put all the pages of a dying
hvm to domheap directly, but put them in scrub list and wait for handled
by page_scrub_softirq(). If the dying hvm have a lot of memory,
page_scrub_softirq() will not handle all of them before the start the
hvm. There are some pages belong to node1 still in scrub list, new hvm
can't use pages in it. So this hvm will get different memory
distribution than before. Before changeset 18304, page_scrub_softirq()
can be excuted parallel between all the cpus. Changeset 18305
serialise page_scrub_softirq() and Changeset 18307 serialise
page_scrub_softirq() with a new lock to avoid holding up acquiring
page_scrub_lock in free_domheap_pages(). Those changeset slow the ability
to handle pages in scrub list. So the bug becomes more obvious after.

Patch: This patch modifiers balloon.free to avoid this bug. After
patch, balloon.free will check whether current machine is a numa
system and the new created hvm has all its vcpus in the same node. If
all the conditions above fit, we will wait until all the pages in
scrub list are freed (if waiting time go beyond 20s, we will stop
waiting it.).

This seems to be too restricted at the first glance. We used to only
wait for the free memory size of pinned node is bigger than
required. But as we know HVM memory alloction granularity is 2M. Even
the former condition is satisfied, we still may not find enough
2M-size memory on that node.

Signed-off-by: Ting Zhou <ting.g.zhou@intel.com>
Signed-off-by: Xiaowei Yang <Xiaowei.yang@intel.com>
author Keir Fraser <keir.fraser@citrix.com>
date Tue Dec 09 12:44:32 2008 +0000 (2008-12-09)
parents 628b3a76dbf4
children c0c113ab0be5
files tools/python/xen/xend/XendCheckpoint.py tools/python/xen/xend/XendDomainInfo.py tools/python/xen/xend/balloon.py
line diff
     1.1 --- a/tools/python/xen/xend/XendCheckpoint.py	Tue Dec 09 12:42:18 2008 +0000
     1.2 +++ b/tools/python/xen/xend/XendCheckpoint.py	Tue Dec 09 12:44:32 2008 +0000
     1.3 @@ -253,7 +253,7 @@ def restore(xd, fd, dominfo = None, paus
     1.4          # set memory limit
     1.5          xc.domain_setmaxmem(dominfo.getDomid(), maxmem)
     1.6  
     1.7 -        balloon.free(memory + shadow)
     1.8 +        balloon.free(memory + shadow, dominfo)
     1.9  
    1.10          shadow_cur = xc.shadow_mem_control(dominfo.getDomid(), shadow / 1024)
    1.11          dominfo.info['shadow_memory'] = shadow_cur
     2.1 --- a/tools/python/xen/xend/XendDomainInfo.py	Tue Dec 09 12:42:18 2008 +0000
     2.2 +++ b/tools/python/xen/xend/XendDomainInfo.py	Tue Dec 09 12:44:32 2008 +0000
     2.3 @@ -2105,7 +2105,7 @@ class XendDomainInfo:
     2.4          # overhead is greater for some types of domain than others. For
     2.5          # example, an x86 HVM domain will have a default shadow-pagetable
     2.6          # allocation of 1MB. We free up 2MB here to be on the safe side.
     2.7 -        balloon.free(2*1024) # 2MB should be plenty
     2.8 +        balloon.free(2*1024, self) # 2MB should be plenty
     2.9  
    2.10          ssidref = 0
    2.11          if security.on() == xsconstants.XS_POLICY_USE:
    2.12 @@ -2299,7 +2299,7 @@ class XendDomainInfo:
    2.13              vtd_mem = ((vtd_mem + 1023) / 1024) * 1024
    2.14  
    2.15              # Make sure there's enough RAM available for the domain
    2.16 -            balloon.free(memory + shadow + vtd_mem)
    2.17 +            balloon.free(memory + shadow + vtd_mem, self)
    2.18  
    2.19              # Set up the shadow memory
    2.20              shadow_cur = xc.shadow_mem_control(self.domid, shadow / 1024)
    2.21 @@ -2716,7 +2716,7 @@ class XendDomainInfo:
    2.22              # The domain might already have some shadow memory
    2.23              overhead_kb -= xc.shadow_mem_control(self.domid) * 1024
    2.24          if overhead_kb > 0:
    2.25 -            balloon.free(overhead_kb)
    2.26 +            balloon.free(overhead_kb, self)
    2.27  
    2.28      def _unwatchVm(self):
    2.29          """Remove the watch on the VM path, if any.  Idempotent.  Nothrow
     3.1 --- a/tools/python/xen/xend/balloon.py	Tue Dec 09 12:42:18 2008 +0000
     3.2 +++ b/tools/python/xen/xend/balloon.py	Tue Dec 09 12:44:32 2008 +0000
     3.3 @@ -67,7 +67,7 @@ def get_dom0_target_alloc():
     3.4          raise VmError('Failed to query target memory allocation of dom0.')
     3.5      return kb
     3.6  
     3.7 -def free(need_mem):
     3.8 +def free(need_mem ,self):
     3.9      """Balloon out memory from the privileged domain so that there is the
    3.10      specified required amount (in KiB) free.
    3.11      """
    3.12 @@ -122,6 +122,40 @@ def free(need_mem):
    3.13          if need_mem >= max_free_mem:
    3.14              retries = rlimit
    3.15  
    3.16 +        # Check whethercurrent machine is a numa system and the new 
    3.17 +        # created hvm has all its vcpus in the same node, if all the 
    3.18 +        # conditions above are fit. We will wait until all the pages 
    3.19 +        # in scrub list are freed (if waiting time go beyond 20s, 
    3.20 +        # we will stop waiting it.)
    3.21 +        if physinfo['nr_nodes'] > 1 and retries == 0:
    3.22 +            oldnode = -1
    3.23 +            waitscrub = 1
    3.24 +            vcpus = self.info['cpus'][0]
    3.25 +            for vcpu in vcpus:
    3.26 +                nodenum = 0
    3.27 +                for node in physinfo['node_to_cpu']:
    3.28 +                    for cpu in node:
    3.29 +                        if vcpu == cpu:
    3.30 +                            if oldnode == -1:
    3.31 +                                oldnode = nodenum
    3.32 +                            elif oldnode != nodenum:
    3.33 +                                waitscrub = 0
    3.34 +                    nodenum = nodenum + 1
    3.35 +
    3.36 +            if waitscrub == 1 and scrub_mem > 0:
    3.37 +                log.debug("wait for scrub %s", scrub_mem)
    3.38 +                while scrub_mem > 0 and retries < rlimit:
    3.39 +                    time.sleep(sleep_time)
    3.40 +                    physinfo = xc.physinfo()
    3.41 +                    free_mem = physinfo['free_memory']
    3.42 +                    scrub_mem = physinfo['scrub_memory']
    3.43 +                    retries += 1
    3.44 +                    sleep_time += SLEEP_TIME_GROWTH
    3.45 +                log.debug("scrub for %d times", retries)
    3.46 +
    3.47 +            retries = 0
    3.48 +            sleep_time = SLEEP_TIME_GROWTH
    3.49 +
    3.50          while retries < rlimit:
    3.51              physinfo = xc.physinfo()
    3.52              free_mem = physinfo['free_memory']