John Weekes [Tue, 11 Jan 2011 16:42:41 +0000 (16:42 +0000)]
stubdom: Fix stubdom-dm using "grep" improperly
stubdom-dm uses "grep" on "xm list" output to determine whether it is
already running. The existing behavior is to use "grep $domname-dm" but
this will result in a false-positive in the case of another domU running
whose name ends with the full new name; for instance, if "abctest-dm" is
running, a new "test-dm" will spin forever, waiting for it the end.
Any easy fix is to have it use "grep -w" instead of "grep", searching
for the whole word only.
It also might be worth considering a switch to "xl list" from "xm list",
here and in other places.
Signed-off-by: John Weekes <lists.xen@nuclearfallout.net> Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Gianni Tedesco [Tue, 11 Jan 2011 16:31:47 +0000 (16:31 +0000)]
stubdom/minios: don't retrieve the address of void variable
Objects must not be declared to have type void. Declare shared_info
to have the appropriate type instead.
Author: Ganni Tedesco <gianni.tedesco@citrix.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Samuel Thibault [Tue, 11 Jan 2011 16:30:15 +0000 (16:30 +0000)]
stubdom/minios: use correct sized types for software floating point
Replace long/int/short sizes with proper exact-size types for 64bit
architectures. As well as making the code correct, this eliminates a
compiler warning about an uninitialised variable.
Signed-off-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Keir Fraser [Tue, 11 Jan 2011 11:41:39 +0000 (11:41 +0000)]
xenctx: misc adjustments
- fix off-by-one errors during symbol insertion and lookup
- don't store the symbol type, as it wasn't needed at all so far and
is only needed now at parsing time
- don't insert certain kinds of symbols
Keir Fraser [Tue, 11 Jan 2011 11:40:50 +0000 (11:40 +0000)]
x86: restore x2apic pre-enabled check logic
c/s 22475 removed the early checking without replacement, neglecting
the fact that x2apic_enabled must be set early for APIC register
accesses done during second stage ACPI table parsing (rooted at
acpi_boot_init()) to work correctly. Without this, particularly
determination of the boot CPU won't work, resulting in an attempt to
bring up that CPU again as a secondary one (which fails).
Restore the functionality, now calling it from generic_apic_probe().
Keir Fraser [Tue, 11 Jan 2011 11:27:37 +0000 (11:27 +0000)]
xenpaging: update machine_to_phys_mapping[] during page deallocation
The machine_to_phys_mapping[] array needs updating during page
deallocation. If that page is allocated again, a call to
get_gpfn_from_mfn() will still return an old gfn from another guest.
This will cause trouble because this gfn number has no or different
meaning in the context of the current guest.
This happens when the entire guest ram is paged-out before
xen_vga_populate_vram() runs. Then XENMEM_populate_physmap is called
with gfn 0xff000. A new page is allocated with alloc_domheap_pages.
This new page does not have a gfn yet. However, in
guest_physmap_add_entry() the passed mfn maps still to an old gfn
(perhaps from another old guest). This old gfn is in paged-out state
in this guests context and has no mfn anymore. As a result, the
ASSERT() triggers because p2m_is_ram() is true for p2m_ram_paging*
types. If the machine_to_phys_mapping[] array is updated properly,
both loops in guest_physmap_add_entry() turn into no-ops for the new
page and the mfn/gfn mapping will be done at the end of the function.
If XENMEM_add_to_physmap is used with XENMAPSPACE_gmfn,
get_gpfn_from_mfn() will return an appearently valid gfn. As a
result, guest_physmap_remove_page() is called. The ASSERT in
p2m_remove_page triggers because the passed mfn does not match the old
mfn for the passed gfn.
Keir Fraser [Tue, 11 Jan 2011 10:38:28 +0000 (10:38 +0000)]
xenpaging: drop paged pages in guest_remove_page
Simply drop paged-pages in guest_remove_page(), and notify xenpaging
to drop its reference to the gfn. If the ring is full, the page will
remain in paged-out state in xenpaging. This is not an issue, it just
means this gfn will not be nominated again.
Keir Fraser [Tue, 11 Jan 2011 10:32:59 +0000 (10:32 +0000)]
xenpaging: print page-in/page-out progress
Now that DPRINTF is triggered only when the environment variable
XENPAGING_DEBUG is found, make such a debug session actually useful by
printing the entire page-out/page-in process. The 'Got event from Xen'
message alone is not helpful.
Keir Fraser [Tue, 11 Jan 2011 10:32:05 +0000 (10:32 +0000)]
xenpaging: specify policy mru_size at runtime
The environment variable XENPAGING_POLICY_MRU_SIZE will change the
mru_size in the policy at runtime. Specifying the mru_size at runtime
allows the admin to keep more pages in memory so guests can make more
progress. Its also good for development to reduce the value to put
more pressure on the paging related code paths.
Keir Fraser [Tue, 11 Jan 2011 10:31:33 +0000 (10:31 +0000)]
xenpaging: remove domain_id and mfn from struct xenpaging_victim
Remove unused member 'mfn' from struct xenpaging_victim.
xenpaging operates on a single guest, so it needs only a single
domain_id. Remove domain_id from struct xenpaging_victim and use the
one from paging->mem_event where needed. Its not used in the policy.
Keir Fraser [Mon, 10 Jan 2011 08:45:19 +0000 (08:45 +0000)]
x86_64: don't use weak symbols on x86-64
Various gcc versions inline functions that are both weak and hidden,
without even giving a warning.
Certainly the risk exists that we'll see the problem again when
another weak function gets introduced, but I don't see a way to
protect us from that.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Just remove the weak attribute altogether. It's the only one in
non-ia64-specific code. We can get teh same effect with ifdefs which
although a bit unsightly is better than using compiler/linker features
we cannot trust.
Keir Fraser [Mon, 10 Jan 2011 08:40:32 +0000 (08:40 +0000)]
EPT/VT-d: bug fix for EPT/VT-d table sharing
This patch makes following changes: 1) Moves EPT/VT-d sharing
initialization back to when it is actually needed to make sure
vmx_ept_vpid_cap has been initialized. 2) added page order parameter
to iommu_pte_flush() to tell VT-d what size of page to flush. 3)
added hap_2mb flag to ease performance studies between base 4KB EPT
size and when 2MB and 1GB page size support are enabled.
Keir Fraser [Sat, 8 Jan 2011 10:52:45 +0000 (10:52 +0000)]
Update AMD SVM feature flags
This patch updates AMD SVM feature flags (0x8000000A:EDX). It adds
several new feature bits, along with feature description. The feature
names are changed to be consistent with Linux kernel.
Keir Fraser [Sat, 8 Jan 2011 10:48:46 +0000 (10:48 +0000)]
Update AMD CPU feature flags 0x80000001:ECX for Xen Hypervisor
This patch syncs-up AMD CPU feature flags 0x80000001:ECX with the
latest Linux kernel. Several new features are added. Some of existing
features' names are changed as well.
Keir Fraser [Sat, 8 Jan 2011 10:09:44 +0000 (10:09 +0000)]
timer: Ensure that CPU field of a timer is read safely when lock-free.
Firstly, all updates must use atomic_write16(), and lock-free reads
must use atomic_read16(). Secondly, we ensure ->cpu is the only field
accessed without a lock. This requires us to place a special sentinel
value in that field when a timer is killed, to avoid needing to read
->status outside a locked critical section.
Keir Fraser [Sat, 8 Jan 2011 10:05:55 +0000 (10:05 +0000)]
x86: Fix atomic_write*() macros to correctly inform GCC that memory
it knows about is being written to.
The bug is a copy-and-paste error from inline asm that writes to I/O
memory. In that case, as with asm for accessign guest memory,
specifying memory as a read-only parameter is acceptable because the
memory cannot alias with anything that GCC reads directly.
Keir Fraser [Sat, 8 Jan 2011 09:29:11 +0000 (09:29 +0000)]
timer: Fix up timer-state teardown on CPU offline / online-failure.
The lock-free access to timer->cpu in timer_lock() is problematic, as
the per-cpu data that is then dereferenced could disappear under our
feet. Now that per-cpu data is freed via RCU, we simply add a RCU
read-side critical section to timer_lock(). It is then also
unnecessary to migrate timers on CPU_DYING (we can defer it to the
nicer CPU_DEAD state) and we can also safely migrate timers on
CPU_UP_CANCELED.
Keir Fraser [Sat, 8 Jan 2011 09:14:23 +0000 (09:14 +0000)]
x86: Free per-cpu area for offline cpu via RCU.
This allows other CPUs to reference per-cpu areas with less strict
locking. In particular, timer.c access a per-cpu lock with reference
to a per-timer cpu field which it accesses with no synchronisation.
One subtlety is that this prevents us bringing a cpu back online until
the RCU work is completed. In this case we return EBUSY and the
tool stack can report the (unlikely) error, or retry, as it sees fit.
Gianni Tedesco [Fri, 7 Jan 2011 18:01:18 +0000 (18:01 +0000)]
minios: use constant expression to size arrays
Fixes a compile error in gcc-4.5 which is the reason __CONST_RING_SIZE()
was introduced. Let's just use it in minios netfront.
Signed-off-by: Gianni Tedesco <gianni.tedesco@citrix.com> Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Keir Fraser [Fri, 7 Jan 2011 16:59:53 +0000 (16:59 +0000)]
x86 asid: Do not check for per-cpu asid state being already initialised.
It cannot be, since per-cpu data is re-allocated and zeroed across CPU
hotplug. Th ecomment about resetting teh per-cpu generation counter is
incorrect, since all vcpus must have been migrated to other cpus while
this cpu was offline, and that resets the per-vcpu generation stamp.
Keir Fraser [Fri, 7 Jan 2011 14:13:15 +0000 (14:13 +0000)]
ASID: Optimize hvm_flush_guest_tlbs
In our testing, we found that function hvm_flush_guest_tlbs() is used
very frequently and it will always force asid recycling and will
result a whole tlb flush immediately no matter there are still free
asids or not. Actually, in this case, just increasing core generation
might be enough and the remaining asids can still be used until
next_asid > max_asid.
Signed-off-by: Wei Wang <wei.wang2@amd.com> Reviewed-by: Wei Huang <wei.huang2@amd.com>
Simplify the logic and also fix a very minor bug in
hvm_asid_handle_vmenter(), in the case that hvm_asid_flush_core() sets
data->disabled.
Joe Epstein [Fri, 7 Jan 2011 11:54:48 +0000 (11:54 +0000)]
mem_access: added INT3/CRx capture
* Allows a memory event listener to register for events on changes to
CR0, CR3, and CR4, as well as INT3 instructions, as a part of the
mem_access mechanism. These events can be either synchronous or
asynchronous.
* For INT3, the logic works independent of a debugger, and so both can
be supported.
* The presence and type of listener are stored and accessed through
HVM params.
* Changed the event mask handling to ensure that the right events are
captured based on the listeners.
* Added the ability to inject HW/SW traps into a VCPU when it next
resumes (rather than try to modify the existing IRQ injection
code paths). Only one trap to inject can be outstanding at a time.
Signed-off-by: Joe Epstein <jepstein98@gmail.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Tim Deegan <Tim.Deegan@citrix.com>
Joe Epstein [Fri, 7 Jan 2011 11:54:45 +0000 (11:54 +0000)]
mem_access: HVMOPs for setting mem access
* Creates HVMOPs for setting and getting memory access. The hypercalls
can set individual pages or the default access for new/refreshed
pages.
* Added functions to libxc to access these hypercalls.
Signed-off-by: Joe Epstein <jepstein98@gmail.com> Reviewed-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Tim Deegan <Tim.Deegan@citrix.com>
Joe Epstein [Fri, 7 Jan 2011 11:54:40 +0000 (11:54 +0000)]
mem_access: mem event additions for access
* Adds an ACCESS memory event type, with RESUME as the action.
* Refactors the bits in the memory event to store whether the memory event
was a read, write, or execute (for access memory events only). I used
bits sparingly to keep the structure somewhat the same size.
* Modified VMX to report the needed information in its nested page fault.
SVM is not implemented in this patch series.
Signed-off-by: Joe Epstein <jepstein98@gmail.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Tim Deegan <Tim.Deegan@citrix.com>
Joe Epstein [Fri, 7 Jan 2011 11:54:36 +0000 (11:54 +0000)]
mem_access: introduce P2M mem_access types
* Introduces access types for each page, giving independent read, write, and
execute permissions for each page. The permissions are restrictive from
what the page type gives: for example, a p2m_type_ro page with an access of
p2m_access_rw would have read-only permissions in total, as p2m_type_ro
removed write access and p2m_access_rw removed execute access.
* Implements the access flag storage for EPT, moving some bits from P2M type,
which had 10 bits of storage, to the four bits for access.
* Access flags are stored according to a loose consistency contract, where
pages can be reset to the default access permissions at any time. Right
now, that happens on page type changes, where one would want to reevaluate
whether permissions make sense for that page as they are anyway.
Signed-off-by: Joe Epstein <jepstein98@gmail.com> Acked-by: Tim Deegan <Tim.Deegan@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
Anthony PERARD [Thu, 6 Jan 2011 18:04:48 +0000 (18:04 +0000)]
libxl: Lists qdisk device in libxl_device_disk_list
As libxl switch to qdisk when blktap isn't available, this patch makes
libxl_device_disk_list also list qdisk device. So
libxl_build_device_model_args_new will be able to add qdisk device to
the command line options of Qemu.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Anthony PERARD [Thu, 6 Jan 2011 18:03:11 +0000 (18:03 +0000)]
libxl: Factorize function libxl_device_disk_list
This patch adds function libxl_append_disk_list_of_type to get disks
parameter of one backend type.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Thu, 6 Jan 2011 17:37:00 +0000 (17:37 +0000)]
ocaml: evtchn+xc bindings: use libxenctrl and libxenguest
Now that tools/libxc is licensed under LGPL I don't think there is any need for
an LGPL reimplementation under tools/ocaml.
For the most part the conversion to the up-to-date libxc API (xc_lib.c
essentially implemented the same interface as an older libxc) was pretty
automatic. There are some functions which appear to no longer exist in libxc
which I therefore simply removed the bindings for and a small number of
interfaces which had changed.
Many of the functions bound by the stubs have no in-tree users (which I think
is fine for a language binding) so I have no way to confirm correctness other
than by eye. I was however able to confirm that oxenstored still worked and to
build a XCP toolstack which could successfully start a PV guest.
Uses the new XC_OPENFLAG_NON_REENTRANT option to avoid potential conflicts
between pthreads and the ocaml runtime.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Vincent Hanquez <Vincent.Hanquez@eu.citrix.com> Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Christoph Egger [Thu, 6 Jan 2011 17:26:26 +0000 (17:26 +0000)]
libxc: portability fixes for NetBSD
Attached patch makes libxc build again on NetBSD after the recent rework.
[ Modified by iwj:
I changed the name of the new make variable from LIBDL to DLOPEN_LIBS.
The latter conforms to the naming scheme for similar variables found
in config/*.mk - PTHREAD_LIBS et al.
Also I moved the setting of the variable to -dl from Linux to StdGNU
(which makes it apply more widely) and also added it to SunOS.mk
(based on pure guesswork). ]
Signed-off-by: Christoph Egger <Christoph.Egger@amd.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Tim Deegan [Thu, 6 Jan 2011 16:58:48 +0000 (16:58 +0000)]
mem_sharing: fix race condition of nominate and unshare
(1) When updating/checking p2m type for mem_sharing, we must hold shr_lock
(2) For nominate operation, if the page is already nominated, return the
handle from page_info->shr_handle
(3) For unshare operation, it is possible that multiple users unshare a
page via hvm_hap_nested_page_fault() at the same time. If the page
is already un-shared by someone else, simply return success.
Signed-off-by: Jui-Hao Chiang <juihaochiang@gmail.com> Signed-off-by: Han-Lin Li <Han-Lin.Li@itri.org.tw> Acked-by: Tim Deegan <Tim.Deegan@citrix.com>
Christoph Egger [Thu, 6 Jan 2011 14:25:10 +0000 (14:25 +0000)]
libxl: Implement libxl_basename()
This patch implements libxl_basename() as a portable replacement
for GNU vs. POSIX basename.
Signed-off-by: Christoph Egger <Christoph.Egger@amd.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Yang Z Zhang [Wed, 5 Jan 2011 23:37:32 +0000 (23:37 +0000)]
libxl: fix free of uninitialised "disks" variable
Reported-by: Wei Huang <wei.huang2@amd.com> Reported-by: Christoph Egger <Christoph.Egger@amd.com>
Author: Yang Z Zhang <yang.z.zhang@intel.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson [Wed, 5 Jan 2011 23:31:24 +0000 (23:31 +0000)]
tools/xend: drbd: fix things by reverting 20158
drbd's block-drbd script handles all of the details that c/s 20158
introduces within xend :-(. This c/s should be reverted as it causes
a regression. Jim Fehlig tested drbd without 20158 and it works fine.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Tested-by: Jim Fehlig <jfehlig@novell.com>
Gianni Tedesco [Wed, 5 Jan 2011 23:13:07 +0000 (23:13 +0000)]
xl: don't segfault parsing disk configs, support NULL physpath and ioemu:
Switch to a state machine parser since it's easier to handle all these
exotic cases without segfaulting. NULL physpaths are now allowed and a
dodgy hack is introduced to skip over the "ioemu:" prefix for a
virtpath. Also fixes a leak of buf2.
Signed-off-by: Gianni Tedesco <gianni.tedesco@citrix.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Keir Fraser [Wed, 5 Jan 2011 09:57:15 +0000 (09:57 +0000)]
relax vCPU pinned checks
Both writing of certain MSRs and VCPUOP_get_physid make sense also for
dynamically (perhaps temporarily) pinned vcpus.
Likely a couple of other MSR writes (MSR_K8_HWCR, MSR_AMD64_NB_CFG,
MSR_FAM10H_MMIO_CONF_BASE) would make sense to be restricted by an
is_pinned() check too, possibly also some MSR reads.
Keir Fraser [Wed, 5 Jan 2011 09:52:18 +0000 (09:52 +0000)]
x86: Allow dom0 to write MSR IA32_ENERGY_PERF_BIAS
Allow dom0 to write MSR IA32_ENERGY_PERF_BIAS
There is a new hardware feature, which lets system software to set
Energy Performance Preference. This is a opaque knob in the form of
IA32_ENERGY_PERF_BIAS MSR, which has a 4 bit Energy Performance
Preference Hint.
The support for this feature is indicated by CPUID.06H.ECX.bit3. Refer
to Intel Architectures Software Developer's Manual for more info.
Keir Fraser [Wed, 5 Jan 2011 09:48:43 +0000 (09:48 +0000)]
x86 amd: Revert 6382:b74c15e4dd4f (AMD flush filter configuration)
Flush filter is not reliably supported by any processor, we already
have code to unconditionally disable the filter, so we don't need the
command-line config option. Remove it.
Ian Campbell [Tue, 4 Jan 2011 15:26:02 +0000 (15:26 +0000)]
tools/hotplug/Linux: only apply dummy MAC address to virtual devices.
Avoid applying to the bridge and physical network device.
This should un-break dom0 networking in the old xend-creates-bridge
setup (problem introduced in 22493:937488219719).
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Tim Deegan [Tue, 4 Jan 2011 11:32:20 +0000 (11:32 +0000)]
x86/mm: Add p2m_lock in set_shared_p2m_entry
This avoids the immediate problem (calling set_p2m_entry() without the
lock held) but leaves the underlying problem (no consistent locking
order between page-sharing and p2m code) for later.
Signed-off-by: Jui-Hao Chiang <juihaochiang@gmail.com> Signed-off-by: Tim Deegan <Tim.Deegan@citrix.com>
Keir Fraser [Fri, 24 Dec 2010 10:23:08 +0000 (10:23 +0000)]
Xen MCE test: all test cases
Implement the test cases. Each of cases will call the common function,
then call mce inject tool. README for Xen MCE test suite, include the
framwork and test instruction.
Keir Fraser [Fri, 24 Dec 2010 10:22:23 +0000 (10:22 +0000)]
Xen MCE test: common functions to be used for test cases
Implement some common shell functions and variable definitions are
defined to be used by test cases Verify fuctions include domain0 user
space tool mcelog, Xen dmesg and guest kernel log verification.
Keir Fraser [Fri, 24 Dec 2010 10:21:27 +0000 (10:21 +0000)]
Xen MCE test: utilities to inject fake MCE for X86
A software MCE injection tool, which is based on Xen MCE injection
mechanism. It fake MCE error and inject this error to a assigned
Domain Physical Address. Makefile make sure the tool can be built on
Xen. A README explain the usage for this tool.
Keir Fraser [Fri, 24 Dec 2010 10:17:49 +0000 (10:17 +0000)]
libxc: Use .opic to build xenctrl_osdep_ENOSYS.so
Resolves build error:
/usr/bin/ld: xenctrl_osdep_ENOSYS.o: relocation R_X86_64_32
against `a local symbol' can not be used when making a shared
object; recompile with -fPIC
xenctrl_osdep_ENOSYS.o: could not read symbols: Bad value
collect2: ld returned 1 exit status
Clean up object files correctly too.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Keir Fraser [Fri, 24 Dec 2010 10:14:01 +0000 (10:14 +0000)]
VT-d: fix and improve print_vtd_entries()
Fix leaking of mapped domain pages (root_entry and ctxt_entry when
falling out of the level traversing loop). Do this by re-arranging
things slightly so that a mapping is retained only as long as it
really is needed.
Fix the failure to use map_domain_page() in the level traversing loop
of the function.
Add a mssing return statement in one of the error paths.
Also I wonder whether not being able to call print_vtd_entries() from
iommu_page_fault_do_one() in ix86 is still correct, now that
map_domain_page() is IRQ safe.
Keir Fraser [Fri, 24 Dec 2010 10:12:58 +0000 (10:12 +0000)]
re-add calls accidentally deleted from run_all_nonirq_keyhandlers()
c/s 22538:a3a29e67aa7e, having got applied in a form different from
the one submitted, resulted in the calls to
console_{start,end}_log_everything() getting removed without
replacement. Add them back since, other than run_all_keyhandlers(),
this doesn't run with log-everything already in effect.
Keir Fraser [Fri, 24 Dec 2010 08:47:23 +0000 (08:47 +0000)]
x86-64: use PC-relative exception table entries
... thus allowing to make the entries half their current size. Rather
than adjusting all instances to the new layout, abstract the
construction the table entries via a macro (paralleling a similar one
in recent Linux).
Also change the name of the section (to allow easier detection of
missed cases) and merge the final resulting output sections into
.data.read_mostly.
Keir Fraser [Fri, 24 Dec 2010 08:42:52 +0000 (08:42 +0000)]
blkif: add placeholder for packet extension to block interface
While the corresponding implementation has been in our trees for quite
a while, it's in a state that doesn't make it suitable for submission,
and the original author having left the company leaves open to find
someone to complete this work. Yet to prevent problems with other
interface extensions we'd like to keep the slot in the number space
reserved for the purpose it has been serving here.
Keir Fraser [Fri, 24 Dec 2010 08:39:42 +0000 (08:39 +0000)]
x86 xsave: supports xsave (CPUID:0xD) enumeration for all sub-leaves.
In specific, it fixes the following issues:
1. The sub-leaves of CPUID:0x0000000D aren't contiguous. Hypervisor
shouldn't use register values to stop the enumeration. This patch
moves checking on XSAVE sub-leaves out of if-else statement. It also
bumps up sub-leaves to 63.
2. It creates a common function for xsave.
3. The main leaf 0 of CPUID:0x0000000D in current Xen is broken,
especially ECX and EBX registers. This patch cleans it up.
4. It adds support to detects EBX value of CPUID:0x0000000D main leaf
0 on-the-fly.
Keir Fraser [Fri, 24 Dec 2010 08:32:20 +0000 (08:32 +0000)]
credit2: Different unbalance tolerance for underloaded and overloaded queues
Allow the "unbalance tolerance" -- the amount of difference between
two runqueues that will be allowed before rebalancing -- to differ
depending on how busy the runqueue is. If it's less than 100%,
default to a difference of 1.0; if it's more than 100%, default to a
tolerance of 0.125.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Keir Fraser [Fri, 24 Dec 2010 08:31:54 +0000 (08:31 +0000)]
credit2: Introduce a loadavg-based load balancer
This is a first-cut at getting load balancing. I'm first working on
looking at behavior I want to get correct; then, once I know what kind
of behavior works well, then I'll work on getting it efficient.
The general idea is when balancing runqueues, look for the runqueue
whose loadavg is the most different from ours (higher or lower).
Then, look for a transaction which will bring the loads closest
together: either pushing a vcpu, pulling a vcpu, or swapping them.
Use the per-vcpu load to calculate the expected load after the
exchange.
The current algorithm looks at every combination, which is O(N^2).
That's not going to be suitable for workloads with large numbers of
vcpus (such as highly consolidated VDI deployments). I'll make a more
efficient algorithm once I've experimented and determined what I think
is the best load-balancing behavior.
At the moment, balance from a runqueue every time the credit resets.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Keir Fraser [Fri, 24 Dec 2010 08:31:04 +0000 (08:31 +0000)]
credit2: Migrate request infrastructure
Put in infrastructure to allow a vcpu to requeset to migrate to a
specific runqueue. This will allow a load balancer to choose running
VMs to migrate, and know they will go where expected when the VM is
descheduled.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
Keir Fraser [Fri, 24 Dec 2010 08:30:42 +0000 (08:30 +0000)]
credit2: Track expected load
As vcpus are migrated, track how we expect the load to change. This
helps smooth migrations when the balancing doesn't take immediate
effect on the load average. In theory, if vcpu activity remains
constant, then the measured avgload should converge to the balanced
avgload.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>