Keir Fraser [Tue, 26 May 2009 10:52:31 +0000 (11:52 +0100)]
blktap2: a completely rewritten blktap implementation
Benefits to blktap2 over the old version of blktap:
* Isolation from xenstore - Blktap devices are now created directly on
the linux dom0 command line, rather than being spawned in response
to XenStore events. This is handy for debugging, makes blktap
generally easier to work with, and is a step toward a generic
user-level block device implementation that is not Xen-specific.
* Improved tapdisk infrastructure: simpler request forwarding, new
request scheduler, request merging, more efficient use of AIO.
* Improved tapdisk error handling and memory management. No
allocations on the block data path, IO retry logic to protect
guests
transient block device failures. This has been tested and is known
to work on weird environments such as NFS soft mounts.
* Pause and snapshot of live virtual disks (see xmsnap script).
* VHD support. The VHD code in this release has been rigorously
tested, and represents a very mature implementation of the VHD
image
format.
* No more duplication of mechanism with blkback. The blktap kernel
module has changed dramatically from the original blktap. Blkback
is now always used to talk to Xen guests, blktap just presents a
Linux gendisk that blkback can export. This is done while
preserving the zero-copy data path from domU to physical device.
These patches deprecate the old blktap code, which can hopefully be
removed from the tree completely at some point in the future.
Keir Fraser [Tue, 26 May 2009 10:05:04 +0000 (11:05 +0100)]
Transcendent memory ("tmem") for Xen.
Tmem, when called from a tmem-capable (paravirtualized) guest, makes
use of otherwise unutilized ("fallow") memory to create and manage
pools of pages that can be accessed from the guest either as
"ephemeral" pages or as "persistent" pages. In either case, the pages
are not directly addressible by the guest, only copied to and fro via
the tmem interface. Ephemeral pages are a nice place for a guest to
put recently evicted clean pages that it might need again; these pages
can be reclaimed synchronously by Xen for other guests or other uses.
Persistent pages are a nice place for a guest to put "swap" pages to
avoid sending them to disk. These pages retain data as long as the
guest lives, but count against the guest memory allocation.
Tmem pages may optionally be compressed and, in certain cases, can be
shared between guests. Tmem also handles concurrency nicely and
provides limited QoS settings to combat malicious DoS attempts.
Save/restore and live migration support is not yet provided.
Tmem is primarily targeted for an x86 64-bit hypervisor. On a 32-bit
x86 hypervisor, it has limited functionality and testing due to
limitations of the xen heap. Nearly all of tmem is
architecture-independent; three routines remain to be ported to ia64
and it should work on that architecture too. It is also structured to
be portable to non-Xen environments.
Tmem defaults off (for now) and must be enabled with a "tmem" xen boot
option (and does nothing unless a tmem-capable guest is running). The
"tmem_compress" boot option enables compression which takes about 10x
more CPU but approximately doubles the number of pages that can be
stored.
Tmem can be controlled via several "xm" commands and many interesting
tmem statistics can be obtained. A README and internal specification
will follow, but lots of useful prose about tmem, as well as Linux
patches, can be found at http://oss.oracle.com/projects/tmem .
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Keir Fraser [Tue, 26 May 2009 09:03:09 +0000 (10:03 +0100)]
xend: Fix xm pci-detach for inactive devices
In the case where a device is attached to an inactive domain
and then removed before the domain is activated it won't have
a vslot assigned, but it should still be valid to remove it.
I don't think that there are any other cases where vslot can
be invalid. Signed-off-by: Simon Horman <horms@verge.net.au>
Keir Fraser [Tue, 26 May 2009 08:58:38 +0000 (09:58 +0100)]
Add support for superpages (hugepages) in PV domain
This patch adds the option "superpages" to the domain configuration
file. If it is set, the domain is populated using 2M pages.
This code does not support fallback to small pages. If the domain can
not be created with 2M pages, the create will fail.
The patch also includes support for saving and restoring domains with
the superpage flag set. However, if a domain has freed small pages
within its physical page array and then extended the array, the
restore will fill in those freed pages. It will then attempt to
allocate more than its memory limit and will fail. This is
significant because apparently Linux does this during boot, thus a
freshly booted Linux image can not be saved and restored successfully.
Keir Fraser [Tue, 26 May 2009 08:54:53 +0000 (09:54 +0100)]
minios: replace mktime implementation
In the efforts to clarify MiniOS license it came to my attention that
few portions of MiniOS were taken from other GPL projects, one of them
is the mktime implementation. This patch replaces the current GPL
licensed mktime implementation with a different and BSD licensed
version.
Keir Fraser [Tue, 26 May 2009 08:50:35 +0000 (09:50 +0100)]
stubdom: 'file' based disk sharing
Allow 'file' based disks, that are blkback based disks, to be shared
between the guest domain and the stubdom. It does so exploiting the
same exception introduced in the previous patch "stubdoms phy disks
sharing". Now we can remove the hack in stubdom-dm that forces "file"
disks to be opened using blktap instead of blkback.
Keir Fraser [Wed, 20 May 2009 15:02:50 +0000 (16:02 +0100)]
ACPI/NUMA: Improve SRAT parsing
This is to properly handle SRAT rev 2 extended proximity domain
values.
Also a first step to eliminate the redundant definitions of
ACPI provided table structures (Linux eliminated all of the duplicates
from include/linux/acpi.h in 2.6.21).
Portions based on a Linux patch from Kurt Garloff <garloff@suse.de>
and Alexey Starikovskiy <astarikovskiy@suse.de>.
Keir Fraser [Wed, 20 May 2009 14:38:34 +0000 (15:38 +0100)]
x86-64: also handle virtual aliases of Xen image pages
With the unification of the heaps, the pages freed from the Xen boot
image now can also end up being allocated to a domain, and hence the
respective aliases need handling when such pages get their
cacheability attributes changed.
Rather than establishing multiple mappings with non-WB attributes
(which temporarily still can cause aliasing issues), simply unmap
those pages from the Xen virtual space, and re-map them (to allow re-
establishing of eventual large page mappings) when the cachability
attribute for them gets restored to normal (WB).
Keir Fraser [Wed, 20 May 2009 14:35:32 +0000 (15:35 +0100)]
x86: don't map more than the allocated space for frame_table
This is to avoid undue virtual address aliases in case the over-mapped
pages happen to get allocated to a domain, and then get their
cacheability attributes changed.
At the same time, use 1Gb mappings if possible and reasonable.
Keir Fraser [Tue, 19 May 2009 22:28:25 +0000 (23:28 +0100)]
x86: Fix the P2M audit code.
It currently doesn't even compile; with this patch applied, it
compiles and didn't immediately explode as soon as I started a VM.
I've not given it much testing beyond that, though.
Signed-off-by: Steven Smith <steven.smith@citrix.com>
Keir Fraser [Tue, 19 May 2009 13:17:56 +0000 (14:17 +0100)]
stubdom: Rebuild the ocaml runtime libraries with the options needed
if they are to be linked with object files created by ocamlc and the minios
kernel.
This is needed to build stubdoms written in ocaml.
Signed-off-by: Alex Zeffertt <alex.zeffertt@eu.citrix.com>
Keir Fraser [Tue, 19 May 2009 01:18:48 +0000 (02:18 +0100)]
xend: Make hotplug script timeouts configurable
In some configurations, when dom0 is busy with I/O, it may take
several minutes to complete all hotplug scripts required when a new
domain is being created. As device create timeout is set to 100
seconds, users get "hotplug scripts not working" error instead of a
new domain.
This patch makes both DEVICE_CREATE_TIMEOUT and DEVICE_DESTROY_TIMEOUT
configurable in xend-config.sxp to allow users to easily adapt hotplug
timeouts to their environment.
Keir Fraser [Tue, 19 May 2009 00:37:19 +0000 (01:37 +0100)]
xend: solve issues with xm block-configure command.
In the case of inactive managed domains:
The following error occurs currently. We cannot change the
configuration of the VBD by using xm block-configure. Of course,
using xm block-detach and xm block-attach instead of xm
block-configure, we can change it. However, I'd like to change it by
using xm block-configure.
In the case of active domains:
Another problem occurs after a domain was rebooted. Even if we
change a configuration of a VBD in the domain by using xm
block-configure, the configuration of the VBD is reverted to previous
configuration after the domain was rebooted.
Keir Fraser [Tue, 19 May 2009 00:31:26 +0000 (01:31 +0100)]
x86, cpufreq: fix ondemand governor to take aperf/mperf feedback
APERF/MPERF MSRs provides feedback about actual freq in
eplased time, which could be different from requested freq by
governor. However currently ondemand governor only takes that
feedback at freq down path. We should do that for scale up too.
Keir Fraser [Fri, 15 May 2009 07:12:39 +0000 (08:12 +0100)]
vt-d: Fix interrupt remapping for multiple IOAPICs
Current IOAPIC interrupt remapping code assumes there is only one
IOAPIC in system. It brings problem when there are more than one
IOAPIC in system. This patch extends ioapic_pin_to_intremap_index[]
array to handle multiple IOAPICs case.
Signed-off-by: Weidong Han <weidong.han@intel.com>
Keir Fraser [Thu, 14 May 2009 14:46:04 +0000 (15:46 +0100)]
xen public: make mmuext_op's vcpumask field const
Linux started to pass around pointers to 'const cpumask_t' a while ago,
and passing such a pointer to set_xen_guest_handle() requires that the
field be a handle for a constant type in order to avoid compiler
warnings.
Keir Fraser [Wed, 13 May 2009 09:39:44 +0000 (10:39 +0100)]
x86 vmx: Ensure debug-mode intercept for int3 and debug exceptions are
reinstated when resetting EXCEPTION_BIRTMAP entry in VMCS after
exiting real mode.
Keir Fraser [Wed, 13 May 2009 09:28:35 +0000 (10:28 +0100)]
passthrough: Fix PCI hot-plug option parsing
When a PCI function is passed-through extra options may be passed
through.
In the case of boot-time PCI pass-through the documented format is:
[dom:]bus:dev.slot[@vslot][[,opt]...]
e.g.
00:01.00.1@7,msitranslate=3D1
In the case of PCI hot-plug the xm pci-attach command take the
following arguments:
[-o opt[,opt]...] [dom:]bus:dev.slot [vslot]
e.g.
-o msitranslate=3D1 00:01.00.1 7
These xm ends up passing these to xem-qemu as:
[dom:]bus:dev.slot[[,opt]...][@vslot]
e.g.
00:01.00.1,msitranslate=3D1@7
Note that the option and the vslot have are transposed when
compared to the format used by boot-time PCI pass-through.
The parser inside qemu-xen can only handle the format used by
boot-time PCI pass-through and because of this ignores
any options passed by hot-plug.
This patch alters format used by hot-plug to match the parser.
Keir Fraser [Fri, 8 May 2009 10:50:12 +0000 (11:50 +0100)]
x86 hvm: hvm_set_callback_irq_level() must not be called in IRQ
context or with IRQs disabled. Ensure this by deferring to tasklet
(softirq) context if required.
Keir Fraser [Thu, 7 May 2009 18:32:10 +0000 (19:32 +0100)]
Permit user to suppress passing --prefix to setup.py
We change all invocations of setup.py as follows:
* use $(PYTHON) instead of `python' so that the user can specify
an alternative python version if they need to. If not set it
defaults to `python' in Config.mk.
* pass --prefix=$(PREFIX) via a new make variable
$(PYTHON_PREFIX_ARG). This allows a user to suppress the
--prefix=... argument entirely by setting PYTHON_PREFIX_ARG=''.
This will work around the bug described here
https://bugs.launchpad.net/ubuntu/+bug/362570
where passing --prefix=/usr/local (which ought to have no effect as
/usr/local is the default prefix) changes which subdirectory
distutils chooses, and results in the files being installed in
site-packages which is not on the default search path.
Users not affected by this python packaging bug should not set
PYTHON_PREFIX_ARG and their builds will not be affected. (Provided
PREFIX did not contain spaces. People who put spaces in PREFIX are
being quite optimistic.)
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
After the scheduler timer became suspended before entering cpu idle
state, the percpu timer_deadline is possible to be 0, i.e. no soft
timer in the queue. This case will cause unexpected large residency
percentage in C1 for the purely idle cpu.
Signed-off-by: Wei Gang <gang.wei@intel.com> Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
Update XEN_LINUX_GIT_REMOTEBRANCH to match changes made in upstream
repo. Needed if you want setting KERNELS=linux-2.6-pvops in
config/Linux.mk to work.
Signed-off-by: Alex Zeffert <alex.zeffert@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
op_pincpu method in SrvDomain.py does not currently work because
op_pincpu method gives string objects to a cpumap argument of
domain_pincpu method in XendDomain.py though the cpumap argument
expects list objects.
This patch solves the above problem as follows.
op_pincpu method gives string objects to the cpumap argument as is,
because op_pincpu method cannot give list objects to the cpumap
argument.
Instead, domain_pincpu method expects that the cpumap argument is
string objects, then domain_pincpu method converts the cpumap
argument into list objects.
Also, the patch modifies two methods (except for op_pincpu method)
calling domain_pincpu method. The methods give string objects to
the cpumap argument instead of list objects.
Network support is still provided the same way: using the tap
interface, created in qemu using netfront.
The lwip stack is still available to avoid additional compilation
issues.
However the stubdom is not going to have its own vif anymore,
this means that the only vnc server supported is the one in dom0.
You can still enable the vnc server in a stubdom at compile time, if
you want so.
Probably the most important change caused by this patch to xen users
is that you don't have to specify two vif in the stubdom config file
anymore, but just one:
Prior to changset 19510:5c69f98c348e - 'xm, xend: Replace "vslt" with
"vslot"', both vslt and vslot were used in the xm code, often fairly
arbitrarily.
However, in the dictionary that describes a pci function both vslt and
vslot were present. vslt stored the slot assigned to the function. And
vslot stored the slot the user requested for the function, or
AUTO_PHP_SLOT if no slot was requested.
With the renaming these two values got merged into a single entry.
This patch un-merges them by renaming the what was vslot to
requested_vslot.
So an out of chronological order list of name changes is:
xend: Do not overwrite xauthority and display with empty values
Display and xauthority vars are read from vmConfig['platform'] first,
then they are read again from dev_info.
However if the user does not set those variable in the config file,
dev_info won't contain them, hence we are going to overwrite the
current significant values with null.
This patch fixes the problem setting display and xauthority to the
current values if dev_info does not contain them.
This patch removes the need for a second configuration file for
stubdoms: it is going to be automatically generated by the script
stubdom-dm using command line options and xenstore to find any needed
information.
The configuration script will be placed under /etc/xen/stubdoms and
automatically removed when the domain is destroyed.
The only change needed in xend is not to write on xenstore sdl,
opengl and serial command line options for qemu, because stubdoms do
not support them.
It is safe to remove those two options from xenstore because qemu does
not use xenstore to read commans line options.
Finally this patch fixes blkfront disconnections from backends and
display and xauthority variables for pv guests.
If ${netdev} is bonding, brctl addif ${bridge} ${pdev} fails:
can't add ${pdev} to bridge ${bridge}: Invalid argument
Because ${pdev} has no slaves at this point.=20
# Notice that ifdown ${netdev} clears slaves of ${netdev}.
This patch restores slaves before add_to_bridge2 ${bridge} ${pdev}.
The following changeset broke booting xen-ia64 on some kinds of ia64 boxes.
http://xenbits.xensource.com/ext/ia64/xen-unstable.hg/rev/3fd8f9b34941
The tasklet_schedule call raise_softirq().
Because raise_softirq() use per_cpu, if we access per_cpu before cpu_init()
the behavior would be unexpected.
Event-channel setup: Re-bind if the connection becomes unbound (e.g.,
due to 'slow' domain suspend cancellation), even if the remote port
identifier has not changed.
Domain logging: Only open log file once (don't leak fds) and fix a
small memory leak.
Evtchn changes based on a patch by Jiri Denemark <jdenemar@redhat.com>
x86: fix next->vcpu_dirty_cpumask checking in context_switch()
There was a timing window where flush_tlb_mask() could be called with
an empty mask (triggering a WARN_ON() in send_IPI_mask_flat() along
with APIC errors) because rather than using the already taken snapshot
of next's vcpu_dirty_cpumask struct vcpu's field was used directly,
which can get its only bit cleared by remote CPUs.
Replacing the structure field's use by the local variable then made
the inner cpus_empty() check completely redundant with the one in the
surrounding if()'s condition.
This patch updates the Makefile to download the latest version of
tboot, which supports the interface changes made recently. This
should go into 3.4, since 3.4 supports the new tboot interface.
Signed-off-by: Joseph Cihula <joseph.cihula@intel.com>
A few ioapic redirection entries are initialized by hypervisor before
enabling iommu hardware. This patch copies those entries from ioapic
redirection table into interrupt remapping table after interrupt
remapping table has been allocated.
cpuidle: Add support for Always Running APIC timer, CPUID_0x6_EAX_Bit2.
This bit means the APIC timer continues to run even when CPU is
in deep C-states.
The advantage is that we can use LAPIC timer on these CPUs
always, and there is no need for "slow to read and program"
external timers (HPET/PIT) and the timer broadcast logic
and related code in C-state entry and exit.
Refer to the latest Intel SDM Vol 2A
(http://www.intel.com/products/processor/manuals/index.htm)
x86: avoid EPT scanning errors when splitting superpages during live migration
Since Xen did not lock the p2m table for p2m table reading, when
splitting the large page during live migration, we should make sure
the path of EPT entries be modified are always there while other CPUs
may access the super entries at the same time.
xend: clean up qemu-dm related items on domain destroy
Some qemu-dm related stuffs might be left behind after the domain is
destroyed.
- xenstore entry, /local/domain/0/device-model/<domid>
- named pipes, /var/run/tap/qemu-{read,write}-<domid>