Keir Fraser [Fri, 29 May 2009 08:19:30 +0000 (09:19 +0100)]
[VTD] laying the ground work for ATS
These changes lay the ground work for ATS enabling in Xen. It will be
followed by patch which enables PCI MMCFG which is needed for actual
enabling of ATS functionality.
Keir Fraser [Thu, 28 May 2009 10:07:19 +0000 (11:07 +0100)]
Serialize iptables calls in hotplug scripts
iptables cannot correctly handle situations when more than one command
is trying to set netfilter rules. In such situations, iptables may fail
with EAGAIN, which results in iptables: Unknown error 18446744073709551615.
Such situation can easily happen when multiple network devices are
configured for a domain as vif hotplug scripts are called in parallel
for all of the network devices.
Keir Fraser [Thu, 28 May 2009 10:01:00 +0000 (11:01 +0100)]
blktap2: Fix build with gcc3. Cannot handle defining a function which
is passed a struct-by-value which is not yet fully defined. Thus
defining a request struct which contains a pointer to a function which
is passed-by-value an instance of that request structure is
impossible. We work around it by defining the function poiinter as
void* and then casting in one place.
Keir Fraser [Thu, 28 May 2009 09:06:01 +0000 (10:06 +0100)]
xend: Update info['platform']['pci']
This patch updates info['platform']['pci'] for PCI devices
assignment to domains.
When a domain is started, xend confirms by using xc.test_assign_device
whether PCI devices can be assigned to the domain. For the
confirmation, info['platform']['pci'] must be an appropriate value.
However, info['platform']['pci'] may be not appropriate. Because
info['platform']['pci'] isn't almost always updated even if the PCI
device configuration of the domain was changed by using xm
pci-attach/detach. This patch updates info['platform']['pci'] to the
appropriate value when domains are started.
Keir Fraser [Thu, 28 May 2009 09:03:29 +0000 (10:03 +0100)]
blktap2: fix tapdisk-channel.c
This patch fixes the following error.
cc1: warnings being treated as errors
In file included from usr/include/sys/resource.h:25,
from tapdisk-daemon.c:559:
usr/include/bits/resource.h: In function 'main':
usr/include/bits/resource.h:33: warning: ISO C90 forbids mixed
declarations and code
Keir Fraser [Thu, 28 May 2009 09:02:57 +0000 (10:02 +0100)]
blktap2: fix makefile of blktap2
- clean up to use SUBDIRS-y
- With parallel make, libvhd might not be created before
link. guarantee it.
- use LDFLAGS for link which is set by upper level makefiles.
Keir Fraser [Thu, 28 May 2009 08:51:43 +0000 (09:51 +0100)]
x86 vmx: Unrestricted guest (realmode) support
It allows fully virtualized guests to run real mode and unpaged mode
code natively in the VMX mode when EPT is turned on. With the
unrestricted guest there is no need to emulate the guest real mode
code in the vm86 container or in the emulator. Also the guest big real
mode code works like native.
This patch enhances Xen to use the unrestricted guest feature if
available on the processor. It also adds a new xen parameter to
disable the unrestricted guest feature at the boot time.
Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com>
Keir Fraser [Thu, 28 May 2009 08:41:59 +0000 (09:41 +0100)]
minios: implement ffs, ffsl and ffsll.
The first function is compiled only in case minios is compiled without
newlib, since newlib already provides an implementation for ffs.
On the other hand ffsl and ffsll are always compiled because newlib
misses those functions.
This patch also provides an implementation for __ffsti2 and __ffsdi2
because they are needed by gcc in order to successfully link ffsll.
Keir Fraser [Wed, 27 May 2009 11:00:51 +0000 (12:00 +0100)]
[IA64] adjust ia64 xc_domain_restore() signature
This patch fixes the following error.
ia64/xc_ia64_linux_restore.c:546: error: conflicting types for
xc_domain_restore
./xenguest.h:49: error: previous declaration of xc_domain_restore was
here
make[4]: *** [ia64/xc_ia64_linux_restore.o] Error 1
Keir Fraser [Wed, 27 May 2009 11:00:32 +0000 (12:00 +0100)]
[IA64] add ia64 _raw_rw_is_write_locked
This patch fixes the following link error.
xen/common/built_in.o: In function `_rw_is_write_locked':
xen/common/spinlock.c:249: undefined reference to
`_raw_rw_is_write_locked'
make[3]: *** [xen/xen-syms] Error 1
Keir Fraser [Wed, 27 May 2009 10:29:38 +0000 (11:29 +0100)]
Fix up the synchronisation around grant table map track handles.
At present, we're not doing any at all, so if a domain e.g. tries to
do two map operations at the same time from different vcpus then you
could end up with both operations getting back the same maptrack
handle.
Fix this problem by just shoving an enormous lock around grant table
operations. This is unlikely to be heavily contended, because netback
and blkback both restrict themselves to mapping on a single vcpu at a
time (globally for netback, and per-device for blkback), and most of
the interesting bits are already protected by the remote domain's
grant table lock anyway.
The unconteded acquisition cost might be significant for some
workloads. If that were the case, it might be worth only acquiring
the lock only for multi-vcpu domains, since we only manipulate the
maptrack table in the context of one of the domain's vcpus. I've not
done that optimisation here, because I didn't want to think about what
would happen if e.g. a cpu got hot-unplugged from a domain while it
was performing a map operation.
Signed-off-by: Steven Smith <steven.smith@citrix.com>
Keir Fraser [Wed, 27 May 2009 10:28:45 +0000 (11:28 +0100)]
xend: Device duplicate check fix
I've checked the duplicate-check code here and I found that's checked
only in the context of one domain but not cross-domain. The thing is
that we should check tap/vbd device cross-domain not to allow another
guest to use the same disk image in some circumstances to prevent VM's
disk corruption.
The patch included denies disk image addition under those
circumstances:
1. We're adding read-only disk that's already used as write-exclusive
2. We're adding write-shared disk that's already used as
write-exclusive
3. We're adding write-exclusive disk that's already used
4. We're adding read-only disk that's already used as write-shared*
(because of I/O caching issues etc.)
The vif device duplicate check remains the same it was and it's
checked in the context of current domain only so that behaviour has
been preserved.
Signed-off-by: Michal Novotny <minovotn@redhat.com>
Keir Fraser [Wed, 27 May 2009 10:27:13 +0000 (11:27 +0100)]
x86 svm: Add support for Pause Filtering to AMD SVM
New AMD processors will support the Pause Filter Feature.
This feature creates a new field in the VMCB called Pause
Filter Count. If Pause Filter Count is greater than 0 and
ntercepting PAUSEs is enabled, the processor will increment
an internal counter when a PAUSE instruction occurs instead
of intercepting. When the internal counter reaches the
Pause Filter Count value, a PAUSE intercept will occur.
This feature can be used to detect contended spinlocks,
especially when the lock holding VCPU is not scheduled.
Rescheduling another VCPU prevents the VCPU seeking the
lock from wasting its quantum by spinning idly.
Experimental results show that most spinlocks are held
for less than 1000 PAUSE cycles or more than a few
thousand. Default the Pause Filter Counter to 3000 to
detect the contended spinlocks.
Processor support for this feature is indicated by a CPUID
bit.
On a 24 core system running 4 guests each with 16 VCPUs,
this patch improved overall performance of each guest's
32 job kernbench by approximately 1%. Further performance
improvement may be possible with a more sophisticated
yield algorithm.
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Keir Fraser [Wed, 27 May 2009 10:19:38 +0000 (11:19 +0100)]
rombios: fix trying to boot from next device
If boot="ndc", rombios cannot try to boot next device.
Because rombios jump to the boot vector without pushing return
address, gPXE code and so on cannot return if it fail to boot.
Keir Fraser [Wed, 27 May 2009 10:15:08 +0000 (11:15 +0100)]
Pass cpumasks by reference always.
Rather than passing cpumasks by value in all cases (which is
problematic for large NR_CPUS configurations), pass them 'by
reference' (i.e. through a pointer to a const cpumask).
On x86 this changes send_IPI_mask() to always only send IPIs to remote
CPUs (meaning any caller needing to handle the current CPU as well has
to do so on its own).
Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
Keir Fraser [Wed, 27 May 2009 09:38:51 +0000 (10:38 +0100)]
x86: eliminate hard-coded NR_IRQS
... splitting it into global nr_irqs (determined at boot time) and
per- domain nr_pirqs (derived from nr_irqs and a possibly command line
specified value, which probably should later become a per-domain
config setting).
This has the (desirable imo) side effect of reducing the size of
struct hvm_irq_dpci from requiring an order-3 page to order-2 (on
x86-64), which nevertheless still is too large.
However, there is now a variable size bit array on the stack in
pt_irq_time_out() - while for the moment this probably is okay, it
certainly doesn't look nice. However, replacing this with a static
(pre-)allocation also seems less than ideal, because that would
require at least min(d->nr_pirqs, NR_VECTORS) bit arrays of
d->nr_pirqs bits, since this bit array is used outside of the
serialized code region in that function, and keeping the domain's
event lock acquired across pirq_guest_eoi() doesn't look like a good
idea either.
The IRQ- and vector-indexed arrays hanging off struct hvm_irq_dpci
could in fact be changed further to dynamically use the smaller of the
two ranges for indexing, since there are other assumptions about a
one-to-one relationship between IRQs and vectors here and elsewhere.
Additionally, it seems to me that struct hvm_mirq_dpci_mapping's
digl_list and gmsi fields could really be overlayed, which would yield
significant savings since this structure gets always instanciated in
form of d->nr_pirqs (as per the above could also be the smaller of
this and NR_VECTORS) dimensioned arrays.
Keir Fraser [Tue, 26 May 2009 10:52:31 +0000 (11:52 +0100)]
blktap2: a completely rewritten blktap implementation
Benefits to blktap2 over the old version of blktap:
* Isolation from xenstore - Blktap devices are now created directly on
the linux dom0 command line, rather than being spawned in response
to XenStore events. This is handy for debugging, makes blktap
generally easier to work with, and is a step toward a generic
user-level block device implementation that is not Xen-specific.
* Improved tapdisk infrastructure: simpler request forwarding, new
request scheduler, request merging, more efficient use of AIO.
* Improved tapdisk error handling and memory management. No
allocations on the block data path, IO retry logic to protect
guests
transient block device failures. This has been tested and is known
to work on weird environments such as NFS soft mounts.
* Pause and snapshot of live virtual disks (see xmsnap script).
* VHD support. The VHD code in this release has been rigorously
tested, and represents a very mature implementation of the VHD
image
format.
* No more duplication of mechanism with blkback. The blktap kernel
module has changed dramatically from the original blktap. Blkback
is now always used to talk to Xen guests, blktap just presents a
Linux gendisk that blkback can export. This is done while
preserving the zero-copy data path from domU to physical device.
These patches deprecate the old blktap code, which can hopefully be
removed from the tree completely at some point in the future.
Keir Fraser [Tue, 26 May 2009 10:05:04 +0000 (11:05 +0100)]
Transcendent memory ("tmem") for Xen.
Tmem, when called from a tmem-capable (paravirtualized) guest, makes
use of otherwise unutilized ("fallow") memory to create and manage
pools of pages that can be accessed from the guest either as
"ephemeral" pages or as "persistent" pages. In either case, the pages
are not directly addressible by the guest, only copied to and fro via
the tmem interface. Ephemeral pages are a nice place for a guest to
put recently evicted clean pages that it might need again; these pages
can be reclaimed synchronously by Xen for other guests or other uses.
Persistent pages are a nice place for a guest to put "swap" pages to
avoid sending them to disk. These pages retain data as long as the
guest lives, but count against the guest memory allocation.
Tmem pages may optionally be compressed and, in certain cases, can be
shared between guests. Tmem also handles concurrency nicely and
provides limited QoS settings to combat malicious DoS attempts.
Save/restore and live migration support is not yet provided.
Tmem is primarily targeted for an x86 64-bit hypervisor. On a 32-bit
x86 hypervisor, it has limited functionality and testing due to
limitations of the xen heap. Nearly all of tmem is
architecture-independent; three routines remain to be ported to ia64
and it should work on that architecture too. It is also structured to
be portable to non-Xen environments.
Tmem defaults off (for now) and must be enabled with a "tmem" xen boot
option (and does nothing unless a tmem-capable guest is running). The
"tmem_compress" boot option enables compression which takes about 10x
more CPU but approximately doubles the number of pages that can be
stored.
Tmem can be controlled via several "xm" commands and many interesting
tmem statistics can be obtained. A README and internal specification
will follow, but lots of useful prose about tmem, as well as Linux
patches, can be found at http://oss.oracle.com/projects/tmem .
Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
Keir Fraser [Tue, 26 May 2009 09:03:09 +0000 (10:03 +0100)]
xend: Fix xm pci-detach for inactive devices
In the case where a device is attached to an inactive domain
and then removed before the domain is activated it won't have
a vslot assigned, but it should still be valid to remove it.
I don't think that there are any other cases where vslot can
be invalid. Signed-off-by: Simon Horman <horms@verge.net.au>
Keir Fraser [Tue, 26 May 2009 08:58:38 +0000 (09:58 +0100)]
Add support for superpages (hugepages) in PV domain
This patch adds the option "superpages" to the domain configuration
file. If it is set, the domain is populated using 2M pages.
This code does not support fallback to small pages. If the domain can
not be created with 2M pages, the create will fail.
The patch also includes support for saving and restoring domains with
the superpage flag set. However, if a domain has freed small pages
within its physical page array and then extended the array, the
restore will fill in those freed pages. It will then attempt to
allocate more than its memory limit and will fail. This is
significant because apparently Linux does this during boot, thus a
freshly booted Linux image can not be saved and restored successfully.
Keir Fraser [Tue, 26 May 2009 08:54:53 +0000 (09:54 +0100)]
minios: replace mktime implementation
In the efforts to clarify MiniOS license it came to my attention that
few portions of MiniOS were taken from other GPL projects, one of them
is the mktime implementation. This patch replaces the current GPL
licensed mktime implementation with a different and BSD licensed
version.
Keir Fraser [Tue, 26 May 2009 08:50:35 +0000 (09:50 +0100)]
stubdom: 'file' based disk sharing
Allow 'file' based disks, that are blkback based disks, to be shared
between the guest domain and the stubdom. It does so exploiting the
same exception introduced in the previous patch "stubdoms phy disks
sharing". Now we can remove the hack in stubdom-dm that forces "file"
disks to be opened using blktap instead of blkback.
Keir Fraser [Wed, 20 May 2009 15:02:50 +0000 (16:02 +0100)]
ACPI/NUMA: Improve SRAT parsing
This is to properly handle SRAT rev 2 extended proximity domain
values.
Also a first step to eliminate the redundant definitions of
ACPI provided table structures (Linux eliminated all of the duplicates
from include/linux/acpi.h in 2.6.21).
Portions based on a Linux patch from Kurt Garloff <garloff@suse.de>
and Alexey Starikovskiy <astarikovskiy@suse.de>.
Keir Fraser [Wed, 20 May 2009 14:38:34 +0000 (15:38 +0100)]
x86-64: also handle virtual aliases of Xen image pages
With the unification of the heaps, the pages freed from the Xen boot
image now can also end up being allocated to a domain, and hence the
respective aliases need handling when such pages get their
cacheability attributes changed.
Rather than establishing multiple mappings with non-WB attributes
(which temporarily still can cause aliasing issues), simply unmap
those pages from the Xen virtual space, and re-map them (to allow re-
establishing of eventual large page mappings) when the cachability
attribute for them gets restored to normal (WB).
Keir Fraser [Wed, 20 May 2009 14:35:32 +0000 (15:35 +0100)]
x86: don't map more than the allocated space for frame_table
This is to avoid undue virtual address aliases in case the over-mapped
pages happen to get allocated to a domain, and then get their
cacheability attributes changed.
At the same time, use 1Gb mappings if possible and reasonable.
Keir Fraser [Tue, 19 May 2009 22:28:25 +0000 (23:28 +0100)]
x86: Fix the P2M audit code.
It currently doesn't even compile; with this patch applied, it
compiles and didn't immediately explode as soon as I started a VM.
I've not given it much testing beyond that, though.
Signed-off-by: Steven Smith <steven.smith@citrix.com>
Keir Fraser [Tue, 19 May 2009 13:17:56 +0000 (14:17 +0100)]
stubdom: Rebuild the ocaml runtime libraries with the options needed
if they are to be linked with object files created by ocamlc and the minios
kernel.
This is needed to build stubdoms written in ocaml.
Signed-off-by: Alex Zeffertt <alex.zeffertt@eu.citrix.com>
Keir Fraser [Tue, 19 May 2009 01:18:48 +0000 (02:18 +0100)]
xend: Make hotplug script timeouts configurable
In some configurations, when dom0 is busy with I/O, it may take
several minutes to complete all hotplug scripts required when a new
domain is being created. As device create timeout is set to 100
seconds, users get "hotplug scripts not working" error instead of a
new domain.
This patch makes both DEVICE_CREATE_TIMEOUT and DEVICE_DESTROY_TIMEOUT
configurable in xend-config.sxp to allow users to easily adapt hotplug
timeouts to their environment.
Keir Fraser [Tue, 19 May 2009 00:37:19 +0000 (01:37 +0100)]
xend: solve issues with xm block-configure command.
In the case of inactive managed domains:
The following error occurs currently. We cannot change the
configuration of the VBD by using xm block-configure. Of course,
using xm block-detach and xm block-attach instead of xm
block-configure, we can change it. However, I'd like to change it by
using xm block-configure.
In the case of active domains:
Another problem occurs after a domain was rebooted. Even if we
change a configuration of a VBD in the domain by using xm
block-configure, the configuration of the VBD is reverted to previous
configuration after the domain was rebooted.
Keir Fraser [Tue, 19 May 2009 00:31:26 +0000 (01:31 +0100)]
x86, cpufreq: fix ondemand governor to take aperf/mperf feedback
APERF/MPERF MSRs provides feedback about actual freq in
eplased time, which could be different from requested freq by
governor. However currently ondemand governor only takes that
feedback at freq down path. We should do that for scale up too.
Keir Fraser [Fri, 15 May 2009 07:12:39 +0000 (08:12 +0100)]
vt-d: Fix interrupt remapping for multiple IOAPICs
Current IOAPIC interrupt remapping code assumes there is only one
IOAPIC in system. It brings problem when there are more than one
IOAPIC in system. This patch extends ioapic_pin_to_intremap_index[]
array to handle multiple IOAPICs case.
Signed-off-by: Weidong Han <weidong.han@intel.com>
Keir Fraser [Thu, 14 May 2009 14:46:04 +0000 (15:46 +0100)]
xen public: make mmuext_op's vcpumask field const
Linux started to pass around pointers to 'const cpumask_t' a while ago,
and passing such a pointer to set_xen_guest_handle() requires that the
field be a handle for a constant type in order to avoid compiler
warnings.
Keir Fraser [Wed, 13 May 2009 09:39:44 +0000 (10:39 +0100)]
x86 vmx: Ensure debug-mode intercept for int3 and debug exceptions are
reinstated when resetting EXCEPTION_BIRTMAP entry in VMCS after
exiting real mode.
Keir Fraser [Wed, 13 May 2009 09:28:35 +0000 (10:28 +0100)]
passthrough: Fix PCI hot-plug option parsing
When a PCI function is passed-through extra options may be passed
through.
In the case of boot-time PCI pass-through the documented format is:
[dom:]bus:dev.slot[@vslot][[,opt]...]
e.g.
00:01.00.1@7,msitranslate=3D1
In the case of PCI hot-plug the xm pci-attach command take the
following arguments:
[-o opt[,opt]...] [dom:]bus:dev.slot [vslot]
e.g.
-o msitranslate=3D1 00:01.00.1 7
These xm ends up passing these to xem-qemu as:
[dom:]bus:dev.slot[[,opt]...][@vslot]
e.g.
00:01.00.1,msitranslate=3D1@7
Note that the option and the vslot have are transposed when
compared to the format used by boot-time PCI pass-through.
The parser inside qemu-xen can only handle the format used by
boot-time PCI pass-through and because of this ignores
any options passed by hot-plug.
This patch alters format used by hot-plug to match the parser.
Keir Fraser [Fri, 8 May 2009 10:50:12 +0000 (11:50 +0100)]
x86 hvm: hvm_set_callback_irq_level() must not be called in IRQ
context or with IRQs disabled. Ensure this by deferring to tasklet
(softirq) context if required.
Keir Fraser [Thu, 7 May 2009 18:32:10 +0000 (19:32 +0100)]
Permit user to suppress passing --prefix to setup.py
We change all invocations of setup.py as follows:
* use $(PYTHON) instead of `python' so that the user can specify
an alternative python version if they need to. If not set it
defaults to `python' in Config.mk.
* pass --prefix=$(PREFIX) via a new make variable
$(PYTHON_PREFIX_ARG). This allows a user to suppress the
--prefix=... argument entirely by setting PYTHON_PREFIX_ARG=''.
This will work around the bug described here
https://bugs.launchpad.net/ubuntu/+bug/362570
where passing --prefix=/usr/local (which ought to have no effect as
/usr/local is the default prefix) changes which subdirectory
distutils chooses, and results in the files being installed in
site-packages which is not on the default search path.
Users not affected by this python packaging bug should not set
PYTHON_PREFIX_ARG and their builds will not be affected. (Provided
PREFIX did not contain spaces. People who put spaces in PREFIX are
being quite optimistic.)
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
After the scheduler timer became suspended before entering cpu idle
state, the percpu timer_deadline is possible to be 0, i.e. no soft
timer in the queue. This case will cause unexpected large residency
percentage in C1 for the purely idle cpu.
Signed-off-by: Wei Gang <gang.wei@intel.com> Signed-off-by: Keir Fraser <keir.fraser@citrix.com>
Update XEN_LINUX_GIT_REMOTEBRANCH to match changes made in upstream
repo. Needed if you want setting KERNELS=linux-2.6-pvops in
config/Linux.mk to work.
Signed-off-by: Alex Zeffert <alex.zeffert@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
op_pincpu method in SrvDomain.py does not currently work because
op_pincpu method gives string objects to a cpumap argument of
domain_pincpu method in XendDomain.py though the cpumap argument
expects list objects.
This patch solves the above problem as follows.
op_pincpu method gives string objects to the cpumap argument as is,
because op_pincpu method cannot give list objects to the cpumap
argument.
Instead, domain_pincpu method expects that the cpumap argument is
string objects, then domain_pincpu method converts the cpumap
argument into list objects.
Also, the patch modifies two methods (except for op_pincpu method)
calling domain_pincpu method. The methods give string objects to
the cpumap argument instead of list objects.