Mfns for PV domains were not properly checked, potentially
allowing a buggy or malicious PV guest to crash Xen. Also,
use get_page/put_page to claim a reference to the pages
so they can't disappear out from under tmem's feet.
Revert 22186:7167d6dd5c7c "x86: Retry do_mmu_update() a few times"
It does not work reliably for a couple of reasons:
(1) page_lock() fails if a page is !PGT_validated, and a page can
remain in that state for unbounded time.
(2) in the kernel-side race that motivated this patch, pgd_pin() can
lose to vmalloc_sync_all() -- pgd_pin() can try to chaneg a pmd page's
type to l2_pagetable while
vmalloc_sync_all()->set_pmd()->do_mmu_update() has it temporarily
pinned as writable. This is hard to fix on the Xen side.
Hence I give up on this approach, revert the patch, and settle for
kernel-side patching only.
x86: Retry do_mmu_update() a few times when called on a pte whose type is in flux.
This can really happen -- all our PV Linux kernels have a race
between vmalloc_sync_all() and pgdir pinning/unpinning. The former is
protected by pgd_lock while the latter by mm->page_table_lock. Hence
they can happen concurrently, and vmalloc_sync_all() can attempt to
set_pmd() on a page directory which is in the process of being
pinned. This can confuse the hypervisor which may see a type change,
and hence fail do_mmu_update(). Until this patch. :-)
sched_credit: Raise bar for inter-socket migrations on mostly-idle systems
The credit scheduler ties to keep work balanced, even on a mostly idle
system. Unfortunately, if you have one VM burning cpu and another VM
idle, the effect is that the busy VM will flip back and forth between
sockets.
This patch addresses this, by only migrating to a different socket if
the number of idle processors is twice that of the socket the vcpu is
currently on.
This will only affect mostly-idle systems; as the system becomes more
busy, other load-balancing code will come into effect.
Several seconds of backward time drift per minute can be seen on a
RHEL6 HVM guest by switching the clocksource to 'acpi_pm' and then
running gettimeofday() in a loop. This is due to the accumulation
of small inaccuracies that are caused by shifting out the lower 32
bits when pmt_update_time() computes 'tmr_val'.
The patch makes sure that the lower 32 bits of the computed value
are not lost. They are saved in a new field 'not_accounted' in the
PMTState structure and are accounted the next time pmt_update_time()
is called.
C6 state with EOI issue fix for some Intel processors
There is an errata in some of Intel processors.
AAJ72. EOI Transaction May Not be Sent if Software Enters Core C6
During an Interrupt Service Routine
If core C6 is entered after the start of an interrupt service routine
but before a write to the APIC EOI register, the core may not send an
EOI transaction (if needed) and further interrupts from the same
priority level or lower may be blocked.
This patch fix this issue, by checking if ISR is pending before enter
deep Cx state. If so, it would use power->safe_state instead of deep
Cx state to prevent the above issue happen.
tmem (tools): move to new ABI version to handle long object-ids
After a great deal of discussion and review with linux
kernel developers, it appears there are "next-generation"
filesystems (such as btrfs, xfs, Lustre) that will not
be able to use tmem due to an ABI limitation... a field
that represents a unique file identifier is 64-bits in
the tmem ABI and may need to be as large as 192-bits.
So to support these guest filesystems, the tmem ABI must be
revised, from "v0" to "v1".
I *think* it is still the case that tmem is experimental
and is not used anywhere yet in production.
The tmem ABI is designed to support multiple revisions,
so the Xen tmem implementation could be updated to
handle both v0 and v1. However this is a bit
messy and would require data structures for both v0
and v1 to appear in public Xen header files.
I am inclined to update the Xen tmem implementation
to only support v1 and gracefully fail v0. This would
result in only a performance loss (as if tmem were
disabled) for newly launched tmem-v0-enabled guests,
but live-migration between old tmem-v0 Xen and new
tmem-v1 Xen machines would fail, and saved tmem-v0
guests will not be able to be restored on a tmem-v1
Xen machine. I would plan to update both pre-4.0.2
and unstable (future 4.1) to only support v1.
I believe these restrictions are reasonable at this
point in the tmem lifecycle, though they may not
be reasonable in the near future; should the tmem
ABI need to be revised from v1 to v2, I understand
backwards compatibility will be required.
tmem (hv): move to new ABI version to handle long object-ids
After a great deal of discussion and review with linux
kernel developers, it appears there are "next-generation"
filesystems (such as btrfs, xfs, Lustre) that will not
be able to use tmem due to an ABI limitation... a field
that represents a unique file identifier is 64-bits in
the tmem ABI and may need to be as large as 192-bits.
So to support these guest filesystems, the tmem ABI must be
revised, from "v0" to "v1".
I *think* it is still the case that tmem is experimental
and is not used anywhere yet in production.
The tmem ABI is designed to support multiple revisions,
so the Xen tmem implementation could be updated to
handle both v0 and v1. However this is a bit
messy and would require data structures for both v0
and v1 to appear in public Xen header files.
I am inclined to update the Xen tmem implementation
to only support v1 and gracefully fail v0. This would
result in only a performance loss (as if tmem were
disabled) for newly launched tmem-v0-enabled guests,
but live-migration between old tmem-v0 Xen and new
tmem-v1 Xen machines would fail, and saved tmem-v0
guests will not be able to be restored on a tmem-v1
Xen machine. I would plan to update both pre-4.0.2
and unstable (future 4.1) to only support v1.
I believe these restrictions are reasonable at this
point in the tmem lifecycle, though they may not
be reasonable in the near future; should the tmem
ABI need to be revised from v1 to v2, I understand
backwards compatibility will be required.
Keir Fraser [Mon, 30 Aug 2010 07:59:46 +0000 (08:59 +0100)]
ept: Put locks around ept_get_entry
There's a subtle race in ept_get_entry, such that if tries to read an
entry that ept_set_entry is modifying, it gets neither the old entry
nor the new entry, but empty. In the case of multi-cpu
populate-on-demand guests, this manifests as a guest crash when one
vcpu tries to read a page which another page is trying to populate,
and ept_get_entry returns p2m_mmio_dm.
This bug can also be fixed by making both ept_set_entry and
ept_next_level access-once (i.e., ept_next_level reads full ept_entry
and then works with local value; ept_set_entry construct the entry
locally and then sets it in one write). But there doesn't seem to be
any major performance implications of just making ept_get_entry use
locks; so the simpler, the better.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
xen-unstable changeset: 22071:c5aed2e049bc
xen-unstable date: Mon Aug 30 08:39:52 2010 +0100
Keir Fraser [Mon, 30 Aug 2010 07:50:52 +0000 (08:50 +0100)]
x2APIC: Improve x2APIC suspend/resume
x2apic depends on interrupt remapping, so it should disable interrupt
remapping behind x2apic disabling. And also this patch wraps
__enable_x2apic to get rid of duplicated code.
Signed-off-by: Weidong Han <weidong.han@intel.com>
xen-unstable changeset: 3cee41690fa2
xen-unstable date: Fri Aug 13 14:58:06 2010 +0100
Keir Fraser [Sun, 15 Aug 2010 20:48:06 +0000 (21:48 +0100)]
blktap2: make protocol specific usage of shared sring explicit
I don't think protocol specific data really belongs in this header
but since it is already there and we seem to be stuck with it let's at
least make the users explicit lest people get caught out by future new
fields moving the pad field around.
This is the Xen portion of this change. The kernel portion will be
sent separately. There is no dependency between the two.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Daniel Stodden <daniel.stodden@citrix.com> Cc: Dongxiao Xu <dongxiao.xu@intel.com>
xen-unstable changeset: feee0abed6aa
xen-unstable date: Fri Jul 02 18:58:02 2010 +0100
Keir Fraser [Fri, 13 Aug 2010 14:06:24 +0000 (15:06 +0100)]
Fix IOAPIC S3 with interrupt remapping enabled
In ioapic_suspend, it reads and saves ioapic RTEs. But when interrupt
remapping is enabled, io_apic_read will call io_apic_read_remap_rte to
convert remapped format interrupt to compatible format, this results
in 'dest' field may be changed in remap_entry_to_ioapic_rte. When in
ioapic_resume, it will write the saved RTEs with incorrect 'dest' to
interrupt remapping table.
Actually it needn't to convert RTEs regardless interrupt remapping is
enabled or not. It just needs to save and restore RTE values
directly. This patch just uses __io_apic_read and __io_apic_write,
which won't call Interrupt remapping functions to convert, to save and
restore RTEs in ioapic_suspend and ioapic_resume. Thus fix this issue.
Signed-off-by: Weidong Han <weidong.han@intel.com>
xen-unstable changeset: 01d185dab39e
xen-unstable date: Fri Aug 13 14:57:35 2010 +0100
Keir Fraser [Fri, 13 Aug 2010 08:05:07 +0000 (09:05 +0100)]
[Xen-devel] [PATCH] PoD: Fix domain build populate-on-demand cache
allocation Rather than trying to count the number of PoD entries we're
putting in, we simply pass the target # of pages - the vga hole, and
let the hypervisor do the calculation.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
xen-unstable changeset: 6f059a340cdf
xen-unstable date: Wed Aug 11 15:56:21 2010 +0100
Keir Fraser [Fri, 13 Aug 2010 07:52:56 +0000 (08:52 +0100)]
msi: Avoid uninitialized msi descriptors
When __pci_enable_msix() returns early, output parameter (struct
msi_desc **desc) will not be initialized. On my machine, a Broadcom
BCM5709 nic has both MSI and MSIX capability blocks and when guest
tries to enable msix interrupts but __pci_enable_msix() returns early
for encountering a msi block, the whole system will crash for fatal
page fault immediately.
Signed-off-by: Wei Wang <wei.wang2@amd.com>
xen-unstable changeset: 786b163da49b
xen-unstable date: Wed Aug 11 17:01:02 2010 +0100
Keir Fraser [Fri, 13 Aug 2010 07:52:08 +0000 (08:52 +0100)]
xc: fix segfault in pv domain create if kernel is an invalid image
If libelf calls elf_err() or elf_msg() before elf_set_log() has been
called then it could potentially read an uninitialised log handling
callback function pointer from struct elf_binary. Fix this in libxc by
zeroing the structure before calling elf_init().
Keir Fraser [Mon, 9 Aug 2010 15:51:30 +0000 (16:51 +0100)]
vt-d: Fix ioapic_rte_to_remap_entry error path.
When ioapic_rte_to_remap_entry fails, currently it just writes value
to ioapic. But the 'mask' bit may be changed if it writes to the upper
half of RTE. This patch ensures to recover the original value of
'mask' bit in this case.
Signed-off-by: Weidong Han <weidong.han@intel.com>
xen-unstable changeset: 21934:befd1814c0a2
xen-unstable date: Mon Aug 09 16:33:45 2010 +0100
Keir Fraser [Mon, 9 Aug 2010 15:51:03 +0000 (16:51 +0100)]
vt-d: Fix ioapic write order in io_apic_write_remap_rte
At the end of io_apic_write_remap_rte, it writes new entry (remapped
interrupt) to ioapic. But it writes low 32 bits before high 32 bits,
it unmasks interrupt before writing high 32 bits if 'mask' bit in low
32 bits is cleared. Thus it may result in issues. This patch fixes
this issue by writing high 32 bits before low 32 bits.
Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com> Signed-off-by: Weidong Han <weidong.han@intel.com>
xen-unstable changeset: 21933:add40eb47868
xen-unstable date: Mon Aug 09 16:32:45 2010 +0100
Keir Fraser [Mon, 2 Aug 2010 16:17:55 +0000 (17:17 +0100)]
xenpaging: Add a check to Xen for EPT.
There isn't seem to be a way to directly check for EPT, so instead
check for HAP and an Intel processor. If EPT isn't enabled, then
return an error to the tool.
Keir Fraser [Mon, 2 Aug 2010 16:11:33 +0000 (17:11 +0100)]
Walking the page lists needs the page_alloc lock
There are a few places in Xen where we walk a domain's page lists
without holding the page_alloc lock. They race with updates to the
page lists, which are normally rare but can be quite common under PoD
when the domain is close to its memory limit and the PoD reclaimer is
busy. This patch protects those places by taking the page_alloc lock.
I think this is OK for the two debug-key printouts - they don't run
from irq context and look deadlock-free. The tboot change seems safe
too unless tboot shutdown functions are called from irq context or
with the page_alloc lock held. The p2m one is the scariest but there
are already code paths in PoD that take the page_alloc lock with the
p2m lock held so it's no worse than existing code.
iommu: New options iommu=dom-strict and iommu=dom0-passthrough
The former strips dom0 of its usual 1:1 mapping of all memory, and
only provides it with mappings of its own memory, like any other
domain. The latter is a new consistent name for iommu=passthrough.
xen: Send the debug VIRQ to guests after the rest of the domain dump is done.
Send the debug VIRQ to guests after the rest of the domain dump is
done. This stops all the 'q' debug-key output getting interleaved with
the debug-virq output from a pv-ops dom0 kernel.
xm: Do not check path of kernel if bootloader is specified
When create DomU, if bootloader is specified, 'kernel/ramdisk' will be
used by bootloader when boots DomU. So it is needless to check the
path is existent or not.
Newer version of gdb, version 7*, seems to have bug where it is not
parsing thread list from gdbsx properly. Getting rid of the space in
thread list works around it. It's ok with older gdb also.
tools/hotplug: locking.sh script: fix lock directory remains on error bug
_release_lock should be used instead of release_lock.
sigerr is introduced so that it can be redefined by
xen-hotplug-common.sh to a version which writes error status to
xenstore.
This matches similar checks done in Linux, since no good can come from
a domain trying to enable both MSI and MSI-X on the same device at the
same time.
This patch masks PIC and IOAPIC RTE's before x2APIC enabling, unmask
and restore them after x2APIC enabling. It also really enables
interrupt remapping before x2APIC enabling instead of just checking
interrupt remapping setting. This patch also handles all x2APIC
configuration including BIOS settings and command line
settings. Especially, it handles that BIOS hands over in x2APIC mode
(when there is apic id > 255). It checks if x2APIC is already enabled
by BIOS. If already enabled, it will disable interrupt remapping and
queued invalidation first, then enable them again.
x2APIC/VT-d: improve interrupt remapping and queued invalidation enabling and disabling
x2APIC depends on interrupt remapping, so interrupt remapping needs to
be enabled before x2APIC. Usually x2APIC is not enabled
(x2apic_enabled=0) when enable interrupt remapping, although x2APIC
will be enabled later. So it needs to pass a parameter to set
interrupt mode in intremap_enable, instead of checking
x2apic_enable. This patch adds a parameter "eim" to intremap_enable to
achieve it. Interrupt remapping and queued invalidation are already
enabled when enable x2apic, so it needn't to enable them again when
setup iommu. This patch checks if interrupt remapping and queued
invalidation are already enable or not, and won't enable them if
already enabled. It does the similar in disabling, that's to say don't
disable them if already disabled.
A drhd is created when parse ACPI DMAR table, but drhd->iommu is not
allocated until iommu setup. But iommu is needed by x2APIC which will
enable interrupt remapping before iommu setup. This patch allocates
iommu when create drhd. And then drhd->ecap can be removed because
it's the same as iommu->ecap.
Currently "make stubdom" on its own fails because it depends on files
being installed by the results of "make tools". This also means that
in some circumstances a parallel "make tools stubdom" (or "make all")
can fail due to races. So make "make stubdom" depend on "make tools"
having completed first.
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
xen-unstable changeset: 21760:84719437205c
xen-unstable date: Fri Jul 09 12:22:52 2010 +0100
The hardware CPUID-levelling features level the feature flags but
don't change the CPU family/model/stepping. Relax the HVM restore
check on family/model/stepping to printk but not veto the load, so
that VMs can be migrated between machines that have been
CPUID-levelled.
xen: allow HVM save/restore from different changesets
Allow HVM save/restore from different changesets of Xen. The HVM save
records are supposed to be backwards compatible; XenServer
live-migrates between versions of Xen during upgrades.