Keir Fraser [Fri, 10 Dec 2010 11:34:28 +0000 (11:34 +0000)]
x86 hvm: Add a new HVMOP to get the current Xen system time
Xen absolute system time, so that it can use SCHEDOP_poll in a
sensible fashion. HVM PV drivers can't use the normal PV clock
because they might have TSC offsets that hey don't know about.
Keir Fraser [Thu, 9 Dec 2010 10:14:57 +0000 (10:14 +0000)]
x86/mm: change ASSERTs to BUG_ONs in mem_sharing.c
These two ASSERTs have important side-effects so make them into
BUG_ONs
consistent with the rest of the file.
Bug found by Jui-Hao Chiang <juihaochiang@gmail.com>.
Signed-off-by: Tim Deegan <Tim.Deegan@citrix.com>
xen-unstable changeset: 22467:89116f28083f
xen-unstable date: Wed Dec 08 10:46:31 2010 +0000
Keir Fraser [Tue, 7 Dec 2010 18:37:31 +0000 (18:37 +0000)]
x86: remove BUG_ON() from QUIRK_IOAPIC_*_REGSEL handler
Since (non-pvops, 32-bit only up to 2.6.27) Linux would report "BAD"
unconditionally on all SiS chipset versions (it only looks for a PCI
device at 0000:00:00.0 with SiS as the vendor), we must not crash if
the report on a 64-bit hypervisor doesn't match the #define (which is
zero).
While we could honor the quirk indication even on 64-bit, it doesn't
seem worthwhile, as there's no evidence that newer SiS chipsets
(supporting 64-bit CPUs) are actually affected.
This should also address bug 1687 (mis-reported, however, afaict).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
xen-unstable changeset: 22466:bfd13358b8bf
xen-unstable date: Tue Dec 07 18:32:04 2010 +0000
Keir Fraser [Wed, 1 Dec 2010 20:14:56 +0000 (20:14 +0000)]
x86: fix IRQ migration when using directed EOI (broken with c/s 20465)
In directed-EOI mode, there is no chance to do the migration in
mask_and_ack_level_ioapic_irq(), as the remote IRR bit can't possibly
be clear after issuing the EOI to the LAPIC. Consequently, there's no
point to even try. Instead, migration must be done in
end_level_ioapic_irq(), and it requires masking the interrupt source
prior to issuing the EOI to the IO-APIC.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
xen-unstable changeset: 22452:62bf12040b0f
xen-unstable date: Wed Dec 01 20:10:27 2010 +0000
Keir Fraser [Tue, 30 Nov 2010 11:38:16 +0000 (11:38 +0000)]
x86 hvm: Do not overwrite boot-cpu capability data on VMX/SVM startup.
Apparently required back in the earliest days of Xen, we now properly
initialise CPU capabilities early during bootstrap. Re-writing
capability data later now causes problems if specific features have
been deliberately masked out.
Thanks to Weidong Han at Intel for finding such a bug where XSAVE
feature is masked out by default, but then erroneously written back
during VMX initialisation. This would cause memory corruption problems
during boot for XSAVE-capable systems.
Keir Fraser [Mon, 29 Nov 2010 14:46:43 +0000 (14:46 +0000)]
x86: tighten filter on ptwr_do_page_fault()
Even not-so-recent Linux may, due to post-2.6.18 changes to the
process creation code, cause quite a number (depending on environment
and argument size) of faulting accesses to user space originating from
kernel mode. Generally those happen for non-present pages and would
lead to a nested page fault from guest_get_eff_l1e(). They can be
avoided by checking for PFEC_page_present as long as the guest isn't
running on shadow page tables.
Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Keir Fraser <keir@xen.org>
xen-unstable changeset: 22449:3afb5ecbf69f
xen-unstable date: Mon Nov 29 14:40:55 2010 +0000
Keir Fraser [Mon, 29 Nov 2010 14:46:01 +0000 (14:46 +0000)]
x86-64: don't crash Xen upon direct pv guest access to GDT/LDT mapping area
handle_gdt_ldt_mapping_fault() is intended to deal with indirect
accesses (i.e. those caused by descriptor loads) to the GDT/LDT
mapping area only. While for 32-bit segment limits indeed prevent the
function being entered for direct accesses (i.e. a #GP fault will be
raised even before the address translation gets done, on 64-bit even
user mode accesses would lead to control reaching the BUG_ON() at the
beginning of that function.
Fortunately the fix is simple: Since the guest kernel runs in ring 3,
any guest direct access will have the "user mode" bit set, whereas
descriptor loads always do the translations to access the actual
descriptors as kernel mode ones.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Further, relax the BUG_ON() in handle_gdt_ldt_mapping_fault() to a
check-and-bail. This avoids any problems in future, if we don't
execute x86_64 guest kernels in ring 3 (e.g., because we use a
lightweight HVM container).
Keir Fraser [Wed, 10 Nov 2010 14:16:45 +0000 (14:16 +0000)]
hvmloader: fix off-by-one-bit error when initialising PCI devices
hvmloader is responsible for - amoungst other things - initialising
the PCI device BARs prior to loading the guest BIOS. The previous
code only probed for devfn up to 128. The lower 3 bits are function
IDs so this meant that only devices in slots 0-15 were actually being
initialized.
Signed-off-by: Alex Zeffertt <alex.zeffertt@eu.citrix.com> Acked-by: Gianni Tedesco <gianni.tedesco@citrix.com>
xen-unstable changeset: 22383:cba667fb80cf
xen-unstable date: Wed Nov 10 13:58:16 2010 +0000
hvmloader: Fix 22383:cba667fb80cf iterating over defns 0..255
We need to declare devfn as wider than 8 bits for a loop 0<devfn<256
to terminate.
Keir Fraser [Mon, 8 Nov 2010 15:36:58 +0000 (15:36 +0000)]
Fix "Error: Device 51952 not connected" error when using pygrub
The following is the process of booting a DomU with 'mounted-blktap2'
(VHD
for example) and 'pygrub' as bootloader:
1. Connect boot-device to Dom0 as '/dev/xpvd'
2. Pygrub get info for load DomU
3. Disconnect boot-device from Dom0
4. Boot DomU
During step 3 the created device is disconnected from Dom0, but
xenstore does not scrape away after the device is disconnected so you
get the following error:
"Error: Device /dev/xvdp (51952, tap2) is already connected."
During step 3 xend calls destroyDevice always with 'tap' as argument.
Keir Fraser [Mon, 8 Nov 2010 15:35:30 +0000 (15:35 +0000)]
tools/xenpaging: Add _XOPEN_SOURCE to fix build problems with recent gcc
This patch fixes compilation issues with
gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21).
Signed-off-by: Daniel Kiper <dkiper@net-space.pl> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
xen-unstable changeset: 22023:af6799abc6e9
xen-unstable date: Wed Aug 18 16:48:25 2010 +0100
Keir Fraser [Wed, 3 Nov 2010 08:28:36 +0000 (08:28 +0000)]
VT-d: fix device assignment failure (regression from Xen c/s 19805:2f1fa2215e60)
If the device at <secbus>:00.0 is the device the mapping operation was
initiated for, trying to map it a second time will fail, and hence
this second mapping attempt must be prevented (as was done prior to
said c/s).
While at it, simplify the code a little, too.
Signed-off-by: Jan Beulich <jbeulich@novell.com> Acked-by: Weidong Han <weidong.han@intel.com>
xen-unstable changeset: 22348:2dfba250c50b
xen-unstable date: Wed Nov 03 08:18:51 2010 +0000
Keir Fraser [Sun, 24 Oct 2010 12:26:45 +0000 (13:26 +0100)]
x86/kexec: fix very old regression and make compatible with modern Linux
c/s 13829 lost the (32-bit only) cpu_has_pae argument passed to the
primary kernel's stub (in the 32-bit Xen case only), and Linux
2.6.27/.30 (32-/64-bit) introduced a new argument (for KEXEC_JUMP)
which for now simply gets passed a hardcoded value.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
xen-unstable changeset: 22280:d6e3cd10a9a6
xen-unstable date: Sun Oct 24 13:15:06 2010 +0100
Keir Fraser [Sun, 24 Oct 2010 12:26:17 +0000 (13:26 +0100)]
Allow max_pages to be set to less than tot_pages
The memory allocation code sometimes needs to enforce that a guest
that's been told to balloon down isn't going to expand further
(because it's still executing a previous balloon-up operation). That
means being able to set the desired max_pages even before the balloon
driver has brought tot_pages down to the right level.
Signed-off-by: Tim Deegan <Tim.Deegan@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
xen-unstable changeset: 22279:2208a036f8d9
xen-unstable date: Sun Oct 24 13:13:04 2010 +0100
Keir Fraser [Wed, 20 Oct 2010 12:34:36 +0000 (13:34 +0100)]
x86-64: workaround for BIOSes wrongly enabling LAHF_LM feature indicator
This workaround is taken from Linux, and the main motivation (besides
such workarounds indeed belonging in the hypervisor rather than each
kernel) is to suppress the warnings in the Xen log each Linux guest
would cause due to the disallowed wrmsr.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
xen-unstable changeset: 22232:eb964c4b4f31
xen-unstable date: Mon Oct 11 09:02:36 2010 +0100
Keir Fraser [Sat, 2 Oct 2010 14:13:01 +0000 (15:13 +0100)]
x86 shadow: reset up-pointers on all l3s when l3s stop being pinnable.
Walking the pinned-shadows list isn't enough: there could be an
unpinned (but still shadowed) l3 somewhere and if we later try to
unshadow it it'll have an up-pointer of PAGE_LIST_NULL:PAGE_LIST_NULL.
Signed-off-by: Tim Deegan <Tim.Deegan@citrix.com>
xen-unstable changeset: 22224:a4016a257672
xen-unstable date: Sat Oct 02 15:05:50 2010 +0100
Keir Fraser [Sat, 2 Oct 2010 14:10:53 +0000 (15:10 +0100)]
Vt-d: fix dom0 graphics problem on Levnovo T410.
The patch is derived from a similar quirk in Linux kernel by David
Woodhouse and Adam Jackson. It checks for VT enabling bit in IGD GGC
register. If VT is not enabled correctly in the IGD, Xen does not
enable VT-d translation for IGD VT-d engine. In case where iommu boot
parameter is set to force, Xen calls panic().
Signed-off-by: Allen Kay <allen.m.kay@intel.com>
xen-unstable changeset: 22223:4beee5779122
xen-unstable date: Sat Oct 02 15:04:21 2010 +0100
Keir Fraser [Sat, 2 Oct 2010 14:10:22 +0000 (15:10 +0100)]
x86: fix boot failure (regression from pre-4.0 IRQ handling changes)
With the change to index irq_desc[] by IRQ rather than by vector, the
prior implicit change of the used flow handler when altering the IRQ
routing path to go through the 8259A didn't work anymore, and hence
on boards needing the ExtINT delivery workaround failed to boot.
Make make_8259A_irq() a real function again, thus allowing the flow
handler to be changed there.
Also eliminate the generally superfluous and (at least theoretically)
dangerous hard coded setting of the flow handler for IRQ0: Earlier
code should have set this already based on information coming from
ACPI/MPS, and non-standard systems may e.g. have this IRQ level
triggered.
Signed-off-by: Jan Beulich <jbeulich@novell.com> Tested-by: Markus Schuster <ml@markus.schuster.name>
xen-unstable changeset: 22222:aed9fd361340
xen-unstable date: Sat Oct 02 15:03:15 2010 +0100
Keir Fraser [Sat, 2 Oct 2010 14:10:01 +0000 (15:10 +0100)]
Vt-d: fix feature boot messages
Changed vt-d feature boot messages from "supported" to "enabled" since
they reflect what is currently enabled in this Xen boot - not what is
supported by VT-d hardware.
Signed-off-by: Allen Kay <allen.m.kay@intel.com>
xen-unstable changeset: 22221:3518149c4d5d
xen-unstable date: Sat Oct 02 15:00:05 2010 +0100
While not as relevant after c/s 21894, is still seems safer to check
the CPUID level here, just like Linux does. The is particularly
relevant for the 4.0 tree (which doesn't have said c/s), but also
possibly for nested environments where writing MSR_IA32_MISC_ENABLE
may not actually take effect (Xen itself ignores such writes).
Mfns for PV domains were not properly checked, potentially
allowing a buggy or malicious PV guest to crash Xen. Also,
use get_page/put_page to claim a reference to the pages
so they can't disappear out from under tmem's feet.
Revert 22186:7167d6dd5c7c "x86: Retry do_mmu_update() a few times"
It does not work reliably for a couple of reasons:
(1) page_lock() fails if a page is !PGT_validated, and a page can
remain in that state for unbounded time.
(2) in the kernel-side race that motivated this patch, pgd_pin() can
lose to vmalloc_sync_all() -- pgd_pin() can try to chaneg a pmd page's
type to l2_pagetable while
vmalloc_sync_all()->set_pmd()->do_mmu_update() has it temporarily
pinned as writable. This is hard to fix on the Xen side.
Hence I give up on this approach, revert the patch, and settle for
kernel-side patching only.
x86: Retry do_mmu_update() a few times when called on a pte whose type is in flux.
This can really happen -- all our PV Linux kernels have a race
between vmalloc_sync_all() and pgdir pinning/unpinning. The former is
protected by pgd_lock while the latter by mm->page_table_lock. Hence
they can happen concurrently, and vmalloc_sync_all() can attempt to
set_pmd() on a page directory which is in the process of being
pinned. This can confuse the hypervisor which may see a type change,
and hence fail do_mmu_update(). Until this patch. :-)
sched_credit: Raise bar for inter-socket migrations on mostly-idle systems
The credit scheduler ties to keep work balanced, even on a mostly idle
system. Unfortunately, if you have one VM burning cpu and another VM
idle, the effect is that the busy VM will flip back and forth between
sockets.
This patch addresses this, by only migrating to a different socket if
the number of idle processors is twice that of the socket the vcpu is
currently on.
This will only affect mostly-idle systems; as the system becomes more
busy, other load-balancing code will come into effect.
Several seconds of backward time drift per minute can be seen on a
RHEL6 HVM guest by switching the clocksource to 'acpi_pm' and then
running gettimeofday() in a loop. This is due to the accumulation
of small inaccuracies that are caused by shifting out the lower 32
bits when pmt_update_time() computes 'tmr_val'.
The patch makes sure that the lower 32 bits of the computed value
are not lost. They are saved in a new field 'not_accounted' in the
PMTState structure and are accounted the next time pmt_update_time()
is called.
C6 state with EOI issue fix for some Intel processors
There is an errata in some of Intel processors.
AAJ72. EOI Transaction May Not be Sent if Software Enters Core C6
During an Interrupt Service Routine
If core C6 is entered after the start of an interrupt service routine
but before a write to the APIC EOI register, the core may not send an
EOI transaction (if needed) and further interrupts from the same
priority level or lower may be blocked.
This patch fix this issue, by checking if ISR is pending before enter
deep Cx state. If so, it would use power->safe_state instead of deep
Cx state to prevent the above issue happen.
tmem (tools): move to new ABI version to handle long object-ids
After a great deal of discussion and review with linux
kernel developers, it appears there are "next-generation"
filesystems (such as btrfs, xfs, Lustre) that will not
be able to use tmem due to an ABI limitation... a field
that represents a unique file identifier is 64-bits in
the tmem ABI and may need to be as large as 192-bits.
So to support these guest filesystems, the tmem ABI must be
revised, from "v0" to "v1".
I *think* it is still the case that tmem is experimental
and is not used anywhere yet in production.
The tmem ABI is designed to support multiple revisions,
so the Xen tmem implementation could be updated to
handle both v0 and v1. However this is a bit
messy and would require data structures for both v0
and v1 to appear in public Xen header files.
I am inclined to update the Xen tmem implementation
to only support v1 and gracefully fail v0. This would
result in only a performance loss (as if tmem were
disabled) for newly launched tmem-v0-enabled guests,
but live-migration between old tmem-v0 Xen and new
tmem-v1 Xen machines would fail, and saved tmem-v0
guests will not be able to be restored on a tmem-v1
Xen machine. I would plan to update both pre-4.0.2
and unstable (future 4.1) to only support v1.
I believe these restrictions are reasonable at this
point in the tmem lifecycle, though they may not
be reasonable in the near future; should the tmem
ABI need to be revised from v1 to v2, I understand
backwards compatibility will be required.
tmem (hv): move to new ABI version to handle long object-ids
After a great deal of discussion and review with linux
kernel developers, it appears there are "next-generation"
filesystems (such as btrfs, xfs, Lustre) that will not
be able to use tmem due to an ABI limitation... a field
that represents a unique file identifier is 64-bits in
the tmem ABI and may need to be as large as 192-bits.
So to support these guest filesystems, the tmem ABI must be
revised, from "v0" to "v1".
I *think* it is still the case that tmem is experimental
and is not used anywhere yet in production.
The tmem ABI is designed to support multiple revisions,
so the Xen tmem implementation could be updated to
handle both v0 and v1. However this is a bit
messy and would require data structures for both v0
and v1 to appear in public Xen header files.
I am inclined to update the Xen tmem implementation
to only support v1 and gracefully fail v0. This would
result in only a performance loss (as if tmem were
disabled) for newly launched tmem-v0-enabled guests,
but live-migration between old tmem-v0 Xen and new
tmem-v1 Xen machines would fail, and saved tmem-v0
guests will not be able to be restored on a tmem-v1
Xen machine. I would plan to update both pre-4.0.2
and unstable (future 4.1) to only support v1.
I believe these restrictions are reasonable at this
point in the tmem lifecycle, though they may not
be reasonable in the near future; should the tmem
ABI need to be revised from v1 to v2, I understand
backwards compatibility will be required.
Keir Fraser [Mon, 30 Aug 2010 07:59:46 +0000 (08:59 +0100)]
ept: Put locks around ept_get_entry
There's a subtle race in ept_get_entry, such that if tries to read an
entry that ept_set_entry is modifying, it gets neither the old entry
nor the new entry, but empty. In the case of multi-cpu
populate-on-demand guests, this manifests as a guest crash when one
vcpu tries to read a page which another page is trying to populate,
and ept_get_entry returns p2m_mmio_dm.
This bug can also be fixed by making both ept_set_entry and
ept_next_level access-once (i.e., ept_next_level reads full ept_entry
and then works with local value; ept_set_entry construct the entry
locally and then sets it in one write). But there doesn't seem to be
any major performance implications of just making ept_get_entry use
locks; so the simpler, the better.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
xen-unstable changeset: 22071:c5aed2e049bc
xen-unstable date: Mon Aug 30 08:39:52 2010 +0100
Keir Fraser [Mon, 30 Aug 2010 07:50:52 +0000 (08:50 +0100)]
x2APIC: Improve x2APIC suspend/resume
x2apic depends on interrupt remapping, so it should disable interrupt
remapping behind x2apic disabling. And also this patch wraps
__enable_x2apic to get rid of duplicated code.
Signed-off-by: Weidong Han <weidong.han@intel.com>
xen-unstable changeset: 3cee41690fa2
xen-unstable date: Fri Aug 13 14:58:06 2010 +0100
Keir Fraser [Sun, 15 Aug 2010 20:48:06 +0000 (21:48 +0100)]
blktap2: make protocol specific usage of shared sring explicit
I don't think protocol specific data really belongs in this header
but since it is already there and we seem to be stuck with it let's at
least make the users explicit lest people get caught out by future new
fields moving the pad field around.
This is the Xen portion of this change. The kernel portion will be
sent separately. There is no dependency between the two.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Daniel Stodden <daniel.stodden@citrix.com> Cc: Dongxiao Xu <dongxiao.xu@intel.com>
xen-unstable changeset: feee0abed6aa
xen-unstable date: Fri Jul 02 18:58:02 2010 +0100
Keir Fraser [Fri, 13 Aug 2010 14:06:24 +0000 (15:06 +0100)]
Fix IOAPIC S3 with interrupt remapping enabled
In ioapic_suspend, it reads and saves ioapic RTEs. But when interrupt
remapping is enabled, io_apic_read will call io_apic_read_remap_rte to
convert remapped format interrupt to compatible format, this results
in 'dest' field may be changed in remap_entry_to_ioapic_rte. When in
ioapic_resume, it will write the saved RTEs with incorrect 'dest' to
interrupt remapping table.
Actually it needn't to convert RTEs regardless interrupt remapping is
enabled or not. It just needs to save and restore RTE values
directly. This patch just uses __io_apic_read and __io_apic_write,
which won't call Interrupt remapping functions to convert, to save and
restore RTEs in ioapic_suspend and ioapic_resume. Thus fix this issue.
Signed-off-by: Weidong Han <weidong.han@intel.com>
xen-unstable changeset: 01d185dab39e
xen-unstable date: Fri Aug 13 14:57:35 2010 +0100
Keir Fraser [Fri, 13 Aug 2010 08:05:07 +0000 (09:05 +0100)]
[Xen-devel] [PATCH] PoD: Fix domain build populate-on-demand cache
allocation Rather than trying to count the number of PoD entries we're
putting in, we simply pass the target # of pages - the vga hole, and
let the hypervisor do the calculation.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
xen-unstable changeset: 6f059a340cdf
xen-unstable date: Wed Aug 11 15:56:21 2010 +0100
Keir Fraser [Fri, 13 Aug 2010 07:52:56 +0000 (08:52 +0100)]
msi: Avoid uninitialized msi descriptors
When __pci_enable_msix() returns early, output parameter (struct
msi_desc **desc) will not be initialized. On my machine, a Broadcom
BCM5709 nic has both MSI and MSIX capability blocks and when guest
tries to enable msix interrupts but __pci_enable_msix() returns early
for encountering a msi block, the whole system will crash for fatal
page fault immediately.
Signed-off-by: Wei Wang <wei.wang2@amd.com>
xen-unstable changeset: 786b163da49b
xen-unstable date: Wed Aug 11 17:01:02 2010 +0100
Keir Fraser [Fri, 13 Aug 2010 07:52:08 +0000 (08:52 +0100)]
xc: fix segfault in pv domain create if kernel is an invalid image
If libelf calls elf_err() or elf_msg() before elf_set_log() has been
called then it could potentially read an uninitialised log handling
callback function pointer from struct elf_binary. Fix this in libxc by
zeroing the structure before calling elf_init().
Keir Fraser [Mon, 9 Aug 2010 15:51:30 +0000 (16:51 +0100)]
vt-d: Fix ioapic_rte_to_remap_entry error path.
When ioapic_rte_to_remap_entry fails, currently it just writes value
to ioapic. But the 'mask' bit may be changed if it writes to the upper
half of RTE. This patch ensures to recover the original value of
'mask' bit in this case.
Signed-off-by: Weidong Han <weidong.han@intel.com>
xen-unstable changeset: 21934:befd1814c0a2
xen-unstable date: Mon Aug 09 16:33:45 2010 +0100
Keir Fraser [Mon, 9 Aug 2010 15:51:03 +0000 (16:51 +0100)]
vt-d: Fix ioapic write order in io_apic_write_remap_rte
At the end of io_apic_write_remap_rte, it writes new entry (remapped
interrupt) to ioapic. But it writes low 32 bits before high 32 bits,
it unmasks interrupt before writing high 32 bits if 'mask' bit in low
32 bits is cleared. Thus it may result in issues. This patch fixes
this issue by writing high 32 bits before low 32 bits.
Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com> Signed-off-by: Weidong Han <weidong.han@intel.com>
xen-unstable changeset: 21933:add40eb47868
xen-unstable date: Mon Aug 09 16:32:45 2010 +0100