x86/AMD: apply workaround for AMD F16h erratum 792
Workaround for the Erratum will be in BIOSes spun only after
Jan 2014 onwards. But initial production parts shipped in 2013
itself. Since there is a coverage hole, we should carry this fix
in software in case BIOS does not do the right thing or someone
is using old BIOS.
Description:
Processor does not ensure DRAM scrub read/write sequence is atomic wrt
accesses to CC6 save state area. Therefore if a concurrent scrub
read/write access is to same address the entry may appear as if it is
not written. This quirk applies to Fam16h models 00h-0Fh
See "Revision Guide" for AMD F16h models 00h-0fh, document 51810 rev.
3.04, Nov 2013.
Equivalent Linux patch link:
http://marc.info/?l=linux-kernel&m=139066012217149&w=2
Tested the patch on Fam16h server platform and it works fine.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com> Reviewed-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Corrected checking for boot CPU. Made warning message conditional.
Compacted warning message text. Moved comment to commit message.
Jan Beulich [Thu, 13 Feb 2014 09:16:13 +0000 (10:16 +0100)]
x86/domctl: don't ignore errors from vmce_restore_vcpu()
What started out as a simple cleanup patch (eliminating the redundant
check of domctl->cmd before copying back the output data) revealed a
bug in the handling of XEN_DOMCTL_set_ext_vcpucontext.
Fix this, retaining the cleanup.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: af172d655c3900822d1f710ac13ee38ee9d482d2
master date: 2014-02-04 09:22:12 +0100
Andrew Cooper [Wed, 22 Jan 2014 17:47:21 +0000 (17:47 +0000)]
libxc: Fix out-of-memory error handling in xc_cpupool_getinfo()
Avoid freeing info then returning it to the caller.
This is XSA-88.
Coverity-ID: 1056192 Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit d883c179a74111a6804baf8cb8224235242a88fc)
libvchan: Fix handling of invalid ring buffer indices
The remote (hostile) process can set ring buffer indices to any value
at any time. If that happens, it is possible to get "buffer space"
(either for writing data, or ready for reading) negative or greater
than buffer size. This will end up with buffer overflow in the second
memcpy inside of do_send/do_recv.
Fix this by introducing new available bytes accessor functions
raw_get_data_ready and raw_get_buffer_space which are robust against
mad ring states, and only return sanitised values.
Proof sketch of correctness:
Now {rd,wr}_{cons,prod} are only ever used in the raw available bytes
functions, and in do_send and do_recv.
The raw available bytes functions do unsigned arithmetic on the
returned values. If the result is "negative" or too big it will be
>ring_size (since we used unsigned arithmetic). Otherwise the result
is a positive in-range value representing a reasonable ring state, in
which case we can safely convert it to int (as the rest of the code
expects).
do_send and do_recv immediately mask the ring index value with the
ring size. The result is always going to be plausible. If the ring
state has become mad, the worst case is that our behaviour is
inconsistent with the peer's ring pointer. I.e. we read or write to
arguably-incorrect parts of the ring - but always parts of the ring.
And of course if a peer misoperates the ring they can achieve this
effect anyway.
So the security problem is fixed.
This is XSA-86.
(The patch is essentially Ian Jackson's work, although parts of the
commit message are by Marek.)
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
master commit: 2efcb0193bf3916c8ce34882e845f5ceb1e511f7
master date: 2014-02-06 16:44:41 +0100
Jan Beulich [Thu, 6 Feb 2014 16:36:40 +0000 (17:36 +0100)]
flask: fix reading strings from guest memory
Since the string size is being specified by the guest, we must range
check it properly before doing allocations based on it. While for the
two cases that are exposed only to trusted guests (via policy
restriction) this just uses an arbitrary upper limit (PAGE_SIZE), for
the FLASK_[GS]ETBOOL case (which any guest can use) the upper limit
gets enforced based on the longest name across all boolean settings.
This is XSA-84.
Reported-by: Matthew Daley <mattd@bugfuzz.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
master commit: 6c79e0ab9ac6042e60434c02e1d99b0cf0cc3470
master date: 2014-02-06 16:33:50 +0100
Frediano Ziglio [Thu, 30 Jan 2014 08:01:01 +0000 (09:01 +0100)]
mce: fix race condition in mctelem_xchg_head
The function (mctelem_xchg_head()) used to exchange mce telemetry
list heads is racy. It may write to the head twice, with the second
write linking to an element in the wrong state.
If there are two threads, T1 inserting on committed list; and T2
trying to consume it.
1. T1 starts inserting an element (A), sets prev pointer (mcte_prev).
2. T1 is interrupted after the cmpxchg succeeded.
3. T2 gets the list and changes element A and updates the commit list
head.
4. T1 resumes, reads pointer to prev again and compare with result
from the cmpxchg which succeeded but in the meantime prev changed
in memory.
5. T1 thinks the cmpxchg failed and goes around the loop again,
linking head to A again.
To solve the race use temporary variable for prev pointer.
*linkp (which point to a field in the element) must be updated before
the cmpxchg() as after a successful cmpxchg the element might be
immediately removed and reinitialized.
The wmb() prior to the cmpchgptr() call is not necessary since it is
already a full memory barrier. This wmb() is thus removed.
Andrew Cooper [Thu, 30 Jan 2014 08:00:09 +0000 (09:00 +0100)]
dbg_rw_guest_mem: need to call put_gfn in error path
Using a 1G hvm domU (in grub) and gdbsx:
(gdb) set arch i8086
warning: A handler for the OS ABI "GNU/Linux" is not built into this configuration
of GDB. Attempting to continue with the default i8086 settings.
The target architecture is assumed to be i8086
(gdb) target remote localhost:9999
Remote debugging using localhost:9999
Remote debugging from host 127.0.0.1
0x0000d475 in ?? ()
(gdb) x/1xh 0x6ae9168b
Will reproduce this bug.
With a debug=y build you will get:
Assertion '!preempt_count()' failed at preempt.c:37
For a debug=n build you will get a dom0 VCPU hung (at some point) in:
Andrew Cooper [Fri, 17 Jan 2014 15:42:37 +0000 (16:42 +0100)]
kexec: prevent deadlock on reentry to the crash path
In some cases, such as suffering a queued-invalidation timeout while
performing an iommu_crash_shutdown(), Xen can end up reentering the crash
path. Previously, this would result in a deadlock in one_cpu_only(), as the
test_and_set_bit() would fail.
The crash path is not reentrant, and even if it could be made to be so, it is
almost certain that we would fall over the same reentry condition again.
The new code can distinguish a reentry case from multiple cpus racing down the
crash path. In the case that a reentry is detected, return back out to the
nested panic() call, which will maybe_reboot() on our behalf. This requires a
bit of return plumbing back up to kexec_crash().
While fixing this deadlock, also fix up an minor niggle seen recently from a
XenServer crash report. The report was from a Bank 8 MCE, which had managed
to crash on all cpus at once. The result was a lot of stack traces with cpus
in kexec_common_shutdown(), which was infact the inlined version of
one_cpu_only(). The kexec crash path is not a hotpath, so we can easily
afford to prevent inlining for the sake of clarity in the stack traces.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com>
master commit: 470f58c159410b280627c2ea7798ea12ad93bd7c
master date: 2013-11-27 15:13:48 +0100
Paul Durrant [Fri, 17 Jan 2014 15:41:38 +0000 (16:41 +0100)]
x86/VT-x: Disable MSR intercept for SHADOW_GS_BASE
Intercepting this MSR is pointless - The swapgs instruction does not cause a
vmexit, so the cached result of this is potentially stale after the next guest
instruction. It is correctly saved and restored on vcpu context switch.
Furthermore, 64bit Windows writes to this MSR on every thread context switch,
so interception causes a substantial performance hit.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org> Acked-by: Jun Nakajima <jun.nakajima@intel.com>
master commit: a82e98d473fd212316ea5aa078a7588324b020e5
master date: 2013-11-15 11:02:17 +0100
Andrew Cooper [Fri, 17 Jan 2014 15:39:08 +0000 (16:39 +0100)]
x86/ats: Fix parsing of 'ats' command line option
This is really a boolean_param() hidden inside a hand-coded attempt to
replicate boolean_param(), which misses the 'no-' prefix semantics
expected with Xen boolean parameters.
AMD/IOMMU: fix infinite loop due to ivrs_bdf_entries larger than 16-bit value
Certain AMD systems could have upto 0x10000 ivrs_bdf_entries.
However, the loop variable (bdf) is declared as u16 which causes
inifinite loop when parsing IOMMU event log with IO_PAGE_FAULT event.
This patch changes the variable to u32 instead.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 81b1c7de2339d2788352b162057e70130803f3cf
master date: 2014-01-07 15:09:42 +0100
Andrew Cooper [Mon, 13 Jan 2014 15:00:28 +0000 (16:00 +0100)]
VTD/DMAR: free() correct pointer on error from acpi_parse_one_atsr()
Free the allocated structure rather than the ACPI table ATS entry.
On further analysis, there is another memory leak. acpi_parse_dev_scope()
could allocate scope->devices, and return with -ENOMEM. All callers of
acpi_parse_dev_scope() would then free the underlying structure, loosing the
pointer.
These errors can only actually be reached through acpi_parse_dev_scope()
(which passes type = DMAR_TYPE), but I am quite surprised Coverity didn't spot
it.
Coverity-ID: 1146949 Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 62d33ca1048f4e08eaeb026c7b79239b4605b636
master date: 2014-01-07 14:59:31 +0100
If {copy_to,clear}_guest_offset() fails, we would leak the domain mappings for
l4 thru l1.
Fixing this requires having conditional unmaps on the faulting path, which in
turn requires explicitly initialising the pointers to NULL because of the
early ENOMEM exit.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <JBeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: 0725f326358cbb2ba7f9626976e346b963d74c37
master date: 2013-12-17 16:38:07 +0100
Jan Beulich [Mon, 13 Jan 2014 14:57:07 +0000 (15:57 +0100)]
ix86: fix linear page table construction in alloc_l2_table()
Slot 0 got updated when slot 3 was meant. The mistake was hidden by
create_pae_xen_mappings() correcting things immediately afterwards
(i.e. before the new entries could get used the first time).
Reported-by: CHENG Yueqiang <yqcheng.2008@phdis.smu.edu.sg> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Tim Deegan <tim@xen.org>
Jan Beulich [Fri, 10 Jan 2014 10:44:10 +0000 (11:44 +0100)]
x86/p2m: restrict auditing to debug builds
... since iterating through all of a guest's pages may take unduly
long.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: 4476d05cf5e8d3880f88ce16649766df67e0791e
master date: 2013-12-13 15:06:11 +0100
Jan Beulich [Fri, 10 Jan 2014 10:42:35 +0000 (11:42 +0100)]
x86/PV: don't commit debug register values early in arch_set_info_guest()
They're being taken care of later (via set_debugreg()), and temporarily
copying them into struct vcpu means that bad values may end up getting
loaded during context switch if the vCPU is already running and the
function errors out between the premature and real commit step, leading
to the same issue that XSA-12 dealt with.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 398c39b6c18d0b55acfc88f5ee75b3a793e6eeec
master date: 2013-12-11 10:33:19 +0100
Jan Beulich [Fri, 10 Jan 2014 10:41:53 +0000 (11:41 +0100)]
x86/cpuidle: publish new states only after fully initializing them
Since state information coming from Dom0 can arrive at any time, on
any CPU, we ought to make sure that a new state is fully initialized
before the target CPU might be using it.
Once touching that code, also do minor cleanup: A missing (but benign)
"break" and some white space adjustments.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Liu Jinsong <jinsong.liu@intel.com>
master commit: 4ca6f9f0377a30755a299cc60a6d44ab6c3b34d0
master date: 2013-12-11 10:30:02 +0100
Roger Pau Monne [Fri, 22 Nov 2013 11:54:09 +0000 (12:54 +0100)]
xl: fixes for do_daemonize
Fix usage of CHK_ERRNO in do_daemonize and also remove the usage of a
bogus for(;;).
Coverity-ID: 1130516 and 1130520 Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Cc: Ian Jackson <Ian.Jackson@eu.citrix.com>
(cherry picked from commit ed8c9047f6fc6d28fc27d37576ec8c8c1be68efe)
Roger Pau Monne [Fri, 22 Nov 2013 11:54:08 +0000 (12:54 +0100)]
libxl: fix fd check in libxl__spawn_local_dm
Checking the logfile_w fd for -1 on failure is no longer true, because
libxl__create_qemu_logfile will now return ERROR_FAIL on failure which
is -3.
While there also add an error check for opening /dev/null.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Cc: Ian Jackson <Ian.Jackson@eu.citrix.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 3b88d95e9c0a5ff91d5b60e94d81f1982af57e7f)
In the original function, mmap() could be called with a length of -1 if the
second lseek failed and the caller had not provided max_size.
While fixing up this error, improve the logging of other error paths. I know
from personal experience that debugging failures function is rather difficult
given only "xc_dom_malloc_filemap: failed (on file <somefile>)" in the logs.
Andrew Cooper [Mon, 25 Nov 2013 11:05:47 +0000 (11:05 +0000)]
tools/xc_restore: Initialise console and store mfns
If the console or store mfn chunks are not present in the migration stream,
stack junk gets reported for the mfns.
XenServer had a very hard to track down VM corruption issue caused by exactly
this issue. Xenconsoled would connect to a junk mfn and incremented the ring
pointer if the junk happend to look like a valid gfn.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Keir Fraser <keir@xen.org> CC: Jan Beulich <JBeulich@suse.com>
(cherry picked from commit 0792426b798fd3b39909d618cf8fe8bac30594f4)
Jan Beulich [Tue, 10 Dec 2013 15:21:57 +0000 (16:21 +0100)]
IOMMU: clear "don't flush" override on error paths
Both xenmem_add_to_physmap() and iommu_populate_page_table() each have
an error path that fails to clear that flag, thus suppressing further
flushes on the respective pCPU.
In iommu_populate_page_table() also slightly re-arrange code to avoid
the false impression of the flag in question being guarded by a
domain's page_alloc_lock.
This is CVE-2013-6400 / XSA-80.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: 552b7fcb9a70f1d4dd0e0cd5fb4d3d9da410104a
master date: 2013-12-10 16:10:37 +0100
Andrew Cooper [Mon, 9 Dec 2013 13:47:43 +0000 (14:47 +0100)]
x86/boot: fix BIOS memory corruption on certain IBM systems
IBM System x3530 M4 BIOSes (including the latest available at the time of this
patch) will corrupt a byte at physical address 0x105ff1 to the value of 0x86
if %esp has the value 0x00080000 when issuing an `int $0x15 (ax=0xec00)` to
inform the system about our intended operating mode.
Xen gets unhappy when the bootloader has placed it's .text section in over
this specific region of RAM.
After dropping into 16bit mode, clear all 32 bits of %esp, and for the BIOS
call already documented to be affected by BIOS bugs clear all GPRs.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org> Release-acked-by: George Dunlap <george.dunlap@eu.citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1ed76797439e384de18fcd6810bd4743d4f38b1e
master date: 2013-12-06 11:28:00 +0100
Daniel Kiper [Mon, 9 Dec 2013 13:47:07 +0000 (14:47 +0100)]
x86: fix early boot command line parsing
There is no reliable way to encode NUL character as a character so encode
it as a number. Read: http://sourceware.org/binutils/docs/as/Characters.html.
Octal and hex encoding do not work on at least one system (GNU assembler
version 2.22 (x86_64-linux-gnu) using BFD version (GNU Binutils for Debian) 2.22).
Without this fix e.g. no-real-mode option at the end of xen.gz command line
is not detected.
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: dc37e0bfffc673f4bdce1d69ad86098bfb0ab531
master date: 2013-12-04 13:26:37 +0100
Apart from the Coverity-detected lock order reversal (a domain's
page_alloc_lock taken with the heap lock already held), calling
put_page() with heap_lock is a bad idea too (as a possible descendant
from put_page() is free_heap_pages(), which wants to take this very
lock).
From all I can tell the region over which heap_lock was held was far
too large: All we need to protect are the call to mark_page_offline()
and reserve_heap_page() (and I'd even put under question the need for
the former). Hence by slightly re-arranging the if/else-if chain we
can drop the lock much earlier, at once no longer covering the two
put_page() invocations.
Once at it, do a little bit of other cleanup: Put the "pod_replace"
code path inline rather than at its own label, and drop the effectively
unused variable "ret".
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Acked-by: Keir Fraser <keir@xen.org>
master commit: d4837a56da4a59259dd0cf9f3bdc073159d81d7a
master date: 2013-12-03 12:40:57 +0100
The ptr calculation shall take the offset into the page into account
when ptr is valid.
Reported regression on NetBSD's port-xen with last known working libxen
being rev 2.9. This corrupts the kernel symbol table when the table is
not loaded on a page boundary.
Issue was tracked down by FastIce and Jeff Rizzo. See also
http://mail-index.netbsd.org/port-xen/2013/10/16/msg008088.html
Signed-off-by: Jean-Yves Migeon <jym@NetBSD.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: cb08944a482a5e80a3ff1113f0735761cc4c6cb8
master date: 2013-11-29 11:07:01 +0000
Feng Wu [Mon, 9 Dec 2013 13:45:00 +0000 (14:45 +0100)]
x86: properly handle MSI-X unmask operation from guests
For a pass-through device with MSI-x capability, when guest tries
to unmask the MSI-x interrupt for the passed through device, xen
doesn't clear the mask bit for MSI-x in hardware in the following
scenario, which will cause network disconnection:
1. Guest masks the MSI-x interrupt
2. Guest updates the address and data for it
3. Guest unmasks the MSI-x interrupt (This is the problematic step)
In the step #3 above, Xen doesn't handle it well. When guest tries
to unmask MSI-X interrupt, it traps to Xen, Xen just returns to Qemu
if it notices that address or data has been modified by guest before,
then Qemu will update Xen with the latest value of address/data by
hypercall. However, in this whole process, the MSI-X interrupt unmask
operation is missing, which means Xen doesn't clear the mask bit in
hardware for the MSI-X interrupt, so it remains disabled, that is why
it loses the network connection.
This patch fixes this issue.
Signed-off-by: Feng Wu <feng.wu@intel.com>
Only latch the address if the guest really is unmasking the entry.
Clean up the entire change.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 74fd0036deb585a139b63b26db025805ecedc37a
master date: 2013-11-27 15:15:43 +0100
Liu Jinsong [Mon, 9 Dec 2013 13:43:34 +0000 (14:43 +0100)]
VMX: fix cr0.cd handling
This patch solves XSA-60 security hole:
1. For guest w/o VT-d, and for guest with VT-d but snooped, Xen need
do nothing, since hardware snoop mechanism has ensured cache coherency.
2. For guest with VT-d but non-snooped, cache coherency can not be
guaranteed by h/w snoop, therefore it need emulate UC type to guest:
2.1). if it works w/ Intel EPT, set guest IA32_PAT fields as UC so that
guest memory type are all UC.
2.2). if it works w/ shadow, drop all shadows so that any new ones would
be created on demand w/ UC.
This patch also fix a bug of shadow cr0.cd setting. Current shadow has a
small window between cache flush and TLB invalidation, resulting in possilbe
cache pollution. This patch pause vcpus so that no vcpus context involved
into the window.
This is CVE-2013-2212 / XSA-60.
Signed-off-by: Liu Jinsong <jinsong.liu@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jun Nakajima <jun.nakajima@intel.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 62652c00efa55fb45374bcc92f7d96fc411aebb2
master date: 2013-11-06 10:12:36 +0100
Liu Jinsong [Mon, 9 Dec 2013 13:41:44 +0000 (14:41 +0100)]
VMX: remove the problematic set_uc_mode logic
XSA-60 security hole comes from the problematic vmx_set_uc_mode.
This patch remove vmx_set_uc_mode logic, which will be replaced by
PAT approach at later patch.
This is CVE-2013-2212 / XSA-60.
Signed-off-by: Liu Jinsong <jinsong.liu@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org> Acked-by: Jun Nakajima <jun.nakajima@intel.com>
master commit: 1c84d046735102e02d2df454ab07f14ac51f235d
master date: 2013-11-06 10:12:00 +0100
Liu Jinsong [Mon, 9 Dec 2013 13:40:51 +0000 (14:40 +0100)]
VMX: disable EPT when !cpu_has_vmx_pat
Recently Oracle developers found a Xen security issue as DOS affecting,
named as XSA-60. Please refer http://xenbits.xen.org/xsa/advisory-60.html
Basically it involves how to handle guest cr0.cd setting, which under
some environment it consumes much time resulting in DOS-like behavior.
This is a preparing patch for fixing XSA-60. Later patch will fix XSA-60
via PAT under Intel EPT case, which depends on cpu_has_vmx_pat.
This is CVE-2013-2212 / XSA-60.
Signed-off-by: Liu Jinsong <jinsong.liu@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org> Acked-by: Jun Nakajima <jun.nakajima@intel.com>
master commit: c13b0d65ddedd74508edef5cd66defffe30468fc
master date: 2013-11-06 10:11:18 +0100
Reported-by: David Binderman <dcb314@hotmail.com> Signed-off-by: Tim Deegan <tim@xen.org>
Use _SEGMENT_* instead of plain numbers and adjust a comment.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6ed4bfbabd487b41021caa7ed03cee1f00ecbabf
master date: 2013-11-26 09:54:21 +0100
Liu Jinsong [Mon, 2 Dec 2013 14:56:09 +0000 (15:56 +0100)]
x86/xsave: fix nonlazy state handling
Nonlazy xstates should be xsaved each time when vcpu_save_fpu.
Operation to nonlazy xstates will not trigger #NM exception, so
whenever vcpu scheduled in it got restored and whenever scheduled
out it should get saved.
Currently this bug affects AMD LWP feature, and later Intel MPX
feature. With the bugfix both LWP and MPX will work fine.
Signed-off-by: Liu Jinsong <jinsong.liu@intel.com>
Furthermore, during restore we also need to set nonlazy_xstate_used
according to the incoming accumulated XCR0.
Also adjust the changes to i387.c such that there won't be a pointless
clts()/stts() pair.
David Vrabel [Mon, 2 Dec 2013 14:55:16 +0000 (15:55 +0100)]
x86/crash: disable the watchdog NMIs on the crashing cpu
nmi_shootdown_cpus() is called during a crash to park all the other
CPUs. This changes the NMI trap handlers which means there's no point
in having the watchdog still running.
This also disables the watchdog before executing any crash kexec image
and prevents the image from receiving unexpected NMIs.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
PVOps Linux as a kexec image shoots itself in the foot otherwise.
On a Core2 system, Linux declares a firmware bug and tries to invert some bits
in the performance counter register. It ends up setting the number of retired
instructions to generate another NMI to fewer instructions than the NMI
interrupt path itself, and ceases to make any useful progress.
The call to disable_lapic_nmi_watchdog() must be this late into the kexec path
to be sure that this cpu is the one which will execute the kexec image.
Otherwise there are race conditions where the NMIs might be disabled on the
wrong cpu, resulting in the kexec image still receiving NMIs.
x86/hvm: reset TSC to 0 after domain resume from S3
Host S3 implicitly resets the host TSC to 0, but the tsc offset for hvm
domains is not recalculated when they resume, causing it to go into
negative values. In Linux guest using tsc clocksource, this results in
a hang after wrap back to positive values since the tsc clocksource
implementation expects it reset.
Jan Beulich [Mon, 2 Dec 2013 14:53:50 +0000 (15:53 +0100)]
x86: consider modules when cutting off memory
The code in question runs after module ranges got already removed from
the E820 table, so when determining the new maximum page/PDX we need to
explicitly take them into account.
Furthermore we need to round up the ending addresses here, in order to
fully cover eventual partial trailing pages.
Nate Studer [Mon, 2 Dec 2013 14:51:00 +0000 (15:51 +0100)]
credit: Update other parameters when setting tslice_ms
Add a utility function to update the rest of the timeslice
accounting fields when updating the timeslice of the
credit scheduler, so that capped CPUs behave correctly.
Before this patch changing the timeslice to a value higher
than the default would result in a domain not utilizing
its full capacity and changing the timeslice to a value
lower than the default would result in a domain exceeding
its capacity.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 19f027cc5daff4a37fd0a28bca2514c721852dd0
master date: 2013-11-27 09:00:41 +0100
Ian Jackson [Tue, 3 Sep 2013 12:41:46 +0000 (13:41 +0100)]
libxl: Do not generate short block in libxl__datacopier_prefixdata
libxl__datacopier_prefixdata would prepend a deliberately short block
(not just a half-full one, but one with a short buffer) to the
dc->bufs queue. However, this is wrong because datacopier_readable
will find it and try to continue to fill it up.
Jan Beulich [Mon, 18 Nov 2013 13:00:34 +0000 (14:00 +0100)]
VT-d: fix TLB flushing in dma_pte_clear_one()
The third parameter of __intel_iommu_iotlb_flush() is to indicate
whether the to be flushed entry was a present one. A few lines before,
we bailed if !dma_pte_present(*pte), so there's no need to check the
flag here again - we can simply always pass TRUE here.
This is XSA-78.
Suggested-by: Cheng Yueqiang <yqcheng.2008@phdis.smu.edu.sg> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 85c72f9fe764ed96f5c149efcdd69ab7c18bfe3d
master date: 2013-11-18 13:55:55 +0100
Jan Beulich [Fri, 15 Nov 2013 10:47:41 +0000 (11:47 +0100)]
x86: eliminate has_arch_mmios()
... as being generally insufficient: Either has_arch_pdevs() or
cache_flush_permitted() should be used (in particular, it is
insufficient to consider MMIO ranges alone - I/O port ranges have the
same requirements if available to a guest).
Jan Beulich [Fri, 15 Nov 2013 10:46:26 +0000 (11:46 +0100)]
VMX: don't crash processing 'd' debug key
There's a window during scheduling where "current" and the active VMCS
may disagree: The former gets set much earlier than the latter. Since
both vmx_vmcs_enter() and vmx_vmcs_exit() immediately return when the
subject vCPU is "current", accessing VMCS fields would, depending on
whether there is any currently active VMCS, either read wrong data, or
cause a crash.
Going forward we might want to consider reducing the window during
which vmx_vmcs_enter() might fail (e.g. doing a plain __vmptrld() when
v->arch.hvm_vmx.vmcs != this_cpu(current_vmcs) but arch_vmx->active_cpu
== -1), but that would add complexities (acquiring and - more
importantly - properly dropping v->arch.hvm_vmx.vmcs_lock) that don't
look worthwhile adding right now.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 58929248461ecadce13e92eb5a5d9ef718a7c88e
master date: 2013-11-12 11:52:19 +0100
Jan Beulich [Fri, 15 Nov 2013 10:44:17 +0000 (11:44 +0100)]
x86/EFI: make trampoline allocation more flexible
Certain UEFI implementations reserve all memory below 1Mb at boot time,
making it impossible to properly allocate the chunk necessary for the
trampoline. Fall back to simply grabbing a chunk from EfiBootServices*
regions immediately prior to calling ExitBootServices().
Jan Beulich [Fri, 15 Nov 2013 10:42:36 +0000 (11:42 +0100)]
x86/HVM: 32-bit IN result must be zero-extended to 64 bits
Just like for all other operations with 32-bit operand size.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
x86/HVM: 32-bit IN result must be zero-extended to 64 bits (part 2)
Just spotted a counterpart of what commit 9d89100b (same title) dealt
with.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 9d89100ba8b7b02adb7c2e89ef7c81e734942e7c
master date: 2013-11-05 14:51:53 +0100
master commit: 1e521eddeb51a9f1bf0e4dd1d17efc873eafae41
master date: 2013-11-15 11:01:49 +0100
Jan Beulich [Fri, 15 Nov 2013 10:37:39 +0000 (11:37 +0100)]
x86/ACPI/x2APIC: guard against out of range ACPI or APIC IDs
Other than for the legacy APIC, the x2APIC MADT entries have valid
ranges possibly extending beyond what our internal arrays can handle,
and hence we need to guard ourselves against corrupting memory here.
Jan Beulich [Fri, 15 Nov 2013 10:36:39 +0000 (11:36 +0100)]
x86: refine address validity checks before accessing page tables
In commit 40d66baa ("x86: correct LDT checks") and d06a0d71 ("x86: add
address validity check to guest_map_l1e()") I didn't really pay
attention to the fact that these checks would better be done before the
paging_mode_translate() ones, as there's also no equivalent check down
the shadow code paths involved here (at least not up to the first use
of the address), and such generic checks shouldn't really be done by
particular backend functions anyway.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: 343cad8c70585c4dba8afc75e1ec1b7610605ab2
master date: 2013-10-28 12:00:36 +0100
David Vrabel [Fri, 15 Nov 2013 10:34:43 +0000 (11:34 +0100)]
sched: fix race between sched_move_domain() and vcpu_wake()
From: David Vrabel <david.vrabel@citrix.com>
sched_move_domain() changes v->processor for all the domain's VCPUs.
If another domain, softirq etc. triggers a simultaneous call to
vcpu_wake() (e.g., by setting an event channel as pending), then
vcpu_wake() may lock one schedule lock and try to unlock another.
vcpu_schedule_lock() attempts to handle this but only does so for the
window between reading the schedule_lock from the per-CPU data and the
spin_lock() call. This does not help with sched_move_domain()
changing v->processor between the calls to vcpu_schedule_lock() and
vcpu_schedule_unlock().
Fix the race by taking the schedule_lock for v->processor in
sched_move_domain().
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
Use vcpu_schedule_lock_irq() (which now returns the lock) to properly
retry the locking should the to be used lock have changed in the course
of acquiring it (issue pointed out by George Dunlap).
Add a comment explaining the state after the v->processor adjustment.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: ef55257bc81204e34691f1c2aa9e01f2d0768bdd
master date: 2013-10-14 08:58:31 +0200
Jan Beulich [Fri, 15 Nov 2013 10:34:01 +0000 (11:34 +0100)]
fix locking in cpu_disable_scheduler()
So commit eedd6039 ("scheduler: adjust internal locking interface")
uncovered - by now using proper spin lock constructs - a bug after all:
When bringing down a CPU, cpu_disable_scheduler() gets called with
interrupts disabled, and hence the use of vcpu_schedule_lock_irq() was
never really correct (i.e. the caller ended up with interrupts enabled
despite having disabled them explicitly).
Fixing this however surfaced another problem: The call path
vcpu_migrate() -> evtchn_move_pirqs() wants to acquire the event lock,
which however is a non-IRQ-safe once, and hence check_lock() doesn't
like this lock to be acquired when interrupts are already off. As we're
in stop-machine context here, getting things wrong wrt interrupt state
management during lock acquire/release is out of question though, so
the simple solution to this appears to be to just suppress spin lock
debugging for the period of time while the stop machine callback gets
run.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 41a0cc9e26160a89245c9ba3233e3f70bf9cd4b4
master date: 2013-10-29 09:57:14 +0100
Jan Beulich [Fri, 15 Nov 2013 10:32:51 +0000 (11:32 +0100)]
scheduler: adjust internal locking interface
Make the locking functions return the lock pointers, so they can be
passed to the unlocking functions (which in turn can check that the
lock is still actually providing the intended protection, i.e. the
parameters determining which lock is the right one didn't change).
Further use proper spin lock primitives rather than open coded
local_irq_...() constructs, so that interrupts can be re-enabled as
appropriate while spinning.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: eedd60391610629b4e8a2e8278b857ff884f750d
master date: 2013-10-14 08:57:56 +0200
Ian Jackson [Wed, 1 May 2013 15:56:54 +0000 (16:56 +0100)]
libxl: Fix bug in libxl_cdrom_insert, make more robust against bad xenstore data
libxl_cdrom_insert was failing to initialize the backend type,
resulting in the wrong default backend. The result was not only that
the CD was not inserted properly, but also that some improper xenstore
entries were created, causing further block commands to fail.
This patch fixes the bug by setting the disk backend type based on the
type of the existing device.
It also makes the system more robust by checking to see that it has
got a valid path before proceeding to write a partial xenstore entry.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit c3556e2a1aee3c9b7dda5d57e85e8867fff1b9da)
Conflicts:
tools/libxl/libxl.c Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Jan Beulich [Mon, 11 Nov 2013 08:18:59 +0000 (09:18 +0100)]
nested VMX: VMLANUCH/VMRESUME emulation must check permission first thing
Otherwise uninitialized data may be used, leading to crashes.
This is CVE-2013-4551 / XSA-75.
Reported-and-tested-by: Jeff Zimmerman <Jeff_Zimmerman@McAfee.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-and-tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: 4e87bc5b03e05123ba5c888f77969140c8ebd1bf
master date: 2013-11-11 09:15:04 +0100
Ian Jackson [Tue, 29 Oct 2013 15:45:53 +0000 (15:45 +0000)]
tools: xenstored: if the reply is too big then send E2BIG error
This fixes the issue for both C and ocaml xenstored, however only the ocaml
xenstored is vulnerable in its default configuration.
Adding a new error appears to be safe, since bit libxenstore and the Linux
driver at least treat an unknown error code as EINVAL.
This is XSA-72 / CVE-2013-4416.
Original ocaml patch by Jerome Maloberti <jerome.maloberti@citrix.com> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
(cherry picked from commit 8b2c441a1b53a43a38b3c517e28f239da3349872)
(cherry picked from commit d88ac91ef2f0a93ea9359a8133405dbd78abc89b)
Jan Beulich [Tue, 22 Oct 2013 10:07:40 +0000 (12:07 +0200)]
x86-64: check for canonical address before doing page walks
... as there doesn't really exists any valid mapping for them.
Particularly in the case of do_page_walk() this also avoids returning
non-NULL for such invalid input.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 6fd9b0361e2eb5a7f12bdd5cbf7e42c0d1937d26
master date: 2013-10-11 09:31:16 +0200
Jan Beulich [Tue, 22 Oct 2013 10:06:43 +0000 (12:06 +0200)]
x86: add address validity check to guest_map_l1e()
Just like for guest_get_eff_l1e() this prevents accessing as page
tables (and with the wrong memory attribute) internal data inside Xen
happening to be mapped with 1Gb pages.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: d06a0d715ec1423b6c42141ab1b0ff69a3effb56
master date: 2013-10-11 09:29:43 +0200
Jan Beulich [Tue, 22 Oct 2013 10:05:45 +0000 (12:05 +0200)]
x86: correct LDT checks
- MMUEXT_SET_LDT should behave as similarly to the LLDT instruction as
possible: fail only if the base address is non-canonical
- instead LDT descriptor accesses should fault if the descriptor
address ends up being non-canonical (by ensuring this we at once
avoid reading an entry from the mach-to-phys table and consider it a
page table entry)
- fault propagation on using LDT selectors must distinguish #PF and #GP
(the latter must be raised for a non-canonical descriptor address,
which also applies to several other uses of propagate_page_fault(),
and hence the problem is being fixed there)
- map_ldt_shadow_page() should properly wrap addresses for 32-bit VMs
At once remove the odd invokation of map_ldt_shadow_page() from the
MMUEXT_SET_LDT handler: There's nothing really telling us that the
first LDT page is going to be preferred over others.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 40d66baa46ca8a9ffa6df3e063a967d08ec92bcf
master date: 2013-10-11 09:28:26 +0200
Andrew Cooper [Tue, 22 Oct 2013 10:04:01 +0000 (12:04 +0200)]
x86/percpu: Force INVALID_PERCPU_AREA into the non-canonical address region
This causes accidental uses of per_cpu() on a pcpu with an INVALID_PERCPU_AREA
to result in a #GF for attempting to access the middle of the non-canonical
virtual address region.
This is preferable to the current behaviour, where incorrect use of per_cpu()
will result in an effective NULL structure dereference which has security
implication in the context of PV guests.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 7cfb0053629c4dd1a6f01dc43cca7c0c25b8b7bf
master date: 2013-10-04 12:24:34 +0200
Andrew Cooper [Tue, 22 Oct 2013 10:03:03 +0000 (12:03 +0200)]
x86/idle: Fix get_cpu_idle_time()'s interaction with offline pcpus
Checking for "idle_vcpu[cpu] != NULL" is insufficient protection against
offline pcpus. From a hypercall, vcpu_runstate_get() will determine "v !=
current", and try to take the vcpu_schedule_lock(). This will try to look up
per_cpu(schedule_data, v->processor) and promptly suffer a NULL structure
deference as v->processors' __per_cpu_offset is INVALID_PERCPU_AREA.
****************************************
Panic on CPU 11:
...
get_cpu_idle_time() has been updated to correctly deal with offline pcpus
itself by returning 0, in the same way as it would if it was missing the
idle_vcpu[] pointer.
In doing so, XENPF_getidletime needed updating to correctly retain its
described behaviour of clearing bits in the cpumap for offline pcpus.
As this crash can only be triggered with toolstack hypercalls, it is not a
security issue and just a simple bug.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 0aa27ce3351f7eb09d13e863a1d5f303086aa32a
master date: 2013-10-04 12:23:23 +0200
Matthew Daley [Thu, 10 Oct 2013 13:24:15 +0000 (15:24 +0200)]
x86: check segment descriptor read result in 64-bit OUTS emulation
When emulating such an operation from a 64-bit context (CS has long
mode set), and the data segment is overridden to FS/GS, the result of
reading the overridden segment's descriptor (read_descriptor) is not
checked. If it fails, data_base is left uninitialized.
This can lead to 8 bytes of Xen's stack being leaked to the guest
(implicitly, i.e. via the address given in a #PF).
Ignoring them generally implies using uninitialized data and, in all
but two of the cases dealt with here, potentially leaking hypervisor
stack contents to guests.
This is CVE-2013-4355 / XSA-63.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6bb838e7375f5b031e9ac346b353775c90de45dc
master date: 2013-09-30 14:17:46 +0200
Jan Beulich [Wed, 25 Sep 2013 08:55:42 +0000 (10:55 +0200)]
x86/xsave: initialize extended register state when guests enable it
Till now, when setting previously unset bits in XCR0 we wouldn't touch
the active register state, thus leaving in the newly enabled registers
whatever a prior user of it left there, i.e. potentially leaking
information between guests.
This is CVE-2013-1442 / XSA-62.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 63a75ba0de817d6f384f96d25427a05c313e2179
master date: 2013-09-25 10:41:25 +0200
Jan Beulich [Thu, 12 Sep 2013 09:33:44 +0000 (11:33 +0200)]
x86/xsave: fix migration from xsave-capable to xsave-incapable host
With CPUID features suitably masked this is supposed to work, but was
completely broken (i.e. the case wasn't even considered when the
original xsave save/restore code was written).
First of all, xsave_enabled() wrongly returned the value of
cpu_has_xsave, i.e. not even taking into consideration attributes of
the vCPU in question. Instead this function ought to check whether the
guest ever enabled xsave support (by writing a [non-zero] value to
XCR0). As a result of this, a vCPU's xcr0 and xcr0_accum must no longer
be initialized to XSTATE_FP_SSE (since that's a valid value a guest
could write to XCR0), and the xsave/xrstor as well as the context
switch code need to suitably account for this (by always enforcing at
least this part of the state to be saved/loaded).
This involves undoing large parts of c/s 22945:13a7d1f7f62c ("x86: add
strictly sanity check for XSAVE/XRSTOR") - we need to cleanly
distinguish between hardware capabilities and vCPU used features.
Next both HVM and PV save code needed tweaking to not always save the
full state supported by the underlying hardware, but just the parts
that the guest actually used. Similarly the restore code should bail
not just on state being restored that the hardware cannot handle, but
also on inconsistent save state (inconsistent XCR0 settings or size of
saved state not in line with XCR0).
And finally the PV extended context get/set code needs to use slightly
different logic than the HVM one, as here we can't just key off of
xsave_enabled() (i.e. avoid doing anything if a guest doesn't use
xsave) because the tools use this function to determine host
capabilities as well as read/write vCPU state. The set operation in
particular needs to be capable of cleanly dealing with input that
consists of only the xcr0 and xcr0_accum values (if they're both zero
then no further data is required).
While for things to work correctly both sides (saving _and_ restoring
host) need to run with the fixed code, afaict no breakage should occur
if either side isn't up to date (other than the breakage that this
patch attempts to fix).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Yang Zhang <yang.z.zhang@intel.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 4cc1344447a0458df5d222960f2adf1b65084fa8
master date: 2013-09-09 14:36:54 +0200
Jan Beulich [Thu, 12 Sep 2013 09:32:21 +0000 (11:32 +0200)]
x86/xsave: initialization improvements
- properly validate available feature set on APs
- also validate xsaveopt availability on APs
- properly indicate whether the initialization is on the BSP (we
shouldn't be using "cpu == 0" checks for this)
x86: allow guest to set/clear MSI-X mask bit (try 2)
Guest needs the ability to enable and disable MSI-X interrupts
by setting the MSI-X control bit, for a passed-through device.
Guest is allowed to write MSI-X mask bit only if Xen *thinks*
that mask is clear (interrupts enabled). If the mask is set by
Xen (interrupts disabled), writes to mask bit by the guest is
ignored.
Currently, a write to MSI-X mask bit by the guest is silently
ignored.
A likely scenario is where we have a 82599 SR-IOV nic passed
through to a guest. From the guest if you do
ifconfig <ETH_DEV> down
ifconfig <ETH_DEV> up
the interrupts remain masked. On VF reset, the mask bit is set
by the controller. At this point, Xen is not aware that mask is set.
However, interrupts are enabled by VF driver by clearing the mask
bit by writing directly to BAR3 region containing the MSI-X table.
From dom0, we can verify that
interrupts are being masked using 'xl debug-keys M'.
Jan Beulich [Thu, 12 Sep 2013 09:30:36 +0000 (11:30 +0200)]
x86/EFI: properly handle run time memory regions outside the 1:1 map
Namely with PFN compression, MMIO ranges that the firmware may need
runtime access to can live in the holes that gets shrunk/eliminated by
PFN compression, and hence no mappings would result from simply
copying Xen's direct mapping table's L3 page table entries. Build
mappings for this "manually" in the EFI runtime call 1:1 page tables.
Use the opportunity to also properly identify (via a forcibly undefined
manifest constant) all the disabled code regions associated with it not
being acceptable for us to call SetVirtualAddressMap().
Xi Xiong [Thu, 12 Sep 2013 09:28:03 +0000 (11:28 +0200)]
xend: fix file descriptor leak in pci utilities
A file descriptor leak was detected after creating multiple domUs with
pass-through PCI devices. This patch fixes the issue.
Signed-off-by: Xi Xiong <xixiong@amazon.com> Reviewed-by: Matt Wilson <msw@amazon.com>
[msw: adjusted commit message] Signed-off-by: Matt Wilson <msw@amazon.com>
master commit: 749019afca4fd002d36856bad002cc11f7d0ddda
master date: 2013-09-03 16:36:52 +0100
Steven Noonan [Thu, 12 Sep 2013 09:27:27 +0000 (11:27 +0200)]
xend: handle extended PCI configuration space when saving state
Newer PCI standards (e.g., PCI-X 2.0 and PCIe) introduce extended
configuration space which is larger than 256 bytes. This patch uses
stat() to determine the amount of space used to correctly save all of
the PCI configuration space. Resets handled by the xen-pciback driver
don't have this problem, as that code correctly handles saving
extended configuration space.
Signed-off-by: Steven Noonan <snoonan@amazon.com> Reviewed-by: Matt Wilson <msw@amazon.com>
[msw: adjusted commit message] Signed-off-by: Matt Wilson <msw@amazon.com>
master commit: 1893cf77992cc0ce9d827a8d345437fa2494b540
master date: 2013-09-03 16:36:47 +0100