In the original function, mmap() could be called with a length of -1 if the
second lseek failed and the caller had not provided max_size.
While fixing up this error, improve the logging of other error paths. I know
from personal experience that debugging failures function is rather difficult
given only "xc_dom_malloc_filemap: failed (on file <somefile>)" in the logs.
Andrew Cooper [Mon, 25 Nov 2013 11:05:47 +0000 (11:05 +0000)]
tools/xc_restore: Initialise console and store mfns
If the console or store mfn chunks are not present in the migration stream,
stack junk gets reported for the mfns.
XenServer had a very hard to track down VM corruption issue caused by exactly
this issue. Xenconsoled would connect to a junk mfn and incremented the ring
pointer if the junk happend to look like a valid gfn.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Keir Fraser <keir@xen.org> CC: Jan Beulich <JBeulich@suse.com>
(cherry picked from commit 0792426b798fd3b39909d618cf8fe8bac30594f4)
Jan Beulich [Tue, 10 Dec 2013 15:21:57 +0000 (16:21 +0100)]
IOMMU: clear "don't flush" override on error paths
Both xenmem_add_to_physmap() and iommu_populate_page_table() each have
an error path that fails to clear that flag, thus suppressing further
flushes on the respective pCPU.
In iommu_populate_page_table() also slightly re-arrange code to avoid
the false impression of the flag in question being guarded by a
domain's page_alloc_lock.
This is CVE-2013-6400 / XSA-80.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: 552b7fcb9a70f1d4dd0e0cd5fb4d3d9da410104a
master date: 2013-12-10 16:10:37 +0100
Andrew Cooper [Mon, 9 Dec 2013 13:47:43 +0000 (14:47 +0100)]
x86/boot: fix BIOS memory corruption on certain IBM systems
IBM System x3530 M4 BIOSes (including the latest available at the time of this
patch) will corrupt a byte at physical address 0x105ff1 to the value of 0x86
if %esp has the value 0x00080000 when issuing an `int $0x15 (ax=0xec00)` to
inform the system about our intended operating mode.
Xen gets unhappy when the bootloader has placed it's .text section in over
this specific region of RAM.
After dropping into 16bit mode, clear all 32 bits of %esp, and for the BIOS
call already documented to be affected by BIOS bugs clear all GPRs.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org> Release-acked-by: George Dunlap <george.dunlap@eu.citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1ed76797439e384de18fcd6810bd4743d4f38b1e
master date: 2013-12-06 11:28:00 +0100
Daniel Kiper [Mon, 9 Dec 2013 13:47:07 +0000 (14:47 +0100)]
x86: fix early boot command line parsing
There is no reliable way to encode NUL character as a character so encode
it as a number. Read: http://sourceware.org/binutils/docs/as/Characters.html.
Octal and hex encoding do not work on at least one system (GNU assembler
version 2.22 (x86_64-linux-gnu) using BFD version (GNU Binutils for Debian) 2.22).
Without this fix e.g. no-real-mode option at the end of xen.gz command line
is not detected.
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: dc37e0bfffc673f4bdce1d69ad86098bfb0ab531
master date: 2013-12-04 13:26:37 +0100
Apart from the Coverity-detected lock order reversal (a domain's
page_alloc_lock taken with the heap lock already held), calling
put_page() with heap_lock is a bad idea too (as a possible descendant
from put_page() is free_heap_pages(), which wants to take this very
lock).
From all I can tell the region over which heap_lock was held was far
too large: All we need to protect are the call to mark_page_offline()
and reserve_heap_page() (and I'd even put under question the need for
the former). Hence by slightly re-arranging the if/else-if chain we
can drop the lock much earlier, at once no longer covering the two
put_page() invocations.
Once at it, do a little bit of other cleanup: Put the "pod_replace"
code path inline rather than at its own label, and drop the effectively
unused variable "ret".
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Acked-by: Keir Fraser <keir@xen.org>
master commit: d4837a56da4a59259dd0cf9f3bdc073159d81d7a
master date: 2013-12-03 12:40:57 +0100
The ptr calculation shall take the offset into the page into account
when ptr is valid.
Reported regression on NetBSD's port-xen with last known working libxen
being rev 2.9. This corrupts the kernel symbol table when the table is
not loaded on a page boundary.
Issue was tracked down by FastIce and Jeff Rizzo. See also
http://mail-index.netbsd.org/port-xen/2013/10/16/msg008088.html
Signed-off-by: Jean-Yves Migeon <jym@NetBSD.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: cb08944a482a5e80a3ff1113f0735761cc4c6cb8
master date: 2013-11-29 11:07:01 +0000
Feng Wu [Mon, 9 Dec 2013 13:45:00 +0000 (14:45 +0100)]
x86: properly handle MSI-X unmask operation from guests
For a pass-through device with MSI-x capability, when guest tries
to unmask the MSI-x interrupt for the passed through device, xen
doesn't clear the mask bit for MSI-x in hardware in the following
scenario, which will cause network disconnection:
1. Guest masks the MSI-x interrupt
2. Guest updates the address and data for it
3. Guest unmasks the MSI-x interrupt (This is the problematic step)
In the step #3 above, Xen doesn't handle it well. When guest tries
to unmask MSI-X interrupt, it traps to Xen, Xen just returns to Qemu
if it notices that address or data has been modified by guest before,
then Qemu will update Xen with the latest value of address/data by
hypercall. However, in this whole process, the MSI-X interrupt unmask
operation is missing, which means Xen doesn't clear the mask bit in
hardware for the MSI-X interrupt, so it remains disabled, that is why
it loses the network connection.
This patch fixes this issue.
Signed-off-by: Feng Wu <feng.wu@intel.com>
Only latch the address if the guest really is unmasking the entry.
Clean up the entire change.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 74fd0036deb585a139b63b26db025805ecedc37a
master date: 2013-11-27 15:15:43 +0100
Liu Jinsong [Mon, 9 Dec 2013 13:43:34 +0000 (14:43 +0100)]
VMX: fix cr0.cd handling
This patch solves XSA-60 security hole:
1. For guest w/o VT-d, and for guest with VT-d but snooped, Xen need
do nothing, since hardware snoop mechanism has ensured cache coherency.
2. For guest with VT-d but non-snooped, cache coherency can not be
guaranteed by h/w snoop, therefore it need emulate UC type to guest:
2.1). if it works w/ Intel EPT, set guest IA32_PAT fields as UC so that
guest memory type are all UC.
2.2). if it works w/ shadow, drop all shadows so that any new ones would
be created on demand w/ UC.
This patch also fix a bug of shadow cr0.cd setting. Current shadow has a
small window between cache flush and TLB invalidation, resulting in possilbe
cache pollution. This patch pause vcpus so that no vcpus context involved
into the window.
This is CVE-2013-2212 / XSA-60.
Signed-off-by: Liu Jinsong <jinsong.liu@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jun Nakajima <jun.nakajima@intel.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 62652c00efa55fb45374bcc92f7d96fc411aebb2
master date: 2013-11-06 10:12:36 +0100
Liu Jinsong [Mon, 9 Dec 2013 13:41:44 +0000 (14:41 +0100)]
VMX: remove the problematic set_uc_mode logic
XSA-60 security hole comes from the problematic vmx_set_uc_mode.
This patch remove vmx_set_uc_mode logic, which will be replaced by
PAT approach at later patch.
This is CVE-2013-2212 / XSA-60.
Signed-off-by: Liu Jinsong <jinsong.liu@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org> Acked-by: Jun Nakajima <jun.nakajima@intel.com>
master commit: 1c84d046735102e02d2df454ab07f14ac51f235d
master date: 2013-11-06 10:12:00 +0100
Liu Jinsong [Mon, 9 Dec 2013 13:40:51 +0000 (14:40 +0100)]
VMX: disable EPT when !cpu_has_vmx_pat
Recently Oracle developers found a Xen security issue as DOS affecting,
named as XSA-60. Please refer http://xenbits.xen.org/xsa/advisory-60.html
Basically it involves how to handle guest cr0.cd setting, which under
some environment it consumes much time resulting in DOS-like behavior.
This is a preparing patch for fixing XSA-60. Later patch will fix XSA-60
via PAT under Intel EPT case, which depends on cpu_has_vmx_pat.
This is CVE-2013-2212 / XSA-60.
Signed-off-by: Liu Jinsong <jinsong.liu@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org> Acked-by: Jun Nakajima <jun.nakajima@intel.com>
master commit: c13b0d65ddedd74508edef5cd66defffe30468fc
master date: 2013-11-06 10:11:18 +0100
Reported-by: David Binderman <dcb314@hotmail.com> Signed-off-by: Tim Deegan <tim@xen.org>
Use _SEGMENT_* instead of plain numbers and adjust a comment.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6ed4bfbabd487b41021caa7ed03cee1f00ecbabf
master date: 2013-11-26 09:54:21 +0100
Liu Jinsong [Mon, 2 Dec 2013 14:56:09 +0000 (15:56 +0100)]
x86/xsave: fix nonlazy state handling
Nonlazy xstates should be xsaved each time when vcpu_save_fpu.
Operation to nonlazy xstates will not trigger #NM exception, so
whenever vcpu scheduled in it got restored and whenever scheduled
out it should get saved.
Currently this bug affects AMD LWP feature, and later Intel MPX
feature. With the bugfix both LWP and MPX will work fine.
Signed-off-by: Liu Jinsong <jinsong.liu@intel.com>
Furthermore, during restore we also need to set nonlazy_xstate_used
according to the incoming accumulated XCR0.
Also adjust the changes to i387.c such that there won't be a pointless
clts()/stts() pair.
David Vrabel [Mon, 2 Dec 2013 14:55:16 +0000 (15:55 +0100)]
x86/crash: disable the watchdog NMIs on the crashing cpu
nmi_shootdown_cpus() is called during a crash to park all the other
CPUs. This changes the NMI trap handlers which means there's no point
in having the watchdog still running.
This also disables the watchdog before executing any crash kexec image
and prevents the image from receiving unexpected NMIs.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
PVOps Linux as a kexec image shoots itself in the foot otherwise.
On a Core2 system, Linux declares a firmware bug and tries to invert some bits
in the performance counter register. It ends up setting the number of retired
instructions to generate another NMI to fewer instructions than the NMI
interrupt path itself, and ceases to make any useful progress.
The call to disable_lapic_nmi_watchdog() must be this late into the kexec path
to be sure that this cpu is the one which will execute the kexec image.
Otherwise there are race conditions where the NMIs might be disabled on the
wrong cpu, resulting in the kexec image still receiving NMIs.
x86/hvm: reset TSC to 0 after domain resume from S3
Host S3 implicitly resets the host TSC to 0, but the tsc offset for hvm
domains is not recalculated when they resume, causing it to go into
negative values. In Linux guest using tsc clocksource, this results in
a hang after wrap back to positive values since the tsc clocksource
implementation expects it reset.
Jan Beulich [Mon, 2 Dec 2013 14:53:50 +0000 (15:53 +0100)]
x86: consider modules when cutting off memory
The code in question runs after module ranges got already removed from
the E820 table, so when determining the new maximum page/PDX we need to
explicitly take them into account.
Furthermore we need to round up the ending addresses here, in order to
fully cover eventual partial trailing pages.
Nate Studer [Mon, 2 Dec 2013 14:51:00 +0000 (15:51 +0100)]
credit: Update other parameters when setting tslice_ms
Add a utility function to update the rest of the timeslice
accounting fields when updating the timeslice of the
credit scheduler, so that capped CPUs behave correctly.
Before this patch changing the timeslice to a value higher
than the default would result in a domain not utilizing
its full capacity and changing the timeslice to a value
lower than the default would result in a domain exceeding
its capacity.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 19f027cc5daff4a37fd0a28bca2514c721852dd0
master date: 2013-11-27 09:00:41 +0100
Ian Jackson [Tue, 3 Sep 2013 12:41:46 +0000 (13:41 +0100)]
libxl: Do not generate short block in libxl__datacopier_prefixdata
libxl__datacopier_prefixdata would prepend a deliberately short block
(not just a half-full one, but one with a short buffer) to the
dc->bufs queue. However, this is wrong because datacopier_readable
will find it and try to continue to fill it up.
Jan Beulich [Mon, 18 Nov 2013 13:00:34 +0000 (14:00 +0100)]
VT-d: fix TLB flushing in dma_pte_clear_one()
The third parameter of __intel_iommu_iotlb_flush() is to indicate
whether the to be flushed entry was a present one. A few lines before,
we bailed if !dma_pte_present(*pte), so there's no need to check the
flag here again - we can simply always pass TRUE here.
This is XSA-78.
Suggested-by: Cheng Yueqiang <yqcheng.2008@phdis.smu.edu.sg> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 85c72f9fe764ed96f5c149efcdd69ab7c18bfe3d
master date: 2013-11-18 13:55:55 +0100
Jan Beulich [Fri, 15 Nov 2013 10:47:41 +0000 (11:47 +0100)]
x86: eliminate has_arch_mmios()
... as being generally insufficient: Either has_arch_pdevs() or
cache_flush_permitted() should be used (in particular, it is
insufficient to consider MMIO ranges alone - I/O port ranges have the
same requirements if available to a guest).
Jan Beulich [Fri, 15 Nov 2013 10:46:26 +0000 (11:46 +0100)]
VMX: don't crash processing 'd' debug key
There's a window during scheduling where "current" and the active VMCS
may disagree: The former gets set much earlier than the latter. Since
both vmx_vmcs_enter() and vmx_vmcs_exit() immediately return when the
subject vCPU is "current", accessing VMCS fields would, depending on
whether there is any currently active VMCS, either read wrong data, or
cause a crash.
Going forward we might want to consider reducing the window during
which vmx_vmcs_enter() might fail (e.g. doing a plain __vmptrld() when
v->arch.hvm_vmx.vmcs != this_cpu(current_vmcs) but arch_vmx->active_cpu
== -1), but that would add complexities (acquiring and - more
importantly - properly dropping v->arch.hvm_vmx.vmcs_lock) that don't
look worthwhile adding right now.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 58929248461ecadce13e92eb5a5d9ef718a7c88e
master date: 2013-11-12 11:52:19 +0100
Jan Beulich [Fri, 15 Nov 2013 10:44:17 +0000 (11:44 +0100)]
x86/EFI: make trampoline allocation more flexible
Certain UEFI implementations reserve all memory below 1Mb at boot time,
making it impossible to properly allocate the chunk necessary for the
trampoline. Fall back to simply grabbing a chunk from EfiBootServices*
regions immediately prior to calling ExitBootServices().
Jan Beulich [Fri, 15 Nov 2013 10:42:36 +0000 (11:42 +0100)]
x86/HVM: 32-bit IN result must be zero-extended to 64 bits
Just like for all other operations with 32-bit operand size.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
x86/HVM: 32-bit IN result must be zero-extended to 64 bits (part 2)
Just spotted a counterpart of what commit 9d89100b (same title) dealt
with.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 9d89100ba8b7b02adb7c2e89ef7c81e734942e7c
master date: 2013-11-05 14:51:53 +0100
master commit: 1e521eddeb51a9f1bf0e4dd1d17efc873eafae41
master date: 2013-11-15 11:01:49 +0100
Jan Beulich [Fri, 15 Nov 2013 10:37:39 +0000 (11:37 +0100)]
x86/ACPI/x2APIC: guard against out of range ACPI or APIC IDs
Other than for the legacy APIC, the x2APIC MADT entries have valid
ranges possibly extending beyond what our internal arrays can handle,
and hence we need to guard ourselves against corrupting memory here.
Jan Beulich [Fri, 15 Nov 2013 10:36:39 +0000 (11:36 +0100)]
x86: refine address validity checks before accessing page tables
In commit 40d66baa ("x86: correct LDT checks") and d06a0d71 ("x86: add
address validity check to guest_map_l1e()") I didn't really pay
attention to the fact that these checks would better be done before the
paging_mode_translate() ones, as there's also no equivalent check down
the shadow code paths involved here (at least not up to the first use
of the address), and such generic checks shouldn't really be done by
particular backend functions anyway.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: 343cad8c70585c4dba8afc75e1ec1b7610605ab2
master date: 2013-10-28 12:00:36 +0100
David Vrabel [Fri, 15 Nov 2013 10:34:43 +0000 (11:34 +0100)]
sched: fix race between sched_move_domain() and vcpu_wake()
From: David Vrabel <david.vrabel@citrix.com>
sched_move_domain() changes v->processor for all the domain's VCPUs.
If another domain, softirq etc. triggers a simultaneous call to
vcpu_wake() (e.g., by setting an event channel as pending), then
vcpu_wake() may lock one schedule lock and try to unlock another.
vcpu_schedule_lock() attempts to handle this but only does so for the
window between reading the schedule_lock from the per-CPU data and the
spin_lock() call. This does not help with sched_move_domain()
changing v->processor between the calls to vcpu_schedule_lock() and
vcpu_schedule_unlock().
Fix the race by taking the schedule_lock for v->processor in
sched_move_domain().
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
Use vcpu_schedule_lock_irq() (which now returns the lock) to properly
retry the locking should the to be used lock have changed in the course
of acquiring it (issue pointed out by George Dunlap).
Add a comment explaining the state after the v->processor adjustment.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: ef55257bc81204e34691f1c2aa9e01f2d0768bdd
master date: 2013-10-14 08:58:31 +0200
Jan Beulich [Fri, 15 Nov 2013 10:34:01 +0000 (11:34 +0100)]
fix locking in cpu_disable_scheduler()
So commit eedd6039 ("scheduler: adjust internal locking interface")
uncovered - by now using proper spin lock constructs - a bug after all:
When bringing down a CPU, cpu_disable_scheduler() gets called with
interrupts disabled, and hence the use of vcpu_schedule_lock_irq() was
never really correct (i.e. the caller ended up with interrupts enabled
despite having disabled them explicitly).
Fixing this however surfaced another problem: The call path
vcpu_migrate() -> evtchn_move_pirqs() wants to acquire the event lock,
which however is a non-IRQ-safe once, and hence check_lock() doesn't
like this lock to be acquired when interrupts are already off. As we're
in stop-machine context here, getting things wrong wrt interrupt state
management during lock acquire/release is out of question though, so
the simple solution to this appears to be to just suppress spin lock
debugging for the period of time while the stop machine callback gets
run.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 41a0cc9e26160a89245c9ba3233e3f70bf9cd4b4
master date: 2013-10-29 09:57:14 +0100
Jan Beulich [Fri, 15 Nov 2013 10:32:51 +0000 (11:32 +0100)]
scheduler: adjust internal locking interface
Make the locking functions return the lock pointers, so they can be
passed to the unlocking functions (which in turn can check that the
lock is still actually providing the intended protection, i.e. the
parameters determining which lock is the right one didn't change).
Further use proper spin lock primitives rather than open coded
local_irq_...() constructs, so that interrupts can be re-enabled as
appropriate while spinning.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: eedd60391610629b4e8a2e8278b857ff884f750d
master date: 2013-10-14 08:57:56 +0200
Ian Jackson [Wed, 1 May 2013 15:56:54 +0000 (16:56 +0100)]
libxl: Fix bug in libxl_cdrom_insert, make more robust against bad xenstore data
libxl_cdrom_insert was failing to initialize the backend type,
resulting in the wrong default backend. The result was not only that
the CD was not inserted properly, but also that some improper xenstore
entries were created, causing further block commands to fail.
This patch fixes the bug by setting the disk backend type based on the
type of the existing device.
It also makes the system more robust by checking to see that it has
got a valid path before proceeding to write a partial xenstore entry.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Committed-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit c3556e2a1aee3c9b7dda5d57e85e8867fff1b9da)
Conflicts:
tools/libxl/libxl.c Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Jan Beulich [Mon, 11 Nov 2013 08:18:59 +0000 (09:18 +0100)]
nested VMX: VMLANUCH/VMRESUME emulation must check permission first thing
Otherwise uninitialized data may be used, leading to crashes.
This is CVE-2013-4551 / XSA-75.
Reported-and-tested-by: Jeff Zimmerman <Jeff_Zimmerman@McAfee.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-and-tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: 4e87bc5b03e05123ba5c888f77969140c8ebd1bf
master date: 2013-11-11 09:15:04 +0100
Ian Jackson [Tue, 29 Oct 2013 15:45:53 +0000 (15:45 +0000)]
tools: xenstored: if the reply is too big then send E2BIG error
This fixes the issue for both C and ocaml xenstored, however only the ocaml
xenstored is vulnerable in its default configuration.
Adding a new error appears to be safe, since bit libxenstore and the Linux
driver at least treat an unknown error code as EINVAL.
This is XSA-72 / CVE-2013-4416.
Original ocaml patch by Jerome Maloberti <jerome.maloberti@citrix.com> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
(cherry picked from commit 8b2c441a1b53a43a38b3c517e28f239da3349872)
(cherry picked from commit d88ac91ef2f0a93ea9359a8133405dbd78abc89b)
Jan Beulich [Tue, 22 Oct 2013 10:07:40 +0000 (12:07 +0200)]
x86-64: check for canonical address before doing page walks
... as there doesn't really exists any valid mapping for them.
Particularly in the case of do_page_walk() this also avoids returning
non-NULL for such invalid input.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 6fd9b0361e2eb5a7f12bdd5cbf7e42c0d1937d26
master date: 2013-10-11 09:31:16 +0200
Jan Beulich [Tue, 22 Oct 2013 10:06:43 +0000 (12:06 +0200)]
x86: add address validity check to guest_map_l1e()
Just like for guest_get_eff_l1e() this prevents accessing as page
tables (and with the wrong memory attribute) internal data inside Xen
happening to be mapped with 1Gb pages.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: d06a0d715ec1423b6c42141ab1b0ff69a3effb56
master date: 2013-10-11 09:29:43 +0200
Jan Beulich [Tue, 22 Oct 2013 10:05:45 +0000 (12:05 +0200)]
x86: correct LDT checks
- MMUEXT_SET_LDT should behave as similarly to the LLDT instruction as
possible: fail only if the base address is non-canonical
- instead LDT descriptor accesses should fault if the descriptor
address ends up being non-canonical (by ensuring this we at once
avoid reading an entry from the mach-to-phys table and consider it a
page table entry)
- fault propagation on using LDT selectors must distinguish #PF and #GP
(the latter must be raised for a non-canonical descriptor address,
which also applies to several other uses of propagate_page_fault(),
and hence the problem is being fixed there)
- map_ldt_shadow_page() should properly wrap addresses for 32-bit VMs
At once remove the odd invokation of map_ldt_shadow_page() from the
MMUEXT_SET_LDT handler: There's nothing really telling us that the
first LDT page is going to be preferred over others.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 40d66baa46ca8a9ffa6df3e063a967d08ec92bcf
master date: 2013-10-11 09:28:26 +0200
Andrew Cooper [Tue, 22 Oct 2013 10:04:01 +0000 (12:04 +0200)]
x86/percpu: Force INVALID_PERCPU_AREA into the non-canonical address region
This causes accidental uses of per_cpu() on a pcpu with an INVALID_PERCPU_AREA
to result in a #GF for attempting to access the middle of the non-canonical
virtual address region.
This is preferable to the current behaviour, where incorrect use of per_cpu()
will result in an effective NULL structure dereference which has security
implication in the context of PV guests.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 7cfb0053629c4dd1a6f01dc43cca7c0c25b8b7bf
master date: 2013-10-04 12:24:34 +0200
Andrew Cooper [Tue, 22 Oct 2013 10:03:03 +0000 (12:03 +0200)]
x86/idle: Fix get_cpu_idle_time()'s interaction with offline pcpus
Checking for "idle_vcpu[cpu] != NULL" is insufficient protection against
offline pcpus. From a hypercall, vcpu_runstate_get() will determine "v !=
current", and try to take the vcpu_schedule_lock(). This will try to look up
per_cpu(schedule_data, v->processor) and promptly suffer a NULL structure
deference as v->processors' __per_cpu_offset is INVALID_PERCPU_AREA.
****************************************
Panic on CPU 11:
...
get_cpu_idle_time() has been updated to correctly deal with offline pcpus
itself by returning 0, in the same way as it would if it was missing the
idle_vcpu[] pointer.
In doing so, XENPF_getidletime needed updating to correctly retain its
described behaviour of clearing bits in the cpumap for offline pcpus.
As this crash can only be triggered with toolstack hypercalls, it is not a
security issue and just a simple bug.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 0aa27ce3351f7eb09d13e863a1d5f303086aa32a
master date: 2013-10-04 12:23:23 +0200
Matthew Daley [Thu, 10 Oct 2013 13:24:15 +0000 (15:24 +0200)]
x86: check segment descriptor read result in 64-bit OUTS emulation
When emulating such an operation from a 64-bit context (CS has long
mode set), and the data segment is overridden to FS/GS, the result of
reading the overridden segment's descriptor (read_descriptor) is not
checked. If it fails, data_base is left uninitialized.
This can lead to 8 bytes of Xen's stack being leaked to the guest
(implicitly, i.e. via the address given in a #PF).
Ignoring them generally implies using uninitialized data and, in all
but two of the cases dealt with here, potentially leaking hypervisor
stack contents to guests.
This is CVE-2013-4355 / XSA-63.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6bb838e7375f5b031e9ac346b353775c90de45dc
master date: 2013-09-30 14:17:46 +0200
Jan Beulich [Wed, 25 Sep 2013 08:55:42 +0000 (10:55 +0200)]
x86/xsave: initialize extended register state when guests enable it
Till now, when setting previously unset bits in XCR0 we wouldn't touch
the active register state, thus leaving in the newly enabled registers
whatever a prior user of it left there, i.e. potentially leaking
information between guests.
This is CVE-2013-1442 / XSA-62.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 63a75ba0de817d6f384f96d25427a05c313e2179
master date: 2013-09-25 10:41:25 +0200
Jan Beulich [Thu, 12 Sep 2013 09:33:44 +0000 (11:33 +0200)]
x86/xsave: fix migration from xsave-capable to xsave-incapable host
With CPUID features suitably masked this is supposed to work, but was
completely broken (i.e. the case wasn't even considered when the
original xsave save/restore code was written).
First of all, xsave_enabled() wrongly returned the value of
cpu_has_xsave, i.e. not even taking into consideration attributes of
the vCPU in question. Instead this function ought to check whether the
guest ever enabled xsave support (by writing a [non-zero] value to
XCR0). As a result of this, a vCPU's xcr0 and xcr0_accum must no longer
be initialized to XSTATE_FP_SSE (since that's a valid value a guest
could write to XCR0), and the xsave/xrstor as well as the context
switch code need to suitably account for this (by always enforcing at
least this part of the state to be saved/loaded).
This involves undoing large parts of c/s 22945:13a7d1f7f62c ("x86: add
strictly sanity check for XSAVE/XRSTOR") - we need to cleanly
distinguish between hardware capabilities and vCPU used features.
Next both HVM and PV save code needed tweaking to not always save the
full state supported by the underlying hardware, but just the parts
that the guest actually used. Similarly the restore code should bail
not just on state being restored that the hardware cannot handle, but
also on inconsistent save state (inconsistent XCR0 settings or size of
saved state not in line with XCR0).
And finally the PV extended context get/set code needs to use slightly
different logic than the HVM one, as here we can't just key off of
xsave_enabled() (i.e. avoid doing anything if a guest doesn't use
xsave) because the tools use this function to determine host
capabilities as well as read/write vCPU state. The set operation in
particular needs to be capable of cleanly dealing with input that
consists of only the xcr0 and xcr0_accum values (if they're both zero
then no further data is required).
While for things to work correctly both sides (saving _and_ restoring
host) need to run with the fixed code, afaict no breakage should occur
if either side isn't up to date (other than the breakage that this
patch attempts to fix).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Yang Zhang <yang.z.zhang@intel.com> Acked-by: Keir Fraser <keir@xen.org>
master commit: 4cc1344447a0458df5d222960f2adf1b65084fa8
master date: 2013-09-09 14:36:54 +0200
Jan Beulich [Thu, 12 Sep 2013 09:32:21 +0000 (11:32 +0200)]
x86/xsave: initialization improvements
- properly validate available feature set on APs
- also validate xsaveopt availability on APs
- properly indicate whether the initialization is on the BSP (we
shouldn't be using "cpu == 0" checks for this)
x86: allow guest to set/clear MSI-X mask bit (try 2)
Guest needs the ability to enable and disable MSI-X interrupts
by setting the MSI-X control bit, for a passed-through device.
Guest is allowed to write MSI-X mask bit only if Xen *thinks*
that mask is clear (interrupts enabled). If the mask is set by
Xen (interrupts disabled), writes to mask bit by the guest is
ignored.
Currently, a write to MSI-X mask bit by the guest is silently
ignored.
A likely scenario is where we have a 82599 SR-IOV nic passed
through to a guest. From the guest if you do
ifconfig <ETH_DEV> down
ifconfig <ETH_DEV> up
the interrupts remain masked. On VF reset, the mask bit is set
by the controller. At this point, Xen is not aware that mask is set.
However, interrupts are enabled by VF driver by clearing the mask
bit by writing directly to BAR3 region containing the MSI-X table.
From dom0, we can verify that
interrupts are being masked using 'xl debug-keys M'.
Jan Beulich [Thu, 12 Sep 2013 09:30:36 +0000 (11:30 +0200)]
x86/EFI: properly handle run time memory regions outside the 1:1 map
Namely with PFN compression, MMIO ranges that the firmware may need
runtime access to can live in the holes that gets shrunk/eliminated by
PFN compression, and hence no mappings would result from simply
copying Xen's direct mapping table's L3 page table entries. Build
mappings for this "manually" in the EFI runtime call 1:1 page tables.
Use the opportunity to also properly identify (via a forcibly undefined
manifest constant) all the disabled code regions associated with it not
being acceptable for us to call SetVirtualAddressMap().
Xi Xiong [Thu, 12 Sep 2013 09:28:03 +0000 (11:28 +0200)]
xend: fix file descriptor leak in pci utilities
A file descriptor leak was detected after creating multiple domUs with
pass-through PCI devices. This patch fixes the issue.
Signed-off-by: Xi Xiong <xixiong@amazon.com> Reviewed-by: Matt Wilson <msw@amazon.com>
[msw: adjusted commit message] Signed-off-by: Matt Wilson <msw@amazon.com>
master commit: 749019afca4fd002d36856bad002cc11f7d0ddda
master date: 2013-09-03 16:36:52 +0100
Steven Noonan [Thu, 12 Sep 2013 09:27:27 +0000 (11:27 +0200)]
xend: handle extended PCI configuration space when saving state
Newer PCI standards (e.g., PCI-X 2.0 and PCIe) introduce extended
configuration space which is larger than 256 bytes. This patch uses
stat() to determine the amount of space used to correctly save all of
the PCI configuration space. Resets handled by the xen-pciback driver
don't have this problem, as that code correctly handles saving
extended configuration space.
Signed-off-by: Steven Noonan <snoonan@amazon.com> Reviewed-by: Matt Wilson <msw@amazon.com>
[msw: adjusted commit message] Signed-off-by: Matt Wilson <msw@amazon.com>
master commit: 1893cf77992cc0ce9d827a8d345437fa2494b540
master date: 2013-09-03 16:36:47 +0100
Jan Beulich [Thu, 12 Sep 2013 09:26:40 +0000 (11:26 +0200)]
x86: AVX instruction emulation fixes
- we used the C4/C5 (first prefix) byte instead of the apparent ModR/M
one as the second prefix byte
- early decoding normalized vex.reg, thus corrupting it for the main
consumer (copy_REX_VEX()), resulting in #UD on the two-operand
instructions we emulate
Also add respective test cases to the testing utility plus
- fix get_fpu() (the fall-through order was inverted)
- add cpu_has_avx2, even if it's currently unused (as in the new test
cases I decided to refrain from using AVX2 instructions in order to
be able to actually run all the tests on the hardware I have)
- slightly tweak cpu_has_avx to more consistently express the outputs
we don't care about (sinking them all into the same variable)
Andrew Cooper [Thu, 12 Sep 2013 08:58:40 +0000 (10:58 +0200)]
x86: Special case __HYPERVISOR_iret rather more when writing hypercall pages
In all cases when a hypercall page is written, __HYPERVISOR_iret is first
written as a regular hypercall, then subsequently rewritten in its special
case.
For VMX and SVM, this means that following the ud2a instruction is 3 bytes of
an imm32 parameter. For a ring3 kernel, this means that following the syscall
instruction is the second half of 'pop %r11'.
For a ring1 kernel, the iret case ends up as the same number of bytes as the
rest of the hypercalls, but it is pointless writing it twice, and is changed
for consistency.
Therefore, skip the loop iteration which would write the incorrect
__HYPERVISOR_iret hypercall. This removes junk machine code from the tail and
makes disassemblers rather more happy when looking at the hypercall page.
Also, a miscellaneous whitespace fix in the comment for ring3 kernel.
Jan Beulich [Wed, 11 Sep 2013 06:23:29 +0000 (08:23 +0200)]
ACPI: fix acpi_os_map_memory()
It using map_domain_page() was entirely wrong. Use __acpi_map_table()
instead for the time being, with locking added as the mappings it
produces get replaced with subsequent invocations. Using locking in
this way is acceptable here since the only two runtime callers are
acpi_os_{read,write}_memory(), which don't leave mappings pending upon
returning to their callers.
Also fix __acpi_map_table()'s first parameter's type - while benign for
unstable, backports to pre-4.3 trees will need this.
Fix inactive timer list corruption on second S3 resume
init_timer cannot be safely called multiple times on same timer since it does memset(0)
on the structure, erasing the auxiliary member used by linked list code. This breaks
inactive timer list in common/timer.c.
Moved resume_timer initialisation to ns16550_init_postirq, so it's only done once.
Signed-off-by: Ian Campbell <ijc@hellion.org.uk> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit 258d27a1d9fb33a490bef1381f52d522225c3dca)
Ian Jackson [Tue, 3 Sep 2013 10:55:48 +0000 (11:55 +0100)]
oxenstored: Protect oxenstored from malicious domains.
add check logic when read from IO ring, and if error happens,
then mark the reading connection as "bad", Unless vm reboot,
oxenstored will not handle message from this connection any more.
xs_ring_stubs.c: add a more strict check on ring reading
connection.ml, domain.ml: add getter and setter for bad flag
process.ml: if exception raised when reading from domain's ring,
mark this domain as "bad"
xenstored.ml: if a domain is marked as "bad", do not handle it.
Juergen Gross [Thu, 22 Aug 2013 09:30:01 +0000 (11:30 +0200)]
Correct X2-APIC HVM emulation
commit 6859874b61d5ddaf5289e72ed2b2157739b72ca5 ("x86/HVM: fix x2APIC
APIC_ID read emulation") introduced an error for the hvm emulation of
x2apic. Any try to write to APIC_ICR MSR will result in a GP fault.
Tim Deegan [Tue, 20 Aug 2013 13:40:38 +0000 (15:40 +0200)]
xen: Add stdbool.h workaround for BSD.
On *BSD, stdbool.h lives in /usr/include, but we don't want to have
that on the search path in case we pick up any headers from the build
host's C libraries.
Copy the equivalent hack already in place for stdarg.h: on all
supported compilers the contents of stdbool.h are trivial, so just
supply the things we need in a xen/stdbool.h header.
Signed-off-by: Tim Deegan <tim@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Keir Fraser <keir@xen.org> Tested-by: Patrick Welche <prlw1@cam.ac.uk>
master commit: 7b9685ca4ed2fd723600ce66eb20a6d0c115b6cb
master date: 2013-08-15 22:00:45 +0100
Jan Beulich [Tue, 20 Aug 2013 13:38:24 +0000 (15:38 +0200)]
VT-d: protect against bogus information coming from BIOS
Add checks similar to those done by Linux: The DRHD address must not
be all zeros or all ones (Linux only checks for zero), and capabilities
as well as extended capabilities must not be all ones.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Ben Guthro <benjamin.guthro@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Ben Guthro <benjamin.guthro@citrix.com>
Acked by: Yang Zhang <yang.z.zhang@intel.com> Acked-by: Xiantao Zhang <xiantao.zhang@intel.com>
master commit: e8e8b030ecf916fea19639f0b6a446c1c9dbe174
master date: 2013-08-14 11:18:24 +0200
x86/AMD: Inject #GP instead of #UD when unable to map vmcb
According to AMD Programmer's Manual vol2, vmrun, vmsave and vmload
should inject #GP instead of #UD when unable to access memory
location for vmcb. Also, the code should make sure that L1 guest
EFER.SVME is not zero. Otherwise, #UD should be injected.
x86/AMD: Fix nested svm crash due to assertion in __virt_to_maddr
Fix assertion in __virt_to_maddr when starting nested SVM guest
in debug mode. Investigation has shown that svm_vmsave/svm_vmload
make use of __pa() with invalid address.
Patrick Welche [Tue, 20 Aug 2013 13:33:14 +0000 (15:33 +0200)]
libelf: Fix typo in header guard macro
s/__LIBELF_PRIVATE_H_/__LIBELF_PRIVATE_H__/
Signed-off-by: Patrick Welche <prlw1@cam.ac.uk> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: 0aec8823501f8ee058c1ba673d2ac3e0f3f2e8db
master date: 2013-08-08 12:47:38 +0100
Tim Deegan [Fri, 16 Aug 2013 10:07:40 +0000 (12:07 +0200)]
x86: explicit suffix in inline assembler (for clang).
This fixes the clang build, and has no effect on gcc's output.
Signed-off-by: Tim Deegan <tim@xen.org> Committed-by: Jan Beulich <jbeulich@suse.com>
master commit: 59a28b5f045331641cbf0c1fc8d5d67afe328939
master date: 2013-02-14 14:20:06 +0100
Note that this isn't just a build fix - if the "delta" input in the
64-bit variant ends up in memory, gas would default to 32-bit operand
size (and should really warn about the ambiguity).
Yang Zhang [Thu, 15 Aug 2013 07:14:11 +0000 (09:14 +0200)]
VTD: Remove the check for reserved device scope type
Though we only have four valid types now, the new type may be added in future.
It's better to remove the check and only deal with the type that we can
recognize.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Signed-off-by: Xiantao Zhang <xiantao.zhang@Intel.com> Acked-by: Keir Fraser <keir@xen.org>
Add log message for this case.