Ross Lagerwall [Thu, 18 Jun 2015 07:28:37 +0000 (09:28 +0200)]
x86: don't crash when mapping a page using EFI runtime page tables
When an interrupt is received during an EFI runtime service call, Xen
may call map_domain_page() while using the EFI runtime page tables.
This fails because, although the EFI runtime page tables are a
copy of the idle domain's page tables, current points at a different
domain's vCPU.
To fix this, return NULL from mapcache_current_vcpu() when using the EFI
runtime page tables which is treated equivalently to running in an idle
vCPU.
This issue can be reproduced by repeatedly calling GetVariable() from
dom0 while using VT-d, since VT-d frequently maps a page from interrupt
context.
Roger Pau Monné [Thu, 18 Jun 2015 07:28:02 +0000 (09:28 +0200)]
x86/pvh: disable posted interrupts
Enabling posted interrupts requires the virtual interrupt delivery feature,
which is disabled for PVH guests, so make sure posted interrupts are also
disabled or else vmlaunch will fail.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reported-and-Tested-by: Lars Eggert <lars@netapp.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: cf6b3ccf28faee01a078311fcfe671148c81ea75
master date: 2015-05-28 10:56:08 +0200
This is because, for free CPUs, -EBUSY were being returned
when trying to tear them down, making cpu_down() unhappy.
It is certainly unpractical to forbid shutting down or
suspenging if there are unassigned CPUs, so this change
fixes the above by just avoiding returning -EBUSY for those
CPUs. If shutting off, that does not matter much anyway. If
suspending, we make sure that the CPUs remain unassigned
when resuming.
While there, take the chance to:
- fix the doc comment of cpupool_cpu_remove() (it was
wrong);
- improve comments in general around and in cpupool_cpu_remove()
and cpupool_cpu_add();
- add a couple of ASSERT()-s for checking consistency.
x86_emulate: fix EFLAGS setting of CMPXCHG emulation
CMPXCHG sets CF, PF, AF, SF, and OF flags according to the results of the
comparison the rAX with the operand of the instruction.
rAX must be the first argument of the comparison (a minuend), the operand
must be the second one (a subtrahend).
Due to improper order of comparison arguments, CF, PF, AF, SF and OF flags were
set incorrectly in the case of inequality. Need to swap them.
Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com>
test_x86_emulate: extend EFLAGS check of CMPXCHG test
CMPXCHG: in the case of inequality of the rAX and the operand,
need to check CF, PF, AF, SF and OF flags as well.
This adjustment covers the fix of incorrect comparison during
CMPXCHG emulation.
Ross Lagerwall [Tue, 19 May 2015 09:55:46 +0000 (11:55 +0200)]
x86/efi: reserve SMBIOS table region when EFI booting
Some EFI firmware implementations may place the SMBIOS table in RAM
marked as BootServicesData, which Xen does not consider as reserved.
When dom0 tries to access the SMBIOS, the region is not contained in the
initial P2M and it crashes with a page fault. To fix this, reserve the
SMBIOS region.
Also, fix the memcmp checks for existence of the SMBIOS.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: bf68adcadaa2b0885c5d2f1c8e2e068e209eb041
master date: 2015-04-17 10:44:48 +0200
Jan Beulich [Tue, 19 May 2015 09:54:12 +0000 (11:54 +0200)]
x86: don't change affinity with interrupt unmasked
With ->startup unmasking the IRQ, setting the affinity afterwards
without masking the IRQ again is invalid namely for MSI (address and
data can't be updated atomically and may - at least for MSI-X - be
cached while unmasked).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
AMD IOMMU: only translate remapped IO-APIC RTEs
1aeb1156fa ("x86 don't change affinity with interrupt unmasked")
introducing RTE reads prior to the respective interrupt having got
enabled for the first time uncovered a bug in 2ca9fbd739 ("AMD IOMMU:
allocate IRTE entries instead of using a static mapping"): We obviously
shouldn't be translating RTEs for which remapping didn't get set up
yet.
Jan Beulich [Tue, 21 Apr 2015 07:22:17 +0000 (09:22 +0200)]
VT-d: improve fault info logging
I got repeatedly annoyed by there not getting anything logged by
default on VT-d faults (and hence having to tell people to add extra
command line options), and hence I think it is time to redo this code:
Log basic fault information at guest-warning level (rate limited by
default), and show the page walk in verbose rather than only in debug
mode. Break up multi-line message so that each gets a proper log level
attached, at once splitting out the common part. Also don't log
"unknown" faults as interrupt-remapping ones.
As a minor cleanup fix the type of the involved "fault_type" variables.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Yang Zhang <yang.z.zhang@intel.com>
master commit: f0250f4b4818f5f4230995407ea2501de3485934
master date: 2015-03-27 15:23:25 +0100
Jan Beulich [Tue, 21 Apr 2015 07:21:41 +0000 (09:21 +0200)]
x86/MSI: fix error handling
__setup_msi_irq() needs to undo what it did before calling
write_msi_msg() in case that returned an error.
map_domain_pirq() needs to get rid of the MSI descriptor it
(implicitly) allocated. The case of a setup_msi_irq() failure on a
non-initial multi-vector-MSI interrupt needs special handling: While
the initial IRQ will get freed by the caller (who also passed it to
us), we need to undo the effect setup_msi_irq() had on it. (As a
benefit from the added call to msi_free_irq() we no longer need to
explicitly call destroy_irq() on the non-initial slots.)
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 29c1b7886c36d4e6aa03a779b2251b829d9689c3
master date: 2015-03-26 11:19:57 +0100
If the part of the compression data are corrupted, or the compression
data is totally fake, the memory access over the limit is possible.
This is the log from my system usning lz4 decompression.
[6502]data abort, halting
[6503]r0 0x00000000 r1 0x00000000 r2 0xdcea0ffc r3 0xdcea0ffc
[6509]r4 0xb9ab0bfd r5 0xdcea0ffc r6 0xdcea0ff8 r7 0xdce80000
[6515]r8 0x00000000 r9 0x00000000 r10 0x00000000 r11 0xb9a98000
[6522]r12 0xdcea1000 usp 0x00000000 ulr 0x00000000 pc 0x820149bc
[6528]spsr 0x400001f3
and the memory addresses of some variables at the moment are
ref:0xdcea0ffc, op:0xdcea0ffc, oend:0xdcea1000
As you can see, COPYLENGH is 8bytes, so @ref and @op can access the momory
over @oend.
Signed-off-by: JeHyeon Yeon <tom.yeon@windriver.com> Reviewed-by: David Sterba <dsterba@suse.cz>
[Linux commit d5e7cafd69da24e6d6cc988fab6ea313a2577efc] Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: fcc17f96c2776eb220e3dee79fd0ad6a624ffcd9
master date: 2015-03-26 11:19:10 +0100
Jan Beulich [Tue, 21 Apr 2015 07:20:34 +0000 (09:20 +0200)]
hvmloader: don't treat ROM BAR like other BARs
Its low 11 bits have different meaning.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 73aa7fc2926c5ae30d8ebd049beadbb48e24d6c6
master date: 2015-03-26 11:17:51 +0100
Andrew Cooper [Tue, 21 Apr 2015 07:18:39 +0000 (09:18 +0200)]
domctl/sysctl: don't leak hypervisor stack to toolstacks
This is CVE-2015-3340 / XSA-132.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: 4ff3449f0e9d175ceb9551d3f2aecb59273f639d
master date: 2015-04-21 09:03:15 +0200
Limit XEN_DOMCTL_memory_mapping hypercall to only process up to 64 GFNs (or less)
Said hypercall for large BARs can take quite a while. As such
we can require that the hypercall MUST break up the request
in smaller values.
Another approach is to add preemption to it - whether we do the
preemption using hypercall_create_continuation or returning
EAGAIN to userspace (and have it re-invocate the call) - either
way the issue we cannot easily solve is that in 'map_mmio_regions'
if we encounter an error we MUST call 'unmap_mmio_regions' for the
whole BAR region.
Since the preemption would re-use input fields such as nr_mfns,
first_gfn, first_mfn - we would lose the original values -
and only undo what was done in the current round (i.e. ignoring
anything that was done prior to earlier preemptions).
Unless we re-used the return value as 'EAGAIN|nr_mfns_done<<10' but
that puts a limit (since the return value is a long) on the amount
of nr_mfns that can provided.
This patch sidesteps this problem by:
- Setting an hard limit of nr_mfns having to be 64 or less.
- Toolstack adjusts correspondingly to the nr_mfn limit.
- If the there is an error when adding the toolstack will call the
remove operation to remove the whole region.
The need to break this hypercall down is for large BARs can take
more than the guest (initial domain usually) time-slice. This has
the negative result in that the guest is locked out for a long
duration and is unable to act on any pending events.
We also augment the code to return zero if nr_mfns instead
of trying to the hypercall.
This is XSA-125 / CVE-2015-2752.
Suggested-by: Jan Beulich <jbeulich@suse.com> Acked-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ross Lagerwall [Thu, 26 Mar 2015 07:41:44 +0000 (08:41 +0100)]
x86: don't apply reboot quirks if reboot set by user
If reboot= is specified on the command-line, don't apply reboot quirks
to allow the command-line option to take precedence.
This is a port of Linux commit 5955633e91bf ("x86/reboot: Skip DMI
checks if reboot set by user").
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Leverage (and make apply on top of) c643fb110a ("x86/EFI: allow
reboot= overrides when running under EFI").
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9832f5e8e3575f8affceb2751f7422704bf7b446
master date: 2015-03-13 12:41:51 +0100
At the point this patch calls domain_update_node_affinity(), the vcpu
hard affinities have not yet been updated; so calling it at this point
can in some circumstances trigger an ASSERT().
domain_update_node_affinity() is already called in
cpu_disable_scheduler(), so adding it to cpupool_unassign_cpu() is
redundant. Simply reverting the patch is sufficient.
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
x86/EFI: allow reboot= overrides when running under EFI
By default we will always use EFI reboot mechanism when
running under EFI platforms. However some EFI platforms
are buggy and need to use the ACPI mechanism to
reboot (such as Lenovo ThinkCentre M57). As such
respect the 'reboot=' override and DMI overrides
for EFI platforms.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
- BOOT_INVALID is just zero
- also consider acpi_disabled in BOOT_INVALID resolution
- duplicate BOOT_INVALID resolution in machine_restart()
- don't fall back from BOOT_ACPI to BOOT_EFI (if it was overridden, it
surely was for a reason)
- adjust doc change formatting
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86/EFI: fix reboot after c643fb110a
acpi_disabled needs to be moved out of .init.data.
Reported-by: Ross Lagerwall <ross.lagerwall@citrix.com>
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: c643fb110a51693e82a36ca9178d54f0b9744024
master date: 2015-03-13 11:25:52 +0100
master commit: 8ff330ec11e471919621bce97c069b83b0319d15
master date: 2015-03-23 18:01:51 +0100
Ross Lagerwall [Thu, 26 Mar 2015 07:38:35 +0000 (08:38 +0100)]
EFI: fix getting EFI variable list on some systems
Copy the entire output buffer to the guest because some firmwares update
size on successful calls (contrary to the spec) and the buffer may
contain data beyond the output size that the firmware requires on a
subsequent GetNextVariableName() call (e.g. a NULL character).
Note that this shouldn't change the amount of data copied because on success, a
compliant firmware does not change size and so the entire buffer is copied
anyway. If size is changed, Xen does not copy the buffer.
Without this change, the following (simplified) sequence would occur:
GetNextVariableName: in \0, size 1024 || out AdminPw\0, size 7
GetNextVariableName: in AdminPw\0, size 1024 || out UserPw\0, size 6
GetNextVariableName: in UserPww\0, size 1024 || NOT FOUND
This was seen on an Intel S1200RP_SE with firmware
S1200RP.86B.02.02.0005.102320140911, version 4.6, date 2014-10-23.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1f4eb9d27d0ebd62a0b6cdff8213726f5ae8f25c
master date: 2015-03-10 13:52:01 +0100
Jan Beulich [Thu, 26 Mar 2015 07:37:56 +0000 (08:37 +0100)]
VT-d: print_vtd_entries() should cope with superpages
Even if VT-d code alone (i.e. when not sharing tables with EPT) still
doesn't support superpages, this function - invoked upon DMA remapping
faults - needs to cope with such.
While at it also replace a few more plain numbers with suitable named
constants.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 92cf6c2456dc428694ed95b6b1dec5bb84319790
master date: 2015-03-09 14:00:19 +0100
Jan Beulich [Thu, 26 Mar 2015 07:37:08 +0000 (08:37 +0100)]
honor MEMF_no_refcount in alloc_heap_pages()
Non-anonymous allocations with this flag set should - for the purpose
of the availability check - be treated just like anonymous ones, as
they wouldn't lead to a reduction of ->outstanding_pages.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 17294e69c4cd299da7ba3ca8077e24be76bd61b1
master date: 2015-02-26 13:58:54 +0100
Ian Campbell [Fri, 13 Mar 2015 10:39:50 +0000 (10:39 +0000)]
xen: arm: correct arm64 version of gva_to_ma_par
The implementation was backwards and checked that the guest could
read when asked about write and vice versa.
This is an update to the fix for XSA-98.
Reported-by: Tamas K Lengyel <tklengyel@sec.in.tum.de> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit c1245e9d5bf311b5a3267ea4b077a16561fcf439)
Ian Campbell [Fri, 20 Feb 2015 14:41:09 +0000 (14:41 +0000)]
tools: libxl: Explicitly disable graphics backends on qemu cmdline
By default qemu will try to create some sort of backend for the
emulated VGA device, either SDL or VNC.
However when the user specifies sdl=0 and vnc=0 in their configuration
libxl was not explicitly disabling either backend, which could lead to
one unexpectedly running.
If either sdl=1 or vnc=1 is configured then both before and after this
change only the backends which are explicitly enabled are configured,
i.e. this issue only occurs when all backends are supposed to have
been disabled.
This affects qemu-xen and qemu-xen-traditional differently.
If qemu-xen was compiled with SDL support then this would result in an
SDL window being opened if $DISPLAY is valid, or a failure to start
the guest if not. Passing "-display none" to qemu before any further
-sdl options disables this default behaviour and ensures that SDL is
only started if the libxl configuration demands it.
If qemu-xen was compiled without SDL support then qemu would instead
start a VNC server listening on ::1 (IPv6 localhost) or 127.0.0.1
(IPv4 localhost) with IPv6 preferred if available. Explicitly pass
"-vnc none" when vnc is not enabled in the libxl configuration to
remove this possibility.
qemu-xen-traditional would never start a vnc backend unless asked.
However by default it will start an SDL backend, the way to disable
this is to pass a -vnc option. In other words passing "-vnc none" will
disable both vnc and sdl by default. sdl can then be reenabled if
configured by subsequent use of the -sdl option.
Tested with both qemu-xen and qemu-xen-traditional built with SDL
support and:
xl cr # defaults
xl cr sdl=0 vnc=0
xl cr sdl=1 vnc=0
xl cr sdl=0 vnc=1
xl cr sdl=0 vnc=0 vga=\"none\"
xl cr sdl=0 vnc=0 nographic=1
with both valid and invalid $DISPLAY.
Jan Beulich [Thu, 12 Mar 2015 13:20:52 +0000 (14:20 +0100)]
x86/tboot: invalidate FIX_TBOOT_MAP_ADDRESS mapping after use
In order for commit cbeeaa7d ("x86/nmi: fix shootdown of pcpus
running in VMX non-root mode")'s re-use of that fixmap entry to not
cause undesirable (in crash context) cross-CPU TLB flushes, invalidate
the fixmap entry right after use.
Jan Beulich [Tue, 10 Mar 2015 12:58:00 +0000 (13:58 +0100)]
x86emul: fully ignore segment override for register-only operations
For ModRM encoded instructions with register operands we must not
overwrite ea.mem.seg (if a - bogus in that case - segment override was
present) as it aliases with ea.reg.
This is CVE-2015-2151 / XSA-123.
Reported-by: Felix Wilhelm <fwilhelm@ernw.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: Keir Fraser <keir@xen.org>
master commit: bcf92a5382b75fd964c1f8678b2d9a3abe6dec39
master date: 2015-03-10 13:45:51 +0100
Boris Ostrovsky [Tue, 25 Nov 2014 16:11:50 +0000 (11:11 -0500)]
pygrub: Fix regression from c/s d1b93ea, attempt 2
c/s d1b93ea causes substantial functional regressions in pygrub's ability to
parse bootloader configuration files.
c/s d1b93ea itself changed an an interface which previously used exclusively
integers, to using strings in the case of a grub configuration with explicit
default set, along with changing the code calling the interface to require a
string. The default value for "default" remained as an integer.
As a result, any Extlinux or Lilo configuration (which drives this interface
exclusively with integers), or Grub configuration which doesn't explicitly
declare a default will die with an AttributeError when attempting to call
"self.cf.default.isdigit()" where "default" is an integer.
Sadly, this AttributeError gets swallowed by the blanket ignore in the loop
which searches partitions for valid bootloader configurations, causing the
issue to be reported as "Unable to find partition containing kernel"
We should explicitly check type of "default" in image_index() and process it
appropriately.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit 3b279811707dab4bab95c2e952e94ebf4d6badd9)
Simon Rowe [Mon, 27 Oct 2014 16:00:14 +0000 (16:00 +0000)]
pygrub: fix non-interactive parsing of grub1 config files
Changes to handle non-numeric default attributes for grub2 caused run_grub()
to attempt to index into the images list using a string. Pull out the code
that handles submenus into a new function and use that to ensure sel is
numeric.
Reported-by: David Scott <dave.scott@citrix.com> Signed-off-by: Simon Rowe <simon.rowe@eu.citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-and-tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit 4ee393f9d6528640c29a0554fdc6cb3e795fb6e8)
Boris Ostrovsky [Fri, 27 Jun 2014 14:07:31 +0000 (10:07 -0400)]
tools/pygrub: Make pygrub understand default entry in string format
Currently pygrub can only correctly parse grub2's default attribute when it is
specified as a number. If it is set to ${saved_entry} or ${next_entry} then
the first image (i.e. entry number 0) is selected. If any other value is
specified (typically this would be the string in menuentry) pygrub will crash.
This patch will allow pygrub to interpret default attribute if it is specified
as a string (note that in case of submenus only the leaf string will be
considered).
Also issue a warning if default is set to ${saved_entry} or ${next_entry}.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit d1b93ea2615bd789ee28901f1f1c05ffb319cb61)
Andrew Cooper [Wed, 11 Jun 2014 18:31:55 +0000 (19:31 +0100)]
tools/pygrub: Fix extlinux when /boot is a separate partition from /
Grub and Grub2 already cope with this.
Reported-by: Joseph Hom <jhom@softlayer.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Ian Campbell <Ian.Campbell@citrix.com> CC: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit 0c12e5b7427b4dfd2dfabf21f6b0e6e24bc8e864)
Conflicts:
tools/pygrub/src/pygrub
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Wed, 18 Feb 2015 15:48:07 +0000 (16:48 +0100)]
x86/nmi: fix shootdown of pcpus running in VMX non-root mode
c/s 7dd3b06ff "vmx: fix handling of NMI VMEXIT" fixed one issue but
inadvertently introduced a regression when it came to the NMI shootdown. The
shootdown code worked by patching vector 2 in each IDT, but the introduced
direct call to do_nmi() bypassed this.
Instead of patching each IDT, take a different approach by updating the
existing dispatch table. This allows for the removal of the remote IDT
patching and the removal of the nmi_crash() entry point.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cbeeaa7da01bfa37c1fcdfe79e8f4f1400262ccb
master date: 2015-02-11 17:18:27 +0100
Boris Ostrovsky [Wed, 18 Feb 2015 15:45:14 +0000 (16:45 +0100)]
x86/VPMU: disable when NMI watchdog is on
NMI watchdog sets APIC_LVTPC register to generate an NMI when PMU counter
overflow occurs. This may be overwritten by VPMU code later, effectively
turning off the watchdog.
We should disable VPMU when NMI watchdog is running.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e5e09b5c46444b0ab8450c6d6ee8316d4016ac18
master date: 2015-02-03 11:30:09 +0100
Dan Carpenter [Tue, 3 Feb 2015 11:34:56 +0000 (12:34 +0100)]
bunzip2: off by one in get_next_block()
"origPtr" is used as an offset into the bd->dbuf[] array. That array is
allocated in start_bunzip() and has "bd->dbufSize" number of elements so
the test here should be >= instead of >.
Later we check "origPtr" again before using it as an offset so I don't
know if this bug can be triggered in real life.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Trivial adjustments to make the respective Linux commit b5c8afe5be51078a979d86ae5ae78c4ac948063d apply to Xen.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: 39798e95a954eec660a3f5f21489c30ef78daf6d
master date: 2015-01-28 16:50:08 +0100
x86: vcpu_destroy_pagetables() must not return -EINTR
.. otherwise it has the side effect that: domain_relinquish_resources
will stop and will return to user-space with -EINTR which it is not
equipped to deal with that error code; or vcpu_reset - which will
ignore it and convert the error to -ENOMEM..
The preemption mechanism we have for domain destruction is to return
-EAGAIN (and then user-space calls the hypercall again) and as such we need
to catch the case of:
we need to return -ERESTART otherwise we end up returning -ENOMEM.
There are also other callers of vcpu_destroy_pagetables: arch_vcpu_reset
(vcpu_reset) are:
- hvm_s3_suspend (asserts on any return code),
- vlapic_init_sipi_one (asserts on any return code),
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: de4f284b3d7b47d3b9807f354552ecf3e0fff56b
master date: 2015-01-26 12:51:09 +0100
Jan Beulich [Tue, 3 Feb 2015 11:32:25 +0000 (12:32 +0100)]
x86: don't expose XSAVES capability to PV guests
As done by the recent Linux commit b65d6e17fe ("kvm: x86: mask out
XSAVES") for KVM, we should also mask out XSAVES from what PV guests
get to see as long as we don't emulate accesses to MSR_IA32_XSS.
Actually, go beyond that: Just like for leaf 7, switch from
blacklisting to whitelisting, i.e. only allow XSAVEOPT and XSAVEC for
the time being. And do these overrides consistently for both Dom0 and
DomU-s.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 8d050ed1097ce5f4bf6a1d6806fb1e3471976adb
master date: 2015-01-22 12:47:56 +0100
Andrew Cooper [Tue, 3 Feb 2015 11:31:31 +0000 (12:31 +0100)]
xsm/evtchn: never pretend to have successfully created a Xen event channel
Xen event channels are not internal resources. They still have one end in a
domain, and are created at the request of privileged domains. This logic
which "successfully" creates a Xen event channel opens up undesirable failure
cases with ill-specified XSM policies.
If a domain is permitted to create ioreq servers or memevent listeners, but
not to create event channels, the ioreq/memevent creation will succeed but
attempting to bind the returned event channel will fail without any indication
of a permission error.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
master commit: 09aa4759faa29c1fe735266de4c79f17329bd67b
master date: 2015-01-20 10:42:26 +0100
Jan Beulich [Tue, 3 Feb 2015 11:30:59 +0000 (12:30 +0100)]
common/memory: fix an XSM error path
XENMEM_{in,de}crease_reservation as well as XENMEM_populate_physmap
return the extent at which failure was detected, not error indicators.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Tim Deegan <tim@xen.org>
master commit: 76d4ff26d9647088353acaf4a56388a354a5d6e9
master date: 2015-01-19 11:59:05 +0100
Jan Beulich [Tue, 3 Feb 2015 11:30:12 +0000 (12:30 +0100)]
x86emul: tighten CLFLUSH emulation
While for us it's not as bad as it was for Linux, their commit 13e457e0ee ("KVM: x86: Emulator does not decode clflush well", by
Nadav Amit <namit@cs.technion.ac.il>) nevertheless points out two
shortcomings in our code: opcode 0F AE /7 is clflush only when it uses
a memory mode (otherwise it's SFENCE) and when there's no REP prefix
(an operand size prefix is fine, as that's CLFLUSHOPT).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9d03db6b81d1880bf3aa4fc83a60346bf02be251
master date: 2015-01-12 15:41:12 +0100
Boris Ostrovsky [Tue, 3 Feb 2015 11:28:40 +0000 (12:28 +0100)]
x86/VPMU: Clear last_vcpu when destroying VPMU
We need to make sure that last_vcpu is not pointing to VCPU whose
VPMU is being destroyed. Otherwise we may try to dereference it in
the future, when VCPU is gone.
We have to do this via IPI since otherwise there is a (somewheat
theoretical) chance that between test and subsequent clearing
of last_vcpu the remote processor (i.e. vpmu->last_pcpu) might do
both vpmu_load() and then vpmu_save() for another VCPU. The former
will clear last_vcpu and the latter will set it to something else.
Performing this operation via IPI will guarantee that nothing can
happen on the remote processor between testing and clearing of
last_vcpu.
We should also check for VPMU_CONTEXT_ALLOCATED in vpmu_destroy() to
avoid unnecessary percpu tests and arch-specific destroy ops. Thus
checks in AMD and Intel routines are no longer needed.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: ed8017155607db1bbe1f6ca41eac696b7ef8082b
master date: 2015-01-07 11:12:27 +0100
Jan Beulich [Tue, 3 Feb 2015 11:28:00 +0000 (12:28 +0100)]
VT-d: don't crash when PTE bits 52 and up are non-zero
This can (and will) be legitimately the case when sharing page tables
with EPT (more of a problem before p2m_access_rwx became zero, but
still possible even now when other than that is the default for a
guest), leading to an unconditional crash (in print_vtd_entries())
when a DMA remapping fault occurs.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 46e0baf59105200d43612cf0c59de216958b008d
master date: 2015-01-07 11:13:58 +0100
Jan Beulich [Tue, 3 Feb 2015 11:27:12 +0000 (12:27 +0100)]
domctl: fix IRQ permission granting/revocation
Commit 545607eb3c ("x86: fix various issues with handling guest IRQs")
wasn't really consistent in one respect: The granting of access to an
IRQ shouldn't assume the pIRQ->IRQ translation to be the same in both
domains. In fact it is wrong to assume that a translation is already/
still in place at the time access is being granted/revoked.
What is wanted is to translate the incoming pIRQ to an IRQ for
the invoking domain (as the pIRQ is the only notion the invoking
domain has of the IRQ), and grant the subject domain access to
the resulting IRQ.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Acked-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
master commit: 6fb3a07bc0ad656b5f76eb9fc961bcd1d3cace58
master date: 2015-01-05 10:20:24 -0500
Julien Grall [Mon, 19 Jan 2015 12:59:42 +0000 (12:59 +0000)]
xen/arm: vgic: message in the emulation code should be rate-limited
printk is not rated-limited by default. Therefore a malicious guest may
be able to flood the Xen console.
If we use gdprintk, unecessary information will be printed such as the
filename and the line. Instead use XENLOG_G_ERR combine with %pv.
This is XSA-118.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ relates to d0b2caa80fccafbb131b28b7b8488001d82ab4bf in master ]
Ian Campbell [Thu, 8 Jan 2015 11:53:55 +0000 (11:53 +0000)]
dt-uart: use ':' as separator between path and options
',' is a valid character in a device-tree path (see ePAPR v1.1 Table
2-1), in fact ',' is actually pretty common in node names.
Using ',' as a separator breaks for example on fast models. If you use
the full path (/smb/motherboard/iofpga@3,00000000/uart@090000) rather
than the alias then earlyprintk gives:
(XEN) Looking for UART console /smb/motherboard/iofpga@3
(XEN) Unable to find device "/smb/motherboard/iofpga@3"
(XEN) Bad console= option 'dtuart'
I actually noticed this on Jetson where the uart is
"/serial@0,70006300" and there happened to be no alias defined.
Instead use ':' as the separator, it is defined to terminate the path
in the context of /chosen/stdout-path (Table 3-4) which is pretty
closely analogous to the dtuart= option and so makes a pretty good
choice (especially since the next patch adds support for stdout-path).
Since no DT aware driver current supports any options there is no
point in retaining support for ',' for backwards compatibility.
Additionally, expand the buffer for the dtuart option, a path can be
far longer than 30 characters (in fact the maximum size of a single
node name is 31, so it's not even necessarily enough for an alias).
128 is completely arbitrary and allows for paths at least 8 deep even
with worst case node names.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Julien Grall <julien.grall@linaro.org>
(cherry picked from commit f01af57300cb60ab0fd8487fb5bbbe97bee234f0)
Conflicts:
docs/misc/xen-command-line.markdown
-- dropped since precursor patch not present.
Julien Grall [Wed, 14 Jan 2015 13:06:48 +0000 (13:06 +0000)]
libxl: Don't ignore error when we fail to give access to ioport/irq/iomem
If we fail to give the access, the domain will unlikely work correctly.
So we should bail out at the first error.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Wei Liu <wei.liu2@citrix.com>
(Based on commit 7070eec417934360bf3aed434191246dfe4f8091)
Ian Campbell [Thu, 6 Nov 2014 13:00:31 +0000 (13:00 +0000)]
tools: libxl: do not leak diskpath during local disk attach
libxl__device_disk_local_initiate_attach is assigning dls->diskpath with a
strdup of the device path. This is then passed to the callback, e.g.
parse_bootloader_result but bootloader_cleanup will not free it.
Since the callback is within the scope of the (e)gc and therefore doesn't need
to be malloc'd, a gc'd alloc will do. All other assignments to this field use
the gc.
Reported-by: Gedalya <gedalya@gedalya.net> Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 379b351889a8f02abe30a06e2ce9ba8b381b91ab)
Ian Campbell [Thu, 6 Nov 2014 13:59:43 +0000 (13:59 +0000)]
tools: libxl: do not overrun input buffer in libxl__parse_mac
Valgrind reports:
==7971== Invalid read of size 1
==7971== at 0x40877BE: libxl__parse_mac (libxl_internal.c:288)
==7971== by 0x405C5F8: libxl__device_nic_from_xs_be (libxl.c:3405)
==7971== by 0x4065542: libxl__append_nic_list_of_type (libxl.c:3484)
==7971== by 0x4065542: libxl_device_nic_list (libxl.c:3504)
==7971== by 0x406F561: libxl_retrieve_domain_configuration (libxl.c:6661)
==7971== by 0x805671C: reload_domain_config (xl_cmdimpl.c:2037)
==7971== by 0x8057F30: handle_domain_death (xl_cmdimpl.c:2116)
==7971== by 0x8057F30: create_domain (xl_cmdimpl.c:2580)
==7971== by 0x805B4B2: main_create (xl_cmdimpl.c:4652)
==7971== by 0x804EAB2: main (xl.c:378)
This is because on the final iteration the tok += 3 skips over the terminating
NUL to the next byte, and then *tok reads it. Fix this by using endptr as the
iterator.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Don Slutz <dslutz@verizon.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 5a430eca0b27354456d1245ed3f637d5f2e17883)
libxc: check return values on mmap() and madvise() on xc_alloc_hypercall_buffer()
On a Thinkpad T4440p with OpenSUSE tumbleweed with v3.15-rc4
and today's latest xen tip from the git tree strace -f reveals
we end up on a never ending wait shortly after
This is right before we just wait on the qemu process which we
had mmap'd for. Without this you'll end up getting stuck on a
loop if mmap() worked but madvise() did not. While at it I noticed
even the mmap() error fail was not being checked, fix that too.
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit e86539a388314cd3dca88f5e69d7873343197cd8)
Mihai Donțu [Tue, 6 Jan 2015 12:49:52 +0000 (12:49 +0000)]
x86/HVM: prevent use-after-free when destroying a domain
hvm_domain_relinquish_resources() can free certain domain resources
which can still be accessed, e.g. by HVMOP_set_param, while the domain
is being cleaned up.
This is CVE-2015-0361 / XSA-116.
Signed-off-by: Mihai Donțu <mdontu@bitdefender.com> Tested-by: Răzvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit d3c151fd3a4365fc6107198bfc975807d40d157d)
Frediano Ziglio [Thu, 23 Oct 2014 16:18:52 +0000 (17:18 +0100)]
xen/arm: dump guest stack even if not the current VCPU
If show_guest_stack was called from Xen context (for instance hitting
'0' key on Xen console) get_page_from_gva was not able to get the
page returning NULL.
Detecting different domain and changing VTTBR register make
get_page_from_gva works for different domains.
Signed-off-by: Frediano Ziglio <frediano.ziglio@huawei.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit 281c892e3bdb0a56cfd8ab6516cd1a33c095c857)
Julien Grall [Fri, 28 Nov 2014 15:17:06 +0000 (15:17 +0000)]
xen/arm: Handle platforms with edge-triggered virtual timer
Some platforms (such as Xgene and ARMv8 models) use an edge-triggered interrupt
for the virtual timer. Even if the timer output signal is masked in the
context switch, the GIC will keep track that of any interrupts raised
while IRQs are disabled. As soon as IRQs are re-enabled, the virtual
interrupt timer will be injected to Xen.
If an idle vVCPU was scheduled next then the interrupt handler doesn't
expect to the receive the IRQ and will crash:
The proper solution is to context switch the virtual interrupt state at
the GIC level. This would also avoid masking the output signal which
requires specific handling in the guest OS and more complex code in Xen
to deal with EOIs, and so is desirable for that reason too.
Sadly, this solution requires some refactoring which would not be
suitable for a freeze exception for the Xen 4.5 release.
For now implement a temporary solution which ignores the virtual timer
interrupt when the idle VCPU is running.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- tweaked some wording in the comment ]
(cherry picked from commit f688aec47c452d6aef382739d0781735672ef995)
Modifying ienable and calling vgic_enable_irqs or vgic_disable_irqs need
to be atomic. Move the calls to vgic_enable_irqs and vgic_disable_irqs
before releasing the rank lock.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- partial backport of just the locking part of 5b3a817ea33b
"xen/arm: observe itargets setting in vgic_enable_irqs and
vgic_disable_irqs"
]
xen/arm: domain_vgic_init: Avoid double free on shared_irqs
When the function domain_vgic_init is failing to initialize pending_irqs,
it will free shared_irqs. Few call later, domain_vgic_free will be called
an try to free a second time the same variable. This will result to a double
free.
Remove the free in domain_vgic_init and rely on domain_vgic_free to correctly
release the memory.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commit 7b41618f5a08145b0198af4a8a2ce361d7e677e6)
Jan Beulich [Wed, 10 Dec 2014 11:30:15 +0000 (12:30 +0100)]
x86/HVM: don't crash guest upon problems occurring in user mode
This extends commit 5283b310 ("x86/HVM: only kill guest when unknown VM
exit occurred in guest kernel mode") to a few more cases, including the
failed VM entry one that XSA-110 was needed to be issued for.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86/HVM: prevent infinite VM entry retries
This reverts the VMX side of commit 28b4baac ("x86/HVM: don't crash
guest upon problems occurring in user mode") and gets SVM in line with
the resulting VMX behavior. This is because Andrew validly says
"A failed vmentry is overwhelmingly likely to be caused by corrupt
VMC[SB] state. As a result, injecting a fault and retrying the the
vmentry is likely to fail in the same way."
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Tim Deegan <tim@xen.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 28b4baacd599e8c10e6dac055f6a939bb730fb8a
master date: 2014-11-25 10:08:57 +0100
master commit: 04ae2f6837b35bcfb689baf15f493da626929fb5
master date: 2014-12-02 12:48:01 +0100
Jan Beulich [Wed, 10 Dec 2014 11:29:05 +0000 (12:29 +0100)]
x86/cpuidle: don't count C1 multiple times
Commit 4ca6f9f0 ("x86/cpuidle: publish new states only after fully
initializing them") resulted in the state counter to be incremented
for C1 despite that using a fixed table entry (and the statically
initialized counter value already accounting for it and C0).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
master commit: 0aabd10525326edfe5098c2ec5bfe05db7732c32
master date: 2014-11-25 10:05:29 +0100
Jan Beulich [Wed, 10 Dec 2014 11:26:56 +0000 (12:26 +0100)]
EFI: allow retry of ExitBootServices() call
The specification is kind of vague under what conditions
ExitBootServices() may legitimately fail, requiring the OS loader to
retry:
"If MapKey value is incorrect, ExitBootServices() returns
EFI_INVALID_PARAMETER and GetMemoryMap() with ExitBootServices() must
be called again. Firmware implementation may choose to do a partial
shutdown of the boot services during the first call to
ExitBootServices(). EFI OS loader should not make calls to any boot
service function other then GetMemoryMap() after the first call to
ExitBootServices()."
While our code guarantees the map key to be valid, there are systems
where a firmware internal notification sent while processing
ExitBootServices() reportedly results in changes to the memory map.
In that case, make a best effort second try: Avoid any boot service
calls other than the two named above, with the possible exception of
error paths. Those aren't a problem, since if we end up needing to
retry, we're hosed when something goes wrong as much as if we didn't
make the retry attempt.
For x86, a minimal adjustment to efi_arch_process_memory_map() is
needed for it to cope with potentially being called a second time.
For arm64, while efi_process_memory_map_bootinfo() is easy to verify
that it can safely be called more than once without violating spec
constraints, it's not so obvious for fdt_add_uefi_nodes(), hence a
step by step approach:
- deletion of memory nodes and memory reserve map entries: the 2nd pass
shouldn't find any as the 1st one deleted them all,
- a "chosen" node should be found as it got added in the 1st pass,
- the various "linux,uefi-*" nodes all got added during the 1st pass
and hence only their contents may get updated.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roy Franz <roy.franz@linaro.org>
master commit: 0540b854f6733759593e829bc3f13c9b45974e32
master date: 2014-11-17 15:07:03 +0100
Jan Beulich [Wed, 10 Dec 2014 11:22:00 +0000 (12:22 +0100)]
x86: (allow to) override LIST_POISON*
Having these point into space not controlled by the hypervisor provides
an unnecessary attack surface. Allow architectures to override them and
utilize that override to make them non-canonical addresses (thus
causing #GP rather than #PF when dereferenced).
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 404227138e1e49c8073e946649a8d4173b35625c
master date: 2014-11-17 15:05:53 +0100
Juergen Gross [Wed, 10 Dec 2014 11:20:40 +0000 (12:20 +0100)]
adjust number of domains in cpupools when destroying domain
Commit bac6334b51d9bcfe57ecf4a4cb5288348fcf044a (move domain to
cpupool0 before destroying it) introduced an error in the accounting
of cpupools regarding the number of domains. The number of domains
is nor adjusted when a domain is moved to cpupool0 in kill_domain().
Correct this by introducing a cpupool function doing the move
instead of open coding it by calling sched_move_domain().
Reported-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com> Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com> Reviewed-by: Andrew Cooper <Andrew.Cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
master commit: 934e7baa6c12d19cfaf24e8f8e27d6c6a8b8c5e4
master date: 2014-11-12 12:39:58 +0100
Keir Fraser [Mon, 8 Dec 2014 14:26:57 +0000 (15:26 +0100)]
switch to write-biased r/w locks
This is to improve fairness: A permanent flow of read acquires can
otherwise lock out eventual writers indefinitely.
This is CVE-2014-9065 / XSA-114.
Signed-off-by: Keir Fraser <keir@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2a549b9c8aa48dc39d7c97e5a93978b781b3a1db
master date: 2014-12-08 14:45:46 +0100
Jan Beulich [Thu, 27 Nov 2014 13:11:57 +0000 (14:11 +0100)]
x86/HVM: confine internally handled MMIO to solitary regions
While it is generally wrong to cross region boundaries when dealing
with MMIO accesses of repeated string instructions (currently only
MOVS) as that would do things a guest doesn't expect (leaving aside
that none of these regions would normally be accessed with repeated
string instructions in the first place), this is even more of a problem
for all virtual MSI-X page accesses (both msixtbl_{read,write}() can be
made dereference NULL "entry" pointers this way) as well as undersized
(1- or 2-byte) LAPIC writes (causing vlapic_read_aligned() to access
space beyond the one memory page set up for holding LAPIC register
values).
Since those functions validly assume to be called only with addresses
their respective checking functions indicated to be okay, it is generic
code that needs to be fixed to clip the repetition count.
To be on the safe side (and consistent), also do the same for buffered
I/O intercepts, even if their only client (stdvga) doesn't put the
hypervisor at risk (i.e. "only" guest misbehavior would result).
This is CVE-2014-8867 / XSA-112.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org>
master commit: c5397354b998d030b021810b8202de93b9526818
master date: 2014-11-27 14:01:40 +0100
Jan Beulich [Thu, 27 Nov 2014 13:10:52 +0000 (14:10 +0100)]
x86: limit checks in hypercall_xlat_continuation() to actual arguments
HVM/PVH guests can otherwise trigger the final BUG_ON() in that
function by entering 64-bit mode, setting the high halves of affected
registers to non-zero values, leaving 64-bit mode, and issuing a
hypercall that might get preempted and hence become subject to
continuation argument translation (HYPERVISOR_memory_op being the only
one possible for HVM, PVH also having the option of using
HYPERVISOR_mmuext_op). This issue got introduced when HVM code was
switched to use compat_memory_op() - neither that nor
hypercall_xlat_continuation() were originally intended to be used by
other than PV guests (which can't enter 64-bit mode and hence have no
way to alter the high halves of 64-bit registers).
This is CVE-2014-8866 / XSA-111.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 0ad715304b04739fd2fc9517ce8671d3947c7621
master date: 2014-11-27 14:00:23 +0100
Jan Beulich [Tue, 18 Nov 2014 13:28:45 +0000 (14:28 +0100)]
x86emul: enforce privilege level restrictions when loading CS
Privilege level checks were basically missing for the CS case, the
only check that was done (RPL == DPL for nonconforming segments)
was solely covering a single special case (return to non-conforming
segment).
Additionally in long mode the L bit set requires the D bit to be clear,
as was recently pointed out for KVM by Nadav Amit
<namit@cs.technion.ac.il>.
Finally we also need to force the loaded selector's RPL to CPL (at
least as long as lret/retf emulation doesn't support privilege level
changes).
This is CVE-2014-8595 / XSA-110.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 1d68c1a70e00ed95ef0889cfa005379dab27b37d
master date: 2014-11-18 14:16:23 +0100
Jan Beulich [Tue, 18 Nov 2014 13:27:46 +0000 (14:27 +0100)]
x86: don't allow page table updates on non-PV page tables in do_mmu_update()
paging_write_guest_entry() and paging_cmpxchg_guest_entry() aren't
consistently supported for non-PV guests (they'd deref NULL for PVH or
non-HAP HVM ones). Don't allow respective MMU_* operations on the
page tables of such domains.
This is CVE-2014-8594 / XSA-109.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: e4292c5aac41b80f33d4877104348d5ee7c95aa4
master date: 2014-11-18 14:15:21 +0100
Jan Beulich [Thu, 13 Nov 2014 08:50:13 +0000 (09:50 +0100)]
x86/PVH: replace bogus assertion with conditional
While PVH guests currently have to start in 64-bit mode, nothing keeps
them from entering compatibility mode via a suitable ring-0 code
selector and making a hypercall from there. Fail such attempts rather
than asserting they won't happen.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 01c019c843df313f62ca5c106544026867da9e9e
master date: 2014-11-04 13:07:23 +0100
Andrew Cooper [Thu, 13 Nov 2014 08:49:40 +0000 (09:49 +0100)]
process softirqs while dumping domains
Process softirqs once per domain, and once every 64 vcpus in a guest to avoid
being hit by the NMI watchdog. Discovered against a VM which had accidentally
been assigned 8192 vcpus.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
master commit: 9cf71226edabd8b9bc81a5eb57823dacbe8b4bd8
master date: 2014-10-31 11:28:36 +0100
Jan Beulich [Thu, 13 Nov 2014 08:48:55 +0000 (09:48 +0100)]
x86/HVM: only kill guest when unknown VM exit occurred in guest kernel mode
A recent KVM change by Nadav Amit <namit@cs.technion.ac.il> pointed out
that unconditional VM exits (like VMX'es ones for the INVEPT, INVVPID,
and XSETBV instructions) may result from guest user mode activity (in
the example cases, e.g. prior to a privilege level check being done).
Consequently convert the unconditional domain_crash() to a conditional
one (when guest is in kernel mode) with the alternative of injecting
#UD (when in user mode).
This is meant to be a precaution against in-guest security issues
introduced when any such VM exit becomes possible (on newer hardware)
without the hypervisor immediately being aware of it. There are no such
unhandled VM exits currently (and hence this is not an active security
issue), but old (no longer security maintained) versions exhibit issues
in the cases given as examples above.
Suggested-by: Tim Deegan <tim@xen.org> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 5283b310e14884341f51be35253cdd59c4cb034c
master date: 2014-10-31 11:32:27 +0100
Jan Beulich [Thu, 13 Nov 2014 08:48:13 +0000 (09:48 +0100)]
VMX: values written to MSR_IA32_SYSENTER_E[IS]P should be canonical
A recent KVM change by Nadav Amit <namit@cs.technion.ac.il> helped spot
that we have the same issue as they did.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 93cc5c6f1641e90eb120826d42f103b7726efb8e
master date: 2014-10-31 11:31:11 +0100
Don Koch [Thu, 13 Nov 2014 08:47:30 +0000 (09:47 +0100)]
x86/HVM: sanity check xsave area when migrating or restoring from older Xen versions
Xen 4.3.0, 4.2.3 and older transferred a maximum sized xsave area (as
if all the available XCR0 bits were set); the new version only
transfers based on the actual XCR0 bits. This may result in a smaller
area if the last sections were missing (e.g., the LWP area from an AMD
machine). If the size doesn't match the XCR0 derived size, the size is
checked against the maximum size and the part of the xsave area
between the actual and maximum used size is checked for zero data. If
either the max size check or any part of the overflow area is
non-zero, we return with an error.
Signed-off-by: Don Koch <dkoch@verizon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d7bb8e88a087690feba63ef83c13ba067f041da0
master date: 2014-10-27 16:45:09 +0100
Jan Beulich [Thu, 13 Nov 2014 08:46:46 +0000 (09:46 +0100)]
EFI: allow to suppress the use of runtime services
On certain systems some of the memory map entries designated for use by
runtime services cannot be mapped (frequently due to firmware bugs). On
others, some of the memory map entries aren't even marked for runtime
services use, yet are being used by them. For both cases give people a
way to suppress use of runtime services altogether.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5680244040d26234a43c7f884ecf98fa4928d0da
master date: 2014-10-27 16:43:52 +0100
Jan Beulich [Thu, 13 Nov 2014 08:45:42 +0000 (09:45 +0100)]
x86: tolerate running on EFI runtime services page tables in map_domain_page()
In the event of a #PF while in an EFI runtime service function we
otherwise can't dump the page tables, making the analysis of the
problem more cumbersome.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e65436ba36be8f1b735573d8fc9af7d8a053ba5f
master date: 2014-10-27 16:43:12 +0100
Andrew Cooper [Thu, 13 Nov 2014 08:44:26 +0000 (09:44 +0100)]
hvm/load: correct length checks for zeroextended records
In the case that Xen is attempting to load a zeroextended HVM record where the
difference needing extending would overflow the data blob, _hvm_check_entry()
will incorrectly fail before working out that it would have been safe.
The "len + sizeof(*d)" check is wrong. Consider zeroextending a 16 byte
record into a 32 byte structure. "32 + hdr" will fail the overall context
length check even though the pre-extended record in the stream is 16 bytes.
The first condition is reduced to just a length check for hvm save header,
while the second condition is extended to include a check that the record in
the stream not exceeding the stream length.
The error messages are extended to include further useful information.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <Paul.Durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 66d0c0aa1f3e57e873fd64d1d370e11758d25442
master date: 2014-10-27 16:41:50 +0100
Yang Zhang [Thu, 13 Nov 2014 08:43:33 +0000 (09:43 +0100)]
vmx: fix save/restore issue with apicv
This patch fixes two issues:
1. Interrupts on PIR are lost during save/restore. Syncing the PIR
into IRR during save will fix it.
2. EOI exit bitmap doesn't set up correctly after restore. Here we
will construct the eoi exit bitmap via (IRR | ISR). Though it may cause
unnecessary eoi exit of the interrupts that pending in IRR or ISR during
save/restore, each pending interrupt only causes one vmexit. The
subsequent interrupts will adjust the eoi exit bitmap correctly. So
the performance hurt can be ignored.
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Signed-off-by: Olaf Hering <olaf@aepfle.de> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 607e8494c42397fb249191904066cace6ac9a880
master date: 2014-10-27 16:40:18 +0100
Andrew Cooper [Thu, 13 Nov 2014 08:42:03 +0000 (09:42 +0100)]
fix listing of vcpus when domains lacking any vcpus exist
On a system which looks like this:
[root@st04 ~]# xl list
Name ID Mem VCPUs State Time(s)
Domain-0 0 752 4 r----- 46699.3
(null) 1 0 0 --p--- 0.0
(null) 2 0 0 --p--- 0.0
(null) 3 0 0 --p--- 0.0
badger 25 0 1 --p--- 0.0
`xl vcpu-list` failes as so:
[root@st04 ~]# xl vcpu-list
Name ID VCPU CPU State Time(s) CPU Affinity
Domain-0 0 0 0 -b- 12171.0 all
Domain-0 0 1 1 -b- 11779.6 all
Domain-0 0 2 2 -b- 11599.0 all
Domain-0 0 3 3 r-- 11007.0 all
libxl: critical: libxl__calloc: libxl: FATAL ERROR: memory allocation failure (libxl__calloc, 4294935299 x 40)
: Cannot allocate memory
libxl: FATAL ERROR: memory allocation failure (libxl__calloc, 4294935299 x 40)
The root cause of this is in Xen. getdomaininfo() has no way of expressing
"this domain has no vcpus". Previously, info->max_vcpu_id would be returned
uninitialised in such a case.
Unfortunately, setting it to 0 as a default is not appropriate. A max_vcpu_id
of 0 and nr_online_cpus of 0 is the valid state for a single vcpu domain which
is in the process of being destroyed.
As all components are required to add 1 to max_vcpu_id to get the number of
vcpus, an id of ~0U is not valid to be used. Explicitly define this as an
invalid max vcpu value, and use it to express "no vcpus" in getdomaininfo()
In libxl, the issue is seen as libxl_list_vcpu() attempts to use the
uninitialised domaininfo.max_vcpu_id for memory allocation.
Check domaininfo.max_vcpu_id against the new sentinel value
XEN_INVALID_MAX_VCPU_ID, and return early. This means that it is now valid
for libxl_list_vcpu() to return NULL for a domain which lacks any vcpus.
As part of this change, remove the pointless call to libxl_get_max_cpus(),
whose returned value is unconditionally clobbered in the for() loop.
Reported-by: Euan Harris <euan.harris@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Don Slutz <dslutz@verizon.com> Acked-by: Ian Campbell <Ian.Campbell@citrix.com>
tools/libxl: Fix libxl_list_vcpu() following c/s 93e52d52
My reasoning regarding nr_cpus_out was wrong, as I had confused nr_cpus_out
with nr_vcpus_out.
Dario pointed this out, but the patch (having gained appropriate acks) got
committed before I could post a correction.
Noticed-by: Dario Faggioli <dario.faggioli@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Ian Campbell <Ian.Campbell@citrix.com> CC: Ian Jackson <Ian.Jackson@eu.citrix.com> CC: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: 93e52d5242804ff928131553599afa769f85481b
master date: 2014-10-23 10:19:53 +0200
master commit: c90a755f261b8ccb3dac7e1f695122cac95c6053
master date: 2014-10-23 12:29:00 +0100
Jan Beulich [Fri, 17 Oct 2014 13:57:42 +0000 (15:57 +0200)]
x86/paging: make log-dirty operations preemptible
Both the freeing and the inspection of the bitmap get done in (nested)
loops which - besides having a rather high iteration count in general,
albeit that would be covered by XSA-77 - have the number of non-trivial
iterations they need to perform (indirectly) controllable by both the
guest they are for and any domain controlling the guest (including the
one running qemu for it).
Note that the tying of the continuations to the invoking domain (which
previously [wrongly] used the invoking vCPU instead) implies that the
tools requesting such operations have to make sure they don't issue
multiple similar operations in parallel.
Note further that this breaks supervisor-mode kernel assumptions in
hypercall_create_continuation() (where regs->eip gets rewound to the
current hypercall stub beginning), but otoh
hypercall_cancel_continuation() doesn't work in that mode either.
Perhaps time to rip out all the remains of that feature?
This is part of CVE-2014-5146 / XSA-97.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 070493dfd2788e061b53f074b7ba97507fbcbf65
master date: 2014-10-06 11:22:04 +0200
Andrew Cooper [Fri, 17 Oct 2014 13:56:51 +0000 (15:56 +0200)]
AMD/guest_iommu: properly disable guest iommu support
AMD Guest IOMMU support was added to allow correct use of PASID and PRI
hardware support with an ATS-aware guest driver.
However, support cannot possibly function as guest_iommu_set_base() has no
callers. This means that its MMIO region's P2M pages are not set to
p2m_mmio_dm, preventing any invocation of the MMIO read/write handlers.
c/s fd186384 "x86/HVM: extend LAPIC shortcuts around P2M lookups" introduces a
path (via hvm_mmio_internal()) where iommu_mmio_handler claims its MMIO range,
and causes __hvm_copy() to fail with HVMCOPY_bad_gfn_to_mfn.
iommu->mmio_base defaults to 0, with a range of 8 pages, and is unilaterally
enabled in any HVM guests when the host IOMMU(s) supports any extended
features.
Unfortunately, HVMLoader's AP boot trampoline executes an `lmsw` instruction
at linear address 0x100c which unconditionally requires emulation. The
instruction fetch in turn fails as __hvm_copy() fails with
HVMCOPY_bad_gfn_to_mfn.
The result is that multi-vcpu HVM guests do not work on newer AMD hardware, if
IOMMU support is enabled in the BIOS.
Change the default mmio_base address to ~0ULL. This prevents
guest_iommu_mmio_range() from actually claiming any physical range
whatsoever, which allows the emulation of `lmsw` to succeed.
Reported-by: Roberto Luongo <rluongo@ready.it> Suggested-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Roberto Luongo <rluongo@ready.it> Acked-by: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
master commit: 02e4c89b2cc3c1f19ef42ed4fcb1931d8d704bb5
master date: 2014-10-06 11:20:12 +0200
Jan Beulich [Fri, 17 Oct 2014 13:56:07 +0000 (15:56 +0200)]
don't allow Dom0 access to IOMMUs' MMIO pages
Just like for LAPIC, IO-APIC, MSI, and HT we shouldn't be granting Dom0
access to these. This implicitly results in these pages also getting
marked reserved in the machine memory map Dom0 uses to determine the
ranges where PCI devices can have their MMIO ranges placed.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: fdf30377fbc4fa6798bfda7d69e5d448c2b8e834
master date: 2014-10-06 11:15:01 +0200
Jan Beulich [Fri, 17 Oct 2014 13:55:27 +0000 (15:55 +0200)]
x86: restore reserving of IO-APIC pages in XENMEM_machine_memory_map output
Commit d1222afda4 ("x86: allow Dom0 read-only access to IO-APICs") had
an unintended side effect: By no longer adding IO-APIC pages to Dom0's
iomem_caps these also no longer get reported as reserved in the machine
memory map presented to it (which got added there intentionally by
commit b8a456caed ["x86: improve reporting through
XENMEM_machine_memory_map"] because many BIOSes fail to add these).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 9d8edc5a8b4a0937193f80da72abdb44c5aeaec6
master date: 2014-10-06 11:13:19 +0200
Jan Beulich [Fri, 17 Oct 2014 13:54:31 +0000 (15:54 +0200)]
x86/MSI: fix MSI-X case of freeing IRQ
Commit d1b6d0a024 ("x86: enable multi-vector MSI") went a little too
far with moving things around in msi_free_irqs() in order to streamline
the code: We shouldn't drop the MSI-X control page reference before
calling destroy_irq(), as the latter will call us back via
desc->handler->shutdown() (effectively invoking to msi_set_mask_bit()).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 44d20c69516f8d5e71fc14bf216e230a9910d729
master date: 2014-10-06 11:11:28 +0200
Roy Franz [Fri, 17 Oct 2014 13:53:46 +0000 (15:53 +0200)]
x86/EFI: fix freeing of uninitialized pointer
The only valid response from the LocateHandle() call is EFI_BUFFER_TOO_SMALL,
so exit if we get anything else. We pass a 0 size/NULL pointer buffer, so the
only other returns we will get is an error. Return right away as there is
nothing to do. Also return if there is an error allocating the buffer, as the
previous code path also allowed for an undefined pointer to be freed.
Signed-off-by: Roy Franz <roy.franz@linaro.org>
Re-structure the change.