Jan Beulich [Thu, 17 Aug 2017 13:07:56 +0000 (15:07 +0200)]
gnttab: fix transitive grant handling
Processing of transitive grants must not use the fast path, or else
reference counting breaks due to the skipped recursive call to
__acquire_grant_for_copy() (its __release_grant_for_copy()
counterpart occurs independent of original pin count). Furthermore
after re-acquiring temporarily dropped locks we need to verify no grant
properties changed if the original pin count was non-zero; checking
just the pin counts is sufficient only for well-behaved guests. As a
result, __release_grant_for_copy() needs to mirror that new behavior.
Furthermore a __release_grant_for_copy() invocation was missing on the
retry path of __acquire_grant_for_copy(), and gnttab_set_version() also
needs to bail out upon encountering a transitive grant.
This is part of XSA-226.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ad48fb963dbff02762d2db5396fa655ac0c432c7
master date: 2017-08-17 14:40:31 +0200
Jan Beulich [Thu, 17 Aug 2017 13:01:27 +0000 (15:01 +0200)]
gnttab: don't use possibly unbounded tail calls
There is no guarantee that the compiler would actually translate them
to branches instead of calls, so only ones with a known recursion limit
are okay:
- __release_grant_for_copy() can call itself only once, as
__acquire_grant_for_copy() won't permit use of multi-level transitive
grants,
- __acquire_grant_for_copy() is fine to call itself with the last
argument false, as that prevents further recursion,
- __acquire_grant_for_copy() must not call itself to recover from an
observed change to the active entry's pin count
This is part of XSA-226.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 999d2ccb7f73408aa22656e1ba2f98b077eaa1c2
master date: 2017-08-17 14:39:18 +0200
Jan Beulich [Tue, 15 Aug 2017 13:21:42 +0000 (15:21 +0200)]
gnttab: correct pin status fixup for copy
Regardless of copy operations only setting GNTPIN_hst*, GNTPIN_dev*
also need to be taken into account when deciding whether to clear
_GTF_{read,writ}ing. At least for consistency with code elsewhere the
read part better doesn't use any mask at all.
This is XSA-230.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6e2a4c73564ab907b732059adb317d6ca2d138a2
master date: 2017-08-15 15:08:03 +0200
Jan Beulich [Tue, 15 Aug 2017 13:21:02 +0000 (15:21 +0200)]
gnttab: split maptrack lock to make it fulfill its purpose again
The way the lock is currently being used in get_maptrack_handle(), it
protects only the maptrack limit: The function acts on current's list
only, so races on list accesses are impossible even without the lock.
Otoh list access races are possible between __get_maptrack_handle() and
put_maptrack_handle(), due to the invocation of the former for other
than current from steal_maptrack_handle(). Introduce a per-vCPU lock
for list accesses to become race free again. This lock will be
uncontended except when it becomes necessary to take the steal path,
i.e. in the common case there should be no meaningful performance
impact.
When in get_maptrack_handle adds a stolen entry to a fresh, empty,
freelist, we think that there is probably no concurrency. However,
this is not a fast path and adding the locking there makes the code
clearly correct.
Also, while we are here: the stolen maptrack_entry's tail pointer was
not properly set. Set it.
This is CVE-2017-12136 / XSA-228.
Reported-by: Ian Jackson <ian.jackson@eu.citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
master commit: 02cbeeb6207508b0f04a2c6181445c8eb3f1e117
master date: 2017-08-15 15:07:25 +0200
Andrew Cooper [Tue, 15 Aug 2017 13:20:10 +0000 (15:20 +0200)]
x86/grant: disallow misaligned PTEs
Pagetable entries must be aligned to function correctly. Disallow attempts
from the guest to have a grant PTE created at a misaligned address, which
would result in corruption of the L1 table with largely-guest-controlled
values.
This is CVE-2017-12137 / XSA-227.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ce442926c2530da9376199dcc769436376ad2386
master date: 2017-08-15 15:06:45 +0200
Punit Agrawal [Fri, 26 May 2017 11:14:06 +0000 (12:14 +0100)]
arm: p2m: Prevent redundant icache flushes
When toolstack requests flushing the caches, flush_page_to_ram() is
called for each page of the requested domain. This needs to unnecessary
icache invalidation operations.
Let's take the responsibility of performing icache operations and use
the recently introduced flag to prevent redundant icache operations by
flush_page_to_ram().
Punit Agrawal [Fri, 26 May 2017 11:14:05 +0000 (12:14 +0100)]
Allow control of icache invalidations when calling flush_page_to_ram()
flush_page_to_ram() unconditionally drops the icache. In certain
situations this leads to execessive icache flushes when
flush_page_to_ram() ends up being repeatedly called in a loop.
Introduce a parameter to allow callers of flush_page_to_ram() to take
responsibility of synchronising the icache. This is in preparations for
adding logic to make the callers perform the necessary icache
maintenance operations.
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 54b8651066e82f04db9d9e5b0cc02c26d39ae763)
xen/arm: Properly map the FDT in the boot page table
Currently, Xen is assuming the FDT will always fit in a 2MB section.
Recently, I noticed an early crash on Xen when using GRUB with the
following call trace:
Indeed, the booting documentation for AArch32 and AArch64 only requires
the FDT to be placed on a 8-byte boundary. This means the Device-Tree can
cross a 2MB boundary.
Given that Xen limits the size of the FDT to 2MB, it will always fit in
a 4MB slot. So extend the fixmap slot for FDT from 2MB to 4MB.
The second 2MB superpage will only be mapped if the FDT is cross the 2MB
boundary.
xen/arm: Check if the FDT passed by the bootloader is valid
There is currently no sanity check on the FDT passed by the bootloader.
Whilst they are stricly not necessary, it will avoid us to spend hours
to try to find out why it does not work.
>From the booting documentation for AArch32 [1] and AArch64 [2] must :
- be placed on 8-byte boundary
- not exceed 2MB (only on AArch64)
Even if AArch32 does not seem to limit the size, Xen is not currently
able to support more the 2MB FDT. It is better to crash rather with a nice
error message than claiming we are supporting any size of FDT.
The checks are mostly borrowed from the Linux code (see fixmap_remap_fdt
in arch/arm64/mm/mmu.c).
[1] Section 2 in linux/Documentation/arm64/booting.txt
[2] Section 4b in linux/Documentation/arm/Booting
xen/arm: Move the code to map FDT in the boot tables from assembly to C
The FDT will not be accessed before start_xen (begining of C code) is
called and it will be easier to maintain as the code could be common
between AArch32 and AArch64.
A new function early_fdt_map is introduced to map the FDT in the boot
page table.
Jan Beulich [Fri, 23 Jun 2017 13:13:21 +0000 (15:13 +0200)]
memory: don't suppress P2M update in populate_physmap()
Commit d18627583d ("memory: don't hand MFN info to translated guests")
wrongly added a null-handle check there - just like stated in its
description for memory_exchange(), the array is also an input for
populate_physmap() (and hence can't reasonably be null). I have no idea
how I've managed to overlook this.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b964e3106d2cdaa11cc4524181ff14607d110ae4
master date: 2017-06-20 14:51:53 +0200
When determining Access Rights, Protection Keys only take effect when CR4.PKE
it set, and 4-level paging is active. All other circumstances (notibly, 32bit
PAE paging) skip the Protection Key control mechanism.
Therefore, we do not need to clear CR4.PKE behind the back of a guest which is
not using paging, as such a guest is necesserily running with EFER.LMA
disabled.
The {RD,WR}PKRU instructions are specified as being legal for use in any
operating mode, but only if CR4.PKE is set. By clearing CR4.PKE behind the
back of an unpaged guest, these instructions yield #UD despite the guest
correctly seeing PKE set if it reads CR4, and OSPKE being visible in CPUID.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Huaitong Han <huaitong.han@intel.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 224acdd04a9f6ffe44d2f716287cac74787899ec
master date: 2017-06-01 14:13:57 +0100
Andrew Cooper [Fri, 23 Jun 2017 13:11:19 +0000 (15:11 +0200)]
x86/pv: Fix the handling of `int $x` for vectors which alias exceptions
The claim at the top of c/s 2e426d6eecf "x86/traps: Drop use_error_code
parameter from do_{,guest_}trap()" is only actually true for hardware
exceptions. It is not true for `int $x` instructions (which never push error
code), irrespective of whether the vector aliases an exception or not.
Furthermore, c/s 6480cc6280e "x86/traps: Fix failed ASSERT() in
do_guest_trap()" really should have helped highlight that a regression had
been introduced.
Modify pv_inject_event() to understand event types other than
X86_EVENTTYPE_HW_EXCEPTION, and introduce pv_inject_sw_interrupt() for the
`int $x` handling code.
Add further assertions to pv_inject_event() concerning the type of events
passed in, which in turn requires that do_guest_trap() set its type
appropriately (which is now used exclusively for hardware exceptions).
Ian Jackson [Mon, 19 Jun 2017 14:04:08 +0000 (15:04 +0100)]
xen/test/Makefile: Fix clean target, broken by pattern rule
In "xen/test/livepatch: Regularise Makefiles" we reworked
xen/test/Makefile to use a pattern rule. However, there are two
problems with this. Both are related to the way that xen/Rules.mk is
implicitly part of this Makefile because of the way that Makefiles
under xen/ are invoked by their parent directory Makefiles.
Firstly, the Rules.mk `clean' target overrides the pattern rule in
xen/test/Makefile. The result is that `make -C xen clean' does not
actually run the livepatch clean target.
The Rules.mk clean target does have provision for recursing into
subdirectories, but that feature is tangled up with complex object
file iteration machinery which is not desirable here. However, we can
extend the Rules.mk clean target since it is a double-colon rule.
Sadly this involves duplicating the SUBDIR iteration boilerplate. (A
make function could be used but the cure would be worse than the
disease.)
Secondly, Rules.mk has a number of -include directives. make likes to
try to (re)build files mentioned in includes. With the % pattern
rule, this applies to those files too.
As a result, make -C xen clean would try to build `.*.d' (for example)
in xen/test. This would fail with an error message. The error would
be ignored because of the `-', but it's annoying and ugly.
Solve this by limiting the % pattern rule to the targets we expect it
to handle. These are those listed in the top-level Makefile help
message, apart from: those which are subdir- or component-qualified;
clean targets (which are handled specially, even distclean); and dist,
src-tarball-*, etc. (which are converted to install by an earlier
Makefile).
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
(cherry picked from commit 592e834522086009975bd48d59386094771bd06b)
(cherry picked from commit b38b1479a532f08fedd7f3b761673bc78b66739d)
Jan Beulich [Tue, 20 Jun 2017 14:10:04 +0000 (16:10 +0200)]
x86: avoid leaking PKRU and BND* between vCPU-s
PKRU is explicitly "XSAVE-managed but not XSAVE-enabled", so guests
might access the register (via {RD,WR}PKRU) without setting XCR0.PKRU.
Force context switching as well as migrating the register as soon as
CR4.PKE is being set the first time.
For MPX (BND<n>, BNDCFGU, and BNDSTATUS) the situation is less clear,
and the SDM has not entirely consistent information for that case.
While experimentally the instructions don't change register state as
long as the two XCR0 bits aren't both 1, be on the safe side and enable
both if BNDCFGS.EN is being set the first time.
This is XSA-220.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: de20bb6c4f65c4161e0931402613f9ffac86302d
master date: 2017-06-20 14:36:51 +0200
Julien Grall [Tue, 20 Jun 2017 14:07:06 +0000 (16:07 +0200)]
xen/arm: vgic: Sanitize target mask used to send SGI
The current function vgic_to_sgi does not sanitize the target mask and
may therefore get an invalid vCPU ID. This will result to an out of
bound access of d->vcpu[...] as there is no check whether the vCPU ID is
within the maximum supported by the guest.
This was introduced by commit ea37fd2111 "xen/arm: split vgic driver
into generic and vgic-v2 driver".
Jan Beulich [Tue, 20 Jun 2017 14:06:32 +0000 (16:06 +0200)]
gnttab: __gnttab_unmap_common_complete() is all-or-nothing
All failures have to be detected in __gnttab_unmap_common(), the
completion function must not skip part of its processing. In particular
the GNTMAP_device_map related putting of page references and adjustment
of pin count must not occur if __gnttab_unmap_common() signaled an
error. Furthermore the function must not make adjustments to global
state (here: clearing GNTTAB_device_map) before all possibly failing
operations have been performed.
There's one exception for IOMMU related failures: As IOMMU manipulation
occurs after GNTMAP_*_map have been cleared already, the related page
reference and pin count adjustments need to be done nevertheless. A
fundamental requirement for the correctness of this is that
iommu_{,un}map_page() crash any affected DomU in case of failure.
The version check appears to be pointless (or could perhaps be a
BUG_ON() or ASSERT()), but for the moment also move it.
George Dunlap [Tue, 20 Jun 2017 14:06:06 +0000 (16:06 +0200)]
gnttab: correct logic to get page references during map requests
The rules for reference counting are somewhat complicated:
* Each of GNTTAB_host_map and GNTTAB_device_map need their own
reference count
* If the mapping is writeable:
- GNTTAB_host_map needs a type count under only some conditions
- GNTTAB_device_map always needs a type count
If the mapping succeeds, we need to keep all of these; if the mapping
fails, we need to release whatever references we have acquired so far.
Additionally, the code that does a lot of this calculation "inherits"
a reference as part of the process of finding out who the owner is.
Finally, if the grant is mapped as writeable (without the
GNTMAP_readonly flag), but the hypervisor cannot grab a
PGT_writeable_page type, the entire operation should fail.
Unfortunately, the current code has several logic holes:
* If a grant is mapped only GNTTAB_device_map, and with a writeable
mapping, but in conditions where a *host* type count is not
necessary, the code will fail to grab the necessary type count.
* If a grant is mapped both GNTTAB_device_map and GNTTAB_host_map,
with a writeable mapping, in conditions where the host type count is
not necessary, *and* where the page cannot be changed to type
PGT_writeable, the condition will not be detected.
In both cases, this means that on success, the type count will be
erroneously reduced when the grant is unmapped. In the second case,
the type count will be erroneously reduced on the failure path as
well. (In the first case the failure path logic has the same hole
as the reference grabbing logic.)
Additionally, the return value of get_page() is not checked; but this
may fail even if the first get_page() succeeded due to a reference
counting overflow.
First of all, simplify the restoration logic by explicitly counting
the reference and type references acquired.
Consider each mapping type separately, explicitly marking the
'incoming' reference as used so we know when we need to grab a second
one.
Finally, always check the return value of get_page[_type]() and go to
the failure path if appropriate.
This is part of XSA-224.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 75b384ece635adc55c2bafbdc2d8959c10542c31
master date: 2017-06-20 14:46:21 +0200
Jan Beulich [Tue, 20 Jun 2017 14:05:39 +0000 (16:05 +0200)]
gnttab: never create host mapping unless asked to
We shouldn't create a host mapping unless asked to even in the case of
mapping a granted MMIO page. In particular the mapping wouldn't be torn
down when processing the matching unmap request.
George Dunlap [Tue, 20 Jun 2017 14:05:14 +0000 (16:05 +0200)]
gnttab: fix handling of dev_bus_addr during unmap
If a grant has been mapped with the GNTTAB_device_map flag, calling
grant_unmap_ref() with dev_bus_addr set to zero should cause the
GNTTAB_device_map part of the mapping to be left alone.
Unfortunately, at the moment, op->dev_bus_addr is implicitly checked
before clearing the map and adjusting the pin count, but only the bits
above 12; and it is not checked at all before dropping page
references. This means a guest can repeatedly make such a call to
cause the reference count to drop to zero, causing the page to be
freed and re-used, even though it's still mapped in its pagetables.
To fix this, always check op->dev_bus_addr explicitly for being
non-zero, as well as op->flag & GNTMAP_device_map, before doing
operations on the device_map.
While we're here, make the logic a bit cleaner:
* Always initialize op->frame to zero and set it from act->frame, to reduce the
chance of untrusted input being used
* Explicitly check the full dev_bus_addr against act->frame <<
PAGE_SHIFT, rather than ignoring the lower 12 bits
This is part of XSA-224.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 8fdfcb2b6bcd074776560e76843815f124d587f1
master date: 2017-06-20 14:45:33 +0200
Julien Grall [Tue, 20 Jun 2017 14:04:55 +0000 (16:04 +0200)]
arm: vgic: Don't update the LR when the IRQ is not enabled
gic_raise_inflight_irq will be called if the IRQ is already inflight
(i.e the IRQ is injected to the guest). If the IRQ is already already in
the LRs, then the associated LR will be updated.
To know if the interrupt is already in the LR, the function check if the
interrupt is queued. However, if the interrupt is not enabled then the
interrupt may not be queued nor in the LR. So gic_update_one_lr may be
called (if we inject on the current vCPU) and read the LR.
Because the interrupt is not in the LR, Xen will either read:
* LR 0 if the interrupt was never injected before
* LR 255 (GIC_INVALID_LR) if the interrupt was injected once. This
is because gic_update_one_lr will reset p->lr.
Reading LR 0 will result to potentially update the wrong interrupt and
not keep the LRs in sync with Xen.
Reading LR 255 will result to:
* Crash Xen on GICv3 as the LR index is bigger than supported (see
gicv3_ich_read_lr).
* Read/write always GICH_LR + 255 * 4 that is not part of the memory
mapped.
The problem can be prevented by checking whether the interrupt is
enabled in gic_raise_inflight_irq before calling gic_update_one_lr.
A follow-up of this patch is expected to mitigate the issue in the
future.
Jan Beulich [Tue, 20 Jun 2017 14:04:24 +0000 (16:04 +0200)]
guest_physmap_remove_page() needs its return value checked
Callers, namely such subsequently freeing the page, must not blindly
assume success - the function may namely fail when needing to shatter a
super page, but there not being memory available for the then needed
intermediate page table.
As it happens, guest_remove_page() callers now also all check the
return value.
Furthermore a missed put_gfn() on an error path in gnttab_transfer() is
also being taken care of.
This is part of XSA-222.
Reported-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a0cce6048d010a30ac82f8db7787bbf9aada64f4
master date: 2017-06-20 14:41:16 +0200
Andrew Cooper [Tue, 20 Jun 2017 14:03:11 +0000 (16:03 +0200)]
memory: fix return value handing of guest_remove_page()
Despite the description in mm.h, guest_remove_page() previously returned 0 for
paging errors.
Switch guest_remove_page() to having regular 0/-error semantics, and propagate
the return values from clear_mmio_p2m_entry() and mem_sharing_unshare_page()
to the callers (although decrease_reservation() is the only caller which
currently cares).
This is part of XSA-222.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b614f642c35da5184416787352f51a6379a92628
master date: 2017-06-20 14:39:56 +0200
Jan Beulich [Tue, 20 Jun 2017 14:02:45 +0000 (16:02 +0200)]
evtchn: avoid NULL derefs
Commit fbbd5009e6 ("evtchn: refactor low-level event channel port ops")
added a de-reference of the struct evtchn pointer for a port without
first making sure the bucket pointer is non-NULL. This de-reference is
actually entirely unnecessary, as all relevant callers (beyond the
problematic do_poll()) already hold the port number in their hands, and
the actual leaf functions need nothing else.
For FIFO event channels there's a second problem in that the ordering
of reads and updates to ->num_evtchns and ->event_array[] was so far
undefined (the read side isn't always holding the domain's event lock).
Add respective barriers.
Andrew Cooper [Tue, 20 Jun 2017 14:01:46 +0000 (16:01 +0200)]
x86/shadow: hold references for the duration of emulated writes
The (misnamed) emulate_gva_to_mfn() function translates a linear address to an
mfn, but releases its page reference before returning the mfn to its caller.
sh_emulate_map_dest() uses the results of one or two translations to construct
a virtual mapping to the underlying frames, completes an emulated
write/cmpxchg, then unmaps the virtual mappings.
The page references need holding until the mappings are unmapped, or the
frames can change ownership before the writes occurs.
This is XSA-219.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 26217aff67ae1538d4e1b2226afab6993cdbe772
master date: 2017-06-20 14:36:11 +0200
Jan Beulich [Tue, 20 Jun 2017 14:01:24 +0000 (16:01 +0200)]
gnttab: correct maptrack table accesses
In order to observe a consistent (limit,pointer-table) pair, the reader
needs to either hold the maptrack lock (in line with documentation) or
both sides need to order their accesses suitably (the writer side
barrier was removed by commit dff515dfea ["gnttab: use per-VCPU
maptrack free lists"], and a read side barrier has never been there).
Make the writer publish a new table page before limit (for bounds
checks to work), and new list head last (for racing maptrack_entry()
invocations to work). At the same time add read barriers to lockless
readers.
Additionally get_maptrack_handle() must not assume ->maptrack_head to
not change behind its back: Another handle may be put (updating only
->maptrack_tail) and then got or stolen (updating ->maptrack_head).
This is part of XSA-218.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 4b78efa91c8ae3c42e14b8eaeaad773c5eb3b71a
master date: 2017-06-20 14:34:34 +0200
George Dunlap [Tue, 20 Jun 2017 14:01:01 +0000 (16:01 +0200)]
gnttab: Avoid potential double-put of maptrack entry
Each grant mapping for a particular domain is tracked by an in-Xen
"maptrack" entry. This entry is is referenced by a "handle", which is
given to the guest when it calls gnttab_map_grant_ref().
There are two types of mapping a particular handle can refer to:
GNTMAP_host_map and GNTMAP_device_map. A given
gnttab_unmap_grant_ref() call can remove either only one or both of
these entries. When a particular handle has no entries left, it must
be freed.
gnttab_unmap_grant_ref() loops through its grant unmap request list
twice. It first removes entries from any host pagetables and (if
appropraite) iommus; then it does a single domain TLB flush; then it
does the clean-up, including telling the granter that entries are no
longer being used (if appropriate).
At the moment, it's during the first pass that the maptrack flags are
cleared, but the second pass that the maptrack entry is freed.
Unfortunately this allows the following race, which results in a
double-free:
A: (pass 1) clear host_map
B: (pass 1) clear device_map
A: (pass 2) See that maptrack entry has no mappings, free it
B: (pass 2) See that maptrack entry has no mappings, free it #
Unfortunately, unlike the active entry pinning update, we can't simply
move the maptrack flag changes to the second half, because the
maptrack flags are used to determine if iommu entries need to be
added: a domain's iommu must never have fewer permissions than the
maptrack flags indicate, or a subsequent map_grant_ref() might fail to
add the necessary iommu entries.
Instead, free the maptrack entry in the first pass if there are no
further mappings.
This is part of XSA-218.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: b7f6cbb9d43f7384e1f38f8764b9a48216c8a525
master date: 2017-06-20 14:33:13 +0200
Jan Beulich [Tue, 20 Jun 2017 14:00:29 +0000 (16:00 +0200)]
gnttab: fix unmap pin accounting race
Once all {writable} mappings of a grant entry have been unmapped, the
hypervisor informs the guest that the grant entry has been released by
clearing the _GTF_{reading,writing} usage flags in the guest's grant
table as appropriate.
Unfortunately, at the moment, the code that updates the accounting
happens in a different critical section than the one which updates the
usage flags; this means that under the right circumstances, there may be
a window in time after the hypervisor reported the grant as being free
during which the grant referee still had access to the page.
Move the grant accounting code into the same critical section as the
reporting code to make sure this kind of race can't happen.
Jan Beulich [Tue, 20 Jun 2017 13:58:40 +0000 (15:58 +0200)]
x86/mm: disallow page stealing from HVM domains
The operation's success can't be controlled by the guest, as the device
model may have an active mapping of the page. If we nevertheless
permitted this operation, we'd have to add further TLB flushing to
prevent scenarios like
"Domains A (HVM), B (PV), C (PV); B->target==A
Steps:
1. B maps page X from A as writable
2. B unmaps page X without a TLB flush
3. A sends page X to C via GNTTABOP_transfer
4. C maps page X as pagetable (potentially causing a TLB flush in C,
but not in B)
At this point, X would be mapped as a pagetable in C while being
writable through a stale TLB entry in B."
A similar scenario could be constructed for A using XENMEM_exchange and
some arbitrary PV domain C then having this page allocated.
This is XSA-217.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
master commit: fae7d5be8bb8b7a5b7005c4f3b812a47661a721e
master date: 2017-06-20 14:29:51 +0200
Ian Jackson [Wed, 7 Jun 2017 14:05:44 +0000 (15:05 +0100)]
Makefile: Provide way to ship livepatch test files
In the toplevel Makefile, provide build-tests and install-tests
targets which descend into xen/test. (dist-tests is provided
automatically by the pattern rule, as is the convention here.)
We have to set BASEDIR ourselves, and use these curious runes, because
the convention in Makefiles under xen/ is to "make -f Rules.mk" with
BASEDIR set and to expect Rules.mk to reinvoke the per-directory
Makefile. (This is really very strange.) Normally this invocation
pattern is organised by the machinery in xen/Makefile (which sets
BASEDIR) and Rules.mk, but we need to invoke it from outside that
context.
In theory it would be nice to have a pattern rule %-tests. But this
is not the style in the rest of the toplevel Makefile; and doing that
might interfere with the dist-% pattern rule.
None of this is invoked by default. If install-tests or dist-tests is
requested, the livepatches (the only current output from xen/tests)
are shipped in DESTDIR/usr/lib/debug/xen-livepatch/.
This allows CI systems such as osstest which are trying to consume
this to arrange for the files to be built, and output, without them
having to have special knowledge of the details of Xen's build system.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
(cherry picked from commit c55667bd0ad8f04688abfd5c6317709dc00f88ab)
(cherry picked from commit 2fb72b56bf0423e4b625cebc40e2dc7ea7734986)
Ian Jackson [Wed, 7 Jun 2017 14:09:57 +0000 (15:09 +0100)]
xen/test/livepatch: Add xen_nop.livepatch to .gitignore
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
(cherry picked from commit 5225e70ccbb02f786647eddfeb65d6db1e230782)
(cherry picked from commit c40237fb500be05aee6869f5c2da5aa6f1a86f76)
Ian Jackson [Wed, 7 Jun 2017 13:44:51 +0000 (14:44 +0100)]
xen/test/livepatch: Regularise Makefiles
In xen/test/livepatch/Makefile:
Provide a `build' target, as most of the
subdir-invoking Makefiles elsewhere expect.
In xen/test/Makefile:
Replace the two open-coded targets with a generalised pattern rule
which descends into each of SUBDIRS. This allows `install' to work
too (it is already supported by xen/test/livepatch/Makefile).
Provide an explicit default target of `tests', and an `all' target
(which is conventional).
Suppress entry into the xen/test/livepatch subdir when we are
building for i386, since the 32-bit hypervisor is not supported any
more and we can't build livepatches for it either.
After this, the xen/test subdirectory is somewhere were make can be
invoked in the way which is conventional for xen.git/xen/ subdirs.
None of this is yet invoked from the top-level Makefile.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
(cherry picked from commit e541982dc21dcc5be61279d22d477ed5c0bc41c5)
(cherry picked from commit 8c60e5f4327583b1c90cf3641388ef618c1727ec)
Ian Jackson [Wed, 7 Jun 2017 14:00:17 +0000 (15:00 +0100)]
xen/test/livepatch/Makefile: Install in DESTDIR/usr/lib/debug/xen-livepatch
Dumping these patch files in /usr/lib/debug/xen-*.livepatch is a bit
ugly.
Also, refactor the Makefile to have a LIVEPATCHES variable, to reduce
repetition.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
(cherry picked from commit a38d1af5fb02bee68c9a30e38b97c6129815f943)
(cherry picked from commit d32607042bdc944aa23b7eaf17cd68c8536d3a11)
Julien Grall [Fri, 19 May 2017 16:08:39 +0000 (17:08 +0100)]
xen/arm: p2m: Fix incorrect mapping of superpages
The same set of functions is used to set as well as to clean P2M
entries, except for clean operations (INVALID_MFN ~0UL) is passed as a
parameter. Unfortunately, when calculating an appropriate target order
for a particular mapping INVALID_MFN is taken into account which leads
to 4K page target order being set each time even for 2MB and 1GB
mappings.
This will result to break down the superpage into 4K mappings and leave
empty tables allocated.
This was introduced by commit 2ef3e36ec7 "xen/arm: p2m: Introduce
p2m_set_entry and __p2m_set_entry".
vgic: refuse irq migration when one is already in progress
When an irq migration is already in progress, but not yet completed
(GIC_IRQ_GUEST_MIGRATING is set), refuse any other irq migration
requests for the same irq.
This patch implements this approach by returning success or failure from
vgic_migrate_irq, and avoiding irq target changes on failure. It prints
a warning in case the irq migration fails.
It also moves the clear_bit of GIC_IRQ_GUEST_MIGRATING to after the
physical irq affinity has been changed so that all operations regarding
irq migration are completed.
arm: remove irq from inflight, then change physical affinity
This patch fixes a potential race that could happen when
gic_update_one_lr and vgic_vcpu_inject_irq run simultaneously.
When GIC_IRQ_GUEST_MIGRATING is set, we must make sure that the irq has
been removed from inflight before changing physical affinity, to avoid
concurrent accesses to p->inflight, as vgic_vcpu_inject_irq will take a
different vcpu lock.
Julien Grall [Fri, 5 May 2017 14:30:36 +0000 (15:30 +0100)]
xen/arm: Survive unknown traps from guests
Currently we crash Xen if we see an ESR_EL2.EC value we don't recognise.
As configurable disables/enables are added to the architecture
(controlled by RES1/RESO bits respectively), with associated synchronous
exceptions, it may be possible for a guest to trigger exceptions with
classes that we don't recognise.
While we can't service these exceptions in a manner useful to the guest,
we can avoid bringing down the host. Per ARM DDI 0487A.k_iss10775, page
D7-1937, EC values within the range 0x00 - 0x2c are reserved for future
use with synchronous exceptions, and EC within the range 0x2d - 0x3f may
be used for either synchronous or asynchronous exceptions.
The patch makes Xen handle any unknown EC by injecting an UNDEFINED
exception into the guest, with a corresponding (ratelimited) warning in
the log.
This patch is based on Linux commit f050fe7a9164 "arm: KVM: Survive unknown
traps from the guest".
Julien Grall [Fri, 5 May 2017 14:30:35 +0000 (15:30 +0100)]
xen/arm: do_trap_hypervisor: Separate hypervisor and guest traps
The function do_trap_hypervisor is currently handling both trap coming
from the hypervisor and the guest. This makes difficult to get specific
behavior when a trap is coming from either the guest or the hypervisor.
Split the function into two parts:
- do_trap_guest_sync to handle guest traps
- do_trap_hyp_sync to handle hypervisor traps
On AArch32, the Hyp Trap Exception provides the standard mechanism for
trapping Guest OS functions to the hypervisor (see B1.14.1 in ARM DDI
0406C.c). It cannot be generated when generated when the processor is in
Hyp Mode, instead other exception will be used. So it is fine to replace
the call to do_trap_hypervisor by do_trap_guest_sync.
For AArch64, there are two distincts exception depending whether the
exception was taken from the current level (hypervisor) or lower level
(guest).
Note that the unknown traps from guests will lead to panic Xen. This is
already behavior and is left unchanged for simplicy. A follow-up patch
will address that.
xen/arm: Save ESR_EL2 to avoid using mismatched value in syndrome check
Xen will do exception syndrome check while some types of exception
take place in EL2. The syndrome check code read the ESR_EL2 register
directly, but in some situation this register maybe overridden by
nested exception.
For example, if we re-enable IRQ before reading ESR_EL2 which means
Xen may enter in IRQ exception mode and return the processor with
clobbered ESR_EL2 (See ARM ARM DDI 0487A.j D7.2.25)
In this case the guest exception syndrome has been overridden, we will
check the syndrome for guest sync exception with an incorrect ESR_EL2
value. So we want to save ESR_EL2 to cpu_user_regs as soon as the
exception takes place in EL2 to avoid using an incorrect syndrome value.
In order to save ESR_EL2, we added a 32-bit member hsr to cpu_user_regs.
But while saving registers in trap entry, we use stp to save ELR and
CPSR at the same time through 64-bit general registers. If we keep this
code, the hsr will be overridden by upper 32-bit of CPSR. So adjust the
code to use str to save ELR in a separate instruction and use stp to
save CPSR and HSR at the same time through 32-bit general registers.
This change affects the registers restore in trap exit, we can't use the
ldp to restore ELR and CPSR from stack at the same time. We have to use
ldr to restore them separately.
Gregory Herrero [Fri, 9 Jun 2017 11:42:07 +0000 (13:42 +0200)]
stop_machine: fill fn_result only in case of error
When stop_machine_run() is called with NR_CPUS as last argument,
fn_result member must be filled only if an error happens since it is
shared across all cpus.
Assume CPU1 detects an error and set fn_result to -1, then CPU2 doesn't
detect an error and set fn_result to 0. The error detected by CPU1 will
be ignored.
Note that in case multiple failures occur on different CPUs, only the
last error will be reported.
Jan Beulich [Fri, 9 Jun 2017 11:41:44 +0000 (13:41 +0200)]
hvmloader: avoid tests when they would clobber used memory
First of all limit the memory range used for testing to 4Mb: There's no
point placing page tables right above 8Mb when they can equally well
live at the bottom of the chunk at 4Mb - rep_io_test() cares about the
5Mb...7Mb range only anyway. In a subsequent patch this will then also
allow simply looking for an unused 4Mb range (instead of using a build
time determined one).
Extend the "skip tests" condition beyond the "is there enough memory"
question.
Reported-by: Charles Arnold <carnold@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Gary Lin <glin@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0d6968635ce51a8ed7508ddcf17b3d13a462cb27
master date: 2017-05-19 16:04:38 +0200
Jan Beulich [Fri, 9 Jun 2017 11:41:08 +0000 (13:41 +0200)]
arm: fix build with gcc 7
The compiler dislikes duplicate "const", and the ones it complains
about look like they we in fact meant to be placed differently.
Also fix array_access_okay() (just like on x86), despite the construct
being unused on ARM: -Wint-in-bool-context, enabled by default in
gcc 7, doesn't like multiplication in conditional operators. "Hide" it,
at the risk of the next compiler version becoming smarter and
recognizing even that. (The hope is that added smartness then would
also better deal with legitimate cases like the one here.) The change
could have been done in access_ok(), but I think we better keep it at
the place the compiler is actually unhappy about.
Jan Beulich [Fri, 9 Jun 2017 11:40:28 +0000 (13:40 +0200)]
x86: fix build with gcc 7
-Wint-in-bool-context, enabled by default in gcc 7, doesn't like
multiplication in conditional operators. "Hide" them, at the risk of
the next compiler version becoming smarter and recognizing even those.
(The hope is that added smartness then would also better deal with
legitimate cases like the ones here.)
The change could have been done in access_ok(), but I think we better
keep it at the places the compiler is actually unhappy about.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f32400e90c046a9fd76c8917a60d34ade9c02ea2
master date: 2017-05-19 10:11:36 +0200
Igor Druzhinin [Fri, 9 Jun 2017 11:36:48 +0000 (13:36 +0200)]
x86/mm: fix incorrect unmapping of 2MB and 1GB pages
The same set of functions is used to set as well as to clean
P2M entries, except that for clean operations INVALID_MFN (~0UL)
is passed as a parameter. Unfortunately, when calculating an
appropriate target order for a particular mapping INVALID_MFN
is not taken into account which leads to 4K page target order
being set each time even for 2MB and 1GB mappings. This eventually
breaks down an EPT structure irreversibly into 4K mappings which
prevents consecutive high order mappings to this area.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
x86/NPT: deal with fallout from 2Mb/1Gb unmapping change
Commit efa9596e9d ("x86/mm: fix incorrect unmapping of 2MB and 1GB
pages") left the NPT code untouched, as there is no explicit alignment
check matching the one in EPT code. However, the now more widespread
storing of INVALID_MFN into PTEs requires adjustments:
- calculations when shattering large pages may spill into the p2m type
field (converting p2m_populate_on_demand to p2m_grant_map_rw) - use
OR instead of PLUS,
- the use of plain l{2,3}e_from_pfn() in p2m_pt_set_entry() results in
all upper (flag) bits being clobbered - introduce and use
p2m_l{2,3}e_from_pfn(), paralleling the existing L1 variant.
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: efa9596e9d167c8fb7d1c4446c10f7ca30453646
master date: 2017-05-17 17:23:15 +0200
master commit: 83520cb4aa39ebeb4eb1a7cac2e85b413e75a336
master date: 2017-06-06 14:32:54 +0200
Andrew Cooper [Fri, 9 Jun 2017 11:36:08 +0000 (13:36 +0200)]
x86/pv: Align %rsp before pushing the failsafe stack frame
Architecturally, all 64bit stacks are aligned on a 16 byte boundary before an
exception frame is pushed. The failsafe frame should not special in this
regard.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cbcaccb5e991155a4ae85a032e990614c3dc6960
master date: 2017-05-09 19:00:20 +0100
Andrew Cooper [Fri, 9 Jun 2017 11:35:27 +0000 (13:35 +0200)]
x86/pv: Fix bugs with the handling of int80_bounce
Testing has revealed two issues:
1) Passing a NULL handle to set_trap_table() is intended to flush the entire
table. The 64bit guest case (and 32bit guest on 32bit Xen, when it
existed) called init_int80_direct_trap() to reset int80_bounce, but c/s cda335c279 which introduced the 32bit guest on 64bit Xen support omitted
this step. Previously therefore, it was impossible for a 32bit guest to
reset its registered int80_bounce details.
2) init_int80_direct_trap() doesn't honour the guests request to have
interrupts disabled on entry. PVops Linux requests that interrupts are
disabled, but Xen currently leaves them enabled when following the int80
fastpath.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 55ab172a1f286742d918947ecb9b257ce31cc253
master date: 2017-05-09 19:00:04 +0100
Mohit Gambhir [Fri, 9 Jun 2017 11:35:00 +0000 (13:35 +0200)]
x86/vpmu_intel: fix hypervisor crash by masking PC bit in MSR_P6_EVNTSEL
Setting Pin Control (PC) bit (19) in MSR_P6_EVNTSEL results in a General
Protection Fault and thus results in a hypervisor crash. This behavior has
been observed on two generations of Intel processors namely, Haswell and
Broadwell. Other Intel processor generations were not tested. However, it
does seem to be a possible erratum that hasn't yet been confirmed by Intel.
To fix the problem this patch masks PC bit and returns an error in
case any guest tries to write to it on any Intel processor. In addition
to the fact that setting this bit crashes the hypervisor on Haswell and
Broadwell, the PC flag bit toggles a hardware pin on the physical CPU
every time the programmed event occurs and the hardware behavior in
response to the toggle is undefined in the SDM, which makes this bit
unsafe to be used by guests and hence should be masked on all machines.
Signed-off-by: Mohit Gambhir <mohit.gambhir@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 8bf68dca65e2d61f4dfc6715cca51ad3dd5aadf1
master date: 2017-05-08 13:37:17 +0200
Jan Beulich [Fri, 9 Jun 2017 11:34:29 +0000 (13:34 +0200)]
hvm: fix hypervisor crash in hvm_save_one()
hvm_save_cpu_ctxt() returns success without writing any data into
hvm_domain_context_t when all VCPUs are offline. This can then crash
the hypervisor (with FATAL PAGE FAULT) in hvm_save_one() via the
"off < (ctxt.cur - sizeof(*desc))" for() test, where ctxt.cur remains 0,
causing an underflow which leads the hypervisor to go off the end of the
ctxt buffer.
This has been broken since Xen 4.4 (c/s e019c606f59).
It has happened in practice with an HVM Linux VM (Debian 8) queried around
shutdown:
Commit 407a3c00ff ("compat/memory: fix build with old gcc") "fixed" a
build issue by switching to the use of uninitialized data. Due to
- the bounding of the uninitialized data item
- the accessed area being outside of Xen space
- arguments being properly verified by the native hypercall function
this is not a security issue.
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 144aec4140515c53bb1676df71a469f3e285c557
master date: 2017-04-26 09:48:45 +0200
Ian Jackson [Mon, 3 Apr 2017 11:34:13 +0000 (12:34 +0100)]
tools: ocaml: In configure, check for ocamlopt
If ocaml.m4 didn't find ocamlopt, disable all the ocaml builds.
Currently our Makefiles do not work properly when the native code
compiler (`ocamlopt') is not available. In principle this should be
fixed to fall back to bytecode, but this is not a task for this stage
of the Xen 4.9 release.
Without this change, we cannot build on systems with only ocamlc.
That includes Debian jessie ARM64, as used on the new ARM64 hardware
in the Xen Project CI test lab.
When the Makefiles are fixed, this commit should be reverted.
Committers: Please rerun autogen.sh.
CC: Julien Grall <julien.grall@arm.com> CC: Christian Lindig <christian.lindig@citrix.com> CC: Jonathan Ludlam <Jonathan.Ludlam@citrix.com> CC: David Scott <dave@recoil.org> CC: Wei Liu <wei.liu2@citrix.com> Tested-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 4d0240e03349fd0715332eae65372e0a47b5a43b)
Andrew Cooper [Thu, 30 Mar 2017 16:32:31 +0000 (17:32 +0100)]
tools/libxc: Tolerate specific zero-content records in migration v2 streams
The migration v2 save code was written to avoid sending data records with no
content, as such records serve no purpose but come with a performance hit.
The restore code sanity checks this expectation.
Under some circumstances (most notably, on AMD hardware with Debug Extensions,
and a PV guest kernel which is not using the feature), the save code would
generate a record with no content, which trips the sanity check in the restore
code.
As the stream is otherwise fine, tolerate these records and avoid failing the
migration.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 119ee4d77377aa1fc62efdadc1cc87df4f1270bf)
Haozhong Zhang [Wed, 3 May 2017 15:06:33 +0000 (17:06 +0200)]
x86/mce: always re-initialize 'severity_cpu' in mcheck_cmn_handler()
mcheck_cmn_handler() does not always set 'severity_cpu' to override
its value taken from previous rounds of MC handling, which will
interfere the current round of MC handling. Always re-initialize it to
clear the historical value.
Haozhong Zhang [Wed, 3 May 2017 15:06:07 +0000 (17:06 +0200)]
x86/mce: make 'severity_cpu' private to its users
The current 'severity_cpu' is used by both mcheck_cmn_handler() and
mce_softirq(). If MC# happens during mce_softirq(), the values set in
mcheck_cmn_handler() and mce_softirq() may interfere with each
other. Use private 'severity_cpu' for each function to fix this issue.
Jan Beulich [Wed, 3 May 2017 15:05:39 +0000 (17:05 +0200)]
memory: don't hand MFN info to translated guests
We shouldn't hand MFN info back from increase-reservation for
translated domains, just like we don't for populate-physmap and
memory-exchange. For full symmetry also check for a NULL guest handle
in populate_physmap() (but note this makes no sense in
memory_exchange(), as there the array is also an input).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d18627583df28facd9af473ea1ac4a56e93e6ea9
master date: 2017-04-05 16:39:53 +0200
Jan Beulich [Wed, 3 May 2017 15:05:11 +0000 (17:05 +0200)]
memory: exit early from memory_exchange() upon write-back error
There's no point in continuing if in the end we'll return -EFAULT
anyway. It also seems wrong to report a chunk for which at least one
write-back failed as successfully exchanged (albeit the indication of
an error is also not fully correct, as the exchange happened in that
case at least partially - retrieving the GFN to assign the memory to
and/or handing back the information on the replacement memory didn't
work). In any case limiting the amount of damage done to the guest
can't be all that bad an idea.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1cf4d2ec0d7c0cb53729ca810e416793030f6f07
master date: 2017-04-05 16:39:16 +0200
Bhavesh Davda [Wed, 3 May 2017 15:03:34 +0000 (17:03 +0200)]
kexec: clear kexec_image slot when unloading kexec image
When kexec_do_unload calls kexec_swap_images to get the old kexec_image to
free, it passes NULL for the new kexec_image pointer. The new slot wasn't being
cleared in such a case, leading to a stale pointer being left behind in the
kexec_image array and Xen panics in subsequent load/unload operations.
Signed-off-by: Bhavesh Davda <bhavesh.davda@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5c5216e825332c83b1965b5a39a6100f9dde34da
master date: 2017-04-04 11:34:57 +0200
Jan Beulich [Tue, 2 May 2017 12:54:26 +0000 (14:54 +0200)]
x86: discard type information when stealing pages
While a page having just a single general reference left necessarily
has a zero type reference count too, its type may still be valid (and
in validated state; at present this is only possible and relevant for
PGT_seg_desc_page, as page tables have their type forcibly zapped when
their type reference count drops to zero, and
PGT_{writable,shared}_page pages don't require any validation). In
such a case when the page is being re-used with the same type again,
validation is being skipped. As validation criteria differ between
32- and 64-bit guests, pages to be transferred between guests need to
have their validation indicator zapped (and with it we zap all other
type information at once).
This is XSA-214.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: eaf537342c909875c10f49b06e17493655410681
master date: 2017-05-02 14:46:58 +0200
Jan Beulich [Tue, 2 May 2017 12:52:54 +0000 (14:52 +0200)]
multicall: deal with early exit conditions
In particular changes to guest privilege level require the multicall
sequence to be aborted, as hypercalls are permitted from kernel mode
only. While likely not very useful in a multicall, also properly handle
the return value in the HYPERVISOR_iret case (which should be the guest
specified value).
This is XSA-213.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com>
master commit: 22c096c99d8c05833c3c19870e36efb2dd4e8013
master date: 2017-05-02 14:45:02 +0200
parse_vwfi runs after init_traps on cpu0, potentially resulting in the
wrong HCR_EL2 for it. Secondary cpus boot after parse_vwfi, so in their
case init_traps will write the correct set of flags to HCR_EL2.
For cpu0, fix the issue by changing HCR_EL2 setting from a new
presmp_initcall.
Thomas Sanders [Tue, 28 Mar 2017 17:57:52 +0000 (18:57 +0100)]
oxenstored: trim history in the frequent_ops function
We were trimming the history of commits only at the end of each
transaction (regardless of how it ended).
Therefore if non-transactional writes were being made but no
transactions were being ended, the history would grow
indefinitely. Now we trim the history at regular intervals.
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Thomas Sanders [Mon, 27 Mar 2017 13:36:34 +0000 (14:36 +0100)]
oxenstored transaction conflicts: improve logging
For information related to transaction conflicts, potentially frequent
logging at "info" priority has been changed to "debug" priority, and
once per two minutes there is an "info" priority summary.
Additional detailed logging has been added at "debug" priority.
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Thomas Sanders [Fri, 24 Mar 2017 19:55:03 +0000 (19:55 +0000)]
oxenstored: don't wake to issue no conflict-credit
In the main loop, when choosing the timeout for the select function
call, we were setting it so as to wake up to issue conflict-credit to
any domains that could accept it. When xenstore is idle, this would
mean waking up every 50ms (by default) to do no work. With this
commit, we check whether any domain is below its cap, and if not then
we set the timeout for longer (the same timeout as before the
conflict-protection feature was added).
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com> Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Thomas Sanders [Fri, 24 Mar 2017 16:16:10 +0000 (16:16 +0000)]
oxenstored: do not commit read-only transactions
The packet telling us to end the transaction has always carried an
argument telling us whether to commit.
If the transaction made no modifications to the tree, now we ignore
that argument and do not commit: it is just a waste of effort.
This makes read-only transactions immune to conflicts, and means that
we do not need to store any of their details in the history that is
used for assigning blame for conflicts.
We count a transaction as a read-only transaction only if it contains
no operations that modified the tree.
This means that (for example) a transaction that creates a new node
then deletes it would NOT count as read-only, even though it makes no
change overall. A more sophisticated algorithm could judge the
transaction based on comparison of its initial and final states, but
this would add complexity and computational cost.
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com> Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Thomas Sanders [Thu, 23 Mar 2017 19:06:54 +0000 (19:06 +0000)]
oxenstored: allow self-conflicts
We already avoid inter-domain conflicts but now allow intra-domain
conflicts. Although there are no known practical examples of a domain
that might perform operations that conflict with its own transactions,
this is conceivable, so here we avoid changing those semantics
unnecessarily.
When a transaction commit fails with a conflict and we look through
the history of commits to see which connection(s) to blame, ignore
historical commits that were made by the same connection as the
failing commit.
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com> Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Jonathan Davies [Thu, 23 Mar 2017 14:28:16 +0000 (14:28 +0000)]
oxenstored: blame the connection that caused a transaction conflict
Blame each connection found to have made a commit that would cause this
transaction to fail. Each blamed connection is penalised by having its
conflict-credit decremented.
Note the change in semantics for the replay function: we no longer stop after
finding the first operation that can't be replayed. This allows us to identify
all operations that conflicted with this transaction, not just the one that
conflicted first.
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
v1 Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Changes since v1:
* use correct log levels for informational messages
Changes since v2:
* fix the blame algorithm and improve logging
(fix was reviewed by Jonathan Davies)
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Thomas Sanders [Thu, 23 Mar 2017 14:25:16 +0000 (14:25 +0000)]
oxenstored: discard old commit-history on txn end
The history of commits is to be used for working out which historical
commit(s) (including atomic writes) caused conflicts with a
currently-failing commit of a transaction. Any commit that was made
before the current transaction started cannot be relevant. Therefore
we never need to keep history from before the start of the
longest-running transaction that is open at any given time: whenever a
transaction ends (with or without a commit) then if it was the
longest-running open transaction we can delete history up until start
of the the next-longest-running open transaction.
Some transactions might stay open for a very long time, so if any
transaction exceeds conflict_max_history_seconds then we remove it
from consideration in this context, and will not guarantee to keep
remembering about historical commits made during such a transaction.
We implement this by keeping a list of all open transactions that have
not been open too long. When a transaction ends, we remove it from the
list, along with any that have been open longer than the maximum; then
we delete any history from before the start of the longest-running
transaction remaining in the list.
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com> Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Jonathan Davies [Tue, 14 Mar 2017 12:17:38 +0000 (12:17 +0000)]
oxenstored: add transaction info relevant to history-tracking
Specifically:
* retain the original store (not just the root) in full transactions
* store commit count at the time of the start of the transaction
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Thomas Sanders <thomas.sanders@citrix.com> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com> Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Thomas Sanders [Tue, 14 Mar 2017 12:15:52 +0000 (12:15 +0000)]
oxenstored: ignore domains with no conflict-credit
When processing connections, skip those from domains with no remaining
conflict-credit.
Also, issue a point of conflict-credit at regular intervals, the
period being set by the configuration option "conflict-max-history-
seconds". When issuing conflict-credit, we give a point either to
every domain at once (one each) or only to the single domain at the
front of the queue, depending on the configuration option
"conflict-rate-limit-is-aggregate".
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com> Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Thomas Sanders [Tue, 14 Mar 2017 12:15:52 +0000 (12:15 +0000)]
oxenstored: handling of domain conflict-credit
This commit gives each domain a conflict-credit variable, which will
later be used for limiting how often a domain can cause other domain's
transaction-commits to fail.
This commit also provides functions and data for manipulating domains
and their conflict-credit, and checking whether they have credit.
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com> Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Thomas Sanders [Tue, 14 Mar 2017 12:15:52 +0000 (12:15 +0000)]
oxenstored: comments explaining some variables
It took a while of reading and reasoning to work out what these are
for, so here are comments to make life easier for everyone reading
this code in future.
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com> Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com> Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Paul Durrant [Wed, 22 Feb 2017 13:27:34 +0000 (13:27 +0000)]
tools/libxenctrl: fix error check after opening libxenforeignmemory
Checking the value of xch->xcall is clearly incorrect. The code should be
checking xch->fmem (i.e. the return of the previously called function).
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit 80a7d04f532ddc3500acd7988917708a536ae15f)
Juergen Gross [Wed, 15 Feb 2017 11:11:12 +0000 (12:11 +0100)]
libxl: correct xenstore entry for empty cdrom
Specifying an empty cdrom device will result in a Xenstore entry
params = aio:(null)
as the physical device path isn't existing. This lets a domain booted
via OVMF hang as OVMF is checking for "aio:" only in order to detect
the empty cdrom case.
Use an empty string for the physical device path in this case. As a
cdrom device for HVM is always backed by qdisk we only need to cover this
backend.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Tue, 4 Apr 2017 12:55:00 +0000 (14:55 +0200)]
memory: properly check guest memory ranges in XENMEM_exchange handling
The use of guest_handle_okay() here (as introduced by the XSA-29 fix)
is insufficient here, guest_handle_subrange_okay() needs to be used
instead.
Note that the uses are okay in
- XENMEM_add_to_physmap_batch handling due to the size field being only
16 bits wide,
- livepatch_list() due to the limit of 1024 enforced on the
number-of-entries input (leaving aside the fact that this can be
called by a privileged domain only anyway),
- compat mode handling due to counts there being limited to 32 bits,
- everywhere else due to guest arrays being accessed sequentially from
index zero.
This is CVE-2017-7228 / XSA-212.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 938fd2586eb081bcbd694f4c1f09ae6a263b0d90
master date: 2017-04-04 14:47:46 +0200
Dario Faggioli [Fri, 31 Mar 2017 06:33:20 +0000 (08:33 +0200)]
xen: sched: don't call hooks of the wrong scheduler via VCPU2OP
Within context_saved(), we call the context_saved hook,
and we use VCPU2OP() to determine from what scheduler.
VCPU2OP uses DOM2OP, which uses d->cpupool, which is
NULL when d is the idle domain. And in that case,
DOM2OP just returns ops, the scheduler of cpupool0.
Therefore, if:
- cpupool0's scheduler defines context_saved (like
Credit2 and RTDS do),
- we are not in cpupool0 (i.e., our scheduler is
not ops),
- we are context switching from idle,
we call VCPU2OP(idle_vcpu), which means
DOM2OP(idle->cpupool), which is ops.
Therefore, we both:
- check if context_saved is defined in the wrong
scheduler;
- if yes, call the wrong one.
When using Credit2 at boot, and also Credit2 in
the other cpupool, this is wrong but innocuous,
because it only involves the idle vcpus.
When using Credit2 at boot, and Credit1 in the
other cpupool, this is *totally* wrong, and
it's by chance it does not explode!
When using Credit2 and other schedulers I'm
developping, I hit the following assert (in
sched_credit2.c, on a CPU inside a cpupool that
does not use Credit2):
Jan Beulich [Fri, 31 Mar 2017 06:32:51 +0000 (08:32 +0200)]
x86/EFI: avoid Xen image when looking for module/kexec position
When booting straight from EFI, we don't further try to relocate Xen.
As a result, so far we also didn't avoid the area Xen uses when looking
for a location to put modules or the kexec area. Introduce a fake
module slot to deal with that without having to fiddle with a lot of
code.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e22e1c47958a4778cd7baa3980f74e52f525ba28
master date: 2017-03-20 09:27:12 +0100
Jan Beulich [Fri, 31 Mar 2017 06:32:22 +0000 (08:32 +0200)]
x86/EFI: avoid IOMMU faults on [_end,__2M_rwdata_end)
Commit c9a4a1c419 ("x86/layout: Correct Xen's idea of its own memory
layout") didn't go far enough with the conversion, causing IOMMU faults
when memory from that range was handed to a domain. We must not make
this memory available for allocation (the change is benign to xen.gz at
this point in time).
Note that the change to tboot_shutdown() is fixing another issue at
once: As it looks, the function so far skipped all memory below the Xen
image.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d522571a408a7dd21a06705f6dd51cdafd2db4fc
master date: 2017-03-20 09:25:36 +0100
Roger Pau Monné [Fri, 31 Mar 2017 06:31:14 +0000 (08:31 +0200)]
build/clang: fix XSM dummy policy when using clang 4.0
There seems to be some weird bug in clang 4.0 that prevents xsm_pmu_op from
working as expected, and vpmu.o ends up with a reference to
__xsm_action_mismatch_detected which makes the build fail:
[...]
ld -melf_x86_64_fbsd -T xen.lds -N prelink.o \
xen/common/symbols-dummy.o -o xen/.xen-syms.0
prelink.o: In function `xsm_default_action':
xen/include/xsm/dummy.h:80: undefined reference to `__xsm_action_mismatch_detected'
xen/xen/include/xsm/dummy.h:80: relocation truncated to fit: R_X86_64_PC32 against undefined symbol `__xsm_action_mismatch_detected'
ld: xen/xen/.xen-syms.0: hidden symbol `__xsm_action_mismatch_detected' isn't defined
The current patch is the only way I've found to fix this so far, by simply
moving the XSM_PRIV check into the default case in xsm_pmu_op. This also fixes
the behavior of do_xenpmu_op, which will now return -EINVAL for unknown
XENPMU_* operations, instead of -EPERM when called by a privileged domain.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
master commit: 9e4d116faff4545a7f21c2b01008e94d68e6db58
master date: 2017-03-14 18:19:29 +0100
Roger Pau Monné [Fri, 31 Mar 2017 06:28:49 +0000 (08:28 +0200)]
x86: drop unneeded __packed attributes
There where a couple of unneeded packed attributes in several x86-specific
structures, that are obviously aligned. The only non-trivial one is
vmcb_struct, which has been checked to have the same layout with and without
the packed attribute using pahole. In that case add a build-time size check to
be on the safe side.
No functional change is expected as a result of this commit.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
master commit: 4036e7c592905c2292cdeba8269e969959427237
master date: 2017-03-07 17:11:06 +0100
This panic was triggered by the BUG(); in branch_insn_requires_update.
That's because in this case the alternative patching needs to update the
offset of the branch instruction. But the new target address of the branch
instruction could not pass the check of is_active_kernel_text();
The reason is that: When Xen is booting, it will call apply_alternatives_all
to do patching with alternative tables. In this progress, we should update
the offset of branch instructions if required. This means we should modify
the Xen text section. But Xen text section is marked as read-only and we
configure the hardware to not allow a region to be writable and executable at
the same time. So we re-map Xen in a temporary area for writing. In this case,
the calculation of the new target address of the branch instruction is based
on this re-mapped area. The new target address will point to a value in the
re-mapped area. But we haven't registered this area as an active kernel text.
So the check of is_active_kernel_text will always return false.
We have to register the re-mapped Xen area as a virtual region temporarily to
solve this problem.
We don't need a lock in vgic_get_target_vcpu anymore, solving the
following lock inversion bug: the rank lock should be taken first, then
the vgic lock. However, gic_update_one_lr is called with the vgic lock
held, and it calls vgic_get_target_vcpu, which tries to obtain the rank
lock.
Julien Grall [Wed, 8 Mar 2017 18:06:02 +0000 (18:06 +0000)]
xen/arm: p2m: Perform local TLB invalidation on vCPU migration
The ARM architecture allows an OS to have per-CPU page tables, as it
guarantees that TLBs never migrate from one CPU to another.
This works fine until this is done in a guest. Consider the following
scenario:
- vcpu-0 maps P to V
- vpcu-1 maps P' to V
If run on the same physical CPU, vcpu-1 can hit in TLBs generated by
vcpu-0 accesses, and access the wrong physical page.
The solution to this is to keep a per-p2m map of which vCPU ran the last
on each given pCPU and invalidate local TLBs if two vPCUs from the same
VM run on the same CPU.
Unfortunately it is not possible to allocate per-cpu variable on the
fly. So for now the size of the array is NR_CPUS, this is fine because
we still have space in the structure domain. We may want to add an
helper to allocate per-cpu variable in the future.
xen/arm: acpi: Relax hw domain mapping attributes to p2m_mmio_direct_c
Since the hardware domain is a trusted domain, we extend the
trust to include making final decisions on what attributes to
use when mapping memory regions.
For ACPI configured hardware domains, this patch relaxes the hardware
domains mapping attributes to p2m_mmio_direct_c. This will allow the
hardware domain to control the attributes via its S1 mappings.
Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com> Acked-by: Julien Grall <julien.grall@arm.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>