Juergen Gross [Tue, 1 Dec 2020 16:36:41 +0000 (17:36 +0100)]
xen/events: access last_priority and last_vcpu_id together
The queue for a fifo event is depending on the vcpu_id and the
priority of the event. When sending an event it might happen the
event needs to change queues and the old queue needs to be kept for
keeping the links between queue elements intact. For this purpose
the event channel contains last_priority and last_vcpu_id values
elements for being able to identify the old queue.
In order to avoid races always access last_priority and last_vcpu_id
with a single atomic operation avoiding any inconsistencies.
With the event channel lock no longer disabling interrupts commit 52e1fc47abc3a0123 ("evtchn/Flask: pre-allocate node on send path") can
be reverted again.
Juergen Gross [Tue, 1 Dec 2020 16:35:30 +0000 (17:35 +0100)]
xen/evtchn: rework per event channel lock
Currently the lock for a single event channel needs to be taken with
interrupts off, which causes deadlocks in some cases.
Rework the per event channel lock to be non-blocking for the case of
sending an event and removing the need for disabling interrupts for
taking the lock.
The lock is needed for avoiding races between event channel state
changes (creation, closing, binding) against normal operations (set
pending, [un]masking, priority changes).
Use a rwlock, but with some restrictions:
- Changing the state of an event channel (creation, closing, binding)
needs to use write_lock(), with ASSERT()ing that the lock is taken as
writer only when the state of the event channel is either before or
after the locked region appropriate (either free or unbound).
- Sending an event needs to use read_trylock() mostly, in case of not
obtaining the lock the operation is omitted. This is needed as
sending an event can happen with interrupts off (at least in some
cases).
- Dumping the event channel state for debug purposes is using
read_trylock(), too, in order to avoid blocking in case the lock is
taken as writer for a long time.
Jan Beulich [Tue, 1 Dec 2020 16:34:38 +0000 (17:34 +0100)]
evtchn/fifo: use stable fields when recording "last queue" information
Both evtchn->priority and evtchn->notify_vcpu_id could change behind the
back of evtchn_fifo_set_pending(), as for it - in the case of
interdomain channels - only the remote side's per-channel lock is held.
Neither the queue's priority nor the vCPU's vcpu_id fields have similar
properties, so they seem better suited for the purpose. In particular
they reflect the respective evtchn fields' values at the time they were
used to determine queue and vCPU.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 6f6f07b64cbe90e54f8e62b4d6f2404cf5306536
master date: 2020-10-02 08:37:35 +0200
Andrew Cooper [Thu, 23 Apr 2020 12:05:46 +0000 (13:05 +0100)]
x86/msr: Disallow guest access to the RAPL MSRs
Researchers have demonstrated using the RAPL interface to perform a
differential power analysis attack to recover AES keys used by other cores in
the system.
Furthermore, even privileged guests cannot use this interface correctly, due
to MSR scope and vcpu scheduling issues. The interface would want to be
paravirtualised to be used sensibly.
Disallow access to the RAPL MSRs completely, as well as other MSRs which
potentially access fine grain power information.
This is part of XSA-351.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Tue, 6 Oct 2020 16:23:27 +0000 (18:23 +0200)]
x86/msr: fix handling of MSR_IA32_PERF_{STATUS/CTL}
Currently a PV hardware domain can also be given control over the CPU
frequency, and such guest is allowed to write to MSR_IA32_PERF_CTL.
However since commit 322ec7c89f6 the default behavior has been changed
to reject accesses to not explicitly handled MSRs, preventing PV
guests that manage CPU frequency from reading
MSR_IA32_PERF_{STATUS/CTL}.
Additionally some HVM guests (Windows at least) will attempt to read
MSR_IA32_PERF_CTL and will panic if given back a #GP fault:
Move the handling of MSR_IA32_PERF_{STATUS/CTL} to the common MSR
handling shared between HVM and PV guests, and add an explicit case
for reads to MSR_IA32_PERF_{STATUS/CTL}.
Restore previous behavior and allow PV guests with the required
permissions to read the contents of the mentioned MSRs. Non privileged
guests will get 0 when trying to read those registers, as writes to
MSR_IA32_PERF_CTL by such guest will already be silently dropped.
Fixes: 322ec7c89f6 ('x86/pv: disallow access to unknown MSRs') Fixes: 84e848fd7a1 ('x86/hvm: disallow access to unknown MSRs') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3059178798a23ba870ff86ff54d442a07e6651fc)
Julien Grall [Mon, 9 Nov 2020 20:28:59 +0000 (20:28 +0000)]
xen/arm: Always trap AMU system registers
The Activity Monitors Unit (AMU) has been introduced by ARMv8.4. It is
considered to be unsafe to be expose to guests as they might expose
information about code executed by other guests or the host.
Arm provided a way to trap all the AMU system registers by setting
CPTR_EL2.TAM to 1.
Unfortunately, on older revision of the specification, the bit 30 (now
CPTR_EL1.TAM) was RES0. Because of that, Xen is setting it to 0 and
therefore the system registers would be exposed to the guest when it is
run on processors with AMU.
As the bit is mark as UNKNOWN at boot in Armv8.4, the only safe solution
for us is to always set CPTR_EL1.TAM to 1.
Guest trying to access the AMU system registers will now receive an
undefined instruction. Unfortunately, this means that even well-behaved
guest may fail to boot because we don't sanitize the ID registers.
This is a known issues with other Armv8.0+ features (e.g. SVE, Pointer
Auth). This will taken care separately.
Ian Jackson [Wed, 4 Nov 2020 08:38:53 +0000 (09:38 +0100)]
SUPPORT.md: Desupport qemu trad except stub dm
While investigating XSA-335 we discovered that many upstream security
fixes were missing. It is not practical to backport them. There is
no good reason to be running this very ancient version of qemu, except
that it is the only way to run a stub dm which is currently supported
by upstream.
Andrew Cooper [Mon, 19 Oct 2020 14:51:22 +0000 (15:51 +0100)]
x86/pv: Flush TLB in response to paging structure changes
With MMU_UPDATE, a PV guest can make changes to higher level pagetables. This
is safe from Xen's point of view (as the update only affects guest mappings),
and the guest is required to flush (if necessary) after making updates.
However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
writeable pagetables, etc.) is an implementation detail outside of the
API/ABI.
Changes in the paging structure require invalidations in the linear pagetable
range for subsequent accesses into the linear pagetables to access non-stale
mappings. Xen must provide suitable flushing to prevent intermixed guest
actions from accidentally accessing/modifying the wrong pagetable.
For all L2 and higher modifications, flush the TLB. PV guests cannot create
L2 or higher entries with the Global bit set, so no mappings established in
the linear range can be global. (This could in principle be an order 39 flush
starting at LINEAR_PT_VIRT_START, but no such mechanism exists in practice.)
Express the necessary flushes as a set of booleans which accumulate across the
operation. Comment the flushing logic extensively.
This is XSA-286.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 16a20963b3209788f2c0d3a3eebb7d92f03f5883)
Andrew Cooper [Thu, 22 Oct 2020 10:28:58 +0000 (11:28 +0100)]
x86/pv: Drop FLUSH_TLB_GLOBAL in do_mmu_update() for XPTI
c/s 9d1d31ad9498 "x86: slightly reduce Meltdown band-aid overhead" removed the
use of Global TLB flushes on the Xen entry path, but added a FLUSH_TLB_GLOBAL
to the L4 path in do_mmu_update().
However, this was unnecessary.
It is the guests responsibility to perform appropriate TLB flushing if the L4
modification altered an established mapping in a flush-relevant way. In this
case, an MMUEXT_OP hypercall will follow. The case which Xen needs to cover
is when new mappings are created, and the resync on the exit-to-guest path
covers this correctly.
There is a corner case with multiple vCPUs in hypercalls at the same time,
which 9d1d31ad9498 changed, and this patch changes back to its original XPTI
behaviour.
Architecturally, established TLB entries can continue to be used until the
broadcast flush has completed. Therefore, even with concurrent hypercalls,
the guest cannot depend on older mappings not being used until an MMUEXT_OP
hypercall completes. Xen's implementation of guest-initiated flushes will
take correct effect on top of an in-progress hypercall, picking up new mapping
setting before the other vCPU's MMUEXT_OP completes.
Note: The correctness of this change is not impacted by whether XPTI uses
global mappings or not. Correctness there depends on the behaviour of Xen on
the entry/exit paths when switching two/from the XPTI "shadow" pagetables.
This is (not really) XSA-286 (but necessary to simplify the logic).
Fixes: 9d1d31ad9498 ("x86: slightly reduce Meltdown band-aid overhead") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 055e1c3a3d95b1e753148369fbc4ba48782dd602)
Jan Beulich [Tue, 20 Oct 2020 13:21:40 +0000 (15:21 +0200)]
AMD/IOMMU: ensure suitable ordering of DTE modifications
DMA and interrupt translation should be enabled only after other
applicable DTE fields have been written. Similarly when disabling
translation or when moving a device between domains, translation should
first be disabled, before other entry fields get modified. Note however
that the "moving" aspect doesn't apply to the interrupt remapping side,
as domain specifics are maintained in the IRTEs here, not the DTE. We
also never disable interrupt remapping once it got enabled for a device
(the respective argument passed is always the immutable iommu_intremap).
This is part of XSA-347.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 0514a3a25fb9ebff5d75cc8f00a9229385300858
master date: 2020-10-20 14:23:12 +0200
Jan Beulich [Tue, 20 Oct 2020 13:21:19 +0000 (15:21 +0200)]
AMD/IOMMU: update live PTEs atomically
Updating a live PTE bitfield by bitfield risks the compiler re-ordering
the individual updates as well as splitting individual updates into
multiple memory writes. Construct the new entry fully in a local
variable, do the check to determine the flushing needs on the thus
established new entry, and then write the new entry by a single insn.
Similarly using memset() to clear a PTE is unsafe, as the order of
writes the function does is, at least in principle, undefined.
This is part of XSA-347.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 3b055121c5410e2c3105d6d06aa24ca0d58868cd
master date: 2020-10-20 14:22:52 +0200
Jan Beulich [Tue, 20 Oct 2020 13:20:54 +0000 (15:20 +0200)]
IOMMU: hold page ref until after deferred TLB flush
When moving around a page via XENMAPSPACE_gmfn_range, deferring the TLB
flush for the "from" GFN range requires that the page remains allocated
to the guest until the TLB flush has actually occurred. Otherwise a
parallel hypercall to remove the page would only flush the TLB for the
GFN it has been moved to, but not the one is was mapped at originally.
This is part of XSA-346.
Fixes: cf95b2a9fd5a ("iommu: Introduce per cpu flag (iommu_dont_flush_iotlb) to avoid unnecessary iotlb... ") Reported-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <jgrall@amazon.com>
master commit: 5777a3742d88ff1c0ebc626ceb4fd47f9b3dc6d5
master date: 2020-10-20 14:21:32 +0200
Jan Beulich [Tue, 20 Oct 2020 13:20:31 +0000 (15:20 +0200)]
IOMMU: suppress "iommu_dont_flush_iotlb" when about to free a page
Deferring flushes to a single, wide range one - as is done when
handling XENMAPSPACE_gmfn_range - is okay only as long as
pages don't get freed ahead of the eventual flush. While the only
function setting the flag (xenmem_add_to_physmap()) suggests by its name
that it's only mapping new entries, in reality the way
xenmem_add_to_physmap_one() works means an unmap would happen not only
for the page being moved (but not freed) but, if the destination GFN is
populated, also for the page being displaced from that GFN. Collapsing
the two flushes for this GFN into just one (end even more so deferring
it to a batched invocation) is not correct.
This is part of XSA-346.
Fixes: cf95b2a9fd5a ("iommu: Introduce per cpu flag (iommu_dont_flush_iotlb) to avoid unnecessary iotlb... ") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Acked-by: Julien Grall <jgrall@amazon.com>
master commit: dea460d86957bf1425a8a1572626099ac3f165a8
master date: 2020-10-20 14:21:09 +0200
Hongyan Xia [Tue, 20 Oct 2020 13:20:10 +0000 (15:20 +0200)]
x86/mm: Prevent some races in hypervisor mapping updates
map_pages_to_xen will attempt to coalesce mappings into 2MiB and 1GiB
superpages if possible, to maximize TLB efficiency. This means both
replacing superpage entries with smaller entries, and replacing
smaller entries with superpages.
Unfortunately, while some potential races are handled correctly,
others are not. These include:
1. When one processor modifies a sub-superpage mapping while another
processor replaces the entire range with a superpage.
Take the following example:
Suppose L3[N] points to L2. And suppose we have two processors, A and
B.
* A walks the pagetables, get a pointer to L2.
* B replaces L3[N] with a 1GiB mapping.
* B Frees L2
* A writes L2[M] #
This is race exacerbated by the fact that virt_to_xen_l[21]e doesn't
handle higher-level superpages properly: If you call virt_xen_to_l2e
on a virtual address within an L3 superpage, you'll either hit a BUG()
(most likely), or get a pointer into the middle of a data page; same
with virt_xen_to_l1 on a virtual address within either an L3 or L2
superpage.
So take the following example:
* A reads pl3e and discovers it to point to an L2.
* B replaces L3[N] with a 1GiB mapping
* A calls virt_to_xen_l2e() and hits the BUG_ON() #
2. When two processors simultaneously try to replace a sub-superpage
mapping with a superpage mapping.
Take the following example:
Suppose L3[N] points to L2. And suppose we have two processors, A and B,
both trying to replace L3[N] with a superpage.
* A walks the pagetables, get a pointer to pl3e, and takes a copy ol3e pointing to L2.
* B walks the pagetables, gets a pointre to pl3e, and takes a copy ol3e pointing to L2.
* A writes the new value into L3[N]
* B writes the new value into L3[N]
* A recursively frees all the L1's under L2, then frees L2
* B recursively double-frees all the L1's under L2, then double-frees L2 #
Fix this by grabbing a lock for the entirety of the mapping update
operation.
Rather than grabbing map_pgdir_lock for the entire operation, however,
repurpose the PGT_locked bit from L3's page->type_info as a lock.
This means that rather than locking the entire address space, we
"only" lock a single 512GiB chunk of hypervisor address space at a
time.
There was a proposal for a lock-and-reverify approach, where we walk
the pagetables to the point where we decide what to do; then grab the
map_pgdir_lock, re-verify the information we collected without the
lock, and finally make the change (starting over again if anything had
changed). Without being able to guarantee that the L2 table wasn't
freed, however, that means every read would need to be considered
potentially unsafe. Thinking carefully about that is probably
something that wants to be done on public, not under time pressure.
Jan Beulich [Fri, 2 Oct 2020 10:38:17 +0000 (12:38 +0200)]
evtchn/Flask: pre-allocate node on send path
xmalloc() & Co may not be called with IRQs off, or else check_lock()
will have its assertion trigger about locks getting acquired
inconsistently. Re-arranging the locking in evtchn_send() doesn't seem
very reasonable, especially since the per-channel lock was introduced to
avoid acquiring the per-domain event lock on the send paths. Issue a
second call to xsm_evtchn_send() instead, before acquiring the lock, to
give XSM / Flask a chance to pre-allocate whatever it may need.
As these nodes are used merely for caching earlier decisions' results,
allocate just one node in AVC code despite two potentially being needed.
Things will merely be not as performant if a second allocation was
wanted, just like when the pre-allocation fails.
Fixes: c0ddc8634845 ("evtchn: convert per-channel lock to be IRQ-safe") Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Jason Andryuk <jandryuk@gmail.com> Acked-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Jason Andryuk <jandryuk@gmail.com>
master commit: 52e1fc47abc3a0123d2b5bb7e9172e84fd571851
master date: 2020-10-02 08:36:21 +0200
Jan Beulich [Tue, 22 Sep 2020 15:35:56 +0000 (17:35 +0200)]
evtchn: arrange for preemption in evtchn_reset()
Like for evtchn_destroy() looping over all possible event channels to
close them can take a significant amount of time. Unlike done there, we
can't alter domain properties (i.e. d->valid_evtchns) here. Borrow, in a
lightweight form, the paging domctl continuation concept, redirecting
the continuations to different sub-ops. Just like there this is to be
able to allow for predictable overall results of the involved sub-ops:
Racing requests should either complete or be refused.
Note that a domain can't interfere with an already started (by a remote
domain) reset, due to being paused. It can prevent a remote reset from
happening by leaving a reset unfinished, but that's only going to affect
itself.
Jan Beulich [Tue, 22 Sep 2020 15:35:15 +0000 (17:35 +0200)]
evtchn: arrange for preemption in evtchn_destroy()
Especially closing of fully established interdomain channels can take
quite some time, due to the locking involved. Therefore we shouldn't
assume we can clean up still active ports all in one go. Besides adding
the necessary preemption check, also avoid pointlessly starting from
(or now really ending at) 0; 1 is the lowest numbered port which may
need closing.
Since we're now reducing ->valid_evtchns, free_xen_event_channel(),
and (at least to be on the safe side) notify_via_xen_event_channel()
need to cope with attempts to close / unbind from / send through already
closed (and no longer valid, as per port_is_valid()) ports.
Jan Beulich [Tue, 22 Sep 2020 15:34:52 +0000 (17:34 +0200)]
evtchn: address races with evtchn_reset()
Neither d->evtchn_port_ops nor max_evtchns(d) may be used in an entirely
lock-less manner, as both may change by a racing evtchn_reset(). In the
common case, at least one of the domain's event lock or the per-channel
lock needs to be held. In the specific case of the inter-domain sending
by evtchn_send() and notify_via_xen_event_channel() holding the other
side's per-channel lock is sufficient, as the channel can't change state
without both per-channel locks held. Without such a channel changing
state, evtchn_reset() can't complete successfully.
Lock-free accesses continue to be permitted for the shim (calling some
otherwise internal event channel functions), as this happens while the
domain is in effectively single-threaded mode. Special care also needs
taking for the shim's marking of in-use ports as ECS_RESERVED (allowing
use of such ports in the shim case is okay because switching into and
hence also out of FIFO mode is impossihble there).
As a side effect, certain operations on Xen bound event channels which
were mistakenly permitted so far (e.g. unmask or poll) will be refused
now.
Jan Beulich [Tue, 22 Sep 2020 15:33:56 +0000 (17:33 +0200)]
evtchn: evtchn_reset() shouldn't succeed with still-open ports
While the function closes all ports, it does so without holding any
lock, and hence racing requests may be issued causing new ports to get
opened. This would have been problematic in particular if such a newly
opened port had a port number above the new implementation limit (i.e.
when switching from FIFO to 2-level) after the reset, as prior to
"evtchn: relax port_is_valid()" this could have led to e.g.
evtchn_close()'s "BUG_ON(!port_is_valid(d2, port2))" to trigger.
Introduce a counter of active ports and check that it's (still) no
larger then the number of Xen internally used ones after obtaining the
necessary lock in evtchn_reset().
As to the access model of the new {active,xen}_evtchns fields - while
all writes get done using write_atomic(), reads ought to use
read_atomic() only when outside of a suitably locked region.
Note that as of now evtchn_bind_virq() and evtchn_bind_ipi() don't have
a need to call check_free_port().
Jan Beulich [Tue, 22 Sep 2020 15:33:25 +0000 (17:33 +0200)]
evtchn/x86: enforce correct upper limit for 32-bit guests
The recording of d->max_evtchns in evtchn_2l_init(), in particular with
the limited set of callers of the function, is insufficient. Neither for
PV nor for HVM guests the bitness is known at domain_create() time, yet
the upper bound in 2-level mode depends upon guest bitness. Recording
too high a limit "allows" x86 32-bit domains to open not properly usable
event channels, management of which (inside Xen) would then result in
corruption of the shared info and vCPU info structures.
Keep the upper limit dynamic for the 2-level case, introducing a helper
function to retrieve the effective limit. This helper is now supposed to
be private to the event channel code. The used in do_poll() and
domain_dump_evtchn_info() weren't consistent with port uses elsewhere
and hence get switched to port_is_valid().
Furthermore FIFO mode's setup_ports() gets adjusted to loop only up to
the prior ABI limit, rather than all the way up to the new one.
Finally a word on the change to do_poll(): Accessing ->max_evtchns
without holding a suitable lock was never safe, as it as well as
->evtchn_port_ops may change behind do_poll()'s back. Using
port_is_valid() instead widens some the window for potential abuse,
until we've dealt with the race altogether (see XSA-343).
This is XSA-342.
Reported-by: Julien Grall <jgrall@amazon.com> Fixes: 48974e6ce52e ("evtchn: use a per-domain variable for the max number of event channels") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Julien Grall <jgrall@amazon.com>
xen/evtchn: Add missing barriers when accessing/allocating an event channel
While the allocation of a bucket is always performed with the per-domain
lock, the bucket may be accessed without the lock taken (for instance, see
evtchn_send()).
Instead such sites relies on port_is_valid() to return a non-zero value
when the port has a struct evtchn associated to it. The function will
mostly check whether the port is less than d->valid_evtchns as all the
buckets/event channels should be allocated up to that point.
Unfortunately a compiler is free to re-order the assignment in
evtchn_allocate_port() so it would be possible to have d->valid_evtchns
updated before the new bucket has finish to allocate.
Additionally on Arm, even if this was compiled "correctly", the
processor can still re-order the memory access.
Add a write memory barrier in the allocation side and a read memory
barrier when the port is valid to prevent any re-ordering issue.
Andrew Cooper [Tue, 22 Sep 2020 15:31:31 +0000 (17:31 +0200)]
x86/pv: Avoid double exception injection
There is at least one path (SYSENTER with NT set, Xen converts to #GP) which
ends up injecting the #GP fault twice, first in compat_sysenter(), and then a
second time in compat_test_all_events(), due to the stale TBF_EXCEPTION left
in TRAPBOUNCE_flags.
The guest kernel sees the second fault first, which is a kernel level #GP
pointing at the head of the #GP handler, and is therefore a userspace
trigger-able DoS.
This particular bug has bitten us several times before, so rearrange
{compat_,}create_bounce_frame() to clobber TRAPBOUNCE on success, rather than
leaving this task to one area of code which isn't used uniformly.
Other scenarios which might result in a double injection (e.g. two calls
directly to compat_create_bounce_frame) will now crash the guest, which is far
more obvious than letting the kernel run with corrupt state.
This is XSA-339
Fixes: fdac9515607b ("x86: clear EFLAGS.NT in SYSENTER entry path") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 22 Sep 2020 15:30:04 +0000 (17:30 +0200)]
evtchn: relax port_is_valid()
To avoid ports potentially becoming invalid behind the back of certain
other functions (due to ->max_evtchn shrinking) because of
- a guest invoking evtchn_reset() and from a 2nd vCPU opening new
channels in parallel (see also XSA-343),
- alloc_unbound_xen_event_channel() produced channels living above the
2-level range (see also XSA-342),
drop the max_evtchns check from port_is_valid(). For a port for which
the function once returned "true", the returned value may not turn into
"false" later on. The function's result may only depend on bounds which
can only ever grow (which is the case for d->valid_evtchns).
This also eliminates a false sense of safety, utilized by some of the
users (see again XSA-343): Without a suitable lock held, d->max_evtchns
may change at any time, and hence deducing that certain other operations
are safe when port_is_valid() returned true is not legitimate. The
opportunities to abuse this may get widened by the change here
(depending on guest and host configuration), but will be taken care of
by the other XSA.
This is XSA-338.
Fixes: 48974e6ce52e ("evtchn: use a per-domain variable for the max number of event channels") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Julien Grall <jgrall@amazon.com>
Jan Beulich [Tue, 22 Sep 2020 15:29:19 +0000 (17:29 +0200)]
x86/MSI-X: restrict reading of table/PBA bases from BARs
When assigned to less trusted or un-trusted guests, devices may change
state behind our backs (they may e.g. get reset by means we may not know
about). Therefore we should avoid reading BARs from hardware once a
device is no longer owned by Dom0. Furthermore when we can't read a BAR,
or when we read zero, we shouldn't instead use the caller provided
address unless that caller can be trusted.
Re-arrange the logic in msix_capability_init() such that only Dom0 (and
only if the device isn't DomU-owned yet) or calls through
PHYSDEVOP_prepare_msix will actually result in the reading of the
respective BAR register(s). Additionally do so only as long as in-use
table entries are known (note that invocation of PHYSDEVOP_prepare_msix
counts as a "pseudo" entry). In all other uses the value already
recorded will get used instead.
Clear the recorded values in _pci_cleanup_msix() as well as on the one
affected error path. (Adjust this error path to also avoid blindly
disabling MSI-X when it was enabled on entry to the function.)
While moving around variable declarations (in many cases to reduce their
scopes), also adjust some of their types.
This is part of XSA-337.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
It's safer and faster to just use the cached last written
(untranslated) MSI message stored in msi_desc for the single user that
calls read_msi_msg.
This also prevents relying on the data read from the device MSI
registers in order to figure out the index into the IOMMU interrupt
remapping table, which is not safe.
This is part of XSA-337.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/vpt: fix race when migrating timers between vCPUs
The current vPT code will migrate the emulated timers between vCPUs
(change the pt->vcpu field) while just holding the destination lock,
either from create_periodic_time or pt_adjust_global_vcpu_target if
the global target is adjusted. Changing the periodic_timer vCPU field
in this way creates a race where a third party could grab the lock in
the unlocked region of pt_adjust_global_vcpu_target (or before
create_periodic_time performs the vcpu change) and then release the
lock from a different vCPU, creating a locking imbalance.
Introduce a per-domain rwlock in order to protect periodic_time
migration between vCPU lists. Taking the lock in read mode prevents
any timer from being migrated to a different vCPU, while taking it in
write mode allows performing migration of timers across vCPUs. The
per-vcpu locks are still used to protect all the other fields from the
periodic_timer struct.
Note that such migration shouldn't happen frequently, and hence
there's no performance drop as a result of such locking.
This is XSA-336.
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Tested-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen: Check the alignment of the offset pased via VCPUOP_register_vcpu_info
Currently a guest is able to register any guest physical address to use
for the vcpu_info structure as long as the structure can fits in the
rest of the frame.
This means a guest can provide an address that is not aligned to the
natural alignment of the structure.
On Arm 32-bit, unaligned access are completely forbidden by the
hypervisor. This will result to a data abort which is fatal.
On Arm 64-bit, unaligned access are only forbidden when used for atomic
access. As the structure contains fields (such as evtchn_pending_self)
that are updated using atomic operations, any unaligned access will be
fatal as well.
While the misalignment is only fatal on Arm, a generic check is added
as an x86 guest shouldn't sensibly pass an unaligned address (this
would result to a split lock).
x86/ept: flush cache when modifying PTEs and sharing page tables
Modifications made to the page tables by EPT code need to be written
to memory when the page tables are shared with the IOMMU, as Intel
IOMMUs can be non-coherent and thus require changes to be written to
memory in order to be visible to the IOMMU.
In order to achieve this make sure data is written back to memory
after writing an EPT entry when the recalc bit is not set in
atomic_write_ept_entry. If such bit is set, the entry will be
adjusted and atomic_write_ept_entry will be called a second time
without the recalc bit set. Note that when splitting a super page the
new tables resulting of the split should also be written back.
Failure to do so can allow devices behind the IOMMU access to the
stale super page, or cause coherency issues as changes made by the
processor to the page tables are not visible to the IOMMU.
This allows to remove the VT-d specific iommu_pte_flush helper, since
the cache write back is now performed by atomic_write_ept_entry, and
hence iommu_iotlb_flush can be used to flush the IOMMU TLB. The newly
used method (iommu_iotlb_flush) can result in less flushes, since it
might sometimes be called rightly with 0 flags, in which case it
becomes a no-op.
This is part of XSA-321.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c23274fd0412381bd75068ebc9f8f8c90a4be748
master date: 2020-07-07 14:40:11 +0200
Some VT-d IOMMUs are non-coherent, which requires a cache write back
in order for the changes made by the CPU to be visible to the IOMMU.
This cache write back was unconditionally done using clflush, but there are
other more efficient instructions to do so, hence implement support
for them using the alternative framework.
This is part of XSA-321.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a64ea16522a73a13a0d66cfa4b66a9d3b95dd9d6
master date: 2020-07-07 14:39:54 +0200
vtd: don't assume addresses are aligned in sync_cache
Current code in sync_cache assume that the address passed in is
aligned to a cache line size. Fix the code to support passing in
arbitrary addresses not necessarily aligned to a cache line size.
This is part of XSA-321.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b6d9398144f21718d25daaf8d72669a75592abc5
master date: 2020-07-07 14:39:05 +0200
The hook is only implemented for VT-d and it uses the already existing
iommu_sync_cache function present in VT-d code. The new hook is
added so that the cache can be flushed by code outside of VT-d when
using shared page tables.
Note that alloc_pgtable_maddr must use the now locally defined
sync_cache function, because IOMMU ops are not yet setup the first
time the function gets called during IOMMU initialization.
No functional change intended.
This is part of XSA-321.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 91526b460e5009fc56edbd6809e66c327281faba
master date: 2020-07-07 14:38:34 +0200
Rename __iommu_flush_cache to iommu_sync_cache and remove
iommu_flush_cache_page. Also remove the iommu_flush_cache_entry
wrapper and just use iommu_sync_cache instead. Note the _entry suffix
was meaningless as the wrapper was already taking a size parameter in
bytes. While there also constify the addr parameter.
No functional change intended.
This is part of XSA-321.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 62298825b9a44f45761acbd758138b5ba059ebd1
master date: 2020-07-07 14:38:13 +0200
Jan Beulich [Tue, 7 Jul 2020 13:29:44 +0000 (15:29 +0200)]
vtd: improve IOMMU TLB flush
Do not limit PSI flushes to order 0 pages, in order to avoid doing a
full TLB flush if the passed in page has an order greater than 0 and
is aligned. Should increase the performance of IOMMU TLB flushes when
dealing with page orders greater than 0.
This is part of XSA-321.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 5fe515a0fede07543f2a3b049167b1fd8b873caf
master date: 2020-07-07 14:37:46 +0200
x86/ept: atomically modify entries in ept_next_level
ept_next_level was passing a live PTE pointer to ept_set_middle_entry,
which was then modified without taking into account that the PTE could
be part of a live EPT table. This wasn't a security issue because the
pages returned by p2m_alloc_ptp are zeroed, so adding such an entry
before actually initializing it didn't allow a guest to access
physical memory addresses it wasn't supposed to access.
This is part of XSA-328.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: bc3d9f95d661372b059a5539ae6cb1e79435bb95
master date: 2020-07-07 14:37:12 +0200
Jan Beulich [Tue, 7 Jul 2020 13:28:53 +0000 (15:28 +0200)]
x86/EPT: ept_set_middle_entry() related adjustments
ept_split_super_page() wants to further modify the newly allocated
table, so have ept_set_middle_entry() return the mapped pointer rather
than tearing it down and then getting re-established right again.
Similarly ept_next_level() wants to hand back a mapped pointer of
the next level page, so re-use the one established by
ept_set_middle_entry() in case that path was taken.
Pull the setting of suppress_ve ahead of insertion into the higher level
table, and don't have ept_split_super_page() set the field a 2nd time.
This is part of XSA-328.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 1104288186ee73a7f9bfa41cbaa5bb7611521028
master date: 2020-07-07 14:36:52 +0200
Jan Beulich [Tue, 7 Jul 2020 13:28:25 +0000 (15:28 +0200)]
x86/shadow: correct an inverted conditional in dirty VRAM tracking
This originally was "mfn_x(mfn) == INVALID_MFN". Make it like this
again, taking the opportunity to also drop the unnecessary nearby
braces.
This is XSA-319.
Fixes: 246a5a3377c2 ("xen: Use a typesafe to define INVALID_MFN") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 23a216f99d40fbfbc2318ade89d8213eea6ba1f8
master date: 2020-07-07 14:36:24 +0200
xen/common: event_channel: Don't ignore error in get_free_port()
Currently, get_free_port() is assuming that the port has been allocated
when evtchn_allocate_port() is not return -EBUSY.
However, the function may return an error when:
- We exhausted all the event channels. This can happen if the limit
configured by the administrator for the guest ('max_event_channels'
in xl cfg) is higher than the ABI used by the guest. For instance,
if the guest is using 2L, the limit should not be higher than 4095.
- We cannot allocate memory (e.g Xen has not more memory).
Users of get_free_port() (such as EVTCHNOP_alloc_unbound) will validly
assuming the port was valid and will next call evtchn_from_port(). This
will result to a crash as the memory backing the event channel structure
is not present.
Fixes: 368ae9a05fe ("xen/pvshim: forward evtchn ops between L0 Xen and L2 DomU") Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2e9c2bc292231823a3a021d2e0a9f1956bf00b3c
master date: 2020-07-07 14:35:36 +0200
Jan Beulich [Wed, 11 Dec 2019 14:34:51 +0000 (15:34 +0100)]
lz4: refine commit 9143a6c55ef7 for the 64-bit case
I clearly went too far there: While the LZ4_WILDCOPY() instances indeed
need prior guarding, LZ4_SECURECOPY() needs this only in the 32-bit case
(where it simply aliases LZ4_WILDCOPY()). "cpy" can validly point
(slightly) below "op" in these cases, due to
cpy = op + length - (STEPSIZE - 4);
where length can be as low as 0 and STEPSIZE is 8. However, instead of
removing the check via "#if !LZ4_ARCH64", refine it such that it would
also properly work in the 64-bit case, aborting decompression instead
of continuing on bogus input.
Reported-by: Mark Pryor <pryorm09@gmail.com> Reported-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Mark Pryor <pryorm09@gmail.com> Tested-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2d7572cdfa4d481c1ca246aa1ce5239ccae7eb59
master date: 2019-12-09 14:01:25 +0100
Andrew Cooper [Wed, 10 Jun 2020 17:57:00 +0000 (18:57 +0100)]
x86/spec-ctrl: Allow the RDRAND/RDSEED features to be hidden
RDRAND/RDSEED can be hidden using cpuid= to mitigate SRBDS if microcode
isn't available.
This is part of XSA-320 / CVE-2020-0543.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit 7028534d8482d25860c4d1aa8e45f0b911abfc5a)
CC xenctrl_stubs.o
xenctrl_stubs.c: In function 'failwith_xc':
xenctrl_stubs.c:65:17: error: 'snprintf' output may be truncated before the last format character [-Werror=format-truncation=]
"%d: %s: %s", error->code,
^
xenctrl_stubs.c:64:4: note: 'snprintf' output 6 or more bytes (assuming 1029) into a destination of size 1028
snprintf(error_str, sizeof(error_str),
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"%d: %s: %s", error->code,
~~~~~~~~~~~~~~~~~~~~~~~~~~
xc_error_code_to_desc(error->code),
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
error->message);
~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[8]: *** [/build/xen-git/src/xen/tools/ocaml/libs/xc/../../Makefile.rules:37: xenctrl_stubs.o] Error 1
m
Signed-off-by: John Thomson <git@johnthomson.fastmail.com.au> Acked-by: Christian Lindig <christian.lindig@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit 2adc90908fbb1e614c477e29f2d45eda94570795)
tools/misc: fix hypothetical buffer overflow in xen-lowmemd
gcc-8 complains:
xen-lowmemd.c: In function 'handle_low_mem':
xen-lowmemd.c:80:55: error: '%s' directive output may be truncated writing up to 511 bytes into a region of size 489 [-Werror=format-truncation=]
snprintf(error, BUFSZ,"Failed to write target %s to xenstore", data);
^~ ~~~~
xen-lowmemd.c:80:9: note: 'snprintf' output between 36 and 547 bytes into a destination of size 512
snprintf(error, BUFSZ,"Failed to write target %s to xenstore", data);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In practice it wouldn't happen, because 'data' contains string
representation of 64-bit unsigned number (20 characters at most).
But place a limit to mute gcc warning.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-Acked-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit 27751d89248c8c5eef6d8b56eb8f7d2084145080)
tapdisk-vbd.c: In function 'tapdisk_vbd_resume_ring':
tapdisk-vbd.c:1671:53: error: 'snprintf' output may be truncated before the last format character [-Werror=format-truncation=]
snprintf(params.name, sizeof(params.name) - 1, "%s", message);
^
tapdisk-vbd.c:1671:3: note: 'snprintf' output between 1 and 256 bytes into a destination of size 255
snprintf(params.name, sizeof(params.name) - 1, "%s", message);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The "- 1" in buffer size should be actually applied to message, to leave
place for terminating '\0', not the other way around (truncate '\0' even
if it would fit).
In function 'tapdisk_control_open_image',
inlined from 'tapdisk_control_handle_request' at tapdisk-control.c:660:10:
tapdisk-control.c:465:2: error: 'strncpy' specified bound 256 equals destination size [-Werror=stringop-truncation]
strncpy(params.name, vbd->name, BLKTAP2_MAX_MESSAGE_LEN);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In function 'tapdisk_control_create_socket',
inlined from 'tapdisk_control_open' at tapdisk-control.c:836:9:
tapdisk-control.c:793:2: error: 'strncpy' specified bound 108 equals destination size [-Werror=stringop-truncation]
strncpy(saddr.sun_path, td_control.path, sizeof(saddr.sun_path));
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
block-qcow.c: In function 'qcow_create':
block-qcow.c:1216:5: error: 'strncpy' specified bound 4096 equals destination size [-Werror=stringop-truncation]
strncpy(backing_filename, backing_file,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sizeof(backing_filename));
~~~~~~~~~~~~~~~~~~~~~~~~~
I those cases, reduce size of copied string and make sure final '\0' is
added.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-Acked-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit 850e89b3ef1a7be6b71fa7ae22333c884e08431a)
gx_main.c: In function 'prepare_stop_reply':
gx_main.c:385:9: error: 'strncpy' output truncated before terminating nul copying 6 bytes from a string of the same length [-Werror=stringop-truncation]
strncpy(buf, "watch:", 6);
^~~~~~~~~~~~~~~~~~~~~~~~~
Since terminating '\0' isn't needed here at all, switch to memcpy.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-Acked-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit 7f601f7c341c80d554615556d60e3b8ed1e5ad4f)
gcc-8 warns about possible truncation of trailing '\0'.
Final character is overridden by '\0' anyway, so don't bother to copy
it.
This fixes compile failure:
xc_pm.c: In function 'xc_set_cpufreq_gov':
xc_pm.c:308:5: error: 'strncpy' specified bound 16 equals destination size [-Werror=stringop-truncation]
strncpy(scaling_governor, govname, CPUFREQ_NAME_LEN);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-Acked-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit fa7789ef18bd2e716997937af71b2e4b5b00a159)
| xentop.c: In function 'print':
| xentop.c:304:4: error: 'vwprintw' is deprecated [-Werror=deprecated-declarations]
| vwprintw(stdscr, (curses_str_t)fmt, args);
| ^~~~~~~~
vw_printw (note the underscore) is a non-deprecated alternative.
Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 2b50cdbc444c637575580dcfa6c9525a84d5cc62)
Andrew Cooper [Wed, 8 Jan 2020 19:47:46 +0000 (19:47 +0000)]
x86/spec-ctrl: Mitigate the Special Register Buffer Data Sampling sidechannel
See patch documentation and comments.
This is part of XSA-320 / CVE-2020-0543
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 6a49b9a7920c82015381740905582b666160d955)
Andrew Cooper [Wed, 8 Jan 2020 19:47:46 +0000 (19:47 +0000)]
x86/spec-ctrl: CPUID/MSR definitions for Special Register Buffer Data Sampling
This is part of XSA-320 / CVE-2020-0543
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit caab85ab58c0cdf74ab070a5de5c4df89f509ff3)
Andrew Cooper [Tue, 4 Feb 2020 20:29:38 +0000 (20:29 +0000)]
libxc/restore: Fix REC_TYPE_X86_PV_VCPU_XSAVE data auditing (take 2)
It turns out that a bug (since forever) in Xen causes XSAVE records to have
non-architectural behaviour on xsave-capable hardware, when a PV guest has not
touched the state.
In such a case, the data record returned from Xen is 2*uint64_t, both claiming
the (illegitimate) state of %xcr0 and %xcr0_accum being 0.
Adjust the bound in handle_x86_pv_vcpu_blob() to cope with this.
Andrew Cooper [Wed, 18 Dec 2019 20:17:42 +0000 (20:17 +0000)]
libxc/restore: Fix data auditing in handle_x86_pv_info()
handle_x86_pv_info() has a subtle bug. It uses an 'else if' chain with a
clause in the middle which doesn't exit unconditionally. In practice, this
means that when restoring a 32bit PV guest, later sanity checks are skipped.
Rework the logic a little to be simpler. There are exactly two valid
combinations of fields in X86_PV_INFO, so factor this out and check them all
in one go, before making adjustments to the current domain.
Once adjustments have been completed successfully, sanity check the result
against the X86_PV_INFO settings in one go, rather than piece-wise.
tools/xenstore: fix a use after free problem in xenstored
Commit 562a1c0f7ef3fb ("tools/xenstore: dont unlink connection object
twice") introduced a potential use after free problem in
domain_cleanup(): after calling talloc_unlink() for domain->conn
domain->conn is set to NULL. The problem is that domain is registered
as talloc child of domain->conn, so it might be freed by the
talloc_unlink() call.
With Xenstore being single threaded there are normally no concurrent
memory allocations running and freeing a virtual memory area normally
doesn't result in that area no longer being accessible. A problem
could occur only in case either a signal received results in some
memory allocation done in the signal handler (SIGHUP is a primary
candidate leading to reopening the log file), or in case the talloc
framework would do some internal memory allocation during freeing of
the memory (which would lead to clobbering of the freed domain
structure).
Andrew Cooper [Thu, 5 Mar 2020 10:35:17 +0000 (11:35 +0100)]
xen/pvh: Fix segment selector ABI
The written ABI states that %es will be set up, but libxc doesn't do so. In
practice, it breaks `rep movs` inside guests before they reload %es.
The written ABI doesn't mention %ss, but libxc does set it up. Having %ds
different to %ss is obnoxous to work with, as different registers have
different implicit segments.
Modify the spec to state that %ss is set up as a flat read/write segment.
This a) matches the Multiboot 1 spec, b) matches what is set up in practice,
and c) is the more sane behaviour for guests to use.
Fixes: 68e1183411b ('libxc: introduce a xc_dom_arch for hvm-3.0-x86_32 guests') Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wl@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
x86/pvh: Adjust dom0's starting state
Fixes: b25fb1a04e "xen/pvh: Fix segment selector ABI" Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wl@xen.org> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: b25fb1a04e99cc03359eade1affb56ef0eee766f
master date: 2020-02-10 15:26:09 +0000
master commit: 6ee10313623c1f41fc72fe12372e176e744463c1
master date: 2020-02-11 11:04:26 +0000
Jan Beulich [Tue, 14 Apr 2020 13:14:21 +0000 (15:14 +0200)]
gnttab: fix GNTTABOP_copy continuation handling
The XSA-226 fix was flawed - the backwards transformation on rc was done
too early, causing a continuation to not get invoked when the need for
preemption was determined at the very first iteration of the request.
This in particular means that all of the status fields of the individual
operations would be left untouched, i.e. set to whatever the caller may
or may not have initialized them to.
Ross Lagerwall [Tue, 14 Apr 2020 13:13:24 +0000 (15:13 +0200)]
xen/gnttab: Fix error path in map_grant_ref()
Part of XSA-295 (c/s 863e74eb2cffb) inadvertently re-positioned the brackets,
changing the logic. If the _set_status() call fails, the grant_map hypercall
would fail with a status of 1 (rc != GNTST_okay) instead of the expected
negative GNTST_* error.
This error path can be taken due to bad guest state, and causes net/blk-back
in Linux to crash.
This is XSA-316.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: da0c66c8f48042a0186799014af69db0303b1da5
master date: 2020-04-14 14:41:02 +0200
xen/rwlock: Add missing memory barrier in the unlock path of rwlock
The rwlock unlock paths are using atomic_sub() to release the lock.
However the implementation of atomic_sub() rightfully doesn't contain a
memory barrier. On Arm, this means a processor is allowed to re-order
the memory access with the preceeding access.
In other words, the unlock may be seen by another processor before all
the memory accesses within the "critical" section.
The rwlock paths already contains barrier indirectly, but they are not
very useful without the counterpart in the unlock paths.
The memory barriers are not necessary on x86 because loads/stores are
not re-ordered with lock instructions.
So add arch_lock_release_barrier() in the unlock paths that will only
add memory barrier on Arm.
Take the opportunity to document each lock paths explaining why a
barrier is not necessary.
Jan Beulich [Tue, 14 Apr 2020 13:10:48 +0000 (15:10 +0200)]
xenoprof: limit consumption of shared buffer data
Since a shared buffer can be written to by the guest, we may only read
the head and tail pointers from there (all other fields should only ever
be written to). Furthermore, for any particular operation the two values
must be read exactly once, with both checks and consumption happening
with the thus read values. (The backtrace related xenoprof_buf_space()
use in xenoprof_log_event() is an exception: The values used there get
re-checked by every subsequent xenoprof_add_sample().)
Since that code needed touching, also fix the double increment of the
lost samples count in case the backtrace related xenoprof_add_sample()
invocation in xenoprof_log_event() fails.
Where code is being touched anyway, add const as appropriate, but take
the opportunity to entirely drop the now unused domain parameter of
xenoprof_buf_space().
This is part of XSA-313.
Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Wei Liu <wl@xen.org>
master commit: 50ef9a3cb26e2f9383f6fdfbed361d8f174bae9f
master date: 2020-04-14 14:33:19 +0200
Jan Beulich [Tue, 14 Apr 2020 13:08:26 +0000 (15:08 +0200)]
xenoprof: clear buffer intended to be shared with guests
alloc_xenheap_pages() making use of MEMF_no_scrub is fine for Xen
internally used allocations, but buffers allocated to be shared with
(unpriviliged) guests need to be zapped of their prior content.
This is part of XSA-313.
Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wl@xen.org>
master commit: 0763a7ebfcdad66cf9e5475a1301eefb29bae9ed
master date: 2020-04-14 14:32:33 +0200
Julien Grall [Thu, 19 Dec 2019 08:12:21 +0000 (08:12 +0000)]
xen/arm: Place a speculation barrier sequence following an eret instruction
Some CPUs can speculate past an ERET instruction and potentially perform
speculative accesses to memory before processing the exception return.
Since the register state is often controlled by lower privilege level
at the point of an ERET, this could potentially be used as part of a
side-channel attack.
Newer CPUs may implement a new SB barrier instruction which acts
as an architected speculation barrier. For current CPUs, the sequence
DSB; ISB is known to prevent speculation.
The latter sequence is heavier than SB but it would never be executed
(this is speculation after all!).
Introduce a new macro 'sb' that could be used when a speculation barrier
is required. For now it is using dsb; isb but this could easily be
updated to cater SB in the future.
Andrew Cooper [Wed, 11 Dec 2019 14:44:09 +0000 (15:44 +0100)]
AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables
update_paging_mode() has multiple bugs:
1) Booting with iommu=debug will cause it to inform you that that it called
without the pdev_list lock held.
2) When growing by more than a single level, it leaks the newly allocated
table(s) in the case of a further error.
Furthermore, the choice of default level for a domain has issues:
1) All HVM guests grow from 2 to 3 levels during construction because of the
position of the VRAM just below the 4G boundary, so defaulting to 2 is a
waste of effort.
2) The limit for PV guests doesn't take memory hotplug into account, and
isn't dynamic at runtime like HVM guests. This means that a PV guest may
get RAM which it can't map in the IOMMU.
The dynamic height is a property unique to AMD, and adds a substantial
quantity of complexity for what is a marginal performance improvement. Remove
the complexity by removing the dynamic height.
PV guests now get 3 or 4 levels based on any hotplug regions in the host.
This only makes a difference for hardware which previously had all RAM below
the 512G boundary, and a hotplug region above.
HVM guests now get 4 levels (which will be sufficient until 256TB guests
become a thing), because we don't currently have the information to know when
3 would be safe to use.
The overhead of this extra level is not expected to be noticeable. It costs
one page (4k) per domain, and one extra IO-TLB paging structure cache entry
which is very hot and less likely to be evicted.
This is XSA-311.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: b4f042236ae0bb6725b3e8dd40af5a2466a6f971
master date: 2019-12-11 14:55:32 +0100
George Dunlap [Wed, 11 Dec 2019 14:43:44 +0000 (15:43 +0100)]
x86/mm: relinquish_memory: Grab an extra type ref when setting PGT_partial
The PGT_partial bit in page->type_info holds both a type count and a
general ref count. During domain tear-down, when free_page_type()
returns -ERESTART, relinquish_memory() correctly handles the general
ref count, but fails to grab an extra type count when setting
PGT_partial. When this bit is eventually cleared, type_count underflows
and triggers the following BUG in page_alloc.c:free_domheap_pages():
As far as we can tell, this page underflow cannot be exploited any any
other way: The page can't be used as a pagetable by the dying domain
because it's dying; it can't be used as a pagetable by any other
domain since it belongs to the dying domain; and ownership can't
transfer to any other domain without hitting the BUG_ON() in
free_domheap_pages().
(steal_page() won't work on a page in this state, since it requires
PGC_allocated to be set, and PGC_allocated will already have been
cleared.)
Fix this by grabbing an extra type ref if setting PGT_partial in
relinquish_memory.
This is part of XSA-310.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 66bdc16aeed8ddb2ae724adc5ea6bde0dea78c3d
master date: 2019-12-11 14:55:08 +0100
George Dunlap [Wed, 11 Dec 2019 14:43:09 +0000 (15:43 +0100)]
x86/mm: alloc/free_lN_table: Retain partial_flags on -EINTR
When validating or de-validating pages (in alloc_lN_table and
free_lN_table respectively), the `partial_flags` local variable is
used to keep track of whether the "current" PTE started the entire
operation in a "may be partial" state.
One of the patches in XSA-299 addressed the fact that it is possible
for a previously-partially-validated entry to subsequently be found to
have invalid entries (indicated by returning -EINVAL); in which case
page->partial_flags needs to be set to indicate that the current PTE
may have the partial bit set (and thus _put_page_type() should be
called with PTF_partial_set).
Unfortunately, the patches in XSA-299 assumed that once
put_page_from_lNe() returned -ERESTART on a page, it was not possible
for it to return -EINTR. This turns out to be true for
alloc_lN_table() and free_lN_table, but not for _get_page_type() and
_put_page_type(): both can return -EINTR when called on pages with
PGT_partial set. In these cases, the pages PGT_partial will still be
set; failing to set partial_flags appropriately may allow an attacker
to do a privilege escalation similar to those described in XSA-299.
Fix this by always copying the local partial_flags variable into
page->partial_flags when exiting early.
NB that on the "get" side, no adjustment to nr_validated_entries is
needed: whether pte[i] is partially validated or entirely
un-validated, we want nr_validated_entries = i. On the "put" side,
however, we need to adjust nr_validated_entries appropriately: if
pte[i] is entirely validated, we want nr_validated_entries = i + 1; if
pte[i] is partially validated, we want nr_validated_entries = i.
This is part of XSA-310.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4e70f4476c0c543559f971faecdd5f1300cddb0a
master date: 2019-12-11 14:54:43 +0100
George Dunlap [Wed, 11 Dec 2019 14:42:33 +0000 (15:42 +0100)]
x86/mm: Set old_guest_table when destroying vcpu pagetables
Changeset 6c4efc1eba ("x86/mm: Don't drop a type ref unless you held a
ref to begin with"), part of XSA-299, changed the calling discipline
of put_page_type() such that if put_page_type() returned -ERESTART
(indicating a partially de-validated page), subsequent calls to
put_page_type() must be called with PTF_partial_set. If called on a
partially de-validated page but without PTF_partial_set, Xen will
BUG(), because to do otherwise would risk opening up the kind of
privilege escalation bug described in XSA-299.
One place this was missed was in vcpu_destroy_pagetables().
put_page_and_type_preemptible() is called, but on -ERESTART, the
entire operation is simply restarted, causing put_page_type() to be
called on a partially de-validated page without PTF_partial_set. The
result was that if such an operation were interrupted, Xen would hit a
BUG().
Fix this by having vcpu_destroy_pagetables() consistently pass off
interrupted de-validations to put_old_page_type():
- Unconditionally clear references to the page, even if
put_page_and_type failed
- Set old_guest_table and old_guest_table_partial appropriately
While here, do some refactoring:
- Move clearing of arch.cr3 to the top of the function
- Now that clearing is unconditional, move the unmap to the same
conditional as the l4tab mapping. This also allows us to reduce
the scope of the l4tab variable.
- Avoid code duplication by looping to drop references on
guest_table_user
This is part of XSA-310.
Reported-by: Sarah Newman <srn@prgmr.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ececa12b2c4c8e4433e4f9be83f5c668ae36fe08
master date: 2019-12-11 14:54:13 +0100
George Dunlap [Wed, 11 Dec 2019 14:41:54 +0000 (15:41 +0100)]
x86/mm: Don't reset linear_pt_count on partial validation
"Linear pagetables" is a technique which involves either pointing a
pagetable at itself, or to another pagetable the same or higher level.
Xen has limited support for linear pagetables: A page may either point
to itself, or point to another page of the same level (i.e., L2 to L2,
L3 to L3, and so on).
XSA-240 introduced an additional restriction that limited the "depth"
of such chains by allowing pages to either *point to* other pages of
the same level, or *be pointed to* by other pages of the same level,
but not both. To implement this, we keep track of the number of
outstanding times a page points to or is pointed to another page
table, to prevent both from happening at the same time.
Unfortunately, the original commit introducing this reset this count
when resuming validation of a partially-validated pagetable, dropping
some "linear_pt_entry" counts.
On debug builds on systems where guests used this feature, this might
lead to crashes that look like this:
Assertion 'oc > 0' failed at mm.c:874
Worse, if an attacker could engineer such a situation to occur, they
might be able to make loops or other abitrary chains of linear
pagetables, leading to the denial-of-service situation outlined in
XSA-240.
This is XSA-309.
Reported-by: Manuel Bouyer <bouyer@antioche.eu.org> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7473efd12fb7a6548f5303f1f4c5cb521543a813
master date: 2019-12-11 14:10:27 +0100
Andrew Cooper [Wed, 11 Dec 2019 14:41:22 +0000 (15:41 +0100)]
x86/vtx: Work around SingleStep + STI/MovSS VMEntry failures
See patch comment for technical details.
Concerning the timeline, this was first discovered in the aftermath of
XSA-156 which caused #DB to be intercepted unconditionally, but only in
its SingleStep + STI form which is restricted to privileged software.
After working with Intel and identifying the problematic vmentry check,
this workaround was suggested, and the patch was posted in an RFC
series. Outstanding work for that series (not breaking Introspection)
is still pending, and this fix from it (which wouldn't have been good
enough in its original form) wasn't committed.
A vmentry failure was reported to xen-devel, and debugging identified
this bug in its SingleStep + MovSS form by way of INT1, which does not
involve the use of any privileged instructions, and proving this to be a
security issue.
This is XSA-308
Reported-by: Håkon Alstadheim <hakon@alstadheim.priv.no> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 1d3eb8259804e5bec991a3462d69ba6bd80bb40e
master date: 2019-12-11 14:09:30 +0100
Jan Beulich [Wed, 11 Dec 2019 14:40:39 +0000 (15:40 +0100)]
x86+Arm32: make find_next_{,zero_}bit() have well defined behavior
These functions getting used with the 2nd and 3rd arguments being equal
wasn't well defined: Arm64 reliably returns the value of the 2nd
argument in this case, while on x86 for bitmaps up to 64 bits wide the
return value was undefined (due to the undefined behavior of a shift of
a value by the number of bits it's wide) when the incoming value was 64.
On Arm32 an actual out of bounds access would happen when the
size/offset value is a multiple of 32; if this access doesn't fault, the
return value would have been sufficiently correct afaict.
Make the functions consistently tolerate the last two arguments being
equal (and in fact the 3rd argument being greater or equal to the 2nd),
in favor of finding and fixing all the use sites that violate the
original more strict assumption.
Jan Beulich [Wed, 11 Dec 2019 14:39:07 +0000 (15:39 +0100)]
AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page
Unmapping a page which has never been mapped should be a no-op (note how
it already is in case there was no root page table allocated). There's
in particular no need to grow the number of page table levels in use,
and there's also no need to allocate intermediate page tables except
when needing to split a large page.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ad591454f069647c36a7daaa9ec23384c0263f0b
master date: 2019-11-12 11:08:34 +0100
Jan Beulich [Tue, 26 Nov 2019 17:03:41 +0000 (18:03 +0100)]
IOMMU: default to always quarantining PCI devices
XSA-302 relies on the use of libxl's "assignable-add" feature to prepare
devices to be assigned to untrusted guests.
Unfortunately, this is not considered a strictly required step for
device assignment. The PCI passthrough documentation on the wiki
describes alternate ways of preparing devices for assignment, and
libvirt uses its own ways as well. Hosts where these alternate methods
are used will still leave the system in a vulnerable state after the
device comes back from a guest.
Default to always quarantining PCI devices, but provide a command line
option to revert back to prior behavior (such that people who both
sufficiently trust their guests and want to be able to use devices in
Dom0 again after they had been in use by a guest wouldn't need to
"manually" move such devices back from DomIO to Dom0).
This is XSA-306.
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org>
master commit: ba2ab00bbb8c74e311a252d816d68dee47c779a0
master date: 2019-11-26 14:15:01 +0100
Andrew Cooper [Tue, 26 Nov 2019 17:01:17 +0000 (18:01 +0100)]
x86/vvmx: Fix livelock with XSA-304 fix
It turns out that the XSA-304 / CVE-2018-12207 fix of disabling executable
superpages doesn't work well with the nested p2m code.
Nested virt is experimental and not security supported, but is useful for
development purposes. In order to not regress the status quo, disable the
XSA-304 workaround until the nested p2m code can be improved.
Introduce a per-domain exec_sp control and set it based on the current
opt_ept_exec_sp setting. Take the oppotunity to omit a PVH hardware domain
from the performance hit, because it is already permitted to DoS the system in
such ways as issuing a reboot.
When nested virt is enabled on a domain, force it to using executable
superpages and rebuild the p2m.
Having the setting per-domain involves rearranging the internals of
parse_ept_param_runtime() but it still retains the same overall semantics -
for each applicable domain whose setting needs to change, rebuild the p2m.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: 183f354e1430087879de071f0c7122e42703916e
master date: 2019-11-23 14:06:24 +0000
Andrew Cooper [Wed, 19 Jun 2019 17:16:03 +0000 (18:16 +0100)]
x86/tsx: Introduce tsx= to use MSR_TSX_CTRL when available
To protect against the TSX Async Abort speculative vulnerability, Intel have
released new microcode for affected parts which introduce the MSR_TSX_CTRL
control, which allows TSX to be turned off. This will be architectural on
future parts.
Introduce tsx= to provide a global on/off for TSX, including its enumeration
via CPUID. Provide stub virtualisation of this MSR, as it is not exposed to
guests at the moment.
VMs may have booted before microcode is loaded, or before hosts have rebooted,
and they still want to migrate freely. A VM which booted seeing TSX can
migrate safely to hosts with TSX disabled - TSX will start unconditionally
aborting, but still behave in a manner compatible with the ABI.
The guest-visible behaviour is equivalent to late loading the microcode and
setting the RTM_DISABLE bit in the course of live patching.
This is part of XSA-305 / CVE-2019-11135
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 8 Nov 2019 16:36:50 +0000 (16:36 +0000)]
x86/vtx: Allow runtime modification of the exec-sp setting
See patch for details.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Andrew Cooper [Thu, 20 Dec 2018 17:25:29 +0000 (17:25 +0000)]
x86/vtx: Disable executable EPT superpages to work around CVE-2018-12207
CVE-2018-12207 covers a set of errata on various Intel processors, whereby a
machine check exception can be generated in a corner case when an executable
mapping changes size or cacheability without TLB invalidation. HVM guest
kernels can trigger this to DoS the host.
To mitigate, in affected hardware, all EPT superpages are marked NX. When an
instruction fetch violation is observed against the superpage, the superpage
is shattered to 4k and has execute permissions restored. This prevents the
guest kernel from being able to create the necessary preconditions in the iTLB
to exploit the vulnerability.
This does come with a workload-dependent performance overhead, caused by
increased TLB pressure. Performance can be restored, if guest kernels are
trusted not to mount an attack, by specifying ept=exec-sp on the command line.
This is part of XSA-304 / CVE-2018-12207
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 24 Oct 2019 13:09:01 +0000 (14:09 +0100)]
x86/vtd: Hide superpage support for SandyBridge IOMMUs
Something causes SandyBridge IOMMUs to choke when sharing EPT pagetables, and
an EPT superpage gets shattered. The root cause is still under investigation,
but the end result is unusable in combination with CVE-2018-12207 protections.
This is part of XSA-304 / CVE-2018-12207
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Mon, 4 Nov 2019 13:54:09 +0000 (14:54 +0100)]
xen/arm64: Don't blindly unmask interrupts on trap without a change of level
Some of the traps without a change of the level (i.e. hypervisor ->
hypervisor) will unmask interrupts regardless the state of them in the
interrupted context.
One of the consequences is IRQ will be unmasked when receiving a
synchronous exception (used by WARN*()). This could result to unexpected
behavior such as deadlock (if a lock was shared with interrupts).
In a nutshell, interrupts should only be unmasked when it is safe to
do. Xen only unmask IRQ and Abort interrupts, so the logic can stay
simple:
- hyp_error: All the interrupts are now kept masked. SError should
be pretty rare and if ever happen then we most likely want to
avoid any other interrupts to be generated. The potential main
"caller" is during virtual SError synchronization on the exit
path from the guest (see check_pending_vserror).
- hyp_sync: The interrupts state is inherited from the interrupted
context.
- hyp_irq: All the interrupts but IRQ state are inherited from the
interrupted context. IRQ is kept masked.
Julien Grall [Mon, 4 Nov 2019 13:53:46 +0000 (14:53 +0100)]
xen/arm32: Don't blindly unmask interrupts on trap without a change of level
Exception vectors will unmask interrupts regardless the state of them in
the interrupted context.
One of the consequences is IRQ will be unmasked when receiving an
undefined instruction exception (used by WARN*) from the hypervisor.
This could result to unexpected behavior such as deadlock (if a lock was
shared with interrupts).
In a nutshell, interrupts should only be unmasked when it is safe to do.
Xen only unmask IRQ and Abort interrupts, so the logic can stay simple.
As vectors exceptions may be shared between guest and hypervisor, we now
need to have a different policy for the interrupts.
On exception from hypervisor, each vector will select the list of
interrupts to inherit from the interrupted context. Any interrupts not
listed will be kept masked.
On exception from the guest, the Abort and IRQ will be unmasked
depending on the exact vector.
The interrupts will be kept unmasked when the vector cannot used by
either guest or hypervisor.
Note that each vector is not anymore preceded by ALIGN. This is fine
because the alignment is already bigger than what we need.
Julien Grall [Mon, 4 Nov 2019 13:53:27 +0000 (14:53 +0100)]
xen/arm32: entry: Fold the macro SAVE_ALL in the macro vector
Follow-up rework will require the macro vector to distinguish between
a trap from a guest vs while in the hypervisor.
The macro SAVE_ALL already has code to distinguish between the two and
it is only called by the vector macro. So fold the former into the
latter. This will help to avoid duplicating the check.
Julien Grall [Mon, 4 Nov 2019 13:53:02 +0000 (14:53 +0100)]
xen/arm32: entry: Split __DEFINE_ENTRY_TRAP in two
The preprocessing macro __DEFINE_ENTRY_TRAP is used to generate trap
entry function. While the macro is fairly small today, follow-up patches
will increase the size signicantly.
In general, assembly macros are more readable as they allow you to name
parameters and avoid '\'. So the actual implementation of the trap is
now switched to an assembly macro.
Paul Durrant [Mon, 4 Nov 2019 13:52:31 +0000 (14:52 +0100)]
passthrough: quarantine PCI devices
When a PCI device is assigned to an untrusted domain, it is possible for
that domain to program the device to DMA to an arbitrary address. The
IOMMU is used to protect the host from malicious DMA by making sure that
the device addresses can only target memory assigned to the guest. However,
when the guest domain is torn down the device is assigned back to dom0,
thus allowing any in-flight DMA to potentially target critical host data.
This patch introduces a 'quarantine' for PCI devices using dom_io. When
the toolstack makes a device assignable (by binding it to pciback), it
will now also assign it to DOMID_IO and the device will only be assigned
back to dom0 when the device is made unassignable again. Whilst device is
assignable it will only ever transfer between dom_io and guest domains.
dom_io is actually only used as a sentinel domain for quarantining purposes;
it is not configured with any IOMMU mappings. Assignment to dom_io simply
means that the device's initiator (requestor) identifier is not present in
the IOMMU's device table and thus any DMA transactions issued will be
terminated with a fault condition.
In addition, a fix to assignment handling is made for VT-d. Failure
during the assignment step should not lead to a device still being
associated with its prior owner. Hand the device to DomIO temporarily,
until the assignment step has completed successfully. Remove the PI
hooks from the source domain then earlier as well.
Failure of the recovery reassign_device_ownership() may not go silent:
There e.g. may still be left over RMRR mappings in the domain assignment
to which has failed, and hence we can't allow that domain to continue
executing.
NOTE: This patch also includes one printk() cleanup; the
"XEN_DOMCTL_assign_device: " tag is dropped in iommu_do_pci_domctl(),
since similar printk()-s elsewhere also don't log such a tag.
This is XSA-302.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
master commit: 319f9a0ba94c7db505cd5dd9cb0b037ab1aa8e12
master date: 2019-10-31 16:20:05 +0100
Julien Grall [Mon, 4 Nov 2019 13:52:06 +0000 (14:52 +0100)]
xen/arm: p2m: Don't check the return of p2m_get_root_pointer() with BUG_ON()
It turns out that the BUG_ON() was actually reachable with well-crafted
hypercalls. The BUG_ON() is here to prevent catch logical error, so
crashing Xen is a bit over the top.
While all the holes should now be fixed, it would be better to downgrade
the BUG_ON() to something less fatal to prevent any more DoS.
The BUG_ON() in p2m_get_entry() is now replaced by ASSERT_UNREACHABLE()
to catch mistake in debug build and return INVALID_MFN for production
build. The interface also requires to set page_order to give an idea of
the size of "hole". So 'level' is now set so we report a hole of size of
the an entry of the root page-table. This stays inline with what happen
when the GFN is higher than p2m->max_mapped_gfn.
The BUG_ON() in p2m_resolve_translation_fault() is now replaced by
ASSERT_UNREACHABLE() to catch mistake in debug build and just report a
fault for producion build.
Julien Grall [Mon, 4 Nov 2019 13:51:47 +0000 (14:51 +0100)]
xen/arm: p2m: Avoid off-by-one check on p2m->max_mapped_gfn
The code base is using inconsistently the field p2m->max_mapped_gfn.
Some of the useres expect that p2m->max_guest_gfn contain the highest
mapped GFN while others expect highest + 1.
p2m->max_guest_gfn is set as highest + 1, because of that the sanity
check on the GFN in p2m_resolved_translation_fault() and
p2m_get_entry() can be bypassed when GFN == p2m->max_guest_gfn.
p2m_get_root_pointer(p2m->max_guest_gfn) may return NULL if it is
outside of address range supported and therefore the BUG_ON() could be
hit.
The current value hold in p2m->max_mapped_gfn is inconsistent with the
expectation of the common code (see domain_get_maximum_gpfn()) and also
the documentation of the field.
Rather than changing the check in p2m_translation_fault() and
p2m_get_entry(), p2m->max_mapped_gfn is now containing the highest
mapped GFN and the callers assuming "highest + 1" are now adjusted.
Take the opportunity to use 1UL rather than 1 as page_order could
theoritically big enough to overflow a 32-bit integer.
Lastly, the documentation of the field max_guest_gfn to reflect how it
is computed.
Julien Grall [Mon, 4 Nov 2019 13:51:21 +0000 (14:51 +0100)]
xen/arm: p2m: Avoid aliasing guest physical frame
The P2M helpers implementation is quite lax and will end up to ignore
the unused top bits of a guest physical frame.
This effectively means that p2m_set_entry() will create a mapping for a
different frame (it is always equal to gfn & (mask unused bits)). Yet
p2m->max_mapped_gfn will be updated using the original frame.
At the moment, p2m_get_entry() and p2m_resolve_translation_fault()
assume that p2m_get_root_pointer() will always return a non-NULL pointer
when the GFN is smaller than p2m->max_mapped_gfn.
Unfortunately, because of the aliasing described above, it would be
possible to set p2m->max_mapped_gfn high enough so it covers frame that
would lead p2m_get_root_pointer() to return NULL.
As we don't sanity check the guest physical frame provided by a guest, a
malicious guest could craft a series of hypercalls that will hit the
BUG_ON() and therefore DoS Xen.
To prevent aliasing, the function p2m_get_root_pointer() is now reworked
to return NULL If any of the unused top bits are not zero. The caller
can then decide what's the appropriate action to do. Since the two paths
(i.e. P2M_ROOT_PAGES == 1 and P2M_ROOT_PAGES != 1) are now very
similarly, take the opportunity to consolidate them making the code a
bit simpler.
With this change, p2m_get_entry() will not try to insert a mapping as
the root pointer is invalid.
Note that root_table is now switch to unsigned long as unsigned int is
not enough to hold part of a GFN.
George Dunlap [Mon, 4 Nov 2019 13:50:47 +0000 (14:50 +0100)]
x86/mm: Don't drop a type ref unless you held a ref to begin with
Validation and de-validation of pagetable trees may take arbitrarily
large amounts of time, and so must be preemptible. This is indicated
by setting the PGT_partial bit in the type_info, and setting
nr_validated_entries and partial_flags appropriately. Specifically,
if the entry at [nr_validated_entries] is partially validated,
partial_flags should have the PGT_partial_set bit set, and the entry
should hold a general reference count. During de-validation,
put_page_type() is called on partially validated entries.
Unfortunately, there are a number of issues with the current algorithm.
First, doing a "normal" put_page_type() is not safe when no type ref
is held: there is nothing to stop another vcpu from coming along and
picking up validation again: at which point the put_page_type may drop
the only page ref on an in-use page. Some examples are listed in the
appendix.
The core issue is that put_page_type() is being called both to clean
up PGT_partial, and to drop a type count; and has no way of knowing
which is which; and so if in between, PGT_partial is cleared,
put_page_type() will drop the type ref erroneously.
What is needed is to distinguish between two states:
- Dropping a type ref which is held
- Cleaning up a page which has been partially de/validated
Fix this by telling put_page_type() which of the two activities you
intend.
When cleaning up a partial de/validation, take no action unless you
find a page partially validated.
If put_page_type() is called without PTF_partial_set, and finds the
page in a PGT_partial state anyway, then there's certainly been a
misaccounting somewhere, and carrying on would almost certainly cause
a security issue, so crash the host instead.
In put_page_from_lNe, pass partial_flags on to _put_page_type().
old_guest_table may be set either with a fully validated page (when
using the "deferred put" pattern), or with a partially validated page
(when a normal "de-validation" is interrupted, or when a validation
fails part-way through due to invalid entries). Add a flag,
old_guest_table_partial, to indicate which of these it is, and use
that to pass the appropriate flag to _put_page_type().
While here, delete stray trailing whitespace.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix:
Suppose page A, when interpreted as an l3 pagetable, contains all
valid entries; and suppose A[x] points to page B, which when
interpreted as an l2 pagetable, contains all valid entries.
P1: PIN_L3_TABLE
A -> PGT_l3_table | 1 | valid
B -> PGT_l2_table | 1 | valid
P1: UNPIN_TABLE
> Arrange to interrupt after B has been de-validated
B:
type_info -> PGT_l2_table | 0
A:
type_info -> PGT_l3_table | 1 | partial
nr_validated_enties -> (less than x)
P2: mod_l4_entry to point to A
> Arrange for this to be interrupted while B is being validated
B:
type_info -> PGT_l2_table | 1 | partial
(nr_validated_entires &c set as appropriate)
A:
type_info -> PGT_l3_table | 1 | partial
nr_validated_entries -> x
partial_pte = 1
P3: mod_l3_entry some other unrelated l3 to point to B:
B:
type_info -> PGT_l2_table | 1
P1: Restart UNPIN_TABLE
At this point, since A.nr_validate_entries == x and A.partial_pte !=
0, free_l3_table() will call put_page_from_l3e() on pl3e[x], dropping
its type count to 0 while it's still being pointed to by some other l3
A similar issue arises with old_guest_table. Consider the following
scenario:
Suppose A is a page which, when interpreted as an l2, has valid entries
until entry x, which is invalid.
V1: PIN_L2_TABLE(A)
<Validate until we try to validate [x], get -EINVAL>
A -> PGT_l2_table | 1 | PGT_partial
V1 -> old_guest_table = A
<delayed>
V2: PIN_L2_TABLE(A)
<Pick up where V1 left off, try to re-validate [x], get -EINVAL>
A -> PGT_l2_table | 1 | PGT_partial
V2 -> old_guest_table = A
<restart>
put_old_guest_table()
_put_page_type(A)
A -> PGT_l2_table | 0
Indeed, it is possible to engineer for old_guest_table for every vcpu
a guest has to point to the same page.
master commit: c40b33d72630dcfa506d6fd856532d6152cb97dc
master date: 2019-10-31 16:16:37 +0100
George Dunlap [Mon, 4 Nov 2019 13:50:22 +0000 (14:50 +0100)]
x86/mm: Fix nested de-validation on error
If an invalid entry is discovered when validating a page-table tree,
the entire tree which has so far been validated must be de-validated.
Since this may take a long time, alloc_l[2-4]_table() set current
vcpu's old_guest_table immediately; put_old_guest_table() will make
sure that put_page_type() will be called to finish off the
de-validation before any other MMU operations can happen on the vcpu.
The invariant for partial pages should be:
* Entries [0, nr_validated_ptes) should be completely validated;
put_page_type() will de-validate these.
* If [nr_validated_ptes] is partially validated, partial_flags should
set PTF_partiaL_set. put_page_type() will be called on this page to
finish off devalidation, and the appropriate refcount adjustments
will be done.
alloc_l[2-3]_table() indicates partial validation to its callers by
setting current->old_guest_table.
Unfortunately, this is mishandled.
Take the case where validating lNe[x] returns an error.
First, alloc_l3_table() doesn't check old_guest_table at all; as a
result, partial_flags is not set when it should be. nr_validated_ptes
is set to x; and since PFT_partial_set clear, de-validation resumes at
nr_validated_ptes-1. This means that the l2 page at pl3e[x] will not
have put_page_type() called on it when de-validating the rest of the
l3: it will be stuck in the PGT_partial state until the domain is
destroyed, or until it is re-used as an l2. (Any other page type will
fail.)
Worse, alloc_l4_table(), rather than setting PTF_partial_set as it
should, sets nr_validated_ptes to x+1. When de-validating, since
partial is 0, this will correctly resume calling put_page_type at [x];
but, if the put_page_type() is never called, but instead
get_page_type() is called, validation will pick up at [x+1],
neglecting to validate [x]. If the rest of the validation succeeds,
the l4 will be validated even though [x] is invalid.
Fix this in both cases by setting PTF_partial_set if old_guest_table
is set.
While here, add some safety catches:
- old_guest_table must point to the page contained in
[nr_validated_ptes].
- alloc_l1_page shouldn't set old_guest_table
If we experience one of these situations in production builds, it's
safer to avoid calling put_page_type for the pages in question. If
they have PGT_partial set, they will be cleaned up on domain
destruction; if not, we have no idea whether a type count is safe to
drop. Retaining an extra type ref that should have been dropped may
trigger a BUG() on the free_domain_page() path, but dropping a type
count that shouldn't be dropped may cause a privilege escalation.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3c15a2d8cc1981f369cc9542f028054d0dfb325b
master date: 2019-10-31 16:16:13 +0100
George Dunlap [Mon, 4 Nov 2019 13:50:00 +0000 (14:50 +0100)]
x86/mm: Properly handle linear pagetable promotion failures
In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted. This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.
The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count. If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held. If PTF_partial_set is set, then [nr_entries_validated]
is partially validated, and a general reference count is held.
Unfortunately, in cases where an entry began with PTF_partial_set set,
and get_page_from_lNe() returns -EINVAL, the PTF_partial_set bit is
erroneously dropped. (This scenario can be engineered mainly by the
use of interleaving of promoting and demoting a page which has "linear
pagetable" entries; see the appendix for a sketch.) This means that
we will "leak" a general reference count on the page in question,
preventing the page from being freed.
Fix this by setting page->partial_flags to the partial_flags local
variable.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix
Suppose A and B can both be promoted to L2 pages, and A[x] points to B.
V1: MOD_L3_ENTRY pointing something to A.
In the process of validating A[x], grab an extra type / ref on B:
B.type_count = 2 | PGT_validated
B.count = 3 | PGC_allocated
A.type_count = 1 | PGT_validated
A.count = 2 | PGC_allocated
V1: MOD_L3_ENTRY removing the reference to A.
De-validate A, down to A[x], which points to B.
Drop the final type on B. Arrange to be interrupted.
B.type_count = 1 | PGT_partial
B.count = 2 | PGC_allocated
A.type_count = 1 | PGT_partial
A.nr_validated_entries = x
A.partial_pte = -1
V2: MOD_L3_ENTRY adds a reference to A.
At this point, get_page_from_l2e(A[x]) tries
get_page_and_type_from_mfn(), which fails because it's the wrong type;
and get_l2_linear_pagetable() also fails, because B isn't validated as
an l2 anymore.
master commit: 2f126247ef49c2ba52bae29a2ab371059ede67c0
master date: 2019-10-31 16:15:48 +0100
George Dunlap [Mon, 4 Nov 2019 13:49:01 +0000 (14:49 +0100)]
x86/mm: Always retain a general ref on partial
In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted. This is stored in two elements in the page struct:
nr_entries_validated and partial_flags.
The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count. If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held. If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.
At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.
PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held. Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.
Unfortunately, because a refcount is not held, it is possible to
engineer a situation where PFT_partial_set is set but the page in
question has been assigned to another domain. A sketch is provided in
the appendix.
Fix this by having the parent page table entry hold a general
reference count whenever PFT_partial_set is set. (For clarity of
change, keep two separate flags. These will be collapsed in a
subsequent changeset.)
This has two basic implications. On the put_page_from_lNe() side,
this mean that the (partial_set && !partial_ref) case can never happen,
and no longer needs to be special-cased.
Secondly, because both flags are set together, there's no need to carry over
existing bits from partial_pte.
(NB there is still another issue with calling _put_page_type() on a
page which had PGT_partial set; that will be handled in a subsequent
patch.)
On the get_page_and_type_from_mfn() side, we need to distinguish
between callers which hold a reference on partial (i.e.,
alloc_lN_table()), and those which do not (new_cr3, PIN_LN_TABLE, and
so on): pass a flag if the type should be retained on interruption.
NB that since l1 promotion can't be preempted, that get_page_from_l2e
can't return -ERESTART.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
* Appendix: Engineering PTF_partial_set while a page belongs to a
foreign domain
Suppose A is a page which can be promoted to an l3, and B is a page
which can be promoted to an l2, and A[x] points to B. B has
PGC_allocated set but no other general references.
V1: PIN_L3 A.
A is validated, B is validated.
A.type_count = 1 | PGT_validated | PGT_pinned
B.type_count = 1 | PGT_validated
B.count = 2 | PGC_allocated (A[x] holds a general ref)
V1: UNPIN A.
A begins de-validation.
Arrange to be interrupted when i < x
V1->old_guest_table = A
V1->old_guest_table_ref_held = false
A.type_count = 1 | PGT_partial
A.nr_validated_entries = i < x
B.type_count = 0
B.count = 1 | PGC_allocated
V2: MOD_L4_ENTRY to point some l4e to A.
Picks up re-validation of A.
Arrange to be interrupted halfway through B's validation
B.type_count = 1 | PGT_partial
B.count = 2 | PGC_allocated (PGT_partial holds a general ref)
A.type_count = 1 | PGT_partial
A.nr_validated_entries = x
A.partial_pte = PTF_partial_set
V3: MOD_L3_ENTRY to point some other l3e (not in A) to B.
Validates B.
B.type_count = 1 | PGT_validated
B.count = 2 | PGC_allocated ("other l3e" holds a general ref)
V3: MOD_L3_ENTRY to clear l3e pointing to B.
Devalidates B.
B.type_count = 0
B.count = 1 | PGC_allocated
V3: decrease_reservation(B)
Clears PGC_allocated
B.count = 0 => B is freed
B gets assigned to a different domain
V1: Restarts UNPIN of A
put_old_guest_table(A)
...
free_l3_table(A)
Now since A.partial_flags has PTF_partial_set, free_l3_table() will
call put_page_from_l3e() on A[x], which points to B, while B is owned
by another domain.
If A[x] held a general refcount for B on partial validation, as it does
for partial de-validation, then B would still have a reference count of
1 after PGC_allocated was freed; so B wouldn't be freed until after
put_page_from_l3e() had happend on A[x].
master commit: 18b0ab697830a46ce3dacaf9210799322cb3732c
master date: 2019-10-31 16:14:36 +0100
George Dunlap [Mon, 4 Nov 2019 13:48:41 +0000 (14:48 +0100)]
x86/mm: Have alloc_l[23]_table clear partial_flags when preempting
In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted. This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.
The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count. If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held. If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.
At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.
PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held. Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.
Unfortunately, when alloc_l[23]_table check hypercall_preempt_check()
and return -ERESTART, they set nr_entries_validated, but don't clear
partial_flags.
If we were picking up from a previously-interrupted promotion, that
means that PTF_partial_set would be set even though
[nr_entries_validated] was not partially validated. This means that
if the page in this state were de-validated, put_page_type() would
erroneously be called on that entry.
Perhaps worse, if we were racing with a de-validation, then we might
leave both PTF_partial_set and PTF_partial_general_ref; and when
de-validation picked up again, both the type and the general ref would
be erroneously dropped from [nr_entries_validated].
In a sense, the real issue here is code duplication. Rather than
duplicate the interruption code, set rc to -EINTR and fall through to
the code which already handles that case correctly.
Given the logic at this point, it should be impossible for
partial_flags to be non-zero; add an ASSERT() to catch any changes.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ff0b9a5d69b744a99e8bbeac820a985db5a3bf8e
master date: 2019-10-31 16:14:14 +0100
Make it easier to read by declaring the conditions in which we will
retain the ref, rather than the conditions under which we release it.
The only way (page == current->arch.old_guest_table) can be true is if
preemptible is true; so remove this from the query itself, and add an
ASSERT() to that effect on the opposite path.
No functional change intended.
NB that alloc_lN_table() mishandle the "linear pt failure" situation
described in the comment; this will be addressed in a future patch.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2aab06d742e13d7a9d248f1fc7f0ec62b295ada1
master date: 2019-10-31 16:13:23 +0100