Anthony PERARD [Tue, 1 Jun 2021 10:28:03 +0000 (11:28 +0100)]
tools/firmware/ovmf: Use OvmfXen platform file is exist
A platform introduced in EDK II named OvmfXen is now the one to use for
Xen instead of OvmfX64. It comes with PVH support.
Also, the Xen support in OvmfX64 is deprecated,
"deprecation notice: *dynamic* multi-VMM (QEMU vs. Xen) support in OvmfPkg"
https://edk2.groups.io/g/devel/message/75498
Jan Beulich [Wed, 25 Aug 2021 14:04:22 +0000 (16:04 +0200)]
gnttab: fix array capacity check in gnttab_get_status_frames()
The number of grant frames is of no interest here; converting the passed
in op.nr_frames this way means we allow for 8 times as many GFNs to be
written as actually fit in the array. We would corrupt xlat areas of
higher vCPU-s (after having faulted many times while trying to write to
the guard pages between any two areas) for 32-bit PV guests. For HVM
guests we'd simply crash as soon as we hit the first guard page, as
accesses to the xlat area are simply memcpy() there.
This is CVE-2021-28699 / XSA-382.
Fixes: 18b1be5e324b ("gnttab: make resource limits per domain") Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: ec820035b875cdbedce5e73f481ce65963ede9ed
master date: 2021-08-25 14:19:09 +0200
Jan Beulich [Wed, 25 Aug 2021 14:03:49 +0000 (16:03 +0200)]
gnttab: replace mapkind()
mapkind() doesn't scale very well with larger maptrack entry counts,
using a brute force linear search through all entries, with the only
option of an early loop exit if a matching writable entry was found.
Introduce a radix tree alongside the main maptrack table, thus
allowing much faster MFN-based lookup. To avoid the need to actually
allocate space for the individual nodes, encode the two counters in the
node pointers themselves, thus limiting the number of permitted
simultaneous r/o and r/w mappings of the same MFN to 2³¹-1 (64-bit) /
2¹⁵-1 (32-bit) each.
To avoid enforcing an unnecessarily low bound on the number of
simultaneous mappings of a single MFN, introduce
radix_tree_{ulong_to_ptr,ptr_to_ulong} paralleling
radix_tree_{int_to_ptr,ptr_to_int}.
As a consequence locking changes are also applicable: With there no
longer being any inspection of the remote domain's active entries,
there's also no need anymore to hold the remote domain's grant table
lock. And since we're no longer iterating over the local domain's map
track table, the lock in map_grant_ref() can also be dropped before the
new maptrack entry actually gets populated.
As a nice side effect this also reduces the number of IOMMU operations
in unmap_common(): Previously we would have "established" a readable
mapping whenever we didn't find a writable entry anymore (yet, of
course, at least one readable one). But we only need to do this if we
actually dropped the last writable entry, not if there were none already
before.
Jan Beulich [Wed, 25 Aug 2021 14:03:28 +0000 (16:03 +0200)]
gnttab: add preemption check to gnttab_release_mappings()
A guest may die with many grant mappings still in place, or simply with
a large maptrack table. Iterating through this may take more time than
is reasonable without intermediate preemption (to run softirqs and
perhaps the scheduler).
Move the invocation of the function to the section where other
restartable functions get invoked, and have the function itself check
for preemption every once in a while. Have it iterate the table
backwards, such that decreasing the maptrack limit is all it takes to
convey restart information.
In domain_teardown() introduce PROG_none such that inserting at the
front will be easier going forward.
This is part of CVE-2021-28698 / XSA-380.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: b1ee10be5625b7d502cef1e6ee3818610ab0d29c
master date: 2021-08-25 14:18:18 +0200
Jan Beulich [Wed, 25 Aug 2021 14:03:05 +0000 (16:03 +0200)]
x86/mm: widen locked region in xenmem_add_to_physmap_one()
For pages which can be made part of the P2M by the guest, but which can
also later be de-allocated (grant table v2 status pages being the
present example), it is imperative that they be mapped at no more than a
single GFN. We therefore need to make sure that of two parallel
XENMAPSPACE_grant_table requests for the same status page one completes
before the second checks at which other GFN the underlying MFN is
presently mapped.
Pull ahead the respective get_gfn() and push down the respective
put_gfn(). This leverages that gfn_lock() really aliases p2m_lock(), but
the function makes this assumption already anyway: In the
XENMAPSPACE_gmfn case lock nesting constraints for both involved GFNs
would otherwise need to be enforced to avoid ABBA deadlocks.
Jan Beulich [Wed, 25 Aug 2021 14:02:38 +0000 (16:02 +0200)]
x86/p2m: guard (in particular) identity mapping entries
Such entries, created by set_identity_p2m_entry(), should only be
destroyed by clear_identity_p2m_entry(). However, similarly, entries
created by set_mmio_p2m_entry() should only be torn down by
clear_mmio_p2m_entry(), so the logic gets based upon p2m_mmio_direct as
the entry type (separation between "ordinary" and 1:1 mappings would
require a further indicator to tell apart the two).
As to the guest_remove_page() change, commit 48dfb297a20a ("x86/PVH:
allow guest_remove_page to remove p2m_mmio_direct pages"), which
introduced the call to clear_mmio_p2m_entry(), claimed this was done for
hwdom only without this actually having been the case. However, this
code shouldn't be there in the first place, as MMIO entries shouldn't be
dropped this way. Avoid triggering the warning again that 48dfb297a20a
silenced by an adjustment to xenmem_add_to_physmap_one() instead.
Note that guest_physmap_mark_populate_on_demand() gets tightened beyond
the immediate purpose of this change.
Note also that I didn't inspect code which isn't security supported,
e.g. sharing, paging, or altp2m.
This is CVE-2021-28694 / part of XSA-378.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 753cb68e653002e89fdcd1c80e52905fdbfb78cb
master date: 2021-08-25 14:17:32 +0200
Jan Beulich [Wed, 25 Aug 2021 14:02:17 +0000 (16:02 +0200)]
x86/p2m: introduce p2m_is_special()
Seeing the similarity of grant, foreign, and (subsequently) direct-MMIO
handling, introduce a new P2M type group named "special" (as in "needing
special accessors to create/destroy").
Also use -EPERM instead of other error codes on the two domain_crash()
paths touched.
This is part of XSA-378.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 0bf755e2c856628e11e93c76c3e12974e9964638
master date: 2021-08-25 14:17:07 +0200
Jan Beulich [Wed, 25 Aug 2021 14:01:59 +0000 (16:01 +0200)]
AMD/IOMMU: re-arrange exclusion range and unity map recording
The spec makes no provisions for OS behavior here to depend on the
amount of RAM found on the system. While the spec may not sufficiently
clearly distinguish both kinds of regions, they are surely meant to be
separate things: Only regions with ACPI_IVMD_EXCLUSION_RANGE set should
be candidates for putting in the exclusion range registers. (As there's
only a single such pair of registers per IOMMU, secondary non-adjacent
regions with the flag set already get converted to unity mapped
regions.)
First of all, drop the dependency on max_page. With commit b4f042236ae0
("AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables") the
use of it here was stale anyway; it was bogus already before, as it
didn't account for max_page getting increased later on. Simply try an
exclusion range registration first, and if it fails (for being
unsuitable or non-mergeable), register a unity mapping range.
With this various local variables become unnecessary and hence get
dropped at the same time.
With the max_page boundary dropped for using unity maps, the minimum
page table tree height now needs both recording and enforcing in
amd_iommu_domain_init(). Since we can't predict which devices may get
assigned to a domain, our only option is to uniformly force at least
that height for all domains, now that the height isn't dynamic anymore.
Further don't make use of the exclusion range unless ACPI data says so.
Note that exclusion range registration in
register_range_for_all_devices() is on a best effort basis. Hence unity
map entries also registered are redundant when the former succeeded, but
they also do no harm. Improvements in this area can be done later imo.
Also adjust types where suitable without touching extra lines.
This is part of XSA-378.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 8ea80530cd0dbb8ffa7ac92606a3ee29663fdc93
master date: 2021-08-25 14:16:46 +0200
Prior to the assignment step having completed successfully, devices
should not get associated with their new owner. Hand the device to DomIO
(perhaps temporarily), until after the de-assignment step has completed.
De-assignment of a device (from other than Dom0) as well as failure of
reassign_device() during assignment should result in unity mappings
getting torn down. This in turn requires switching to a refcounted
mapping approach, as was already used by VT-d for its RMRRs, to prevent
unmapping a region used by multiple devices.
This is CVE-2021-28696 / part of XSA-378.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 899272539cbe1acda736a850015416fff653a1b6
master date: 2021-08-25 14:16:26 +0200
Jan Beulich [Wed, 25 Aug 2021 14:01:16 +0000 (16:01 +0200)]
IOMMU: generalize VT-d's tracking of mapped RMRR regions
In order to re-use it elsewhere, move the logic to vendor independent
code and strip it of RMRR specifics.
Note that the prior "map" parameter gets folded into the new "p2ma" one
(which AMD IOMMU code will want to make use of), assigning alternative
meaning ("unmap") to p2m_access_x. Prepare set_identity_p2m_entry() and
p2m_get_iommu_flags() for getting passed access types other than
p2m_access_rw (in the latter case just for p2m_mmio_direct requests).
Note also that, to be on the safe side, an overlap check gets added to
the main loop of iommu_identity_mapping().
This is part of XSA-378.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: c0e19d7c6c42f0bfccccd96b4f7b03b5515e10fc
master date: 2021-08-25 14:15:57 +0200
Jan Beulich [Wed, 25 Aug 2021 14:00:28 +0000 (16:00 +0200)]
AMD/IOMMU: correct device unity map handling
Blindly assuming all addresses between any two such ranges, specified by
firmware in the ACPI tables, should also be unity-mapped can't be right.
Nor can it be correct to merge ranges with differing permissions. Track
ranges individually; don't merge at all, but check for overlaps instead.
This requires bubbling up error indicators, such that IOMMU init can be
failed when allocation of a new tracking struct wasn't possible, or an
overlap was detected.
At this occasion also stop ignoring
amd_iommu_reserve_domain_unity_map()'s return value.
This is part of XSA-378 / CVE-2021-28695.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 34750a3eb022462cdd1c36e8ef9049d3d73c824c
master date: 2021-08-25 14:15:11 +0200
Jan Beulich [Wed, 25 Aug 2021 14:00:06 +0000 (16:00 +0200)]
AMD/IOMMU: correct global exclusion range extending
Besides unity mapping regions, the AMD IOMMU spec also provides for
exclusion ranges (areas of memory not to be subject to DMA translation)
to be specified by firmware in the ACPI tables. The spec does not put
any constraints on the number of such regions.
Blindly assuming all addresses between any two such ranges should also
be excluded can't be right. Since hardware has room for just a single
such range (comprised of the Exclusion Base Register and the Exclusion
Range Limit Register), combine only adjacent or overlapping regions (for
now; this may require further adjustment in case table entries aren't
sorted by address) with matching exclusion_allow_all settings. This
requires bubbling up error indicators, such that IOMMU init can be
failed when concatenation wasn't possible.
Furthermore, since the exclusion range specified in IOMMU registers
implies R/W access, reject requests asking for less permissions (this
will be brought closer to the spec by a subsequent change).
This is part of XSA-378 / CVE-2021-28695.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: b02c5c88982411be11e3413159862f255f1f39dc
master date: 2021-08-25 14:12:13 +0200
Jan Beulich [Wed, 25 Aug 2021 13:59:28 +0000 (15:59 +0200)]
x86/p2m: don't assert that the passed in MFN matches for a remove
guest_physmap_remove_page() gets handed an MFN from the outside, yet
takes the necessary lock to prevent further changes to the GFN <-> MFN
mapping itself. While some callers, in particular guest_remove_page()
(by way of having called get_gfn_query()), hold the GFN lock already,
various others (most notably perhaps the 2nd instance in
xenmem_add_to_physmap_one()) don't. While it also is an option to fix
all the callers, deal with the issue in p2m_remove_page() instead:
Replace the ASSERT() by a conditional and split the loop into two, such
that all checking gets done before any modification would occur.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c65ea16dbcafbe4fe21693b18f8c2a3c5d14600e
master date: 2020-04-03 10:56:55 +0200
Jan Beulich [Wed, 25 Aug 2021 13:59:13 +0000 (15:59 +0200)]
x86/p2m: don't ignore p2m_remove_page()'s return value
It's not very nice to return from guest_physmap_add_entry() after
perhaps already having made some changes to the P2M, but this is pre-
existing practice in the function, and imo better than ignoring errors.
Take the liberty and replace an mfn_add() instance with a local variable
already holding the result (as proven by the check immediately ahead).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a6b051a87a586347969bfbaa6925ac0f0c845413
master date: 2020-04-03 10:56:10 +0200
Jan Beulich [Wed, 25 Aug 2021 13:58:58 +0000 (15:58 +0200)]
x86/p2m: fix PoD accounting in guest_physmap_add_entry()
The initial observation was that the mfn_valid() check comes too late:
Neither mfn_add() nor mfn_to_page() (let alone de-referencing the
result of the latter) are valid for MFNs failing this check. Move it up
and - noticing that there's no caller doing so - also add an assertion
that this should never produce "false" here.
In turn this would have meant that the "else" to that if() could now go
away, which didn't seem right at all. And indeed, considering callers
like memory_exchange() or various grant table functions, the PoD
accounting should have been outside of that if() from the very
beginning.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: aea270e3f7c0db696c88a0e94b1ece7abd339c84
master date: 2020-02-21 17:14:38 +0100
Jan Beulich [Wed, 25 Aug 2021 13:58:18 +0000 (15:58 +0200)]
x86: work around build issue with GNU ld 2.37
I suspect it is commit 40726f16a8d7 ("ld script expression parsing")
which broke the hypervisor build, by no longer accepting section names
with a dash in them inside ADDR() (and perhaps other script directives
expecting just a section name, not an expression): .note.gnu.build-id
is such a section.
Quoting all section names passed to ADDR() via DECL_SECTION() works
around the regression.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 58ad654ebce7ccb272a3f4f3482c03aaad850d31
master date: 2021-07-27 15:03:29 +0100
Jan Beulich [Wed, 25 Aug 2021 13:57:32 +0000 (15:57 +0200)]
x86: make hypervisor build with gcc11
Gcc 11 looks to make incorrect assumptions about valid ranges that
pointers may be used for addressing when they are derived from e.g. a
plain constant. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100680.
Utilize RELOC_HIDE() to work around the issue, which for x86 manifests
in at least
- mpparse.c:efi_check_config(),
- tboot.c:tboot_probe(),
- tboot.c:tboot_gen_frametable_integrity(),
- x86_emulate.c:x86_emulate() (at -O2 only).
The last case is particularly odd not just because it only triggers at
higher optimization levels, but also because it only affects one of at
least three similar constructs. Various "note" diagnostics claim the
valid index range to be [0, 2⁶³-1].
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Jason Andryuk <jandryuk@gmail.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 722f59d38c710a940ab05e542a83020eb5546dea
master date: 2021-05-27 14:40:29 +0200
Andrew Cooper [Thu, 20 May 2021 00:21:39 +0000 (01:21 +0100)]
x86/spec-ctrl: Mitigate TAA after S3 resume
The user chosen setting for MSR_TSX_CTRL needs restoring after S3.
All APs get the correct setting via start_secondary(), but the BSP was missed
out.
This is XSA-377 / CVE-2021-28690.
Fixes: 8c4330818f6 ("x86/spec-ctrl: Mitigate the TSX Asynchronous Abort sidechannel") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 8cf276cb2e0b99b96333865873f56b0b31555ff1)
Andrew Cooper [Thu, 11 Mar 2021 14:39:11 +0000 (14:39 +0000)]
x86/spec-ctrl: Protect against Speculative Code Store Bypass
Modern x86 processors have far-better-than-architecturally-guaranteed self
modifying code detection. Typically, when a write hits an instruction in
flight, a Machine Clear occurs to flush stale content in the frontend and
backend.
For self modifying code, before a write which hits an instruction in flight
retires, the frontend can speculatively decode and execute the old instruction
stream. Speculation of this form can suffer from type confusion in registers,
and potentially leak data.
Furthermore, updates are typically byte-wise, rather than atomic. Depending
on timing, speculation can race ahead multiple times between individual
writes, and execute the transiently-malformed instruction stream.
Xen has stubs which are used in certain cases for emulation purposes. Inhibit
speculation between updating the stub and executing it.
This is XSA-375 / CVE-2021-0089.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 45f59ed8865318bb0356954bad067f329677ce9e)
Jan Beulich [Tue, 8 Jun 2021 18:16:07 +0000 (19:16 +0100)]
AMD/IOMMU: drop command completion timeout
First and foremost - such timeouts were not signaled to callers, making
them believe they're fine to e.g. free previously unmapped pages.
Mirror VT-d's behavior: A fixed number of loop iterations is not a
suitable way to detect timeouts in an environment (CPU and bus speeds)
independent manner anyway. Furthermore, leaving an in-progress operation
pending when it appears to take too long is problematic: If a command
completed later, the signaling of its completion may instead be
understood to signal a subsequently started command's completion.
Log excessively long processing times (with a progressive threshold) to
have some indication of problems in this area. Allow callers to specify
a non-default timeout bias for this logging, using the same values as
VT-d does, which in particular means a (by default) much larger value
for device IO TLB invalidation.
Jan Beulich [Tue, 8 Jun 2021 18:15:57 +0000 (19:15 +0100)]
AMD/IOMMU: wait for command slot to be available
No caller cared about send_iommu_command() indicating unavailability of
a slot. Hence if a sufficient number prior commands timed out, we did
blindly assume that the requested command was submitted to the IOMMU
when really it wasn't. This could mean both a hanging system (waiting
for a command to complete that was never seen by the IOMMU) or blindly
propagating success back to callers, making them believe they're fine
to e.g. free previously unmapped pages.
Fold the three involved functions into one, add spin waiting for an
available slot along the lines of VT-d's qinval_next_index(), and as a
consequence drop all error indicator return types/values.
Jan Beulich [Tue, 8 Jun 2021 18:15:39 +0000 (19:15 +0100)]
VT-d: eliminate flush related timeouts
Leaving an in-progress operation pending when it appears to take too
long is problematic: If e.g. a QI command completed later, the write to
the "poll slot" may instead be understood to signal a subsequently
started command's completion. Also our accounting of the timeout period
was actually wrong: We included the time it took for the command to
actually make it to the front of the queue, which could be heavily
affected by guests other than the one for which the flush is being
performed.
Do away with all timeout detection on all flush related code paths.
Log excessively long processing times (with a progressive threshold) to
have some indication of problems in this area.
Additionally log (once) if qinval_next_index() didn't immediately find
an available slot. Together with the earlier change sizing the queue(s)
dynamically, we should now have a guarantee that with our fully
synchronous model any demand for slots can actually be satisfied.
Jan Beulich [Tue, 8 Jun 2021 18:15:18 +0000 (19:15 +0100)]
AMD/IOMMU: size command buffer dynamically
With the present synchronous model, we need two slots for every
operation (the operation itself and a wait command). There can be one
such pair of commands pending per CPU. To ensure that under all normal
circumstances a slot is always available when one is requested, size the
command ring according to the number of present CPUs.
Jan Beulich [Tue, 8 Jun 2021 18:15:06 +0000 (19:15 +0100)]
VT-d: size qinval queue dynamically
With the present synchronous model, we need two slots for every
operation (the operation itself and a wait descriptor). There can be
one such pair of requests pending per CPU. To ensure that under all
normal circumstances a slot is always available when one is requested,
size the queue ring according to the number of present CPUs.
Ian Jackson [Tue, 9 Mar 2021 15:00:47 +0000 (15:00 +0000)]
SUPPORT.md: Document speculative attacks status of non-shim 32-bit PV
This documents, but does not fix, XSA-370.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Edwin Török [Fri, 15 Jan 2021 19:38:58 +0000 (19:38 +0000)]
tools/oxenstored: mkdir conflicts were sometimes missed
Due to how set_write_lowpath was used here it didn't detect create/delete
conflicts. When we create an entry we must mark our parent as modified
(this is what creating a new node via write does).
Otherwise we can have 2 transactions one creating, and another deleting a node
both succeeding depending on timing. Or one transaction reading an entry,
concluding it doesn't exist, do some other work based on that information and
successfully commit even if another transaction creates the node via mkdir
meanwhile.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
(cherry picked from commit 45dee7d92b493bb531e7e77a6f9c0180ab152f87)
Edwin Török [Fri, 15 Jan 2021 19:28:37 +0000 (19:28 +0000)]
tools/oxenstored: Reject invalid watch paths early
Watches on invalid paths were accepted, but they would never trigger. The
client also got no notification that its watch is bad and would never trigger.
Found again by the structured fuzzer, due to an error on live update reload:
the invalid watch paths would get rejected during live update and the list of
watches would be different pre/post live update.
The testcase is watch on `//`, which is an invalid path.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
(cherry picked from commit dc8caf214fb882546b0e93317b9828247a7c9da8)
Edwin Török [Fri, 15 Jan 2021 19:11:32 +0000 (19:11 +0000)]
tools/oxenstored: Fix quota calculation for mkdir EEXIST
We increment the domain's quota on mkdir even when the node already exists.
This results in a quota inconsistency after live update, where reconstructing
the tree from scratch results in a different quota.
Not a security issue because the domain uses up quota faster, so it will only
get a Quota error sooner than it should.
Found by the structured fuzzer.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
(cherry picked from commit c8b96708252a436da44005307f7c195d699bd7c5)
Edwin Török [Fri, 8 Jan 2021 11:57:37 +0000 (11:57 +0000)]
tools/oxenstored: Trim txhistory on xenbus reconnect
There is a global history, containing transactions from the past 0.05s, which
get trimmed whenever any transaction commits or aborts. Destroying a domain
will cause xenopsd to perform some transactions deleting the tree, so that is
fine. But I think that a domain can abuse the xenbus reconnect facility to
cause a large history to be recorded - provided that noone does any
transactions on the system inbetween, which may be difficult to achieve given
squeezed's constant pinging.
The theoretical situation is like this:
- a domain starts a transaction, creates as large a tree as it can, commits
it. Then repeatedly:
- start a transaction, do nothing with it, start a transaction, delete
part of the large tree, write some new unique data there, don't commit
- cause a xenbus reconnect (I think this can be done by writing something
to the ring). This causes all transactions/watches for the connection to
be cleared, but NOT the history, there were no commits, so nobody
trimmed the history, i.e. it the history can contain transactions from
more than just 0.05s
- loop back and start more transactions, you can keep this up indefinitely
without hitting quotas
Now there is a periodic History.trim running every 0.05s, so I don't think you
can do much damage with it. But lets be safe an trim the transaction history
anyway on reconnect.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 2a47797d1f3b14aab4f0368ab833abd311f94a70)
Edwin Török [Fri, 15 Jan 2021 18:23:10 +0000 (18:23 +0000)]
tools/ocaml/libs/xb: Do not crash after xenbus is unmapped
Xenmmap.unmap sets the address to MAP_FAILED in xenmmap_stubs.c. If due to a
bug there were still references to the Xenbus and we attempt to use it then we
crash. Raise an exception instead of crashing.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 5e317896342d553f0b55f72948bbf93a0f1147d3)
Edwin Török [Wed, 15 Jul 2020 15:10:56 +0000 (16:10 +0100)]
oxenstored: fix ABI breakage introduced in Xen 4.9.0
dbc84d2983969bb47d294131ed9e6bbbdc2aec49 (Xen >= 4.9.0) deleted XS_RESTRICT
from oxenstored, which caused all the following opcodes to be shifted by 1:
reset_watches became off-by-one compared to the C version of xenstored.
Looking at the C code the opcode for reset watches needs:
XS_RESET_WATCHES = XS_SET_TARGET + 2
So add the placeholder `Invalid` in the OCaml<->C mapping list.
(Note that the code here doesn't simply convert the OCaml constructor to
an integer, so we don't need to introduce a dummy constructor).
Igor says that with a suitably patched xenopsd to enable watch reset,
we now see `reset watches` during kdump of a guest in xenstored-access.log.
Julien Grall [Mon, 30 Mar 2020 19:21:52 +0000 (20:21 +0100)]
tools/libxc: misc: Mark const the parameter 'keys' of xc_send_debug_keys()
OCaml is using a string to describe the parameter 'keys' of
xc_send_debug_keys(). Since Ocaml 4.06.01, String_val() will return a
const char * when using -safe-string. This will result to a build
failure because xc_send_debug_keys() expects a char *.
The function should never modify the parameter 'keys' and therefore the
parameter should be const. Unfortunately, this is not directly possible
because DECLARE_HYPERCALL_BOUNCE() is expecting a non-const variable.
A new macro DECLARE_HYPERCALL_BOUNCE_IN() is introduced and will take
care of const parameter. The first user will be xc_send_debug_keys() but
this can be used in more place in the future.
Reported-by: Dario Faggioli <dfaggioli@suse.com> Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit 2b8079610ec55413613ad071cc81cd9f97232a7e)
Julien Grall [Mon, 30 Mar 2020 17:50:08 +0000 (18:50 +0100)]
tools/ocaml: libxb: Avoid to use String_val() when value is bytes
Commit ec7d54dd1a "ocaml/libs/xb: Use bytes in place of strings for
mutable buffers" switch mutable buffers from string to bytes. However
the C code were still using String_Val() to access them.
While the underlying structure is the same between string and bytes, a
string is meant to be immutable. OCaml 4.06.1 and later will enforce it.
Therefore, it will not be possible to build the OCaml libs when using
-safe-string. This is because String_val() will return a const value.
To avoid plain cast in the code, the code is now switched to use
Bytes_val(). As the macro is not defined in older OCaml version, we need
to provide a stub.
Take the opportunity to switch to const the buffer in
ml_interface_write() as it should not be modified.
Reported-by: Dario Faggioli <dfaggioli@suse.com> Signed-off-by: Julien Grall <jgrall@amazon.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 78686437e949a85a207ae1a0d637efe2d3778bbe)
Julien Grall [Mon, 30 Mar 2020 14:14:23 +0000 (15:14 +0100)]
tools/ocaml: Fix stubs build when OCaml has been compiled with -safe-string
The OCaml code has been fixed to handle properly -safe-string in Xen
4.11, however the stubs part were missed.
On OCaml newer than 4.06.1, String_Val() will return a const char *
when using -safe-string leading to build failure when this is used
in place where char * is expected.
The main use in Xen code base is when a new string is allocated. The
suggested approach by the OCaml community [1] is to use the helper
caml_alloc_initialized_string() but it was introduced by OCaml 4.06.1.
The next best approach is to cast String_val() to (char *) as the helper
would have done. So use it when we need to update the new string using
memcpy().
Take the opportunity to remove the unnecessary cast of the source as
mempcy() is expecting a void *.
[1] https://github.com/ocaml/ocaml/pull/1274
Reported-by: Dario Faggioli <dfaggioli@suse.com> Signed-off-by: Julien Grall <jgrall@amazon.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 59b087e3954402c487e0abb4ad9bd05f43669436)
Roger Pau Monne [Thu, 18 Feb 2021 12:02:04 +0000 (13:02 +0100)]
x86/ept: fix missing IOMMU flush in atomic_write_ept_entry
Backport of XSA-321 missed a flush in atomic_write_ept_entry when
level was different than 0. Such omission will undermine the fix for
XSA-321, because page table entries cached in the IOMMU can get out
of sync and contain stale entries.
Fix this by slightly re-arranging the code to prevent the early return
when level is different that 0. Note that the early return is just an
optimization because foreign entries cannot have level > 0.
This is XSA-366.
Reported-by: M. Vefa Bicakci <m.v.b@runbox.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Tue, 16 Feb 2021 14:39:56 +0000 (15:39 +0100)]
xen/page_alloc: Only flush the page to RAM once we know they are scrubbed
At the moment, each page are flushed to RAM just after the allocator
found some free pages. However, this is happening before check if the
page was scrubbed.
As a consequence, on Arm, a guest may be able to access the old content
of the scrubbed pages if it has cache disabled (default at boot) and
the content didn't reach the Point of Coherency.
The flush is now moved after we know the content of the page will not
change. This also has the benefit to reduce the amount of work happening
with the heap_lock held.
This is XSA-364.
Fixes: 307c3be3ccb2 ("mm: Don't scrub pages while holding heap lock in alloc_heap_pages()") Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3b1cc15f1931ba56d0ee256fe9bfe65509733b27
master date: 2021-02-16 15:32:08 +0100
Jan Beulich [Thu, 4 Feb 2021 14:41:12 +0000 (15:41 +0100)]
x86/msr: fix handling of MSR_IA32_PERF_{STATUS/CTL} (again)
X86_VENDOR_* aren't bit masks in the older trees.
Reported-by: James Dingwall <james@dingwall.me.uk> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 15 Dec 2020 13:40:35 +0000 (14:40 +0100)]
evtchn/FIFO: add 2nd smp_rmb() to evtchn_fifo_word_from_port()
Besides with add_page_to_event_array() the function also needs to
synchronize with evtchn_fifo_init_control() setting both d->evtchn_fifo
and (subsequently) d->evtchn_port_ops.
Jan Beulich [Tue, 15 Dec 2020 13:40:12 +0000 (14:40 +0100)]
evtchn/FIFO: re-order and synchronize (with) map_control_block()
For evtchn_fifo_set_pending()'s check of the control block having been
set to be effective, ordering of respective reads and writes needs to be
ensured: The control block pointer needs to be recorded strictly after
the setting of all the queue heads, and it needs checking strictly
before any uses of them (this latter aspect was already guaranteed).
Jan Beulich [Tue, 15 Dec 2020 13:39:33 +0000 (14:39 +0100)]
x86: avoid calling {svm,vmx}_do_resume()
These functions follow the following path: hvm_do_resume() ->
handle_hvm_io_completion() -> hvm_wait_for_io() ->
wait_on_xen_event_channel() -> do_softirq() -> schedule() ->
sched_context_switch() -> continue_running() and hence may
recursively invoke themselves. If this ends up happening a couple of
times, a stack overflow would result.
Prevent this by also resetting the stack at the
->arch.ctxt_switch->tail() invocations (in both places for consistency)
and thus jumping to the functions instead of calling them.
Edwin Török [Tue, 15 Dec 2020 13:38:47 +0000 (14:38 +0100)]
tools/ocaml/xenstored: only Dom0 can change node owner
Otherwise we can give quota away to another domain, either causing it to run
out of quota, or in case of Dom0 use unbounded amounts of memory and bypass
the quota system entirely.
This was fixed in the C version of xenstored in 2006 (c/s db34d2aaa5f5,
predating the XSA process by 5 years).
It was also fixed in the mirage version of xenstore in 2012, with a unit test
demonstrating the vulnerability:
but possibly without realising that the vulnerability still affected the
in-tree oxenstored (added c/s f44af660412 in 2010).
This is XSA-352.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Tue, 15 Dec 2020 13:38:42 +0000 (14:38 +0100)]
tools/ocaml/xenstored: delete watch from trie too when resetting watches
c/s f8c72b526129 "oxenstored: implement XS_RESET_WATCHES" from Xen 4.6
introduced reset watches support in oxenstored by mirroring the change
in cxenstored.
However the OCaml version has some additional data structures to
optimize watch firing, and just resetting the watches in one of the data
structures creates a security bug where a malicious guest kernel can
exceed its watch quota, driving oxenstored into OOM:
* create watches
* reset watches (this still keeps the watches lingering in another data
structure, using memory)
* create some more watches
* loop until oxenstored dies
The guest kernel doesn't necessarily have to be malicious to trigger
this:
* if control/platform-feature-xs_reset_watches is set
* the guest kexecs (e.g. because it crashes)
* on boot more watches are set up
* this will slowly "leak" memory for watches in oxenstored, driving it
towards OOM.
This is XSA-330.
Fixes: f8c72b526129 ("oxenstored: implement XS_RESET_WATCHES") Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
tools/xenstore: Preserve bad client until they are destroyed
XenStored will kill any connection that it thinks has misbehaved,
this is currently happening in two places:
* In `handle_input()` if the sanity check on the ring and the message
fails.
* In `handle_output()` when failing to write the response in the ring.
As the domain structure is a child of the connection, XenStored will
destroy its view of the domain when killing the connection. This will
result in sending @releaseDomain event to all the watchers.
As the watch event doesn't carry which domain has been released,
the watcher (such as XenStored) will generally go through the list of
domains registers and check if one of them is shutting down/dying.
In the case of a client misbehaving, the domain will likely to be
running, so no action will be performed.
When the domain is effectively destroyed, XenStored will not be aware of
the domain anymore. So the watch event is not going to be sent.
By consequence, the watchers of the event will not release mappings
they may have on the domain. This will result in a zombie domain.
In order to send @releaseDomain event at the correct time, we want
to keep the domain structure until the domain is effectively
shutting-down/dying.
We also want to keep the connection around so we could possibly revive
the connection in the future.
A new flag 'is_ignored' is added to mark whether a connection should be
ignored when checking if there are work to do. Additionally any
transactions, watches, buffers associated to the connection will be
freed as you can't do much with them (restarting the connection will
likely need a reset).
As a side note, when the device model were running in a stubdomain, a
guest would have been able to introduce a use-after-free because there
is two parents for a guest connection.
Juergen Gross [Tue, 15 Dec 2020 13:38:09 +0000 (14:38 +0100)]
tools/xenstore: drop watch event messages exceeding maximum size
By setting a watch with a very large tag it is possible to trick
xenstored to send watch event messages exceeding the maximum allowed
payload size. This might in turn lead to a crash of xenstored as the
resulting error can cause dereferencing a NULL pointer in case there
is no active request being handled by the guest the watch event is
being sent to.
Fix that by just dropping such watch events. Additionally modify the
error handling to test the pointer to be not NULL before dereferencing
it.
Edwin Török [Tue, 15 Dec 2020 13:38:04 +0000 (14:38 +0100)]
tools/ocaml/xenstored: Fix path length validation
Currently, oxenstored checks the length of paths against 1024, then
prepends "/local/domain/$DOMID/" to relative paths. This allows a domU
to create paths which can't subsequently be read by anyone, even dom0.
This also interferes with listing directories, etc.
Define a new oxenstored.conf entry: quota-path-max, defaulting to 1024
as before. For paths that begin with "/local/domain/$DOMID/" check the
relative path length against this quota. For all other paths check the
entire path length.
This ensures that if the domid changes (and thus the length of a prefix
changes) a path that used to be valid stays valid (e.g. after a
live-migration). It also ensures that regardless how the client tries
to access a path (domid-relative or absolute) it will get consistent
results, since the limit is always applied on the final canonicalized
path.
Delete the unused Domain.get_path to avoid it being confused with
Connection.get_path (which differs by a trailing slash only).
Rewrite Util.path_validate to apply the appropriate length restriction
based on whether the path is relative or not. Remove the check for
connection_path being absolute, because it is not guest controlled data.
This is part of XSA-323.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
Edwin Török [Tue, 15 Dec 2020 13:37:48 +0000 (14:37 +0100)]
tools/ocaml/xenstored: clean up permissions for dead domains
domain ids are prone to wrapping (15-bits), and with sufficient number
of VMs in a reboot loop it is possible to trigger it. Xenstore entries
may linger after a domain dies, until a toolstack cleans it up. During
this time there is a window where a wrapped domid could access these
xenstore keys (that belonged to another VM).
To prevent this do a cleanup when a domain dies:
* walk the entire xenstore tree and update permissions for all nodes
* if the dead domain had an ACL entry: remove it
* if the dead domain was the owner: change the owner to Dom0
This is done without quota checks or a transaction. Quota checks would
be a no-op (either the domain is dead, or it is Dom0 where they are not
enforced). Transactions are not needed, because this is all done
atomically by oxenstored's single thread.
The xenstore entries owned by the dead domain are not deleted, because
that could confuse a toolstack / backends that are still bound to it
(or generate unexpected watch events). It is the responsibility of a
toolstack to remove the xenstore entries themselves.
This is part of XSA-322.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
Juergen Gross [Tue, 15 Dec 2020 13:37:43 +0000 (14:37 +0100)]
tools/xenstore: revoke access rights for removed domains
Access rights of Xenstore nodes are per domid. Unfortunately existing
granted access rights are not removed when a domain is being destroyed.
This means that a new domain created with the same domid will inherit
the access rights to Xenstore nodes from the previous domain(s) with
the same domid.
This can be avoided by adding a generation counter to each domain.
The generation counter of the domain is set to the global generation
counter when a domain structure is being allocated. When reading or
writing a node all permissions of domains which are younger than the
node itself are dropped. This is done by flagging the related entry
as invalid in order to avoid modifying permissions in a way the user
could detect.
A special case has to be considered: for a new domain the first
Xenstore entries are already written before the domain is officially
introduced in Xenstore. In order not to drop the permissions for the
new domain a domain struct is allocated even before introduction if
the hypervisor is aware of the domain. This requires adding another
bool "introduced" to struct domain in xenstored. In order to avoid
additional padding holes convert the shutdown flag to bool, too.
As verifying permissions has its price regarding runtime add a new
quota for limiting the number of permissions an unprivileged domain
can set for a node. The default for that new quota is 5.
Edwin Török [Tue, 15 Dec 2020 13:37:15 +0000 (14:37 +0100)]
tools/ocaml/xenstored: add xenstored.conf flag to turn off watch permission checks
There are flags to turn off quotas and the permission system, so add one
that turns off the newly introduced watch permission checks as well.
This is part of XSA-115.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Tue, 15 Dec 2020 13:37:10 +0000 (14:37 +0100)]
tools/ocaml/xenstored: avoid watch events for nodes without access
Today watch events are sent regardless of the access rights of the
node the event is sent for. This enables any guest to e.g. setup a
watch for "/" in order to have a detailed record of all Xenstore
modifications.
Modify that by sending only watch events for nodes that the watcher
has a chance to see otherwise (either via direct reads or by querying
the children of a node). This includes cases where the visibility of
a node for a watcher is changing (permissions being removed).
Permissions for nodes are looked up either in the old (pre
transaction/command) or current trees (post transaction). If
permissions are changed multiple times in a transaction only the final
version is checked, because considering a transaction atomic the
individual permission changes would not be noticable to an outside
observer.
Two trees are only needed for set_perms: here we can either notice the
node disappearing (if we loose permission), appearing
(if we gain permission), or changing (if we preserve permission).
RM needs to only look at the old tree: in the new tree the node would be
gone, or could have different permissions if it was recreated (the
recreation would get its own watch fired).
Inside a tree we lookup the watch path's parent, and then the watch path
child itself. This gets us 4 sets of permissions in worst case, and if
either of these allows a watch, then we permit it to fire. The
permission lookups are done without logging the failures, otherwise we'd
get confusing errors about permission denied for some paths, but a watch
still firing. The actual result is logged in xenstored-access log:
'w event ...' as usual if watch was fired
'w notfired...' if the watch was not fired, together with path and
permission set to help in troubleshooting
Adding a watch bypasses permission checks and always fires the watch
once immediately. This is consistent with the specification, and no
information is gained (the watch is fired both if the path exists or
doesn't, and both if you have or don't have access, i.e. it reflects the
path a domain gave it back to that domain).
There are some semantic changes here:
* Write+rm in a single transaction of the same path is unobservable
now via watches: both before and after a transaction the path
doesn't exist, thus both tree lookups come up with the empty
permission set, and noone, not even Dom0 can see this. This is
consistent with transaction atomicity though.
* Similar to above if we temporarily grant and then revoke permission
on a path any watches fired inbetween are ignored as well
* There is a new log event (w notfired) which shows the permission set
of the path, and the path.
* Watches on paths that a domain doesn't have access to are now not
seen, which is the purpose of the security fix.
This is part of XSA-115.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Tue, 15 Dec 2020 13:37:05 +0000 (14:37 +0100)]
tools/ocaml/xenstored: introduce permissions for special watches
The special watches "@introduceDomain" and "@releaseDomain" should be
allowed for privileged callers only, as they allow to gain information
about presence of other guests on the host. So send watch events for
those watches via privileged connections only.
Start to address this by treating the special watches as regular nodes
in the tree, which gives them normal semantics for permissions. A later
change will restrict the handling, so that they can't be listed, etc.
This is part of XSA-115.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Tue, 15 Dec 2020 13:37:00 +0000 (14:37 +0100)]
tools/ocaml/xenstored: unify watch firing
This will make it easier insert additional checks in a follow-up patch.
All watches are now fired from a single function.
This is part of XSA-115.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Tue, 15 Dec 2020 13:36:56 +0000 (14:36 +0100)]
tools/ocaml/xenstored: check privilege for XS_IS_DOMAIN_INTRODUCED
The Xenstore command XS_IS_DOMAIN_INTRODUCED should be possible for privileged
domains only (the only user in the tree is the xenpaging daemon).
This is part of XSA-115.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Tue, 15 Dec 2020 13:36:52 +0000 (14:36 +0100)]
tools/ocaml/xenstored: ignore transaction id for [un]watch
Instead of ignoring the transaction id for XS_WATCH and XS_UNWATCH
commands as it is documented in docs/misc/xenstore.txt, it is tested
for validity today.
Really ignore the transaction id for XS_WATCH and XS_UNWATCH.
This is part of XSA-115.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Juergen Gross [Thu, 11 Jun 2020 14:12:46 +0000 (16:12 +0200)]
tools/xenstore: avoid watch events for nodes without access
Today watch events are sent regardless of the access rights of the
node the event is sent for. This enables any guest to e.g. setup a
watch for "/" in order to have a detailed record of all Xenstore
modifications.
Modify that by sending only watch events for nodes that the watcher
has a chance to see otherwise (either via direct reads or by querying
the children of a node). This includes cases where the visibility of
a node for a watcher is changing (permissions being removed).
This is part of XSA-115.
Signed-off-by: Juergen Gross <jgross@suse.com>
[julieng: Handle rebase conflict] Reviewed-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Paul Durrant <paul@xen.org>
Juergen Gross [Thu, 11 Jun 2020 14:12:45 +0000 (16:12 +0200)]
tools/xenstore: allow special watches for privileged callers only
The special watches "@introduceDomain" and "@releaseDomain" should be
allowed for privileged callers only, as they allow to gain information
about presence of other guests on the host. So send watch events for
those watches via privileged connections only.
In order to allow for disaggregated setups where e.g. driver domains
need to make use of those special watches add support for calling
"set permissions" for those special nodes, too.
Juergen Gross [Thu, 11 Jun 2020 14:12:43 +0000 (16:12 +0200)]
tools/xenstore: fire watches only when removing a specific node
Instead of firing all watches for removing a subtree in one go, do so
only when the related node is being removed.
The watches for the top-most node being removed include all watches
including that node, while watches for nodes below that are only fired
if they are matching exactly. This avoids firing any watch more than
once when removing a subtree.
Juergen Gross [Thu, 11 Jun 2020 14:12:42 +0000 (16:12 +0200)]
tools/xenstore: rework node removal
Today a Xenstore node is being removed by deleting it from the parent
first and then deleting itself and all its children. This results in
stale entries remaining in the data base in case e.g. a memory
allocation is failing during processing. This would result in the
rather strange behavior to be able to read a node (as its still in the
data base) while not being visible in the tree view of Xenstore.
Fix that by deleting the nodes from the leaf side instead of starting
at the root.
As fire_watches() is now called from _rm() the ctx parameter needs a
const attribute.
Juergen Gross [Thu, 11 Jun 2020 14:12:40 +0000 (16:12 +0200)]
tools/xenstore: simplify and rename check_event_node()
There is no path which allows to call check_event_node() without a
event name. So don't let the result depend on the name being NULL and
add an assert() covering that case.
Rename the function to check_special_event() to better match the
semantics.
Juergen Gross [Thu, 11 Jun 2020 14:12:39 +0000 (16:12 +0200)]
tools/xenstore: fix node accounting after failed node creation
When a node creation fails the number of nodes of the domain should be
the same as before the failed node creation. In case of failure when
trying to create a node requiring to create one or more intermediate
nodes as well (e.g. when /a/b/c/d is to be created, but /a/b isn't
existing yet) it might happen that the number of nodes of the creating
domain is not reset to the value it had before.
So move the quota accounting out of construct_node() and into the node
write loop in create_node() in order to be able to undo the accounting
in case of an error in the intermediate node destructor.
Juergen Gross [Thu, 11 Jun 2020 14:12:38 +0000 (16:12 +0200)]
tools/xenstore: ignore transaction id for [un]watch
Instead of ignoring the transaction id for XS_WATCH and XS_UNWATCH
commands as it is documented in docs/misc/xenstore.txt, it is tested
for validity today.
Really ignore the transaction id for XS_WATCH and XS_UNWATCH.
Juergen Gross [Thu, 11 Jun 2020 14:12:37 +0000 (16:12 +0200)]
tools/xenstore: allow removing child of a node exceeding quota
An unprivileged user of Xenstore is not allowed to write nodes with a
size exceeding a global quota, while privileged users like dom0 are
allowed to write such nodes. The size of a node is the needed space
to store all node specific data, this includes the names of all
children of the node.
When deleting a node its parent has to be modified by removing the
name of the to be deleted child from it.
This results in the strange situation that an unprivileged owner of a
node might not succeed in deleting that node in case its parent is
exceeding the quota of that unprivileged user (it might have been
written by dom0), as the user is not allowed to write the updated
parent node.
Fix that by not checking the quota when writing a node for the
purpose of removing a child's name only.
The same applies to transaction handling: a node being read during a
transaction is written to the transaction specific area and it should
not be tested for exceeding the quota, as it might not be owned by
the reader and presumably the original write would have failed if the
node is owned by the reader.
Edwin Török [Tue, 15 Dec 2020 13:35:00 +0000 (14:35 +0100)]
tools/ocaml/xenstored: do permission checks on xenstore root
This was lacking in a disappointing number of places.
The xenstore root node is treated differently from all other nodes, because it
doesn't have a parent, and mutation requires changing the parent.
Unfortunately this lead to open-coding the special case for root into every
single xenstore operation, and out of all the xenstore operations only read
did a permission check when handling the root node.
This means that an unprivileged guest can:
* xenstore-chmod / to its liking and subsequently write new arbitrary nodes
there (subject to quota)
* xenstore-rm -r / deletes almost the entire xenstore tree (xenopsd quickly
refills some, but you are left with a broken system)
* DIRECTORY on / lists all children when called through python
bindings (xenstore-ls stops at /local because it tries to list recursively)
* get-perms on / works too, but that is just a minor information leak
Add the missing permission checks, but this should really be refactored to do
the root handling and permission checks on the node only once from a single
function, instead of getting it wrong nearly everywhere.
This is XSA-353.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Juergen Gross [Tue, 1 Dec 2020 16:07:03 +0000 (17:07 +0100)]
xen/events: rework fifo queue locking
Two cpus entering evtchn_fifo_set_pending() for the same event channel
can race in case the first one gets interrupted after setting
EVTCHN_FIFO_PENDING and when the other one manages to set
EVTCHN_FIFO_LINKED before the first one is testing that bit. This can
lead to evtchn_check_pollers() being called before the event is put
properly into the queue, resulting eventually in the guest not seeing
the event pending and thus blocking forever afterwards.
Note that commit 5f2df45ead7c1195 ("xen/evtchn: rework per event channel
lock") made the race just more obvious, while the fifo event channel
implementation had this race forever since the introduction and use of
per-channel locks, when an unmask operation was running in parallel with
an event channel send operation.
Using a spinlock for the per event channel lock had turned out
problematic due to some paths needing to take the lock are called with
interrupts off, so the lock would need to disable interrupts, which in
turn broke some use cases related to vm events.
For avoiding this race the queue locking in evtchn_fifo_set_pending()
needs to be reworked to cover the test of EVTCHN_FIFO_PENDING,
EVTCHN_FIFO_MASKED and EVTCHN_FIFO_LINKED, too. Additionally when an
event channel needs to change queues both queues need to be locked
initially, in order to avoid having a window with no lock held at all.
Reported-by: Jan Beulich <jbeulich@suse.com> Fixes: 5f2df45ead7c1195 ("xen/evtchn: rework per event channel lock") Fixes: de6acb78bf0e137c ("evtchn: use a per-event channel lock for sending events") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 71ac522909e9302350a88bc378be99affa87067c
master date: 2020-11-30 14:05:39 +0100
Juergen Gross [Tue, 1 Dec 2020 16:06:15 +0000 (17:06 +0100)]
xen/events: access last_priority and last_vcpu_id together
The queue for a fifo event is depending on the vcpu_id and the
priority of the event. When sending an event it might happen the
event needs to change queues and the old queue needs to be kept for
keeping the links between queue elements intact. For this purpose
the event channel contains last_priority and last_vcpu_id values
elements for being able to identify the old queue.
In order to avoid races always access last_priority and last_vcpu_id
with a single atomic operation avoiding any inconsistencies.
With the event channel lock no longer disabling interrupts commit 52e1fc47abc3a0123 ("evtchn/Flask: pre-allocate node on send path") can
be reverted again.
Juergen Gross [Tue, 1 Dec 2020 16:04:43 +0000 (17:04 +0100)]
xen/evtchn: rework per event channel lock
Currently the lock for a single event channel needs to be taken with
interrupts off, which causes deadlocks in some cases.
Rework the per event channel lock to be non-blocking for the case of
sending an event and removing the need for disabling interrupts for
taking the lock.
The lock is needed for avoiding races between event channel state
changes (creation, closing, binding) against normal operations (set
pending, [un]masking, priority changes).
Use a rwlock, but with some restrictions:
- Changing the state of an event channel (creation, closing, binding)
needs to use write_lock(), with ASSERT()ing that the lock is taken as
writer only when the state of the event channel is either before or
after the locked region appropriate (either free or unbound).
- Sending an event needs to use read_trylock() mostly, in case of not
obtaining the lock the operation is omitted. This is needed as
sending an event can happen with interrupts off (at least in some
cases).
- Dumping the event channel state for debug purposes is using
read_trylock(), too, in order to avoid blocking in case the lock is
taken as writer for a long time.
Jan Beulich [Tue, 1 Dec 2020 16:03:12 +0000 (17:03 +0100)]
evtchn/fifo: use stable fields when recording "last queue" information
Both evtchn->priority and evtchn->notify_vcpu_id could change behind the
back of evtchn_fifo_set_pending(), as for it - in the case of
interdomain channels - only the remote side's per-channel lock is held.
Neither the queue's priority nor the vCPU's vcpu_id fields have similar
properties, so they seem better suited for the purpose. In particular
they reflect the respective evtchn fields' values at the time they were
used to determine queue and vCPU.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 6f6f07b64cbe90e54f8e62b4d6f2404cf5306536
master date: 2020-10-02 08:37:35 +0200
Andrew Cooper [Thu, 23 Apr 2020 12:05:46 +0000 (13:05 +0100)]
x86/msr: Disallow guest access to the RAPL MSRs
Researchers have demonstrated using the RAPL interface to perform a
differential power analysis attack to recover AES keys used by other cores in
the system.
Furthermore, even privileged guests cannot use this interface correctly, due
to MSR scope and vcpu scheduling issues. The interface would want to be
paravirtualised to be used sensibly.
Disallow access to the RAPL MSRs completely, as well as other MSRs which
potentially access fine grain power information.
This is part of XSA-351.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Tue, 6 Oct 2020 16:23:27 +0000 (18:23 +0200)]
x86/msr: fix handling of MSR_IA32_PERF_{STATUS/CTL}
Currently a PV hardware domain can also be given control over the CPU
frequency, and such guest is allowed to write to MSR_IA32_PERF_CTL.
However since commit 322ec7c89f6 the default behavior has been changed
to reject accesses to not explicitly handled MSRs, preventing PV
guests that manage CPU frequency from reading
MSR_IA32_PERF_{STATUS/CTL}.
Additionally some HVM guests (Windows at least) will attempt to read
MSR_IA32_PERF_CTL and will panic if given back a #GP fault:
Move the handling of MSR_IA32_PERF_{STATUS/CTL} to the common MSR
handling shared between HVM and PV guests, and add an explicit case
for reads to MSR_IA32_PERF_{STATUS/CTL}.
Restore previous behavior and allow PV guests with the required
permissions to read the contents of the mentioned MSRs. Non privileged
guests will get 0 when trying to read those registers, as writes to
MSR_IA32_PERF_CTL by such guest will already be silently dropped.
Fixes: 322ec7c89f6 ('x86/pv: disallow access to unknown MSRs') Fixes: 84e848fd7a1 ('x86/hvm: disallow access to unknown MSRs') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3059178798a23ba870ff86ff54d442a07e6651fc)
Julien Grall [Mon, 9 Nov 2020 20:28:59 +0000 (20:28 +0000)]
xen/arm: Always trap AMU system registers
The Activity Monitors Unit (AMU) has been introduced by ARMv8.4. It is
considered to be unsafe to be expose to guests as they might expose
information about code executed by other guests or the host.
Arm provided a way to trap all the AMU system registers by setting
CPTR_EL2.TAM to 1.
Unfortunately, on older revision of the specification, the bit 30 (now
CPTR_EL1.TAM) was RES0. Because of that, Xen is setting it to 0 and
therefore the system registers would be exposed to the guest when it is
run on processors with AMU.
As the bit is mark as UNKNOWN at boot in Armv8.4, the only safe solution
for us is to always set CPTR_EL1.TAM to 1.
Guest trying to access the AMU system registers will now receive an
undefined instruction. Unfortunately, this means that even well-behaved
guest may fail to boot because we don't sanitize the ID registers.
This is a known issues with other Armv8.0+ features (e.g. SVE, Pointer
Auth). This will taken care separately.
Ian Jackson [Wed, 4 Nov 2020 08:37:12 +0000 (09:37 +0100)]
SUPPORT.md: Desupport qemu trad except stub dm
While investigating XSA-335 we discovered that many upstream security
fixes were missing. It is not practical to backport them. There is
no good reason to be running this very ancient version of qemu, except
that it is the only way to run a stub dm which is currently supported
by upstream.
Andrew Cooper [Mon, 19 Oct 2020 14:51:22 +0000 (15:51 +0100)]
x86/pv: Flush TLB in response to paging structure changes
With MMU_UPDATE, a PV guest can make changes to higher level pagetables. This
is safe from Xen's point of view (as the update only affects guest mappings),
and the guest is required to flush (if necessary) after making updates.
However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
writeable pagetables, etc.) is an implementation detail outside of the
API/ABI.
Changes in the paging structure require invalidations in the linear pagetable
range for subsequent accesses into the linear pagetables to access non-stale
mappings. Xen must provide suitable flushing to prevent intermixed guest
actions from accidentally accessing/modifying the wrong pagetable.
For all L2 and higher modifications, flush the TLB. PV guests cannot create
L2 or higher entries with the Global bit set, so no mappings established in
the linear range can be global. (This could in principle be an order 39 flush
starting at LINEAR_PT_VIRT_START, but no such mechanism exists in practice.)
Express the necessary flushes as a set of booleans which accumulate across the
operation. Comment the flushing logic extensively.
This is XSA-286.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 16a20963b3209788f2c0d3a3eebb7d92f03f5883)
Andrew Cooper [Thu, 22 Oct 2020 10:28:58 +0000 (11:28 +0100)]
x86/pv: Drop FLUSH_TLB_GLOBAL in do_mmu_update() for XPTI
c/s 9d1d31ad9498 "x86: slightly reduce Meltdown band-aid overhead" removed the
use of Global TLB flushes on the Xen entry path, but added a FLUSH_TLB_GLOBAL
to the L4 path in do_mmu_update().
However, this was unnecessary.
It is the guests responsibility to perform appropriate TLB flushing if the L4
modification altered an established mapping in a flush-relevant way. In this
case, an MMUEXT_OP hypercall will follow. The case which Xen needs to cover
is when new mappings are created, and the resync on the exit-to-guest path
covers this correctly.
There is a corner case with multiple vCPUs in hypercalls at the same time,
which 9d1d31ad9498 changed, and this patch changes back to its original XPTI
behaviour.
Architecturally, established TLB entries can continue to be used until the
broadcast flush has completed. Therefore, even with concurrent hypercalls,
the guest cannot depend on older mappings not being used until an MMUEXT_OP
hypercall completes. Xen's implementation of guest-initiated flushes will
take correct effect on top of an in-progress hypercall, picking up new mapping
setting before the other vCPU's MMUEXT_OP completes.
Note: The correctness of this change is not impacted by whether XPTI uses
global mappings or not. Correctness there depends on the behaviour of Xen on
the entry/exit paths when switching two/from the XPTI "shadow" pagetables.
This is (not really) XSA-286 (but necessary to simplify the logic).
Fixes: 9d1d31ad9498 ("x86: slightly reduce Meltdown band-aid overhead") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 055e1c3a3d95b1e753148369fbc4ba48782dd602)
Jan Beulich [Tue, 20 Oct 2020 13:17:11 +0000 (15:17 +0200)]
AMD/IOMMU: ensure suitable ordering of DTE modifications
DMA and interrupt translation should be enabled only after other
applicable DTE fields have been written. Similarly when disabling
translation or when moving a device between domains, translation should
first be disabled, before other entry fields get modified. Note however
that the "moving" aspect doesn't apply to the interrupt remapping side,
as domain specifics are maintained in the IRTEs here, not the DTE. We
also never disable interrupt remapping once it got enabled for a device
(the respective argument passed is always the immutable iommu_intremap).
This is part of XSA-347.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 0514a3a25fb9ebff5d75cc8f00a9229385300858
master date: 2020-10-20 14:23:12 +0200
Jan Beulich [Tue, 20 Oct 2020 13:16:49 +0000 (15:16 +0200)]
AMD/IOMMU: update live PTEs atomically
Updating a live PTE bitfield by bitfield risks the compiler re-ordering
the individual updates as well as splitting individual updates into
multiple memory writes. Construct the new entry fully in a local
variable, do the check to determine the flushing needs on the thus
established new entry, and then write the new entry by a single insn.
Similarly using memset() to clear a PTE is unsafe, as the order of
writes the function does is, at least in principle, undefined.
This is part of XSA-347.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 3b055121c5410e2c3105d6d06aa24ca0d58868cd
master date: 2020-10-20 14:22:52 +0200
Jan Beulich [Tue, 20 Oct 2020 13:16:22 +0000 (15:16 +0200)]
IOMMU: hold page ref until after deferred TLB flush
When moving around a page via XENMAPSPACE_gmfn_range, deferring the TLB
flush for the "from" GFN range requires that the page remains allocated
to the guest until the TLB flush has actually occurred. Otherwise a
parallel hypercall to remove the page would only flush the TLB for the
GFN it has been moved to, but not the one is was mapped at originally.
This is part of XSA-346.
Fixes: cf95b2a9fd5a ("iommu: Introduce per cpu flag (iommu_dont_flush_iotlb) to avoid unnecessary iotlb... ") Reported-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <jgrall@amazon.com>
master commit: 5777a3742d88ff1c0ebc626ceb4fd47f9b3dc6d5
master date: 2020-10-20 14:21:32 +0200
Jan Beulich [Tue, 20 Oct 2020 13:15:16 +0000 (15:15 +0200)]
IOMMU: suppress "iommu_dont_flush_iotlb" when about to free a page
Deferring flushes to a single, wide range one - as is done when
handling XENMAPSPACE_gmfn_range - is okay only as long as
pages don't get freed ahead of the eventual flush. While the only
function setting the flag (xenmem_add_to_physmap()) suggests by its name
that it's only mapping new entries, in reality the way
xenmem_add_to_physmap_one() works means an unmap would happen not only
for the page being moved (but not freed) but, if the destination GFN is
populated, also for the page being displaced from that GFN. Collapsing
the two flushes for this GFN into just one (end even more so deferring
it to a batched invocation) is not correct.
This is part of XSA-346.
Fixes: cf95b2a9fd5a ("iommu: Introduce per cpu flag (iommu_dont_flush_iotlb) to avoid unnecessary iotlb... ") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Acked-by: Julien Grall <jgrall@amazon.com>
master commit: dea460d86957bf1425a8a1572626099ac3f165a8
master date: 2020-10-20 14:21:09 +0200
Hongyan Xia [Tue, 20 Oct 2020 13:14:58 +0000 (15:14 +0200)]
x86/mm: Prevent some races in hypervisor mapping updates
map_pages_to_xen will attempt to coalesce mappings into 2MiB and 1GiB
superpages if possible, to maximize TLB efficiency. This means both
replacing superpage entries with smaller entries, and replacing
smaller entries with superpages.
Unfortunately, while some potential races are handled correctly,
others are not. These include:
1. When one processor modifies a sub-superpage mapping while another
processor replaces the entire range with a superpage.
Take the following example:
Suppose L3[N] points to L2. And suppose we have two processors, A and
B.
* A walks the pagetables, get a pointer to L2.
* B replaces L3[N] with a 1GiB mapping.
* B Frees L2
* A writes L2[M] #
This is race exacerbated by the fact that virt_to_xen_l[21]e doesn't
handle higher-level superpages properly: If you call virt_xen_to_l2e
on a virtual address within an L3 superpage, you'll either hit a BUG()
(most likely), or get a pointer into the middle of a data page; same
with virt_xen_to_l1 on a virtual address within either an L3 or L2
superpage.
So take the following example:
* A reads pl3e and discovers it to point to an L2.
* B replaces L3[N] with a 1GiB mapping
* A calls virt_to_xen_l2e() and hits the BUG_ON() #
2. When two processors simultaneously try to replace a sub-superpage
mapping with a superpage mapping.
Take the following example:
Suppose L3[N] points to L2. And suppose we have two processors, A and B,
both trying to replace L3[N] with a superpage.
* A walks the pagetables, get a pointer to pl3e, and takes a copy ol3e pointing to L2.
* B walks the pagetables, gets a pointre to pl3e, and takes a copy ol3e pointing to L2.
* A writes the new value into L3[N]
* B writes the new value into L3[N]
* A recursively frees all the L1's under L2, then frees L2
* B recursively double-frees all the L1's under L2, then double-frees L2 #
Fix this by grabbing a lock for the entirety of the mapping update
operation.
Rather than grabbing map_pgdir_lock for the entire operation, however,
repurpose the PGT_locked bit from L3's page->type_info as a lock.
This means that rather than locking the entire address space, we
"only" lock a single 512GiB chunk of hypervisor address space at a
time.
There was a proposal for a lock-and-reverify approach, where we walk
the pagetables to the point where we decide what to do; then grab the
map_pgdir_lock, re-verify the information we collected without the
lock, and finally make the change (starting over again if anything had
changed). Without being able to guarantee that the L2 table wasn't
freed, however, that means every read would need to be considered
potentially unsafe. Thinking carefully about that is probably
something that wants to be done on public, not under time pressure.
Jan Beulich [Fri, 2 Oct 2020 10:37:09 +0000 (12:37 +0200)]
evtchn/Flask: pre-allocate node on send path
xmalloc() & Co may not be called with IRQs off, or else check_lock()
will have its assertion trigger about locks getting acquired
inconsistently. Re-arranging the locking in evtchn_send() doesn't seem
very reasonable, especially since the per-channel lock was introduced to
avoid acquiring the per-domain event lock on the send paths. Issue a
second call to xsm_evtchn_send() instead, before acquiring the lock, to
give XSM / Flask a chance to pre-allocate whatever it may need.
As these nodes are used merely for caching earlier decisions' results,
allocate just one node in AVC code despite two potentially being needed.
Things will merely be not as performant if a second allocation was
wanted, just like when the pre-allocation fails.
Fixes: c0ddc8634845 ("evtchn: convert per-channel lock to be IRQ-safe") Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Jason Andryuk <jandryuk@gmail.com> Acked-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Jason Andryuk <jandryuk@gmail.com>
master commit: 52e1fc47abc3a0123d2b5bb7e9172e84fd571851
master date: 2020-10-02 08:36:21 +0200
Jan Beulich [Tue, 22 Sep 2020 15:23:04 +0000 (17:23 +0200)]
evtchn: arrange for preemption in evtchn_reset()
Like for evtchn_destroy() looping over all possible event channels to
close them can take a significant amount of time. Unlike done there, we
can't alter domain properties (i.e. d->valid_evtchns) here. Borrow, in a
lightweight form, the paging domctl continuation concept, redirecting
the continuations to different sub-ops. Just like there this is to be
able to allow for predictable overall results of the involved sub-ops:
Racing requests should either complete or be refused.
Note that a domain can't interfere with an already started (by a remote
domain) reset, due to being paused. It can prevent a remote reset from
happening by leaving a reset unfinished, but that's only going to affect
itself.
Jan Beulich [Tue, 22 Sep 2020 15:22:28 +0000 (17:22 +0200)]
evtchn: arrange for preemption in evtchn_destroy()
Especially closing of fully established interdomain channels can take
quite some time, due to the locking involved. Therefore we shouldn't
assume we can clean up still active ports all in one go. Besides adding
the necessary preemption check, also avoid pointlessly starting from
(or now really ending at) 0; 1 is the lowest numbered port which may
need closing.
Since we're now reducing ->valid_evtchns, free_xen_event_channel(),
and (at least to be on the safe side) notify_via_xen_event_channel()
need to cope with attempts to close / unbind from / send through already
closed (and no longer valid, as per port_is_valid()) ports.
Jan Beulich [Tue, 22 Sep 2020 15:22:00 +0000 (17:22 +0200)]
evtchn: address races with evtchn_reset()
Neither d->evtchn_port_ops nor max_evtchns(d) may be used in an entirely
lock-less manner, as both may change by a racing evtchn_reset(). In the
common case, at least one of the domain's event lock or the per-channel
lock needs to be held. In the specific case of the inter-domain sending
by evtchn_send() and notify_via_xen_event_channel() holding the other
side's per-channel lock is sufficient, as the channel can't change state
without both per-channel locks held. Without such a channel changing
state, evtchn_reset() can't complete successfully.
Lock-free accesses continue to be permitted for the shim (calling some
otherwise internal event channel functions), as this happens while the
domain is in effectively single-threaded mode. Special care also needs
taking for the shim's marking of in-use ports as ECS_RESERVED (allowing
use of such ports in the shim case is okay because switching into and
hence also out of FIFO mode is impossihble there).
As a side effect, certain operations on Xen bound event channels which
were mistakenly permitted so far (e.g. unmask or poll) will be refused
now.
Jan Beulich [Tue, 22 Sep 2020 15:20:56 +0000 (17:20 +0200)]
evtchn: evtchn_reset() shouldn't succeed with still-open ports
While the function closes all ports, it does so without holding any
lock, and hence racing requests may be issued causing new ports to get
opened. This would have been problematic in particular if such a newly
opened port had a port number above the new implementation limit (i.e.
when switching from FIFO to 2-level) after the reset, as prior to
"evtchn: relax port_is_valid()" this could have led to e.g.
evtchn_close()'s "BUG_ON(!port_is_valid(d2, port2))" to trigger.
Introduce a counter of active ports and check that it's (still) no
larger then the number of Xen internally used ones after obtaining the
necessary lock in evtchn_reset().
As to the access model of the new {active,xen}_evtchns fields - while
all writes get done using write_atomic(), reads ought to use
read_atomic() only when outside of a suitably locked region.
Note that as of now evtchn_bind_virq() and evtchn_bind_ipi() don't have
a need to call check_free_port().
Jan Beulich [Tue, 22 Sep 2020 15:20:23 +0000 (17:20 +0200)]
evtchn/x86: enforce correct upper limit for 32-bit guests
The recording of d->max_evtchns in evtchn_2l_init(), in particular with
the limited set of callers of the function, is insufficient. Neither for
PV nor for HVM guests the bitness is known at domain_create() time, yet
the upper bound in 2-level mode depends upon guest bitness. Recording
too high a limit "allows" x86 32-bit domains to open not properly usable
event channels, management of which (inside Xen) would then result in
corruption of the shared info and vCPU info structures.
Keep the upper limit dynamic for the 2-level case, introducing a helper
function to retrieve the effective limit. This helper is now supposed to
be private to the event channel code. The used in do_poll() and
domain_dump_evtchn_info() weren't consistent with port uses elsewhere
and hence get switched to port_is_valid().
Furthermore FIFO mode's setup_ports() gets adjusted to loop only up to
the prior ABI limit, rather than all the way up to the new one.
Finally a word on the change to do_poll(): Accessing ->max_evtchns
without holding a suitable lock was never safe, as it as well as
->evtchn_port_ops may change behind do_poll()'s back. Using
port_is_valid() instead widens some the window for potential abuse,
until we've dealt with the race altogether (see XSA-343).
This is XSA-342.
Reported-by: Julien Grall <jgrall@amazon.com> Fixes: 48974e6ce52e ("evtchn: use a per-domain variable for the max number of event channels") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Julien Grall <jgrall@amazon.com>
xen/evtchn: Add missing barriers when accessing/allocating an event channel
While the allocation of a bucket is always performed with the per-domain
lock, the bucket may be accessed without the lock taken (for instance, see
evtchn_send()).
Instead such sites relies on port_is_valid() to return a non-zero value
when the port has a struct evtchn associated to it. The function will
mostly check whether the port is less than d->valid_evtchns as all the
buckets/event channels should be allocated up to that point.
Unfortunately a compiler is free to re-order the assignment in
evtchn_allocate_port() so it would be possible to have d->valid_evtchns
updated before the new bucket has finish to allocate.
Additionally on Arm, even if this was compiled "correctly", the
processor can still re-order the memory access.
Add a write memory barrier in the allocation side and a read memory
barrier when the port is valid to prevent any re-ordering issue.
Andrew Cooper [Tue, 22 Sep 2020 15:18:57 +0000 (17:18 +0200)]
x86/pv: Avoid double exception injection
There is at least one path (SYSENTER with NT set, Xen converts to #GP) which
ends up injecting the #GP fault twice, first in compat_sysenter(), and then a
second time in compat_test_all_events(), due to the stale TBF_EXCEPTION left
in TRAPBOUNCE_flags.
The guest kernel sees the second fault first, which is a kernel level #GP
pointing at the head of the #GP handler, and is therefore a userspace
trigger-able DoS.
This particular bug has bitten us several times before, so rearrange
{compat_,}create_bounce_frame() to clobber TRAPBOUNCE on success, rather than
leaving this task to one area of code which isn't used uniformly.
Other scenarios which might result in a double injection (e.g. two calls
directly to compat_create_bounce_frame) will now crash the guest, which is far
more obvious than letting the kernel run with corrupt state.
This is XSA-339
Fixes: fdac9515607b ("x86: clear EFLAGS.NT in SYSENTER entry path") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 22 Sep 2020 15:18:21 +0000 (17:18 +0200)]
evtchn: relax port_is_valid()
To avoid ports potentially becoming invalid behind the back of certain
other functions (due to ->max_evtchn shrinking) because of
- a guest invoking evtchn_reset() and from a 2nd vCPU opening new
channels in parallel (see also XSA-343),
- alloc_unbound_xen_event_channel() produced channels living above the
2-level range (see also XSA-342),
drop the max_evtchns check from port_is_valid(). For a port for which
the function once returned "true", the returned value may not turn into
"false" later on. The function's result may only depend on bounds which
can only ever grow (which is the case for d->valid_evtchns).
This also eliminates a false sense of safety, utilized by some of the
users (see again XSA-343): Without a suitable lock held, d->max_evtchns
may change at any time, and hence deducing that certain other operations
are safe when port_is_valid() returned true is not legitimate. The
opportunities to abuse this may get widened by the change here
(depending on guest and host configuration), but will be taken care of
by the other XSA.
This is XSA-338.
Fixes: 48974e6ce52e ("evtchn: use a per-domain variable for the max number of event channels") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Julien Grall <jgrall@amazon.com>
Jan Beulich [Tue, 22 Sep 2020 15:17:47 +0000 (17:17 +0200)]
x86/MSI-X: restrict reading of table/PBA bases from BARs
When assigned to less trusted or un-trusted guests, devices may change
state behind our backs (they may e.g. get reset by means we may not know
about). Therefore we should avoid reading BARs from hardware once a
device is no longer owned by Dom0. Furthermore when we can't read a BAR,
or when we read zero, we shouldn't instead use the caller provided
address unless that caller can be trusted.
Re-arrange the logic in msix_capability_init() such that only Dom0 (and
only if the device isn't DomU-owned yet) or calls through
PHYSDEVOP_prepare_msix will actually result in the reading of the
respective BAR register(s). Additionally do so only as long as in-use
table entries are known (note that invocation of PHYSDEVOP_prepare_msix
counts as a "pseudo" entry). In all other uses the value already
recorded will get used instead.
Clear the recorded values in _pci_cleanup_msix() as well as on the one
affected error path. (Adjust this error path to also avoid blindly
disabling MSI-X when it was enabled on entry to the function.)
While moving around variable declarations (in many cases to reduce their
scopes), also adjust some of their types.
This is part of XSA-337.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
It's safer and faster to just use the cached last written
(untranslated) MSI message stored in msi_desc for the single user that
calls read_msi_msg.
This also prevents relying on the data read from the device MSI
registers in order to figure out the index into the IOMMU interrupt
remapping table, which is not safe.
This is part of XSA-337.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/vpt: fix race when migrating timers between vCPUs
The current vPT code will migrate the emulated timers between vCPUs
(change the pt->vcpu field) while just holding the destination lock,
either from create_periodic_time or pt_adjust_global_vcpu_target if
the global target is adjusted. Changing the periodic_timer vCPU field
in this way creates a race where a third party could grab the lock in
the unlocked region of pt_adjust_global_vcpu_target (or before
create_periodic_time performs the vcpu change) and then release the
lock from a different vCPU, creating a locking imbalance.
Introduce a per-domain rwlock in order to protect periodic_time
migration between vCPU lists. Taking the lock in read mode prevents
any timer from being migrated to a different vCPU, while taking it in
write mode allows performing migration of timers across vCPUs. The
per-vcpu locks are still used to protect all the other fields from the
periodic_timer struct.
Note that such migration shouldn't happen frequently, and hence
there's no performance drop as a result of such locking.
This is XSA-336.
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Tested-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>