tools/xenstore: create_node: Don't defer work to undo any changes on failure
XSA-115 extended destroy_node() to update the node accounting for the
connection. The implementation is assuming the connection is the parent
of the node, however all the nodes are allocated using a separate context
(see process_message()). This will result to crash (or corrupt) xenstored
as the pointer is wrongly used.
In case of an error, any changes to the database or update to the
accounting will now be reverted in create_node() by calling directly
destroy_node(). This has the nice advantage to remove the loop to unset
the destructors in case of success.
Take the opportunity to free the nodes right now as they are not
going to be reachable (the function returns NULL) and are just wasting
resources.
Henry Wang [Tue, 25 Oct 2022 09:19:37 +0000 (09:19 +0000)]
xen/arm: p2m: Populate pages for GICv2 mapping in p2m_init()
Hardware using GICv2 needs to create a P2M mapping of 8KB GICv2 area
when the domain is created. Considering the worst case of page tables
which requires 6 P2M pages as the two pages will be consecutive but not
necessarily in the same L3 page table and keep a buffer, populate 16
pages as the default value to the P2M pages pool in p2m_init() at the
domain creation stage to satisfy the GICv2 requirement. For GICv3, the
above-mentioned P2M mapping is not necessary, but since the allocated
16 pages here would not be lost, hence populate these pages
unconditionally.
With the default 16 P2M pages populated, there would be a case that
failures would happen in the domain creation with P2M pages already in
use. To properly free the P2M for this case, firstly support the
optionally preemption of p2m_teardown(), then call p2m_teardown() and
p2m_set_allocation(d, 0, NULL) non-preemptively in p2m_final_teardown().
As non-preemptive p2m_teardown() should only return 0, use a
BUG_ON to confirm that.
Since p2m_final_teardown() is called either after
domain_relinquish_resources() where relinquish_p2m_mapping() has been
called, or from failure path of domain_create()/arch_domain_create()
where mappings that require p2m_put_l3_page() should never be created,
relinquish_p2m_mapping() is not added in p2m_final_teardown(), add
in-code comments to refer this.
Fixes: cbea5a1149ca ("xen/arm: Allocate and free P2M pages from the P2M pool") Suggested-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
(cherry picked from commit: c7cff1188802646eaa38e918e5738da0e84949be)
Andrew Cooper [Tue, 25 Oct 2022 09:19:36 +0000 (09:19 +0000)]
arm/p2m: Rework p2m_init()
p2m_init() is mostly trivial initialisation, but has two fallible operations
which are on either side of the backpointer trigger for teardown to take
actions.
p2m_free_vmid() is idempotent with a failed p2m_alloc_vmid(), so rearrange
p2m_init() to perform all trivial setup, then set the backpointer, then
perform all fallible setup.
This will simplify a future bugfix which needs to add a third fallible
operation.
No practical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
(cherry picked from commit: 3783e583319fa1ce75e414d851f0fde191a14753)
Jan Beulich [Wed, 12 Oct 2022 15:36:48 +0000 (17:36 +0200)]
libxl/Arm: correct xc_shadow_control() invocation to fix build
The backport didn't adapt to the earlier function prototype taking more
(unused here) arguments.
Fixes: c5215044578e ("xen/arm, libxl: Implement XEN_DOMCTL_shadow_op for Arm") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Henry Wang <Henry.Wang@arm.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Jan Beulich [Tue, 11 Oct 2022 13:53:28 +0000 (15:53 +0200)]
gnttab: correct locking on transitive grant copy error path
While the comment next to the lock dropping in preparation of
recursively calling acquire_grant_for_copy() mistakenly talks about the
rd == td case (excluded a few lines further up), the same concerns apply
to the calling of release_grant_for_copy() on a subsequent error path.
This is CVE-2022-33748 / XSA-411.
Fixes: ad48fb963dbf ("gnttab: fix transitive grant handling") Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 6e3aab858eef614a21a782a3b73acc88e74690ea
master date: 2022-10-11 14:29:30 +0200
Henry Wang [Tue, 11 Oct 2022 13:52:18 +0000 (15:52 +0200)]
xen/arm: Allocate and free P2M pages from the P2M pool
This commit sets/tearsdown of p2m pages pool for non-privileged Arm
guests by calling `p2m_set_allocation` and `p2m_teardown_allocation`.
- For dom0, P2M pages should come from heap directly instead of p2m
pool, so that the kernel may take advantage of the extended regions.
- For xl guests, the setting of the p2m pool is called in
`XEN_DOMCTL_shadow_op` and the p2m pool is destroyed in
`domain_relinquish_resources`. Note that domctl->u.shadow_op.mb is
updated with the new size when setting the p2m pool.
- For dom0less domUs, the setting of the p2m pool is called before
allocating memory during domain creation. Users can specify the p2m
pool size by `xen,domain-p2m-mem-mb` dts property.
To actually allocate/free pages from the p2m pool, this commit adds
two helper functions namely `p2m_alloc_page` and `p2m_free_page` to
`struct p2m_domain`. By replacing the `alloc_domheap_page` and
`free_domheap_page` with these two helper functions, p2m pages can
be added/removed from the list of p2m pool rather than from the heap.
Since page from `p2m_alloc_page` is cleaned, take the opportunity
to remove the redundant `clean_page` in `p2m_create_table`.
This is part of CVE-2022-33747 / XSA-409.
Signed-off-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: cbea5a1149ca7fd4b7cdbfa3ec2e4f109b601ff7
master date: 2022-10-11 14:28:44 +0200
Henry Wang [Tue, 11 Oct 2022 13:52:02 +0000 (15:52 +0200)]
xen/arm, libxl: Implement XEN_DOMCTL_shadow_op for Arm
This commit implements the `XEN_DOMCTL_shadow_op` support in Xen
for Arm. The p2m pages pool size for xl guests is supposed to be
determined by `XEN_DOMCTL_shadow_op`. Hence, this commit:
- Introduces a function `p2m_domctl` and implements the subops
`XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION` and
`XEN_DOMCTL_SHADOW_OP_GET_ALLOCATION` of `XEN_DOMCTL_shadow_op`.
- Adds the `XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION` support in libxl.
Therefore enabling the setting of shadow memory pool size
when creating a guest from xl and getting shadow memory pool size
from Xen.
Note that the `XEN_DOMCTL_shadow_op` added in this commit is only
a dummy op, and the functionality of setting/getting p2m memory pool
size for xl guests will be added in following commits.
This is part of CVE-2022-33747 / XSA-409.
Signed-off-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: cf2a68d2ffbc3ce95e01449d46180bddb10d24a0
master date: 2022-10-11 14:28:42 +0200
Henry Wang [Tue, 11 Oct 2022 13:51:45 +0000 (15:51 +0200)]
xen/arm: Construct the P2M pages pool for guests
This commit constructs the p2m pages pool for guests from the
data structure and helper perspective.
This is implemented by:
- Adding a `struct paging_domain` which contains a freelist, a
counter variable and a spinlock to `struct arch_domain` to
indicate the free p2m pages and the number of p2m total pages in
the p2m pages pool.
- Adding a helper `p2m_get_allocation` to get the p2m pool size.
- Adding a helper `p2m_set_allocation` to set the p2m pages pool
size. This helper should be called before allocating memory for
a guest.
- Adding a helper `p2m_teardown_allocation` to free the p2m pages
pool. This helper should be called during the xl domain destory.
This is part of CVE-2022-33747 / XSA-409.
Signed-off-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 55914f7fc91a468649b8a3ec3f53ae1c4aca6670
master date: 2022-10-11 14:28:39 +0200
Henry Wang [Tue, 11 Oct 2022 13:51:26 +0000 (15:51 +0200)]
libxl, docs: Use arch-specific default paging memory
The default paging memory (descibed in `shadow_memory` entry in xl
config) in libxl is used to determine the memory pool size for xl
guests. Currently this size is only used for x86, and contains a part
of RAM to shadow the resident processes. Since on Arm there is no
shadow mode guests, so the part of RAM to shadow the resident processes
is not necessary. Therefore, this commit splits the function
`libxl_get_required_shadow_memory()` to arch specific helpers and
renamed the helper to `libxl__arch_get_required_paging_memory()`.
On x86, this helper calls the original value from
`libxl_get_required_shadow_memory()` so no functional change intended.
On Arm, this helper returns 1MB per vcpu plus 4KB per MiB of RAM
for the P2M map and additional 512KB.
Also update the xl.cfg documentation to add Arm documentation
according to code changes and correct the comment style following Xen
coding style.
This is part of CVE-2022-33747 / XSA-409.
Suggested-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: 156a239ea288972425f967ac807b3cb5b5e14874
master date: 2022-10-11 14:28:37 +0200
Julien Grall [Tue, 11 Oct 2022 13:50:28 +0000 (15:50 +0200)]
xen/x86: p2m: Add preemption in p2m_teardown()
The list p2m->pages contain all the pages used by the P2M. On large
instance this can be quite large and the time spent to call
d->arch.paging.free_page() will take more than 1ms for a 80GB guest
on a Xen running in nested environment on a c5.metal.
By extrapolation, it would take > 100ms for a 8TB guest (what we
current security support). So add some preemption in p2m_teardown()
and propagate to the callers. Note there are 3 places where
the preemption is not enabled:
- hap_final_teardown()/shadow_final_teardown(): We are
preventing update the P2M once the domain is dying (so
no more pages could be allocated) and most of the P2M pages
will be freed in preemptive manneer when relinquishing the
resources. So this is fine to disable preemption.
- shadow_enable(): This is fine because it will undo the allocation
that may have been made by p2m_alloc_table() (so only the root
page table).
The preemption is arbitrarily checked every 1024 iterations.
We now need to include <xen/event.h> in p2m-basic in order to
import the definition for local_events_need_delivery() used by
general_preempt_check(). Ideally, the inclusion should happen in
xen/sched.h but it opened a can of worms.
Note that with the current approach, Xen doesn't keep track on whether
the alt/nested P2Ms have been cleared. So there are some redundant work.
However, this is not expected to incurr too much overhead (the P2M lock
shouldn't be contended during teardown). So this is optimization is
left outside of the security event.
Roger Pau Monné [Tue, 11 Oct 2022 13:50:10 +0000 (15:50 +0200)]
x86/p2m: free the paging memory pool preemptively
The paging memory pool is currently freed in two different places:
from {shadow,hap}_teardown() via domain_relinquish_resources() and
from {shadow,hap}_final_teardown() via complete_domain_destroy().
While the former does handle preemption, the later doesn't.
Attempt to move as much p2m related freeing as possible to happen
before the call to {shadow,hap}_teardown(), so that most memory can be
freed in a preemptive way. In order to avoid causing issues to
existing callers leave the root p2m page tables set and free them in
{hap,shadow}_final_teardown(). Also modify {hap,shadow}_free to free
the page immediately if the domain is dying, so that pages don't
accumulate in the pool when {shadow,hap}_final_teardown() get called.
Move altp2m_vcpu_disable_ve() to be done in hap_teardown(), as that's
the place where altp2m_active gets disabled now.
This is part of CVE-2022-33746 / XSA-410.
Reported-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: e7aa55c0aab36d994bf627c92bd5386ae167e16e
master date: 2022-10-11 14:24:21 +0200
Roger Pau Monné [Tue, 11 Oct 2022 13:49:52 +0000 (15:49 +0200)]
x86/p2m: truly free paging pool memory for dying domains
Modify {hap,shadow}_free to free the page immediately if the domain is
dying, so that pages don't accumulate in the pool when
{shadow,hap}_final_teardown() get called. This is to limit the amount of
work which needs to be done there (in a non-preemptable manner).
Note the call to shadow_free() in shadow_free_p2m_page() is moved after
increasing total_pages, so that the decrease done in shadow_free() in
case the domain is dying doesn't underflow the counter, even if just for
a short interval.
This is part of CVE-2022-33746 / XSA-410.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: f50a2c0e1d057c00d6061f40ae24d068226052ad
master date: 2022-10-11 14:23:51 +0200
Roger Pau Monné [Tue, 11 Oct 2022 13:49:18 +0000 (15:49 +0200)]
x86/shadow: tolerate failure in shadow_prealloc()
Prevent _shadow_prealloc() from calling BUG() when unable to fulfill
the pre-allocation and instead return true/false. Modify
shadow_prealloc() to crash the domain on allocation failure (if the
domain is not already dying), as shadow cannot operate normally after
that. Modify callers to also gracefully handle {_,}shadow_prealloc()
failing to fulfill the request.
Note this in turn requires adjusting the callers of
sh_make_monitor_table() also to handle it returning INVALID_MFN.
sh_update_paging_modes() is also modified to add additional error
paths in case of allocation failure, some of those will return with
null monitor page tables (and the domain likely crashed). This is no
different that current error paths, but the newly introduced ones are
more likely to trigger.
The now added failure points in sh_update_paging_modes() also require
that on some error return paths the previous structures are cleared,
and thus monitor table is null.
While there adjust the 'type' parameter type of shadow_prealloc() to
unsigned int rather than u32.
This is part of CVE-2022-33746 / XSA-410.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: b7f93c6afb12b6061e2d19de2f39ea09b569ac68
master date: 2022-10-11 14:22:53 +0200
Jan Beulich [Tue, 11 Oct 2022 13:48:59 +0000 (15:48 +0200)]
x86/shadow: tolerate failure of sh_set_toplevel_shadow()
Subsequently sh_set_toplevel_shadow() will be adjusted to install a
blank entry in case prealloc fails. There are, in fact, pre-existing
error paths which would put in place a blank entry. The 4- and 2-level
code in sh_update_cr3(), however, assume the top level entry to be
valid.
Hence bail from the function in the unlikely event that it's not. Note
that 3-level logic works differently: In particular a guest is free to
supply a PDPTR pointing at 4 non-present (or otherwise deemed invalid)
entries. The guest will crash, but we already cope with that.
Really mfn_valid() is likely wrong to use in sh_set_toplevel_shadow(),
and it should instead be !mfn_eq(gmfn, INVALID_MFN). Avoid such a change
in security context, but add a respective assertion.
This is part of CVE-2022-33746 / XSA-410.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: eac000978c1feb5a9ee3236ab0c0da9a477e5336
master date: 2022-10-11 14:22:24 +0200
Jan Beulich [Tue, 11 Oct 2022 13:48:23 +0000 (15:48 +0200)]
x86/HAP: adjust monitor table related error handling
hap_make_monitor_table() will return INVALID_MFN if it encounters an
error condition, but hap_update_paging_modes() wasn’t handling this
value, resulting in an inappropriate value being stored in
monitor_table. This would subsequently misguide at least
hap_vcpu_teardown(). Avoid this by bailing early.
Further, when a domain has/was already crashed or (perhaps less
important as there's no such path known to lead here) is already dying,
avoid calling domain_crash() on it again - that's at best confusing.
This is part of CVE-2022-33746 / XSA-410.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 5b44a61180f4f2e4f490a28400c884dd357ff45d
master date: 2022-10-11 14:21:56 +0200
Roger Pau Monné [Tue, 11 Oct 2022 13:48:01 +0000 (15:48 +0200)]
x86/p2m: add option to skip root pagetable removal in p2m_teardown()
Add a new parameter to p2m_teardown() in order to select whether the
root page table should also be freed. Note that all users are
adjusted to pass the parameter to remove the root page tables, so
behavior is not modified.
No functional change intended.
This is part of CVE-2022-33746 / XSA-410.
Suggested-by: Julien Grall <julien@xen.org> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: 1df52a270225527ae27bfa2fc40347bf93b78357
master date: 2022-10-11 14:21:23 +0200
Julien Grall [Tue, 11 Oct 2022 13:47:41 +0000 (15:47 +0200)]
xen/arm: p2m: Handle preemption when freeing intermediate page tables
At the moment the P2M page tables will be freed when the domain structure
is freed without any preemption. As the P2M is quite large, iterating
through this may take more time than it is reasonable without intermediate
preemption (to run softirqs and perhaps scheduler).
Split p2m_teardown() in two parts: one preemptible and called when
relinquishing the resources, the other one non-preemptible and called
when freeing the domain structure.
As we are now freeing the P2M pages early, we also need to prevent
further allocation if someone call p2m_set_entry() past p2m_teardown()
(I wasn't able to prove this will never happen). This is done by
the checking domain->is_dying from previous patch in p2m_set_entry().
Similarly, we want to make sure that no-one can accessed the free
pages. Therefore the root is cleared before freeing pages.
This is part of CVE-2022-33746 / XSA-410.
Signed-off-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Henry Wang <Henry.Wang@arm.com> Tested-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 3202084566bba0ef0c45caf8c24302f83d92f9c8
master date: 2022-10-11 14:20:56 +0200
Julien Grall [Tue, 11 Oct 2022 13:47:15 +0000 (15:47 +0200)]
xen/arm: p2m: Prevent adding mapping when domain is dying
During the domain destroy process, the domain will still be accessible
until it is fully destroyed. So does the P2M because we don't bail
out early if is_dying is non-zero. If a domain has permission to
modify the other domain's P2M (i.e. dom0, or a stubdomain), then
foreign mapping can be added past relinquish_p2m_mapping().
Therefore, we need to prevent mapping to be added when the domain
is dying. This commit prevents such adding of mapping by adding the
d->is_dying check to p2m_set_entry(). Also this commit enhances the
check in relinquish_p2m_mapping() to make sure that no mappings can
be added in the P2M after the P2M lock is released.
This is part of CVE-2022-33746 / XSA-410.
Signed-off-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Henry Wang <Henry.Wang@arm.com> Tested-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 3ebe773293e3b945460a3d6f54f3b91915397bab
master date: 2022-10-11 14:20:18 +0200
Roger Pau Monné [Wed, 3 Aug 2022 12:19:37 +0000 (14:19 +0200)]
tools/libxl: env variable to signal whether disk/nic backend is trusted
Introduce support in libxl for fetching the default backend trusted
option for disk and nic devices.
Users can set LIBXL_{DISK,NIC}_BACKEND_UNTRUSTED environment variable
to notify libxl of whether the backends for disk and nic devices
should be trusted. Such information is passed into the frontend so it
can take the appropriate measures.
This is part of XSA-403.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Jan Beulich [Tue, 26 Jul 2022 13:05:08 +0000 (15:05 +0200)]
x86/mm: correct TLB flush condition in _get_page_type()
When this logic was moved, it was moved across the point where nx is
updated to hold the new type for the page. IOW originally it was
equivalent to using x (and perhaps x would better have been used), but
now it isn't anymore. Switch to using x, which then brings things in
line again with the slightly earlier comment there (now) talking about
transitions _from_ writable.
I have to confess though that I cannot make a direct connection between
the reported observed behavior of guests leaving several pages around
with pending general references and the change here. Repeated testing,
nevertheless, confirms the reported issue is no longer there.
This is CVE-2022-33745 / XSA-408.
Reported-by: Charles Arnold <carnold@suse.com> Fixes: 8cc5036bc385 ("x86/pv: Fix ABAC cmpxchg() race in _get_page_type()") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a9949efb288fd6e21bbaf9d5826207c7c41cda27
master date: 2022-07-26 14:54:34 +0200
Andrew Cooper [Mon, 27 Jun 2022 18:29:40 +0000 (19:29 +0100)]
x86/spec-ctrl: Mitigate Branch Type Confusion when possible
Branch Type Confusion affects AMD/Hygon CPUs on Zen2 and earlier. To
mitigate, we require SMT safety (STIBP on Zen2, no-SMT on Zen1), and to issue
an IBPB on each entry to Xen, to flush the BTB.
Due to performance concerns, dom0 (which is trusted in most configurations) is
excluded from protections by default.
Therefore:
* Use STIBP by default on Zen2 too, which now means we want it on by default
on all hardware supporting STIBP.
* Break the current IBPB logic out into a new function, extending it with
IBPB-at-entry logic.
* Change the existing IBPB-at-ctxt-switch boolean to be tristate, and disable
it by default when IBPB-at-entry is providing sufficient safety.
If all PV guests on the system are trusted, then it is recommended to boot
with `spec-ctrl=ibpb-entry=no-pv`, as this will provide an additional marginal
perf improvement.
This is part of XSA-407 / CVE-2022-23825.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit d8cb7e0f069e0f106d24941355b59b45a731eabe)
Andrew Cooper [Mon, 16 May 2022 14:48:24 +0000 (15:48 +0100)]
x86/cpuid: Enumeration for BTC_NO
BTC_NO indicates that hardware is not succeptable to Branch Type Confusion.
Zen3 CPUs don't suffer BTC.
This is part of XSA-407.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 76cb04ad64f3ab9ae785988c40655a71dde9c319)
Andrew Cooper [Thu, 24 Feb 2022 13:44:33 +0000 (13:44 +0000)]
x86/spec-ctrl: Support IBPB-on-entry
We are going to need this to mitigate Branch Type Confusion on AMD/Hygon CPUs,
but as we've talked about using it in other cases too, arrange to support it
generally. However, this is also very expensive in some cases, so we're going
to want per-domain controls.
Introduce SCF_ist_ibpb and SCF_entry_ibpb controls, adding them to the IST and
DOM masks as appropriate. Also introduce X86_FEATURE_IBPB_ENTRY_{PV,HVM} to
to patch the code blocks.
For SVM, the STGI is serialising enough to protect against Spectre-v1 attacks,
so no "else lfence" is necessary. VT-x will use use the MSR host load list,
so doesn't need any code in the VMExit path.
For the IST path, we can't safely check CPL==0 to skip a flush, as we might
have hit an entry path before it's IBPB. As IST hitting Xen is rare, flush
irrespective of CPL. A later path, SCF_ist_sc_msr, provides Spectre-v1
safety.
For the PV paths, we know we're interrupting CPL>0, while for the INTR paths,
we can safely check CPL==0. Only flush when interrupting guest context.
An "else lfence" is needed for safety, but we want to be able to skip it on
unaffected CPUs, so the block wants to be an alternative, which means the
lfence has to be inline rather than UNLIKELY() (the replacement block doesn't
have displacements fixed up for anything other than the first instruction).
As with SPEC_CTRL_ENTRY_FROM_INTR_IST, %rdx is 0 on entry so rely on this to
shrink the logic marginally. Update the comments to specify this new
dependency.
This is part of XSA-407.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 53a570b285694947776d5190f591a0d5b9b18de7)
We are shortly going to add a conditional IBPB in this path.
Therefore, we cannot hold spec_ctrl_flags in %eax, and rely on only clobbering
it after we're done with its contents. %rbx is available for use, and the
more normal register to hold preserved information in.
With %rax freed up, use it instead of %rdx for the RSB tmp register, and for
the adjustment to spec_ctrl_flags.
This leaves no use of %rdx, except as 0 for the upper half of WRMSR. In
practice, %rdx is 0 from SAVE_ALL on all paths and isn't likely to change in
the foreseeable future, so update the macro entry requirements to state this
dependency. This marginal optimisation can be revisited if circumstances
change.
No practical change.
This is part of XSA-407.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit e9b8d31981f184c6539f91ec54bd9cae29cdae36)
Andrew Cooper [Mon, 4 Jul 2022 20:32:17 +0000 (21:32 +0100)]
x86/spec-ctrl: Rename opt_ibpb to opt_ibpb_ctxt_switch
We are about to introduce the use of IBPB at different points in Xen, making
opt_ibpb ambiguous. Rename it to opt_ibpb_ctxt_switch.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit a8e5ef079d6f5c88c472e3e620db5a8d1402a50d)
Andrew Cooper [Tue, 28 Jun 2022 13:36:56 +0000 (14:36 +0100)]
x86/spec-ctrl: Rename SCF_ist_wrmsr to SCF_ist_sc_msr
We are about to introduce SCF_ist_ibpb, at which point SCF_ist_wrmsr becomes
ambiguous.
No functional change.
This is part of XSA-407.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 76d6a36f645dfdbad8830559d4d52caf36efc75e)
We are shortly going to need to context switch new bits in both the vcpu and
S3 paths. Introduce SCF_IST_MASK and SCF_DOM_MASK, and rework d->arch.verw
into d->arch.spec_ctrl_flags to accommodate.
No functional change.
This is part of XSA-407.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 5796912f7279d9348a3166655588d30eae9f72cc)
Andrew Cooper [Fri, 8 Jul 2022 15:44:43 +0000 (16:44 +0100)]
x86/spec-ctrl: Add fine-grained cmdline suboptions for primitives
Support controling the PV/HVM suboption of msr-sc/rsb/md-clear, which
previously wasn't possible.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 27357c394ba6e1571a89105b840ce1c6f026485c)
Andrew Cooper [Tue, 5 Jul 2022 18:19:01 +0000 (19:19 +0100)]
xen/cmdline: Extend parse_boolean() to signal a name match
This will help parsing a sub-option which has boolean and non-boolean options
available.
First, rework 'int val' into 'bool has_neg_prefix'. This inverts it's value,
but the resulting logic is far easier to follow.
Second, reject anything of the form 'no-$FOO=' which excludes ambiguous
constructs such as 'no-$foo=yes' which have never been valid.
This just leaves the case where everything is otherwise fine, but parse_bool()
can't interpret the provided string.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 382326cac528dd1eb0d04efd5c05363c453e29f4)
Andrew Cooper [Wed, 16 Mar 2022 13:07:40 +0000 (13:07 +0000)]
x86/spec-ctrl: Knobs for STIBP and PSFD, and follow hardware STIBP hint
STIBP and PSFD are slightly weird bits, because they're both implied by other
bits in MSR_SPEC_CTRL. Add fine grain controls for them, and take the
implications into account when setting IBRS/SSBD.
Rearrange the IBPB text/variables/logic to keep all the MSR_SPEC_CTRL bits
together, for consistency.
However, AMD have a hardware hint CPUID bit recommending that STIBP be set
unilaterally. This is advertised on Zen3, so follow the recommendation.
Furthermore, in such cases, set STIBP behind the guest's back for now.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit fef244b179c06fcdfa581f7d57fa6e578c49ff50)
[Adjust for absence of INTEL_PSFD]
Andrew Cooper [Mon, 27 Jun 2022 10:54:27 +0000 (11:54 +0100)]
x86/spec-ctrl: Only adjust MSR_SPEC_CTRL for idle with legacy IBRS
Back at the time of the original Spectre-v2 fixes, it was recommended to clear
MSR_SPEC_CTRL when going idle. This is because of the side effects on the
sibling thread caused by the microcode IBRS and STIBP implementations which
were retrofitted to existing CPUs.
However, there are no relevant cross-thread impacts for the hardware
IBRS/STIBP implementations, so this logic should not be used on Intel CPUs
supporting eIBRS, or any AMD CPUs; doing so only adds unnecessary latency to
the idle path.
Furthermore, there's no point playing with MSR_SPEC_CTRL in the idle paths if
SMT is disabled for other reasons.
Fixes: 8d03080d2a33 ("x86/spec-ctrl: Cease using thunk=lfence on AMD") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit ffc7694e0c99eea158c32aa164b7d1e1bb1dc46b)
Andrew Cooper [Fri, 21 Jan 2022 15:59:03 +0000 (15:59 +0000)]
x86/spec-ctrl: Use common MSR_SPEC_CTRL logic for AMD
Currently, amd_init_ssbd() works by being the only write to MSR_SPEC_CTRL in
the system. This ceases to be true when using the common logic.
Include AMD MSR_SPEC_CTRL in has_spec_ctrl to activate the common paths, and
introduce an AMD specific block to control alternatives. Also update the
boot/resume paths to configure default_xen_spec_ctrl.
svm.h needs an adjustment to remove a dependency on include order.
For now, only active alternatives for HVM - PV will require more work. No
functional change, as no alternatives are defined yet for HVM yet.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 378f2e6df31442396f0afda19794c5c6091d96f9)
Andrew Cooper [Fri, 28 Jan 2022 12:03:42 +0000 (12:03 +0000)]
x86/spec-ctrl: Don't use spec_ctrl_{enter,exit}_idle() for S3
'idle' here refers to hlt/mwait. The S3 path isn't an idle path - it is a
platform reset.
We need to load default_xen_spec_ctrl unilaterally on the way back up.
Currently it happens as a side effect of X86_FEATURE_SC_MSR_IDLE or the next
return-to-guest, but that's fragile behaviour.
Conversely, there is no need to clear IBRS and flush the store buffers on the
way down; we're microseconds away from cutting power.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 71fac402e05ade7b0af2c34f77517449f6f7e2c1)
Andrew Cooper [Tue, 25 Jan 2022 17:14:48 +0000 (17:14 +0000)]
x86/spec-ctrl: Introduce new has_spec_ctrl boolean
Most MSR_SPEC_CTRL setup will be common between Intel and AMD. Instead of
opencoding an OR of two features everywhere, introduce has_spec_ctrl instead.
Reword the comment above the Intel specific alternatives block to highlight
that it is Intel specific, and pull the setting of default_xen_spec_ctrl.IBRS
out because it will want to be common.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 5d9eff3a312763d889cfbf3c8468b6dfb3ab490c)
Andrew Cooper [Tue, 25 Jan 2022 16:09:59 +0000 (16:09 +0000)]
x86/spec-ctrl: Drop use_spec_ctrl boolean
Several bugfixes have reduced the utility of this variable from it's original
purpose, and now all it does is aid in the setup of SCF_ist_wrmsr.
Simplify the logic by drop the variable, and doubling up the setting of
SCF_ist_wrmsr for the PV and HVM blocks, which will make the AMD SPEC_CTRL
support easier to follow. Leave a comment explaining why SCF_ist_wrmsr is
still necessary for the VMExit case.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit ec083bf552c35e10347449e21809f4780f8155d2)
Andrew Cooper [Fri, 15 Oct 2021 09:15:39 +0000 (11:15 +0200)]
x86/spec-ctrl: Print all AMD speculative hints/features
We already print Intel features that aren't yet implemented/used, so be
consistent on AMD too.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3d189f16a11d5a209fb47fa3635847608d43745c)
[Forward port over XSA-398]
Andrew Cooper [Fri, 15 Oct 2021 09:15:14 +0000 (11:15 +0200)]
x86/amd: Use newer SSBD mechanisms if they exist
The opencoded legacy Memory Disambiguation logic in init_amd() neglected
Fam19h for the Zen3 microarchitecture. Further more, all Zen2 based system
have the architectural MSR_SPEC_CTRL and the SSBD bit within it, so shouldn't
be using MSR_AMD64_LS_CFG.
Implement the algorithm given in AMD's SSBD whitepaper, and leave a
printk_once() behind in the case that no controls can be found.
This now means that a user explicitly choosing `spec-ctrl=ssbd` will properly
turn off Memory Disambiguation on Fam19h/Zen3 systems.
This still remains a single system-wide setting (for now), and is not context
switched between vCPUs. As such, it doesn't interact with Intel's use of
MSR_SPEC_CTRL and default_xen_spec_ctrl (yet).
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 2a4e6c4e4bea2e0bb720418c331ee28ff9c7632e)
Andrew Cooper [Fri, 15 Oct 2021 09:14:46 +0000 (11:14 +0200)]
x86/amd: Enumeration for speculative features/hints
There is a step change in speculation protections between the Zen1 and Zen2
microarchitectures.
Zen1 and older have no special support. Control bits in non-architectural
MSRs are used to make lfence be dispatch-serialising (Spectre v1), and to
disable Memory Disambiguation (Speculative Store Bypass). IBPB was
retrofitted in a microcode update, and software methods are required for
Spectre v2 protections.
Because the bit controlling Memory Disambiguation is model specific,
hypervisors are expected to expose a MSR_VIRT_SPEC_CTRL interface which
abstracts the model specific details.
Zen2 and later implement the MSR_SPEC_CTRL interface in hardware, and
virtualise the interface for HVM guests to use. A number of hint bits are
specified too to help guide OS software to the most efficient mitigation
strategy.
Zen3 introduced a new feature, Predictive Store Forwarding, along with a
control to disable it in sensitive code.
Add CPUID and VMCB details for all the new functionality.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 747424c664bb164a04e7a9f2ffbf02d4a1630d7d)
Andrew Cooper [Fri, 15 Oct 2021 09:14:16 +0000 (11:14 +0200)]
x86/spec-ctrl: Split the "Hardware features" diagnostic line
Separate the read-only hints from the features requiring active actions on
Xen's behalf.
Also take the opportunity split the IBRS/IBPB and IBPB mess. More features
with overlapping enumeration are on the way, and and it is not useful to split
them like this.
No practical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 565ebcda976c05b0c6191510d5e32b621a2b1867)
[Forward ported over XSA-404]
Andrew Cooper [Tue, 25 Jan 2022 12:52:56 +0000 (13:52 +0100)]
x86/spec-ctrl: Drop SPEC_CTRL_{ENTRY_FROM,EXIT_TO}_HVM
These were written before Spectre/Meltdown went public, and there was large
uncertainty in how the protections would evolve. As it turns out, they're
very specific to Intel hardware, and not very suitable for AMD.
Drop the macros, opencoding the relevant subset of functionality, and leaving
grep-fodder to locate the logic. No change at all for VT-x.
For AMD, the only relevant piece of functionality is DO_OVERWRITE_RSB,
although we will soon be adding (different) logic to handle MSR_SPEC_CTRL.
This has a marginal improvement of removing an unconditional pile of long-nops
from the vmentry/exit path.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 95b13fa43e0753b7514bef13abe28253e8614f62)
[Forward port over XSA-404]
Andrew Cooper [Fri, 8 Jul 2022 15:11:40 +0000 (16:11 +0100)]
x86/spec-ctrl: Honour spec-ctrl=0 for unpriv-mmio sub-option
This was an oversight from when unpriv-mmio was introduced.
Fixes: 8c24b70fedcb ("x86/spec-ctrl: Add spec-ctrl=unpriv-mmio") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 4cdb519d797c19ebb8fadc5938cdb47479d5a21b)
Andrew Cooper [Mon, 13 Jun 2022 18:18:32 +0000 (19:18 +0100)]
x86/spec-ctrl: Add spec-ctrl=unpriv-mmio
Per Xen's support statement, PCI passthrough should be to trusted domains
because the overall system security depends on factors outside of Xen's
control.
As such, Xen, in a supported configuration, is not vulnerable to DRPW/SBDR.
However, users who have risk assessed their configuration may be happy with
the risk of DoS, but unhappy with the risk of cross-domain data leakage. Such
users should enable this option.
On CPUs vulnerable to MDS, the existing mitigations are the best we can do to
mitigate MMIO cross-domain data leakage.
On CPUs fixed to MDS but vulnerable MMIO stale data leakage, this option:
* On CPUs susceptible to FBSDP, mitigates cross-domain fill buffer leakage
using FB_CLEAR.
* On CPUs susceptible to SBDR, mitigates RNG data recovery by engaging the
srb-lock, previously used to mitigate SRBDS.
Both mitigations require microcode from IPU 2022.1, May 2022.
This is part of XSA-404.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 8c24b70fedcb52633b2370f834d8a2be3f7fa38e)
Andrew Cooper [Mon, 20 Sep 2021 17:47:49 +0000 (18:47 +0100)]
x86/spec-ctrl: Enumeration for MMIO Stale Data controls
The three *_NO bits indicate non-susceptibility to the SSDP, FBSDP and PSDP
data movement primitives.
FB_CLEAR indicates that the VERW instruction has re-gained it's Fill Buffer
flushing side effect. This is only enumerated on parts where VERW had
previously lost it's flushing side effect due to the MDS/TAA vulnerabilities
being fixed in hardware.
FB_CLEAR_CTRL is available on a subset of FB_CLEAR parts where the Fill Buffer
clearing side effect of VERW can be turned off for performance reasons.
This is part of XSA-404.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 2ebe8fe9b7e0d36e9ec3cfe4552b2b197ef0dcec)
Andrew Cooper [Mon, 13 Jun 2022 15:19:01 +0000 (16:19 +0100)]
x86/spec-ctrl: Make VERW flushing runtime conditional
Currently, VERW flushing to mitigate MDS is boot time conditional per domain
type. However, to provide mitigations for DRPW (CVE-2022-21166), we need to
conditionally use VERW based on the trustworthiness of the guest, and the
devices passed through.
Remove the PV/HVM alternatives and instead issue a VERW on the return-to-guest
path depending on the SCF_verw bit in cpuinfo spec_ctrl_flags.
Introduce spec_ctrl_init_domain() and d->arch.verw to calculate the VERW
disposition at domain creation time, and context switch the SCF_verw bit.
For now, VERW flushing is used and controlled exactly as before, but later
patches will add per-domain cases too.
No change in behaviour.
This is part of XSA-404.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit e06b95c1d44ab80da255219fc9f1e2fc423edcb6)
Jan Beulich [Fri, 10 Jun 2022 08:31:23 +0000 (10:31 +0200)]
x86/mm: account for PGT_pae_xen_l2 in recently added assertion
While PGT_pae_xen_l2 will be zapped once the type refcount of an L2 page
reaches zero, it'll be retained as long as the type refcount is non-
zero. Hence any checking against the requested type needs to either zap
the bit from the type or include it in the used mask.
Fixes: 9186e96b199e ("x86/pv: Clean up _get_page_type()") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c2095ac76be0f4a1940346c9ffb49fb967345060
master date: 2022-06-10 10:21:06 +0200
Andrew Cooper [Thu, 9 Jun 2022 15:16:06 +0000 (17:16 +0200)]
x86/pv: Track and flush non-coherent mappings of RAM
There are legitimate uses of WC mappings of RAM, e.g. for DMA buffers with
devices that make non-coherent writes. The Linux sound subsystem makes
extensive use of this technique.
For such usecases, the guest's DMA buffer is mapped and consistently used as
WC, and Xen doesn't interact with the buffer.
However, a mischevious guest can use WC mappings to deliberately create
non-coherency between the cache and RAM, and use this to trick Xen into
validating a pagetable which isn't actually safe.
Allocate a new PGT_non_coherent to track the non-coherency of mappings. Set
it whenever a non-coherent writeable mapping is created. If the page is used
as anything other than PGT_writable_page, force a cache flush before
validation. Also force a cache flush before the page is returned to the heap.
This is CVE-2022-26364, part of XSA-402.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c1c9cae3a9633054b177c5de21ad7268162b2f2c
master date: 2022-06-09 14:23:37 +0200
Andrew Cooper [Thu, 9 Jun 2022 15:15:39 +0000 (17:15 +0200)]
x86/amd: Work around CLFLUSH ordering on older parts
On pre-CLFLUSHOPT AMD CPUs, CLFLUSH is weakely ordered with everything,
including reads and writes to the address, and LFENCE/SFENCE instructions.
This creates a multitude of problematic corner cases, laid out in the manual.
Arrange to use MFENCE on both sides of the CLFLUSH to force proper ordering.
This is part of XSA-402.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 062868a5a8b428b85db589fa9a6d6e43969ffeb9
master date: 2022-06-09 14:23:07 +0200
Andrew Cooper [Thu, 9 Jun 2022 15:15:11 +0000 (17:15 +0200)]
x86: Split cache_flush() out of cache_writeback()
Subsequent changes will want a fully flushing version.
Use the new helper rather than opencoding it in flush_area_local(). This
resolves an outstanding issue where the conditional sfence is on the wrong
side of the clflushopt loop. clflushopt is ordered with respect to older
stores, not to younger stores.
Rename gnttab_cache_flush()'s helper to avoid colliding in name.
grant_table.c can see the prototype from cache.h so the build fails
otherwise.
This is part of XSA-402.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 9a67ffee3371506e1cbfdfff5b90658d4828f6a2
master date: 2022-06-09 14:22:38 +0200
Andrew Cooper [Thu, 9 Jun 2022 15:14:39 +0000 (17:14 +0200)]
x86: Don't change the cacheability of the directmap
Changeset 55f97f49b7ce ("x86: Change cache attributes of Xen 1:1 page mappings
in response to guest mapping requests") attempted to keep the cacheability
consistent between different mappings of the same page.
The reason wasn't described in the changelog, but it is understood to be in
regards to a concern over machine check exceptions, owing to errata when using
mixed cacheabilities. It did this primarily by updating Xen's mapping of the
page in the direct map when the guest mapped a page with reduced cacheability.
Unfortunately, the logic didn't actually prevent mixed cacheability from
occurring:
* A guest could map a page normally, and then map the same page with
different cacheability; nothing prevented this.
* The cacheability of the directmap was always latest-takes-precedence in
terms of guest requests.
* Grant-mapped frames with lesser cacheability didn't adjust the page's
cacheattr settings.
* The map_domain_page() function still unconditionally created WB mappings,
irrespective of the page's cacheattr settings.
Additionally, update_xen_mappings() had a bug where the alias calculation was
wrong for mfn's which were .init content, which should have been treated as
fully guest pages, not Xen pages.
Worse yet, the logic introduced a vulnerability whereby necessary
pagetable/segdesc adjustments made by Xen in the validation logic could become
non-coherent between the cache and main memory. The CPU could subsequently
operate on the stale value in the cache, rather than the safe value in main
memory.
The directmap contains primarily mappings of RAM. PAT/MTRR conflict
resolution is asymmetric, and generally for MTRR=WB ranges, PAT of lesser
cacheability resolves to being coherent. The special case is WC mappings,
which are non-coherent against MTRR=WB regions (except for fully-coherent
CPUs).
Xen must not have any WC cacheability in the directmap, to prevent Xen's
actions from creating non-coherency. (Guest actions creating non-coherency is
dealt with in subsequent patches.) As all memory types for MTRR=WB ranges
inter-operate coherently, so leave Xen's directmap mappings as WB.
Only PV guests with access to devices can use reduced-cacheability mappings to
begin with, and they're trusted not to mount DoSs against the system anyway.
Drop PGC_cacheattr_{base,mask} entirely, and the logic to manipulate them.
Shift the later PGC_* constants up, to gain 3 extra bits in the main reference
count. Retain the check in get_page_from_l1e() for special_pages() because a
guest has no business using reduced cacheability on these.
Andrew Cooper [Thu, 9 Jun 2022 15:13:54 +0000 (17:13 +0200)]
x86/pv: Fix ABAC cmpxchg() race in _get_page_type()
_get_page_type() suffers from a race condition where it incorrectly assumes
that because 'x' was read and a subsequent a cmpxchg() succeeds, the type
cannot have changed in-between. Consider:
CPU A:
1. Creates an L2e referencing pg
`-> _get_page_type(pg, PGT_l1_page_table), sees count 0, type PGT_writable_page
2. Issues flush_tlb_mask()
CPU B:
3. Creates a writeable mapping of pg
`-> _get_page_type(pg, PGT_writable_page), count increases to 1
4. Writes into new mapping, creating a TLB entry for pg
5. Removes the writeable mapping of pg
`-> _put_page_type(pg), count goes back down to 0
CPU A:
7. Issues cmpxchg(), setting count 1, type PGT_l1_page_table
CPU B now has a writeable mapping to pg, which Xen believes is a pagetable and
suitably protected (i.e. read-only). The TLB flush in step 2 must be deferred
until after the guest is prohibited from creating new writeable mappings,
which is after step 7.
Defer all safety actions until after the cmpxchg() has successfully taken the
intended typeref, because that is what prevents concurrent users from using
the old type.
Also remove the early validation for writeable and shared pages. This removes
race conditions where one half of a parallel mapping attempt can return
successfully before:
* The IOMMU pagetables are in sync with the new page type
* Writeable mappings to shared pages have been torn down
This is part of XSA-401 / CVE-2022-26362.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 8cc5036bc385112a82f1faff27a0970e6440dfed
master date: 2022-06-09 14:21:04 +0200
Andrew Cooper [Thu, 9 Jun 2022 15:13:30 +0000 (17:13 +0200)]
x86/pv: Clean up _get_page_type()
Various fixes for clarity, ahead of making complicated changes.
* Split the overflow check out of the if/else chain for type handling, as
it's somewhat unrelated.
* Comment the main if/else chain to explain what is going on. Adjust one
ASSERT() and state the bit layout for validate-locked and partial states.
* Correct the comment about TLB flushing, as it's backwards. The problem
case is when writeable mappings are retained to a page becoming read-only,
as it allows the guest to bypass Xen's safety checks for updates.
* Reduce the scope of 'y'. It is an artefact of the cmpxchg loop and not
valid for use by subsequent logic. Switch to using ACCESS_ONCE() to treat
all reads as explicitly volatile. The only thing preventing the validated
wait-loop being infinite is the compiler barrier hidden in cpu_relax().
* Replace one page_get_owner(page) with the already-calculated 'd' already in
scope.
No functional change.
This is part of XSA-401 / CVE-2022-26362.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 9186e96b199e4f7e52e033b238f9fe869afb69c7
master date: 2022-06-09 14:20:36 +0200
Jan Beulich [Fri, 8 Apr 2022 13:22:49 +0000 (15:22 +0200)]
VT-d: avoid infinite recursion on domain_context_mapping_one() error path
Despite the comment there infinite recursion was still possible, by
flip-flopping between two domains. This is because prev_dom is derived
from the DID found in the context entry, which was already updated by
the time error recovery is invoked. Simply introduce yet another mode
flag to prevent rolling back an in-progress roll-back of a prior
mapping attempt.
Also drop the existing recursion prevention for having been dead anyway:
Earlier in the function we already bail when prev_dom == domain.
Jan Beulich [Fri, 8 Apr 2022 13:21:33 +0000 (15:21 +0200)]
VT-d: avoid NULL deref on domain_context_mapping_one() error paths
First there's a printk() which actually wrongly uses pdev in the first
place: We want to log the coordinates of the (perhaps fake) device
acted upon, which may not be pdev.
Then it was quite pointless for eb19326a328d ("VT-d: prepare for per-
device quarantine page tables (part I)") to add a domid_t parameter to
domain_context_unmap_one(): It's only used to pass back here via
me_wifi_quirk() -> map_me_phantom_function(). Drop the parameter again.
Finally there's the invocation of domain_context_mapping_one(), which
needs to be passed the correct domain ID. Avoid taking that path when
pdev is NULL and the quarantine state is what would need restoring to.
This means we can't security-support non-PCI-Express devices with RMRRs
(if such exist in practice) any longer; note that as of trhe 1st of the
two commits referenced below assigning them to DomU-s is unsupported
anyway.
Fixes: 8f41e481b485 ("VT-d: re-assign devices directly") Fixes: 14dd241aad8a ("IOMMU/x86: use per-device page tables for quarantining")
Coverity ID: 1503784 Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 608394b906e71587f02e6662597bc985bad33a5a
master date: 2022-04-07 12:30:19 +0200
Jan Beulich [Fri, 8 Apr 2022 13:20:21 +0000 (15:20 +0200)]
VT-d: don't needlessly look up DID
If get_iommu_domid() in domain_context_unmap_one() fails, we better
wouldn't clear the context entry in the first place, as we're then unable
to issue the corresponding flush. However, we have no need to look up the
DID in the first place: What needs flushing is very specifically the DID
that was in the context entry before our clearing of it.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 445ab9852d69d8957467f0036098ebec75fec092
master date: 2022-04-07 12:29:03 +0200
Jan Beulich [Tue, 5 Apr 2022 13:27:36 +0000 (15:27 +0200)]
IOMMU/x86: use per-device page tables for quarantining
Devices with RMRRs / unity mapped regions, due to it being unspecified
how/when these memory regions may be accessed, may not be left
disconnected from the mappings of these regions (as long as it's not
certain that the device has been fully quiesced). Hence even the page
tables used when quarantining such devices need to have mappings of
those regions. This implies installing page tables in the first place
even when not in scratch-page quarantining mode.
This is CVE-2022-26361 / part of XSA-400.
While for the purpose here it would be sufficient to have devices with
RMRRs / unity mapped regions use per-device page tables, extend this to
all devices (in scratch-page quarantining mode). This allows the leaf
pages to be mapped r/w, thus covering also memory writes (rather than
just reads) issued by non-quiescent devices.
Set up quarantine page tables as late as possible, yet early enough to
not encounter failure during de-assign. This means setup generally
happens in assign_device(), while (for now) the one in deassign_device()
is there mainly to be on the safe side.
In VT-d's DID allocation function don't require the IOMMU lock to be
held anymore: All involved code paths hold pcidevs_lock, so this way we
avoid the need to acquire the IOMMU lock around the new call to
context_set_domain_id().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 14dd241aad8af447680ac73e8579990e2c09c1e7
master date: 2022-04-05 14:24:18 +0200
Jan Beulich [Tue, 5 Apr 2022 13:26:41 +0000 (15:26 +0200)]
IOMMU/x86: drop TLB flushes from quarantine_init() hooks
The page tables just created aren't hooked up yet anywhere, so there's
nothing that could be present in any TLB, and hence nothing to flush.
Dropping this flush is, at least on the VT-d side, a prereq to per-
device domain ID use when quarantining devices, as dom_io isn't going
to be assigned a DID anymore: The warning in get_iommu_did() would
trigger.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 54c5cef49239e2f27ec3b3fc8804bf57aa4bf46d
master date: 2022-04-05 14:19:42 +0200
Jan Beulich [Tue, 5 Apr 2022 13:25:54 +0000 (15:25 +0200)]
IOMMU/x86: maintain a per-device pseudo domain ID
In order to subsequently enable per-device quarantine page tables, we'll
need domain-ID-like identifiers to be inserted in the respective device
(AMD) or context (Intel) table entries alongside the per-device page
table root addresses.
Make use of "real" domain IDs occupying only half of the value range
coverable by domid_t.
Note that in VT-d's iommu_alloc() I didn't want to introduce new memory
leaks in case of error, but existing ones don't get plugged - that'll be
the subject of a later change.
The VT-d changes are slightly asymmetric, but this way we can avoid
assigning pseudo domain IDs to devices which would never be mapped while
still avoiding to add a new parameter to domain_context_unmap().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 97af062b89d52c0ecf7af254b53345c97d438e33
master date: 2022-04-05 14:19:10 +0200
Jan Beulich [Tue, 5 Apr 2022 13:25:26 +0000 (15:25 +0200)]
VT-d: prepare for per-device quarantine page tables (part II)
Replace the passing of struct domain * by domid_t in preparation of
per-device quarantine page tables also requiring per-device pseudo
domain IDs, which aren't going to be associated with any struct domain
instances.
No functional change intended (except for slightly adjusted log message
text).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 7131163c4806e3c7de24873164d1a003d2a27dee
master date: 2022-04-05 14:18:48 +0200
Jan Beulich [Tue, 5 Apr 2022 13:24:54 +0000 (15:24 +0200)]
VT-d: prepare for per-device quarantine page tables (part I)
Arrange for domain ID and page table root to be passed around, the latter in
particular to domain_pgd_maddr() such that taking it from the per-domain
fields can be overridden.
No functional change intended.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: eb19326a328d49a6a4dc3930391b340f3bcd8948
master date: 2022-04-05 14:18:26 +0200
Jan Beulich [Tue, 5 Apr 2022 13:24:23 +0000 (15:24 +0200)]
AMD/IOMMU: re-assign devices directly
Devices with unity map ranges, due to it being unspecified how/when
these memory ranges may get accessed, may not be left disconnected from
their unity mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than tearing down the old root page
table pointer and then establishing the new one, re-assignment needs to
be done in a single step.
This is CVE-2022-26360 / part of XSA-400.
Reported-by: Roger Pau Monné <roger.pau@citrix.com>
Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.
To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any unity map
ranges. The main difference is when it comes to updating DTEs, which need
to be atomic when there are unity mappings. Yet atomicity can only be
achieved with CMPXCHG16B, availability of which we can't take for given.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 1fa6e9aa36233fe9c29a204fcb2697e985b8345f
master date: 2022-04-05 14:18:04 +0200
Jan Beulich [Tue, 5 Apr 2022 13:23:57 +0000 (15:23 +0200)]
VT-d: re-assign devices directly
Devices with RMRRs, due to it being unspecified how/when the specified
memory regions may get accessed, may not be left disconnected from their
respective mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than unmapping the old context and
then mapping the new one, re-assignment needs to be done in a single
step.
This is CVE-2022-26359 / part of XSA-400.
Reported-by: Roger Pau Monné <roger.pau@citrix.com>
Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.
To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any RMRRs. The
main difference is when it comes to updating context entries, which need
to be atomic when there are RMRRs. Yet atomicity can only be achieved
with CMPXCHG16B, availability of which we can't take for given.
The seemingly complicated choice of non-negative return values for
domain_context_mapping_one() is to limit code churn: This way callers
passing NULL for pdev don't need fiddling with.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8f41e481b4852173909363b88c1ab3da747d3a05
master date: 2022-04-05 14:17:42 +0200
Jan Beulich [Tue, 5 Apr 2022 13:23:26 +0000 (15:23 +0200)]
VT-d: drop ownership checking from domain_context_mapping_one()
Despite putting in quite a bit of effort it was not possible to
establish why exactly this code exists (beyond possibly sanity
checking). Instead of a subsequent change further complicating this
logic, simply get rid of it.
Take the opportunity and move the respective unmap_vtd_domain_page() out
of the locked region.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: a680b8134b2d1828bbbf443a97feea66e8a85c75
master date: 2022-04-05 14:17:21 +0200
Jan Beulich [Tue, 5 Apr 2022 13:22:31 +0000 (15:22 +0200)]
VT-d: fix add/remove ordering when RMRRs are in use
In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully cleared.
Also switch to %pd in related log messages.
Fixes: fa88cfadf918 ("vt-d: Map RMRR in intel_iommu_add_device() if the device has RMRR") Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 3221f270cf2eba0a22fd4f92319d664eacb92889
master date: 2022-04-05 14:16:10 +0200
Jan Beulich [Tue, 5 Apr 2022 13:21:21 +0000 (15:21 +0200)]
VT-d: fix (de)assign ordering when RMRRs are in use
In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully updated.
Also adjust a related log message.
This is CVE-2022-26358 / part of XSA-400.
Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 78a40f8b5dfa1a3aec43528663f39473d4429101
master date: 2022-04-05 14:15:33 +0200
Jan Beulich [Tue, 5 Apr 2022 13:20:42 +0000 (15:20 +0200)]
VT-d: correct ordering of operations in cleanup_domid_map()
The function may be called without any locks held (leaving aside the
domctl one, which we surely don't want to depend on here), so needs to
play safe wrt other accesses to domid_map[] and domid_bitmap[]. This is
to avoid context_set_domain_id()'s writing of domid_map[] to be reset to
zero right away in the case of it racing the freeing of a DID.
For the interaction with context_set_domain_id() and did_to_domain_id()
see the code comment.
{check_,}cleanup_domid_map() are called with pcidevs_lock held or during
domain cleanup only (and pcidevs_lock is also held around
context_set_domain_id()), i.e. racing calls with the same (dom, iommu)
tuple cannot occur.
domain_iommu_domid(), besides its use by cleanup_domid_map(), has its
result used only to control flushing, and hence a stale result would
only lead to a stray extra flush.
x86/hap: do not switch on log dirty for VRAM tracking
XEN_DMOP_track_dirty_vram possibly calls into paging_log_dirty_enable
when using HAP mode, and it can interact badly with other ongoing
paging domctls, as XEN_DMOP_track_dirty_vram is not holding the domctl
lock.
This was detected as a result of the following assert triggering when
doing repeated migrations of a HAP HVM domain with a stubdom:
Assertion 'd->arch.paging.log_dirty.allocs == 0' failed at paging.c:198
----[ Xen-4.17-unstable x86_64 debug=y Not tainted ]----
CPU: 34
RIP: e008:[<ffff82d040314b3b>] arch/x86/mm/paging.c#paging_free_log_dirty_bitmap+0x606/0x6
RFLAGS: 0000000000010206 CONTEXT: hypervisor (d0v23)
[...]
Xen call trace:
[<ffff82d040314b3b>] R arch/x86/mm/paging.c#paging_free_log_dirty_bitmap+0x606/0x63a
[<ffff82d040279f96>] S xsm/flask/hooks.c#domain_has_perm+0x5a/0x67
[<ffff82d04031577f>] F paging_domctl+0x251/0xd41
[<ffff82d04031640c>] F paging_domctl_continuation+0x19d/0x202
[<ffff82d0403202fa>] F pv_hypercall+0x150/0x2a7
[<ffff82d0403a729d>] F lstar_enter+0x12d/0x140
Such assert triggered because the stubdom used
XEN_DMOP_track_dirty_vram while dom0 was in the middle of executing
XEN_DOMCTL_SHADOW_OP_OFF, and so log dirty become enabled while
retiring the old structures, thus leading to new entries being
populated in already clear slots.
Fix this by not enabling log dirty for VRAM tracking, similar to what
is done when using shadow instead of HAP. Call
p2m_enable_hardware_log_dirty when enabling VRAM tracking in order to
get some hardware assistance if available. As a side effect the memory
pressure on the p2m pool should go down if only VRAM tracking is
enabled, as the dirty bitmap is no longer allocated.
Note that paging_log_dirty_range (used to get the dirty bitmap for
VRAM tracking) doesn't use the log dirty bitmap, and instead relies on
checking whether each gfn on the range has been switched from
p2m_ram_logdirty to p2m_ram_rw in order to account for dirty pages.
This is CVE-2022-26356 / XSA-397.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4f4db53784d912c4f409a451c36ebfd4754e0a42
master date: 2022-04-05 14:11:30 +0200
Andrew Cooper [Mon, 7 Mar 2022 16:35:52 +0000 (16:35 +0000)]
x86/spec-ctrl: Cease using thunk=lfence on AMD
AMD have updated their Spectre v2 guidance, and lfence/jmp is no longer
considered safe. AMD are recommending using retpoline everywhere.
Update the default heuristics to never select THUNK_LFENCE.
This is part of XSA-398 / CVE-2021-26401.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 8d03080d2a339840d3a59e0932a94f804e45110d)
Bertrand Marquis [Thu, 17 Feb 2022 14:52:54 +0000 (14:52 +0000)]
xen/arm: Allow to discover and use SMCCC_ARCH_WORKAROUND_3
Allow guest to discover whether or not SMCCC_ARCH_WORKAROUND_3 is
supported and create a fastpath in the code to handle guests request to
do the workaround.
The function SMCCC_ARCH_WORKAROUND_3 will be called by the guest for
flushing the branch history. So we want the handling to be as fast as
possible.
As the mitigation is applied on every guest exit, we can check for the
call before saving all context and return very early.
Rahul Singh [Mon, 14 Feb 2022 18:47:32 +0000 (18:47 +0000)]
xen/arm: Add Spectre BHB handling
This commit is adding Spectre BHB handling to Xen on Arm.
The commit is introducing new alternative code to be executed during
exception entry:
- SMCC workaround 3 call
- loop workaround (with 8, 24 or 32 iterations)
- use of new clearbhb instruction
Cpuerrata is modified by this patch to apply the required workaround for
CPU affected by Spectre BHB when CONFIG_ARM64_HARDEN_BRANCH_PREDICTOR is
enabled.
To do this the system previously used to apply smcc workaround 1 is
reused and new alternative code to be copied in the exception handler is
introduced.
To define the type of workaround required by a processor, 4 new cpu
capabilities are introduced (for each number of loop and for smcc
workaround 3).
When a processor is affected, enable_spectre_bhb_workaround is called
and if the processor does not have CSV2 set to 3 or ECBHB feature (which
would mean that the processor is doing what is required in hardware),
the proper code is enabled at exception entry.
In the case where workaround 3 is not supported by the firmware, we
enable workaround 1 when possible as it will also mitigate Spectre BHB
on systems without CSV2.
Bertrand Marquis [Wed, 23 Feb 2022 09:42:18 +0000 (09:42 +0000)]
xen/arm: Add ECBHB and CLEARBHB ID fields
Introduce ID coprocessor register ID_AA64ISAR2_EL1.
Add definitions in cpufeature and sysregs of ECBHB field in mmfr1 and
CLEARBHB in isar2 ID coprocessor registers.
Bertrand Marquis [Tue, 15 Feb 2022 10:39:47 +0000 (10:39 +0000)]
xen/arm: move errata CSV2 check earlier
CSV2 availability check is done after printing to the user that
workaround 1 will be used. Move the check before to prevent saying to the
user that workaround 1 is used when it is not because it is not needed.
This will also allow to reuse install_bp_hardening_vec function for
other use cases.
Code previously returning "true", now returns "0" to conform to
enable_smccc_arch_workaround_1 returning an int and surrounding code
doing a "return 0" if workaround is not needed.
Julien Grall [Tue, 25 Jan 2022 13:44:47 +0000 (14:44 +0100)]
xen/grant-table: Only decrement the refcounter when grant is fully unmapped
The grant unmapping hypercall (GNTTABOP_unmap_grant_ref) is not a
simple revert of the changes done by the grant mapping hypercall
(GNTTABOP_map_grant_ref).
Instead, it is possible to partially (or even not) clear some flags.
This will leave the grant is mapped until a future call where all
the flags would be cleared.
XSA-380 introduced a refcounting that is meant to only be dropped
when the grant is fully unmapped. Unfortunately, unmap_common() will
decrement the refcount for every successful call.
A consequence is a domain would be able to underflow the refcount
and trigger a BUG().
Looking at the code, it is not clear to me why a domain would
want to partially clear some flags in the grant-table. But as
this is part of the ABI, it is better to not change the behavior
for now.
Fix it by checking if the maptrack handle has been released before
decrementing the refcounting.
Julien Grall [Tue, 25 Jan 2022 13:44:21 +0000 (14:44 +0100)]
xen/arm: p2m: Always clear the P2M entry when the mapping is removed
Commit 2148a125b73b ("xen/arm: Track page accessed between batch of
Set/Way operations") allowed an entry to be invalid from the CPU PoV
(lpae_is_valid()) but valid for Xen (p2m_is_valid()). This is useful
to track which page is accessed and only perform an action on them
(e.g. clean & invalidate the cache after a set/way instruction).
Unfortunately, __p2m_set_entry() is only zeroing the P2M entry when
lpae_is_valid() returns true. This means the entry will not be zeroed
if the entry was valid from Xen PoV but invalid from the CPU PoV for
tracking purpose.
As a consequence, this will allow a domain to continue to access the
page after it was removed.
Resolve the issue by always zeroing the entry if it the LPAE bit is
set or the entry is about to be removed.
Jan Beulich [Tue, 23 Nov 2021 12:33:33 +0000 (13:33 +0100)]
x86/P2M: deal with partial success of p2m_set_entry()
M2P and PoD stats need to remain in sync with P2M; if an update succeeds
only partially, respective adjustments need to be made. If updates get
made before the call, they may also need undoing upon complete failure
(i.e. including the single-page case).
Log-dirty state would better also be kept in sync.
Note that the change to set_typed_p2m_entry() may not be strictly
necessary (due to the order restriction enforced near the top of the
function), but is being kept here to be on the safe side.
This is CVE-2021-28705 and CVE-2021-28709 / XSA-389.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 74a11c43fd7e074b1f77631b446dd2115eacb9e8
master date: 2021-11-22 12:27:30 +0000
Jan Beulich [Tue, 23 Nov 2021 12:33:14 +0000 (13:33 +0100)]
x86/PoD: handle intermediate page orders in p2m_pod_cache_add()
p2m_pod_decrease_reservation() may pass pages to the function which
aren't 4k, 2M, or 1G. Handle all intermediate orders as well, to avoid
hitting the BUG() at the switch() statement's "default" case.
This is CVE-2021-28708 / part of XSA-388.
Fixes: 3c352011c0d3 ("x86/PoD: shorten certain operations on higher order ranges") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8ec13f68e0b026863d23e7f44f252d06478bc809
master date: 2021-11-22 12:27:30 +0000
Jan Beulich [Tue, 23 Nov 2021 12:32:54 +0000 (13:32 +0100)]
x86/PoD: deal with misaligned GFNs
Users of XENMEM_decrease_reservation and XENMEM_populate_physmap aren't
required to pass in order-aligned GFN values. (While I consider this
bogus, I don't think we can fix this there, as that might break existing
code, e.g Linux'es swiotlb, which - while affecting PV only - until
recently had been enforcing only page alignment on the original
allocation.) Only non-PoD code paths (guest_physmap_{add,remove}_page(),
p2m_set_entry()) look to be dealing with this properly (in part by being
implemented inefficiently, handling every 4k page separately).
Introduce wrappers taking care of splitting the incoming request into
aligned chunks, without putting much effort in trying to determine the
largest possible chunk at every iteration.
Also "handle" p2m_set_entry() failure for non-order-0 requests by
crashing the domain in one more place. Alongside putting a log message
there, also add one to the other similar path.
Note regarding locking: This is left in the actual worker functions on
the assumption that callers aren't guaranteed atomicity wrt acting on
multiple pages at a time. For mis-aligned GFNs gfn_lock() wouldn't have
locked the correct GFN range anyway, if it didn't simply resolve to
p2m_lock(), and for well-behaved callers there continues to be only a
single iteration, i.e. behavior is unchanged for them. (FTAOD pulling
out just pod_lock() into p2m_pod_decrease_reservation() would result in
a lock order violation.)
This is CVE-2021-28704 and CVE-2021-28707 / part of XSA-388.
Fixes: 3c352011c0d3 ("x86/PoD: shorten certain operations on higher order ranges") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 182c737b9ba540ebceb1433f3940fbed6eac4ea9
master date: 2021-11-22 12:27:30 +0000
Julien Grall [Tue, 23 Nov 2021 12:32:26 +0000 (13:32 +0100)]
xen/page_alloc: Harden assign_pages()
domain_tot_pages() and d->max_pages are 32-bit values. While the order
should always be quite small, it would still be possible to overflow
if domain_tot_pages() is near to (2^32 - 1).
As this code may be called by a guest via XENMEM_increase_reservation
and XENMEM_populate_physmap, we want to make sure the guest is not going
to be able to allocate more than it is allowed.
Rework the allocation check to avoid any possible overflow. While the
check domain_tot_pages() < d->max_pages should technically not be
necessary, it is probably best to have it to catch any possible
inconsistencies in the future.
Jan Beulich [Fri, 1 Oct 2021 13:05:42 +0000 (15:05 +0200)]
VT-d: fix deassign of device with RMRR
Ignoring a specific error code here was not meant to short circuit
deassign to _just_ the unmapping of RMRRs. This bug was previously
hidden by the bogus (potentially indefinite) looping in
pci_release_devices(), until f591755823a7 ("IOMMU/PCI: don't let domain
cleanup continue when device de-assignment failed") fixed that loop.
This is CVE-2021-28702 / XSA-386.
Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling") Reported-by: Ivan Kardykov <kardykov@tabit.pro> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Ivan Kardykov <kardykov@tabit.pro>
(cherry picked from commit 24ebe875a77833696bbe5c9372e9e1590a7e7101)
Andrew Cooper [Wed, 8 Sep 2021 15:39:42 +0000 (17:39 +0200)]
x86/PVH: Fix debug build following XSA-378 bugfix
Fixes: 8d8b4bde3e1c ("x86/PVH: de-duplicate mappings for first Mb of Dom0 memory") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Wed, 8 Sep 2021 12:57:31 +0000 (14:57 +0200)]
gnttab: deal with status frame mapping race
Once gnttab_map_frame() drops the grant table lock, the MFN it reports
back to its caller is free to other manipulation. In particular
gnttab_unpopulate_status_frames() might free it, by a racing request on
another CPU, thus resulting in a reference to a deallocated page getting
added to a domain's P2M.
Obtain a page reference in gnttab_map_frame() to prevent freeing of the
page until xenmem_add_to_physmap_one() has actually completed its acting
on the page. Do so uniformly, even if only strictly required for v2
status pages, to avoid extra conditionals (which then would all need to
be kept in sync going forward).
Jan Beulich [Wed, 8 Sep 2021 12:56:47 +0000 (14:56 +0200)]
x86/p2m-pt: fix p2m_flags_to_access()
The initial if() was inverted, invalidating all output from this
function. Which in turn means the mirroring of P2M mappings into the
IOMMU didn't always work as intended: Mappings may have got updated when
there was no need to. There would not have been too few (un)mappings;
what saves us is that alongside the flags comparison MFNs also get
compared, with non-present entries always having an MFN of 0 or
INVALID_MFN while present entries always have MFNs different from these
two (0 in the table also meant to cover INVALID_MFN):
OLD NEW
P W access MFN P W access MFN
0 0 r 0 0 0 n 0
0 1 rw 0 0 1 n 0
1 0 n non-0 1 0 r non-0
1 1 n non-0 1 1 rw non-0
present <-> non-present transitions are fine because the MFNs differ.
present -> present transitions as well as non-present -> non-present
ones are potentially causing too many map/unmap operations, but never
too few, because in that case old (bogus) and new access differ.
Fixes: d1bb6c97c31e ("IOMMU: also pass p2m_access_t to p2m_get_iommu_flags()) Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e70a9a043a5ce6d4025420f729bc473f711bf5d1
master date: 2021-09-07 14:24:49 +0200
Jan Beulich [Wed, 8 Sep 2021 12:56:16 +0000 (14:56 +0200)]
x86/P2M: relax guarding of MMIO entries
One of the changes comprising the fixes for XSA-378 disallows replacing
MMIO mappings by code paths not intended for this purpose. At least in
the case of PVH Dom0 hitting an RMRR covered by an E820 ACPI region,
this is too strict. Generally short-circuit requests establishing the
same kind of mapping (mfn, type), but allow permissions to differ.
While there, also add a log message to the other domain_crash()
invocation that did prevent PVH Dom0 from coming up after the XSA-378
changes.
Fixes: 753cb68e6530 ("x86/p2m: guard (in particular) identity mapping entries") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 111469cc7b3f586c2335e70205320ed3c828b89e
master date: 2021-09-07 09:39:38 +0200
Jan Beulich [Wed, 8 Sep 2021 12:55:46 +0000 (14:55 +0200)]
x86/PVH: de-duplicate mappings for first Mb of Dom0 memory
One of the changes comprising the fixes for XSA-378 disallows replacing
MMIO mappings by code paths not intended for this purpose. This means we
need to be more careful about the mappings put in place in this range -
mappings should be created exactly once:
- iommu_hwdom_init() comes first; it should avoid the first Mb,
- pvh_populate_p2m() should insert identity mappings only into ranges
not populated as RAM,
- pvh_setup_acpi() should again avoid the first Mb, which was already
dealt with at that point.
Fixes: 753cb68e6530 ("x86/p2m: guard (in particular) identity mapping entries") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6b4f6a31ace125d658a581e8d10809e4fccdc272
master date: 2021-08-31 17:43:36 +0200
Jan Beulich [Wed, 8 Sep 2021 12:55:00 +0000 (14:55 +0200)]
gnttab: avoid triggering assertion in radix_tree_ulong_to_ptr()
Relevant quotes from the C11 standard:
"Except where explicitly stated otherwise, for the purposes of this
subclause unnamed members of objects of structure and union type do not
participate in initialization. Unnamed members of structure objects
have indeterminate value even after initialization."
"If there are fewer initializers in a brace-enclosed list than there are
elements or members of an aggregate, [...], the remainder of the
aggregate shall be initialized implicitly the same as objects that have
static storage duration."
"If an object that has static or thread storage duration is not
initialized explicitly, then:
[...]
— if it is an aggregate, every member is initialized (recursively)
according to these rules, and any padding is initialized to zero
bits;
[...]"
"A bit-field declaration with no declarator, but only a colon and a
width, indicates an unnamed bit-field." Footnote: "An unnamed bit-field
structure member is useful for padding to conform to externally imposed
layouts."
"There may be unnamed padding within a structure object, but not at its
beginning."
Which makes me conclude:
- Whether an unnamed bit-field member is an unnamed member or padding is
unclear, and hence also whether the last quote above would render the
big endian case of the structure declaration invalid.
- Whether the number of members of an aggregate includes unnamed ones is
also not really clear.
- The initializer in map_grant_ref() initializes all fields of the "cnt"
sub-structure of the union, so assuming the second quote above applies
here (indirectly), the compiler isn't required to implicitly
initialize the rest (i.e. in particular any padding) like would happen
for static storage duration objects.
Gcc 7.4.1 can be observed (apparently in debug builds only) to translate
aforementioned initializer to a read-modify-write operation of a stack
variable, leaving unchanged the top two bits of whatever was previously
in that stack slot. Clearly if either of the two bits were set,
radix_tree_ulong_to_ptr()'s assertion would trigger.
Therefore, to be on the safe side, add an explicit padding field for the
non-big-endian-bitfields case and give a dummy name to both padding
fields.
Fixes: 9781b51efde2 ("gnttab: replace mapkind()") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b6da9d0414d69c2682214ee3ecf9816fcac500d0
master date: 2021-08-27 10:54:46 +0200
Juergen Gross [Wed, 25 Aug 2021 13:28:49 +0000 (15:28 +0200)]
xen/sched: fix get_cpu_idle_time() for smt=0 suspend/resume
With smt=0 during a suspend/resume cycle of the machine the threads
which have been parked before will briefly come up again. This can
result in problems e.g. with cpufreq driver being active as this will
call into get_cpu_idle_time() for a cpu without initialized scheduler
data.
Fix that by letting get_cpu_idle_time() deal with this case. Drop a
redundant check in exchange.
Fixes: 132cbe8f35632fb2 ("sched: fix get_cpu_idle_time() with core scheduling") Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Dario Faggioli <dfaggioli@suse.com>
master commit: 5293470a77ad980dce2af9b7e6c3f11eeebf1b64
master date: 2021-08-19 13:38:31 +0200
Jan Beulich [Wed, 25 Aug 2021 13:28:33 +0000 (15:28 +0200)]
VT-d: Tylersburg errata apply to further steppings
While for 5500 and 5520 chipsets only B3 and C2 are mentioned in the
spec update, X58's also mentions B2, and searching the internet suggests
systems with this stepping are actually in use. Even worse, for X58
erratum #69 is marked applicable even to C2. Split the check to cover
all applicable steppings and to also report applicable errata numbers in
the log message. The splitting requires using the DMI port instead of
the System Management Registers device, but that's then in line (also
revision checking wise) with the spec updates.
Fixes: 6890cebc6a98 ("VT-d: deal with 5500/5520/X58 errata") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 517a90d1ca09ce00e50d46ac25566cc3bd2eb34d
master date: 2021-08-18 09:44:14 +0200
Dario Faggioli [Wed, 25 Aug 2021 13:28:13 +0000 (15:28 +0200)]
credit2: avoid picking a spurious idle unit when caps are used
Commit 07b0eb5d0ef0 ("credit2: make sure we pick a runnable unit from the
runq if there is one") did not fix completely the problem of potentially
selecting a scheduling unit that will then not be able to run.
In fact, in case caps are used and the unit we are currently looking
at, during the runqueue scan, does not have enough budget for being run,
we should continue looking instead than giving up and picking the idle
unit.
Suggested-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Dario Faggioli <dfaggioli@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0f742839ae57e10687e7a573070c37430f31068c
master date: 2021-08-10 09:29:10 +0200
Jane Malalane [Wed, 25 Aug 2021 13:27:58 +0000 (15:27 +0200)]
xen/lib: Fix strcmp() and strncmp()
The C standard requires that each character be compared as unsigned
char. Xen's current behaviour compares as signed char, which changes
the answer when chars with a value greater than 0x7f are used.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jane Malalane <jane.malalane@citrix.com> Reviewed-by: Ian Jackson <iwj@xenproject.org>
master commit: 3747a2bb67daa5a8baeff6cda57dc98a5ef79c3e
master date: 2021-07-30 10:52:46 +0100