Andrew Cooper [Thu, 9 Jun 2022 13:50:16 +0000 (15:50 +0200)]
x86/pv: Fix ABAC cmpxchg() race in _get_page_type()
_get_page_type() suffers from a race condition where it incorrectly assumes
that because 'x' was read and a subsequent a cmpxchg() succeeds, the type
cannot have changed in-between. Consider:
CPU A:
1. Creates an L2e referencing pg
`-> _get_page_type(pg, PGT_l1_page_table), sees count 0, type PGT_writable_page
2. Issues flush_tlb_mask()
CPU B:
3. Creates a writeable mapping of pg
`-> _get_page_type(pg, PGT_writable_page), count increases to 1
4. Writes into new mapping, creating a TLB entry for pg
5. Removes the writeable mapping of pg
`-> _put_page_type(pg), count goes back down to 0
CPU A:
7. Issues cmpxchg(), setting count 1, type PGT_l1_page_table
CPU B now has a writeable mapping to pg, which Xen believes is a pagetable and
suitably protected (i.e. read-only). The TLB flush in step 2 must be deferred
until after the guest is prohibited from creating new writeable mappings,
which is after step 7.
Defer all safety actions until after the cmpxchg() has successfully taken the
intended typeref, because that is what prevents concurrent users from using
the old type.
Also remove the early validation for writeable and shared pages. This removes
race conditions where one half of a parallel mapping attempt can return
successfully before:
* The IOMMU pagetables are in sync with the new page type
* Writeable mappings to shared pages have been torn down
This is part of XSA-401 / CVE-2022-26362.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 8cc5036bc385112a82f1faff27a0970e6440dfed
master date: 2022-06-09 14:21:04 +0200
Andrew Cooper [Thu, 9 Jun 2022 13:49:50 +0000 (15:49 +0200)]
x86/pv: Clean up _get_page_type()
Various fixes for clarity, ahead of making complicated changes.
* Split the overflow check out of the if/else chain for type handling, as
it's somewhat unrelated.
* Comment the main if/else chain to explain what is going on. Adjust one
ASSERT() and state the bit layout for validate-locked and partial states.
* Correct the comment about TLB flushing, as it's backwards. The problem
case is when writeable mappings are retained to a page becoming read-only,
as it allows the guest to bypass Xen's safety checks for updates.
* Reduce the scope of 'y'. It is an artefact of the cmpxchg loop and not
valid for use by subsequent logic. Switch to using ACCESS_ONCE() to treat
all reads as explicitly volatile. The only thing preventing the validated
wait-loop being infinite is the compiler barrier hidden in cpu_relax().
* Replace one page_get_owner(page) with the already-calculated 'd' already in
scope.
No functional change.
This is part of XSA-401 / CVE-2022-26362.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 9186e96b199e4f7e52e033b238f9fe869afb69c7
master date: 2022-06-09 14:20:36 +0200
Track whether symbols belong to ignored sections in order to avoid
applying relocations referencing those symbols. The address of such
symbols won't be resolved and thus the relocation will likely fail or
write garbage to the destination.
Return an error in that case, as leaving unresolved relocations would
lead to malfunctioning payload code.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Bjoern Doebel <doebel@amazon.de> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: 9120b5737f517fe9d2a3936c38d3a2211630323b
master date: 2022-04-08 10:27:11 +0200
A side effect of ignoring such sections is that symbols belonging to
them won't be resolved, and that could make relocations belonging to
other sections that reference those symbols fail.
For example it's likely to have an empty .altinstr_replacement with
symbols pointing to it, and marking the section as ignored will
prevent the symbols from being resolved, which in turn will cause any
relocations against them to fail.
In order to solve this do not ignore sections with 0 size, only ignore
sections that don't have the SHF_ALLOC flag set.
Special case such empty sections in move_payload so they are not taken
into account in order to decide whether a livepatch can be safely
re-applied after a revert.
Fixes: 98b728a7b2 ('livepatch: Disallow applying after an revert') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Bjoern Doebel <doebel@amazon.de> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: 0dc1f929e8fed681dec09ca3ea8de38202d5bf30
master date: 2022-04-08 10:24:10 +0200
Andrew Cooper [Fri, 8 Apr 2022 13:06:54 +0000 (15:06 +0200)]
x86/cpuid: Clobber CPUID leaves 0x800000{1d..20} in policies
c/s 1a914256dca5 increased the AMD max leaf from 0x8000001c to 0x80000021, but
did not adjust anything in the calculate_*_policy() chain. As a result, on
hardware supporting these leaves, we read the real hardware values into the
raw policy, then copy into host, and all the way into the PV/HVM default
policies.
All 4 of these leaves have enable bits (first two by TopoExt, next by SEV,
next by PQOS), so any software following the rules is fine and will leave them
alone. However, leaf 0x8000001d takes a subleaf input and at least two
userspace utilities have been observed to loop indefinitely under Xen (clearly
waiting for eax to report "no more cache levels").
Such userspace is buggy, but Xen's behaviour isn't great either.
In the short term, clobber all information in these leaves. This is a giant
bodge, but there are complexities with implementing all of these leaves
properly.
Fixes: 1a914256dca5 ("x86/cpuid: support LFENCE always serialising CPUID bit") Link: https://github.com/QubesOS/qubes-issues/issues/7392 Reported-by: fosslinux <fosslinux@aussies.space> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d4012d50082c2eae2f3cbe7770be13b9227fbc3f
master date: 2022-04-07 11:36:45 +0100
Jan Beulich [Fri, 8 Apr 2022 13:06:26 +0000 (15:06 +0200)]
VT-d: avoid infinite recursion on domain_context_mapping_one() error path
Despite the comment there infinite recursion was still possible, by
flip-flopping between two domains. This is because prev_dom is derived
from the DID found in the context entry, which was already updated by
the time error recovery is invoked. Simply introduce yet another mode
flag to prevent rolling back an in-progress roll-back of a prior
mapping attempt.
Also drop the existing recursion prevention for having been dead anyway:
Earlier in the function we already bail when prev_dom == domain.
Jan Beulich [Fri, 8 Apr 2022 13:05:56 +0000 (15:05 +0200)]
VT-d: avoid NULL deref on domain_context_mapping_one() error paths
First there's a printk() which actually wrongly uses pdev in the first
place: We want to log the coordinates of the (perhaps fake) device
acted upon, which may not be pdev.
Then it was quite pointless for eb19326a328d ("VT-d: prepare for per-
device quarantine page tables (part I)") to add a domid_t parameter to
domain_context_unmap_one(): It's only used to pass back here via
me_wifi_quirk() -> map_me_phantom_function(). Drop the parameter again.
Finally there's the invocation of domain_context_mapping_one(), which
needs to be passed the correct domain ID. Avoid taking that path when
pdev is NULL and the quarantine state is what would need restoring to.
This means we can't security-support non-PCI-Express devices with RMRRs
(if such exist in practice) any longer; note that as of trhe 1st of the
two commits referenced below assigning them to DomU-s is unsupported
anyway.
Fixes: 8f41e481b485 ("VT-d: re-assign devices directly") Fixes: 14dd241aad8a ("IOMMU/x86: use per-device page tables for quarantining")
Coverity ID: 1503784 Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 608394b906e71587f02e6662597bc985bad33a5a
master date: 2022-04-07 12:30:19 +0200
Jan Beulich [Fri, 8 Apr 2022 13:05:16 +0000 (15:05 +0200)]
VT-d: don't needlessly look up DID
If get_iommu_domid() in domain_context_unmap_one() fails, we better
wouldn't clear the context entry in the first place, as we're then unable
to issue the corresponding flush. However, we have no need to look up the
DID in the first place: What needs flushing is very specifically the DID
that was in the context entry before our clearing of it.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 445ab9852d69d8957467f0036098ebec75fec092
master date: 2022-04-07 12:29:03 +0200
tools/firmware: do not add a .note.gnu.property section
Prevent the assembler from creating a .note.gnu.property section on
the output objects, as it's not useful for firmware related binaries,
and breaks the resulting rombios image.
This requires modifying the cc-option Makefile macro so it can test
assembler options (by replacing the usage of the -S flag with -c) and
also stripping the -Wa, prefix if present when checking for the test
output.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e270af94280e6a9610705ebc1fdd1d7a9b1f8a98
master date: 2022-04-04 12:30:07 +0100
Do so right in firmware/Rules.mk, like it's done for other compiler
flags.
Fixes: 3667f7f8f7 ('x86: Introduce support for CET-IBT') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 7225f6e0cd3afd48b4d61c43dd8fead0f4c92193
master date: 2022-04-04 12:30:00 +0100
Jason Andryuk [Thu, 7 Apr 2022 07:11:08 +0000 (09:11 +0200)]
libxl: Re-scope qmp_proxy_spawn.ao usage
I've observed this failed assertion:
libxl_event.c:2057: libxl__ao_inprogress_gc: Assertion `ao' failed.
AFAICT, this is happening in qmp_proxy_spawn_outcome where
sdss->qmp_proxy_spawn.ao is NULL.
The out label of spawn_stub_launch_dm() calls qmp_proxy_spawn_outcome(),
but it is only in the success path that sdss->qmp_proxy_spawn.ao gets
set to the current ao.
qmp_proxy_spawn_outcome() should instead use sdss->dm.spawn.ao, which is
the already in-use ao when spawn_stub_launch_dm() is called. The same
is true for spawn_qmp_proxy().
With this, move sdss->qmp_proxy_spawn.ao initialization to
spawn_qmp_proxy() since its use is for libxl__spawn_spawn() and it can
be initialized along with the rest of sdss->qmp_proxy_spawn.
Fixes: 83c845033dc8 ("libxl: use vchan for QMP access with Linux stubdomain") Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: d62a34423a1a98aefd7c30e22d2d82d198f077c8
master date: 2022-04-01 17:01:57 +0100
Move dcs->console_xswait initialization into the callers of
initiate_domain_create, do_domain_create() and do_domain_soft_reset(),
so it is initialized along with the other dcs state.
Jason Andryuk [Thu, 7 Apr 2022 07:10:15 +0000 (09:10 +0200)]
xl: Fix global pci options
commit babde47a3fed "introduce a 'passthrough' configuration option to
xl.cfg..." moved the pci list parsing ahead of the global pci option
parsing. This broke the global pci configuration options since they
need to be set first so that looping over the pci devices assigns their
values.
Move the global pci options ahead of the pci list to restore their
function.
Fixes: babde47a3fed ("introduce a 'passthrough' configuration option to xl.cfg...") Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: e45ad0b1b0bd6a43f59aaf4a6f86d88783c630e5
master date: 2022-03-31 19:48:12 +0100
Map the PBA in order to access it from the MSI-X read and write
handlers. Note that previously the handlers would pass the physical
host address into the {read,write}{l,q} handlers, which is wrong as
those expect a linear address.
Map the PBA using ioremap when the first access is performed. Note
that 32bit arches might want to abstract the call to ioremap into a
vPCI arch handler, so they can use a fixmap range to map the PBA.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Tested-by: Alex Olson <Alex.Olson@starlab.io>
master commit: b4f21160601155762a4d014db9623af921fec959
master date: 2022-03-09 16:21:01 +0100
Lasse Collin [Thu, 7 Apr 2022 07:08:20 +0000 (09:08 +0200)]
xz: validate the value before assigning it to an enum variable
This might matter, for example, if the underlying type of enum xz_check
was a signed char. In such a case the validation wouldn't have caught an
unsupported header. I don't know if this problem can occur in the kernel
on any arch but it's still good to fix it because some people might copy
the XZ code to their own projects from Linux instead of the upstream
XZ Embedded repository.
This change may increase the code size by a few bytes. An alternative
would have been to use an unsigned int instead of enum xz_check but
using an enumeration looks cleaner.
Link: https://lore.kernel.org/r/20211010213145.17462-3-xiang@kernel.org Signed-off-by: Lasse Collin <lasse.collin@tukaani.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 4f8d7abaa413 Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Luca Fancellu <luca.fancellu@arm.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0a21660515c24f09c4ee060ce0bb42e4b2e6b6fa
master date: 2022-03-07 09:08:54 +0100
Lasse Collin [Thu, 7 Apr 2022 07:07:43 +0000 (09:07 +0200)]
xz: avoid overlapping memcpy() with invalid input with in-place decompression
With valid files, the safety margin described in lib/decompress_unxz.c
ensures that these buffers cannot overlap. But if the uncompressed size
of the input is larger than the caller thought, which is possible when
the input file is invalid/corrupt, the buffers can overlap. Obviously
the result will then be garbage (and usually the decoder will return
an error too) but no other harm will happen when such an over-run occurs.
This change only affects uncompressed LZMA2 chunks and so this
should have no effect on performance.
Link: https://lore.kernel.org/r/20211010213145.17462-2-xiang@kernel.org Signed-off-by: Lasse Collin <lasse.collin@tukaani.org> Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 83d3c4f22a36 Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Luca Fancellu <luca.fancellu@arm.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 10454f381f9157bce26d5db15e07e857b317b4af
master date: 2022-03-07 09:08:08 +0100
Prevent libxl from creating guests that attempts to use PoD together
with an IOMMU, even if no devices are actually assigned.
While the hypervisor could support using PoD together with an IOMMU as
long as no devices are assigned, such usage seems doubtful. There's no
guarantee the guest has PoD no longer be active, and thus a later
assignment of a PCI device to such domain could fail.
Preventing the usage of PoD together with an IOMMU at guest creation
avoids having to add checks for active PoD entries in the device
assignment paths.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: 07449ecfa42532495156fa342af2112e3e31dd3f
master date: 2022-02-18 09:03:08 +0100
x86/console: process softirqs between warning prints
Process softirqs while printing end of boot warnings. Each warning can
be several lines long, and on slow consoles printing multiple ones
without processing softirqs can result in the watchdog triggering:
(XEN) [ 22.277806] ***************************************************
(XEN) [ 22.417802] WARNING: CONSOLE OUTPUT IS SYNCHRONOUS
(XEN) [ 22.556029] This option is intended to aid debugging of Xen by ensuring
(XEN) [ 22.696802] that all output is synchronously delivered on the serial line.
(XEN) [ 22.838024] However it can introduce SIGNIFICANT latencies and affect
(XEN) [ 22.978710] timekeeping. It is NOT recommended for production use!
(XEN) [ 23.119066] ***************************************************
(XEN) [ 23.258865] Booted on L1TF-vulnerable hardware with SMT/Hyperthreading
(XEN) [ 23.399560] enabled. Please assess your configuration and choose an
(XEN) [ 23.539925] explicit 'smt=<bool>' setting. See XSA-273.
(XEN) [ 23.678860] ***************************************************
(XEN) [ 23.818492] Booted on MLPDS/MFBDS-vulnerable hardware with SMT/Hyperthreading
(XEN) [ 23.959811] enabled. Mitigations will not be fully effective. Please
(XEN) [ 24.100396] choose an explicit smt=<bool> setting. See XSA-297.
(XEN) [ 24.240254] *************************************************(XEN) [ 24.247302] Watchdog timer detects that CPU0 is stuck!
(XEN) [ 24.386785] ----[ Xen-4.17-unstable x86_64 debug=y Tainted: C ]----
(XEN) [ 24.527874] CPU: 0
(XEN) [ 24.662422] RIP: e008:[<ffff82d04025b84a>] drivers/char/ns16550.c#ns16550_tx_ready+0x3a/0x90
Jan Beulich [Thu, 7 Apr 2022 07:06:00 +0000 (09:06 +0200)]
x86emul: fix VPBLENDMW with mask and memory operand
Element size for this opcode depends on EVEX.W, not the low opcode bit.
Make use of AVX512BW being a prereq to AVX512_BITALG and move the case
label there, adding an AVX512BW feature check.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: eddf13b5e9401f6871dcce1ce61c80cff62079ed
master date: 2022-02-14 10:08:38 +0100
Anthony PERARD [Thu, 7 Apr 2022 07:05:23 +0000 (09:05 +0200)]
build: fix exported variable name CFLAGS_stack_boundary
Exporting a variable with a dash doesn't work reliably, they may be
striped from the environment when calling a sub-make or sub-shell.
CFLAGS-stack-boundary start to be removed from env in patch "build:
set ALL_OBJS in main Makefile; move prelink.o to main Makefile" when
running `make "ALL_OBJS=.."` due to the addition of the quote. At
least in my empirical tests.
Fixes: 2740d96efd ("xen/build: have the root Makefile generates the CFLAGS") Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: aa390d513a67a6ec0a069eea7478e5ecd54a7ea6
master date: 2022-01-28 11:44:33 +0100
The size of the video memory of PVH guests should be set to 0 in case
no value has been specified.
Doing not so will leave it to be -1, resulting in an additional 1 kB
of RAM being advertised in the memory map (here the output of a PVH
Mini-OS boot with 16 MB of RAM assigned):
Kevin Stefanov [Thu, 7 Apr 2022 07:03:55 +0000 (09:03 +0200)]
tools/libxl: Correctly align the ACPI tables
The memory allocator currently calculates alignment in libxl's virtual
address space, rather than guest physical address space. This results
in the FACS being commonly misaligned.
Furthermore, the allocator has several other bugs.
The opencoded align-up calculation is currently susceptible to a bug
that occurs in the corner case that the buffer is already aligned to
begin with. In that case, an align-sized memory hole is introduced.
The while loop is dead logic because its effects are entirely and
unconditionally overwritten immediately after it.
Rework the memory allocator to align in guest physical address space
instead of libxl's virtual memory and improve the calculation, drop
errant extra page in allocated buffer for ACPI tables, and give some
of the variables better names/types.
Fixes: 14c0d328da2b ("libxl/acpi: Build ACPI tables for HVMlite guests") Signed-off-by: Kevin Stefanov <kevin.stefanov@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Jackson <iwj@xenproject.org>
master commit: dd6c062a7a4abdb662c18af03d1396325969d155
master date: 2021-09-24 11:07:50 +0100
Jan Beulich [Tue, 5 Apr 2022 13:14:50 +0000 (15:14 +0200)]
IOMMU/x86: use per-device page tables for quarantining
Devices with RMRRs / unity mapped regions, due to it being unspecified
how/when these memory regions may be accessed, may not be left
disconnected from the mappings of these regions (as long as it's not
certain that the device has been fully quiesced). Hence even the page
tables used when quarantining such devices need to have mappings of
those regions. This implies installing page tables in the first place
even when not in scratch-page quarantining mode.
This is CVE-2022-26361 / part of XSA-400.
While for the purpose here it would be sufficient to have devices with
RMRRs / unity mapped regions use per-device page tables, extend this to
all devices (in scratch-page quarantining mode). This allows the leaf
pages to be mapped r/w, thus covering also memory writes (rather than
just reads) issued by non-quiescent devices.
Set up quarantine page tables as late as possible, yet early enough to
not encounter failure during de-assign. This means setup generally
happens in assign_device(), while (for now) the one in deassign_device()
is there mainly to be on the safe side.
In VT-d's DID allocation function don't require the IOMMU lock to be
held anymore: All involved code paths hold pcidevs_lock, so this way we
avoid the need to acquire the IOMMU lock around the new call to
context_set_domain_id().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 14dd241aad8af447680ac73e8579990e2c09c1e7
master date: 2022-04-05 14:24:18 +0200
Jan Beulich [Tue, 5 Apr 2022 13:14:15 +0000 (15:14 +0200)]
IOMMU/x86: drop TLB flushes from quarantine_init() hooks
The page tables just created aren't hooked up yet anywhere, so there's
nothing that could be present in any TLB, and hence nothing to flush.
Dropping this flush is, at least on the VT-d side, a prereq to per-
device domain ID use when quarantining devices, as dom_io isn't going
to be assigned a DID anymore: The warning in get_iommu_did() would
trigger.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 54c5cef49239e2f27ec3b3fc8804bf57aa4bf46d
master date: 2022-04-05 14:19:42 +0200
Jan Beulich [Tue, 5 Apr 2022 13:13:43 +0000 (15:13 +0200)]
IOMMU/x86: maintain a per-device pseudo domain ID
In order to subsequently enable per-device quarantine page tables, we'll
need domain-ID-like identifiers to be inserted in the respective device
(AMD) or context (Intel) table entries alongside the per-device page
table root addresses.
Make use of "real" domain IDs occupying only half of the value range
coverable by domid_t.
Note that in VT-d's iommu_alloc() I didn't want to introduce new memory
leaks in case of error, but existing ones don't get plugged - that'll be
the subject of a later change.
The VT-d changes are slightly asymmetric, but this way we can avoid
assigning pseudo domain IDs to devices which would never be mapped while
still avoiding to add a new parameter to domain_context_unmap().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 97af062b89d52c0ecf7af254b53345c97d438e33
master date: 2022-04-05 14:19:10 +0200
Jan Beulich [Tue, 5 Apr 2022 13:13:20 +0000 (15:13 +0200)]
VT-d: prepare for per-device quarantine page tables (part II)
Replace the passing of struct domain * by domid_t in preparation of
per-device quarantine page tables also requiring per-device pseudo
domain IDs, which aren't going to be associated with any struct domain
instances.
No functional change intended (except for slightly adjusted log message
text).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 7131163c4806e3c7de24873164d1a003d2a27dee
master date: 2022-04-05 14:18:48 +0200
Jan Beulich [Tue, 5 Apr 2022 13:13:02 +0000 (15:13 +0200)]
VT-d: prepare for per-device quarantine page tables (part I)
Arrange for domain ID and page table root to be passed around, the latter in
particular to domain_pgd_maddr() such that taking it from the per-domain
fields can be overridden.
No functional change intended.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: eb19326a328d49a6a4dc3930391b340f3bcd8948
master date: 2022-04-05 14:18:26 +0200
Jan Beulich [Tue, 5 Apr 2022 13:12:46 +0000 (15:12 +0200)]
AMD/IOMMU: re-assign devices directly
Devices with unity map ranges, due to it being unspecified how/when
these memory ranges may get accessed, may not be left disconnected from
their unity mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than tearing down the old root page
table pointer and then establishing the new one, re-assignment needs to
be done in a single step.
This is CVE-2022-26360 / part of XSA-400.
Reported-by: Roger Pau Monné <roger.pau@citrix.com>
Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.
To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any unity map
ranges. The main difference is when it comes to updating DTEs, which need
to be atomic when there are unity mappings. Yet atomicity can only be
achieved with CMPXCHG16B, availability of which we can't take for given.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 1fa6e9aa36233fe9c29a204fcb2697e985b8345f
master date: 2022-04-05 14:18:04 +0200
Jan Beulich [Tue, 5 Apr 2022 13:12:28 +0000 (15:12 +0200)]
VT-d: re-assign devices directly
Devices with RMRRs, due to it being unspecified how/when the specified
memory regions may get accessed, may not be left disconnected from their
respective mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than unmapping the old context and
then mapping the new one, re-assignment needs to be done in a single
step.
This is CVE-2022-26359 / part of XSA-400.
Reported-by: Roger Pau Monné <roger.pau@citrix.com>
Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.
To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any RMRRs. The
main difference is when it comes to updating context entries, which need
to be atomic when there are RMRRs. Yet atomicity can only be achieved
with CMPXCHG16B, availability of which we can't take for given.
The seemingly complicated choice of non-negative return values for
domain_context_mapping_one() is to limit code churn: This way callers
passing NULL for pdev don't need fiddling with.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8f41e481b4852173909363b88c1ab3da747d3a05
master date: 2022-04-05 14:17:42 +0200
Jan Beulich [Tue, 5 Apr 2022 13:12:11 +0000 (15:12 +0200)]
VT-d: drop ownership checking from domain_context_mapping_one()
Despite putting in quite a bit of effort it was not possible to
establish why exactly this code exists (beyond possibly sanity
checking). Instead of a subsequent change further complicating this
logic, simply get rid of it.
Take the opportunity and move the respective unmap_vtd_domain_page() out
of the locked region.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: a680b8134b2d1828bbbf443a97feea66e8a85c75
master date: 2022-04-05 14:17:21 +0200
Jan Beulich [Tue, 5 Apr 2022 13:10:10 +0000 (15:10 +0200)]
VT-d: fix add/remove ordering when RMRRs are in use
In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully cleared.
Also switch to %pd in related log messages.
Fixes: fa88cfadf918 ("vt-d: Map RMRR in intel_iommu_add_device() if the device has RMRR") Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 3221f270cf2eba0a22fd4f92319d664eacb92889
master date: 2022-04-05 14:16:10 +0200
Jan Beulich [Tue, 5 Apr 2022 13:09:48 +0000 (15:09 +0200)]
VT-d: fix (de)assign ordering when RMRRs are in use
In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully updated.
Also adjust a related log message.
This is CVE-2022-26358 / part of XSA-400.
Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 78a40f8b5dfa1a3aec43528663f39473d4429101
master date: 2022-04-05 14:15:33 +0200
Jan Beulich [Tue, 5 Apr 2022 13:09:28 +0000 (15:09 +0200)]
VT-d: correct ordering of operations in cleanup_domid_map()
The function may be called without any locks held (leaving aside the
domctl one, which we surely don't want to depend on here), so needs to
play safe wrt other accesses to domid_map[] and domid_bitmap[]. This is
to avoid context_set_domain_id()'s writing of domid_map[] to be reset to
zero right away in the case of it racing the freeing of a DID.
For the interaction with context_set_domain_id() and did_to_domain_id()
see the code comment.
{check_,}cleanup_domid_map() are called with pcidevs_lock held or during
domain cleanup only (and pcidevs_lock is also held around
context_set_domain_id()), i.e. racing calls with the same (dom, iommu)
tuple cannot occur.
domain_iommu_domid(), besides its use by cleanup_domid_map(), has its
result used only to control flushing, and hence a stale result would
only lead to a stray extra flush.
x86/hap: do not switch on log dirty for VRAM tracking
XEN_DMOP_track_dirty_vram possibly calls into paging_log_dirty_enable
when using HAP mode, and it can interact badly with other ongoing
paging domctls, as XEN_DMOP_track_dirty_vram is not holding the domctl
lock.
This was detected as a result of the following assert triggering when
doing repeated migrations of a HAP HVM domain with a stubdom:
Assertion 'd->arch.paging.log_dirty.allocs == 0' failed at paging.c:198
----[ Xen-4.17-unstable x86_64 debug=y Not tainted ]----
CPU: 34
RIP: e008:[<ffff82d040314b3b>] arch/x86/mm/paging.c#paging_free_log_dirty_bitmap+0x606/0x6
RFLAGS: 0000000000010206 CONTEXT: hypervisor (d0v23)
[...]
Xen call trace:
[<ffff82d040314b3b>] R arch/x86/mm/paging.c#paging_free_log_dirty_bitmap+0x606/0x63a
[<ffff82d040279f96>] S xsm/flask/hooks.c#domain_has_perm+0x5a/0x67
[<ffff82d04031577f>] F paging_domctl+0x251/0xd41
[<ffff82d04031640c>] F paging_domctl_continuation+0x19d/0x202
[<ffff82d0403202fa>] F pv_hypercall+0x150/0x2a7
[<ffff82d0403a729d>] F lstar_enter+0x12d/0x140
Such assert triggered because the stubdom used
XEN_DMOP_track_dirty_vram while dom0 was in the middle of executing
XEN_DOMCTL_SHADOW_OP_OFF, and so log dirty become enabled while
retiring the old structures, thus leading to new entries being
populated in already clear slots.
Fix this by not enabling log dirty for VRAM tracking, similar to what
is done when using shadow instead of HAP. Call
p2m_enable_hardware_log_dirty when enabling VRAM tracking in order to
get some hardware assistance if available. As a side effect the memory
pressure on the p2m pool should go down if only VRAM tracking is
enabled, as the dirty bitmap is no longer allocated.
Note that paging_log_dirty_range (used to get the dirty bitmap for
VRAM tracking) doesn't use the log dirty bitmap, and instead relies on
checking whether each gfn on the range has been switched from
p2m_ram_logdirty to p2m_ram_rw in order to account for dirty pages.
This is CVE-2022-26356 / XSA-397.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4f4db53784d912c4f409a451c36ebfd4754e0a42
master date: 2022-04-05 14:11:30 +0200
Jan Beulich [Mon, 4 Apr 2022 13:58:04 +0000 (15:58 +0200)]
livepatch: account for patch offset when applying NOP patch
While not triggered by the trivial xen_nop in-tree patch on
staging/master, that patch exposes a problem on the stable trees, where
all functions have ENDBR inserted. When NOP-ing out a range, we need to
account for this. Handle this right in livepatch_insn_len().
This requires livepatch_insn_len() to be called _after_ ->patch_offset
was set.
Fixes: 6974c75180f1 ("xen/x86: Livepatch: support patching CET-enhanced functions") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8a87b9a0fb0564f9d68f0be0a0d1a17c34117b8b
master date: 2022-03-31 10:45:46 +0200
Bjoern Doebel [Wed, 9 Mar 2022 15:22:03 +0000 (16:22 +0100)]
livepatch: resolve old address before function verification
When verifying that a livepatch can be applied, we may as well want to
inspect the target function to be patched. To do so, we need to resolve
this function's address before running the arch-specific
livepatch_verify hook.
Signed-off-by: Bjoern Doebel <doebel@amazon.de> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
(cherry picked from commit 5142dc5c25e317c208e3dc16d16b664b9f05dab5)
Andrew Cooper [Mon, 28 Feb 2022 19:31:00 +0000 (19:31 +0000)]
x86/cet: Remove XEN_SHSTK's dependency on EXPERT
CET-SS hardware is now available from multiple vendors, the feature has
downstream users, and was declared security supported in XSA-398.
Enable it by default.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit fc90d75c2b71ae15b75128e7d0d4dbe718164ecb)
Bjoern Doebel [Thu, 10 Mar 2022 07:35:36 +0000 (07:35 +0000)]
xen/x86: Livepatch: support patching CET-enhanced functions
Xen enabled CET for supporting architectures. The control flow aspect of
CET require functions that can be called indirectly (i.e., via function
pointers) to start with an ENDBR64 instruction. Otherwise a control flow
exception is raised.
This expectation breaks livepatching flows because we patch functions by
overwriting their first 5 bytes with a JMP + <offset>, thus breaking the
ENDBR64. We fix this by checking the start of a patched function for
being ENDBR64. In the positive case we move the livepatch JMP to start
behind the ENDBR64 instruction.
To avoid having to guess the ENDBR64 offset again on patch reversal
(which might race with other mechanisms adding/removing ENDBR
dynamically), use the livepatch metadata to store the computed offset
along with the saved bytes of the overwritten function.
Signed-off-by: Bjoern Doebel <doebel@amazon.de> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com> Tested-by: Jiamei Xie <jiamei.xie@arm.com>
(cherry picked from commit 6974c75180f1aad44e5428eabf2396b2b50fb0e4)
Note: For backports to 4.14 thru 4.16, there is no endbr-clobbering, hence no
is_endbr64_poison() logic.
Andrew Cooper [Tue, 15 Mar 2022 12:07:18 +0000 (12:07 +0000)]
x86/cet: Remove writeable mapping of the BSPs shadow stack
An unintended consequence of the BSP using cpu0_stack[] is that writeable
mappings to the BSPs shadow stacks are retained in the bss. This renders
CET-SS almost useless, as an attacker can update both return addresses and the
ret will not fault.
We specifically don't want to shatter the superpage mapping .data and .bss, so
the only way to fix this is to not have the BSP stack in the main Xen image.
Break cpu_alloc_stack() out of cpu_smpboot_alloc(), and dynamically allocate
the BSP stack as early as reasonable in __start_xen(). As a consequence,
there is no need to delay the BSP's memguard_guard_stack() call.
Copy the top of cpu info block just before switching to use the new stack.
Fix a latent bug by setting %rsp to info->guest_cpu_user_regs rather than
->es; this would be buggy if reinit_bsp_stack() called schedule() (which
rewrites the GPR block) directly, but luckily it doesn't.
Finally, move cpu0_stack[] into .init, so it can be reclaimed after boot.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 37786b23b027ab83051175cb8ce9ac86cacfc58e)
Andrew Cooper [Mon, 14 Mar 2022 10:30:46 +0000 (10:30 +0000)]
x86/cet: Clear IST supervisor token busy bits on S3 resume
Stacks are not freed across S3. Execution just stops, leaving supervisor
token busy bits active. Fixing this for the primary shadow stack was done
previously, but there is a (rare) risk that an IST token is left busy too, if
the platform power-off happens to intersect with an NMI/#MC arriving. This
will manifest as #DF next time the IST vector gets used.
Introduce rdssp() and wrss() helpers in a new shstk.h, cleaning up
fixup_exception_return() and explaining the trick with the literal 1.
Then this infrastructure to rewrite the IST tokens in load_system_tables()
when all the other IST details are being set up. In the case that an IST
token were left busy across S3, this will clear the busy bit before the stack
gets used.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit e421ed0f68488863599532bda575c03c33cde0e0)
Andrew Cooper [Mon, 7 Mar 2022 20:19:18 +0000 (20:19 +0000)]
x86/kexec: Fix kexec-reboot with CET active
The kexec_reloc() asm has an indirect jump to relocate onto the identity
trampoline. While we clear CET in machine_crash_shutdown(), we fail to clear
CET for the non-crash path. This in turn highlights that the same is true of
resetting the CPUID masking/faulting.
Move both pieces of logic from machine_crash_shutdown() to machine_kexec(),
the latter being common for all kexec transitions. Adjust the condition for
CET being considered active to check in CR4, which is simpler and more robust.
Fixes: 311434bfc9d1 ("x86/setup: Rework MSR_S_CET handling for CET-IBT") Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks") Fixes: 5ab9564c6fa1 ("x86/cpu: Context switch cpuid masks and faulting state in context_switch()") Reported-by: David Vrabel <dvrabel@amazon.co.uk> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: David Vrabel <dvrabel@amazon.co.uk>
(cherry picked from commit 7f5b2448bd724f5f24426b2595a9bdceb1e5a346)
Andrew Cooper [Mon, 28 Feb 2022 19:26:37 +0000 (19:26 +0000)]
x86/spec-ctrl: Disable retpolines with CET-IBT
CET-IBT depend on executing indirect branches for protections to apply.
Extend the clobber for CET-SS to all of CET.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 6e3f36387de566b09aa4145ea0e3bfe4814d68b4)
and then later (on at least two Intel TigerLake platforms), the next HVM vCPU
to be scheduled on the BSP dies with:
(XEN) d1v0 Unexpected vmexit: reason 3
(XEN) domain_crash called from vmx.c:4304
(XEN) Domain 1 (vcpu#0) crashed on cpu#0:
The VMExit reason is EXIT_REASON_INIT, which has nothing to do with the
scheduled vCPU, and will be addressed in a subsequent patch. It is a
consequence of the APs triple faulting.
The reason the APs triple fault is because we don't tear down the stacks on
suspend. The idle/play_dead loop is killed in the middle of running, meaning
that the supervisor token is left busy.
On resume, SETSSBSY finds busy bit set, suffers #CP and triple faults because
the IDT isn't configured this early.
Rework the AP bring-up path to (re)create the supervisor token. This ensures
the primary stack is non-busy before use.
Note: There are potential issues with the IST shadow stacks too, but fixing
those is more involved.
Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks") Link: https://github.com/QubesOS/qubes-issues/issues/7283 Reported-by: Thiner Logoer <logoerthiner1@163.com> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Thiner Logoer <logoerthiner1@163.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 7d9589239ec068c944190408b9838774d5ec1f8f)
Andrew Cooper [Mon, 1 Nov 2021 15:17:20 +0000 (15:17 +0000)]
x86: Enable CET Indirect Branch Tracking
With all the pieces now in place, turn CET-IBT on when available.
MSR_S_CET, like SMEP/SMAP, controls Ring1 meaning that ENDBR_EN can't be
enabled for Xen independently of PV32 kernels. As we already disable PV32 for
CET-SS, extend this to all CET, adjusting the documentation/comments as
appropriate.
Introduce a cet=no-ibt command line option to allow the admin to disable IBT
even when everything else is configured correctly.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit cdbe2b0a1aecae946639ee080f14831429b184b6)
Andrew Cooper [Mon, 1 Nov 2021 21:54:26 +0000 (21:54 +0000)]
x86/EFI: Disable CET-IBT around Runtime Services calls
UEFI Runtime services, at the time of writing, aren't CET-IBT compatible.
Work is ongoing to address this. In the meantime, unconditionally disable IBT.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit d37a8a067e62e3b6709d224c22f740fdda9d0078)
Andrew Cooper [Mon, 1 Nov 2021 16:13:29 +0000 (16:13 +0000)]
x86/setup: Rework MSR_S_CET handling for CET-IBT
CET-SS and CET-IBT can be independently controlled, so the configuration of
MSR_S_CET can't be constant any more.
Introduce xen_msr_s_cet_value(), mostly because I don't fancy
writing/maintaining that logic in assembly. Use this in the 3 paths which
alter MSR_S_CET when both features are potentially active.
To active CET-IBT, we only need CR4.CET and MSR_S_CET.ENDBR_EN. This is
common with the CET-SS setup, so reorder the operations to set up CR4 and
MSR_S_CET for any nonzero result from xen_msr_s_cet_value(), and set up
MSR_PL0_SSP and SSP if SHSTK_EN was also set.
Adjust the crash path to disable CET-IBT too.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 311434bfc9d10615adbd340d7fb08c05cd14f4c7)
Andrew Cooper [Mon, 1 Nov 2021 17:08:24 +0000 (17:08 +0000)]
x86/entry: Make IDT entrypoints CET-IBT compatible
Each IDT vector needs to land on an endbr64 instruction. This is especially
important for the #CP handler, which will recurse indefinitely if the endbr64
is missing, eventually escalating to #DF if guard pages are active.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit e702e36d1d519f4b66086650c1c47d6bac96d4b9)
Also include the continue_pv_domain() change from c/s 954bb07fdb5fad which is
also in entry.S
Andrew Cooper [Mon, 1 Nov 2021 09:51:16 +0000 (09:51 +0000)]
x86/entry: Make syscall/sysenter entrypoints CET-IBT compatible
Each of MSR_{L,C}STAR and MSR_SYSENTER_EIP need to land on an endbr64
instruction. For sysenter, this is easy.
Unfortunately for syscall, the stubs are already 29 byte long with a limit of
32. endbr64 is 4 bytes. Luckily, there is a 1 byte instruction which can
move from the stubs into the main handlers.
Move the push %rax out of the stub and into {l,c}star_entry(), allowing room
for the endbr64 instruction when appropriate. Update the comment describing
the entry state.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 17d77ec62a299f4299883ec79ab10cacafd0b2f5)
Andrew Cooper [Mon, 1 Nov 2021 10:09:59 +0000 (10:09 +0000)]
x86/emul: Update emulation stubs to be CET-IBT compatible
All indirect branches need to land on an endbr64 instruction.
For stub_selftests(), use endbr64 unconditionally for simplicity. For ioport
and instruction emulation, add endbr64 conditionally.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 0d101568d29e8b4bfd33f20031fedec2652aa0cf)
Andrew Cooper [Fri, 26 Nov 2021 15:34:08 +0000 (15:34 +0000)]
x86: Introduce helpers/checks for endbr64 instructions
... to prevent the optimiser creating unsafe code. See the code comment for
full details.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 4046ba97446e3974a4411db227263a9f11e0aeb4)
Note: For the backport to 4.14 thru 4.16, we don't care for embedded endbr64
specifically, but place_endbr64() is a prerequisite for other parts of
the series.
Andrew Cooper [Mon, 1 Nov 2021 12:36:33 +0000 (12:36 +0000)]
x86/traps: Rework write_stub_trampoline() to not hardcode the jmp
For CET-IBT, we will need to optionally insert an endbr64 instruction at the
start of the stub. Don't hardcode the jmp displacement assuming that it
starts at byte 24 of the stub.
Also add extra comments describing what is going on. The mix of %rax and %rsp
is far from trivial to follow.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 809beac3e7fdfd20000386453c64a1e2a3d93075)
Andrew Cooper [Mon, 1 Nov 2021 10:17:59 +0000 (10:17 +0000)]
x86/alternatives: Clear CR4.CET when clearing CR0.WP
This allows us to have CET active much earlier in boot.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 48cdc15a424f9fadad7f9aed00e7dc8ef16a2196)
Andrew Cooper [Mon, 1 Nov 2021 10:19:57 +0000 (10:19 +0000)]
x86/setup: Read CR4 earlier in __start_xen()
This is necessary for read_cr4() to function correctly. Move the EFER caching
at the same time.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9851bc4939101828d2ad7634b93c0d9ccaef5b7e)
Andrew Cooper [Thu, 21 Oct 2021 17:38:50 +0000 (18:38 +0100)]
x86: Introduce support for CET-IBT
CET Indirect Branch Tracking is a hardware feature designed to provide
forward-edge control flow integrity, protecting against jump/call oriented
programming.
IBT requires the placement of endbr{32,64} instructions at the target of every
indirect call/jmp, and every entrypoint.
It is necessary to check for both compiler and assembler support, as the
notrack prefix can be emitted in certain cases.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3667f7f8f7c471e94e58cf35a95f09a0fe5c1290)
Note: For backports to 4.14 thru 4.16, we are deliberately not using
-mmanual-endbr as done in staging, as an intermediate approach which
is not too invasive to backport.
x86/cet: Force -fno-jump-tables for CET-IBT
Both GCC and Clang have a (mis)feature where, even with
-fcf-protection=branch, jump tables are created using a notrack jump rather
than using endbr's in each case statement.
This is incompatible with the safety properties we want in Xen, and enforced
by not setting MSR_S_CET.NOTRACK_EN. The consequence is a fatal #CP[endbr].
-fno-jump-tables is generally active as a side effect of
CONFIG_INDIRECT_THUNK (retpoline), but as of c/s 95d9ab461436 ("x86/Kconfig:
introduce option to select retpoline usage"), we explicitly support turning
retpoline off.
Fixes: 3667f7f8f7c4 ("x86: Introduce support for CET-IBT") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9d4a44380d273de22d5753883cbf5581795ff24d)
Andrew Cooper [Mon, 7 Mar 2022 16:35:52 +0000 (16:35 +0000)]
x86/spec-ctrl: Cease using thunk=lfence on AMD
AMD have updated their Spectre v2 guidance, and lfence/jmp is no longer
considered safe. AMD are recommending using retpoline everywhere.
Retpoline is incompatible with CET. All CET-capable hardware has efficient
IBRS (specifically, not something retrofitted in microcode), so use IBRS (and
STIBP for consistency sake).
This is a logical change on AMD, but not on Intel as the default calculations
would end up with these settings anyway. Leave behind a message if IBRS is
found to be missing.
Also update the default heuristics to never select THUNK_LFENCE. This causes
AMD CPUs to change their default to retpoline.
Also update the printed message to include the AMD MSR_SPEC_CTRL settings, and
STIBP now that we set it for consistency sake.
This is part of XSA-398 / CVE-2021-26401.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 8d03080d2a339840d3a59e0932a94f804e45110d)
Bertrand Marquis [Thu, 17 Feb 2022 14:52:54 +0000 (14:52 +0000)]
xen/arm: Allow to discover and use SMCCC_ARCH_WORKAROUND_3
Allow guest to discover whether or not SMCCC_ARCH_WORKAROUND_3 is
supported and create a fastpath in the code to handle guests request to
do the workaround.
The function SMCCC_ARCH_WORKAROUND_3 will be called by the guest for
flushing the branch history. So we want the handling to be as fast as
possible.
As the mitigation is applied on every guest exit, we can check for the
call before saving all context and return very early.
Rahul Singh [Mon, 14 Feb 2022 18:47:32 +0000 (18:47 +0000)]
xen/arm: Add Spectre BHB handling
This commit is adding Spectre BHB handling to Xen on Arm.
The commit is introducing new alternative code to be executed during
exception entry:
- SMCC workaround 3 call
- loop workaround (with 8, 24 or 32 iterations)
- use of new clearbhb instruction
Cpuerrata is modified by this patch to apply the required workaround for
CPU affected by Spectre BHB when CONFIG_ARM64_HARDEN_BRANCH_PREDICTOR is
enabled.
To do this the system previously used to apply smcc workaround 1 is
reused and new alternative code to be copied in the exception handler is
introduced.
To define the type of workaround required by a processor, 4 new cpu
capabilities are introduced (for each number of loop and for smcc
workaround 3).
When a processor is affected, enable_spectre_bhb_workaround is called
and if the processor does not have CSV2 set to 3 or ECBHB feature (which
would mean that the processor is doing what is required in hardware),
the proper code is enabled at exception entry.
In the case where workaround 3 is not supported by the firmware, we
enable workaround 1 when possible as it will also mitigate Spectre BHB
on systems without CSV2.
Bertrand Marquis [Wed, 23 Feb 2022 09:42:18 +0000 (09:42 +0000)]
xen/arm: Add ECBHB and CLEARBHB ID fields
Introduce ID coprocessor register ID_AA64ISAR2_EL1.
Add definitions in cpufeature and sysregs of ECBHB field in mmfr1 and
CLEARBHB in isar2 ID coprocessor registers.
Bertrand Marquis [Tue, 15 Feb 2022 10:39:47 +0000 (10:39 +0000)]
xen/arm: move errata CSV2 check earlier
CSV2 availability check is done after printing to the user that
workaround 1 will be used. Move the check before to prevent saying to the
user that workaround 1 is used when it is not because it is not needed.
This will also allow to reuse install_bp_hardening_vec function for
other use cases.
Code previously returning "true", now returns "0" to conform to
enable_smccc_arch_workaround_1 returning an int and surrounding code
doing a "return 0" if workaround is not needed.
Andrew Cooper [Tue, 19 Oct 2021 20:22:27 +0000 (21:22 +0100)]
x86/spec-ctrl: Support Intel PSFD for guests
The Feb 2022 microcode from Intel retrofits AMD's MSR_SPEC_CTRL.PSFD interface
to Sunny Cove (IceLake) and later cores.
Update the MSR_SPEC_CTRL emulation, and expose it to guests.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 52ce1c97844db213de01c5300eaaa8cf101a285f)
Andrew Cooper [Thu, 27 Jan 2022 21:07:40 +0000 (21:07 +0000)]
x86/cpuid: Infrastructure for cpuid word 7:2.edx
While in principle it would be nice to keep leaf 7 in order, that would
involve having an extra 5 words of zeros in a featureset.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f3709b15fc86c6c6a0959cec8d97f21d0e9f9629)
Andrew Cooper [Wed, 16 Sep 2020 15:15:52 +0000 (16:15 +0100)]
x86/tsx: Cope with TSX deprecation on WHL-R/CFL-R
The February 2022 microcode is formally de-featuring TSX on the TAA-impacted
client CPUs. The backup TAA mitigation (VERW regaining its flushing side
effect) is being dropped, meaning that `smt=0 spec-ctrl=md-clear` no longer
protects against TAA on these parts.
The new functionality enumerates itself via the RTM_ALWAYS_ABORT CPUID
bit (the same as June 2021), but has its control in MSR_MCU_OPT_CTRL as
opposed to MSR_TSX_FORCE_ABORT.
TSX now defaults to being disabled on ucode load. Furthermore, if SGX is
enabled in the BIOS, TSX is locked and cannot be re-enabled. In this case,
override opt_tsx to 0, so the RTM/HLE CPUID bits get hidden by default.
While updating the command line documentation, take the opportunity to add a
paragraph explaining what TSX being disabled actually means, and how migration
compatibility works.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit ad9f7c3b2e0df38ad6d54f4769d4dccf765fbcee)
Andrew Cooper [Wed, 23 Jun 2021 20:53:58 +0000 (21:53 +0100)]
x86/tsx: Move has_rtm_always_abort to an outer scope
We are about to introduce a second path which needs to conditionally force the
presence of RTM_ALWAYS_ABORT.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 4116139131e93b4f075e5442e3c1b424280f6f1f)
Andrew Cooper [Wed, 19 May 2021 18:40:28 +0000 (19:40 +0100)]
x86/spec-ctrl: Clean up MSR_MCU_OPT_CTRL handling
Introduce cpu_has_srbds_ctrl as more users are going to appear shortly.
MSR_MCU_OPT_CTRL is gaining extra functionality, meaning that the current
default_xen_mcu_opt_ctrl is no longer a good fit.
Introduce two new helpers, update_mcu_opt_ctrl() which does a full RMW cycle
on the MSR, and set_in_mcu_opt_ctrl() which lets callers configure specific
bits at a time without clobbering each others settings.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 39a40f3835efcc25c1b05a25c321a01d7e11cbd7)
Jan Beulich [Thu, 27 Jan 2022 12:54:42 +0000 (12:54 +0000)]
x86/cpuid: Infrastructure for leaf 7:1.ebx
Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit e1828e3032ebfe036023cd733adfd2d4ec856688)
Andrew Cooper [Thu, 27 Jan 2022 13:56:04 +0000 (13:56 +0000)]
x86/cpuid: Disentangle logic for new feature leaves
Adding a new feature leaf is a reasonable amount of boilerplate and for the
patch to build, at least one feature from the new leaf needs defining. This
typically causes two non-trivial changes to be merged together.
First, have gen-cpuid.py write out some extra placeholder defines:
This allows DECL_BITFIELD() to be added to struct cpuid_policy without
requiring a XEN_CPUFEATURE() declared for the leaf. The choice of 4 is
arbitrary, and allows us to add more than one leaf at a time if necessary.
Second, rework generic_identify() to not use specific feature names.
The choice of deriving the index from a feature was to avoid mismatches, but
its correctness depends on bugs like c/s 249e0f1d8f20 ("x86/cpuid: Fix
TSXLDTRK definition") not happening.
Switch to using FEATURESET_* just like the policy/featureset helpers. This
breaks the cognitive complexity of needing to know which leaf a specifically
named feature should reside in, and is shorter to write. It is also far
easier to identify as correct at a glance, given the correlation with the
CPUID leaf being read.
In addition, tidy up some other bits of generic_identify()
* Drop leading zeros from leaf numbers.
* Don't use a locked update for X86_FEATURE_APERFMPERF.
* Rework extended_cpuid_level calculation to avoid setting it twice.
* Use "leaf >= $N" consistently so $N matches with the CPUID input.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit e3662437eb43cc8002bd39be077ef68b131649c5)
Andrew Cooper [Mon, 17 Jan 2022 20:29:09 +0000 (20:29 +0000)]
x86/cpuid: Enable MSR_SPEC_CTRL in SVM guests by default
With all other pieces in place, MSR_SPEC_CTRL is fully working for HVM guests.
Update the CPUID derivation logic (both PV and HVM to avoid losing subtle
changes), drop the MSR intercept, and explicitly enable the CPUID bits for HVM
guests.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit a7e7c7260cde78a148810db5320cbf39686c3e09)
Andrew Cooper [Mon, 17 Jan 2022 20:29:09 +0000 (20:29 +0000)]
x86/msr: AMD MSR_SPEC_CTRL infrastructure
Fill in VMCB accessors for spec_ctrl in svm_{get,set}_reg(), and CPUID checks
for all supported bits in guest_{rd,wr}msr().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 22b9add22b4a9af37305c8441fec12cb26bd142b)
Andrew Cooper [Fri, 21 Jan 2022 15:59:03 +0000 (15:59 +0000)]
x86/svm: VMEntry/Exit logic for MSR_SPEC_CTRL
Hardware maintains both host and guest versions of MSR_SPEC_CTRL, but guests
run with the logical OR of both values. Therefore, in principle we want to
clear Xen's value before entering the guest. However, for migration
compatibility (future work), and for performance reasons with SEV-SNP guests,
we want the ability to use a nonzero value behind the guest's back. Use
vcpu_msrs to hold this value, with the guest value in the VMCB.
On the VMEntry path, adjusting MSR_SPEC_CTRL must be done after CLGI so as to
be atomic with respect to NMIs/etc.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 614cec7d79d76786f5638a6e4da0576b57732ca1)
Andrew Cooper [Fri, 21 Jan 2022 15:59:03 +0000 (15:59 +0000)]
x86/spec-ctrl: Use common MSR_SPEC_CTRL logic for AMD
Currently, amd_init_ssbd() works by being the only write to MSR_SPEC_CTRL in
the system. This ceases to be true when using the common logic.
Include AMD MSR_SPEC_CTRL in has_spec_ctrl to activate the common paths, and
introduce an AMD specific block to control alternatives. Also update the
boot/resume paths to configure default_xen_spec_ctrl.
svm.h needs an adjustment to remove a dependency on include order.
For now, only active alternatives for HVM - PV will require more work. No
functional change, as no alternatives are defined yet for HVM yet.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 378f2e6df31442396f0afda19794c5c6091d96f9)
Andrew Cooper [Fri, 28 Jan 2022 11:57:19 +0000 (11:57 +0000)]
x86/spec-ctrl: Record the last write to MSR_SPEC_CTRL
In some cases, writes to MSR_SPEC_CTRL do not have interesting side effects,
and we should implement lazy context switching like we do with other MSRs.
In the short term, this will be used by the SVM infrastructure, but I expect
to extend it to other contexts in due course.
Introduce cpu_info.last_spec_ctrl for the purpose, and cache writes made from
the boot/resume paths. The value can't live in regular per-cpu data when it
is eventually used for PV guests when XPTI might be active.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 00f2992b6c7a9d4090443c1a85bf83224a87eeb9)
Andrew Cooper [Fri, 28 Jan 2022 12:03:42 +0000 (12:03 +0000)]
x86/spec-ctrl: Don't use spec_ctrl_{enter,exit}_idle() for S3
'idle' here refers to hlt/mwait. The S3 path isn't an idle path - it is a
platform reset.
We need to load default_xen_spec_ctrl unilaterally on the way back up.
Currently it happens as a side effect of X86_FEATURE_SC_MSR_IDLE or the next
return-to-guest, but that's fragile behaviour.
Conversely, there is no need to clear IBRS and flush the store buffers on the
way down; we're microseconds away from cutting power.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 71fac402e05ade7b0af2c34f77517449f6f7e2c1)
Andrew Cooper [Tue, 25 Jan 2022 17:14:48 +0000 (17:14 +0000)]
x86/spec-ctrl: Introduce new has_spec_ctrl boolean
Most MSR_SPEC_CTRL setup will be common between Intel and AMD. Instead of
opencoding an OR of two features everywhere, introduce has_spec_ctrl instead.
Reword the comment above the Intel specific alternatives block to highlight
that it is Intel specific, and pull the setting of default_xen_spec_ctrl.IBRS
out because it will want to be common.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 5d9eff3a312763d889cfbf3c8468b6dfb3ab490c)
Andrew Cooper [Tue, 25 Jan 2022 16:09:59 +0000 (16:09 +0000)]
x86/spec-ctrl: Drop use_spec_ctrl boolean
Several bugfixes have reduced the utility of this variable from it's original
purpose, and now all it does is aid in the setup of SCF_ist_wrmsr.
Simplify the logic by drop the variable, and doubling up the setting of
SCF_ist_wrmsr for the PV and HVM blocks, which will make the AMD SPEC_CTRL
support easier to follow. Leave a comment explaining why SCF_ist_wrmsr is
still necessary for the VMExit case.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit ec083bf552c35e10347449e21809f4780f8155d2)
Andrew Cooper [Thu, 27 Jan 2022 21:28:48 +0000 (21:28 +0000)]
x86/cpuid: Advertise SSB_NO to guests by default
This is a statement of hardware behaviour, and not related to controls for the
guest kernel to use. Pass it straight through from hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 15b7611efd497c4b65f350483857082cb70fc348)
Andrew Cooper [Wed, 19 Jan 2022 19:55:02 +0000 (19:55 +0000)]
x86/msr: Fix migration compatibility issue with MSR_SPEC_CTRL
This bug existed in early in 2018 between MSR_SPEC_CTRL arriving in microcode,
and SSBD arriving a few months later. It went unnoticed presumably because
everyone was busy rebooting everything.
The same bug will reappear when adding PSFD support.
Clamp the guest MSR_SPEC_CTRL value to that permitted by CPUID on migrate.
The guest is already playing with reserved bits at this point, and clamping
the value will prevent a migration to a less capable host from failing.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 969a57f73f6b011b2ebf4c0ab1715efc65837335)
Andrew Cooper [Tue, 1 Feb 2022 13:34:49 +0000 (13:34 +0000)]
x86/vmx: Drop spec_ctrl load in VMEntry path
This is not needed now that the VMEntry path is not responsible for loading
the guest's MSR_SPEC_CTRL value.
Fixes: 81f0eaadf84d ("x86/spec-ctrl: Fix NMI race condition with VT-x MSR_SPEC_CTRL handling") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9ce3ef20b4f085a7dc8ee41b0fec6fdeced3773e)
x86/cpuid: support LFENCE always serialising CPUID bit
AMD Milan (Zen3) CPUs have an LFENCE Always Serialising CPUID bit in
leaf 80000021.eax. Previous AMD versions used to have a user settable
bit in DE_CFG MSR to select whether LFENCE was dispatch serialising,
which Xen always attempts to set. The forcefully always on setting is
due to the addition of SEV-SNP so that a VMM cannot break the
confidentiality of a guest.
In order to support this new CPUID bit move the LFENCE_DISPATCH
synthetic CPUID bit to map the hardware bit (leaving a hole in the
synthetic range) and either rely on the bit already being set by the
native CPUID output, or attempt to fake it in Xen by modifying the
DE_CFG MSR. This requires adding one more entry to the featureset to
support leaf 80000021.eax.
The bit is always exposed to guests by default even if the underlying
hardware doesn't support leaf 80000021. Note that Xen doesn't allow
guests to change the DE_CFG value, so once set by Xen LFENCE will always
be serialising.
Note that the access to DE_CFG by guests is left as-is: reads will
unconditionally return LFENCE_SERIALISE bit set, while writes are
silently dropped.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
[Always expose to guests by default] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit e9b4fe26364950258c9f57f0f68eccb778eeadbb)
x86/cpuid: do not expand max leaves on restore
When restoring limit the maximum leaves to the ones supported by Xen
4.12 in order to not expand the maximum leaves a guests sees. Note
this is unlikely to cause real issues.
Guests restored from Xen versions 4.13 or greater will contain CPUID
data on the stream that will override the values set by
xc_cpuid_apply_policy.
Fixes: e9b4fe263649 ("x86/cpuid: support LFENCE always serialising CPUID bit") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 111c8c33a8a18588f3da3c5dbb7f5c63ddb98ce5)
tools/libxenguest: Fix max_extd_leaf calculation for legacy restore
0x1c is lower than any value which will actually be observed in
p->extd.max_leaf, but higher than the logical 9 leaves worth of extended data
on Intel systems, causing x86_cpuid_copy_to_buffer() to fail with -ENOBUFS.
Correct the calculation.
The problem was first noticed in c/s 34990446ca9 "libxl: don't ignore the
return value from xc_cpuid_apply_policy" but introduced earlier.
Fixes: 111c8c33a8a1 ("x86/cpuid: do not expand max leaves on restore") Reported-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 5fa174cbf54cc625a023b8e7170e359dd150c072)
It was Xen 4.14 where CPUID data was added to the migration stream, and 4.13
that we need to worry about with regards to compatibility. Xen 4.12 isn't
relevant.
Expand and correct the commentary.
Fixes: 111c8c33a8a1 ("x86/cpuid: do not expand max leaves on restore") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 820cc393434097f3b7976acdccbf1d96071d6d23)
x86/amd: split LFENCE dispatch serializing setup logic into helper
Split the logic to attempt to setup LFENCE to be dispatch serializing
on AMD into a helper, so it can be shared with Hygon.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 3e9460ec93341fa6a80ecf99832aa5d9975339c9)
Roger Pau Monné [Wed, 26 Jan 2022 11:52:09 +0000 (12:52 +0100)]
x86/pvh: fix population of the low 1MB for dom0
RMRRs are setup ahead of populating the p2m and hence the ASSERT when
populating the low 1MB needs to be relaxed when it finds an existing
entry: it's either RAM or a RMRR resulting from the IOMMU setup.
Rework the logic a bit and introduce a local mfn variable in order to
assert that if the gfn is populated and not RAM it is an identity map.
Fixes: 6b4f6a31ac ('x86/PVH: de-duplicate mappings for first Mb of Dom0 memory') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2d5fc9120d556ec3c4b1acf0ab5660a6d3f7ebeb
master date: 2022-01-25 10:52:24 +0000
Andrew Cooper [Wed, 26 Jan 2022 11:51:31 +0000 (12:51 +0100)]
x86: Fix build with the get/set_reg() infrastructure
I clearly messed up concluding that the stubs were safe to drop.
The is_{pv,hvm}_domain() predicates are not symmetrical with both CONFIG_PV
and CONFIG_HVM. As a result logic of the form `if ( pv/hvm ) ... else ...`
will always have one side which can't be DCE'd.
While technically only the hvm stubs are needed, due to the use of the
is_pv_domain() predicate in guest_{rd,wr}msr(), sort out the pv stubs too to
avoid leaving a bear trap for future users.
Fixes: 88d3ff7ab15d ("x86/guest: Introduce {get,set}_reg() infrastructure") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 13caa585791234fe3e3719c8376f7ea731012451
master date: 2022-01-21 12:42:11 +0000
Andrew Cooper [Tue, 25 Jan 2022 12:53:14 +0000 (13:53 +0100)]
x86/spec-ctrl: Fix NMI race condition with VT-x MSR_SPEC_CTRL handling
The logic was based on a mistaken understanding of how NMI blocking on vmexit
works. NMIs are only blocked for EXIT_REASON_NMI, and not for general exits.
Therefore, an NMI can in general hit early in the vmx_asm_vmexit_handler path,
and the guest's value will be clobbered before it is saved.
Switch to using MSR load/save lists. This causes the guest value to be saved
atomically with respect to NMIs/MCEs/etc.
First, update vmx_cpuid_policy_changed() to configure the load/save lists at
the same time as configuring the intercepts. This function is always used in
remote context, so extend the vmx_vmcs_{enter,exit}() block to cover the whole
function, rather than having multiple remote acquisitions of the same VMCS.
Both of vmx_{add,del}_guest_msr() can fail. The -ESRCH delete case is fine,
but all others are fatal to the running of the VM, so handle them using
domain_crash() - this path is only used during domain construction anyway.
Second, update vmx_{get,set}_reg() to use the MSR load/save lists rather than
vcpu_msrs, and update the vcpu_msrs comment to describe the new state
location.
Finally, adjust the entry/exit asm.
Because the guest value is saved and loaded atomically, we do not need to
manually load the guest value, nor do we need to enable SCF_use_shadow. This
lets us remove the use of DO_SPEC_CTRL_EXIT_TO_GUEST. Additionally,
SPEC_CTRL_ENTRY_FROM_PV gets removed too, because on an early entry failure,
we're no longer in the guest MSR_SPEC_CTRL context needing to switch back to
Xen's context.
The only action remaining is to load Xen's MSR_SPEC_CTRL value on vmexit. We
could in principle use the host msr list, but is expected to complicated
future work. Delete DO_SPEC_CTRL_ENTRY_FROM_HVM entirely, and use a shorter
code sequence to simply reload Xen's setting from the top-of-stack block.
Adjust the comment at the top of spec_ctrl_asm.h in light of this bugfix.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 81f0eaadf84d273a6ff8df3660b874a02d0e7677
master date: 2022-01-20 16:32:11 +0000
Andrew Cooper [Tue, 25 Jan 2022 12:52:56 +0000 (13:52 +0100)]
x86/spec-ctrl: Drop SPEC_CTRL_{ENTRY_FROM,EXIT_TO}_HVM
These were written before Spectre/Meltdown went public, and there was large
uncertainty in how the protections would evolve. As it turns out, they're
very specific to Intel hardware, and not very suitable for AMD.
Drop the macros, opencoding the relevant subset of functionality, and leaving
grep-fodder to locate the logic. No change at all for VT-x.
For AMD, the only relevant piece of functionality is DO_OVERWRITE_RSB,
although we will soon be adding (different) logic to handle MSR_SPEC_CTRL.
This has a marginal improvement of removing an unconditional pile of long-nops
from the vmentry/exit path.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 95b13fa43e0753b7514bef13abe28253e8614f62
master date: 2022-01-20 16:32:11 +0000
Various registers have per-guest-type or per-vendor locations or access
requirements. To support their use from common code, provide accessors which
allow for per-guest-type behaviour.
For now, just infrastructure handling default cases and expectations.
Subsequent patches will start handling registers using this infrastructure.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 88d3ff7ab15da277a85b39735797293fb541c718
master date: 2022-01-20 16:32:11 +0000
Jan Beulich [Tue, 25 Jan 2022 12:51:49 +0000 (13:51 +0100)]
x86/time: improve TSC / CPU freq calibration accuracy
While the problem report was for extreme errors, even smaller ones would
better be avoided: The calculated period to run calibration loops over
can (and usually will) be shorter than the actual time elapsed between
first and last platform timer and TSC reads. Adjust values returned from
the init functions accordingly.
On a Skylake system I've tested this on accuracy (using HPET) went from
detecting in some cases more than 220kHz too high a value to about
±2kHz. On other systems (or on this system, but with PMTMR) the original
error range was much smaller, with less (in some cases only very little)
improvement.
Reported-by: James Dingwall <james-xen@dingwall.me.uk> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a5c9a80af34eefcd6e31d0ed2b083f452cd9076d
master date: 2022-01-13 14:31:52 +0100
Jan Beulich [Tue, 25 Jan 2022 12:51:36 +0000 (13:51 +0100)]
x86/time: use relative counts in calibration loops
Looping until reaching/exceeding a certain value is error prone: If the
target value is close enough to the wrapping point, the loop may not
terminate at all. Switch to using delta values, which then allows to
fold the two loops each into just one.
Fixes: 93340297802b ("x86/time: calibrate TSC against platform timer") Reported-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 467191641d2a2fd2e43b3ae7b80399f89d339980
master date: 2022-01-13 14:30:18 +0100
Julien Grall [Tue, 25 Jan 2022 12:49:26 +0000 (13:49 +0100)]
xen/grant-table: Only decrement the refcounter when grant is fully unmapped
The grant unmapping hypercall (GNTTABOP_unmap_grant_ref) is not a
simple revert of the changes done by the grant mapping hypercall
(GNTTABOP_map_grant_ref).
Instead, it is possible to partially (or even not) clear some flags.
This will leave the grant is mapped until a future call where all
the flags would be cleared.
XSA-380 introduced a refcounting that is meant to only be dropped
when the grant is fully unmapped. Unfortunately, unmap_common() will
decrement the refcount for every successful call.
A consequence is a domain would be able to underflow the refcount
and trigger a BUG().
Looking at the code, it is not clear to me why a domain would
want to partially clear some flags in the grant-table. But as
this is part of the ABI, it is better to not change the behavior
for now.
Fix it by checking if the maptrack handle has been released before
decrementing the refcounting.
Julien Grall [Tue, 25 Jan 2022 12:49:06 +0000 (13:49 +0100)]
xen/arm: p2m: Always clear the P2M entry when the mapping is removed
Commit 2148a125b73b ("xen/arm: Track page accessed between batch of
Set/Way operations") allowed an entry to be invalid from the CPU PoV
(lpae_is_valid()) but valid for Xen (p2m_is_valid()). This is useful
to track which page is accessed and only perform an action on them
(e.g. clean & invalidate the cache after a set/way instruction).
Unfortunately, __p2m_set_entry() is only zeroing the P2M entry when
lpae_is_valid() returns true. This means the entry will not be zeroed
if the entry was valid from Xen PoV but invalid from the CPU PoV for
tracking purpose.
As a consequence, this will allow a domain to continue to access the
page after it was removed.
Resolve the issue by always zeroing the entry if it the LPAE bit is
set or the entry is about to be removed.
Andrew Cooper [Fri, 7 Jan 2022 07:54:38 +0000 (08:54 +0100)]
x86/spec-ctrl: Fix default calculation of opt_srb_lock
Since this logic was introduced, opt_tsx has become more complicated and
shouldn't be compared to 0 directly. While there are no buggy logic paths,
the correct expression is !(opt_tsx & 1) but the rtm_disabled boolean is
easier and clearer to use.
Fixes: 8fe24090d940 ("x86/cpuid: Rework HLE and RTM handling") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 31f3bc97f4508687215e459a5e35676eecf1772b
master date: 2022-01-05 09:44:26 +0000
While its description is correct from an abstract or real hardware pov,
the range is special inside HVM guests. The range being UC in particular
gets in the way of OVMF, which places itself at [FFE00000,FFFFFFFF].
While this is benign to epte_get_entry_emt() as long as the IOMMU isn't
enabled for a guest, it becomes a very noticable problem otherwise: It
takes about half a minute for OVMF to decompress itself into its
designated address range.
And even beyond OVMF there's no reason to have e.g. the ACPI memory
range marked UC.
Fixes: c22bd567ce22 ("hvmloader: PA range 0xfc000000-0xffffffff should be UC") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ea187c0b7a73c26258c0e91e4f3656989804555f
master date: 2021-12-17 08:56:15 +0100
Jan Beulich [Fri, 7 Jan 2022 07:52:48 +0000 (08:52 +0100)]
x86/HVM: permit CLFLUSH{,OPT} on execute-only code segments
Both SDM and PM explicitly permit this.
Fixes: 52dba7bd0b36 ("x86emul: generalize wbinvd() hook") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Paul Durrant <paul@xen.org>
master commit: df3e1a5efe700a9f59eced801cac73f9fd02a0e2
master date: 2021-12-10 14:03:56 +0100