Andrew Cooper [Tue, 15 Mar 2022 12:07:18 +0000 (12:07 +0000)]
x86/cet: Remove writeable mapping of the BSPs shadow stack
An unintended consequence of the BSP using cpu0_stack[] is that writeable
mappings to the BSPs shadow stacks are retained in the bss. This renders
CET-SS almost useless, as an attacker can update both return addresses and the
ret will not fault.
We specifically don't want to shatter the superpage mapping .data and .bss, so
the only way to fix this is to not have the BSP stack in the main Xen image.
Break cpu_alloc_stack() out of cpu_smpboot_alloc(), and dynamically allocate
the BSP stack as early as reasonable in __start_xen(). As a consequence,
there is no need to delay the BSP's memguard_guard_stack() call.
Copy the top of cpu info block just before switching to use the new stack.
Fix a latent bug by setting %rsp to info->guest_cpu_user_regs rather than
->es; this would be buggy if reinit_bsp_stack() called schedule() (which
rewrites the GPR block) directly, but luckily it doesn't.
Finally, move cpu0_stack[] into .init, so it can be reclaimed after boot.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 37786b23b027ab83051175cb8ce9ac86cacfc58e)
Andrew Cooper [Mon, 14 Mar 2022 10:30:46 +0000 (10:30 +0000)]
x86/cet: Clear IST supervisor token busy bits on S3 resume
Stacks are not freed across S3. Execution just stops, leaving supervisor
token busy bits active. Fixing this for the primary shadow stack was done
previously, but there is a (rare) risk that an IST token is left busy too, if
the platform power-off happens to intersect with an NMI/#MC arriving. This
will manifest as #DF next time the IST vector gets used.
Introduce rdssp() and wrss() helpers in a new shstk.h, cleaning up
fixup_exception_return() and explaining the trick with the literal 1.
Then this infrastructure to rewrite the IST tokens in load_system_tables()
when all the other IST details are being set up. In the case that an IST
token were left busy across S3, this will clear the busy bit before the stack
gets used.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit e421ed0f68488863599532bda575c03c33cde0e0)
Andrew Cooper [Mon, 7 Mar 2022 20:19:18 +0000 (20:19 +0000)]
x86/kexec: Fix kexec-reboot with CET active
The kexec_reloc() asm has an indirect jump to relocate onto the identity
trampoline. While we clear CET in machine_crash_shutdown(), we fail to clear
CET for the non-crash path. This in turn highlights that the same is true of
resetting the CPUID masking/faulting.
Move both pieces of logic from machine_crash_shutdown() to machine_kexec(),
the latter being common for all kexec transitions. Adjust the condition for
CET being considered active to check in CR4, which is simpler and more robust.
Fixes: 311434bfc9d1 ("x86/setup: Rework MSR_S_CET handling for CET-IBT") Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks") Fixes: 5ab9564c6fa1 ("x86/cpu: Context switch cpuid masks and faulting state in context_switch()") Reported-by: David Vrabel <dvrabel@amazon.co.uk> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: David Vrabel <dvrabel@amazon.co.uk>
(cherry picked from commit 7f5b2448bd724f5f24426b2595a9bdceb1e5a346)
Andrew Cooper [Mon, 28 Feb 2022 19:26:37 +0000 (19:26 +0000)]
x86/spec-ctrl: Disable retpolines with CET-IBT
CET-IBT depend on executing indirect branches for protections to apply.
Extend the clobber for CET-SS to all of CET.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 6e3f36387de566b09aa4145ea0e3bfe4814d68b4)
and then later (on at least two Intel TigerLake platforms), the next HVM vCPU
to be scheduled on the BSP dies with:
(XEN) d1v0 Unexpected vmexit: reason 3
(XEN) domain_crash called from vmx.c:4304
(XEN) Domain 1 (vcpu#0) crashed on cpu#0:
The VMExit reason is EXIT_REASON_INIT, which has nothing to do with the
scheduled vCPU, and will be addressed in a subsequent patch. It is a
consequence of the APs triple faulting.
The reason the APs triple fault is because we don't tear down the stacks on
suspend. The idle/play_dead loop is killed in the middle of running, meaning
that the supervisor token is left busy.
On resume, SETSSBSY finds busy bit set, suffers #CP and triple faults because
the IDT isn't configured this early.
Rework the AP bring-up path to (re)create the supervisor token. This ensures
the primary stack is non-busy before use.
Note: There are potential issues with the IST shadow stacks too, but fixing
those is more involved.
Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks") Link: https://github.com/QubesOS/qubes-issues/issues/7283 Reported-by: Thiner Logoer <logoerthiner1@163.com> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Thiner Logoer <logoerthiner1@163.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 7d9589239ec068c944190408b9838774d5ec1f8f)
Andrew Cooper [Mon, 1 Nov 2021 15:17:20 +0000 (15:17 +0000)]
x86: Enable CET Indirect Branch Tracking
With all the pieces now in place, turn CET-IBT on when available.
MSR_S_CET, like SMEP/SMAP, controls Ring1 meaning that ENDBR_EN can't be
enabled for Xen independently of PV32 kernels. As we already disable PV32 for
CET-SS, extend this to all CET, adjusting the documentation/comments as
appropriate.
Introduce a cet=no-ibt command line option to allow the admin to disable IBT
even when everything else is configured correctly.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit cdbe2b0a1aecae946639ee080f14831429b184b6)
Andrew Cooper [Mon, 1 Nov 2021 21:54:26 +0000 (21:54 +0000)]
x86/EFI: Disable CET-IBT around Runtime Services calls
UEFI Runtime services, at the time of writing, aren't CET-IBT compatible.
Work is ongoing to address this. In the meantime, unconditionally disable IBT.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit d37a8a067e62e3b6709d224c22f740fdda9d0078)
Andrew Cooper [Mon, 1 Nov 2021 16:13:29 +0000 (16:13 +0000)]
x86/setup: Rework MSR_S_CET handling for CET-IBT
CET-SS and CET-IBT can be independently controlled, so the configuration of
MSR_S_CET can't be constant any more.
Introduce xen_msr_s_cet_value(), mostly because I don't fancy
writing/maintaining that logic in assembly. Use this in the 3 paths which
alter MSR_S_CET when both features are potentially active.
To active CET-IBT, we only need CR4.CET and MSR_S_CET.ENDBR_EN. This is
common with the CET-SS setup, so reorder the operations to set up CR4 and
MSR_S_CET for any nonzero result from xen_msr_s_cet_value(), and set up
MSR_PL0_SSP and SSP if SHSTK_EN was also set.
Adjust the crash path to disable CET-IBT too.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 311434bfc9d10615adbd340d7fb08c05cd14f4c7)
Andrew Cooper [Mon, 1 Nov 2021 17:08:24 +0000 (17:08 +0000)]
x86/entry: Make IDT entrypoints CET-IBT compatible
Each IDT vector needs to land on an endbr64 instruction. This is especially
important for the #CP handler, which will recurse indefinitely if the endbr64
is missing, eventually escalating to #DF if guard pages are active.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit e702e36d1d519f4b66086650c1c47d6bac96d4b9)
Also include the continue_pv_domain() change from c/s 954bb07fdb5fad which is
also in entry.S
Andrew Cooper [Mon, 1 Nov 2021 09:51:16 +0000 (09:51 +0000)]
x86/entry: Make syscall/sysenter entrypoints CET-IBT compatible
Each of MSR_{L,C}STAR and MSR_SYSENTER_EIP need to land on an endbr64
instruction. For sysenter, this is easy.
Unfortunately for syscall, the stubs are already 29 byte long with a limit of
32. endbr64 is 4 bytes. Luckily, there is a 1 byte instruction which can
move from the stubs into the main handlers.
Move the push %rax out of the stub and into {l,c}star_entry(), allowing room
for the endbr64 instruction when appropriate. Update the comment describing
the entry state.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 17d77ec62a299f4299883ec79ab10cacafd0b2f5)
Andrew Cooper [Mon, 1 Nov 2021 10:09:59 +0000 (10:09 +0000)]
x86/emul: Update emulation stubs to be CET-IBT compatible
All indirect branches need to land on an endbr64 instruction.
For stub_selftests(), use endbr64 unconditionally for simplicity. For ioport
and instruction emulation, add endbr64 conditionally.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 0d101568d29e8b4bfd33f20031fedec2652aa0cf)
Andrew Cooper [Fri, 26 Nov 2021 15:34:08 +0000 (15:34 +0000)]
x86: Introduce helpers/checks for endbr64 instructions
... to prevent the optimiser creating unsafe code. See the code comment for
full details.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 4046ba97446e3974a4411db227263a9f11e0aeb4)
Note: For the backport to 4.14 thru 4.16, we don't care for embedded endbr64
specifically, but place_endbr64() is a prerequisite for other parts of
the series.
Andrew Cooper [Mon, 1 Nov 2021 12:36:33 +0000 (12:36 +0000)]
x86/traps: Rework write_stub_trampoline() to not hardcode the jmp
For CET-IBT, we will need to optionally insert an endbr64 instruction at the
start of the stub. Don't hardcode the jmp displacement assuming that it
starts at byte 24 of the stub.
Also add extra comments describing what is going on. The mix of %rax and %rsp
is far from trivial to follow.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 809beac3e7fdfd20000386453c64a1e2a3d93075)
Andrew Cooper [Mon, 1 Nov 2021 10:17:59 +0000 (10:17 +0000)]
x86/alternatives: Clear CR4.CET when clearing CR0.WP
This allows us to have CET active much earlier in boot.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 48cdc15a424f9fadad7f9aed00e7dc8ef16a2196)
Andrew Cooper [Mon, 1 Nov 2021 10:19:57 +0000 (10:19 +0000)]
x86/setup: Read CR4 earlier in __start_xen()
This is necessary for read_cr4() to function correctly. Move the EFER caching
at the same time.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9851bc4939101828d2ad7634b93c0d9ccaef5b7e)
Andrew Cooper [Thu, 21 Oct 2021 17:38:50 +0000 (18:38 +0100)]
x86: Introduce support for CET-IBT
CET Indirect Branch Tracking is a hardware feature designed to provide
forward-edge control flow integrity, protecting against jump/call oriented
programming.
IBT requires the placement of endbr{32,64} instructions at the target of every
indirect call/jmp, and every entrypoint.
It is necessary to check for both compiler and assembler support, as the
notrack prefix can be emitted in certain cases.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3667f7f8f7c471e94e58cf35a95f09a0fe5c1290)
Note: For backports to 4.14 thru 4.16, we are deliberately not using
-mmanual-endbr as done in staging, as an intermediate approach which
is not too invasive to backport.
x86/cet: Force -fno-jump-tables for CET-IBT
Both GCC and Clang have a (mis)feature where, even with
-fcf-protection=branch, jump tables are created using a notrack jump rather
than using endbr's in each case statement.
This is incompatible with the safety properties we want in Xen, and enforced
by not setting MSR_S_CET.NOTRACK_EN. The consequence is a fatal #CP[endbr].
-fno-jump-tables is generally active as a side effect of
CONFIG_INDIRECT_THUNK (retpoline), but as of c/s 95d9ab461436 ("x86/Kconfig:
introduce option to select retpoline usage"), we explicitly support turning
retpoline off.
Fixes: 3667f7f8f7c4 ("x86: Introduce support for CET-IBT") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9d4a44380d273de22d5753883cbf5581795ff24d)
Andrew Cooper [Mon, 7 Mar 2022 16:35:52 +0000 (16:35 +0000)]
x86/spec-ctrl: Cease using thunk=lfence on AMD
AMD have updated their Spectre v2 guidance, and lfence/jmp is no longer
considered safe. AMD are recommending using retpoline everywhere.
Retpoline is incompatible with CET. All CET-capable hardware has efficient
IBRS (specifically, not something retrofitted in microcode), so use IBRS (and
STIBP for consistency sake).
This is a logical change on AMD, but not on Intel as the default calculations
would end up with these settings anyway. Leave behind a message if IBRS is
found to be missing.
Also update the default heuristics to never select THUNK_LFENCE. This causes
AMD CPUs to change their default to retpoline.
Also update the printed message to include the AMD MSR_SPEC_CTRL settings, and
STIBP now that we set it for consistency sake.
This is part of XSA-398 / CVE-2021-26401.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 8d03080d2a339840d3a59e0932a94f804e45110d)
Bertrand Marquis [Thu, 17 Feb 2022 14:52:54 +0000 (14:52 +0000)]
xen/arm: Allow to discover and use SMCCC_ARCH_WORKAROUND_3
Allow guest to discover whether or not SMCCC_ARCH_WORKAROUND_3 is
supported and create a fastpath in the code to handle guests request to
do the workaround.
The function SMCCC_ARCH_WORKAROUND_3 will be called by the guest for
flushing the branch history. So we want the handling to be as fast as
possible.
As the mitigation is applied on every guest exit, we can check for the
call before saving all context and return very early.
Rahul Singh [Mon, 14 Feb 2022 18:47:32 +0000 (18:47 +0000)]
xen/arm: Add Spectre BHB handling
This commit is adding Spectre BHB handling to Xen on Arm.
The commit is introducing new alternative code to be executed during
exception entry:
- SMCC workaround 3 call
- loop workaround (with 8, 24 or 32 iterations)
- use of new clearbhb instruction
Cpuerrata is modified by this patch to apply the required workaround for
CPU affected by Spectre BHB when CONFIG_ARM64_HARDEN_BRANCH_PREDICTOR is
enabled.
To do this the system previously used to apply smcc workaround 1 is
reused and new alternative code to be copied in the exception handler is
introduced.
To define the type of workaround required by a processor, 4 new cpu
capabilities are introduced (for each number of loop and for smcc
workaround 3).
When a processor is affected, enable_spectre_bhb_workaround is called
and if the processor does not have CSV2 set to 3 or ECBHB feature (which
would mean that the processor is doing what is required in hardware),
the proper code is enabled at exception entry.
In the case where workaround 3 is not supported by the firmware, we
enable workaround 1 when possible as it will also mitigate Spectre BHB
on systems without CSV2.
Bertrand Marquis [Wed, 23 Feb 2022 09:42:18 +0000 (09:42 +0000)]
xen/arm: Add ECBHB and CLEARBHB ID fields
Introduce ID coprocessor register ID_AA64ISAR2_EL1.
Add definitions in cpufeature and sysregs of ECBHB field in mmfr1 and
CLEARBHB in isar2 ID coprocessor registers.
Bertrand Marquis [Tue, 15 Feb 2022 10:39:47 +0000 (10:39 +0000)]
xen/arm: move errata CSV2 check earlier
CSV2 availability check is done after printing to the user that
workaround 1 will be used. Move the check before to prevent saying to the
user that workaround 1 is used when it is not because it is not needed.
This will also allow to reuse install_bp_hardening_vec function for
other use cases.
Code previously returning "true", now returns "0" to conform to
enable_smccc_arch_workaround_1 returning an int and surrounding code
doing a "return 0" if workaround is not needed.
Andrew Cooper [Tue, 19 Oct 2021 20:22:27 +0000 (21:22 +0100)]
x86/spec-ctrl: Support Intel PSFD for guests
The Feb 2022 microcode from Intel retrofits AMD's MSR_SPEC_CTRL.PSFD interface
to Sunny Cove (IceLake) and later cores.
Update the MSR_SPEC_CTRL emulation, and expose it to guests.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 52ce1c97844db213de01c5300eaaa8cf101a285f)
Andrew Cooper [Thu, 27 Jan 2022 21:07:40 +0000 (21:07 +0000)]
x86/cpuid: Infrastructure for cpuid word 7:2.edx
While in principle it would be nice to keep leaf 7 in order, that would
involve having an extra 5 words of zeros in a featureset.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f3709b15fc86c6c6a0959cec8d97f21d0e9f9629)
Andrew Cooper [Wed, 16 Sep 2020 15:15:52 +0000 (16:15 +0100)]
x86/tsx: Cope with TSX deprecation on WHL-R/CFL-R
The February 2022 microcode is formally de-featuring TSX on the TAA-impacted
client CPUs. The backup TAA mitigation (VERW regaining its flushing side
effect) is being dropped, meaning that `smt=0 spec-ctrl=md-clear` no longer
protects against TAA on these parts.
The new functionality enumerates itself via the RTM_ALWAYS_ABORT CPUID
bit (the same as June 2021), but has its control in MSR_MCU_OPT_CTRL as
opposed to MSR_TSX_FORCE_ABORT.
TSX now defaults to being disabled on ucode load. Furthermore, if SGX is
enabled in the BIOS, TSX is locked and cannot be re-enabled. In this case,
override opt_tsx to 0, so the RTM/HLE CPUID bits get hidden by default.
While updating the command line documentation, take the opportunity to add a
paragraph explaining what TSX being disabled actually means, and how migration
compatibility works.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit ad9f7c3b2e0df38ad6d54f4769d4dccf765fbcee)
Andrew Cooper [Wed, 23 Jun 2021 20:53:58 +0000 (21:53 +0100)]
x86/tsx: Move has_rtm_always_abort to an outer scope
We are about to introduce a second path which needs to conditionally force the
presence of RTM_ALWAYS_ABORT.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 4116139131e93b4f075e5442e3c1b424280f6f1f)
Andrew Cooper [Wed, 19 May 2021 18:40:28 +0000 (19:40 +0100)]
x86/spec-ctrl: Clean up MSR_MCU_OPT_CTRL handling
Introduce cpu_has_srbds_ctrl as more users are going to appear shortly.
MSR_MCU_OPT_CTRL is gaining extra functionality, meaning that the current
default_xen_mcu_opt_ctrl is no longer a good fit.
Introduce two new helpers, update_mcu_opt_ctrl() which does a full RMW cycle
on the MSR, and set_in_mcu_opt_ctrl() which lets callers configure specific
bits at a time without clobbering each others settings.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 39a40f3835efcc25c1b05a25c321a01d7e11cbd7)
Jan Beulich [Thu, 27 Jan 2022 12:54:42 +0000 (12:54 +0000)]
x86/cpuid: Infrastructure for leaf 7:1.ebx
Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit e1828e3032ebfe036023cd733adfd2d4ec856688)
Andrew Cooper [Thu, 27 Jan 2022 13:56:04 +0000 (13:56 +0000)]
x86/cpuid: Disentangle logic for new feature leaves
Adding a new feature leaf is a reasonable amount of boilerplate and for the
patch to build, at least one feature from the new leaf needs defining. This
typically causes two non-trivial changes to be merged together.
First, have gen-cpuid.py write out some extra placeholder defines:
This allows DECL_BITFIELD() to be added to struct cpuid_policy without
requiring a XEN_CPUFEATURE() declared for the leaf. The choice of 4 is
arbitrary, and allows us to add more than one leaf at a time if necessary.
Second, rework generic_identify() to not use specific feature names.
The choice of deriving the index from a feature was to avoid mismatches, but
its correctness depends on bugs like c/s 249e0f1d8f20 ("x86/cpuid: Fix
TSXLDTRK definition") not happening.
Switch to using FEATURESET_* just like the policy/featureset helpers. This
breaks the cognitive complexity of needing to know which leaf a specifically
named feature should reside in, and is shorter to write. It is also far
easier to identify as correct at a glance, given the correlation with the
CPUID leaf being read.
In addition, tidy up some other bits of generic_identify()
* Drop leading zeros from leaf numbers.
* Don't use a locked update for X86_FEATURE_APERFMPERF.
* Rework extended_cpuid_level calculation to avoid setting it twice.
* Use "leaf >= $N" consistently so $N matches with the CPUID input.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit e3662437eb43cc8002bd39be077ef68b131649c5)
Andrew Cooper [Mon, 17 Jan 2022 20:29:09 +0000 (20:29 +0000)]
x86/cpuid: Enable MSR_SPEC_CTRL in SVM guests by default
With all other pieces in place, MSR_SPEC_CTRL is fully working for HVM guests.
Update the CPUID derivation logic (both PV and HVM to avoid losing subtle
changes), drop the MSR intercept, and explicitly enable the CPUID bits for HVM
guests.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit a7e7c7260cde78a148810db5320cbf39686c3e09)
Andrew Cooper [Mon, 17 Jan 2022 20:29:09 +0000 (20:29 +0000)]
x86/msr: AMD MSR_SPEC_CTRL infrastructure
Fill in VMCB accessors for spec_ctrl in svm_{get,set}_reg(), and CPUID checks
for all supported bits in guest_{rd,wr}msr().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 22b9add22b4a9af37305c8441fec12cb26bd142b)
Andrew Cooper [Fri, 21 Jan 2022 15:59:03 +0000 (15:59 +0000)]
x86/svm: VMEntry/Exit logic for MSR_SPEC_CTRL
Hardware maintains both host and guest versions of MSR_SPEC_CTRL, but guests
run with the logical OR of both values. Therefore, in principle we want to
clear Xen's value before entering the guest. However, for migration
compatibility (future work), and for performance reasons with SEV-SNP guests,
we want the ability to use a nonzero value behind the guest's back. Use
vcpu_msrs to hold this value, with the guest value in the VMCB.
On the VMEntry path, adjusting MSR_SPEC_CTRL must be done after CLGI so as to
be atomic with respect to NMIs/etc.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 614cec7d79d76786f5638a6e4da0576b57732ca1)
Andrew Cooper [Fri, 21 Jan 2022 15:59:03 +0000 (15:59 +0000)]
x86/spec-ctrl: Use common MSR_SPEC_CTRL logic for AMD
Currently, amd_init_ssbd() works by being the only write to MSR_SPEC_CTRL in
the system. This ceases to be true when using the common logic.
Include AMD MSR_SPEC_CTRL in has_spec_ctrl to activate the common paths, and
introduce an AMD specific block to control alternatives. Also update the
boot/resume paths to configure default_xen_spec_ctrl.
svm.h needs an adjustment to remove a dependency on include order.
For now, only active alternatives for HVM - PV will require more work. No
functional change, as no alternatives are defined yet for HVM yet.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 378f2e6df31442396f0afda19794c5c6091d96f9)
Andrew Cooper [Fri, 28 Jan 2022 11:57:19 +0000 (11:57 +0000)]
x86/spec-ctrl: Record the last write to MSR_SPEC_CTRL
In some cases, writes to MSR_SPEC_CTRL do not have interesting side effects,
and we should implement lazy context switching like we do with other MSRs.
In the short term, this will be used by the SVM infrastructure, but I expect
to extend it to other contexts in due course.
Introduce cpu_info.last_spec_ctrl for the purpose, and cache writes made from
the boot/resume paths. The value can't live in regular per-cpu data when it
is eventually used for PV guests when XPTI might be active.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 00f2992b6c7a9d4090443c1a85bf83224a87eeb9)
Andrew Cooper [Fri, 28 Jan 2022 12:03:42 +0000 (12:03 +0000)]
x86/spec-ctrl: Don't use spec_ctrl_{enter,exit}_idle() for S3
'idle' here refers to hlt/mwait. The S3 path isn't an idle path - it is a
platform reset.
We need to load default_xen_spec_ctrl unilaterally on the way back up.
Currently it happens as a side effect of X86_FEATURE_SC_MSR_IDLE or the next
return-to-guest, but that's fragile behaviour.
Conversely, there is no need to clear IBRS and flush the store buffers on the
way down; we're microseconds away from cutting power.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 71fac402e05ade7b0af2c34f77517449f6f7e2c1)
Andrew Cooper [Tue, 25 Jan 2022 17:14:48 +0000 (17:14 +0000)]
x86/spec-ctrl: Introduce new has_spec_ctrl boolean
Most MSR_SPEC_CTRL setup will be common between Intel and AMD. Instead of
opencoding an OR of two features everywhere, introduce has_spec_ctrl instead.
Reword the comment above the Intel specific alternatives block to highlight
that it is Intel specific, and pull the setting of default_xen_spec_ctrl.IBRS
out because it will want to be common.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 5d9eff3a312763d889cfbf3c8468b6dfb3ab490c)
Andrew Cooper [Tue, 25 Jan 2022 16:09:59 +0000 (16:09 +0000)]
x86/spec-ctrl: Drop use_spec_ctrl boolean
Several bugfixes have reduced the utility of this variable from it's original
purpose, and now all it does is aid in the setup of SCF_ist_wrmsr.
Simplify the logic by drop the variable, and doubling up the setting of
SCF_ist_wrmsr for the PV and HVM blocks, which will make the AMD SPEC_CTRL
support easier to follow. Leave a comment explaining why SCF_ist_wrmsr is
still necessary for the VMExit case.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit ec083bf552c35e10347449e21809f4780f8155d2)
Andrew Cooper [Thu, 27 Jan 2022 21:28:48 +0000 (21:28 +0000)]
x86/cpuid: Advertise SSB_NO to guests by default
This is a statement of hardware behaviour, and not related to controls for the
guest kernel to use. Pass it straight through from hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 15b7611efd497c4b65f350483857082cb70fc348)
Andrew Cooper [Wed, 19 Jan 2022 19:55:02 +0000 (19:55 +0000)]
x86/msr: Fix migration compatibility issue with MSR_SPEC_CTRL
This bug existed in early in 2018 between MSR_SPEC_CTRL arriving in microcode,
and SSBD arriving a few months later. It went unnoticed presumably because
everyone was busy rebooting everything.
The same bug will reappear when adding PSFD support.
Clamp the guest MSR_SPEC_CTRL value to that permitted by CPUID on migrate.
The guest is already playing with reserved bits at this point, and clamping
the value will prevent a migration to a less capable host from failing.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 969a57f73f6b011b2ebf4c0ab1715efc65837335)
Andrew Cooper [Tue, 1 Feb 2022 13:34:49 +0000 (13:34 +0000)]
x86/vmx: Drop spec_ctrl load in VMEntry path
This is not needed now that the VMEntry path is not responsible for loading
the guest's MSR_SPEC_CTRL value.
Fixes: 81f0eaadf84d ("x86/spec-ctrl: Fix NMI race condition with VT-x MSR_SPEC_CTRL handling") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9ce3ef20b4f085a7dc8ee41b0fec6fdeced3773e)
x86/cpuid: support LFENCE always serialising CPUID bit
AMD Milan (Zen3) CPUs have an LFENCE Always Serialising CPUID bit in
leaf 80000021.eax. Previous AMD versions used to have a user settable
bit in DE_CFG MSR to select whether LFENCE was dispatch serialising,
which Xen always attempts to set. The forcefully always on setting is
due to the addition of SEV-SNP so that a VMM cannot break the
confidentiality of a guest.
In order to support this new CPUID bit move the LFENCE_DISPATCH
synthetic CPUID bit to map the hardware bit (leaving a hole in the
synthetic range) and either rely on the bit already being set by the
native CPUID output, or attempt to fake it in Xen by modifying the
DE_CFG MSR. This requires adding one more entry to the featureset to
support leaf 80000021.eax.
The bit is always exposed to guests by default even if the underlying
hardware doesn't support leaf 80000021. Note that Xen doesn't allow
guests to change the DE_CFG value, so once set by Xen LFENCE will always
be serialising.
Note that the access to DE_CFG by guests is left as-is: reads will
unconditionally return LFENCE_SERIALISE bit set, while writes are
silently dropped.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
[Always expose to guests by default] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit e9b4fe26364950258c9f57f0f68eccb778eeadbb)
x86/cpuid: do not expand max leaves on restore
When restoring limit the maximum leaves to the ones supported by Xen
4.12 in order to not expand the maximum leaves a guests sees. Note
this is unlikely to cause real issues.
Guests restored from Xen versions 4.13 or greater will contain CPUID
data on the stream that will override the values set by
xc_cpuid_apply_policy.
Fixes: e9b4fe263649 ("x86/cpuid: support LFENCE always serialising CPUID bit") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 111c8c33a8a18588f3da3c5dbb7f5c63ddb98ce5)
tools/libxenguest: Fix max_extd_leaf calculation for legacy restore
0x1c is lower than any value which will actually be observed in
p->extd.max_leaf, but higher than the logical 9 leaves worth of extended data
on Intel systems, causing x86_cpuid_copy_to_buffer() to fail with -ENOBUFS.
Correct the calculation.
The problem was first noticed in c/s 34990446ca9 "libxl: don't ignore the
return value from xc_cpuid_apply_policy" but introduced earlier.
Fixes: 111c8c33a8a1 ("x86/cpuid: do not expand max leaves on restore") Reported-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 5fa174cbf54cc625a023b8e7170e359dd150c072)
It was Xen 4.14 where CPUID data was added to the migration stream, and 4.13
that we need to worry about with regards to compatibility. Xen 4.12 isn't
relevant.
Expand and correct the commentary.
Fixes: 111c8c33a8a1 ("x86/cpuid: do not expand max leaves on restore") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 820cc393434097f3b7976acdccbf1d96071d6d23)
x86/amd: split LFENCE dispatch serializing setup logic into helper
Split the logic to attempt to setup LFENCE to be dispatch serializing
on AMD into a helper, so it can be shared with Hygon.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 3e9460ec93341fa6a80ecf99832aa5d9975339c9)
Roger Pau Monné [Wed, 26 Jan 2022 11:52:09 +0000 (12:52 +0100)]
x86/pvh: fix population of the low 1MB for dom0
RMRRs are setup ahead of populating the p2m and hence the ASSERT when
populating the low 1MB needs to be relaxed when it finds an existing
entry: it's either RAM or a RMRR resulting from the IOMMU setup.
Rework the logic a bit and introduce a local mfn variable in order to
assert that if the gfn is populated and not RAM it is an identity map.
Fixes: 6b4f6a31ac ('x86/PVH: de-duplicate mappings for first Mb of Dom0 memory') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2d5fc9120d556ec3c4b1acf0ab5660a6d3f7ebeb
master date: 2022-01-25 10:52:24 +0000
Andrew Cooper [Wed, 26 Jan 2022 11:51:31 +0000 (12:51 +0100)]
x86: Fix build with the get/set_reg() infrastructure
I clearly messed up concluding that the stubs were safe to drop.
The is_{pv,hvm}_domain() predicates are not symmetrical with both CONFIG_PV
and CONFIG_HVM. As a result logic of the form `if ( pv/hvm ) ... else ...`
will always have one side which can't be DCE'd.
While technically only the hvm stubs are needed, due to the use of the
is_pv_domain() predicate in guest_{rd,wr}msr(), sort out the pv stubs too to
avoid leaving a bear trap for future users.
Fixes: 88d3ff7ab15d ("x86/guest: Introduce {get,set}_reg() infrastructure") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 13caa585791234fe3e3719c8376f7ea731012451
master date: 2022-01-21 12:42:11 +0000
Andrew Cooper [Tue, 25 Jan 2022 12:53:14 +0000 (13:53 +0100)]
x86/spec-ctrl: Fix NMI race condition with VT-x MSR_SPEC_CTRL handling
The logic was based on a mistaken understanding of how NMI blocking on vmexit
works. NMIs are only blocked for EXIT_REASON_NMI, and not for general exits.
Therefore, an NMI can in general hit early in the vmx_asm_vmexit_handler path,
and the guest's value will be clobbered before it is saved.
Switch to using MSR load/save lists. This causes the guest value to be saved
atomically with respect to NMIs/MCEs/etc.
First, update vmx_cpuid_policy_changed() to configure the load/save lists at
the same time as configuring the intercepts. This function is always used in
remote context, so extend the vmx_vmcs_{enter,exit}() block to cover the whole
function, rather than having multiple remote acquisitions of the same VMCS.
Both of vmx_{add,del}_guest_msr() can fail. The -ESRCH delete case is fine,
but all others are fatal to the running of the VM, so handle them using
domain_crash() - this path is only used during domain construction anyway.
Second, update vmx_{get,set}_reg() to use the MSR load/save lists rather than
vcpu_msrs, and update the vcpu_msrs comment to describe the new state
location.
Finally, adjust the entry/exit asm.
Because the guest value is saved and loaded atomically, we do not need to
manually load the guest value, nor do we need to enable SCF_use_shadow. This
lets us remove the use of DO_SPEC_CTRL_EXIT_TO_GUEST. Additionally,
SPEC_CTRL_ENTRY_FROM_PV gets removed too, because on an early entry failure,
we're no longer in the guest MSR_SPEC_CTRL context needing to switch back to
Xen's context.
The only action remaining is to load Xen's MSR_SPEC_CTRL value on vmexit. We
could in principle use the host msr list, but is expected to complicated
future work. Delete DO_SPEC_CTRL_ENTRY_FROM_HVM entirely, and use a shorter
code sequence to simply reload Xen's setting from the top-of-stack block.
Adjust the comment at the top of spec_ctrl_asm.h in light of this bugfix.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 81f0eaadf84d273a6ff8df3660b874a02d0e7677
master date: 2022-01-20 16:32:11 +0000
Andrew Cooper [Tue, 25 Jan 2022 12:52:56 +0000 (13:52 +0100)]
x86/spec-ctrl: Drop SPEC_CTRL_{ENTRY_FROM,EXIT_TO}_HVM
These were written before Spectre/Meltdown went public, and there was large
uncertainty in how the protections would evolve. As it turns out, they're
very specific to Intel hardware, and not very suitable for AMD.
Drop the macros, opencoding the relevant subset of functionality, and leaving
grep-fodder to locate the logic. No change at all for VT-x.
For AMD, the only relevant piece of functionality is DO_OVERWRITE_RSB,
although we will soon be adding (different) logic to handle MSR_SPEC_CTRL.
This has a marginal improvement of removing an unconditional pile of long-nops
from the vmentry/exit path.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 95b13fa43e0753b7514bef13abe28253e8614f62
master date: 2022-01-20 16:32:11 +0000
Various registers have per-guest-type or per-vendor locations or access
requirements. To support their use from common code, provide accessors which
allow for per-guest-type behaviour.
For now, just infrastructure handling default cases and expectations.
Subsequent patches will start handling registers using this infrastructure.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 88d3ff7ab15da277a85b39735797293fb541c718
master date: 2022-01-20 16:32:11 +0000
Jan Beulich [Tue, 25 Jan 2022 12:51:49 +0000 (13:51 +0100)]
x86/time: improve TSC / CPU freq calibration accuracy
While the problem report was for extreme errors, even smaller ones would
better be avoided: The calculated period to run calibration loops over
can (and usually will) be shorter than the actual time elapsed between
first and last platform timer and TSC reads. Adjust values returned from
the init functions accordingly.
On a Skylake system I've tested this on accuracy (using HPET) went from
detecting in some cases more than 220kHz too high a value to about
±2kHz. On other systems (or on this system, but with PMTMR) the original
error range was much smaller, with less (in some cases only very little)
improvement.
Reported-by: James Dingwall <james-xen@dingwall.me.uk> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a5c9a80af34eefcd6e31d0ed2b083f452cd9076d
master date: 2022-01-13 14:31:52 +0100
Jan Beulich [Tue, 25 Jan 2022 12:51:36 +0000 (13:51 +0100)]
x86/time: use relative counts in calibration loops
Looping until reaching/exceeding a certain value is error prone: If the
target value is close enough to the wrapping point, the loop may not
terminate at all. Switch to using delta values, which then allows to
fold the two loops each into just one.
Fixes: 93340297802b ("x86/time: calibrate TSC against platform timer") Reported-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 467191641d2a2fd2e43b3ae7b80399f89d339980
master date: 2022-01-13 14:30:18 +0100
Julien Grall [Tue, 25 Jan 2022 12:49:26 +0000 (13:49 +0100)]
xen/grant-table: Only decrement the refcounter when grant is fully unmapped
The grant unmapping hypercall (GNTTABOP_unmap_grant_ref) is not a
simple revert of the changes done by the grant mapping hypercall
(GNTTABOP_map_grant_ref).
Instead, it is possible to partially (or even not) clear some flags.
This will leave the grant is mapped until a future call where all
the flags would be cleared.
XSA-380 introduced a refcounting that is meant to only be dropped
when the grant is fully unmapped. Unfortunately, unmap_common() will
decrement the refcount for every successful call.
A consequence is a domain would be able to underflow the refcount
and trigger a BUG().
Looking at the code, it is not clear to me why a domain would
want to partially clear some flags in the grant-table. But as
this is part of the ABI, it is better to not change the behavior
for now.
Fix it by checking if the maptrack handle has been released before
decrementing the refcounting.
Julien Grall [Tue, 25 Jan 2022 12:49:06 +0000 (13:49 +0100)]
xen/arm: p2m: Always clear the P2M entry when the mapping is removed
Commit 2148a125b73b ("xen/arm: Track page accessed between batch of
Set/Way operations") allowed an entry to be invalid from the CPU PoV
(lpae_is_valid()) but valid for Xen (p2m_is_valid()). This is useful
to track which page is accessed and only perform an action on them
(e.g. clean & invalidate the cache after a set/way instruction).
Unfortunately, __p2m_set_entry() is only zeroing the P2M entry when
lpae_is_valid() returns true. This means the entry will not be zeroed
if the entry was valid from Xen PoV but invalid from the CPU PoV for
tracking purpose.
As a consequence, this will allow a domain to continue to access the
page after it was removed.
Resolve the issue by always zeroing the entry if it the LPAE bit is
set or the entry is about to be removed.
Andrew Cooper [Fri, 7 Jan 2022 07:54:38 +0000 (08:54 +0100)]
x86/spec-ctrl: Fix default calculation of opt_srb_lock
Since this logic was introduced, opt_tsx has become more complicated and
shouldn't be compared to 0 directly. While there are no buggy logic paths,
the correct expression is !(opt_tsx & 1) but the rtm_disabled boolean is
easier and clearer to use.
Fixes: 8fe24090d940 ("x86/cpuid: Rework HLE and RTM handling") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 31f3bc97f4508687215e459a5e35676eecf1772b
master date: 2022-01-05 09:44:26 +0000
While its description is correct from an abstract or real hardware pov,
the range is special inside HVM guests. The range being UC in particular
gets in the way of OVMF, which places itself at [FFE00000,FFFFFFFF].
While this is benign to epte_get_entry_emt() as long as the IOMMU isn't
enabled for a guest, it becomes a very noticable problem otherwise: It
takes about half a minute for OVMF to decompress itself into its
designated address range.
And even beyond OVMF there's no reason to have e.g. the ACPI memory
range marked UC.
Fixes: c22bd567ce22 ("hvmloader: PA range 0xfc000000-0xffffffff should be UC") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ea187c0b7a73c26258c0e91e4f3656989804555f
master date: 2021-12-17 08:56:15 +0100
Jan Beulich [Fri, 7 Jan 2022 07:52:48 +0000 (08:52 +0100)]
x86/HVM: permit CLFLUSH{,OPT} on execute-only code segments
Both SDM and PM explicitly permit this.
Fixes: 52dba7bd0b36 ("x86emul: generalize wbinvd() hook") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Paul Durrant <paul@xen.org>
master commit: df3e1a5efe700a9f59eced801cac73f9fd02a0e2
master date: 2021-12-10 14:03:56 +0100
Jan Beulich [Fri, 7 Jan 2022 07:52:18 +0000 (08:52 +0100)]
x86: avoid wrong use of all-but-self IPI shorthand
With "nosmp" I did observe a flood of "APIC error on CPU0: 04(04), Send
accept error" log messages on an AMD system. And rightly so - nothing
excludes the use of the shorthand in send_IPI_mask() in this case. Set
"unaccounted_cpus" to "true" also when command line restrictions are the
cause.
Note that PV-shim mode is unaffected by this change, first and foremost
because "nosmp" and "maxcpus=" are ignored in this case.
Fixes: 5500d265a2a8 ("x86/smp: use APIC ALLBUT destination shorthand when possible") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 7621880de0bb40bae6436a5b106babc0e4718f4d
master date: 2021-12-10 10:26:52 +0100
Jan Beulich [Fri, 7 Jan 2022 07:51:51 +0000 (08:51 +0100)]
x86/HVM: fail virt-to-linear conversion for insn fetches from non-code segments
Just like (in protected mode) reads may not go to exec-only segments and
writes may not go to non-writable ones, insn fetches may not access data
segments.
Fixes: 623e83716791 ("hvm: Support hardware task switching") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 311297f4216a4387bdae6df6cfbb1f5edb06618a
master date: 2021-12-06 14:15:05 +0100
Jan Beulich [Fri, 7 Jan 2022 07:51:27 +0000 (08:51 +0100)]
VT-d: don't leak domid mapping on error path
While domain_context_mapping() invokes domain_context_unmap() in a sub-
case of handling DEV_TYPE_PCI when encountering an error, thus avoiding
a leak, individual calls to domain_context_mapping_one() aren't
similarly covered. Such a leak might persist until domain destruction.
Leverage that these cases can be recognized by pdev being non-NULL.
Fixes: dec403cc668f ("VT-d: fix iommu_domid for PCI/PCIx devices assignment") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: e6252a51faf42c892eb5fc71f8a2617580832196
master date: 2021-11-24 11:07:11 +0100
Roger Pau Monné [Fri, 7 Jan 2022 07:50:22 +0000 (08:50 +0100)]
efi: fix alignment of function parameters in compat mode
Currently the max_store_size, remain_store_size and max_size in
compat_pf_efi_runtime_call are 4 byte aligned, which makes clang
13.0.0 complain with:
In file included from compat.c:30:
./runtime.c:646:13: error: passing 4-byte aligned argument to 8-byte aligned parameter 2 of 'QueryVariableInfo' may result in an unaligned pointer access [-Werror,-Walign-mismatch]
&op->u.query_variable_info.max_store_size,
^
./runtime.c:647:13: error: passing 4-byte aligned argument to 8-byte aligned parameter 3 of 'QueryVariableInfo' may result in an unaligned pointer access [-Werror,-Walign-mismatch]
&op->u.query_variable_info.remain_store_size,
^
./runtime.c:648:13: error: passing 4-byte aligned argument to 8-byte aligned parameter 4 of 'QueryVariableInfo' may result in an unaligned pointer access [-Werror,-Walign-mismatch]
&op->u.query_variable_info.max_size);
^
Fix this by bouncing the variables on the stack in order for them to
be 8 byte aligned.
Note this could be done in a more selective manner to only apply to
compat code calls, but given the overhead of making an EFI call doing
an extra copy of 3 variables doesn't seem to warrant the special
casing.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Ian Jackson <iwj@xenproject.org> Signed-off-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: be12fcca8b784e456df3adedbffe657d753c5ff9
master date: 2021-11-19 17:01:24 +0000
xen/arm: Do not invalidate the P2M when the PT is shared with the IOMMU
Set/Way flushes never work correctly in a virtualized environment.
Our current implementation is based on clearing the valid bit in the p2m
pagetable to track guest memory accesses. This technique doesn't work
when the IOMMU is enabled for the domain and the pagetable is shared
between IOMMU and MMU because it triggers IOMMU faults.
Specifically, p2m_invalidate_root causes IOMMU faults if
iommu_use_hap_pt returns true for the domain.
Add a check in p2m_set_way_flush: if a set/way instruction is used
and iommu_use_hap_pt returns true, rather than failing with obscure
IOMMU faults, inject an undef exception straight away into the guest,
and print a verbose error message to explain the problem.
Also add an ASSERT in p2m_invalidate_root to make sure we don't
inadvertently stumble across this problem again in the future.
Ian Jackson [Mon, 6 Dec 2021 14:40:24 +0000 (14:40 +0000)]
MAINTAINERS: Resign from tools stable branch maintainership
Signed-off-by: Ian Jackson <iwj@xenproject.org> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit c623a84c2a4fda1cd25f5347a6298706218eb5fb)
Jan Beulich [Tue, 23 Nov 2021 12:30:09 +0000 (13:30 +0100)]
x86/P2M: deal with partial success of p2m_set_entry()
M2P and PoD stats need to remain in sync with P2M; if an update succeeds
only partially, respective adjustments need to be made. If updates get
made before the call, they may also need undoing upon complete failure
(i.e. including the single-page case).
Log-dirty state would better also be kept in sync.
Note that the change to set_typed_p2m_entry() may not be strictly
necessary (due to the order restriction enforced near the top of the
function), but is being kept here to be on the safe side.
This is CVE-2021-28705 and CVE-2021-28709 / XSA-389.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 74a11c43fd7e074b1f77631b446dd2115eacb9e8
master date: 2021-11-22 12:27:30 +0000
Jan Beulich [Tue, 23 Nov 2021 12:29:54 +0000 (13:29 +0100)]
x86/PoD: handle intermediate page orders in p2m_pod_cache_add()
p2m_pod_decrease_reservation() may pass pages to the function which
aren't 4k, 2M, or 1G. Handle all intermediate orders as well, to avoid
hitting the BUG() at the switch() statement's "default" case.
This is CVE-2021-28708 / part of XSA-388.
Fixes: 3c352011c0d3 ("x86/PoD: shorten certain operations on higher order ranges") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8ec13f68e0b026863d23e7f44f252d06478bc809
master date: 2021-11-22 12:27:30 +0000
Jan Beulich [Tue, 23 Nov 2021 12:29:41 +0000 (13:29 +0100)]
x86/PoD: deal with misaligned GFNs
Users of XENMEM_decrease_reservation and XENMEM_populate_physmap aren't
required to pass in order-aligned GFN values. (While I consider this
bogus, I don't think we can fix this there, as that might break existing
code, e.g Linux'es swiotlb, which - while affecting PV only - until
recently had been enforcing only page alignment on the original
allocation.) Only non-PoD code paths (guest_physmap_{add,remove}_page(),
p2m_set_entry()) look to be dealing with this properly (in part by being
implemented inefficiently, handling every 4k page separately).
Introduce wrappers taking care of splitting the incoming request into
aligned chunks, without putting much effort in trying to determine the
largest possible chunk at every iteration.
Also "handle" p2m_set_entry() failure for non-order-0 requests by
crashing the domain in one more place. Alongside putting a log message
there, also add one to the other similar path.
Note regarding locking: This is left in the actual worker functions on
the assumption that callers aren't guaranteed atomicity wrt acting on
multiple pages at a time. For mis-aligned GFNs gfn_lock() wouldn't have
locked the correct GFN range anyway, if it didn't simply resolve to
p2m_lock(), and for well-behaved callers there continues to be only a
single iteration, i.e. behavior is unchanged for them. (FTAOD pulling
out just pod_lock() into p2m_pod_decrease_reservation() would result in
a lock order violation.)
This is CVE-2021-28704 and CVE-2021-28707 / part of XSA-388.
Fixes: 3c352011c0d3 ("x86/PoD: shorten certain operations on higher order ranges") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 182c737b9ba540ebceb1433f3940fbed6eac4ea9
master date: 2021-11-22 12:27:30 +0000
Julien Grall [Tue, 23 Nov 2021 12:29:09 +0000 (13:29 +0100)]
xen/page_alloc: Harden assign_pages()
domain_tot_pages() and d->max_pages are 32-bit values. While the order
should always be quite small, it would still be possible to overflow
if domain_tot_pages() is near to (2^32 - 1).
As this code may be called by a guest via XENMEM_increase_reservation
and XENMEM_populate_physmap, we want to make sure the guest is not going
to be able to allocate more than it is allowed.
Rework the allocation check to avoid any possible overflow. While the
check domain_tot_pages() < d->max_pages should technically not be
necessary, it is probably best to have it to catch any possible
inconsistencies in the future.
Jan Beulich [Fri, 19 Nov 2021 08:41:41 +0000 (09:41 +0100)]
x86/APIC: avoid iommu_supports_x2apic() on error path
The value it returns may change from true to false in case
iommu_enable_x2apic() fails and, as a side effect, clears iommu_intremap
(as can happen at least on AMD). Latch the return value from the first
invocation to replace the second one.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 0f50d1696b3c13cbf0b18fec817fc291d5a30a31
master date: 2021-11-04 14:44:43 +0100
Jan Beulich [Fri, 19 Nov 2021 08:41:09 +0000 (09:41 +0100)]
x86/IOMMU: mark IOMMU / intremap not in use when ACPI tables are missing
x2apic_bsp_setup() gets called ahead of iommu_setup(), and since x2APIC
mode (physical vs clustered) depends on iommu_intremap, that variable
needs to be set to off as soon as we know we can't / won't enable
interrupt remapping, i.e. in particular when parsing of the respective
ACPI tables failed. Move the turning off of iommu_intremap from AMD
specific code into acpi_iommu_init(), accompanying it by clearing of
iommu_enable.
Take the opportunity and also fully skip ACPI table parsing logic on
VT-d when both "iommu=off" and "iommu=no-intremap" are in effect anyway,
like was already the case for AMD.
The tag below only references the commit uncovering a pre-existing
anomaly.
Fixes: d8bd82327b0f ("AMD/IOMMU: obtain IVHD type to use earlier") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 46c4061cd2bf69e8039021af615c2bdb94e50088
master date: 2021-11-04 14:44:01 +0100
x86/xstate: reset cached register values on resume
set_xcr0() and set_msr_xss() use cached value to avoid setting the
register to the same value over and over. But suspend/resume implicitly
reset the registers and since percpu areas are not deallocated on
suspend anymore, the cache gets stale.
Reset the cache on resume, to ensure the next write will really hit the
hardware. Choose value 0, as it will never be a legitimate write to
those registers - and so, will force write (and cache update).
Note the cache is used io get_xcr0() and get_msr_xss() too, but:
- set_xcr0() is called few lines below in xstate_init(), so it will
update the cache with appropriate value
- get_msr_xss() is not used anywhere - and thus not before any
set_msr_xss() that will fill the cache
Fixes: aca2a985a55a "xen: don't free percpu areas during suspend" Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f7f4a523927fa4c7598e4647a16bc3e3cf8009d0
master date: 2021-11-04 14:42:37 +0100
Andrew Cooper [Fri, 19 Nov 2021 08:40:19 +0000 (09:40 +0100)]
x86/traps: Fix typo in do_entry_CP()
The call to debugger_trap_entry() should pass the correct vector. The
break-for-gdbsx logic is in practice unreachable because PV guests can't
generate #CP, but it will interfere with anyone inserting custom debugging
into debugger_trap_entry().
Fixes: 5ad05b9c2490 ("x86/traps: Implement #CP handler and extend #PF for shadow stacks") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 512863ed238d7390f74d43f0ba298b1dfa8f4803
master date: 2021-11-03 19:13:17 +0000
Andrew Cooper [Fri, 19 Nov 2021 08:39:46 +0000 (09:39 +0100)]
x86/shstk: Fix use of shadow stacks with XPTI active
The call to setup_cpu_root_pgt(0) in smp_prepare_cpus() is too early. It
clones the BSP's stack while the .data mapping is still in use, causing all
mappings to be fully read read/write (and with no guard pages either). This
ultimately causes #DF when trying to enter the dom0 kernel for the first time.
Defer setting up BSPs XPTI pagetable until reinit_bsp_stack() after we've set
up proper shadow stack permissions.
Fixes: 60016604739b ("x86/shstk: Rework the stack layout to support shadow stacks") Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b2851580b1f2ff121737a37cb25a370d7692ae3b
master date: 2021-11-03 13:08:42 +0000
Dongli Zhang [Fri, 19 Nov 2021 08:39:09 +0000 (09:39 +0100)]
update system time immediately when VCPUOP_register_vcpu_info
The guest may access the pv vcpu_time_info immediately after
VCPUOP_register_vcpu_info. This is to borrow the idea of
VCPUOP_register_vcpu_time_memory_area, where the
force_update_vcpu_system_time() is called immediately when the new memory
area is registered.
Otherwise, we may observe clock drift at the VM side if the VM accesses
the clocksource immediately after VCPUOP_register_vcpu_info().
Jan Beulich [Fri, 19 Nov 2021 08:38:42 +0000 (09:38 +0100)]
x86/paging: restrict physical address width reported to guests
Modern hardware may report more than 48 bits of physical address width.
For paging-external guests our P2M implementation does not cope with
larger values. Telling the guest of more available bits means misleading
it into perhaps trying to actually put some page there (like was e.g.
intermediately done in OVMF for the shared info page).
While there also convert the PV check to a paging-external one (which in
our current code base are synonyms of one another anyway).
Fixes: 5dbd60e16a1f ("x86/shadow: Correct guest behaviour when creating PTEs above maxphysaddr") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: b7635526acffbe4ad8ad16fd92812c57742e54c2
master date: 2021-10-19 10:08:30 +0200
Jan Beulich [Fri, 19 Nov 2021 08:38:09 +0000 (09:38 +0100)]
x86/AMD: make HT range dynamic for Fam17 and up
At the time of d838ac2539cf ("x86: don't allow Dom0 access to the HT
address range") documentation correctly stated that the range was
completely fixed. For Fam17 and newer, it lives at the top of physical
address space, though.
To correctly determine the top of physical address space, we need to
account for their physical address reduction, hence the calculation of
paddr_bits also gets adjusted.
While for paddr_bits < 40 the HT range is completely hidden, there's no
need to suppress the range insertion in that case: It'll just have no
real meaning.
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: d6e38eea2d806c53d976603717aebf6e5de30a1e
master date: 2021-10-19 10:04:13 +0200
Jan Beulich [Fri, 19 Nov 2021 08:37:37 +0000 (09:37 +0100)]
x86emul: de-duplicate scatters to the same linear address
The SDM specifically allows for earlier writes to fully overlapping
ranges to be dropped. If a guest did so, hvmemul_phys_mmio_access()
would crash it if varying data was written to the same address. Detect
overlaps early, as doing so in hvmemul_{linear,phys}_mmio_access() would
be quite a bit more difficult. To maintain proper faulting behavior,
instead of dropping earlier write instances of fully overlapping slots
altogether, write the data of the final of these slots multiple times.
(We also can't pull ahead the [single] write of the data of the last of
the slots, clearing all involved slots' op_mask bits together, as this
would yield incorrect results if there were intervening partially
overlapping ones.)
Note that due to cache slot use being linear address based, there's no
similar issue with multiple writes to the same physical address (mapped
through different linear addresses).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a8cddbac5051020bb4a59a7f0ea27500c51063fb
master date: 2021-10-19 10:02:39 +0200
Jan Beulich [Fri, 15 Oct 2021 09:20:04 +0000 (11:20 +0200)]
x86/PV32: fix physdev_op_compat handling
The conversion of the original code failed to recognize that the 32-bit
compat variant of this (sorry, two different meanings of "compat" here)
needs to continue to invoke the compat handler, not the native one.
Arrange for this by adding yet another #define.
Affected functions (having existed prior to the introduction of the new
hypercall) are PHYSDEVOP_set_iobitmap and PHYSDEVOP_apic_{read,write}.
For all others the operand struct layout doesn't differ.
Fixes: 1252e2823117 ("x86/pv: Export pv_hypercall_table[] rather than working around it in several ways") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 834cb8761051f7d87816785c0d99fe9bd5f0ce30
master date: 2021-10-12 11:55:42 +0200
Jan Beulich [Fri, 15 Oct 2021 09:19:41 +0000 (11:19 +0200)]
AMD/IOMMU: consider hidden devices when flushing device I/O TLBs
Hidden devices are associated with DomXEN but usable by the
hardware domain. Hence they need flushing as well when all devices are
to have flushes invoked.
While there drop a redundant ATS-enabled check and constify the first
parameter of the involved function.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 036432e8b27e1ef21e0f0204ba9b0e3972a031c2
master date: 2021-10-12 11:54:34 +0200
Jan Beulich [Fri, 15 Oct 2021 09:19:11 +0000 (11:19 +0200)]
x86/HVM: fix xsm_op for 32-bit guests
Like for PV, 32-bit guests need to invoke the compat handler, not the
native one.
Fixes: db984809d61b ("hvm: wire up domctl and xsm hypercalls") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b6b672e8a925ff4b71a1a67bc7d213ef445af74f
master date: 2021-10-11 10:58:44 +0200
Jan Beulich [Fri, 15 Oct 2021 09:18:51 +0000 (11:18 +0200)]
x86/build: suppress EFI-related tool chain checks upon local $(MAKE) recursion
The xen-syms and xen.efi linking steps are serialized only when the
intermediate note.o file is necessary. Otherwise both may run in
parallel. This in turn means that the compiler / linker invocations to
create efi/check.o / efi/check.efi may also happen twice in parallel.
Obviously it's a bad idea to have multiple producers of the same output
race with one another - every once in a while one may e.g. observe
objdump: efi/check.efi: file format not recognized
We don't need this EFI related checking to occur when producing the
intermediate symbol and relocation table objects, and we have an easy
way of suppressing it: Simply pass in "efi-y=", overriding the
assignments done in the Makefile and thus forcing the tool chain checks
to be bypassed.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: 24b0ce9a5da2e648cde818055a085bcbcf24ecb0
master date: 2021-10-11 10:58:17 +0200
Jan Beulich [Fri, 15 Oct 2021 09:17:32 +0000 (11:17 +0200)]
VT-d: consider hidden devices when unmapping
Whether to clear an IOMMU's bit in the domain's bitmap should depend on
all devices the domain can control. For the hardware domain this
includes hidden devices, which are associated with DomXEN.
While touching related logic
- convert the "current device" exclusion check to a simple pointer
comparison,
- convert "found" to "bool",
- adjust style and correct a typo in an existing comment.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 75bfe6ec4844f83b300b9807bceaed1e2fe23270
master date: 2021-09-20 10:24:27 +0200
Roger Pau Monné [Fri, 15 Oct 2021 09:16:41 +0000 (11:16 +0200)]
x86: quote section names when defining them in linker script
LLVM ld seems to require section names to be quoted at both definition
and when referencing them for a match to happen, or else we get the
following errors:
The original fix for GNU ld 2.37 only quoted the section name when
referencing it in the ADDR function. Fix by also quoting the section
names when declaring them.
Fixes: 58ad654ebce7 ("x86: work around build issue with GNU ld 2.37") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6254920587c33bcc7ab884e6c9a11cfc0d5867ab
master date: 2021-09-15 11:02:21 +0200
Andrew Cooper [Fri, 15 Oct 2021 09:15:14 +0000 (11:15 +0200)]
x86/amd: Use newer SSBD mechanisms if they exist
The opencoded legacy Memory Disambiguation logic in init_amd() neglected
Fam19h for the Zen3 microarchitecture. Further more, all Zen2 based system
have the architectural MSR_SPEC_CTRL and the SSBD bit within it, so shouldn't
be using MSR_AMD64_LS_CFG.
Implement the algorithm given in AMD's SSBD whitepaper, and leave a
printk_once() behind in the case that no controls can be found.
This now means that a user explicitly choosing `spec-ctrl=ssbd` will properly
turn off Memory Disambiguation on Fam19h/Zen3 systems.
This still remains a single system-wide setting (for now), and is not context
switched between vCPUs. As such, it doesn't interact with Intel's use of
MSR_SPEC_CTRL and default_xen_spec_ctrl (yet).
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2a4e6c4e4bea2e0bb720418c331ee28ff9c7632e
master date: 2021-09-08 14:16:19 +0100
Andrew Cooper [Fri, 15 Oct 2021 09:14:46 +0000 (11:14 +0200)]
x86/amd: Enumeration for speculative features/hints
There is a step change in speculation protections between the Zen1 and Zen2
microarchitectures.
Zen1 and older have no special support. Control bits in non-architectural
MSRs are used to make lfence be dispatch-serialising (Spectre v1), and to
disable Memory Disambiguation (Speculative Store Bypass). IBPB was
retrofitted in a microcode update, and software methods are required for
Spectre v2 protections.
Because the bit controlling Memory Disambiguation is model specific,
hypervisors are expected to expose a MSR_VIRT_SPEC_CTRL interface which
abstracts the model specific details.
Zen2 and later implement the MSR_SPEC_CTRL interface in hardware, and
virtualise the interface for HVM guests to use. A number of hint bits are
specified too to help guide OS software to the most efficient mitigation
strategy.
Zen3 introduced a new feature, Predictive Store Forwarding, along with a
control to disable it in sensitive code.
Add CPUID and VMCB details for all the new functionality.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 747424c664bb164a04e7a9f2ffbf02d4a1630d7d
master date: 2021-09-08 14:16:19 +0100
Andrew Cooper [Fri, 15 Oct 2021 09:14:16 +0000 (11:14 +0200)]
x86/spec-ctrl: Split the "Hardware features" diagnostic line
Separate the read-only hints from the features requiring active actions on
Xen's behalf.
Also take the opportunity split the IBRS/IBPB and IBPB mess. More features
with overlapping enumeration are on the way, and and it is not useful to split
them like this.
No practical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 565ebcda976c05b0c6191510d5e32b621a2b1867
master date: 2021-09-08 14:16:19 +0100
Jan Beulich [Fri, 1 Oct 2021 13:05:42 +0000 (15:05 +0200)]
VT-d: fix deassign of device with RMRR
Ignoring a specific error code here was not meant to short circuit
deassign to _just_ the unmapping of RMRRs. This bug was previously
hidden by the bogus (potentially indefinite) looping in
pci_release_devices(), until f591755823a7 ("IOMMU/PCI: don't let domain
cleanup continue when device de-assignment failed") fixed that loop.
This is CVE-2021-28702 / XSA-386.
Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling") Reported-by: Ivan Kardykov <kardykov@tabit.pro> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Ivan Kardykov <kardykov@tabit.pro>
(cherry picked from commit 24ebe875a77833696bbe5c9372e9e1590a7e7101)
Jan Beulich [Wed, 8 Sep 2021 12:53:04 +0000 (14:53 +0200)]
gnttab: deal with status frame mapping race
Once gnttab_map_frame() drops the grant table lock, the MFN it reports
back to its caller is free to other manipulation. In particular
gnttab_unpopulate_status_frames() might free it, by a racing request on
another CPU, thus resulting in a reference to a deallocated page getting
added to a domain's P2M.
Obtain a page reference in gnttab_map_frame() to prevent freeing of the
page until xenmem_add_to_physmap_one() has actually completed its acting
on the page. Do so uniformly, even if only strictly required for v2
status pages, to avoid extra conditionals (which then would all need to
be kept in sync going forward).
Jan Beulich [Wed, 8 Sep 2021 12:52:13 +0000 (14:52 +0200)]
x86/p2m-pt: fix p2m_flags_to_access()
The initial if() was inverted, invalidating all output from this
function. Which in turn means the mirroring of P2M mappings into the
IOMMU didn't always work as intended: Mappings may have got updated when
there was no need to. There would not have been too few (un)mappings;
what saves us is that alongside the flags comparison MFNs also get
compared, with non-present entries always having an MFN of 0 or
INVALID_MFN while present entries always have MFNs different from these
two (0 in the table also meant to cover INVALID_MFN):
OLD NEW
P W access MFN P W access MFN
0 0 r 0 0 0 n 0
0 1 rw 0 0 1 n 0
1 0 n non-0 1 0 r non-0
1 1 n non-0 1 1 rw non-0
present <-> non-present transitions are fine because the MFNs differ.
present -> present transitions as well as non-present -> non-present
ones are potentially causing too many map/unmap operations, but never
too few, because in that case old (bogus) and new access differ.
Fixes: d1bb6c97c31e ("IOMMU: also pass p2m_access_t to p2m_get_iommu_flags()) Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e70a9a043a5ce6d4025420f729bc473f711bf5d1
master date: 2021-09-07 14:24:49 +0200