]> xenbits.xensource.com Git - xen.git/log
xen.git
3 years agox86/cet: Remove writeable mapping of the BSPs shadow stack
Andrew Cooper [Tue, 15 Mar 2022 12:07:18 +0000 (12:07 +0000)]
x86/cet: Remove writeable mapping of the BSPs shadow stack

An unintended consequence of the BSP using cpu0_stack[] is that writeable
mappings to the BSPs shadow stacks are retained in the bss.  This renders
CET-SS almost useless, as an attacker can update both return addresses and the
ret will not fault.

We specifically don't want to shatter the superpage mapping .data and .bss, so
the only way to fix this is to not have the BSP stack in the main Xen image.

Break cpu_alloc_stack() out of cpu_smpboot_alloc(), and dynamically allocate
the BSP stack as early as reasonable in __start_xen().  As a consequence,
there is no need to delay the BSP's memguard_guard_stack() call.

Copy the top of cpu info block just before switching to use the new stack.
Fix a latent bug by setting %rsp to info->guest_cpu_user_regs rather than
->es; this would be buggy if reinit_bsp_stack() called schedule() (which
rewrites the GPR block) directly, but luckily it doesn't.

Finally, move cpu0_stack[] into .init, so it can be reclaimed after boot.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 37786b23b027ab83051175cb8ce9ac86cacfc58e)

3 years agox86/cet: Clear IST supervisor token busy bits on S3 resume
Andrew Cooper [Mon, 14 Mar 2022 10:30:46 +0000 (10:30 +0000)]
x86/cet: Clear IST supervisor token busy bits on S3 resume

Stacks are not freed across S3.  Execution just stops, leaving supervisor
token busy bits active.  Fixing this for the primary shadow stack was done
previously, but there is a (rare) risk that an IST token is left busy too, if
the platform power-off happens to intersect with an NMI/#MC arriving.  This
will manifest as #DF next time the IST vector gets used.

Introduce rdssp() and wrss() helpers in a new shstk.h, cleaning up
fixup_exception_return() and explaining the trick with the literal 1.

Then this infrastructure to rewrite the IST tokens in load_system_tables()
when all the other IST details are being set up.  In the case that an IST
token were left busy across S3, this will clear the busy bit before the stack
gets used.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit e421ed0f68488863599532bda575c03c33cde0e0)

3 years agox86/kexec: Fix kexec-reboot with CET active
Andrew Cooper [Mon, 7 Mar 2022 20:19:18 +0000 (20:19 +0000)]
x86/kexec: Fix kexec-reboot with CET active

The kexec_reloc() asm has an indirect jump to relocate onto the identity
trampoline.  While we clear CET in machine_crash_shutdown(), we fail to clear
CET for the non-crash path.  This in turn highlights that the same is true of
resetting the CPUID masking/faulting.

Move both pieces of logic from machine_crash_shutdown() to machine_kexec(),
the latter being common for all kexec transitions.  Adjust the condition for
CET being considered active to check in CR4, which is simpler and more robust.

Fixes: 311434bfc9d1 ("x86/setup: Rework MSR_S_CET handling for CET-IBT")
Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks")
Fixes: 5ab9564c6fa1 ("x86/cpu: Context switch cpuid masks and faulting state in context_switch()")
Reported-by: David Vrabel <dvrabel@amazon.co.uk>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: David Vrabel <dvrabel@amazon.co.uk>
(cherry picked from commit 7f5b2448bd724f5f24426b2595a9bdceb1e5a346)

3 years agox86/spec-ctrl: Disable retpolines with CET-IBT
Andrew Cooper [Mon, 28 Feb 2022 19:26:37 +0000 (19:26 +0000)]
x86/spec-ctrl: Disable retpolines with CET-IBT

CET-IBT depend on executing indirect branches for protections to apply.
Extend the clobber for CET-SS to all of CET.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 6e3f36387de566b09aa4145ea0e3bfe4814d68b4)

3 years agox86/CET: Fix S3 resume with shadow stacks active
Andrew Cooper [Thu, 24 Feb 2022 12:18:00 +0000 (12:18 +0000)]
x86/CET: Fix S3 resume with shadow stacks active

The original shadow stack support has an error on S3 resume with very bizarre
fallout.  The BSP comes back up, but APs fail with:

  (XEN) Enabling non-boot CPUs ...
  (XEN) Stuck ??
  (XEN) Error bringing CPU1 up: -5

and then later (on at least two Intel TigerLake platforms), the next HVM vCPU
to be scheduled on the BSP dies with:

  (XEN) d1v0 Unexpected vmexit: reason 3
  (XEN) domain_crash called from vmx.c:4304
  (XEN) Domain 1 (vcpu#0) crashed on cpu#0:

The VMExit reason is EXIT_REASON_INIT, which has nothing to do with the
scheduled vCPU, and will be addressed in a subsequent patch.  It is a
consequence of the APs triple faulting.

The reason the APs triple fault is because we don't tear down the stacks on
suspend.  The idle/play_dead loop is killed in the middle of running, meaning
that the supervisor token is left busy.

On resume, SETSSBSY finds busy bit set, suffers #CP and triple faults because
the IDT isn't configured this early.

Rework the AP bring-up path to (re)create the supervisor token.  This ensures
the primary stack is non-busy before use.

Note: There are potential issues with the IST shadow stacks too, but fixing
      those is more involved.

Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks")
Link: https://github.com/QubesOS/qubes-issues/issues/7283
Reported-by: Thiner Logoer <logoerthiner1@163.com>
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Thiner Logoer <logoerthiner1@163.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 7d9589239ec068c944190408b9838774d5ec1f8f)

3 years agox86: Enable CET Indirect Branch Tracking
Andrew Cooper [Mon, 1 Nov 2021 15:17:20 +0000 (15:17 +0000)]
x86: Enable CET Indirect Branch Tracking

With all the pieces now in place, turn CET-IBT on when available.

MSR_S_CET, like SMEP/SMAP, controls Ring1 meaning that ENDBR_EN can't be
enabled for Xen independently of PV32 kernels.  As we already disable PV32 for
CET-SS, extend this to all CET, adjusting the documentation/comments as
appropriate.

Introduce a cet=no-ibt command line option to allow the admin to disable IBT
even when everything else is configured correctly.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit cdbe2b0a1aecae946639ee080f14831429b184b6)

3 years agox86/EFI: Disable CET-IBT around Runtime Services calls
Andrew Cooper [Mon, 1 Nov 2021 21:54:26 +0000 (21:54 +0000)]
x86/EFI: Disable CET-IBT around Runtime Services calls

UEFI Runtime services, at the time of writing, aren't CET-IBT compatible.
Work is ongoing to address this. In the meantime, unconditionally disable IBT.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit d37a8a067e62e3b6709d224c22f740fdda9d0078)

3 years agox86/setup: Rework MSR_S_CET handling for CET-IBT
Andrew Cooper [Mon, 1 Nov 2021 16:13:29 +0000 (16:13 +0000)]
x86/setup: Rework MSR_S_CET handling for CET-IBT

CET-SS and CET-IBT can be independently controlled, so the configuration of
MSR_S_CET can't be constant any more.

Introduce xen_msr_s_cet_value(), mostly because I don't fancy
writing/maintaining that logic in assembly.  Use this in the 3 paths which
alter MSR_S_CET when both features are potentially active.

To active CET-IBT, we only need CR4.CET and MSR_S_CET.ENDBR_EN.  This is
common with the CET-SS setup, so reorder the operations to set up CR4 and
MSR_S_CET for any nonzero result from xen_msr_s_cet_value(), and set up
MSR_PL0_SSP and SSP if SHSTK_EN was also set.

Adjust the crash path to disable CET-IBT too.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 311434bfc9d10615adbd340d7fb08c05cd14f4c7)

3 years agox86/entry: Make IDT entrypoints CET-IBT compatible
Andrew Cooper [Mon, 1 Nov 2021 17:08:24 +0000 (17:08 +0000)]
x86/entry: Make IDT entrypoints CET-IBT compatible

Each IDT vector needs to land on an endbr64 instruction.  This is especially
important for the #CP handler, which will recurse indefinitely if the endbr64
is missing, eventually escalating to #DF if guard pages are active.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit e702e36d1d519f4b66086650c1c47d6bac96d4b9)

Also include the continue_pv_domain() change from c/s 954bb07fdb5fad which is
also in entry.S

3 years agox86/entry: Make syscall/sysenter entrypoints CET-IBT compatible
Andrew Cooper [Mon, 1 Nov 2021 09:51:16 +0000 (09:51 +0000)]
x86/entry: Make syscall/sysenter entrypoints CET-IBT compatible

Each of MSR_{L,C}STAR and MSR_SYSENTER_EIP need to land on an endbr64
instruction.  For sysenter, this is easy.

Unfortunately for syscall, the stubs are already 29 byte long with a limit of
32.  endbr64 is 4 bytes.  Luckily, there is a 1 byte instruction which can
move from the stubs into the main handlers.

Move the push %rax out of the stub and into {l,c}star_entry(), allowing room
for the endbr64 instruction when appropriate.  Update the comment describing
the entry state.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 17d77ec62a299f4299883ec79ab10cacafd0b2f5)

3 years agox86/emul: Update emulation stubs to be CET-IBT compatible
Andrew Cooper [Mon, 1 Nov 2021 10:09:59 +0000 (10:09 +0000)]
x86/emul: Update emulation stubs to be CET-IBT compatible

All indirect branches need to land on an endbr64 instruction.

For stub_selftests(), use endbr64 unconditionally for simplicity.  For ioport
and instruction emulation, add endbr64 conditionally.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 0d101568d29e8b4bfd33f20031fedec2652aa0cf)

3 years agox86: Introduce helpers/checks for endbr64 instructions
Andrew Cooper [Fri, 26 Nov 2021 15:34:08 +0000 (15:34 +0000)]
x86: Introduce helpers/checks for endbr64 instructions

... to prevent the optimiser creating unsafe code.  See the code comment for
full details.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 4046ba97446e3974a4411db227263a9f11e0aeb4)

Note: For the backport to 4.14 thru 4.16, we don't care for embedded endbr64
      specifically, but place_endbr64() is a prerequisite for other parts of
      the series.

3 years agox86/traps: Rework write_stub_trampoline() to not hardcode the jmp
Andrew Cooper [Mon, 1 Nov 2021 12:36:33 +0000 (12:36 +0000)]
x86/traps: Rework write_stub_trampoline() to not hardcode the jmp

For CET-IBT, we will need to optionally insert an endbr64 instruction at the
start of the stub.  Don't hardcode the jmp displacement assuming that it
starts at byte 24 of the stub.

Also add extra comments describing what is going on.  The mix of %rax and %rsp
is far from trivial to follow.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 809beac3e7fdfd20000386453c64a1e2a3d93075)

3 years agox86/alternatives: Clear CR4.CET when clearing CR0.WP
Andrew Cooper [Mon, 1 Nov 2021 10:17:59 +0000 (10:17 +0000)]
x86/alternatives: Clear CR4.CET when clearing CR0.WP

This allows us to have CET active much earlier in boot.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 48cdc15a424f9fadad7f9aed00e7dc8ef16a2196)

3 years agox86/setup: Read CR4 earlier in __start_xen()
Andrew Cooper [Mon, 1 Nov 2021 10:19:57 +0000 (10:19 +0000)]
x86/setup: Read CR4 earlier in __start_xen()

This is necessary for read_cr4() to function correctly.  Move the EFER caching
at the same time.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9851bc4939101828d2ad7634b93c0d9ccaef5b7e)

3 years agox86: Introduce support for CET-IBT
Andrew Cooper [Thu, 21 Oct 2021 17:38:50 +0000 (18:38 +0100)]
x86: Introduce support for CET-IBT

CET Indirect Branch Tracking is a hardware feature designed to provide
forward-edge control flow integrity, protecting against jump/call oriented
programming.

IBT requires the placement of endbr{32,64} instructions at the target of every
indirect call/jmp, and every entrypoint.

It is necessary to check for both compiler and assembler support, as the
notrack prefix can be emitted in certain cases.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3667f7f8f7c471e94e58cf35a95f09a0fe5c1290)

Note: For backports to 4.14 thru 4.16, we are deliberately not using
      -mmanual-endbr as done in staging, as an intermediate approach which
      is not too invasive to backport.

x86/cet: Force -fno-jump-tables for CET-IBT

Both GCC and Clang have a (mis)feature where, even with
-fcf-protection=branch, jump tables are created using a notrack jump rather
than using endbr's in each case statement.

This is incompatible with the safety properties we want in Xen, and enforced
by not setting MSR_S_CET.NOTRACK_EN.  The consequence is a fatal #CP[endbr].

-fno-jump-tables is generally active as a side effect of
CONFIG_INDIRECT_THUNK (retpoline), but as of c/s 95d9ab461436 ("x86/Kconfig:
introduce option to select retpoline usage"), we explicitly support turning
retpoline off.

Fixes: 3667f7f8f7c4 ("x86: Introduce support for CET-IBT")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9d4a44380d273de22d5753883cbf5581795ff24d)

3 years agox86/spec-ctrl: Cease using thunk=lfence on AMD
Andrew Cooper [Mon, 7 Mar 2022 16:35:52 +0000 (16:35 +0000)]
x86/spec-ctrl: Cease using thunk=lfence on AMD

AMD have updated their Spectre v2 guidance, and lfence/jmp is no longer
considered safe.  AMD are recommending using retpoline everywhere.

Retpoline is incompatible with CET.  All CET-capable hardware has efficient
IBRS (specifically, not something retrofitted in microcode), so use IBRS (and
STIBP for consistency sake).

This is a logical change on AMD, but not on Intel as the default calculations
would end up with these settings anyway.  Leave behind a message if IBRS is
found to be missing.

Also update the default heuristics to never select THUNK_LFENCE.  This causes
AMD CPUs to change their default to retpoline.

Also update the printed message to include the AMD MSR_SPEC_CTRL settings, and
STIBP now that we set it for consistency sake.

This is part of XSA-398 / CVE-2021-26401.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 8d03080d2a339840d3a59e0932a94f804e45110d)

3 years agoxen/arm: Allow to discover and use SMCCC_ARCH_WORKAROUND_3
Bertrand Marquis [Thu, 17 Feb 2022 14:52:54 +0000 (14:52 +0000)]
xen/arm: Allow to discover and use SMCCC_ARCH_WORKAROUND_3

Allow guest to discover whether or not SMCCC_ARCH_WORKAROUND_3 is
supported and create a fastpath in the code to handle guests request to
do the workaround.

The function SMCCC_ARCH_WORKAROUND_3 will be called by the guest for
flushing the branch history. So we want the handling to be as fast as
possible.

As the mitigation is applied on every guest exit, we can check for the
call before saving all context and return very early.

This is part of XSA-398 / CVE-2022-23960.

Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
Reviewed-by: Julien Grall <julien@xen.org>
(cherry picked from commit c0a56ea0fd92ecb471936b7355ddbecbaea3707c)

3 years agoxen/arm: Add Spectre BHB handling
Rahul Singh [Mon, 14 Feb 2022 18:47:32 +0000 (18:47 +0000)]
xen/arm: Add Spectre BHB handling

This commit is adding Spectre BHB handling to Xen on Arm.
The commit is introducing new alternative code to be executed during
exception entry:
- SMCC workaround 3 call
- loop workaround (with 8, 24 or 32 iterations)
- use of new clearbhb instruction

Cpuerrata is modified by this patch to apply the required workaround for
CPU affected by Spectre BHB when CONFIG_ARM64_HARDEN_BRANCH_PREDICTOR is
enabled.

To do this the system previously used to apply smcc workaround 1 is
reused and new alternative code to be copied in the exception handler is
introduced.

To define the type of workaround required by a processor, 4 new cpu
capabilities are introduced (for each number of loop and for smcc
workaround 3).

When a processor is affected, enable_spectre_bhb_workaround is called
and if the processor does not have CSV2 set to 3 or ECBHB feature (which
would mean that the processor is doing what is required in hardware),
the proper code is enabled at exception entry.

In the case where workaround 3 is not supported by the firmware, we
enable workaround 1 when possible as it will also mitigate Spectre BHB
on systems without CSV2.

This is part of XSA-398 / CVE-2022-23960.

Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
Signed-off-by: Rahul Singh <rahul.singh@arm.com>
Acked-by: Julien Grall <julien@xen.org>
(cherry picked from commit 62c91eb66a2904eefb1d1d9642e3697a1e3c3a3c)

3 years agoxen/arm: Add ECBHB and CLEARBHB ID fields
Bertrand Marquis [Wed, 23 Feb 2022 09:42:18 +0000 (09:42 +0000)]
xen/arm: Add ECBHB and CLEARBHB ID fields

Introduce ID coprocessor register ID_AA64ISAR2_EL1.
Add definitions in cpufeature and sysregs of ECBHB field in mmfr1 and
CLEARBHB in isar2 ID coprocessor registers.

This is part of XSA-398 / CVE-2022-23960.

Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Julien Grall <julien@xen.org>
(cherry picked from commit 4b68d12d98b8790d8002fcc2c25a9d713374a4d7)

3 years agoxen/arm: move errata CSV2 check earlier
Bertrand Marquis [Tue, 15 Feb 2022 10:39:47 +0000 (10:39 +0000)]
xen/arm: move errata CSV2 check earlier

CSV2 availability check is done after printing to the user that
workaround 1 will be used. Move the check before to prevent saying to the
user that workaround 1 is used when it is not because it is not needed.
This will also allow to reuse install_bp_hardening_vec function for
other use cases.

Code previously returning "true", now returns "0" to conform to
enable_smccc_arch_workaround_1 returning an int and surrounding code
doing a "return 0" if workaround is not needed.

This is part of XSA-398 / CVE-2022-23960.

Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
Reviewed-by: Julien Grall <julien@xen.org>
(cherry picked from commit 599616d70eb886b9ad0ef9d6b51693ce790504ba)

3 years agoxen/arm: Introduce new Arm processors
Bertrand Marquis [Tue, 15 Feb 2022 10:37:51 +0000 (10:37 +0000)]
xen/arm: Introduce new Arm processors

Add some new processor identifiers in processor.h and sync Xen
definitions with status of Linux 5.17 (declared in
arch/arm64/include/asm/cputype.h).

This is part of XSA-398 / CVE-2022-23960.

Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Julien Grall <julien@xen.org>
(cherry picked from commit 35d1b85a6b43483f6bd007d48757434e54743e98)

3 years agox86/spec-ctrl: Support Intel PSFD for guests
Andrew Cooper [Tue, 19 Oct 2021 20:22:27 +0000 (21:22 +0100)]
x86/spec-ctrl: Support Intel PSFD for guests

The Feb 2022 microcode from Intel retrofits AMD's MSR_SPEC_CTRL.PSFD interface
to Sunny Cove (IceLake) and later cores.

Update the MSR_SPEC_CTRL emulation, and expose it to guests.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 52ce1c97844db213de01c5300eaaa8cf101a285f)

3 years agox86/cpuid: Infrastructure for cpuid word 7:2.edx
Andrew Cooper [Thu, 27 Jan 2022 21:07:40 +0000 (21:07 +0000)]
x86/cpuid: Infrastructure for cpuid word 7:2.edx

While in principle it would be nice to keep leaf 7 in order, that would
involve having an extra 5 words of zeros in a featureset.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f3709b15fc86c6c6a0959cec8d97f21d0e9f9629)

3 years agox86/tsx: Cope with TSX deprecation on WHL-R/CFL-R
Andrew Cooper [Wed, 16 Sep 2020 15:15:52 +0000 (16:15 +0100)]
x86/tsx: Cope with TSX deprecation on WHL-R/CFL-R

The February 2022 microcode is formally de-featuring TSX on the TAA-impacted
client CPUs.  The backup TAA mitigation (VERW regaining its flushing side
effect) is being dropped, meaning that `smt=0 spec-ctrl=md-clear` no longer
protects against TAA on these parts.

The new functionality enumerates itself via the RTM_ALWAYS_ABORT CPUID
bit (the same as June 2021), but has its control in MSR_MCU_OPT_CTRL as
opposed to MSR_TSX_FORCE_ABORT.

TSX now defaults to being disabled on ucode load.  Furthermore, if SGX is
enabled in the BIOS, TSX is locked and cannot be re-enabled.  In this case,
override opt_tsx to 0, so the RTM/HLE CPUID bits get hidden by default.

While updating the command line documentation, take the opportunity to add a
paragraph explaining what TSX being disabled actually means, and how migration
compatibility works.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit ad9f7c3b2e0df38ad6d54f4769d4dccf765fbcee)

3 years agox86/tsx: Move has_rtm_always_abort to an outer scope
Andrew Cooper [Wed, 23 Jun 2021 20:53:58 +0000 (21:53 +0100)]
x86/tsx: Move has_rtm_always_abort to an outer scope

We are about to introduce a second path which needs to conditionally force the
presence of RTM_ALWAYS_ABORT.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 4116139131e93b4f075e5442e3c1b424280f6f1f)

3 years agox86/spec-ctrl: Clean up MSR_MCU_OPT_CTRL handling
Andrew Cooper [Wed, 19 May 2021 18:40:28 +0000 (19:40 +0100)]
x86/spec-ctrl: Clean up MSR_MCU_OPT_CTRL handling

Introduce cpu_has_srbds_ctrl as more users are going to appear shortly.

MSR_MCU_OPT_CTRL is gaining extra functionality, meaning that the current
default_xen_mcu_opt_ctrl is no longer a good fit.

Introduce two new helpers, update_mcu_opt_ctrl() which does a full RMW cycle
on the MSR, and set_in_mcu_opt_ctrl() which lets callers configure specific
bits at a time without clobbering each others settings.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 39a40f3835efcc25c1b05a25c321a01d7e11cbd7)

3 years agox86/cpuid: Infrastructure for leaf 7:1.ebx
Jan Beulich [Thu, 27 Jan 2022 12:54:42 +0000 (12:54 +0000)]
x86/cpuid: Infrastructure for leaf 7:1.ebx

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit e1828e3032ebfe036023cd733adfd2d4ec856688)

3 years agox86/cpuid: Disentangle logic for new feature leaves
Andrew Cooper [Thu, 27 Jan 2022 13:56:04 +0000 (13:56 +0000)]
x86/cpuid: Disentangle logic for new feature leaves

Adding a new feature leaf is a reasonable amount of boilerplate and for the
patch to build, at least one feature from the new leaf needs defining.  This
typically causes two non-trivial changes to be merged together.

First, have gen-cpuid.py write out some extra placeholder defines:

  #define CPUID_BITFIELD_11 bool :1, :1, lfence_dispatch:1, ...
  #define CPUID_BITFIELD_12 uint32_t :32 /* placeholder */
  #define CPUID_BITFIELD_13 uint32_t :32 /* placeholder */
  #define CPUID_BITFIELD_14 uint32_t :32 /* placeholder */
  #define CPUID_BITFIELD_15 uint32_t :32 /* placeholder */

This allows DECL_BITFIELD() to be added to struct cpuid_policy without
requiring a XEN_CPUFEATURE() declared for the leaf.  The choice of 4 is
arbitrary, and allows us to add more than one leaf at a time if necessary.

Second, rework generic_identify() to not use specific feature names.

The choice of deriving the index from a feature was to avoid mismatches, but
its correctness depends on bugs like c/s 249e0f1d8f20 ("x86/cpuid: Fix
TSXLDTRK definition") not happening.

Switch to using FEATURESET_* just like the policy/featureset helpers.  This
breaks the cognitive complexity of needing to know which leaf a specifically
named feature should reside in, and is shorter to write.  It is also far
easier to identify as correct at a glance, given the correlation with the
CPUID leaf being read.

In addition, tidy up some other bits of generic_identify()
 * Drop leading zeros from leaf numbers.
 * Don't use a locked update for X86_FEATURE_APERFMPERF.
 * Rework extended_cpuid_level calculation to avoid setting it twice.
 * Use "leaf >= $N" consistently so $N matches with the CPUID input.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit e3662437eb43cc8002bd39be077ef68b131649c5)

3 years agox86/cpuid: Enable MSR_SPEC_CTRL in SVM guests by default
Andrew Cooper [Mon, 17 Jan 2022 20:29:09 +0000 (20:29 +0000)]
x86/cpuid: Enable MSR_SPEC_CTRL in SVM guests by default

With all other pieces in place, MSR_SPEC_CTRL is fully working for HVM guests.

Update the CPUID derivation logic (both PV and HVM to avoid losing subtle
changes), drop the MSR intercept, and explicitly enable the CPUID bits for HVM
guests.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit a7e7c7260cde78a148810db5320cbf39686c3e09)

3 years agox86/msr: AMD MSR_SPEC_CTRL infrastructure
Andrew Cooper [Mon, 17 Jan 2022 20:29:09 +0000 (20:29 +0000)]
x86/msr: AMD MSR_SPEC_CTRL infrastructure

Fill in VMCB accessors for spec_ctrl in svm_{get,set}_reg(), and CPUID checks
for all supported bits in guest_{rd,wr}msr().

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 22b9add22b4a9af37305c8441fec12cb26bd142b)

3 years agox86/svm: VMEntry/Exit logic for MSR_SPEC_CTRL
Andrew Cooper [Fri, 21 Jan 2022 15:59:03 +0000 (15:59 +0000)]
x86/svm: VMEntry/Exit logic for MSR_SPEC_CTRL

Hardware maintains both host and guest versions of MSR_SPEC_CTRL, but guests
run with the logical OR of both values.  Therefore, in principle we want to
clear Xen's value before entering the guest.  However, for migration
compatibility (future work), and for performance reasons with SEV-SNP guests,
we want the ability to use a nonzero value behind the guest's back.  Use
vcpu_msrs to hold this value, with the guest value in the VMCB.

On the VMEntry path, adjusting MSR_SPEC_CTRL must be done after CLGI so as to
be atomic with respect to NMIs/etc.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 614cec7d79d76786f5638a6e4da0576b57732ca1)

3 years agox86/spec-ctrl: Use common MSR_SPEC_CTRL logic for AMD
Andrew Cooper [Fri, 21 Jan 2022 15:59:03 +0000 (15:59 +0000)]
x86/spec-ctrl: Use common MSR_SPEC_CTRL logic for AMD

Currently, amd_init_ssbd() works by being the only write to MSR_SPEC_CTRL in
the system.  This ceases to be true when using the common logic.

Include AMD MSR_SPEC_CTRL in has_spec_ctrl to activate the common paths, and
introduce an AMD specific block to control alternatives.  Also update the
boot/resume paths to configure default_xen_spec_ctrl.

svm.h needs an adjustment to remove a dependency on include order.

For now, only active alternatives for HVM - PV will require more work.  No
functional change, as no alternatives are defined yet for HVM yet.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 378f2e6df31442396f0afda19794c5c6091d96f9)

3 years agox86/spec-ctrl: Record the last write to MSR_SPEC_CTRL
Andrew Cooper [Fri, 28 Jan 2022 11:57:19 +0000 (11:57 +0000)]
x86/spec-ctrl: Record the last write to MSR_SPEC_CTRL

In some cases, writes to MSR_SPEC_CTRL do not have interesting side effects,
and we should implement lazy context switching like we do with other MSRs.

In the short term, this will be used by the SVM infrastructure, but I expect
to extend it to other contexts in due course.

Introduce cpu_info.last_spec_ctrl for the purpose, and cache writes made from
the boot/resume paths.  The value can't live in regular per-cpu data when it
is eventually used for PV guests when XPTI might be active.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 00f2992b6c7a9d4090443c1a85bf83224a87eeb9)

3 years agox86/spec-ctrl: Don't use spec_ctrl_{enter,exit}_idle() for S3
Andrew Cooper [Fri, 28 Jan 2022 12:03:42 +0000 (12:03 +0000)]
x86/spec-ctrl: Don't use spec_ctrl_{enter,exit}_idle() for S3

'idle' here refers to hlt/mwait.  The S3 path isn't an idle path - it is a
platform reset.

We need to load default_xen_spec_ctrl unilaterally on the way back up.
Currently it happens as a side effect of X86_FEATURE_SC_MSR_IDLE or the next
return-to-guest, but that's fragile behaviour.

Conversely, there is no need to clear IBRS and flush the store buffers on the
way down; we're microseconds away from cutting power.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 71fac402e05ade7b0af2c34f77517449f6f7e2c1)

3 years agox86/spec-ctrl: Introduce new has_spec_ctrl boolean
Andrew Cooper [Tue, 25 Jan 2022 17:14:48 +0000 (17:14 +0000)]
x86/spec-ctrl: Introduce new has_spec_ctrl boolean

Most MSR_SPEC_CTRL setup will be common between Intel and AMD.  Instead of
opencoding an OR of two features everywhere, introduce has_spec_ctrl instead.

Reword the comment above the Intel specific alternatives block to highlight
that it is Intel specific, and pull the setting of default_xen_spec_ctrl.IBRS
out because it will want to be common.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 5d9eff3a312763d889cfbf3c8468b6dfb3ab490c)

3 years agox86/spec-ctrl: Drop use_spec_ctrl boolean
Andrew Cooper [Tue, 25 Jan 2022 16:09:59 +0000 (16:09 +0000)]
x86/spec-ctrl: Drop use_spec_ctrl boolean

Several bugfixes have reduced the utility of this variable from it's original
purpose, and now all it does is aid in the setup of SCF_ist_wrmsr.

Simplify the logic by drop the variable, and doubling up the setting of
SCF_ist_wrmsr for the PV and HVM blocks, which will make the AMD SPEC_CTRL
support easier to follow.  Leave a comment explaining why SCF_ist_wrmsr is
still necessary for the VMExit case.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit ec083bf552c35e10347449e21809f4780f8155d2)

3 years agox86/cpuid: Advertise SSB_NO to guests by default
Andrew Cooper [Thu, 27 Jan 2022 21:28:48 +0000 (21:28 +0000)]
x86/cpuid: Advertise SSB_NO to guests by default

This is a statement of hardware behaviour, and not related to controls for the
guest kernel to use.  Pass it straight through from hardware.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 15b7611efd497c4b65f350483857082cb70fc348)

3 years agox86/msr: Fix migration compatibility issue with MSR_SPEC_CTRL
Andrew Cooper [Wed, 19 Jan 2022 19:55:02 +0000 (19:55 +0000)]
x86/msr: Fix migration compatibility issue with MSR_SPEC_CTRL

This bug existed in early in 2018 between MSR_SPEC_CTRL arriving in microcode,
and SSBD arriving a few months later.  It went unnoticed presumably because
everyone was busy rebooting everything.

The same bug will reappear when adding PSFD support.

Clamp the guest MSR_SPEC_CTRL value to that permitted by CPUID on migrate.
The guest is already playing with reserved bits at this point, and clamping
the value will prevent a migration to a less capable host from failing.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 969a57f73f6b011b2ebf4c0ab1715efc65837335)

3 years agox86/vmx: Drop spec_ctrl load in VMEntry path
Andrew Cooper [Tue, 1 Feb 2022 13:34:49 +0000 (13:34 +0000)]
x86/vmx: Drop spec_ctrl load in VMEntry path

This is not needed now that the VMEntry path is not responsible for loading
the guest's MSR_SPEC_CTRL value.

Fixes: 81f0eaadf84d ("x86/spec-ctrl: Fix NMI race condition with VT-x MSR_SPEC_CTRL handling")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9ce3ef20b4f085a7dc8ee41b0fec6fdeced3773e)

3 years agox86/cpuid: support LFENCE always serialising CPUID bit
Roger Pau Monné [Thu, 15 Apr 2021 14:47:31 +0000 (16:47 +0200)]
x86/cpuid: support LFENCE always serialising CPUID bit

AMD Milan (Zen3) CPUs have an LFENCE Always Serialising CPUID bit in
leaf 80000021.eax. Previous AMD versions used to have a user settable
bit in DE_CFG MSR to select whether LFENCE was dispatch serialising,
which Xen always attempts to set. The forcefully always on setting is
due to the addition of SEV-SNP so that a VMM cannot break the
confidentiality of a guest.

In order to support this new CPUID bit move the LFENCE_DISPATCH
synthetic CPUID bit to map the hardware bit (leaving a hole in the
synthetic range) and either rely on the bit already being set by the
native CPUID output, or attempt to fake it in Xen by modifying the
DE_CFG MSR. This requires adding one more entry to the featureset to
support leaf 80000021.eax.

The bit is always exposed to guests by default even if the underlying
hardware doesn't support leaf 80000021. Note that Xen doesn't allow
guests to change the DE_CFG value, so once set by Xen LFENCE will always
be serialising.

Note that the access to DE_CFG by guests is left as-is: reads will
unconditionally return LFENCE_SERIALISE bit set, while writes are
silently dropped.

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
[Always expose to guests by default]
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit e9b4fe26364950258c9f57f0f68eccb778eeadbb)

x86/cpuid: do not expand max leaves on restore

When restoring limit the maximum leaves to the ones supported by Xen
4.12 in order to not expand the maximum leaves a guests sees. Note
this is unlikely to cause real issues.

Guests restored from Xen versions 4.13 or greater will contain CPUID
data on the stream that will override the values set by
xc_cpuid_apply_policy.

Fixes: e9b4fe263649 ("x86/cpuid: support LFENCE always serialising CPUID bit")
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 111c8c33a8a18588f3da3c5dbb7f5c63ddb98ce5)

tools/libxenguest: Fix max_extd_leaf calculation for legacy restore

0x1c is lower than any value which will actually be observed in
p->extd.max_leaf, but higher than the logical 9 leaves worth of extended data
on Intel systems, causing x86_cpuid_copy_to_buffer() to fail with -ENOBUFS.

Correct the calculation.

The problem was first noticed in c/s 34990446ca9 "libxl: don't ignore the
return value from xc_cpuid_apply_policy" but introduced earlier.

Fixes: 111c8c33a8a1 ("x86/cpuid: do not expand max leaves on restore")
Reported-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 5fa174cbf54cc625a023b8e7170e359dd150c072)

tools/guest: Fix comment regarding CPUID compatibility

It was Xen 4.14 where CPUID data was added to the migration stream, and 4.13
that we need to worry about with regards to compatibility.  Xen 4.12 isn't
relevant.

Expand and correct the commentary.

Fixes: 111c8c33a8a1 ("x86/cpuid: do not expand max leaves on restore")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 820cc393434097f3b7976acdccbf1d96071d6d23)

3 years agox86/amd: split LFENCE dispatch serializing setup logic into helper
Roger Pau Monné [Thu, 15 Apr 2021 11:45:09 +0000 (13:45 +0200)]
x86/amd: split LFENCE dispatch serializing setup logic into helper

Split the logic to attempt to setup LFENCE to be dispatch serializing
on AMD into a helper, so it can be shared with Hygon.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 3e9460ec93341fa6a80ecf99832aa5d9975339c9)

3 years agoupdate Xen version to 4.14.4 RELEASE-4.14.4
Jan Beulich [Mon, 31 Jan 2022 09:44:39 +0000 (10:44 +0100)]
update Xen version to 4.14.4

3 years agox86/pvh: fix population of the low 1MB for dom0
Roger Pau Monné [Wed, 26 Jan 2022 11:52:09 +0000 (12:52 +0100)]
x86/pvh: fix population of the low 1MB for dom0

RMRRs are setup ahead of populating the p2m and hence the ASSERT when
populating the low 1MB needs to be relaxed when it finds an existing
entry: it's either RAM or a RMRR resulting from the IOMMU setup.

Rework the logic a bit and introduce a local mfn variable in order to
assert that if the gfn is populated and not RAM it is an identity map.

Fixes: 6b4f6a31ac ('x86/PVH: de-duplicate mappings for first Mb of Dom0 memory')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2d5fc9120d556ec3c4b1acf0ab5660a6d3f7ebeb
master date: 2022-01-25 10:52:24 +0000

3 years agox86: Fix build with the get/set_reg() infrastructure
Andrew Cooper [Wed, 26 Jan 2022 11:51:31 +0000 (12:51 +0100)]
x86: Fix build with the get/set_reg() infrastructure

I clearly messed up concluding that the stubs were safe to drop.

The is_{pv,hvm}_domain() predicates are not symmetrical with both CONFIG_PV
and CONFIG_HVM.  As a result logic of the form `if ( pv/hvm ) ... else ...`
will always have one side which can't be DCE'd.

While technically only the hvm stubs are needed, due to the use of the
is_pv_domain() predicate in guest_{rd,wr}msr(), sort out the pv stubs too to
avoid leaving a bear trap for future users.

Fixes: 88d3ff7ab15d ("x86/guest: Introduce {get,set}_reg() infrastructure")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 13caa585791234fe3e3719c8376f7ea731012451
master date: 2022-01-21 12:42:11 +0000

3 years agox86/spec-ctrl: Fix NMI race condition with VT-x MSR_SPEC_CTRL handling
Andrew Cooper [Tue, 25 Jan 2022 12:53:14 +0000 (13:53 +0100)]
x86/spec-ctrl: Fix NMI race condition with VT-x MSR_SPEC_CTRL handling

The logic was based on a mistaken understanding of how NMI blocking on vmexit
works.  NMIs are only blocked for EXIT_REASON_NMI, and not for general exits.
Therefore, an NMI can in general hit early in the vmx_asm_vmexit_handler path,
and the guest's value will be clobbered before it is saved.

Switch to using MSR load/save lists.  This causes the guest value to be saved
atomically with respect to NMIs/MCEs/etc.

First, update vmx_cpuid_policy_changed() to configure the load/save lists at
the same time as configuring the intercepts.  This function is always used in
remote context, so extend the vmx_vmcs_{enter,exit}() block to cover the whole
function, rather than having multiple remote acquisitions of the same VMCS.

Both of vmx_{add,del}_guest_msr() can fail.  The -ESRCH delete case is fine,
but all others are fatal to the running of the VM, so handle them using
domain_crash() - this path is only used during domain construction anyway.

Second, update vmx_{get,set}_reg() to use the MSR load/save lists rather than
vcpu_msrs, and update the vcpu_msrs comment to describe the new state
location.

Finally, adjust the entry/exit asm.

Because the guest value is saved and loaded atomically, we do not need to
manually load the guest value, nor do we need to enable SCF_use_shadow.  This
lets us remove the use of DO_SPEC_CTRL_EXIT_TO_GUEST.  Additionally,
SPEC_CTRL_ENTRY_FROM_PV gets removed too, because on an early entry failure,
we're no longer in the guest MSR_SPEC_CTRL context needing to switch back to
Xen's context.

The only action remaining is to load Xen's MSR_SPEC_CTRL value on vmexit.  We
could in principle use the host msr list, but is expected to complicated
future work.  Delete DO_SPEC_CTRL_ENTRY_FROM_HVM entirely, and use a shorter
code sequence to simply reload Xen's setting from the top-of-stack block.

Adjust the comment at the top of spec_ctrl_asm.h in light of this bugfix.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 81f0eaadf84d273a6ff8df3660b874a02d0e7677
master date: 2022-01-20 16:32:11 +0000

3 years agox86/spec-ctrl: Drop SPEC_CTRL_{ENTRY_FROM,EXIT_TO}_HVM
Andrew Cooper [Tue, 25 Jan 2022 12:52:56 +0000 (13:52 +0100)]
x86/spec-ctrl: Drop SPEC_CTRL_{ENTRY_FROM,EXIT_TO}_HVM

These were written before Spectre/Meltdown went public, and there was large
uncertainty in how the protections would evolve.  As it turns out, they're
very specific to Intel hardware, and not very suitable for AMD.

Drop the macros, opencoding the relevant subset of functionality, and leaving
grep-fodder to locate the logic.  No change at all for VT-x.

For AMD, the only relevant piece of functionality is DO_OVERWRITE_RSB,
although we will soon be adding (different) logic to handle MSR_SPEC_CTRL.

This has a marginal improvement of removing an unconditional pile of long-nops
from the vmentry/exit path.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 95b13fa43e0753b7514bef13abe28253e8614f62
master date: 2022-01-20 16:32:11 +0000

3 years agox86/msr: Split MSR_SPEC_CTRL handling
Andrew Cooper [Tue, 25 Jan 2022 12:52:30 +0000 (13:52 +0100)]
x86/msr: Split MSR_SPEC_CTRL handling

In order to fix a VT-x bug, and support MSR_SPEC_CTRL on AMD, move
MSR_SPEC_CTRL handling into the new {pv,hvm}_{get,set}_reg() infrastructure.

Duplicate the msrs->spec_ctrl.raw accesses in the PV and VT-x paths for now.
The SVM path is currently unreachable because of the CPUID policy.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6536688439dbca1d08fd6db5be29c39e3917fb2f
master date: 2022-01-20 16:32:11 +0000

3 years agox86/guest: Introduce {get,set}_reg() infrastructure
Andrew Cooper [Tue, 25 Jan 2022 12:52:07 +0000 (13:52 +0100)]
x86/guest: Introduce {get,set}_reg() infrastructure

Various registers have per-guest-type or per-vendor locations or access
requirements.  To support their use from common code, provide accessors which
allow for per-guest-type behaviour.

For now, just infrastructure handling default cases and expectations.
Subsequent patches will start handling registers using this infrastructure.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 88d3ff7ab15da277a85b39735797293fb541c718
master date: 2022-01-20 16:32:11 +0000

3 years agox86/time: improve TSC / CPU freq calibration accuracy
Jan Beulich [Tue, 25 Jan 2022 12:51:49 +0000 (13:51 +0100)]
x86/time: improve TSC / CPU freq calibration accuracy

While the problem report was for extreme errors, even smaller ones would
better be avoided: The calculated period to run calibration loops over
can (and usually will) be shorter than the actual time elapsed between
first and last platform timer and TSC reads. Adjust values returned from
the init functions accordingly.

On a Skylake system I've tested this on accuracy (using HPET) went from
detecting in some cases more than 220kHz too high a value to about
±2kHz. On other systems (or on this system, but with PMTMR) the original
error range was much smaller, with less (in some cases only very little)
improvement.

Reported-by: James Dingwall <james-xen@dingwall.me.uk>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a5c9a80af34eefcd6e31d0ed2b083f452cd9076d
master date: 2022-01-13 14:31:52 +0100

3 years agox86/time: use relative counts in calibration loops
Jan Beulich [Tue, 25 Jan 2022 12:51:36 +0000 (13:51 +0100)]
x86/time: use relative counts in calibration loops

Looping until reaching/exceeding a certain value is error prone: If the
target value is close enough to the wrapping point, the loop may not
terminate at all. Switch to using delta values, which then allows to
fold the two loops each into just one.

Fixes: 93340297802b ("x86/time: calibrate TSC against platform timer")
Reported-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 467191641d2a2fd2e43b3ae7b80399f89d339980
master date: 2022-01-13 14:30:18 +0100

3 years agopassthrough/x86: stop pirq iteration immediately in case of error
Julien Grall [Tue, 25 Jan 2022 12:49:40 +0000 (13:49 +0100)]
passthrough/x86: stop pirq iteration immediately in case of error

pt_pirq_iterate() will iterate in batch over all the PIRQs. The outer
loop will bail out if 'rc' is non-zero but the inner loop will continue.

This means 'rc' will get clobbered and we may miss any errors (such as
-ERESTART in the case of the callback pci_clean_dpci_irq()).

This is CVE-2022-23035 / XSA-395.

Fixes: c24536b636f2 ("replace d->nr_pirqs sized arrays with radix tree")
Fixes: f6dd295381f4 ("dpci: replace tasklet with softirq")
Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 9480a1a519cf016623f657dc544cb372a82b5708
master date: 2022-01-25 13:27:02 +0100

3 years agoxen/grant-table: Only decrement the refcounter when grant is fully unmapped
Julien Grall [Tue, 25 Jan 2022 12:49:26 +0000 (13:49 +0100)]
xen/grant-table: Only decrement the refcounter when grant is fully unmapped

The grant unmapping hypercall (GNTTABOP_unmap_grant_ref) is not a
simple revert of the changes done by the grant mapping hypercall
(GNTTABOP_map_grant_ref).

Instead, it is possible to partially (or even not) clear some flags.
This will leave the grant is mapped until a future call where all
the flags would be cleared.

XSA-380 introduced a refcounting that is meant to only be dropped
when the grant is fully unmapped. Unfortunately, unmap_common() will
decrement the refcount for every successful call.

A consequence is a domain would be able to underflow the refcount
and trigger a BUG().

Looking at the code, it is not clear to me why a domain would
want to partially clear some flags in the grant-table. But as
this is part of the ABI, it is better to not change the behavior
for now.

Fix it by checking if the maptrack handle has been released before
decrementing the refcounting.

This is CVE-2022-23034 / XSA-394.

Fixes: 9781b51efde2 ("gnttab: replace mapkind()")
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 975a8fb45ca186b3476e5656c6ad5dad1122dbfd
master date: 2022-01-25 13:25:49 +0100

3 years agoxen/arm: p2m: Always clear the P2M entry when the mapping is removed
Julien Grall [Tue, 25 Jan 2022 12:49:06 +0000 (13:49 +0100)]
xen/arm: p2m: Always clear the P2M entry when the mapping is removed

Commit 2148a125b73b ("xen/arm: Track page accessed between batch of
Set/Way operations") allowed an entry to be invalid from the CPU PoV
(lpae_is_valid()) but valid for Xen (p2m_is_valid()). This is useful
to track which page is accessed and only perform an action on them
(e.g. clean & invalidate the cache after a set/way instruction).

Unfortunately, __p2m_set_entry() is only zeroing the P2M entry when
lpae_is_valid() returns true. This means the entry will not be zeroed
if the entry was valid from Xen PoV but invalid from the CPU PoV for
tracking purpose.

As a consequence, this will allow a domain to continue to access the
page after it was removed.

Resolve the issue by always zeroing the entry if it the LPAE bit is
set or the entry is about to be removed.

This is CVE-2022-23033 / XSA-393.

Reported-by: Dmytro Firsov <Dmytro_Firsov@epam.com>
Fixes: 2148a125b73b ("xen/arm: Track page accessed between batch of Set/Way operations")
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Julien Grall <jgrall@amazon.com>
master commit: a428b913a002eb2b7425b48029c20a52eeee1b5a
master date: 2022-01-25 13:25:01 +0100

3 years agox86/spec-ctrl: Fix default calculation of opt_srb_lock
Andrew Cooper [Fri, 7 Jan 2022 07:54:38 +0000 (08:54 +0100)]
x86/spec-ctrl: Fix default calculation of opt_srb_lock

Since this logic was introduced, opt_tsx has become more complicated and
shouldn't be compared to 0 directly.  While there are no buggy logic paths,
the correct expression is !(opt_tsx & 1) but the rtm_disabled boolean is
easier and clearer to use.

Fixes: 8fe24090d940 ("x86/cpuid: Rework HLE and RTM handling")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 31f3bc97f4508687215e459a5e35676eecf1772b
master date: 2022-01-05 09:44:26 +0000

3 years agorevert "hvmloader: PA range 0xfc000000-0xffffffff should be UC"
Jan Beulich [Fri, 7 Jan 2022 07:54:21 +0000 (08:54 +0100)]
revert "hvmloader: PA range 0xfc000000-0xffffffff should be UC"

This reverts commit c22bd567ce22f6ad9bd93318ad0d7fd1c2eadb0d.

While its description is correct from an abstract or real hardware pov,
the range is special inside HVM guests. The range being UC in particular
gets in the way of OVMF, which places itself at [FFE00000,FFFFFFFF].
While this is benign to epte_get_entry_emt() as long as the IOMMU isn't
enabled for a guest, it becomes a very noticable problem otherwise: It
takes about half a minute for OVMF to decompress itself into its
designated address range.

And even beyond OVMF there's no reason to have e.g. the ACPI memory
range marked UC.

Fixes: c22bd567ce22 ("hvmloader: PA range 0xfc000000-0xffffffff should be UC")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ea187c0b7a73c26258c0e91e4f3656989804555f
master date: 2021-12-17 08:56:15 +0100

3 years agox86/cpuid: Fix TSXLDTRK definition
Andrew Cooper [Fri, 7 Jan 2022 07:53:12 +0000 (08:53 +0100)]
x86/cpuid: Fix TSXLDTRK definition

TSXLDTRK lives in CPUID leaf 7[0].edx, not 7[0].ecx.

Bit 16 in ecx is LA57.

Fixes: a6d1b558471f ("x86emul: support X{SUS,RES}LDTRK")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 249e0f1d8f203188ccdcced5a05c2149739e1566
master date: 2021-12-14 12:30:48 +0000

3 years agox86/HVM: permit CLFLUSH{,OPT} on execute-only code segments
Jan Beulich [Fri, 7 Jan 2022 07:52:48 +0000 (08:52 +0100)]
x86/HVM: permit CLFLUSH{,OPT} on execute-only code segments

Both SDM and PM explicitly permit this.

Fixes: 52dba7bd0b36 ("x86emul: generalize wbinvd() hook")
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Paul Durrant <paul@xen.org>
master commit: df3e1a5efe700a9f59eced801cac73f9fd02a0e2
master date: 2021-12-10 14:03:56 +0100

3 years agox86: avoid wrong use of all-but-self IPI shorthand
Jan Beulich [Fri, 7 Jan 2022 07:52:18 +0000 (08:52 +0100)]
x86: avoid wrong use of all-but-self IPI shorthand

With "nosmp" I did observe a flood of "APIC error on CPU0: 04(04), Send
accept error" log messages on an AMD system. And rightly so - nothing
excludes the use of the shorthand in send_IPI_mask() in this case. Set
"unaccounted_cpus" to "true" also when command line restrictions are the
cause.

Note that PV-shim mode is unaffected by this change, first and foremost
because "nosmp" and "maxcpus=" are ignored in this case.

Fixes: 5500d265a2a8 ("x86/smp: use APIC ALLBUT destination shorthand when possible")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 7621880de0bb40bae6436a5b106babc0e4718f4d
master date: 2021-12-10 10:26:52 +0100

3 years agox86/HVM: fail virt-to-linear conversion for insn fetches from non-code segments
Jan Beulich [Fri, 7 Jan 2022 07:51:51 +0000 (08:51 +0100)]
x86/HVM: fail virt-to-linear conversion for insn fetches from non-code segments

Just like (in protected mode) reads may not go to exec-only segments and
writes may not go to non-writable ones, insn fetches may not access data
segments.

Fixes: 623e83716791 ("hvm: Support hardware task switching")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 311297f4216a4387bdae6df6cfbb1f5edb06618a
master date: 2021-12-06 14:15:05 +0100

3 years agoVT-d: don't leak domid mapping on error path
Jan Beulich [Fri, 7 Jan 2022 07:51:27 +0000 (08:51 +0100)]
VT-d: don't leak domid mapping on error path

While domain_context_mapping() invokes domain_context_unmap() in a sub-
case of handling DEV_TYPE_PCI when encountering an error, thus avoiding
a leak, individual calls to domain_context_mapping_one() aren't
similarly covered. Such a leak might persist until domain destruction.
Leverage that these cases can be recognized by pdev being non-NULL.

Fixes: dec403cc668f ("VT-d: fix iommu_domid for PCI/PCIx devices assignment")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: e6252a51faf42c892eb5fc71f8a2617580832196
master date: 2021-11-24 11:07:11 +0100

3 years agoVT-d: split domid map cleanup check into a function
Jan Beulich [Fri, 7 Jan 2022 07:51:05 +0000 (08:51 +0100)]
VT-d: split domid map cleanup check into a function

This logic will want invoking from elsewhere.

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 9fdc10abe9457e4c9879a266f82372cb08e88ffb
master date: 2021-11-24 11:06:20 +0100

3 years agoefi: fix alignment of function parameters in compat mode
Roger Pau Monné [Fri, 7 Jan 2022 07:50:22 +0000 (08:50 +0100)]
efi: fix alignment of function parameters in compat mode

Currently the max_store_size, remain_store_size and max_size in
compat_pf_efi_runtime_call are 4 byte aligned, which makes clang
13.0.0 complain with:

In file included from compat.c:30:
./runtime.c:646:13: error: passing 4-byte aligned argument to 8-byte aligned parameter 2 of 'QueryVariableInfo' may result in an unaligned pointer access [-Werror,-Walign-mismatch]
            &op->u.query_variable_info.max_store_size,
            ^
./runtime.c:647:13: error: passing 4-byte aligned argument to 8-byte aligned parameter 3 of 'QueryVariableInfo' may result in an unaligned pointer access [-Werror,-Walign-mismatch]
            &op->u.query_variable_info.remain_store_size,
            ^
./runtime.c:648:13: error: passing 4-byte aligned argument to 8-byte aligned parameter 4 of 'QueryVariableInfo' may result in an unaligned pointer access [-Werror,-Walign-mismatch]
            &op->u.query_variable_info.max_size);
            ^
Fix this by bouncing the variables on the stack in order for them to
be 8 byte aligned.

Note this could be done in a more selective manner to only apply to
compat code calls, but given the overhead of making an EFI call doing
an extra copy of 3 variables doesn't seem to warrant the special
casing.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Ian Jackson <iwj@xenproject.org>
Signed-off-by: Ian Jackson <iwj@xenproject.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: be12fcca8b784e456df3adedbffe657d753c5ff9
master date: 2021-11-19 17:01:24 +0000

3 years agoxen/arm: Do not invalidate the P2M when the PT is shared with the IOMMU
Stefano Stabellini [Wed, 4 Aug 2021 20:57:07 +0000 (13:57 -0700)]
xen/arm: Do not invalidate the P2M when the PT is shared with the IOMMU

Set/Way flushes never work correctly in a virtualized environment.

Our current implementation is based on clearing the valid bit in the p2m
pagetable to track guest memory accesses. This technique doesn't work
when the IOMMU is enabled for the domain and the pagetable is shared
between IOMMU and MMU because it triggers IOMMU faults.

Specifically, p2m_invalidate_root causes IOMMU faults if
iommu_use_hap_pt returns true for the domain.

Add a check in p2m_set_way_flush: if a set/way instruction is used
and iommu_use_hap_pt returns true, rather than failing with obscure
IOMMU faults, inject an undef exception straight away into the guest,
and print a verbose error message to explain the problem.

Also add an ASSERT in p2m_invalidate_root to make sure we don't
inadvertently stumble across this problem again in the future.

Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit 2b45ff60301a988badec526846e77b538383ae63)

3 years agoMAINTAINERS: Resign from tools stable branch maintainership
Ian Jackson [Mon, 6 Dec 2021 14:40:24 +0000 (14:40 +0000)]
MAINTAINERS: Resign from tools stable branch maintainership

Signed-off-by: Ian Jackson <iwj@xenproject.org>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit c623a84c2a4fda1cd25f5347a6298706218eb5fb)

3 years agox86/P2M: deal with partial success of p2m_set_entry()
Jan Beulich [Tue, 23 Nov 2021 12:30:09 +0000 (13:30 +0100)]
x86/P2M: deal with partial success of p2m_set_entry()

M2P and PoD stats need to remain in sync with P2M; if an update succeeds
only partially, respective adjustments need to be made. If updates get
made before the call, they may also need undoing upon complete failure
(i.e. including the single-page case).

Log-dirty state would better also be kept in sync.

Note that the change to set_typed_p2m_entry() may not be strictly
necessary (due to the order restriction enforced near the top of the
function), but is being kept here to be on the safe side.

This is CVE-2021-28705 and CVE-2021-28709 / XSA-389.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 74a11c43fd7e074b1f77631b446dd2115eacb9e8
master date: 2021-11-22 12:27:30 +0000

3 years agox86/PoD: handle intermediate page orders in p2m_pod_cache_add()
Jan Beulich [Tue, 23 Nov 2021 12:29:54 +0000 (13:29 +0100)]
x86/PoD: handle intermediate page orders in p2m_pod_cache_add()

p2m_pod_decrease_reservation() may pass pages to the function which
aren't 4k, 2M, or 1G. Handle all intermediate orders as well, to avoid
hitting the BUG() at the switch() statement's "default" case.

This is CVE-2021-28708 / part of XSA-388.

Fixes: 3c352011c0d3 ("x86/PoD: shorten certain operations on higher order ranges")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8ec13f68e0b026863d23e7f44f252d06478bc809
master date: 2021-11-22 12:27:30 +0000

3 years agox86/PoD: deal with misaligned GFNs
Jan Beulich [Tue, 23 Nov 2021 12:29:41 +0000 (13:29 +0100)]
x86/PoD: deal with misaligned GFNs

Users of XENMEM_decrease_reservation and XENMEM_populate_physmap aren't
required to pass in order-aligned GFN values. (While I consider this
bogus, I don't think we can fix this there, as that might break existing
code, e.g Linux'es swiotlb, which - while affecting PV only - until
recently had been enforcing only page alignment on the original
allocation.) Only non-PoD code paths (guest_physmap_{add,remove}_page(),
p2m_set_entry()) look to be dealing with this properly (in part by being
implemented inefficiently, handling every 4k page separately).

Introduce wrappers taking care of splitting the incoming request into
aligned chunks, without putting much effort in trying to determine the
largest possible chunk at every iteration.

Also "handle" p2m_set_entry() failure for non-order-0 requests by
crashing the domain in one more place. Alongside putting a log message
there, also add one to the other similar path.

Note regarding locking: This is left in the actual worker functions on
the assumption that callers aren't guaranteed atomicity wrt acting on
multiple pages at a time. For mis-aligned GFNs gfn_lock() wouldn't have
locked the correct GFN range anyway, if it didn't simply resolve to
p2m_lock(), and for well-behaved callers there continues to be only a
single iteration, i.e. behavior is unchanged for them. (FTAOD pulling
out just pod_lock() into p2m_pod_decrease_reservation() would result in
a lock order violation.)

This is CVE-2021-28704 and CVE-2021-28707 / part of XSA-388.

Fixes: 3c352011c0d3 ("x86/PoD: shorten certain operations on higher order ranges")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 182c737b9ba540ebceb1433f3940fbed6eac4ea9
master date: 2021-11-22 12:27:30 +0000

3 years agoxen/page_alloc: Harden assign_pages()
Julien Grall [Tue, 23 Nov 2021 12:29:09 +0000 (13:29 +0100)]
xen/page_alloc: Harden assign_pages()

domain_tot_pages() and d->max_pages are 32-bit values. While the order
should always be quite small, it would still be possible to overflow
if domain_tot_pages() is near to (2^32 - 1).

As this code may be called by a guest via XENMEM_increase_reservation
and XENMEM_populate_physmap, we want to make sure the guest is not going
to be able to allocate more than it is allowed.

Rework the allocation check to avoid any possible overflow. While the
check domain_tot_pages() < d->max_pages should technically not be
necessary, it is probably best to have it to catch any possible
inconsistencies in the future.

This is CVE-2021-28706 / part of XSA-385.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 143501861d48e1bfef495849fd68584baac05849
master date: 2021-11-22 11:11:05 +0000

3 years agopublic/gnttab: relax v2 recommendation
Jan Beulich [Fri, 19 Nov 2021 08:42:10 +0000 (09:42 +0100)]
public/gnttab: relax v2 recommendation

With there being a way to disable v2 support, telling new guests to use
v2 exclusively is not a good suggestion.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
master commit: 2d72d2784eb71d8532bfbd6462d261739c9e82e4
master date: 2021-11-16 17:34:06 +0100

3 years agox86/APIC: avoid iommu_supports_x2apic() on error path
Jan Beulich [Fri, 19 Nov 2021 08:41:41 +0000 (09:41 +0100)]
x86/APIC: avoid iommu_supports_x2apic() on error path

The value it returns may change from true to false in case
iommu_enable_x2apic() fails and, as a side effect, clears iommu_intremap
(as can happen at least on AMD). Latch the return value from the first
invocation to replace the second one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 0f50d1696b3c13cbf0b18fec817fc291d5a30a31
master date: 2021-11-04 14:44:43 +0100

3 years agox86/IOMMU: mark IOMMU / intremap not in use when ACPI tables are missing
Jan Beulich [Fri, 19 Nov 2021 08:41:09 +0000 (09:41 +0100)]
x86/IOMMU: mark IOMMU / intremap not in use when ACPI tables are missing

x2apic_bsp_setup() gets called ahead of iommu_setup(), and since x2APIC
mode (physical vs clustered) depends on iommu_intremap, that variable
needs to be set to off as soon as we know we can't / won't enable
interrupt remapping, i.e. in particular when parsing of the respective
ACPI tables failed. Move the turning off of iommu_intremap from AMD
specific code into acpi_iommu_init(), accompanying it by clearing of
iommu_enable.

Take the opportunity and also fully skip ACPI table parsing logic on
VT-d when both "iommu=off" and "iommu=no-intremap" are in effect anyway,
like was already the case for AMD.

The tag below only references the commit uncovering a pre-existing
anomaly.

Fixes: d8bd82327b0f ("AMD/IOMMU: obtain IVHD type to use earlier")
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 46c4061cd2bf69e8039021af615c2bdb94e50088
master date: 2021-11-04 14:44:01 +0100

3 years agox86/xstate: reset cached register values on resume
Marek Marczykowski-Górecki [Fri, 19 Nov 2021 08:40:44 +0000 (09:40 +0100)]
x86/xstate: reset cached register values on resume

set_xcr0() and set_msr_xss() use cached value to avoid setting the
register to the same value over and over. But suspend/resume implicitly
reset the registers and since percpu areas are not deallocated on
suspend anymore, the cache gets stale.
Reset the cache on resume, to ensure the next write will really hit the
hardware. Choose value 0, as it will never be a legitimate write to
those registers - and so, will force write (and cache update).

Note the cache is used io get_xcr0() and get_msr_xss() too, but:
- set_xcr0() is called few lines below in xstate_init(), so it will
  update the cache with appropriate value
- get_msr_xss() is not used anywhere - and thus not before any
  set_msr_xss() that will fill the cache

Fixes: aca2a985a55a "xen: don't free percpu areas during suspend"
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f7f4a523927fa4c7598e4647a16bc3e3cf8009d0
master date: 2021-11-04 14:42:37 +0100

3 years agox86/traps: Fix typo in do_entry_CP()
Andrew Cooper [Fri, 19 Nov 2021 08:40:19 +0000 (09:40 +0100)]
x86/traps: Fix typo in do_entry_CP()

The call to debugger_trap_entry() should pass the correct vector.  The
break-for-gdbsx logic is in practice unreachable because PV guests can't
generate #CP, but it will interfere with anyone inserting custom debugging
into debugger_trap_entry().

Fixes: 5ad05b9c2490 ("x86/traps: Implement #CP handler and extend #PF for shadow stacks")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 512863ed238d7390f74d43f0ba298b1dfa8f4803
master date: 2021-11-03 19:13:17 +0000

3 years agox86/shstk: Fix use of shadow stacks with XPTI active
Andrew Cooper [Fri, 19 Nov 2021 08:39:46 +0000 (09:39 +0100)]
x86/shstk: Fix use of shadow stacks with XPTI active

The call to setup_cpu_root_pgt(0) in smp_prepare_cpus() is too early.  It
clones the BSP's stack while the .data mapping is still in use, causing all
mappings to be fully read read/write (and with no guard pages either).  This
ultimately causes #DF when trying to enter the dom0 kernel for the first time.

Defer setting up BSPs XPTI pagetable until reinit_bsp_stack() after we've set
up proper shadow stack permissions.

Fixes: 60016604739b ("x86/shstk: Rework the stack layout to support shadow stacks")
Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b2851580b1f2ff121737a37cb25a370d7692ae3b
master date: 2021-11-03 13:08:42 +0000

3 years agoupdate system time immediately when VCPUOP_register_vcpu_info
Dongli Zhang [Fri, 19 Nov 2021 08:39:09 +0000 (09:39 +0100)]
update system time immediately when VCPUOP_register_vcpu_info

The guest may access the pv vcpu_time_info immediately after
VCPUOP_register_vcpu_info. This is to borrow the idea of
VCPUOP_register_vcpu_time_memory_area, where the
force_update_vcpu_system_time() is called immediately when the new memory
area is registered.

Otherwise, we may observe clock drift at the VM side if the VM accesses
the clocksource immediately after VCPUOP_register_vcpu_info().

Reference: https://lists.xenproject.org/archives/html/xen-devel/2021-10/msg00571.html
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b67f09721f136cc3a9afcb6a82466d1bd27aa6c0
master date: 2021-11-03 10:19:06 +0100

3 years agox86/paging: restrict physical address width reported to guests
Jan Beulich [Fri, 19 Nov 2021 08:38:42 +0000 (09:38 +0100)]
x86/paging: restrict physical address width reported to guests

Modern hardware may report more than 48 bits of physical address width.
For paging-external guests our P2M implementation does not cope with
larger values. Telling the guest of more available bits means misleading
it into perhaps trying to actually put some page there (like was e.g.
intermediately done in OVMF for the shared info page).

While there also convert the PV check to a paging-external one (which in
our current code base are synonyms of one another anyway).

Fixes: 5dbd60e16a1f ("x86/shadow: Correct guest behaviour when creating PTEs above maxphysaddr")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: b7635526acffbe4ad8ad16fd92812c57742e54c2
master date: 2021-10-19 10:08:30 +0200

3 years agox86/AMD: make HT range dynamic for Fam17 and up
Jan Beulich [Fri, 19 Nov 2021 08:38:09 +0000 (09:38 +0100)]
x86/AMD: make HT range dynamic for Fam17 and up

At the time of d838ac2539cf ("x86: don't allow Dom0 access to the HT
address range") documentation correctly stated that the range was
completely fixed. For Fam17 and newer, it lives at the top of physical
address space, though.

To correctly determine the top of physical address space, we need to
account for their physical address reduction, hence the calculation of
paddr_bits also gets adjusted.

While for paddr_bits < 40 the HT range is completely hidden, there's no
need to suppress the range insertion in that case: It'll just have no
real meaning.

Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: d6e38eea2d806c53d976603717aebf6e5de30a1e
master date: 2021-10-19 10:04:13 +0200

3 years agox86emul: de-duplicate scatters to the same linear address
Jan Beulich [Fri, 19 Nov 2021 08:37:37 +0000 (09:37 +0100)]
x86emul: de-duplicate scatters to the same linear address

The SDM specifically allows for earlier writes to fully overlapping
ranges to be dropped. If a guest did so, hvmemul_phys_mmio_access()
would crash it if varying data was written to the same address. Detect
overlaps early, as doing so in hvmemul_{linear,phys}_mmio_access() would
be quite a bit more difficult. To maintain proper faulting behavior,
instead of dropping earlier write instances of fully overlapping slots
altogether, write the data of the final of these slots multiple times.
(We also can't pull ahead the [single] write of the data of the last of
the slots, clearing all involved slots' op_mask bits together, as this
would yield incorrect results if there were intervening partially
overlapping ones.)

Note that due to cache slot use being linear address based, there's no
similar issue with multiple writes to the same physical address (mapped
through different linear addresses).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a8cddbac5051020bb4a59a7f0ea27500c51063fb
master date: 2021-10-19 10:02:39 +0200

3 years agox86/HVM: correct cleanup after failed viridian_vcpu_init()
Jan Beulich [Fri, 19 Nov 2021 08:37:10 +0000 (09:37 +0100)]
x86/HVM: correct cleanup after failed viridian_vcpu_init()

This happens after nestedhvm_vcpu_initialise(), so its effects also need
to be undone.

Fixes: 40a4a9d72d16 ("viridian: add init hooks")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 66675056c6e59b8a8b651a29ef53c63e9e04f58d
master date: 2021-10-18 14:21:17 +0200

3 years agobuild: fix dependencies in arch/x86/boot
Anthony PERARD [Fri, 19 Nov 2021 08:36:36 +0000 (09:36 +0100)]
build: fix dependencies in arch/x86/boot

Temporary fix the list of headers that cmdline.c and reloc.c depends
on, until the next time the list is out of sync again.

Also, add the linker script to the list.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2f5f0a1b77161993c16c4cc243467d75e5b7633b
master date: 2021-10-14 12:35:42 +0200

3 years agox86/PV32: fix physdev_op_compat handling
Jan Beulich [Fri, 15 Oct 2021 09:20:04 +0000 (11:20 +0200)]
x86/PV32: fix physdev_op_compat handling

The conversion of the original code failed to recognize that the 32-bit
compat variant of this (sorry, two different meanings of "compat" here)
needs to continue to invoke the compat handler, not the native one.
Arrange for this by adding yet another #define.

Affected functions (having existed prior to the introduction of the new
hypercall) are PHYSDEVOP_set_iobitmap and PHYSDEVOP_apic_{read,write}.
For all others the operand struct layout doesn't differ.

Fixes: 1252e2823117 ("x86/pv: Export pv_hypercall_table[] rather than working around it in several ways")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 834cb8761051f7d87816785c0d99fe9bd5f0ce30
master date: 2021-10-12 11:55:42 +0200

3 years agoAMD/IOMMU: consider hidden devices when flushing device I/O TLBs
Jan Beulich [Fri, 15 Oct 2021 09:19:41 +0000 (11:19 +0200)]
AMD/IOMMU: consider hidden devices when flushing device I/O TLBs

Hidden devices are associated with DomXEN but usable by the
hardware domain. Hence they need flushing as well when all devices are
to have flushes invoked.

While there drop a redundant ATS-enabled check and constify the first
parameter of the involved function.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 036432e8b27e1ef21e0f0204ba9b0e3972a031c2
master date: 2021-10-12 11:54:34 +0200

3 years agox86/HVM: fix xsm_op for 32-bit guests
Jan Beulich [Fri, 15 Oct 2021 09:19:11 +0000 (11:19 +0200)]
x86/HVM: fix xsm_op for 32-bit guests

Like for PV, 32-bit guests need to invoke the compat handler, not the
native one.

Fixes: db984809d61b ("hvm: wire up domctl and xsm hypercalls")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b6b672e8a925ff4b71a1a67bc7d213ef445af74f
master date: 2021-10-11 10:58:44 +0200

3 years agox86/build: suppress EFI-related tool chain checks upon local $(MAKE) recursion
Jan Beulich [Fri, 15 Oct 2021 09:18:51 +0000 (11:18 +0200)]
x86/build: suppress EFI-related tool chain checks upon local $(MAKE) recursion

The xen-syms and xen.efi linking steps are serialized only when the
intermediate note.o file is necessary. Otherwise both may run in
parallel. This in turn means that the compiler / linker invocations to
create efi/check.o / efi/check.efi may also happen twice in parallel.
Obviously it's a bad idea to have multiple producers of the same output
race with one another - every once in a while one may e.g. observe

objdump: efi/check.efi: file format not recognized

We don't need this EFI related checking to occur when producing the
intermediate symbol and relocation table objects, and we have an easy
way of suppressing it: Simply pass in "efi-y=", overriding the
assignments done in the Makefile and thus forcing the tool chain checks
to be bypassed.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: 24b0ce9a5da2e648cde818055a085bcbcf24ecb0
master date: 2021-10-11 10:58:17 +0200

3 years agopci: fix handling of PCI bridges with subordinate bus number 0xff
Igor Druzhinin [Fri, 15 Oct 2021 09:18:19 +0000 (11:18 +0200)]
pci: fix handling of PCI bridges with subordinate bus number 0xff

Bus number 0xff is valid according to the PCI spec. Using u8 typed sub_bus
and assigning 0xff to it will result in the following loop getting stuck.

    for ( ; sec_bus <= sub_bus; sec_bus++ ) {...}

Just change its type to unsigned int similarly to what is already done in
dmar_scope_add_buses().

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
master commit: 9c3b9800e2019c93ab22da69e4a0b22d6fb059ec
master date: 2021-09-28 16:04:50 +0200

3 years agoVT-d: PCI segment numbers are up to 16 bits wide
Jan Beulich [Fri, 15 Oct 2021 09:17:56 +0000 (11:17 +0200)]
VT-d: PCI segment numbers are up to 16 bits wide

We shouldn't silently truncate respective values.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: a3dd33e63c2c8c51102f223e6efe2e3d3bdcbec1
master date: 2021-09-20 10:25:03 +0200

3 years agoVT-d: consider hidden devices when unmapping
Jan Beulich [Fri, 15 Oct 2021 09:17:32 +0000 (11:17 +0200)]
VT-d: consider hidden devices when unmapping

Whether to clear an IOMMU's bit in the domain's bitmap should depend on
all devices the domain can control. For the hardware domain this
includes hidden devices, which are associated with DomXEN.

While touching related logic
- convert the "current device" exclusion check to a simple pointer
  comparison,
- convert "found" to "bool",
- adjust style and correct a typo in an existing comment.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 75bfe6ec4844f83b300b9807bceaed1e2fe23270
master date: 2021-09-20 10:24:27 +0200

3 years agox86: quote section names when defining them in linker script
Roger Pau Monné [Fri, 15 Oct 2021 09:16:41 +0000 (11:16 +0200)]
x86: quote section names when defining them in linker script

LLVM ld seems to require section names to be quoted at both definition
and when referencing them for a match to happen, or else we get the
following errors:

ld: error: xen.lds:45: undefined section ".text"
ld: error: xen.lds:69: undefined section ".rodata"
ld: error: xen.lds:104: undefined section ".note.gnu.build-id"
[...]

The original fix for GNU ld 2.37 only quoted the section name when
referencing it in the ADDR function. Fix by also quoting the section
names when declaring them.

Fixes: 58ad654ebce7 ("x86: work around build issue with GNU ld 2.37")
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6254920587c33bcc7ab884e6c9a11cfc0d5867ab
master date: 2021-09-15 11:02:21 +0200

3 years agotools/libacpi: Use 64-byte alignment for FACS
Kevin Stefanov [Fri, 15 Oct 2021 09:16:09 +0000 (11:16 +0200)]
tools/libacpi: Use 64-byte alignment for FACS

The spec requires 64-byte alignment, not 16.

Signed-off-by: Kevin Stefanov <kevin.stefanov@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c76cfada1cfad05aaf64ce3ad305c5467650e782
master date: 2021-09-10 13:27:08 +0100

3 years agox86/spec-ctrl: Print all AMD speculative hints/features
Andrew Cooper [Fri, 15 Oct 2021 09:15:39 +0000 (11:15 +0200)]
x86/spec-ctrl: Print all AMD speculative hints/features

We already print Intel features that aren't yet implemented/used, so be
consistent on AMD too.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3d189f16a11d5a209fb47fa3635847608d43745c
master date: 2021-09-09 12:36:17 +0100

3 years agox86/amd: Use newer SSBD mechanisms if they exist
Andrew Cooper [Fri, 15 Oct 2021 09:15:14 +0000 (11:15 +0200)]
x86/amd: Use newer SSBD mechanisms if they exist

The opencoded legacy Memory Disambiguation logic in init_amd() neglected
Fam19h for the Zen3 microarchitecture.  Further more, all Zen2 based system
have the architectural MSR_SPEC_CTRL and the SSBD bit within it, so shouldn't
be using MSR_AMD64_LS_CFG.

Implement the algorithm given in AMD's SSBD whitepaper, and leave a
printk_once() behind in the case that no controls can be found.

This now means that a user explicitly choosing `spec-ctrl=ssbd` will properly
turn off Memory Disambiguation on Fam19h/Zen3 systems.

This still remains a single system-wide setting (for now), and is not context
switched between vCPUs.  As such, it doesn't interact with Intel's use of
MSR_SPEC_CTRL and default_xen_spec_ctrl (yet).

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2a4e6c4e4bea2e0bb720418c331ee28ff9c7632e
master date: 2021-09-08 14:16:19 +0100

3 years agox86/amd: Enumeration for speculative features/hints
Andrew Cooper [Fri, 15 Oct 2021 09:14:46 +0000 (11:14 +0200)]
x86/amd: Enumeration for speculative features/hints

There is a step change in speculation protections between the Zen1 and Zen2
microarchitectures.

Zen1 and older have no special support.  Control bits in non-architectural
MSRs are used to make lfence be dispatch-serialising (Spectre v1), and to
disable Memory Disambiguation (Speculative Store Bypass).  IBPB was
retrofitted in a microcode update, and software methods are required for
Spectre v2 protections.

Because the bit controlling Memory Disambiguation is model specific,
hypervisors are expected to expose a MSR_VIRT_SPEC_CTRL interface which
abstracts the model specific details.

Zen2 and later implement the MSR_SPEC_CTRL interface in hardware, and
virtualise the interface for HVM guests to use.  A number of hint bits are
specified too to help guide OS software to the most efficient mitigation
strategy.

Zen3 introduced a new feature, Predictive Store Forwarding, along with a
control to disable it in sensitive code.

Add CPUID and VMCB details for all the new functionality.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 747424c664bb164a04e7a9f2ffbf02d4a1630d7d
master date: 2021-09-08 14:16:19 +0100

3 years agox86/spec-ctrl: Split the "Hardware features" diagnostic line
Andrew Cooper [Fri, 15 Oct 2021 09:14:16 +0000 (11:14 +0200)]
x86/spec-ctrl: Split the "Hardware features" diagnostic line

Separate the read-only hints from the features requiring active actions on
Xen's behalf.

Also take the opportunity split the IBRS/IBPB and IBPB mess.  More features
with overlapping enumeration are on the way, and and it is not useful to split
them like this.

No practical change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 565ebcda976c05b0c6191510d5e32b621a2b1867
master date: 2021-09-08 14:16:19 +0100

3 years agobuild: set policy filename on make command line
Anthony PERARD [Fri, 15 Oct 2021 09:13:39 +0000 (11:13 +0200)]
build: set policy filename on make command line

In order to avoid flask/Makefile.common calling `make xenversion`, we
override POLICY_FILENAME with the value we are going to use anyway.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: c81e7efe2146c8f381fbdbb037b9d46866a6451e
master date: 2021-09-08 14:40:00 +0200

3 years agoupdate Xen version to 4.14.4-pre
Jan Beulich [Fri, 15 Oct 2021 09:11:34 +0000 (11:11 +0200)]
update Xen version to 4.14.4-pre

3 years agoVT-d: fix deassign of device with RMRR
Jan Beulich [Fri, 1 Oct 2021 13:05:42 +0000 (15:05 +0200)]
VT-d: fix deassign of device with RMRR

Ignoring a specific error code here was not meant to short circuit
deassign to _just_ the unmapping of RMRRs. This bug was previously
hidden by the bogus (potentially indefinite) looping in
pci_release_devices(), until f591755823a7 ("IOMMU/PCI: don't let domain
cleanup continue when device de-assignment failed") fixed that loop.

This is CVE-2021-28702 / XSA-386.

Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling")
Reported-by: Ivan Kardykov <kardykov@tabit.pro>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Ivan Kardykov <kardykov@tabit.pro>
(cherry picked from commit 24ebe875a77833696bbe5c9372e9e1590a7e7101)

3 years agoupdate Xen version to 4.14.3 RELEASE-4.14.3
Jan Beulich [Fri, 10 Sep 2021 12:30:40 +0000 (14:30 +0200)]
update Xen version to 4.14.3

3 years agognttab: deal with status frame mapping race
Jan Beulich [Wed, 8 Sep 2021 12:53:04 +0000 (14:53 +0200)]
gnttab: deal with status frame mapping race

Once gnttab_map_frame() drops the grant table lock, the MFN it reports
back to its caller is free to other manipulation. In particular
gnttab_unpopulate_status_frames() might free it, by a racing request on
another CPU, thus resulting in a reference to a deallocated page getting
added to a domain's P2M.

Obtain a page reference in gnttab_map_frame() to prevent freeing of the
page until xenmem_add_to_physmap_one() has actually completed its acting
on the page. Do so uniformly, even if only strictly required for v2
status pages, to avoid extra conditionals (which then would all need to
be kept in sync going forward).

This is CVE-2021-28701 / XSA-384.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: eb6bbf7b30da5bae87932514d54d0e3c68b23757
master date: 2021-09-08 14:37:45 +0200

3 years agox86/p2m-pt: fix p2m_flags_to_access()
Jan Beulich [Wed, 8 Sep 2021 12:52:13 +0000 (14:52 +0200)]
x86/p2m-pt: fix p2m_flags_to_access()

The initial if() was inverted, invalidating all output from this
function. Which in turn means the mirroring of P2M mappings into the
IOMMU didn't always work as intended: Mappings may have got updated when
there was no need to. There would not have been too few (un)mappings;
what saves us is that alongside the flags comparison MFNs also get
compared, with non-present entries always having an MFN of 0 or
INVALID_MFN while present entries always have MFNs different from these
two (0 in the table also meant to cover INVALID_MFN):

OLD NEW
P W access MFN P W access MFN
0 0 r 0 0 0 n 0
0 1 rw 0 0 1 n 0
1 0 n non-0 1 0 r non-0
1 1 n non-0 1 1 rw non-0

present <-> non-present transitions are fine because the MFNs differ.
present -> present transitions as well as non-present -> non-present
ones are potentially causing too many map/unmap operations, but never
too few, because in that case old (bogus) and new access differ.

Fixes: d1bb6c97c31e ("IOMMU: also pass p2m_access_t to p2m_get_iommu_flags())
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e70a9a043a5ce6d4025420f729bc473f711bf5d1
master date: 2021-09-07 14:24:49 +0200