]> xenbits.xensource.com Git - people/andrewcoop/xen.git/log
people/andrewcoop/xen.git
7 months agoPV df stk tb-aem-qbs-dbg
Andrew Cooper [Sun, 22 Sep 2024 11:34:34 +0000 (12:34 +0100)]
PV df stk

Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
7 months agoxen/arch/x86/cpu/intel.c: Report SMX and TXT capabilities
Michał Żygowski [Sun, 15 Sep 2024 14:06:01 +0000 (16:06 +0200)]
xen/arch/x86/cpu/intel.c: Report SMX and TXT capabilities

Report the SMX and TXT capabilitiesso that dom0 can query the
Intel TXT support information using xl dmesg.

Signed-off-by: Michał Żygowski <michal.zygowski@3mdeb.com>
7 months agoxen/arch/x86/intel_txt.c: Restore IA32_MISC_ENABLES
Michał Żygowski [Sun, 15 Sep 2024 12:32:40 +0000 (14:32 +0200)]
xen/arch/x86/intel_txt.c: Restore IA32_MISC_ENABLES

Signed-off-by: Michał Żygowski <michal.zygowski@3mdeb.com>
7 months agoxen/arch/x86/boot/head.S: Set CBnT support capability in MLE
Michał Żygowski [Sun, 15 Sep 2024 11:51:14 +0000 (13:51 +0200)]
xen/arch/x86/boot/head.S: Set CBnT support capability in MLE

Signed-off-by: Michał Żygowski <michal.zygowski@3mdeb.com>
7 months agoxen/arch/x86/boot/head.S: Use MAXPHYADDR fr MTRR masks in MLE capabilities
Michał Żygowski [Sun, 15 Sep 2024 10:47:18 +0000 (12:47 +0200)]
xen/arch/x86/boot/head.S: Use MAXPHYADDR fr MTRR masks in MLE capabilities

The bootloader should prepare the MTRR masks using MAXPHYADDRs. On modern
Intel platforms, the SINIT ACM forces this bit to be 1 according to
TXT MLE Software Development Guide Revision 017.4.

Signed-off-by: Michał Żygowski <michal.zygowski@3mdeb.com>
7 months agoarch/x86/hvm/vmx/vmcs.c: Check for VMX in SMX while slaunch active
Michał Żygowski [Sun, 15 Sep 2024 10:40:16 +0000 (12:40 +0200)]
arch/x86/hvm/vmx/vmcs.c: Check for VMX in SMX while slaunch active

Do the check if IA32_FEATURE_CONTROL has the proper bits enabled to run
VMX in SMX when slaunch is active.

Signed-off-by: Michał Żygowski <michal.zygowski@3mdeb.com>
7 months agoxen/arch/x86/e820.c: Read the MTRR caps and default type after restoring
Michał Żygowski [Sat, 17 Aug 2024 17:41:02 +0000 (19:41 +0200)]
xen/arch/x86/e820.c: Read the MTRR caps and default type after restoring

The MTRR capabilities and default type were read before the MTRRs were
restored in slaunch flow. The restoration itself updated the MTRR default
type MSR, so the mtrr_top_of_ram had invalid state in mtrr_cap and
mtrr_def variables. Move reading those MSRs after MTRRs are restored
in slaunch flow.

Signed-off-by: Michał Żygowski <michal.zygowski@3mdeb.com>
7 months agoxen/arch/x86/intel_txt.c: Disable MTRRs before restoring them on BSP
Michał Żygowski [Sat, 17 Aug 2024 12:58:31 +0000 (14:58 +0200)]
xen/arch/x86/intel_txt.c: Disable MTRRs before restoring them on BSP

Right now the MTRRs were restored in an ugly way, while MTRR enable bit
was set and caching was not disabled. Mimic the generic Xen MTRR driver
behavior when changing MTRRs.

Signed-off-by: Michał Żygowski <michal.zygowski@3mdeb.com>
7 months agoxen/arch/x86/slaunch.c: Map the TPM event log after TXT regions
Michał Żygowski [Thu, 23 May 2024 22:11:59 +0000 (00:11 +0200)]
xen/arch/x86/slaunch.c: Map the TPM event log after TXT regions

Map the TPM event log after the TXT regions are mapped to avoid
an early page fault when booting with slaunch.

Signed-off-by: Michał Żygowski <michal.zygowski@3mdeb.com>
11 months agoxen/arch/x86/slaunch.c: Map the TPM event log after TXT regions
Michał Żygowski [Thu, 23 May 2024 22:11:59 +0000 (00:11 +0200)]
xen/arch/x86/slaunch.c: Map the TPM event log after TXT regions

Map the TPM event log after the TXT regions are mapped to avoid
an early page fault when booting with slaunch.

Signed-off-by: Michał Żygowski <michal.zygowski@3mdeb.com>
12 months agoarch/x86/tpm.c: fix appending to event log of TPM1
Sergii Dmytruk [Fri, 5 Apr 2024 17:54:17 +0000 (20:54 +0300)]
arch/x86/tpm.c: fix appending to event log of TPM1

Just like TPM2 case this code path also needs extra handling on AMD
because TXT-compatible data prepared by SKL is stored inside of vendor
data field of TCG header.

Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months agox86/boot: read APIC base from register
Krystian Hebel [Thu, 4 Apr 2024 15:30:25 +0000 (17:30 +0200)]
x86/boot: read APIC base from register

Some CPUs don't use default APIC base. Address in MSR is always valid,
and it is already read to test for x2APIC.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
12 months agoarch/x86: support slaunch with AMD SKINIT
Sergii Dmytruk [Sat, 16 Mar 2024 22:58:26 +0000 (00:58 +0200)]
arch/x86: support slaunch with AMD SKINIT

This mostly involves not running Intel-specific code when on AMD.
There are only a few new AMD-specific implementation details:
 - finding SLB start and size and mapping and protecting it
 - managing offset for adding the next TPM log entry

Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months agoarch/x86: move generic memory mapping and protection to slaunch.c
Sergii Dmytruk [Thu, 21 Mar 2024 22:40:12 +0000 (00:40 +0200)]
arch/x86: move generic memory mapping and protection to slaunch.c

Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months agox86/boot: find MBI and SLRT on AMD
Sergii Dmytruk [Thu, 21 Mar 2024 17:41:06 +0000 (19:41 +0200)]
x86/boot: find MBI and SLRT on AMD

secure-kernel-loader on AMD with SKINIT passes MBI as a parameter for
Multiboot kernel.

Another thing of interest is the location of SLRT which is bootloader's
data after SKL.

Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months agox86/boot: introduce slaunch_slrt
Sergii Dmytruk [Thu, 21 Mar 2024 17:35:10 +0000 (19:35 +0200)]
x86/boot: introduce slaunch_slrt

It holds physical address of SLRT. The value is produced by
slaunch_early (known as txt_early previously), gets set in assembly and
then used by the main C code which don't need to know how we got
it (which is different for different CPUs).

Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months agoarch/x86: map TPM register space for Intel TXT
Sergii Dmytruk [Tue, 19 Mar 2024 23:03:51 +0000 (01:03 +0200)]
arch/x86: map TPM register space for Intel TXT

It also works without doing this explicitly thanks to the fact that
TXT register space is located in the same 2MB page as TPM.

Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months agoarch/x86: extract slaunch unit
Sergii Dmytruk [Sat, 16 Mar 2024 19:25:16 +0000 (21:25 +0200)]
arch/x86: extract slaunch unit

To collect its core functionality in one place instead of having some in
intel_txt and other in tpm units.

TXT_EVTYPE_* now live in <asm/slaunch.h> and are called
DLE_EVTYPE_* despite being based on TXT specification.  This way code
for non-Intel won't need to include TXT header.

No functional changes.

Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months agoarch/x86: add separate AP boot method for Intel TXT
Sergii Dmytruk [Mon, 18 Mar 2024 19:21:18 +0000 (21:21 +0200)]
arch/x86: add separate AP boot method for Intel TXT

Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months agoarch/x86: pass SLRT pointer to find_evt_log()
Sergii Dmytruk [Sat, 16 Mar 2024 17:35:09 +0000 (19:35 +0200)]
arch/x86: pass SLRT pointer to find_evt_log()

This makes the function independent of the way in which SLRT is
discovered and moves discovery code into a separate function reusable in
other places.

Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months agox86/boot: generate value of size field in MLE header
Sergii Dmytruk [Wed, 28 Feb 2024 13:47:52 +0000 (15:47 +0200)]
x86/boot: generate value of size field in MLE header

Do not rely on bootloader to do that to avoid discrepancies between
measured data and binary file that's being loaded.

Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months agoarch/x86: process DRTM policy
Sergii Dmytruk [Sat, 28 Oct 2023 21:42:04 +0000 (00:42 +0300)]
arch/x86: process DRTM policy

Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months agox86/tpm.c: implement event log for TPM2.0
Sergii Dmytruk [Fri, 30 Jun 2023 21:41:35 +0000 (00:41 +0300)]
x86/tpm.c: implement event log for TPM2.0

Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months agox86/tpm.c: support extending PCRs of TPM2.0
Sergii Dmytruk [Wed, 28 Jun 2023 17:23:24 +0000 (20:23 +0300)]
x86/tpm.c: support extending PCRs of TPM2.0

SHA1 and SHA256 is hardcoded here, but their support by TPM is checked
for. Addition of event log for TPM2.0 will generalize the code further.

Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months agox86/sha256.c: add file
Sergii Dmytruk [Sun, 25 Jun 2023 21:17:15 +0000 (00:17 +0300)]
x86/sha256.c: add file

Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months agoarch/x86/smp: start APs in parallel during boot
Krystian Hebel [Fri, 16 Jun 2023 13:45:32 +0000 (15:45 +0200)]
arch/x86/smp: start APs in parallel during boot

Multiple delays are required when sending IPIs and waiting for
responses. During boot, 4 such IPIs were sent per each AP. With this
change, only one set of broadcast IPIs is sent. This reduces boot time,
especially for platforms with large number of cores.

Single CPU initialization is still possible, it is used for hotplug.

During wakeup from S3 APs are started one by one. It should be possible
to enable parallel execution there as well, but I don't have a way of
testing it as of now.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
12 months agoarch/x86/smp: don't send INIT-SIPI-SIPI if AP is already running
Krystian Hebel [Fri, 16 Jun 2023 12:41:17 +0000 (14:41 +0200)]
arch/x86/smp: don't send INIT-SIPI-SIPI if AP is already running

This is another requirement for parallel AP bringup.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
12 months agoarch/x86/smp: remove MONITOR/MWAIT loop for TXT AP bringup
Krystian Hebel [Fri, 16 Jun 2023 12:31:27 +0000 (14:31 +0200)]
arch/x86/smp: remove MONITOR/MWAIT loop for TXT AP bringup

This is no longer necessary, since AP loops on cpu_state and CPU
index is passed as argument.

In addition, move TXT JOIN structure to static data. There is no
guarantee that it would be consumed before it is overwritten on BSP
stack.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
12 months agoarch/x86/smp: make cpu_state per-CPU
Krystian Hebel [Fri, 16 Jun 2023 10:18:23 +0000 (12:18 +0200)]
arch/x86/smp: make cpu_state per-CPU

This will be used for parallel AP bring-up.

CPU_STATE_INIT changed direction. It was previously set by BSP and never
consumed by AP. Now it signals that AP got through assembly part of
initialization and waits for BSP to call notifiers that set up data
structures required for further initialization.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
12 months agoarch/x86/smp: drop booting_cpu variable
Krystian Hebel [Tue, 13 Jun 2023 16:58:21 +0000 (18:58 +0200)]
arch/x86/smp: drop booting_cpu variable

CPU id is obtained as a side effect of searching for appropriate
stack for AP. It can be used as a parameter to start_secondary().
Coincidentally this also makes further work on making AP bring-up
code parallel easier.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
12 months agoarch/x86/shutdown: protect against recurrent machine_restart()
Krystian Hebel [Tue, 13 Jun 2023 13:56:12 +0000 (15:56 +0200)]
arch/x86/shutdown: protect against recurrent machine_restart()

If multiple CPUs called machine_restart() before actual restart took
place, but after boot CPU declared itself not online, ASSERT in
on_selected_cpus() will fail. Few calls later execution would end up
in machine_restart() again, with another frame on call stack for new
exception.

To protect against running out of stack, code checks if boot CPU is
still online before calling on_selected_cpus().

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
12 months agoarch/x86/smp: call x2apic_ap_setup() earlier
Krystian Hebel [Tue, 13 Jun 2023 13:44:36 +0000 (15:44 +0200)]
arch/x86/smp: call x2apic_ap_setup() earlier

It used to be called from smp_callin(), however BUG_ON() was invoked on
multiple occasions before that. It may end up calling machine_restart()
which tries to get APIC ID for CPU running this code. If BSP detected
that x2APIC is enabled, get_apic_id() will try to use it for all CPUs.
Enabling x2APIC on secondary CPUs earlier protects against an endless
loop of #GP exceptions caused by attempts to read IA32_X2APIC_APICID
MSR while x2APIC is disabled in IA32_APIC_BASE.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
12 months agoarch/x86/smp: move stack_base to cpu_data
Krystian Hebel [Thu, 1 Jun 2023 17:27:22 +0000 (19:27 +0200)]
arch/x86/smp: move stack_base to cpu_data

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
12 months agoarch/x86/smp: drop x86_cpu_to_apicid, use cpu_data[cpu].apicid instead
Krystian Hebel [Thu, 1 Jun 2023 15:01:59 +0000 (17:01 +0200)]
arch/x86/smp: drop x86_cpu_to_apicid, use cpu_data[cpu].apicid instead

Both fields held the same data.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
12 months agoarch/x86: don't access x86_cpu_to_apicid[] directly, use cpu_physical_id(cpu)
Krystian Hebel [Thu, 1 Jun 2023 14:05:18 +0000 (16:05 +0200)]
arch/x86: don't access x86_cpu_to_apicid[] directly, use cpu_physical_id(cpu)

This is done in preparation to move data from x86_cpu_to_apicid[]
elsewhere.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
12 months agox86/smpboot.c: TXT AP bringup
Krystian Hebel [Wed, 16 Nov 2022 14:06:18 +0000 (15:06 +0100)]
x86/smpboot.c: TXT AP bringup

On Intel TXT, APs are started in one of two ways, depending on ACM
which reports it in its information table. In both cases, all APs are
started simultaneously after BSP requests them to do so. Two possible
ways are:
- GETSEC[WAKEUP] instruction,
- MONITOR address.

GETSEC[WAKEUP] requires versions >= 7 of SINIT to MLE Data, but there is
no clear mapping of that version with regard to processor family and
it's not known which CPUs actually use it. It could have been designed
for TXT support on CPUs that lack MONITOR/MWAIT, because GETSEC[WAKEUP]
seems to be more complicated, in software and hardware alike.

This patch implements only MONITOR approach, GETSEC[WAKEUP] support will
be added later once more details and means of testing are available and
if there is a practical need for it.

With this patch, every AP goes through assembly part, and only when in
start_secondary() in C they re-enter MONITOR/MWAIT iff they are not the
AP that was asked to boot. The same address is reused for simplicity,
and on next wakeup call APs don't have to go through assembly part
again (GDT, paging, stack setting).

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
12 months agox86/boot: choose AP stack based on APIC ID
Krystian Hebel [Wed, 16 Nov 2022 14:03:07 +0000 (15:03 +0100)]
x86/boot: choose AP stack based on APIC ID

This is made as the first step of making parallel AP bring-up possible.
It should be enough for pre-C code.

Parallel AP bring-up is necessary because TXT by design releases all APs
at once. In addition to that it reduces number of IPIs (and more
importantly, delays between them) required to start all logical
processors. This results in significant reduction of boot time, even
when DRTM is not used, with performance gain growing with the number of
logical CPUs.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months agox86/tpm.c: code for early hashing and extending PCRs (for TPM1.2)
Krystian Hebel [Fri, 21 Oct 2022 16:46:33 +0000 (18:46 +0200)]
x86/tpm.c: code for early hashing and extending PCRs (for TPM1.2)

This file is built twice: for early 32b mode without paging to measure
MBI and for 64b code to measure dom0 kernel and initramfs. Since MBI
is small, the first case uses TPM to do the hashing. Kernel and
initramfs on the other hand are too big, sending them to the TPM would
take multiple minutes.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months agox86/sha1.c: add file
Krystian Hebel [Tue, 25 Oct 2022 14:04:17 +0000 (16:04 +0200)]
x86/sha1.c: add file

File comes from [1] and is licensed under MIT License. Only enough
changes to make it compile under Xen and to swap endianness of result
were made to the original file.

[1] https://www.nayuki.io/page/fast-sha1-hash-implementation-in-x86-assembly

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months agox86/intel_txt.c: restore boot MTRRs
Krystian Hebel [Wed, 19 Oct 2022 17:52:24 +0000 (19:52 +0200)]
x86/intel_txt.c: restore boot MTRRs

In preparation for TXT SENTER call, GRUB had to modify MTRR settings
to be UC for everything except SINIT ACM. Old values are restored
from SLRT where they were saved by the bootloader.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months agoxen/arch/x86: reserve TXT memory
Kacper Stojek [Fri, 2 Sep 2022 06:11:43 +0000 (08:11 +0200)]
xen/arch/x86: reserve TXT memory

TXT heap is marked as reserved in e820 to protect against being allocated
and overwritten.

Signed-off-by: Kacper Stojek <kacper.stojek@3mdeb.com>
Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months agoinclude/xen/slr_table.h: Secure Launch Resource Table definitions
Sergii Dmytruk [Sat, 28 Oct 2023 21:29:30 +0000 (00:29 +0300)]
include/xen/slr_table.h: Secure Launch Resource Table definitions

The file provides constants, structures and several helper functions for
parsing SLRT.

slr_add_entry() and slr_init_table() were omitted to not have issues
with memcpy() usage (it comes from different places for different
translation units).

Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months agox86/boot/txt_early: add early TXT tests and restore MBI pointer
Krystian Hebel [Mon, 17 Apr 2023 18:09:54 +0000 (20:09 +0200)]
x86/boot/txt_early: add early TXT tests and restore MBI pointer

These tests validate that important parts of memory are protected
against DMA attacks, including Xen and MBI. Modules can be tested later,
when it is possible to report issues to user before invoking TXT reset.

TPM event log validation is temporarily disabled due to issue with its
allocation by bootloader (GRUB) which will need to be modified to
address this. Ultimately event log will also have to be validated early
as it is used immediately after these tests to hold MBI measurements.
See larger comment in verify_pmr_ranges().

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months agox86/boot: add MLE header and new entry point
Kacper Stojek [Wed, 31 Aug 2022 13:03:51 +0000 (15:03 +0200)]
x86/boot: add MLE header and new entry point

MLE header is used with Intel TXT, together with MB2 headers.
Entrypoint is different, but it is used just to differentiate
from other entries by moving a magic number to EAX. Execution
environment is similar to that of Multiboot 2 and code falls
through to MB2's entry point.

Signed-off-by: Kacper Stojek <kacper.stojek@3mdeb.com>
Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
12 months agox86/include/asm/intel_txt.h: constants and accessors for TXT registers and heap
Krystian Hebel [Mon, 17 Apr 2023 18:10:13 +0000 (20:10 +0200)]
x86/include/asm/intel_txt.h: constants and accessors for TXT registers and heap

File contains TXT register spaces base address, registers offsets,
error codes and inline functions for accessing structures stored on
TXT heap.

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
12 months ago.github/workflows/build.yml: build QubesOS package
Sergii Dmytruk [Sat, 29 Jul 2023 21:19:23 +0000 (00:19 +0300)]
.github/workflows/build.yml: build QubesOS package

Signed-off-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
12 months agoTEST: applied Qubes patches
Krystian Hebel [Fri, 26 Apr 2024 16:05:25 +0000 (18:05 +0200)]
TEST: applied Qubes patches

Signed-off-by: Krystian Hebel <krystian.hebel@3mdeb.com>
13 months agoRelease: Update CHANGELOG.md
George Dunlap [Tue, 9 Apr 2024 15:48:56 +0000 (16:48 +0100)]
Release: Update CHANGELOG.md

Signed-off-by: George Dunlap <george.dunlap@cloud.com>
13 months agoUpdate Xen version to 4.17.4
Andrew Cooper [Wed, 27 Mar 2024 18:23:18 +0000 (18:23 +0000)]
Update Xen version to 4.17.4

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
13 months agox86/spec-ctrl: Support the "long" BHB loop sequence
Andrew Cooper [Fri, 22 Mar 2024 19:29:34 +0000 (19:29 +0000)]
x86/spec-ctrl: Support the "long" BHB loop sequence

Out of an abudnance of caution, implement the long loop too, and allowing for
it to be opted in to.

This is part of XSA-456 / CVE-2024-2201.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit d5887c0decbd90e798b24ed696628645b04632fb)

13 months agox86/spec-ctrl: Wire up the Native-BHI software sequences
Andrew Cooper [Thu, 8 Jun 2023 18:41:44 +0000 (19:41 +0100)]
x86/spec-ctrl: Wire up the Native-BHI software sequences

In the absence of BHI_DIS_S, mitigating Native-BHI requires the use of a
software sequence.

Introduce a new bhb-seq= option to select between avaialble sequences and
bhb-entry= to control the per-PV/HVM actions like we have for other blocks.

Activate the short sequence by default for PV and HVM guests on affected
hardware if BHI_DIS_S isn't present.

This is part of XSA-456 / CVE-2024-2201.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 689ad48ce9cf4c38297cd126e7e003a1c13a3b9d)

13 months agox86/spec-ctrl: Software BHB-clearing sequences
Andrew Cooper [Thu, 8 Jun 2023 18:41:44 +0000 (19:41 +0100)]
x86/spec-ctrl: Software BHB-clearing sequences

Implement clear_bhb_{tsx,loops}() as per the BHI guidance.  The loops variant
is set up as the "short" sequence.

Introduce SCF_entry_bhb and extend SPEC_CTRL_ENTRY_* with a conditional call
to selected clearing routine.

Note that due to a limitation in the ALTERNATIVE capability, the TEST/JZ can't
be included alongside a CALL in a single alternative block.  This is going to
require further work to untangle.

The BHB sequences (if used) must be after the restoration of Xen's
MSR_SPEC_CTRL value, which must be accounted for when judging whether it is
safe to skip the safety LFENCEs.

This is part of XSA-456 / CVE-2024-2201.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 954c983abceee97bf5f6230b9ae164f2c49a9aa9)

13 months agox86/spec-ctrl: Support BHI_DIS_S in order to mitigate BHI
Andrew Cooper [Tue, 26 Mar 2024 19:01:37 +0000 (19:01 +0000)]
x86/spec-ctrl: Support BHI_DIS_S in order to mitigate BHI

Introduce a "bhi-dis-s" boolean to match the other options we have for
MSR_SPEC_CTRL values.  Also introduce bhi_calculations().

Use BHI_DIS_S whenever possible.

Guests which are levelled to be migration compatible with older CPUs can't see
BHI_DIS_S, and Xen must fill in the difference to make the guest safe.  Use
the virt MSR_SPEC_CTRL infrastructure to force BHI_DIS_S behind the guest's
back.

This is part of XSA-456 / CVE-2024-2201.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 62a1106415c5e8a49b45147ca84d54a58d471343)

13 months agox86/tsx: Expose RTM_ALWAYS_ABORT to guests
Andrew Cooper [Sat, 6 Apr 2024 19:36:54 +0000 (20:36 +0100)]
x86/tsx: Expose RTM_ALWAYS_ABORT to guests

A TSX Abort is one option mitigate Native-BHI, but a guest kernel doesn't get
to see this if Xen has turned RTM off using MSR_TSX_{CTRL,FORCE_ABORT}.

Therefore, the meaning of RTM_ALWAYS_ABORT has been adjusted to "XBEGIN won't
fault", and it should be exposed to guests so they can make a better decision.

Expose it in the max policy for any RTM-capable system.  Offer it by default
only if RTM has been disabled.

Update test-tsx to account for this new meaning.  While adjusting the logic in
test_guest_policies(), take the opportunity to use feature names (now they're
available) to make the logic easier to follow.

This is part of XSA-456 / CVE-2024-2201.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c94e2105924347de0d9f32065370e802a20cc829)

13 months agox86: Drop INDIRECT_JMP
Andrew Cooper [Fri, 22 Dec 2023 18:01:37 +0000 (18:01 +0000)]
x86: Drop INDIRECT_JMP

Indirect JMPs which are not tailcalls can lead to an unwelcome form of
speculative type confusion, and we've removed the uses of INDIRECT_JMP to
compensate.  Remove the temptation to reintroduce new instances.

This is part of XSA-456 / CVE-2024-2201.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 0b66d7ce3c0290eaad28bdafb35200052d012b14)

13 months agox86: Use indirect calls in reset-stack infrastructure
Andrew Cooper [Fri, 22 Dec 2023 17:44:48 +0000 (17:44 +0000)]
x86: Use indirect calls in reset-stack infrastructure

Mixing up JMP and CALL indirect targets leads a very fun form of speculative
type confusion.  A target which is expecting to be called CALLed needs a
return address on the stack, and an indirect JMP doesn't place one there.

An indirect JMP which predicts to a target intending to be CALLed can end up
with a RET speculatively executing with a value from the JMPers stack frame.

There are several ways get indirect JMPs in Xen.

 * From tailcall optimisations.  These are safe because the compiler has
   arranged the stack to point at the callee's return address.

 * From jump tables.  These are unsafe, but Xen is built with -fno-jump-tables
   to work around several compiler issues.

 * From reset_stack_and_jump_ind(), which is particularly unsafe.  Because of
   the additional stack adjustment made, the value picked up off the stack is
   regs->r15 of the next vCPU to run.

In order to mitigate this type confusion, we want to make all indirect targets
be CALL targets, and remove the use of indirect JMP except via tailcall
optimisation.

Luckily due to XSA-348, all C target functions of reset_stack_and_jump_ind()
are noreturn.  {svm,vmx}_do_resume() exits via reset_stack_and_jump(); a
direct JMP with entirely different prediction properties.  idle_loop() is an
infinite loop which eventually exits via reset_stack_and_jump_ind() from a new
schedule.  i.e. These paths are all fine having one extra return address on
the stack.

This leaves continue_pv_domain(), which is expecting to be a JMP target.
Alter it to strip the return address off the stack, which is safe because
there isn't actually a RET expecting to return to its caller.

This allows us change reset_stack_and_jump_ind() to reset_stack_and_call_ind()
in order to mitigate the speculative type confusion.

This is part of XSA-456 / CVE-2024-2201.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 8e186f98ce0e35d1754ec9299da41ec98873b65c)

13 months agox86/spec-ctrl: Widen the {xen,last,default}_spec_ctrl fields
Andrew Cooper [Tue, 26 Mar 2024 22:43:18 +0000 (22:43 +0000)]
x86/spec-ctrl: Widen the {xen,last,default}_spec_ctrl fields

Right now, they're all bytes, but MSR_SPEC_CTRL has been steadily gaining new
features.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 45dac88e78e8a2d9d8738eef884fe6730faf9e67)

13 months agox86/vmx: Add support for virtualize SPEC_CTRL
Roger Pau Monne [Thu, 15 Feb 2024 16:46:53 +0000 (17:46 +0100)]
x86/vmx: Add support for virtualize SPEC_CTRL

The feature is defined in the tertiary exec control, and is available starting
from Sapphire Rapids and Alder Lake CPUs.

When enabled, two extra VMCS fields are used: SPEC_CTRL mask and shadow.  Bits
set in mask are not allowed to be toggled by the guest (either set or clear)
and the value in the shadow field is the value the guest expects to be in the
SPEC_CTRL register.

By using it the hypervisor can force the value of SPEC_CTRL bits behind the
guest back without having to trap all accesses to SPEC_CTRL, note that no bits
are forced into the guest as part of this patch.  It also allows getting rid of
SPEC_CTRL in the guest MSR load list, since the value in the shadow field will
be loaded by the hardware on vmentry.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 97c5b8b657e41a6645de9d40713b881234417b49)

13 months agox86/spec-ctrl: Detail the safety properties in SPEC_CTRL_ENTRY_*
Andrew Cooper [Mon, 25 Mar 2024 11:09:35 +0000 (11:09 +0000)]
x86/spec-ctrl: Detail the safety properties in SPEC_CTRL_ENTRY_*

The complexity is getting out of hand.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 40dea83b75386cb693481cf340024ce093be5c0f)

13 months agox86/spec-ctrl: Simplify DO_COND_IBPB
Andrew Cooper [Fri, 22 Mar 2024 14:33:17 +0000 (14:33 +0000)]
x86/spec-ctrl: Simplify DO_COND_IBPB

With the prior refactoring, SPEC_CTRL_ENTRY_{PV,INTR} both load SCF into %ebx,
and handle the conditional safety including skipping if interrupting Xen.

Therefore, we can drop the maybexen parameter and the conditional safety.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 2378d16a931de0e62c03669169989e9437306abe)

13 months agox86/spec_ctrl: Hold SCF in %ebx across SPEC_CTRL_ENTRY_{PV,INTR}
Andrew Cooper [Fri, 22 Mar 2024 12:08:02 +0000 (12:08 +0000)]
x86/spec_ctrl: Hold SCF in %ebx across SPEC_CTRL_ENTRY_{PV,INTR}

... as we do in the exit paths too.  This will allow simplification to the
sub-blocks.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9607aeb6602b8ed9962404de3f5f90170ffddb66)

13 months agox86/entry: Arrange for %r14 to be STACK_END across SPEC_CTRL_ENTRY_FROM_PV
Andrew Cooper [Fri, 22 Mar 2024 15:52:06 +0000 (15:52 +0000)]
x86/entry: Arrange for %r14 to be STACK_END across SPEC_CTRL_ENTRY_FROM_PV

Other SPEC_CTRL_* paths already use %r14 like this, and it will allow for
simplifications.

All instances of SPEC_CTRL_ENTRY_FROM_PV are followed by a GET_STACK_END()
invocation, so this change is only really logic and register shuffling.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 22390697bf1b4cd3024f2d10893dec3c3ec08a9c)

13 months agox86/spec-ctrl: Rework conditional safety for SPEC_CTRL_ENTRY_*
Andrew Cooper [Fri, 22 Mar 2024 11:41:41 +0000 (11:41 +0000)]
x86/spec-ctrl: Rework conditional safety for SPEC_CTRL_ENTRY_*

Right now, we have a mix of safety strategies in different blocks, making the
logic fragile and hard to follow.

Start addressing this by having a safety LFENCE at the end of the blocks,
which can be patched out if other safety criteria are met.  This will allow us
to simplify the sub-blocks.  For SPEC_CTRL_ENTRY_FROM_IST, simply leave an
LFENCE unconditionally at the end; the IST path is not a fast-path by any
stretch of the imagination.

For SPEC_CTRL_ENTRY_FROM_INTR, the existing description was incorrect.  The
IRET #GP path is non-fatal but can occur with the guest's choice of
MSR_SPEC_CTRL.  It is safe to skip the flush/barrier-like protections when
interrupting Xen, but we must run DO_SPEC_CTRL_ENTRY irrespective.

This will skip RSB stuffing which was previously unconditional even when
interrupting Xen.

AFAICT, this is a missing cleanup from commit 3fffaf9c13e9 ("x86/entry: Avoid
using alternatives in NMI/#MC paths") where we split the IST entry path out of
the main INTR entry path.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 94896de1a98c4289fe6fef9e16ef99fc6ef2efc4)

13 months agox86/spec-ctrl: Rename spec_ctrl_flags to scf
Andrew Cooper [Thu, 28 Mar 2024 11:57:25 +0000 (11:57 +0000)]
x86/spec-ctrl: Rename spec_ctrl_flags to scf

XSA-455 was ultimately caused by having fields with too-similar names.

Both {xen,last}_spec_ctrl are fields containing an architectural MSR_SPEC_CTRL
value.  The spec_ctrl_flags field contains Xen-internal flags.

To more-obviously distinguish the two, rename spec_ctrl_flags to scf, which is
also the prefix of the constants used by the fields.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c62673c4334b3372ebd4292a7ac8185357e7ea27)

13 months agox86/cpuid: Don't expose {IPRED,RRSBA,BHI}_CTRL to PV guests
Andrew Cooper [Tue, 9 Apr 2024 14:03:05 +0000 (15:03 +0100)]
x86/cpuid: Don't expose {IPRED,RRSBA,BHI}_CTRL to PV guests

All of these are prediction-mode (i.e. CPL) based.  They don't operate as
advertised in PV context.

Fixes: 4dd676070684 ("x86/spec-ctrl: Expose IPRED_CTRL to guests")
Fixes: 478e4787fa64 ("x86/spec-ctrl: Expose RRSBA_CTRL to guests")
Fixes: 583f1d095052 ("x86/spec-ctrl: Expose BHI_CTRL to guests")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 4b3da946ad7e3452761478ae683da842e7ff20d6)

13 months agox86/alternatives: fix .init section reference in _apply_alternatives()
Roger Pau Monné [Tue, 9 Apr 2024 12:50:46 +0000 (14:50 +0200)]
x86/alternatives: fix .init section reference in _apply_alternatives()

The code in _apply_alternatives() will unconditionally attempt to read
__initdata_cf_clobber_{start,end} when called as part of applying alternatives
to a livepatch payload when Xen is using IBT.

That leads to a page-fault as __initdata_cf_clobber_{start,end} living in
.init section will have been unmapped by the time a livepatch gets loaded.

Fix by adding a check that limits the clobbering of endbr64 instructions to
boot time only.

Fixes: 37ed5da851b8 ('x86/altcall: Optimise away endbr64 instruction where possible')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 4be1fef1e6572c2be0bd378902ffb62a6e73faeb)

13 months agox86/tsx: Cope with RTM_ALWAYS_ABORT vs RTM mismatch
Andrew Cooper [Wed, 3 Apr 2024 16:43:42 +0000 (17:43 +0100)]
x86/tsx: Cope with RTM_ALWAYS_ABORT vs RTM mismatch

It turns out there is something wonky on some but not all CPUs with
MSR_TSX_FORCE_ABORT.  The presence of RTM_ALWAYS_ABORT causes Xen to think
it's safe to offer HLE/RTM to guests, but in this case, XBEGIN instructions
genuinely #UD.

Spot this case and try to back out as cleanly as we can.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit b33f191e3ca99458fdcea1cb5a29dfa4965d1604)

13 months agox86/spec-ctrl: Move __read_mostly data into __ro_after_init
Andrew Cooper [Thu, 28 Mar 2024 12:38:32 +0000 (12:38 +0000)]
x86/spec-ctrl: Move __read_mostly data into __ro_after_init

These variables predate the introduction of __ro_after_init, but all qualify.
Update them to be consistent with the rest of the file.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 7a09966e7b2823b70f6d56d0cf66c11124f4a3c1)

13 months agoVMX: tertiary execution control infrastructure
Jan Beulich [Wed, 7 Feb 2024 12:46:11 +0000 (13:46 +0100)]
VMX: tertiary execution control infrastructure

This is a prereq to enabling e.g. the MSRLIST feature.

Note that the PROCBASED_CTLS3 MSR is different from other VMX feature
reporting MSRs, in that all 64 bits report allowed 1-settings.

vVMX code is left alone, though, for the time being.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 878159bf259bfbd7a40312829f1ea0ce1f6645e2)

13 months agox86/CPU: convert vendor hook invocations to altcall
Jan Beulich [Mon, 5 Feb 2024 09:48:11 +0000 (10:48 +0100)]
x86/CPU: convert vendor hook invocations to altcall

While not performance critical, these hook invocations still want
converting: This way all pre-filled struct cpu_dev instances can become
__initconst_cf_clobber, thus allowing to eliminate further 8 ENDBR
during the 2nd phase of alternatives patching (besides moving previously
resident data to .init.*).

Since all use sites need touching anyway, take the opportunity and also
address a Misra C:2012 Rule 5.5 violation: Rename the this_cpu static
variable.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 660f8a75013c947fbe5358a640032a1f9f1eece5)

13 months agox86/guest: finish conversion to altcall
Jan Beulich [Mon, 5 Feb 2024 09:45:31 +0000 (10:45 +0100)]
x86/guest: finish conversion to altcall

While .setup() and .e820_fixup() don't need fiddling with for being run
only very early, both .ap_setup() and .resume() want converting too:
This way both pre-filled struct hypervisor_ops instances can become
__initconst_cf_clobber, thus allowing to eliminate up to 5 more ENDBR
(configuration dependent) during the 2nd phase of alternatives patching.

While fiddling with section annotations here, also move "ops" itself to
.data.ro_after_init.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Paul Durrant <paul@xen.org>
(cherry picked from commit e931edccc53c9dd6e9a505ad0ff3a03d985669bc)

13 months agox86: arrange for ENDBR zapping from <vendor>_ctxt_switch_masking()
Jan Beulich [Mon, 5 Feb 2024 09:44:46 +0000 (10:44 +0100)]
x86: arrange for ENDBR zapping from <vendor>_ctxt_switch_masking()

While altcall is already used for them, the functions want announcing in
.init.rodata.cf_clobber, even if the resulting static variables aren't
otherwise used.

While doing this also move ctxt_switch_masking to .data.ro_after_init.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 044168fa3a65b6542bda5c21e373742de1bd5980)

13 months agox86/spec-ctrl: Expose BHI_CTRL to guests
Roger Pau Monné [Tue, 30 Jan 2024 09:14:00 +0000 (10:14 +0100)]
x86/spec-ctrl: Expose BHI_CTRL to guests

The CPUID feature bit signals the presence of the BHI_DIS_S control in
SPEC_CTRL MSR, first available in Intel AlderLake and Sapphire Rapids CPUs

Xen already knows how to context switch MSR_SPEC_CTRL properly between guest
and hypervisor context.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 583f1d0950529f3517b1741c2b21a028a82ba831)

13 months agox86/spec-ctrl: Expose RRSBA_CTRL to guests
Roger Pau Monné [Tue, 30 Jan 2024 09:13:59 +0000 (10:13 +0100)]
x86/spec-ctrl: Expose RRSBA_CTRL to guests

The CPUID feature bit signals the presence of the RRSBA_DIS_{U,S} controls in
SPEC_CTRL MSR, first available in Intel AlderLake and Sapphire Rapids CPUs.

Xen already knows how to context switch MSR_SPEC_CTRL properly between guest
and hypervisor context.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 478e4787fa64b621061177a7843c452e9a19916d)

13 months agox86/spec-ctrl: Expose IPRED_CTRL to guests
Roger Pau Monné [Tue, 30 Jan 2024 09:13:58 +0000 (10:13 +0100)]
x86/spec-ctrl: Expose IPRED_CTRL to guests

The CPUID feature bit signals the presence of the IPRED_DIS_{U,S} controls in
SPEC_CTRL MSR, first available in Intel AlderLake and Sapphire Rapids CPUs.

Xen already knows how to context switch MSR_SPEC_CTRL properly between guest
and hypervisor context.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 4dd6760706848de30f7c8b5f83462b9bcb070c91)

13 months agoIRQ: generalize [gs]et_irq_regs()
Jan Beulich [Tue, 23 Jan 2024 11:03:23 +0000 (12:03 +0100)]
IRQ: generalize [gs]et_irq_regs()

Move functions (and their data) to common code, and invoke the functions
on Arm as well. This is in preparation of dropping the register
parameters from handler functions.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit f67bddf3bccd99a5fee968c3b3f288db6a57d3be)

13 months agox86/MCE: switch some callback invocations to altcall
Jan Beulich [Mon, 22 Jan 2024 12:41:07 +0000 (13:41 +0100)]
x86/MCE: switch some callback invocations to altcall

While not performance critical, these hook invocations still would
better be converted: This way all pre-filled (and newly introduced)
struct mce_callback instances can become __initconst_cf_clobber, thus
allowing to eliminate another 9 ENDBR during the 2nd phase of
alternatives patching.

While this means registering callbacks a little earlier, doing so is
perhaps even advantageous, for having pointers be non-NULL earlier on.
Only one set of callbacks would only ever be registered anyway, and
neither of the respective initialization function can (subsequently)
fail.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 85ba4d050f9f3c4286164f21660ae88435b7e83c)

13 months agox86/MCE: separate BSP-only initialization
Jan Beulich [Mon, 22 Jan 2024 12:40:32 +0000 (13:40 +0100)]
x86/MCE: separate BSP-only initialization

Several function pointers are registered over and over again, when
setting them once on the BSP suffices. Arrange for this in the vendor
init functions and mark involved registration functions __init.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 9f58616ddb1cc1870399de2202fafc7bf0d61694)

13 months agox86/PV: avoid indirect call for I/O emulation quirk hook
Jan Beulich [Mon, 22 Jan 2024 12:40:00 +0000 (13:40 +0100)]
x86/PV: avoid indirect call for I/O emulation quirk hook

This way ioemul_handle_proliant_quirk() won't need ENDBR anymore.

While touching this code, also
- arrange for it to not be built at all when !PV,
- add "const" to the last function parameter and bring the definition
  in sync with the declaration (for Misra).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 1212af3e8c4d3a1350046d4fe0ca3b97b51e67de)

13 months agox86/MTRR: avoid several indirect calls
Jan Beulich [Mon, 22 Jan 2024 12:39:23 +0000 (13:39 +0100)]
x86/MTRR: avoid several indirect calls

The use of (supposedly) vendor-specific hooks is a relic from the days
when Xen was still possible to build as 32-bit binary. There's no
expectation that a new need for such an abstraction would arise. Convert
mttr_if to a mere boolean and all prior calls through it to direct ones,
thus allowing to eliminate 6 ENDBR from .text.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit e9e0eb30d4d6565b411499ca826718b4b9acab68)

13 months agocore-parking: use alternative_call()
Jan Beulich [Mon, 22 Jan 2024 12:38:24 +0000 (13:38 +0100)]
core-parking: use alternative_call()

This way we can arrange for core_parking_{performance,power}()'s ENDBR
to also be zapped.

For the decision to be taken before the 2nd alternative patching pass,
the initcall needs to become a pre-SMP one, though.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 1bc07ebcac3b1bb2a378732bc0f9a19940e76faf)

13 months agox86/HPET: avoid an indirect call
Jan Beulich [Wed, 17 Jan 2024 09:43:02 +0000 (10:43 +0100)]
x86/HPET: avoid an indirect call

When this code was written, indirect branches still weren't considered
much of a problem (besides being a little slower). Instead of a function
pointer, pass a boolean to _disable_pit_irq(), thus allowing to
eliminate two ENDBR (one of them in .text).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 730d2637a8e5b98dc8e4e366179b4cedc496b3ad)

13 months agocpufreq: finish conversion to altcall
Jan Beulich [Wed, 17 Jan 2024 09:42:27 +0000 (10:42 +0100)]
cpufreq: finish conversion to altcall

Even functions used on infrequently executed paths want converting: This
way all pre-filled struct cpufreq_driver instances can become
__initconst_cf_clobber, thus allowing to eliminate another 15 ENDBR
during the 2nd phase of alternatives patching.

For acpi-cpufreq's optionally populated .get hook make sure alternatives
patching can actually see the pointer. See also the code comment.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 467ae515caee491e9b6ae1da8b9b98d094955822)

13 months agox86/APIC: finish genapic conversion to altcall
Jan Beulich [Wed, 17 Jan 2024 09:41:52 +0000 (10:41 +0100)]
x86/APIC: finish genapic conversion to altcall

While .probe() doesn't need fiddling with for being run only very early,
init_apic_ldr() wants converting too despite not being on a frequently
executed path: This way all pre-filled struct genapic instances can
become __initconst_cf_clobber, thus allowing to eliminate 15 more ENDBR
during the 2nd phase of alternatives patching.

While fiddling with section annotations here, also move "genapic" itself
to .data.ro_after_init.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit b1cc53753cba4c3253f2e1093a3a6a9a828314bf)

13 months agox86/spec-ctrl: Fix BTC/SRSO mitigations
Andrew Cooper [Tue, 26 Mar 2024 22:47:25 +0000 (22:47 +0000)]
x86/spec-ctrl: Fix BTC/SRSO mitigations

We were looking for SCF_entry_ibpb in the wrong variable in the top-of-stack
block, and xen_spec_ctrl won't have had bit 5 set because Xen doesn't
understand SPEC_CTRL_RRSBA_DIS_U yet.

This is XSA-455 / CVE-2024-31142.

Fixes: 53a570b28569 ("x86/spec-ctrl: Support IBPB-on-entry")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
13 months agohypercall_xlat_continuation: Replace BUG_ON with domain_crash
Bjoern Doebel [Wed, 27 Mar 2024 18:30:55 +0000 (18:30 +0000)]
hypercall_xlat_continuation: Replace BUG_ON with domain_crash

Instead of crashing the host in case of unexpected hypercall parameters,
resort to only crashing the calling domain.

This is part of XSA-454 / CVE-2023-46842.

Fixes: b8a7efe8528a ("Enable compatibility mode operation for HYPERVISOR_memory_op")
Reported-by: Manuel Andreas <manuel.andreas@tum.de>
Signed-off-by: Bjoern Doebel <doebel@amazon.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 9926e692c4afc40bcd66f8416ff6a1e93ce402f6)

13 months agox86/HVM: clear upper halves of GPRs upon entry from 32-bit code
Jan Beulich [Wed, 27 Mar 2024 17:31:38 +0000 (17:31 +0000)]
x86/HVM: clear upper halves of GPRs upon entry from 32-bit code

Hypercalls in particular can be the subject of continuations, and logic
there checks updated state against incoming register values. If the
guest manufactured a suitable argument register with a non-zero upper
half before entering compatibility mode and issuing a hypercall from
there, checks in hypercall_xlat_continuation() might trip.

Since for HVM we want to also be sure to not hit a corner case in the
emulator, initiate the clipping right from the top of
{svm,vmx}_vmexit_handler(). Also rename the invoked function, as it no
longer does only invalidation of fields.

Note that architecturally the upper halves of registers are undefined
after a switch between compatibility and 64-bit mode (either direction).
Hence once having entered compatibility mode, the guest can't assume
the upper half of any register to retain its value.

This is part of XSA-454 / CVE-2023-46842.

Fixes: b8a7efe8528a ("Enable compatibility mode operation for HYPERVISOR_memory_op")
Reported-by: Manuel Andreas <manuel.andreas@tum.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 6a98383b0877bb66ebfe189da43bf81abe3d7909)

13 months agoxen/livepatch: Fix .altinstructions safety checks
Andrew Cooper [Thu, 13 Apr 2023 19:56:15 +0000 (20:56 +0100)]
xen/livepatch: Fix .altinstructions safety checks

The prior check has && vs || mixups, making it tautologically false and thus
providing no safety at all.  There are boundary errors too.

First start with a comment describing how the .altinstructions and
.altinstr_replacement sections interact, and perform suitable cross-checking.

Second, rewrite the alt_instr loop entirely from scratch.  Origin sites have
non-zero size, and must be fully contained within the livepatches .text
section(s).  Any non-zero sized replacements must be fully contained within
the .altinstr_replacement section.

Fixes: f8a10174e8b1 ("xsplice: Add support for alternatives")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
(cherry picked from commit e74360e4ba4a6b6827a44f8b1b22a0ec4311694a)

13 months agoarm/alternatives: Rename alt_instr fields which are used in common code
Andrew Cooper [Sun, 16 Apr 2023 00:10:43 +0000 (01:10 +0100)]
arm/alternatives: Rename alt_instr fields which are used in common code

Alternatives auditing for livepatches is currently broken.  To fix it, the
livepatch code needs to inspect more fields of alt_instr.

Rename ARM's fields to match x86's, because:

 * ARM already exposes alt_offset under the repl name via ALT_REPL_PTR().
 * "alt" is ambiguous in a structure entirely about alternatives already.
 * "repl", being the same width as orig leads to slightly neater code.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 418cf59c4e29451010d7efb3835b900690d19866)

13 months agotests/resource: Fix HVM guest in !SHADOW builds
Andrew Cooper [Tue, 2 Apr 2024 14:24:07 +0000 (16:24 +0200)]
tests/resource: Fix HVM guest in !SHADOW builds

Right now, test-resource always creates HVM Shadow guests.  But if Xen has
SHADOW compiled out, running the test yields:

  $./test-resource
  XENMEM_acquire_resource tests
  Test x86 PV
    Created d1
    Test grant table
  Test x86 PVH
    Skip: 95 - Operation not supported

and doesn't really test HVM guests, but doesn't fail either.

There's nothing paging-mode-specific about this test, so default to HAP if
possible and provide a more specific message if neither HAP or Shadow are
available.

As we've got physinfo to hand, also provide more specific message about the
absence of PV or HVM support.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 0263dc9069ddb66335c72a159e09050b1600e56a
master date: 2024-03-01 20:14:19 +0000

13 months agox86/boot: Support the watchdog on newer AMD systems
Andrew Cooper [Tue, 2 Apr 2024 14:20:30 +0000 (16:20 +0200)]
x86/boot: Support the watchdog on newer AMD systems

The MSRs used by setup_k7_watchdog() are architectural in 64bit.  The Unit
Select (0x76, cycles not in halt state) isn't, but it hasn't changed in 25
years, making this a trend likely to continue.

Drop the family check.  If the Unit Select does happen to change meaning in
the future, check_nmi_watchdog() will still notice the watchdog not operating
as expected.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 131892e0dcc1265b621c2b7d844cb9e7c3a4404f
master date: 2024-03-19 18:29:37 +0000

13 months agox86/boot: Improve the boot watchdog determination of stuck cpus
Andrew Cooper [Tue, 2 Apr 2024 14:20:09 +0000 (16:20 +0200)]
x86/boot: Improve the boot watchdog determination of stuck cpus

Right now, check_nmi_watchdog() has two processing loops over all online CPUs
using prev_nmi_count as storage.

Use a cpumask_t instead (1/32th as much initdata) and have wait_for_nmis()
make the determination of whether it is stuck, rather than having both
functions needing to agree on how many ticks mean stuck.

More importantly though, it means we can use the standard cpumask
infrastructure, including turning this:

  (XEN) Brought up 512 CPUs
  (XEN) Testing NMI watchdog on all CPUs: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511} stuck

into the rather more manageable:

  (XEN) Brought up 512 CPUs
  (XEN) Testing NMI watchdog on all CPUs: {0-511} stuck

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 9e18f339830c828798aef465556d4029d83476a0
master date: 2024-03-19 18:29:37 +0000

13 months agox86/livepatch: Relax permissions on rodata too
Andrew Cooper [Tue, 2 Apr 2024 14:19:36 +0000 (16:19 +0200)]
x86/livepatch: Relax permissions on rodata too

This reinstates the capability to patch .rodata in load/unload hooks, which
was lost when we stopped using CR0.WP=0 to patch.

This turns out to be rather less of a large TODO than I thought at the time.

Fixes: 8676092a0f16 ("x86/livepatch: Fix livepatch application when CET is active")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: b083b1c393dc8961acf0959b1d2e0ad459985ae3
master date: 2024-03-07 14:24:42 +0000

13 months agoxen/virtual-region: Include rodata pointers
Andrew Cooper [Tue, 2 Apr 2024 14:19:11 +0000 (16:19 +0200)]
xen/virtual-region: Include rodata pointers

These are optional.  .init doesn't distinguish types of data like this, and
livepatches don't necesserily have any .rodata either.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: ef969144a425e39f5b214a875b5713d0ea8575fb
master date: 2024-03-07 14:24:42 +0000

13 months agoxen/virtual-region: Rename the start/end fields
Andrew Cooper [Tue, 2 Apr 2024 14:18:51 +0000 (16:18 +0200)]
xen/virtual-region: Rename the start/end fields

... to text_{start,end}.  We're about to introduce another start/end pair.

Despite it's name, struct virtual_region has always been a module-ish
description.  Call this out specifically.

As minor cleanup, replace ROUNDUP(x, PAGE_SIZE) with the more concise
PAGE_ALIGN() ahead of duplicating the example.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: 989556c6f8ca080f5f202417af97d1188b9ba52a
master date: 2024-03-07 14:24:42 +0000

13 months agox86/cpu-policy: Fix visibility of HTT/CMP_LEGACY in max policies
Andrew Cooper [Tue, 2 Apr 2024 14:18:05 +0000 (16:18 +0200)]
x86/cpu-policy: Fix visibility of HTT/CMP_LEGACY in max policies

The block in recalculate_cpuid_policy() predates the proper split between
default and max policies, and was a "slightly max for a toolstack which knows
about it" capability.  It didn't get transformed properly in Xen 4.14.

Because Xen will accept a VM with HTT/CMP_LEGACY seen, they should be visible
in the max polices.  Keep the default policy matching host settings.

This manifested as an incorrectly-rejected migration across XenServer's Xen
4.13 -> 4.17 upgrade, as Xapi is slowly growing the logic to check a VM
against the target max policy.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: e2d8a652251660c3252d92b442e1a9c5d6e6a1e9
master date: 2024-03-01 20:14:19 +0000

13 months agox86/cpu-policy: Hide x2APIC from PV guests
Andrew Cooper [Tue, 2 Apr 2024 14:17:25 +0000 (16:17 +0200)]
x86/cpu-policy: Hide x2APIC from PV guests

PV guests can't write to MSR_APIC_BASE (in order to set EXTD), nor can they
access any of the x2APIC MSR range.  Therefore they mustn't see the x2APIC
CPUID bit saying that they can.

Right now, the host x2APIC flag filters into PV guests, meaning that PV guests
generally see x2APIC except on Zen1-and-older AMD systems.

Linux works around this by explicitly hiding the bit itself, and filtering
EXTD out of MSR_APIC_BASE reads.  NetBSD behaves more in the spirit of PV
guests, and entirely ignores the APIC when built as a PV guest.

Change the annotation from !A to !S.  This has a consequence of stripping it
out of both PV featuremasks.  However, as existing guests may have seen the
bit, set it back into the PV Max policy; a VM which saw the bit and is alive
enough to migrate will have ignored it one way or another.

Hiding x2APIC does change the contents of leaf 0xb, but as the information is
nonsense to begin with, this is likely an improvement on the status quo.

Xen's blind assumption that APIC_ID = vCPU_ID * 2 isn't interlinked with the
host's topology structure, where a PV guest may see real host values, and the
APIC_IDs are useless without an MADT to start with.  Dom0 is the only PV VM to
get an MADT but it's the host one, meaning the two sets of APIC_IDs are from
different address spaces.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 5420aa165dfa5fe95dd84bb71cb96c15459935b1
master date: 2024-03-01 20:14:19 +0000

13 months agotools/oxenstored: Make Quota.t pure
Edwin Török [Wed, 31 Jan 2024 10:52:56 +0000 (10:52 +0000)]
tools/oxenstored: Make Quota.t pure

Now that we no longer have a hashtable inside we can make Quota.t pure, and
push the mutable update to its callers.  Store.t already had a mutable Quota.t
field.

No functional change.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Acked-by: Christian Lindig <christian.lindig@cloud.com>
(cherry picked from commit 098d868e52ac0165b7f36e22b767ea70cef70054)

13 months agotools/oxenstored: Use Map instead of Hashtbl for quotas
Edwin Török [Wed, 31 Jan 2024 10:52:55 +0000 (10:52 +0000)]
tools/oxenstored: Use Map instead of Hashtbl for quotas

On a stress test running 1000 VMs flamegraphs have shown that
`oxenstored` spends a large amount of time in `Hashtbl.copy` and the GC.

Hashtable complexity:
 * read/write: O(1) average
 * copy: O(domains) -- copying the entire table

Map complexity:
 * read/write: O(log n) worst case
 * copy: O(1) -- a word copy

We always perform at least one 'copy' when processing each xenstore
packet (regardless whether it is a readonly operation or inside a
transaction or not), so the actual complexity per packet is:
  * Hashtbl: O(domains)
  * Map: O(log domains)

Maps are the clear winner, and a better fit for the immutable xenstore
tree.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Acked-by: Christian Lindig <christian.lindig@cloud.com>
(cherry picked from commit b6cf604207fd0a04451a48f2ce6d05fb66c612ab)

13 months agox86/PoD: tie together P2M update and increment of entry count
Jan Beulich [Wed, 27 Mar 2024 11:29:33 +0000 (12:29 +0100)]
x86/PoD: tie together P2M update and increment of entry count

When not holding the PoD lock across the entire region covering P2M
update and stats update, the entry count - if to be incorrect at all -
should indicate too large a value in preference to a too small one, to
avoid functions bailing early when they find the count is zero. However,
instead of moving the increment ahead (and adjust back upon failure),
extend the PoD-locked region.

Fixes: 99af3cd40b6e ("x86/mm: Rework locking in the PoD layer")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@cloud.com>
master commit: cc950c49ae6a6690f7fc3041a1f43122c250d250
master date: 2024-03-21 09:48:10 +0100