]> xenbits.xensource.com Git - people/aperard/xen-unstable.git/log
people/aperard/xen-unstable.git
5 months agoupdate Xen version to 4.16.7 stable-4.16
Jan Beulich [Wed, 4 Dec 2024 11:11:49 +0000 (12:11 +0100)]
update Xen version to 4.16.7

5 months agolibxl: Use zero-ed memory for PVH acpi tables
Jason Andryuk [Tue, 12 Nov 2024 13:15:02 +0000 (14:15 +0100)]
libxl: Use zero-ed memory for PVH acpi tables

xl/libxl memory is leaking into a PVH guest through uninitialized
portions of the ACPI tables.

Use libxl_zalloc() to obtain zero-ed memory to avoid this issue.

This is XSA-464 / CVE-2024-45819.

Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
Fixes: 14c0d328da2b ("libxl/acpi: Build ACPI tables for HVMlite guests")
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0bfe567b58f1182889dea9207103fc9d00baf414
master date: 2024-11-12 13:32:45 +0100

5 months agox86/HVM: drop stdvga's "lock" struct member
Jan Beulich [Tue, 12 Nov 2024 13:14:43 +0000 (14:14 +0100)]
x86/HVM: drop stdvga's "lock" struct member

No state is left to protect. It being the last field, drop the struct
itself as well. Similarly for then ending up empty, drop the .complete
handler.

This is part of XSA-463 / CVE-2024-45818

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit b180a50326c8a2c171f37c1940a0fbbdcad4be90)

5 months agox86/HVM: drop stdvga's "vram_page[]" struct member
Jan Beulich [Tue, 12 Nov 2024 13:14:25 +0000 (14:14 +0100)]
x86/HVM: drop stdvga's "vram_page[]" struct member

No uses are left, hence its setup, teardown, and the field itself can
also go away. stdvga_deinit() is then empty and can be dropped as well.

This is part of XSA-463 / CVE-2024-45818

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 3beb4baf2a0a2eef40d39eb7e6eecbfd36da5d14)

5 months agox86/HVM: drop stdvga's "{g,s}r_index" struct members
Jan Beulich [Tue, 12 Nov 2024 13:14:08 +0000 (14:14 +0100)]
x86/HVM: drop stdvga's "{g,s}r_index" struct members

No consumers are left, hence the producer and the fields themselves can
also go away. stdvga_outb() is then useless, rendering stdvga_out()
useless as well. Hence the entire I/O port intercept can go away.

This is part of XSA-463 / CVE-2024-45818

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 86c03372e107f5c18266a62281663861b1144929)

5 months agox86/HVM: drop stdvga's "sr[]" struct member
Jan Beulich [Tue, 12 Nov 2024 13:13:51 +0000 (14:13 +0100)]
x86/HVM: drop stdvga's "sr[]" struct member

No consumers are left, hence the producer and the array itself can also
go away. The static sr_mask[] is then orphaned and hence needs dropping,
too.

This is part of XSA-463 / CVE-2024-45818

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 7aba44bdd78aedb97703811948c3b69ccff85032)

5 months agox86/HVM: drop stdvga's "gr[]" struct member
Jan Beulich [Tue, 12 Nov 2024 13:13:35 +0000 (14:13 +0100)]
x86/HVM: drop stdvga's "gr[]" struct member

No consumers are left, hence the producer and the array itself can also
go away. The static gr_mask[] is then orphaned and hence needs dropping,
too.

This is part of XSA-463 / CVE-2024-45818

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit b16c0966a17f19c0e55ed0b9baa28191d2590178)

5 months agox86/HVM: remove unused MMIO handling code
Jan Beulich [Tue, 12 Nov 2024 13:13:20 +0000 (14:13 +0100)]
x86/HVM: remove unused MMIO handling code

All read accesses are rejected by the ->accept handler, while writes
bypass the bulk of the function body. Drop the dead code, leaving an
assertion in the read handler.

A number of other static items (and a macro) are then unreferenced and
hence also need (want) dropping. The same applies to the "latch" field
of the state structure.

This is part of XSA-463 / CVE-2024-45818

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 89108547af1f230b72893b48351f9c1106189649)

5 months agox86/HVM: drop stdvga's "stdvga" struct member
Jan Beulich [Tue, 12 Nov 2024 13:12:45 +0000 (14:12 +0100)]
x86/HVM: drop stdvga's "stdvga" struct member

Two of its consumers are dead (in compile-time constant conditionals)
and the only remaining ones are merely controlling debug logging. Hence
the field is now pointless to set, which in particular allows to get rid
of the questionable conditional from which the field's value was
established (afaict 551ceee97513 ["x86, hvm: stdvga cache always on"]
had dropped too much of the earlier extra check that was there, and
quite likely further checks were missing).

This is part of XSA-463 / CVE-2024-45818

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit b740a9369e81bdda675a9780130ce2b9e75d4ec9)

5 months agox86/HVM: properly reject "indirect" VRAM writes
Jan Beulich [Tue, 12 Nov 2024 13:12:18 +0000 (14:12 +0100)]
x86/HVM: properly reject "indirect" VRAM writes

While ->count will only be different from 1 for "indirect" (data in
guest memory) accesses, it being 1 does not exclude the request being an
"indirect" one. Check both to be on the safe side, and bring the ->count
part also in line with what ioreq_send_buffered() actually refuses to
handle.

This is part of XSA-463 / CVE-2024-45818

Fixes: 3bbaaec09b1b ("x86/hvm: unify stdvga mmio intercept with standard mmio intercept")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit eb7cd0593d88c4b967a24bca8bd30591966676cd)

5 months agox86/HVM: drop stdvga's "cache" struct member
Jan Beulich [Tue, 12 Nov 2024 13:10:47 +0000 (14:10 +0100)]
x86/HVM: drop stdvga's "cache" struct member

Since 68e1183411be ("libxc: introduce a xc_dom_arch for hvm-3.0-x86_32
guests"), HVM guests are built using XEN_DOMCTL_sethvmcontext, which
ends up disabling stdvga caching because of arch_hvm_load() being
involved in the processing of the request. With that the field is
useless, and can be dropped. Drop the helper functions manipulating /
checking as well right away, but leave the use sites of
stdvga_cache_is_enabled() with the hard-coded result the function would
have produced, to aid validation of subsequent dropping of further code.

This is part of XSA-463 / CVE-2024-45818

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 53b7246bdfb3c280adcdf714918e4decb7e108f4)

5 months agoCI: Resync .cirrus.yml for FreeBSD testing
Andrew Cooper [Mon, 11 Nov 2024 17:05:03 +0000 (17:05 +0000)]
CI: Resync .cirrus.yml for FreeBSD testing

Includes:
  commit ebb7c6b2faf2 ("cirrus-ci: update to FreeBSD 14.1 image")
  commit 5ea7f2c9d7a1 ("CI: Update FreeBSD to 13.3")
  commit c2ce3466472e ("CirrusCI: drop FreeBSD 12")

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 months agostubdom: Fix newlib build with GCC-14
Andrew Cooper [Tue, 29 Oct 2024 15:39:22 +0000 (16:39 +0100)]
stubdom: Fix newlib build with GCC-14

Based on a fix from OpenSUSE, but adjusted to be Clang-compatible too.  Pass
-Wno-implicit-function-declaration library-wide rather than using local GCC
pragmas.

Fix of copy_past_newline() to avoid triggering -Wstrict-prototypes.

Link: https://build.opensuse.org/request/show/1178775
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: 444cb9350f2c1cc202b6b86176ddd8e57525e2d9
master date: 2024-10-03 10:07:25 +0100

5 months agostubdom: fix errors in newlib:makedoc
Olaf Hering [Wed, 26 Apr 2023 10:52:39 +0000 (10:52 +0000)]
stubdom: fix errors in newlib:makedoc

rpm post-build-checks found a few code bugs in newlib, and marks them as
errors. Add another newlib patch and apply it during stubdom build.

[  227s] ../../../../newlib-1.16.0/newlib/doc/makedoc.c: In function 'lookup_word':
[  227s] ../../../../newlib-1.16.0/newlib/doc/makedoc.c:1147:10: warning: implicit declaration of function 'strcmp' [-Wimplicit-function-declaration]
[  227s]       if (strcmp(ptr->word, word) == 0) return ptr;
[  227s]           ^

[  460s] I: Program is using implicit definitions of special functions.
[  460s]    these functions need to use their correct prototypes to allow
[  460s]    the lightweight buffer overflow checking to work.
[  460s]      - Implicit memory/string functions need #include <string.h>.
[  460s]      - Implicit *printf functions need #include <stdio.h>.
[  460s]      - Implicit *printf functions need #include <stdio.h>.
[  460s]      - Implicit *read* functions need #include <unistd.h>.
[  460s]      - Implicit *recv* functions need #include <sys/socket.h>.
[  460s] E: xen implicit-fortify-decl ../../../../newlib-1.16.0/newlib/doc/makedoc.c:1147

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
(cherry picked from commit dde20f7dc182fdfeeb6c55648979326bb982ca8c)

5 months agostubdom: fix errors in newlib:cygmon-gmon.c
Olaf Hering [Wed, 26 Apr 2023 10:51:56 +0000 (10:51 +0000)]
stubdom: fix errors in newlib:cygmon-gmon.c

rpm post-build-checks found a few code bugs in newlib, and marks them as
errors. Add another newlib patch and apply it during stubdom build.

I: A function uses a 'return;' statement, but has actually a value
   to return, like an integer ('return 42;') or similar.
W: xen voidreturn ../../../../newlib-1.16.0/libgloss/i386/cygmon-gmon.c:117, 125, 146, 157, 330

I: Program is using implicit definitions of special functions.
   these functions need to use their correct prototypes to allow
   the lightweight buffer overflow checking to work.
     - Implicit memory/string functions need #include <string.h>.
     - Implicit *printf functions need #include <stdio.h>.
     - Implicit *printf functions need #include <stdio.h>.
     - Implicit *read* functions need #include <unistd.h>.
     - Implicit *recv* functions need #include <sys/socket.h>.
E: xen implicit-fortify-decl ../../../../newlib-1.16.0/libgloss/i386/cygmon-gmon.c:119

I: Program returns random data in a function
E: xen no-return-in-nonvoid-function ../../../../newlib-1.16.0/libgloss/i386/cygmon-gmon.c:362

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
(cherry picked from commit 860fb990bd208f590b78d938ba874e867e1c2986)

5 months agoCI: Mark Archlinux/x86 as allowing failures
Andrew Cooper [Wed, 10 Jul 2024 12:38:52 +0000 (13:38 +0100)]
CI: Mark Archlinux/x86 as allowing failures

Archlinux is a rolling distro.  As a consequence, rebuilding the container
periodically changes the toolchain, and this affects all stable branches in
one go.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>
(cherry picked from commit 5e1773dc863d6e1fb4c1398e380bdfc754342f7b)

5 months agoConfig: Update MiniOS revision
Andrew Cooper [Wed, 30 Oct 2024 18:00:22 +0000 (18:00 +0000)]
Config: Update MiniOS revision

Commit a60612a2d009 ("mman: correct m{,un}lock() definitions")

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
7 months agox86/vLAPIC: prevent undue recursion of vlapic_error()
Jan Beulich [Tue, 24 Sep 2024 13:09:45 +0000 (15:09 +0200)]
x86/vLAPIC: prevent undue recursion of vlapic_error()

With the error vector set to an illegal value, the function invoking
vlapic_set_irq() would bring execution back here, with the non-recursive
lock already held. Avoid the call in this case, merely further updating
ESR (if necessary).

This is XSA-462 / CVE-2024-45817.

Fixes: 5f32d186a8b1 ("x86/vlapic: don't silently accept bad vectors")
Reported-by: Federico Serafini <federico.serafini@bugseng.com>
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c42d9ec61f6d11e25fa77bd44dd11dad1edda268
master date: 2024-09-24 14:23:29 +0200

8 months agox86/pass-through: documents as security-unsupported when sharing resources
Jan Beulich [Tue, 13 Aug 2024 14:52:44 +0000 (16:52 +0200)]
x86/pass-through: documents as security-unsupported when sharing resources

When multiple devices share resources and one of them is to be passed
through to a guest, security of the entire system and of respective
guests individually cannot really be guaranteed without knowing
internals of any of the involved guests.  Therefore such a configuration
cannot really be security-supported, yet making that explicit was so far
missing.

This is XSA-461 / CVE-2024-31146.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
master commit: 9c94eda1e3790820699a6de3f6a7c959ecf30600
master date: 2024-08-13 16:37:25 +0200

8 months agox86/IOMMU: move tracking in iommu_identity_mapping()
Teddy Astie [Tue, 13 Aug 2024 14:52:06 +0000 (16:52 +0200)]
x86/IOMMU: move tracking in iommu_identity_mapping()

If for some reason xmalloc() fails after having mapped the reserved
regions, an error is reported, but the regions remain mapped in the P2M.

Similarly if an error occurs during set_identity_p2m_entry() (except on
the first call), the partial mappings of the region would be retained
without being tracked anywhere, and hence without there being a way to
remove them again from the domain's P2M.

Move the setting up of the list entry ahead of trying to map the region.
In cases other than the first mapping failing, keep record of the full
region, such that a subsequent unmapping request can be properly torn
down.

To compensate for the potentially excess unmapping requests, don't log a
warning from p2m_remove_identity_entry() when there really was nothing
mapped at a given GFN.

This is XSA-460 / CVE-2024-31145.

Fixes: 2201b67b9128 ("VT-d: improve RMRR region handling")
Fixes: c0e19d7c6c42 ("IOMMU: generalize VT-d's tracking of mapped RMRR regions")
Signed-off-by: Teddy Astie <teddy.astie@vates.tech>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: beadd68b5490ada053d72f8a9ce6fd696d626596
master date: 2024-08-13 16:36:40 +0200

9 months agox86/IRQ: avoid double unlock in map_domain_pirq()
Jan Beulich [Tue, 16 Jul 2024 12:16:27 +0000 (14:16 +0200)]
x86/IRQ: avoid double unlock in map_domain_pirq()

Forever since its introduction the main loop in the function dealing
with multi-vector MSI had error exit points ("break") with different
properties: In one case no IRQ descriptor lock is being held.
Nevertheless the subsequent error cleanup path assumed such a lock would
uniformly need releasing. Identify the case by setting "desc" to NULL,
thus allowing the unlock to be skipped as necessary.

This is CVE-2024-31143 / XSA-458.

Coverity ID: 1605298
Fixes: d1b6d0a02489 ("x86: enable multi-vector MSI")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 57338346f29cea7b183403561bdc5f407163b846
master date: 2024-07-16 14:09:14 +0200

12 months agox86/spec: adjust logic that elides lfence
Roger Pau Monné [Mon, 29 Apr 2024 08:20:07 +0000 (10:20 +0200)]
x86/spec: adjust logic that elides lfence

It's currently too restrictive by just checking whether there's a BHB clearing
sequence selected.  It should instead check whether BHB clearing is used on
entry from PV or HVM specifically.

Switch to use opt_bhb_entry_{pv,hvm} instead, and then remove cpu_has_bhb_seq
since it no longer has any users.

Reported-by: Jan Beulich <jbeulich@suse.com>
Fixes: 954c983abcee ('x86/spec-ctrl: Software BHB-clearing sequences')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 656ae8f1091bcefec9c46ec3ea3ac2118742d4f6
master date: 2024-04-25 16:37:01 +0200

12 months agox86/spec: fix reporting of BHB clearing usage from guest entry points
Roger Pau Monné [Mon, 29 Apr 2024 08:19:03 +0000 (10:19 +0200)]
x86/spec: fix reporting of BHB clearing usage from guest entry points

Reporting whether the BHB clearing on entry is done for the different domains
types based on cpu_has_bhb_seq is unhelpful, as that variable signals whether
there's a BHB clearing sequence selected, but that alone doesn't imply that
such sequence is used from the PV and/or HVM entry points.

Instead use opt_bhb_entry_{pv,hvm} which do signal whether BHB clearing is
performed on entry from PV/HVM.

Fixes: 689ad48ce9cf ('x86/spec-ctrl: Wire up the Native-BHI software sequences')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 049ab0b2c9f1f5edb54b505fef0bc575787dafe9
master date: 2024-04-25 16:35:56 +0200

12 months agox86/vmx: prevent fallthrough in vmx_set_reg() for handled registers
Roger Pau Monné [Mon, 15 Apr 2024 10:22:31 +0000 (11:22 +0100)]
x86/vmx: prevent fallthrough in vmx_set_reg() for handled registers

vmx_set_reg() logic is split into two parts, the top one handles registers
that don't require loading the VMCS into context (ie: don't require a
VMWRITE).  The second half handles registers that do require the VMCS to be
loaded.

SPEC_CTRL MSR is handled differently depending on whether there's support for
virtualize SPEC_CTRL.  Without hardware help for virtualizing SPEC_CTRL the
value is handled using MSR load lists, that don't require the VMCS to be
loaded.  When there's hardware assistance however the value is stored in the
VMCS, and requires a VMWRITE.  The lack of a return statement when handling
SPEC_CTRL in the first half of the function leads to SPEC_CTRL being
unconditionally handled as if the host had virtualize SPEC_CTRL, which means
Xen will either hit an ASSERT in debug builds, or will attempt to perform a
VMWRITE to an unhandled VMCS field if the host doesn't support the virtualize
SPEC_CTRL feature.

This bug occured because the context wasn't adjusted accordingly to account
for the absence commit 0626219dcc6a ("x86/hvm: Drop
hvm_{get,set}_guest_bndcfgs() and use {get,set}_regs() instead") in the 4.15
and 4.16 branches.

Fix by returning early from the function if the register is handled without
requiring the VMCS context to be loaded.

Fixes: 295bf24af77c ('x86/vmx: Add support for virtualize SPEC_CTRL')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
12 months agox86/entry: Fix build with older toolchains
Andrew Cooper [Tue, 9 Apr 2024 20:39:51 +0000 (21:39 +0100)]
x86/entry: Fix build with older toolchains

Binutils older than 2.29 doesn't know INCSSPD.

Fixes: 8e186f98ce0e ("x86: Use indirect calls in reset-stack infrastructure")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit a9fa82500818a8d8ce5f2843f1577bd2c29d088e)

12 months agoUpdate Xen version to 4.16.6
Andrew Cooper [Wed, 27 Mar 2024 18:23:18 +0000 (18:23 +0000)]
Update Xen version to 4.16.6

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
12 months agox86/spec-ctrl: Support the "long" BHB loop sequence
Andrew Cooper [Fri, 22 Mar 2024 19:29:34 +0000 (19:29 +0000)]
x86/spec-ctrl: Support the "long" BHB loop sequence

Out of an abudnance of caution, implement the long loop too, and allowing for
it to be opted in to.

This is part of XSA-456 / CVE-2024-2201.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit d5887c0decbd90e798b24ed696628645b04632fb)

12 months agox86/spec-ctrl: Wire up the Native-BHI software sequences
Andrew Cooper [Thu, 8 Jun 2023 18:41:44 +0000 (19:41 +0100)]
x86/spec-ctrl: Wire up the Native-BHI software sequences

In the absence of BHI_DIS_S, mitigating Native-BHI requires the use of a
software sequence.

Introduce a new bhb-seq= option to select between avaialble sequences and
bhb-entry= to control the per-PV/HVM actions like we have for other blocks.

Activate the short sequence by default for PV and HVM guests on affected
hardware if BHI_DIS_S isn't present.

This is part of XSA-456 / CVE-2024-2201.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 689ad48ce9cf4c38297cd126e7e003a1c13a3b9d)

12 months agox86/spec-ctrl: Software BHB-clearing sequences
Andrew Cooper [Thu, 8 Jun 2023 18:41:44 +0000 (19:41 +0100)]
x86/spec-ctrl: Software BHB-clearing sequences

Implement clear_bhb_{tsx,loops}() as per the BHI guidance.  The loops variant
is set up as the "short" sequence.

Introduce SCF_entry_bhb and extend SPEC_CTRL_ENTRY_* with a conditional call
to selected clearing routine.

Note that due to a limitation in the ALTERNATIVE capability, the TEST/JZ can't
be included alongside a CALL in a single alternative block.  This is going to
require further work to untangle.

The BHB sequences (if used) must be after the restoration of Xen's
MSR_SPEC_CTRL value, which must be accounted for when judging whether it is
safe to skip the safety LFENCEs.

This is part of XSA-456 / CVE-2024-2201.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 954c983abceee97bf5f6230b9ae164f2c49a9aa9)

12 months agox86/spec-ctrl: Support BHI_DIS_S in order to mitigate BHI
Andrew Cooper [Tue, 26 Mar 2024 19:01:37 +0000 (19:01 +0000)]
x86/spec-ctrl: Support BHI_DIS_S in order to mitigate BHI

Introduce a "bhi-dis-s" boolean to match the other options we have for
MSR_SPEC_CTRL values.  Also introduce bhi_calculations().

Use BHI_DIS_S whenever possible.

Guests which are levelled to be migration compatible with older CPUs can't see
BHI_DIS_S, and Xen must fill in the difference to make the guest safe.  Use
the virt MSR_SPEC_CTRL infrastructure to force BHI_DIS_S behind the guest's
back.

This is part of XSA-456 / CVE-2024-2201.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 62a1106415c5e8a49b45147ca84d54a58d471343)

12 months agox86/tsx: Expose RTM_ALWAYS_ABORT to guests
Andrew Cooper [Sat, 6 Apr 2024 19:36:54 +0000 (20:36 +0100)]
x86/tsx: Expose RTM_ALWAYS_ABORT to guests

A TSX Abort is one option mitigate Native-BHI, but a guest kernel doesn't get
to see this if Xen has turned RTM off using MSR_TSX_{CTRL,FORCE_ABORT}.

Therefore, the meaning of RTM_ALWAYS_ABORT has been adjusted to "XBEGIN won't
fault", and it should be exposed to guests so they can make a better decision.

Expose it in the max policy for any RTM-capable system.  Offer it by default
only if RTM has been disabled.

Update test-tsx to account for this new meaning.  While adjusting the logic in
test_guest_policies(), take the opportunity to use feature names (now they're
available) to make the logic easier to follow.

This is part of XSA-456 / CVE-2024-2201.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c94e2105924347de0d9f32065370e802a20cc829)

12 months agox86: Drop INDIRECT_JMP
Andrew Cooper [Fri, 22 Dec 2023 18:01:37 +0000 (18:01 +0000)]
x86: Drop INDIRECT_JMP

Indirect JMPs which are not tailcalls can lead to an unwelcome form of
speculative type confusion, and we've removed the uses of INDIRECT_JMP to
compensate.  Remove the temptation to reintroduce new instances.

This is part of XSA-456 / CVE-2024-2201.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 0b66d7ce3c0290eaad28bdafb35200052d012b14)

12 months agox86: Use indirect calls in reset-stack infrastructure
Andrew Cooper [Fri, 22 Dec 2023 17:44:48 +0000 (17:44 +0000)]
x86: Use indirect calls in reset-stack infrastructure

Mixing up JMP and CALL indirect targets leads a very fun form of speculative
type confusion.  A target which is expecting to be called CALLed needs a
return address on the stack, and an indirect JMP doesn't place one there.

An indirect JMP which predicts to a target intending to be CALLed can end up
with a RET speculatively executing with a value from the JMPers stack frame.

There are several ways get indirect JMPs in Xen.

 * From tailcall optimisations.  These are safe because the compiler has
   arranged the stack to point at the callee's return address.

 * From jump tables.  These are unsafe, but Xen is built with -fno-jump-tables
   to work around several compiler issues.

 * From reset_stack_and_jump_ind(), which is particularly unsafe.  Because of
   the additional stack adjustment made, the value picked up off the stack is
   regs->r15 of the next vCPU to run.

In order to mitigate this type confusion, we want to make all indirect targets
be CALL targets, and remove the use of indirect JMP except via tailcall
optimisation.

Luckily due to XSA-348, all C target functions of reset_stack_and_jump_ind()
are noreturn.  {svm,vmx}_do_resume() exits via reset_stack_and_jump(); a
direct JMP with entirely different prediction properties.  idle_loop() is an
infinite loop which eventually exits via reset_stack_and_jump_ind() from a new
schedule.  i.e. These paths are all fine having one extra return address on
the stack.

This leaves continue_pv_domain(), which is expecting to be a JMP target.
Alter it to strip the return address off the stack, which is safe because
there isn't actually a RET expecting to return to its caller.

This allows us change reset_stack_and_jump_ind() to reset_stack_and_call_ind()
in order to mitigate the speculative type confusion.

This is part of XSA-456 / CVE-2024-2201.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 8e186f98ce0e35d1754ec9299da41ec98873b65c)

12 months agox86/spec-ctrl: Widen the {xen,last,default}_spec_ctrl fields
Andrew Cooper [Tue, 26 Mar 2024 22:43:18 +0000 (22:43 +0000)]
x86/spec-ctrl: Widen the {xen,last,default}_spec_ctrl fields

Right now, they're all bytes, but MSR_SPEC_CTRL has been steadily gaining new
features.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 45dac88e78e8a2d9d8738eef884fe6730faf9e67)

12 months agox86/vmx: Add support for virtualize SPEC_CTRL
Roger Pau Monne [Thu, 15 Feb 2024 16:46:53 +0000 (17:46 +0100)]
x86/vmx: Add support for virtualize SPEC_CTRL

The feature is defined in the tertiary exec control, and is available starting
from Sapphire Rapids and Alder Lake CPUs.

When enabled, two extra VMCS fields are used: SPEC_CTRL mask and shadow.  Bits
set in mask are not allowed to be toggled by the guest (either set or clear)
and the value in the shadow field is the value the guest expects to be in the
SPEC_CTRL register.

By using it the hypervisor can force the value of SPEC_CTRL bits behind the
guest back without having to trap all accesses to SPEC_CTRL, note that no bits
are forced into the guest as part of this patch.  It also allows getting rid of
SPEC_CTRL in the guest MSR load list, since the value in the shadow field will
be loaded by the hardware on vmentry.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 97c5b8b657e41a6645de9d40713b881234417b49)

12 months agox86/spec-ctrl: Detail the safety properties in SPEC_CTRL_ENTRY_*
Andrew Cooper [Mon, 25 Mar 2024 11:09:35 +0000 (11:09 +0000)]
x86/spec-ctrl: Detail the safety properties in SPEC_CTRL_ENTRY_*

The complexity is getting out of hand.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 40dea83b75386cb693481cf340024ce093be5c0f)

12 months agox86/spec-ctrl: Simplify DO_COND_IBPB
Andrew Cooper [Fri, 22 Mar 2024 14:33:17 +0000 (14:33 +0000)]
x86/spec-ctrl: Simplify DO_COND_IBPB

With the prior refactoring, SPEC_CTRL_ENTRY_{PV,INTR} both load SCF into %ebx,
and handle the conditional safety including skipping if interrupting Xen.

Therefore, we can drop the maybexen parameter and the conditional safety.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 2378d16a931de0e62c03669169989e9437306abe)

12 months agox86/spec_ctrl: Hold SCF in %ebx across SPEC_CTRL_ENTRY_{PV,INTR}
Andrew Cooper [Fri, 22 Mar 2024 12:08:02 +0000 (12:08 +0000)]
x86/spec_ctrl: Hold SCF in %ebx across SPEC_CTRL_ENTRY_{PV,INTR}

... as we do in the exit paths too.  This will allow simplification to the
sub-blocks.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9607aeb6602b8ed9962404de3f5f90170ffddb66)

12 months agox86/entry: Arrange for %r14 to be STACK_END across SPEC_CTRL_ENTRY_FROM_PV
Andrew Cooper [Fri, 22 Mar 2024 15:52:06 +0000 (15:52 +0000)]
x86/entry: Arrange for %r14 to be STACK_END across SPEC_CTRL_ENTRY_FROM_PV

Other SPEC_CTRL_* paths already use %r14 like this, and it will allow for
simplifications.

All instances of SPEC_CTRL_ENTRY_FROM_PV are followed by a GET_STACK_END()
invocation, so this change is only really logic and register shuffling.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 22390697bf1b4cd3024f2d10893dec3c3ec08a9c)

12 months agox86/spec-ctrl: Rework conditional safety for SPEC_CTRL_ENTRY_*
Andrew Cooper [Fri, 22 Mar 2024 11:41:41 +0000 (11:41 +0000)]
x86/spec-ctrl: Rework conditional safety for SPEC_CTRL_ENTRY_*

Right now, we have a mix of safety strategies in different blocks, making the
logic fragile and hard to follow.

Start addressing this by having a safety LFENCE at the end of the blocks,
which can be patched out if other safety criteria are met.  This will allow us
to simplify the sub-blocks.  For SPEC_CTRL_ENTRY_FROM_IST, simply leave an
LFENCE unconditionally at the end; the IST path is not a fast-path by any
stretch of the imagination.

For SPEC_CTRL_ENTRY_FROM_INTR, the existing description was incorrect.  The
IRET #GP path is non-fatal but can occur with the guest's choice of
MSR_SPEC_CTRL.  It is safe to skip the flush/barrier-like protections when
interrupting Xen, but we must run DO_SPEC_CTRL_ENTRY irrespective.

This will skip RSB stuffing which was previously unconditional even when
interrupting Xen.

AFAICT, this is a missing cleanup from commit 3fffaf9c13e9 ("x86/entry: Avoid
using alternatives in NMI/#MC paths") where we split the IST entry path out of
the main INTR entry path.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 94896de1a98c4289fe6fef9e16ef99fc6ef2efc4)

12 months agox86/spec-ctrl: Rename spec_ctrl_flags to scf
Andrew Cooper [Thu, 28 Mar 2024 11:57:25 +0000 (11:57 +0000)]
x86/spec-ctrl: Rename spec_ctrl_flags to scf

XSA-455 was ultimately caused by having fields with too-similar names.

Both {xen,last}_spec_ctrl are fields containing an architectural MSR_SPEC_CTRL
value.  The spec_ctrl_flags field contains Xen-internal flags.

To more-obviously distinguish the two, rename spec_ctrl_flags to scf, which is
also the prefix of the constants used by the fields.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c62673c4334b3372ebd4292a7ac8185357e7ea27)

12 months agox86/spec-ctrl: Make svm_vmexit_spec_ctrl conditional
Andrew Cooper [Mon, 18 Jul 2022 13:15:08 +0000 (14:15 +0100)]
x86/spec-ctrl: Make svm_vmexit_spec_ctrl conditional

The logic was written this way out of an abundance of caution, but the reality
is that AMD parts don't currently have the RAS-flushing side effect, nor do
they intend to gain it.

This removes one WRMSR from the VMExit path by default on Zen2 systems.

Fixes: 614cec7d79d7 ("x86/svm: VMEntry/Exit logic for MSR_SPEC_CTRL")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c16a9eda77b2089206d5bc39ab6488c3793e11bf)

12 months agox86/cpuid: Don't expose {IPRED,RRSBA,BHI}_CTRL to PV guests
Andrew Cooper [Tue, 9 Apr 2024 14:03:05 +0000 (15:03 +0100)]
x86/cpuid: Don't expose {IPRED,RRSBA,BHI}_CTRL to PV guests

All of these are prediction-mode (i.e. CPL) based.  They don't operate as
advertised in PV context.

Fixes: 4dd676070684 ("x86/spec-ctrl: Expose IPRED_CTRL to guests")
Fixes: 478e4787fa64 ("x86/spec-ctrl: Expose RRSBA_CTRL to guests")
Fixes: 583f1d095052 ("x86/spec-ctrl: Expose BHI_CTRL to guests")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 4b3da946ad7e3452761478ae683da842e7ff20d6)

12 months agoxen/wait: Remove indirect jump
Andrew Cooper [Fri, 22 Oct 2021 15:07:07 +0000 (16:07 +0100)]
xen/wait: Remove indirect jump

There is no need for this to be an indirect jump at all.  Execution always
returns to a specific location, so use a direct jump instead.

Use a named label for the jump.  As both 1 and 2 have disappeared as labels,
rename 3 to skip to better describe its purpose.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f03567bd7e8efc0f667a67905862e6047babdb7a)

12 months agox86/tsx: Cope with RTM_ALWAYS_ABORT vs RTM mismatch
Andrew Cooper [Wed, 3 Apr 2024 16:43:42 +0000 (17:43 +0100)]
x86/tsx: Cope with RTM_ALWAYS_ABORT vs RTM mismatch

It turns out there is something wonky on some but not all CPUs with
MSR_TSX_FORCE_ABORT.  The presence of RTM_ALWAYS_ABORT causes Xen to think
it's safe to offer HLE/RTM to guests, but in this case, XBEGIN instructions
genuinely #UD.

Spot this case and try to back out as cleanly as we can.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit b33f191e3ca99458fdcea1cb5a29dfa4965d1604)

12 months agoVMX: tertiary execution control infrastructure
Jan Beulich [Wed, 7 Feb 2024 12:46:11 +0000 (13:46 +0100)]
VMX: tertiary execution control infrastructure

This is a prereq to enabling e.g. the MSRLIST feature.

Note that the PROCBASED_CTLS3 MSR is different from other VMX feature
reporting MSRs, in that all 64 bits report allowed 1-settings.

vVMX code is left alone, though, for the time being.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 878159bf259bfbd7a40312829f1ea0ce1f6645e2)

12 months agox86/spec-ctrl: Expose BHI_CTRL to guests
Roger Pau Monné [Tue, 30 Jan 2024 09:14:00 +0000 (10:14 +0100)]
x86/spec-ctrl: Expose BHI_CTRL to guests

The CPUID feature bit signals the presence of the BHI_DIS_S control in
SPEC_CTRL MSR, first available in Intel AlderLake and Sapphire Rapids CPUs

Xen already knows how to context switch MSR_SPEC_CTRL properly between guest
and hypervisor context.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 583f1d0950529f3517b1741c2b21a028a82ba831)

12 months agox86/spec-ctrl: Expose RRSBA_CTRL to guests
Roger Pau Monné [Tue, 30 Jan 2024 09:13:59 +0000 (10:13 +0100)]
x86/spec-ctrl: Expose RRSBA_CTRL to guests

The CPUID feature bit signals the presence of the RRSBA_DIS_{U,S} controls in
SPEC_CTRL MSR, first available in Intel AlderLake and Sapphire Rapids CPUs.

Xen already knows how to context switch MSR_SPEC_CTRL properly between guest
and hypervisor context.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 478e4787fa64b621061177a7843c452e9a19916d)

12 months agox86/spec-ctrl: Expose IPRED_CTRL to guests
Roger Pau Monné [Tue, 30 Jan 2024 09:13:58 +0000 (10:13 +0100)]
x86/spec-ctrl: Expose IPRED_CTRL to guests

The CPUID feature bit signals the presence of the IPRED_DIS_{U,S} controls in
SPEC_CTRL MSR, first available in Intel AlderLake and Sapphire Rapids CPUs.

Xen already knows how to context switch MSR_SPEC_CTRL properly between guest
and hypervisor context.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 4dd6760706848de30f7c8b5f83462b9bcb070c91)

12 months agox86/spec-ctrl: Fix BTC/SRSO mitigations
Andrew Cooper [Tue, 26 Mar 2024 22:47:25 +0000 (22:47 +0000)]
x86/spec-ctrl: Fix BTC/SRSO mitigations

We were looking for SCF_entry_ibpb in the wrong variable in the top-of-stack
block, and xen_spec_ctrl won't have had bit 5 set because Xen doesn't
understand SPEC_CTRL_RRSBA_DIS_U yet.

This is XSA-455 / CVE-2024-31142.

Fixes: 53a570b28569 ("x86/spec-ctrl: Support IBPB-on-entry")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
13 months agohypercall_xlat_continuation: Replace BUG_ON with domain_crash
Bjoern Doebel [Wed, 27 Mar 2024 18:30:55 +0000 (18:30 +0000)]
hypercall_xlat_continuation: Replace BUG_ON with domain_crash

Instead of crashing the host in case of unexpected hypercall parameters,
resort to only crashing the calling domain.

This is part of XSA-454 / CVE-2023-46842.

Fixes: b8a7efe8528a ("Enable compatibility mode operation for HYPERVISOR_memory_op")
Reported-by: Manuel Andreas <manuel.andreas@tum.de>
Signed-off-by: Bjoern Doebel <doebel@amazon.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 9926e692c4afc40bcd66f8416ff6a1e93ce402f6)

13 months agox86/HVM: clear upper halves of GPRs upon entry from 32-bit code
Jan Beulich [Wed, 27 Mar 2024 17:31:38 +0000 (17:31 +0000)]
x86/HVM: clear upper halves of GPRs upon entry from 32-bit code

Hypercalls in particular can be the subject of continuations, and logic
there checks updated state against incoming register values. If the
guest manufactured a suitable argument register with a non-zero upper
half before entering compatibility mode and issuing a hypercall from
there, checks in hypercall_xlat_continuation() might trip.

Since for HVM we want to also be sure to not hit a corner case in the
emulator, initiate the clipping right from the top of
{svm,vmx}_vmexit_handler(). Also rename the invoked function, as it no
longer does only invalidation of fields.

Note that architecturally the upper halves of registers are undefined
after a switch between compatibility and 64-bit mode (either direction).
Hence once having entered compatibility mode, the guest can't assume
the upper half of any register to retain its value.

This is part of XSA-454 / CVE-2023-46842.

Fixes: b8a7efe8528a ("Enable compatibility mode operation for HYPERVISOR_memory_op")
Reported-by: Manuel Andreas <manuel.andreas@tum.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 6a98383b0877bb66ebfe189da43bf81abe3d7909)

13 months agox86: protect conditional lock taking from speculative execution
Roger Pau Monné [Mon, 4 Mar 2024 15:24:21 +0000 (16:24 +0100)]
x86: protect conditional lock taking from speculative execution

Conditionally taken locks that use the pattern:

if ( lock )
    spin_lock(...);

Need an else branch in order to issue an speculation barrier in the else case,
just like it's done in case the lock needs to be acquired.

eval_nospec() could be used on the condition itself, but that would result in a
double barrier on the branch where the lock is taken.

Introduce a new pair of helpers, {gfn,spin}_lock_if() that can be used to
conditionally take a lock in a speculation safe way.

This is part of XSA-453 / CVE-2024-2193

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 03cf7ca23e0e876075954c558485b267b7d02406)

13 months agox86/mm: add speculation barriers to open coded locks
Roger Pau Monné [Mon, 4 Mar 2024 17:08:48 +0000 (18:08 +0100)]
x86/mm: add speculation barriers to open coded locks

Add a speculation barrier to the clearly identified open-coded lock taking
functions.

Note that the memory sharing page_lock() replacement (_page_lock()) is left
as-is, as the code is experimental and not security supported.

This is part of XSA-453 / CVE-2024-2193

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 42a572a38e22a97d86a4b648a22597628d5b42e4)

13 months agolocking: attempt to ensure lock wrappers are always inline
Roger Pau Monné [Mon, 4 Mar 2024 13:29:36 +0000 (14:29 +0100)]
locking: attempt to ensure lock wrappers are always inline

In order to prevent the locking speculation barriers from being inside of
`call`ed functions that could be speculatively bypassed.

While there also add an extra locking barrier to _mm_write_lock() in the branch
taken when the lock is already held.

Note some functions are switched to use the unsafe variants (without speculation
barrier) of the locking primitives, but a speculation barrier is always added
to the exposed public lock wrapping helper.  That's the case with
sched_spin_lock_double() or pcidevs_lock() for example.

This is part of XSA-453 / CVE-2024-2193

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 197ecd838a2aaf959a469df3696d4559c4f8b762)

13 months agopercpu-rwlock: introduce support for blocking speculation into critical regions
Roger Pau Monné [Tue, 13 Feb 2024 16:57:38 +0000 (17:57 +0100)]
percpu-rwlock: introduce support for blocking speculation into critical regions

Add direct calls to block_lock_speculation() where required in order to prevent
speculation into the lock protected critical regions.  Also convert
_percpu_read_lock() from inline to always_inline.

Note that _percpu_write_lock() has been modified the use the non speculation
safe of the locking primites, as a speculation is added unconditionally by the
calling wrapper.

This is part of XSA-453 / CVE-2024-2193

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f218daf6d3a3b847736d37c6a6b76031a0d08441)

13 months agorwlock: introduce support for blocking speculation into critical regions
Roger Pau Monné [Tue, 13 Feb 2024 15:08:52 +0000 (16:08 +0100)]
rwlock: introduce support for blocking speculation into critical regions

Introduce inline wrappers as required and add direct calls to
block_lock_speculation() in order to prevent speculation into the rwlock
protected critical regions.

Note the rwlock primitives are adjusted to use the non speculation safe variants
of the spinlock handlers, as a speculation barrier is added in the rwlock
calling wrappers.

trylock variants are protected by using lock_evaluate_nospec().

This is part of XSA-453 / CVE-2024-2193

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit a1fb15f61692b1fa9945fc51f55471ace49cdd59)

13 months agox86/spinlock: introduce support for blocking speculation into critical regions
Roger Pau Monné [Tue, 13 Feb 2024 12:08:05 +0000 (13:08 +0100)]
x86/spinlock: introduce support for blocking speculation into critical regions

Introduce a new Kconfig option to block speculation into lock protected
critical regions.  The Kconfig option is enabled by default, but the mitigation
won't be engaged unless it's explicitly enabled in the command line using
`spec-ctrl=lock-harden`.

Convert the spinlock acquire macros into always-inline functions, and introduce
a speculation barrier after the lock has been taken.  Note the speculation
barrier is not placed inside the implementation of the spin lock functions, as
to prevent speculation from falling through the call to the lock functions
resulting in the barrier also being skipped.

trylock variants are protected using a construct akin to the existing
evaluate_nospec().

This patch only implements the speculation barrier for x86.

Note spin locks are the only locking primitive taken care in this change,
further locking primitives will be adjusted by separate changes.

This is part of XSA-453 / CVE-2024-2193

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 7ef0084418e188d05f338c3e028fbbe8b6924afa)

13 months agoxen: Swap order of actions in the FREE*() macros
Andrew Cooper [Fri, 2 Feb 2024 00:39:42 +0000 (00:39 +0000)]
xen: Swap order of actions in the FREE*() macros

Wherever possible, it is a good idea to NULL out the visible reference to an
object prior to freeing it.  The FREE*() macros already collect together both
parts, making it easy to adjust.

This has a marginal code generation improvement, as some of the calls to the
free() function can be tailcall optimised.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c4f427ec879e7c0df6d44d02561e8bee838a293e)

13 months agox86/paging: Delete update_cr3()'s do_locking parameter
Andrew Cooper [Wed, 20 Sep 2023 19:06:53 +0000 (20:06 +0100)]
x86/paging: Delete update_cr3()'s do_locking parameter

Nicola reports that the XSA-438 fix introduced new MISRA violations because of
some incidental tidying it tried to do.  The parameter is useless, so resolve
the MISRA regression by removing it.

hap_update_cr3() discards the parameter entirely, while sh_update_cr3() uses
it to distinguish internal and external callers and therefore whether the
paging lock should be taken.

However, we have paging_lock_recursive() for this purpose, which also avoids
the ability for the shadow internal callers to accidentally not hold the lock.

Fixes: fb0ff49fe9f7 ("x86/shadow: defer releasing of PV's top-level shadow reference")
Reported-by: Nicola Vetrini <nicola.vetrini@bugseng.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Henry Wang <Henry.Wang@arm.com>
(cherry picked from commit e71157d1ac2a7fbf413130663cf0a93ff9fbcf7e)

13 months agox86/spec-ctrl: Mitigation Register File Data Sampling
Andrew Cooper [Thu, 22 Jun 2023 22:32:19 +0000 (23:32 +0100)]
x86/spec-ctrl: Mitigation Register File Data Sampling

RFDS affects Atom cores, also branded E-cores, between the Goldmont and
Gracemont microarchitectures.  This includes Alder Lake and Raptor Lake hybrid
clien systems which have a mix of Gracemont and other types of cores.

Two new bits have been defined; RFDS_CLEAR to indicate VERW has more side
effets, and RFDS_NO to incidate that the system is unaffected.  Plenty of
unaffected CPUs won't be getting RFDS_NO retrofitted in microcode, so we
synthesise it.  Alder Lake and Raptor Lake Xeon-E's are unaffected due to
their platform configuration, and we must use the Hybrid CPUID bit to
distinguish them from their non-Xeon counterparts.

Like MD_CLEAR and FB_CLEAR, RFDS_CLEAR needs OR-ing across a resource pool, so
set it in the max policies and reflect the host setting in default.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit fb5b6f6744713410c74cfc12b7176c108e3c9a31)

13 months agox86/spec-ctrl: VERW-handling adjustments
Andrew Cooper [Tue, 5 Mar 2024 19:33:37 +0000 (19:33 +0000)]
x86/spec-ctrl: VERW-handling adjustments

... before we add yet more complexity to this logic.  Mostly expanded
comments, but with three minor changes.

1) Introduce cpu_has_useful_md_clear to simplify later logic in this patch and
   future ones.

2) We only ever need SC_VERW_IDLE when SMT is active.  If SMT isn't active,
   then there's no re-partition of pipeline resources based on thread-idleness
   to worry about.

3) The logic to adjust HVM VERW based on L1D_FLUSH is unmaintainable and, as
   it turns out, wrong.  SKIP_L1DFL is just a hint bit, whereas opt_l1d_flush
   is the relevant decision of whether to use L1D_FLUSH based on
   susceptibility and user preference.

   Rewrite the logic so it can be followed, and incorporate the fact that when
   FB_CLEAR is visible, L1D_FLUSH isn't a safe substitution.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 1eb91a8a06230b4b64228c9a380194f8cfe6c5e2)

13 months agox86/spec-ctrl: Rename VERW related options
Andrew Cooper [Mon, 12 Feb 2024 17:50:43 +0000 (17:50 +0000)]
x86/spec-ctrl: Rename VERW related options

VERW is going to be used for a 3rd purpose, and the existing nomenclature
didn't survive the Stale MMIO issues terribly well.

Rename the command line option from `md-clear=` to `verw=`.  This is more
consistent with other options which tend to be named based on what they're
doing, not which feature enumeration they use behind the scenes.  Retain
`md-clear=` as a deprecated alias.

Rename opt_md_clear_{pv,hvm} and opt_fb_clear_mmio to opt_verw_{pv,hvm,mmio},
which has a side effect of making spec_ctrl_init_domain() rather clearer to
follow.

No functional change.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f7603ca252e4226739eb3129a5290ee3da3f8ea4)

13 months agox86/spec-ctrl: Perform VERW flushing later in exit paths
Andrew Cooper [Sat, 27 Jan 2024 18:20:56 +0000 (18:20 +0000)]
x86/spec-ctrl: Perform VERW flushing later in exit paths

On parts vulnerable to RFDS, VERW's side effects are extended to scrub all
non-architectural entries in various Physical Register Files.  To remove all
of Xen's values, the VERW must be after popping the GPRs.

Rework SPEC_CTRL_COND_VERW to default to an CPUINFO_error_code %rsp position,
but with overrides for other contexts.  Identify that it clobbers eflags; this
is particularly relevant for the SYSRET path.

For the IST exit return to Xen, have the main SPEC_CTRL_EXIT_TO_XEN put a
shadow copy of spec_ctrl_flags, as GPRs can't be used at the point we want to
issue the VERW.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 0a666cf2cd99df6faf3eebc81a1fc286e4eca4c7)

13 months agox86/vmx: Perform VERW flushing later in the VMExit path
Andrew Cooper [Fri, 23 Jun 2023 10:32:00 +0000 (11:32 +0100)]
x86/vmx: Perform VERW flushing later in the VMExit path

Broken out of the following patch because this change is subtle enough on its
own.  See it for the rational of why we're moving VERW.

As for how, extend the trick already used to hold one condition in
flags (RESUME vs LAUNCH) through the POPing of GPRs.

Move the MOV CR earlier.  Intel specify flags to be undefined across it.

Encode the two conditions we want using SF and PF.  See the code comment for
exactly how.

Leave a comment to explain the lack of any content around
SPEC_CTRL_EXIT_TO_VMX, but leave the block in place.  Sods law says if we
delete it, we'll need to reintroduce it.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 475fa20b7384464210f42bad7195f87bd6f1c63f)

13 months agox86/cpu-policy: Allow for levelling of VERW side effects
Andrew Cooper [Thu, 29 Feb 2024 11:26:40 +0000 (11:26 +0000)]
x86/cpu-policy: Allow for levelling of VERW side effects

MD_CLEAR and FB_CLEAR need OR-ing across a migrate pool.  Allow this, by
having them unconditinally set in max, with the host values reflected in
default.  Annotate the bits as having special properies.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit de17162cafd27f2865a3102a2ec0f386a02ed03d)

13 months agox86/entry: Introduce EFRAME_* constants
Andrew Cooper [Sat, 27 Jan 2024 17:52:09 +0000 (17:52 +0000)]
x86/entry: Introduce EFRAME_* constants

restore_all_guest() does a lot of manipulation of the stack after popping the
GPRs, and uses raw %rsp displacements to do so.  Also, almost all entrypaths
use raw %rsp displacements prior to pushing GPRs.

Provide better mnemonics, to aid readability and reduce the chance of errors
when editing.

No functional change.  The resulting binary is identical.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 37541208f119a9c552c6c6c3246ea61be0d44035)

14 months agox86: account for shadow stack in exception-from-stub recovery
Jan Beulich [Tue, 27 Feb 2024 13:13:21 +0000 (14:13 +0100)]
x86: account for shadow stack in exception-from-stub recovery

Dealing with exceptions raised from within emulation stubs involves
discarding return address (replaced by exception related information).
Such discarding of course also requires removing the corresponding entry
from the shadow stack.

Also amend the comment in fixup_exception_return(), to further clarify
why use of ptr[1] can't be an out-of-bounds access.

This is CVE-2023-46841 / XSA-451.

Fixes: 209fb9919b50 ("x86/extable: Adjust extable handling to be shadow stack compatible")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 91f5f7a9154919a765c3933521760acffeddbf28
master date: 2024-02-27 13:49:22 +0100

15 months agopci: fail device assignment if phantom functions cannot be assigned
Roger Pau Monné [Tue, 30 Jan 2024 13:42:41 +0000 (14:42 +0100)]
pci: fail device assignment if phantom functions cannot be assigned

The current behavior is that no error is reported if (some) phantom functions
fail to be assigned during device add or assignment, so the operation succeeds
even if some phantom functions are not correctly setup.

This can lead to devices possibly being successfully assigned to a domU while
some of the device phantom functions are still assigned to dom0.  Even when the
device is assigned domIO before being assigned to a domU phantom functions
might fail to be assigned to domIO, and also fail to be assigned to the domU,
leaving them assigned to dom0.

Since the device can generate requests using the IDs of those phantom
functions, given the scenario above a device in such state would be in control
of a domU, but still capable of generating transactions that use a context ID
targeting dom0 owned memory.

Modify device assign in order to attempt to deassign the device if phantom
functions failed to be assigned.

Note that device addition is not modified in the same way, as in that case the
device is assigned to a trusted domain, and hence partial assign can lead to
device malfunction but not a security issue.

This is XSA-449 / CVE-2023-46839

Fixes: 4e9950dc1bd2 ('IOMMU: add phantom function support')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cb4ecb3cc17b02c2814bc817efd05f3f3ba33d1e
master date: 2024-01-30 14:28:01 +0100

16 months agoxen/arm: page: Avoid pointer overflow on cache clean & invalidate
Michal Orzel [Tue, 12 Dec 2023 13:53:13 +0000 (14:53 +0100)]
xen/arm: page: Avoid pointer overflow on cache clean & invalidate

On Arm32, after cleaning and invalidating the last dcache line of the top
domheap page i.e. VA = 0xfffff000 (as a result of flushing the page to
RAM), we end up adding the value of a dcache line size to the pointer
once again, which results in a pointer arithmetic overflow (with 64B line
size, operation 0xffffffc0 + 0x40 overflows to 0x0). Such behavior is
undefined and given the wide range of compiler versions we support, it is
difficult to determine what could happen in such scenario.

Modify clean_and_invalidate_dcache_va_range() as well as
clean_dcache_va_range() and invalidate_dcache_va_range() due to similarity
of handling to prevent pointer arithmetic overflow. Modify the loops to
use an additional variable to store the index of the next cacheline.
Add an assert to prevent passing a region that wraps around which is
illegal and would end up in a page fault anyway (region 0-2MB is
unmapped). Lastly, return early if size passed is 0.

Note that on Arm64, we don't have this problem given that the max VA
space we support is 48-bits.

This is XSA-447 / CVE-2023-46837.

Signed-off-by: Michal Orzel <michal.orzel@amd.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: 190b7f49af6487a9665da63d43adc9d9a5fbd01e
master date: 2023-12-12 14:01:00 +0100

18 months agox86/spec-ctrl: Remove conditional IRQs-on-ness for INT $0x80/0x82 paths
Andrew Cooper [Thu, 26 Oct 2023 13:37:38 +0000 (14:37 +0100)]
x86/spec-ctrl: Remove conditional IRQs-on-ness for INT $0x80/0x82 paths

Before speculation defences, some paths in Xen could genuinely get away with
being IRQs-on at entry.  But XPTI invalidated this property on most paths, and
attempting to maintain it on the remaining paths was a mistake.

Fast forward, and DO_SPEC_CTRL_COND_IBPB (protection for AMD BTC/SRSO) is not
IRQ-safe, running with IRQs enabled in some cases.  The other actions taken on
these paths happen to be IRQ-safe.

Make entry_int82() and int80_direct_trap() unconditionally Interrupt Gates
rather than Trap Gates.  Remove the conditional re-adjustment of
int80_direct_trap() in smp_prepare_cpus(), and have entry_int82() explicitly
enable interrupts when safe to do so.

In smp_prepare_cpus(), with the conditional re-adjustment removed, the
clearing of pv_cr3 is the only remaining action gated on XPTI, and it is out
of place anyway, repeating work already done by smp_prepare_boot_cpu().  Drop
the entire if() condition to avoid leaving an incorrect vestigial remnant.

Also drop comments which make incorrect statements about when its safe to
enable interrupts.

This is XSA-446 / CVE-2023-46836

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit a48bb129f1b9ff55c22cf6d2b589247c8ba3b10e)

18 months agoiommu/amd-vi: use correct level for quarantine domain page tables
Roger Pau Monne [Wed, 11 Oct 2023 11:14:21 +0000 (13:14 +0200)]
iommu/amd-vi: use correct level for quarantine domain page tables

The current setup of the quarantine page tables assumes that the quarantine
domain (dom_io) has been initialized with an address width of
DEFAULT_DOMAIN_ADDRESS_WIDTH (48).

However dom_io being a PV domain gets the AMD-Vi IOMMU page tables levels based
on the maximum (hot pluggable) RAM address, and hence on systems with no RAM
above the 512GB mark only 3 page-table levels are configured in the IOMMU.

On systems without RAM above the 512GB boundary amd_iommu_quarantine_init()
will setup page tables for the scratch page with 4 levels, while the IOMMU will
be configured to use 3 levels only.  The page destined to be used as level 1,
and to contain a directory of PTEs ends up being the address in a PTE itself,
and thus level 1 page becomes the leaf page.  Without the level mismatch it's
level 0 page that should be the leaf page instead.

The level 1 page won't be used as such, and hence it's not possible to use it
to gain access to other memory on the system.  However that page is not cleared
in amd_iommu_quarantine_init() as part of re-initialization of the device
quarantine page tables, and hence data on the level 1 page can be leaked
between device usages.

Fix this by making sure the paging levels setup by amd_iommu_quarantine_init()
match the number configured on the IOMMUs.

Note that IVMD regions are not affected by this issue, as those areas are
mapped taking the configured paging levels into account.

This is XSA-445 / CVE-2023-46835

Fixes: ea38867831da ('x86 / iommu: set up a scratch page in the quarantine domain')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit fe1e4668b373ec4c1e5602e75905a9fa8cc2be3f)

19 months agox86/pv: Correct the auditing of guest breakpoint addresses
Andrew Cooper [Tue, 26 Sep 2023 19:06:57 +0000 (20:06 +0100)]
x86/pv: Correct the auditing of guest breakpoint addresses

The use of access_ok() is buggy, because it permits access to the compat
translation area.  64bit PV guests don't use the XLAT area, but on AMD
hardware, the DBEXT feature allows a breakpoint to match up to a 4G aligned
region, allowing the breakpoint to reach outside of the XLAT area.

Prior to c/s cda16c1bb223 ("x86: mirror compat argument translation area for
32-bit PV"), the live GDT was within 4G of the XLAT area.

All together, this allowed a malicious 64bit PV guest on AMD hardware to place
a breakpoint over the live GDT, and trigger a #DB livelock (CVE-2015-8104).

Introduce breakpoint_addr_ok() and explain why __addr_ok() happens to be an
appropriate check in this case.

For Xen 4.14 and later, this is a latent bug because the XLAT area has moved
to be on its own with nothing interesting adjacent.  For Xen 4.13 and older on
AMD hardware, this fixes a PV-trigger-able DoS.

This is part of XSA-444 / CVE-2023-34328.

Fixes: 65e355490817 ("x86/PV: support data breakpoint extension registers")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit dc9d9aa62ddeb14abd5672690d30789829f58f7e)

19 months agox86/svm: Fix asymmetry with AMD DR MASK context switching
Andrew Cooper [Tue, 26 Sep 2023 19:06:57 +0000 (20:06 +0100)]
x86/svm: Fix asymmetry with AMD DR MASK context switching

The handling of MSR_DR{0..3}_MASK is asymmetric between PV and HVM guests.

HVM guests context switch in based on the guest view of DBEXT, whereas PV
guest switch in base on the host capability.  Both guest types leave the
context dirty for the next vCPU.

This leads to the following issue:

 * PV or HVM vCPU has debugging active (%dr7 + mask)
 * Switch out deactivates %dr7 but leaves other state stale in hardware
 * HVM vCPU with debugging activate but can't see DBEXT is switched in
 * Switch in loads %dr7 but leaves the mask MSRs alone

Now, the HVM vCPU is operating in the context of the prior vCPU's mask MSR,
and furthermore in a case where it genuinely expects there to be no masking
MSRs.

As a stopgap, adjust the HVM path to switch in/out the masks based on host
capabilities rather than guest visibility (i.e. like the PV path).  Adjustment
of the of the intercepts still needs to be dependent on the guest visibility
of DBEXT.

This is part of XSA-444 / CVE-2023-34327

Fixes: c097f54912d3 ("x86/SVM: support data breakpoint extension registers")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 5d54282f984bb9a7a65b3d12208584f9fdf1c8e1)

19 months agolibxl: limit bootloader execution in restricted mode
Roger Pau Monne [Thu, 28 Sep 2023 10:22:35 +0000 (12:22 +0200)]
libxl: limit bootloader execution in restricted mode

Introduce a timeout for bootloader execution when running in restricted mode.

Allow overwriting the default time out with an environment provided value.

This is part of XSA-443 / CVE-2023-34325

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit 9c114178ffd700112e91f5ec66cf5151b9c9a8cc)

19 months agolibxl: add support for running bootloader in restricted mode
Roger Pau Monne [Mon, 25 Sep 2023 12:30:20 +0000 (14:30 +0200)]
libxl: add support for running bootloader in restricted mode

Much like the device model depriv mode, add the same kind of support for the
bootloader.  Such feature allows passing a UID as a parameter for the
bootloader to run as, together with the bootloader itself taking the necessary
actions to isolate.

Note that the user to run the bootloader as must have the right permissions to
access the guest disk image (in read mode only), and that the bootloader will
be run in non-interactive mode when restricted.

If enabled bootloader restrict mode will attempt to re-use the user(s) from the
QEMU depriv implementation if no user is provided on the configuration file or
the environment.  See docs/features/qemu-deprivilege.pandoc for more
information about how to setup those users.

Bootloader restrict mode is not enabled by default as it requires certain
setup to be done first (setup of the user(s) to use in restrict mode).

This is part of XSA-443 / CVE-2023-34325

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit 1f762642d2cad1a40634e3280361928109d902f1)

19 months agotools/pygrub: Deprivilege pygrub
Alejandro Vallejo [Mon, 25 Sep 2023 17:32:25 +0000 (18:32 +0100)]
tools/pygrub: Deprivilege pygrub

Introduce a --runas=<uid> flag to deprivilege pygrub on Linux and *BSDs. It
also implicitly creates a chroot env where it drops a deprivileged forked
process. The chroot itself is cleaned up at the end.

If the --runas arg is present, then pygrub forks, leaving the child to
deprivilege itself, and waiting for it to complete. When the child exists,
the parent performs cleanup and exits with the same error code.

This is roughly what the child does:
  1. Initialize libfsimage (this loads every .so in memory so the chroot
     can avoid bind-mounting /{,usr}/lib*
  2. Create a temporary empty chroot directory
  3. Mount tmpfs in it
  4. Bind mount the disk inside, because libfsimage expects a path, not a
     file descriptor.
  5. Remount the root tmpfs to be stricter (ro,nosuid,nodev)
  6. Set RLIMIT_FSIZE to a sensibly high amount (128 MiB)
  7. Depriv gid, groups and uid

With this scheme in place, the "output" files are writable (up to
RLIMIT_FSIZE octets) and the exposed filesystem is immutable and contains
the single only file we can't easily get rid of (the disk).

If running on Linux, the child process also unshares mount, IPC, and
network namespaces before dropping its privileges.

This is part of XSA-443 / CVE-2023-34325

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
(cherry picked from commit e0342ae5556f2b6e2db50701b8a0679a45822ca6)

19 months agotools/libfsimage: Export a new function to preload all plugins
Alejandro Vallejo [Mon, 25 Sep 2023 17:32:24 +0000 (18:32 +0100)]
tools/libfsimage: Export a new function to preload all plugins

This is work required in order to let pygrub operate in highly deprivileged
chroot mode. This patch adds a function that preloads every plugin, hence
ensuring that a on function exit, every shared library is loaded in memory.

The new "init" function is supposed to be used before depriv, but that's
fine because it's not acting on untrusted data.

This is part of XSA-443 / CVE-2023-34325

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
(cherry picked from commit 990e65c3ad9ac08642ce62a92852c80be6c83e96)

19 months agotools/pygrub: Open the output files earlier
Alejandro Vallejo [Mon, 25 Sep 2023 17:32:23 +0000 (18:32 +0100)]
tools/pygrub: Open the output files earlier

This patch allows pygrub to get ahold of every RW file descriptor it needs
early on. A later patch will clamp the filesystem it can access so it can't
obtain any others.

This is part of XSA-443 / CVE-2023-34325

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
(cherry picked from commit 0710d7d44586251bfca9758890616dc3d6de8a74)

19 months agotools/pygrub: Small refactors
Alejandro Vallejo [Mon, 25 Sep 2023 17:32:22 +0000 (18:32 +0100)]
tools/pygrub: Small refactors

Small tidy up to ensure output_directory always has a trailing '/' to ease
concatenating paths and that `output` can only be a filename or None.

This is part of XSA-443 / CVE-2023-34325

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
(cherry picked from commit 9f2ff9a7c9b3ac734ae99f17f0134ed0343dcccf)

19 months agotools/pygrub: Remove unnecessary hypercall
Alejandro Vallejo [Mon, 25 Sep 2023 17:32:21 +0000 (18:32 +0100)]
tools/pygrub: Remove unnecessary hypercall

There's a hypercall being issued in order to determine whether PV64 is
supported, but since Xen 4.3 that's strictly true so it's not required.

Plus, this way we can avoid mapping the privcmd interface altogether in the
depriv pygrub.

This is part of XSA-443 / CVE-2023-34325

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit f4b504c6170c446e61055cbd388ae4e832a9deca)

19 months agolibfsimage/xfs: Add compile-time check to libfsimage
Alejandro Vallejo [Thu, 14 Sep 2023 12:22:53 +0000 (13:22 +0100)]
libfsimage/xfs: Add compile-time check to libfsimage

Adds the common tools include folder to the -I compile flags
of libfsimage. This allows us to use:
  xen-tools/common-macros.h:BUILD_BUG_ON()

With it, statically assert a sanitized "blocklog - SECTOR_BITS" cannot
underflow.

This is part of XSA-443 / CVE-2023-34325

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 7d85c70431593550e32022e3a19a37f306f49e00)

19 months agolibfsimage/xfs: Sanity-check the superblock during mounts
Alejandro Vallejo [Thu, 14 Sep 2023 12:22:52 +0000 (13:22 +0100)]
libfsimage/xfs: Sanity-check the superblock during mounts

Sanity-check the XFS superblock for wellformedness at the mount handler.
This forces pygrub to abort parsing a potentially malformed filesystem and
ensures the invariants assumed throughout the rest of the code hold.

Also, derive parameters from previously sanitized parameters where possible
(rather than reading them off the superblock)

The code doesn't try to avoid overflowing the end of the disk, because
that's an unlikely and benign error. Parameters used in calculations of
xfs_daddr_t (like the root inode index) aren't in critical need of being
sanitized.

The sanitization of agblklog is basically checking that no obvious
overflows happen on agblklog, and then ensuring agblocks is contained in
the range (2^(sb_agblklog-1), 2^sb_agblklog].

This is part of XSA-443 / CVE-2023-34325

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 620500dd1baf33347dfde5e7fde7cf7fe347da5c)

19 months agolibfsimage/xfs: Amend mask32lo() to allow the value 32
Alejandro Vallejo [Thu, 14 Sep 2023 12:22:51 +0000 (13:22 +0100)]
libfsimage/xfs: Amend mask32lo() to allow the value 32

agblklog could plausibly be 32, but that would overflow this shift.
Perform the shift as ULL and cast to u32 at the end instead.

This is part of XSA-443 / CVE-2023-34325

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit ddc45e4eea946bb373a4b4a60c84bf9339cf413b)

19 months agolibfsimage/xfs: Remove dead code
Alejandro Vallejo [Thu, 14 Sep 2023 12:22:50 +0000 (13:22 +0100)]
libfsimage/xfs: Remove dead code

xfs_info.agnolog (and related code) and XFS_INO_AGBNO_BITS are dead code
that serve no purpose.

This is part of XSA-443 / CVE-2023-34325

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 37fc1e6c1c5c63aafd9cfd76a37728d5baea7d71)

19 months agoiommu/amd-vi: flush IOMMU TLB when flushing the DTE
Roger Pau Monne [Tue, 13 Jun 2023 13:01:05 +0000 (15:01 +0200)]
iommu/amd-vi: flush IOMMU TLB when flushing the DTE

The caching invalidation guidelines from the AMD-Vi specification (48882—Rev
3.07-PUB—Oct 2022) seem to be misleading on some hardware, as devices will
malfunction (see stale DMA mappings) if some fields of the DTE are updated but
the IOMMU TLB is not flushed. This has been observed in practice on AMD
systems.  Due to the lack of guidance from the currently published
specification this patch aims to increase the flushing done in order to prevent
device malfunction.

In order to fix, issue an INVALIDATE_IOMMU_PAGES command from
amd_iommu_flush_device(), flushing all the address space.  Note this requires
callers to be adjusted in order to pass the DomID on the DTE previous to the
modification.

Some call sites don't provide a valid DomID to amd_iommu_flush_device() in
order to avoid the flush.  That's because the device had address translations
disabled and hence the previous DomID on the DTE is not valid.  Note the
current logic relies on the entity disabling address translations to also flush
the TLB of the in use DomID.

Device I/O TLB flushing when ATS are enabled is not covered by the current
change, as ATS usage is not security supported.

This is XSA-442 / CVE-2023-34326

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 5fc98b97084a46884acef9320e643faf40d42212)

19 months agotools/xenstored: domain_entry_fix(): Handle conflicting transaction
Julien Grall [Fri, 22 Sep 2023 10:32:16 +0000 (11:32 +0100)]
tools/xenstored: domain_entry_fix(): Handle conflicting transaction

The function domain_entry_fix() will be initially called to check if the
quota is correct before attempt to commit any nodes. So it would be
possible that accounting is temporarily negative. This is the case
in the following sequence:

  1) Create 50 nodes
  2) Start two transactions
  3) Delete all the nodes in each transaction
  4) Commit the two transactions

Because the first transaction will have succeed and updated the
accounting, there is no guarantee that 'd->nbentry + num' will still
be above 0. So the assert() would be triggered.
The assert() was introduced in dbef1f748289 ("tools/xenstore: simplify
and fix per domain node accounting") with the assumption that the
value can't be negative. As this is not true revert to the original
check but restricted to the path where we don't update. Take the
opportunity to explain the rationale behind the check.

This CVE-2023-34323 / XSA-440.

Fixes: dbef1f748289 ("tools/xenstore: simplify and fix per domain node accounting")
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit c4e05c97f57d236040d1da5c1fbf6e3699dc86ea)

19 months agox86/shadow: defer releasing of PV's top-level shadow reference
Jan Beulich [Wed, 20 Sep 2023 09:34:24 +0000 (10:34 +0100)]
x86/shadow: defer releasing of PV's top-level shadow reference

sh_set_toplevel_shadow() re-pinning the top-level shadow we may be
running on is not enough (and at the same time unnecessary when the
shadow isn't what we're running on): That shadow becomes eligible for
blowing away (from e.g. shadow_prealloc()) immediately after the
paging lock was dropped. Yet it needs to remain valid until the actual
page table switch occurred.

Propagate up the call chain the shadow entry that needs releasing
eventually, and carry out the release immediately after switching page
tables. Handle update_cr3() failures by switching to idle pagetables.
Note that various further uses of update_cr3() are HVM-only or only act
on paused vCPU-s, in which case sh_set_toplevel_shadow() will not defer
releasing of the reference.

While changing the update_cr3() hook, also convert the "do_locking"
parameter to boolean.

This is CVE-2023-34322 / XSA-438.

Reported-by: Tim Deegan <tim@xen.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@cloud.com>
(cherry picked from commit fb0ff49fe9f784bfee0370c2a3c5f20e39d7a1cb)

19 months agox86/spec-ctrl: Mitigate the Zen1 DIV leakage
Andrew Cooper [Wed, 30 Aug 2023 19:24:25 +0000 (20:24 +0100)]
x86/spec-ctrl: Mitigate the Zen1 DIV leakage

In the Zen1 microarchitecure, there is one divider in the pipeline which
services uops from both threads.  In the case of #DE, the latched result from
the previous DIV to execute will be forwarded speculatively.

This is an interesting covert channel that allows two threads to communicate
without any system calls.  In also allows userspace to obtain the result of
the most recent DIV instruction executed (even speculatively) in the core,
which can be from a higher privilege context.

Scrub the result from the divider by executing a non-faulting divide.  This
needs performing on the exit-to-guest paths, and ist_exit-to-Xen.

Alternatives in IST context is believed safe now that it's done in NMI
context.

This is XSA-439 / CVE-2023-20588.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit b5926c6ecf05c28ee99c6248c42d691ccbf0c315)

19 months agox86/amd: Introduce is_zen{1,2}_uarch() predicates
Andrew Cooper [Fri, 15 Sep 2023 11:13:51 +0000 (12:13 +0100)]
x86/amd: Introduce is_zen{1,2}_uarch() predicates

We already have 3 cases using STIBP as a Zen1/2 heuristic, and are about to
introduce a 4th.  Wrap the heuristic into a pair of predicates rather than
opencoding it, and the explanation of the heuristic, at each usage site.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit de1d265001397f308c5c3c5d3ffc30e7ef8c0705)

19 months agox86/spec-ctrl: Issue VERW during IST exit to Xen
Andrew Cooper [Wed, 13 Sep 2023 12:53:33 +0000 (13:53 +0100)]
x86/spec-ctrl: Issue VERW during IST exit to Xen

There is a corner case where e.g. an NMI hitting an exit-to-guest path after
SPEC_CTRL_EXIT_TO_* would have run the entire NMI handler *after* the VERW
flush to scrub potentially sensitive data from uarch buffers.

In order to compensate, issue VERW when exiting to Xen from an IST entry.

SPEC_CTRL_EXIT_TO_XEN already has two reads of spec_ctrl_flags off the stack,
and we're about to add a third.  Load the field into %ebx, and list the
register as clobbered.

%r12 has been arranged to be the ist_exit signal, so add this as an input
dependency and use it to identify when to issue a VERW.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3ee6066bcd737756b0990d417d94eddc0b0d2585)

19 months agox86/entry: Track the IST-ness of an entry for the exit paths
Andrew Cooper [Wed, 13 Sep 2023 11:20:12 +0000 (12:20 +0100)]
x86/entry: Track the IST-ness of an entry for the exit paths

Use %r12 to hold an ist_exit boolean.  This register is zero elsewhere in the
entry/exit asm, so it only needs setting in the IST path.

As this is subtle and fragile, add check_ist_exit() to be used in debugging
builds to cross-check that the ist_exit boolean matches the entry vector.

Write check_ist_exit() it in C, because it's debug only and the logic more
complicated than I care to maintain in asm.

For now, we only need to use this signal in the exit-to-Xen path, but some
exit-to-guest paths happen in IST context too.  Check the correctness in all
exit paths to avoid the logic bit-rotting.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 21bdc25b05a0f8ab6bc73520a9ca01327360732c)

x86/entry: Partially revert IST-exit checks

The patch adding check_ist_exit() didn't account for the fact that
reset_stack_and_jump() is not an ABI-preserving boundary.  The IST-ness in
%r12 doesn't survive into the next context, and is a stale value C.

This shows up in Gitlab CI for the Clang build:

  https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/5112783827

and in OSSTest for GCC 8:

  http://logs.test-lab.xenproject.org/osstest/logs/183045/test-amd64-amd64-xl-qemuu-debianhvm-amd64/serial-pinot0.log

There's no straightforward way to reconstruct the IST-exit-ness on the
exit-to-guest path after a context switch.  For now, we only need IST-exit on
the return-to-Xen path.

Fixes: 21bdc25b05a0 ("x86/entry: Track the IST-ness of an entry for the exit paths")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9b57c800b79b96769ea3dcd6468578fa664d19f9)

19 months agox86/entry: Adjust restore_all_xen to hold stack_end in %r14
Andrew Cooper [Wed, 13 Sep 2023 12:48:16 +0000 (13:48 +0100)]
x86/entry: Adjust restore_all_xen to hold stack_end in %r14

All other SPEC_CTRL_{ENTRY,EXIT}_* helpers hold stack_end in %r14.  Adjust it
for consistency.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 7aa28849a1155d856e214e9a80a7e65fffdc3e58)

19 months agox86/spec-ctrl: Improve all SPEC_CTRL_{ENTER,EXIT}_* comments
Andrew Cooper [Wed, 30 Aug 2023 19:11:50 +0000 (20:11 +0100)]
x86/spec-ctrl: Improve all SPEC_CTRL_{ENTER,EXIT}_* comments

... to better explain how they're used.

Doing so highlights that SPEC_CTRL_EXIT_TO_XEN is missing a VERW flush for the
corner case when e.g. an NMI hits late in an exit-to-guest path.

Leave a TODO, which will be addressed in subsequent patches which arrange for
VERW flushing to be safe within SPEC_CTRL_EXIT_TO_XEN.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 45f00557350dc7d0756551069803fc49c29184ca)

19 months agox86/spec-ctrl: Turn the remaining SPEC_CTRL_{ENTRY,EXIT}_* into asm macros
Andrew Cooper [Fri, 1 Sep 2023 10:38:44 +0000 (11:38 +0100)]
x86/spec-ctrl: Turn the remaining SPEC_CTRL_{ENTRY,EXIT}_* into asm macros

These have grown more complex over time, with some already having been
converted.

Provide full Requires/Clobbers comments, otherwise missing at this level of
indirection.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 7125429aafb9e3c9c88fc93001fc2300e0ac2cc8)

19 months agox86/spec-ctrl: Fold DO_SPEC_CTRL_EXIT_TO_XEN into it's single user
Andrew Cooper [Tue, 12 Sep 2023 16:03:16 +0000 (17:03 +0100)]
x86/spec-ctrl: Fold DO_SPEC_CTRL_EXIT_TO_XEN into it's single user

With the SPEC_CTRL_EXIT_TO_XEN{,_IST} confusion fixed, it's now obvious that
there's only a single EXIT_TO_XEN path.  Fold DO_SPEC_CTRL_EXIT_TO_XEN into
SPEC_CTRL_EXIT_TO_XEN to simplify further fixes.

When merging labels, switch the name to .L\@_skip_sc_msr as "skip" on its own
is going to be too generic shortly.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 694bb0f280fd08a4377e36e32b84b5062def4de2)

19 months agox86/spec-ctrl: Fix confusion between SPEC_CTRL_EXIT_TO_XEN{,_IST}
Andrew Cooper [Tue, 12 Sep 2023 14:06:49 +0000 (15:06 +0100)]
x86/spec-ctrl: Fix confusion between SPEC_CTRL_EXIT_TO_XEN{,_IST}

c/s 3fffaf9c13e9 ("x86/entry: Avoid using alternatives in NMI/#MC paths")
dropped the only user, leaving behind the (incorrect) implication that Xen had
split exit paths.

Delete the unused SPEC_CTRL_EXIT_TO_XEN and rename SPEC_CTRL_EXIT_TO_XEN_IST
to SPEC_CTRL_EXIT_TO_XEN for consistency.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 1c18d73774533a55ba9d1cbee8bdace03efdb5e7)

19 months agox86/AMD: extend Zenbleed check to models "good" ucode isn't known for
Jan Beulich [Wed, 23 Aug 2023 07:26:36 +0000 (09:26 +0200)]
x86/AMD: extend Zenbleed check to models "good" ucode isn't known for

Reportedly the AMD Custom APU 0405 found on SteamDeck, models 0x90 and
0x91, (quoting the respective Linux commit) is similarly affected. Put
another instance of our Zen1 vs Zen2 distinction checks in
amd_check_zenbleed(), forcing use of the chickenbit irrespective of
ucode version (building upon real hardware never surfacing a version of
0xffffffff).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 145a69c0944ac70cfcf9d247c85dee9e99d9d302)

20 months agoxen/arm: page: Handle cache flush of an element at the top of the address space
Stefano Stabellini [Tue, 5 Sep 2023 12:34:28 +0000 (14:34 +0200)]
xen/arm: page: Handle cache flush of an element at the top of the address space

The region that needs to be cleaned/invalidated may be at the top
of the address space. This means that 'end' (i.e. 'p + size') will
be 0 and therefore nothing will be cleaned/invalidated as the check
in the loop will always be false.

On Arm64, we only support we only support up to 48-bit Virtual
address space. So this is not a concern there. However, for 32-bit,
the mapcache is using the last 2GB of the address space. Therefore
we may not clean/invalidate properly some pages. This could lead
to memory corruption or data leakage (the scrubbed value may
still sit in the cache when the guest could read directly the memory
and therefore read the old content).

Rework invalidate_dcache_va_range(), clean_dcache_va_range(),
clean_and_invalidate_dcache_va_range() to handle a cache flush
with an element at the top of the address space.

This is CVE-2023-34321 / XSA-437.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Acked-by: Bertrand Marquis <bertrand.marquis@arm.com>
master commit: 9a216e92de9f9011097e4f1fb55ff67ba0a21704
master date: 2023-09-05 14:30:08 +0200

20 months agoUpdate Xen to version 4.16.5
Andrew Cooper [Mon, 7 Aug 2023 12:00:02 +0000 (13:00 +0100)]
Update Xen to version 4.16.5

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>