]> xenbits.xensource.com Git - xen.git/log
xen.git
2 months agox86/iommu: check for CMPXCHG16B when enabling IOMMU
Teddy Astie [Mon, 17 Feb 2025 12:29:27 +0000 (13:29 +0100)]
x86/iommu: check for CMPXCHG16B when enabling IOMMU

All hardware with VT-d/AMD-Vi has CMPXCHG16B support. Check this at
initialisation time, and otherwise refuse to use the IOMMU.

If the local APICs support x2APIC mode the IOMMU support for interrupt
remapping will be checked earlier using a specific helper.  If no support
for CX16 is detected by that earlier hook disable the IOMMU at that point
and prevent further poking for CX16 later in the boot process, which would
also fail.

There's a possible corner case when running virtualized, and the underlying
hypervisor exposing an IOMMU but no CMPXCHG16B support.  In which case
ignoring the IOMMU is fine, albeit the most natural would be for the
underlying hypervisor to also expose CMPXCHG16B support if an IOMMU is
available to the VM.

Note this change only introduces the checks, but doesn't remove the now
stale checks for CX16 support sprinkled in the IOMMU code.  Further changes
will take care of that.

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Teddy Astie <teddy.astie@vates.tech>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2636fcdc15c707d5e097770133f0afb69e8d70c9
master date: 2025-01-27 13:05:11 +0100

2 months agox86/HVM: correct read/write split at page boundaries
Jan Beulich [Mon, 17 Feb 2025 12:29:05 +0000 (13:29 +0100)]
x86/HVM: correct read/write split at page boundaries

The MMIO cache is intended to have one entry used per independent memory
access that an insn does. This, in particular, is supposed to be
ignoring any page boundary crossing. Therefore when looking up a cache
entry, the access'es starting (linear) address is relevant, not the one
possibly advanced past a page boundary.

In order for the same offset-into-buffer variable to be usable in
hvmemul_phys_mmio_access() for both the caller's buffer and the cache
entry's it is further necessary to have the un-adjusted caller buffer
passed into there.

Fixes: 2d527ba310dc ("x86/hvm: split all linear reads and writes at page boundary")
Reported-by: Manuel Andreas <manuel.andreas@tum.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 672894a11fe06e664a0ebfb600baf5dbb897b9e4
master date: 2025-01-24 10:15:56 +0100

2 months agox86/HVM: allocate emulation cache entries dynamically
Jan Beulich [Mon, 17 Feb 2025 12:28:28 +0000 (13:28 +0100)]
x86/HVM: allocate emulation cache entries dynamically

Both caches may need higher capacity, and the upper bound will need to
be determined dynamically based on CPUID policy (for AMX'es TILELOAD /
TILESTORE at least).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 23d60dbb0493b2f9ec1d89be5341eec2ee9dab32
master date: 2025-01-24 10:15:29 +0100

2 months agox86/HVM: correct MMIO emulation cache bounds check
Jan Beulich [Mon, 17 Feb 2025 12:28:13 +0000 (13:28 +0100)]
x86/HVM: correct MMIO emulation cache bounds check

To avoid overrunning the internal buffer we need to take the offset into
the buffer into account.

Fixes: d95da91fb497 ("x86/HVM: grow MMIO cache data size to 64 bytes")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: e5339bb689dfa79a914c6c96e1d82d61e1ae3161
master date: 2025-01-23 11:14:48 +0100

2 months agox86/HVM: reduce recursion in linear_{read,write}()
Jan Beulich [Mon, 17 Feb 2025 12:27:22 +0000 (13:27 +0100)]
x86/HVM: reduce recursion in linear_{read,write}()

Let's make explicit what the compiler may or may not do on our behalf:
The 2nd of the recursive invocations each can fall through rather than
re-invoking the function. This will save us from adding yet another
parameter (or more) to the function, just for the recursive invocations.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 18053054b7583810dd356efc8d7018bbc8720f36
master date: 2024-09-09 13:40:47 +0200

3 months agoxen/events: fix race with set_global_virq_handler()
Juergen Gross [Tue, 21 Jan 2025 08:30:59 +0000 (09:30 +0100)]
xen/events: fix race with set_global_virq_handler()

There is a possible race scenario between set_global_virq_handler()
and clear_global_virq_handlers() targeting the same domain, which
might result in that domain ending as a zombie domain.

In case set_global_virq_handler() is being called for a domain which
is just dying, it might happen that clear_global_virq_handlers() is
running first, resulting in set_global_virq_handler() taking a new
reference for that domain and entering in the global_virq_handlers[]
array afterwards. The reference will never be dropped, thus the domain
will never be freed completely.

This can be fixed by checking the is_dying state of the domain inside
the region guarded by global_virq_handlers_lock. In case the domain is
dying, handle it as if the domain wouldn't exist, which will be the
case in near future anyway.

Fixes: 87521589aa6a ("xen: allow global VIRQ handlers to be delegated to other domains")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4d8acc9c1cf14233dda21dd3a7791b5a84b0f6c3
master date: 2025-01-09 17:34:01 +0100

3 months agoxen/flask: Wire up XEN_DOMCTL_vuart_op
Michal Orzel [Tue, 21 Jan 2025 08:30:49 +0000 (09:30 +0100)]
xen/flask: Wire up XEN_DOMCTL_vuart_op

Addition of FLASK permission for this hypercall was overlooked in the
original patch. Fix it. The only VUART operation is initialization that
can occur only during domain creation.

Fixes: 86039f2e8c20 ("xen/arm: vpl011: Add a new domctl API to initialize vpl011")
Signed-off-by: Michal Orzel <michal.orzel@amd.com>
Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
master commit: 29daa72e4019aae92f857cf6e7e0c3ca8fb1483e
master date: 2025-01-08 13:05:38 +0100

3 months agox86emul: correct put_fpu()'s segment selector handling
Jan Beulich [Tue, 21 Jan 2025 08:30:25 +0000 (09:30 +0100)]
x86emul: correct put_fpu()'s segment selector handling

All selector fields under ctxt->regs are (normally) poisoned in the HVM
case, and the four ones besides CS and SS are potentially stale for PV.
Avoid using them in the hypervisor incarnation of the emulator, when
trying to cover for a missing ->read_segment() hook.

To make sure there's always a valid ->read_segment() handler for all HVM
cases, add a respective function to shadow code, even if it is not
expected for FPU insns to be used to update page tables.

Fixes: 0711b59b858a ("x86emul: correct FPU code/data pointers and opcode handling")
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 645b8d48c78f5b6ffd6230873f9e3ced4e840acd
master date: 2025-01-08 11:02:16 +0100

3 months agox86emul: VCVT{,U}DQ2PD ignores embedded rounding
Jan Beulich [Tue, 21 Jan 2025 08:30:04 +0000 (09:30 +0100)]
x86emul: VCVT{,U}DQ2PD ignores embedded rounding

IOW we shouldn't raise #UD in that case. Be on the safe side though and
only encode fully legitimate forms into the stub to be executed.

Things weren't quite right for VCVT{,U}SI2SD either, in the attempt to
be on the safe side: Clearing EVEX.L'L isn't useful; it's EVEX.b which
primarily needs clearing. Also reflect the somewhat improved doc
situation in the comment there.

Fixes: ed806f373730 ("x86emul: support AVX512F legacy-equivalent packed int/FP conversion insns")
Fixes: baf4a376f550 ("x86emul: support AVX512F legacy-equivalent scalar int/FP conversion insns")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d3709d1324aa140f064b9c68da37547f459f8e8d
master date: 2025-01-08 11:01:17 +0100

3 months agox86/amd: Misc setup for Fam1Ah processors
Andrew Cooper [Tue, 21 Jan 2025 08:29:49 +0000 (09:29 +0100)]
x86/amd: Misc setup for Fam1Ah processors

Fam1Ah is similar to Fam19h in these regards.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: f29cc14de1d195bcd8312dcab2b5f8e634b57288
master date: 2025-01-06 18:01:32 +0000

3 months agox86/traps: Rework LER initialisation and support Zen5/Diamond Rapids
Andrew Cooper [Tue, 21 Jan 2025 08:29:34 +0000 (09:29 +0100)]
x86/traps: Rework LER initialisation and support Zen5/Diamond Rapids

AMD have always used the architectural MSRs for LER.  As the first processor
to support LER was the K7 (which was 32bit), we can assume it's presence
unconditionally in 64bit mode.

Intel are about to run out of space in Family 6 and start using 19.  It is
only the Pentium 4 which uses non-architectural LER MSRs.

percpu_traps_init(), which runs on every CPU, contains a lot of code which
should be init-only, and is the only reason why opt_ler can't be in initdata.

Write a brand new init_ler() which expects all future Intel and AMD CPUs to
continue using the architectural MSRs, and does all setup together.  Call it
from trap_init(), and remove the setup logic percpu_traps_init() except for
the single path configuring MSR_IA32_DEBUGCTLMSR.

Leave behind a warning if the user asked for LER and Xen couldn't enable it.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 555866cb56002849014a1409ecdfa3f436c0c2c4
master date: 2025-01-06 12:24:05 +0000

3 months agox86/spec-ctrl: Support for SRSO_U/S_NO and SRSO_MSR_FIX
Andrew Cooper [Tue, 21 Jan 2025 08:28:42 +0000 (09:28 +0100)]
x86/spec-ctrl: Support for SRSO_U/S_NO and SRSO_MSR_FIX

AMD have updated the SRSO whitepaper[1] with further information.  These
features exist on AMD Zen5 CPUs and are necessary for Xen to use.

The two features are in principle unrelated:

 * SRSO_U/S_NO is an enumeration saying that SRSO attacks can't cross the
   User(CPL3) / Supervisor(CPL<3) boundary.  i.e. Xen don't need to use
   IBPB-on-entry for PV64.  PV32 guests are explicitly unsupported for
   speculative issues, and excluded from consideration for simplicity.

 * SRSO_MSR_FIX is an enumeration identifying that the BP_SPEC_REDUCE bit is
   available in MSR_BP_CFG.  When set, SRSO attacks can't cross the host/guest
   boundary.  i.e. Xen don't need to use IBPB-on-entry for HVM.

Extend ibpb_calculations() to account for these when calculating
opt_ibpb_entry_{pv,hvm} defaults.  Add a `bp-spec-reduce=<bool>` option to
control the use of BP_SPEC_REDUCE, with it active by default.

Because MSR_BP_CFG is core-scoped with a race condition updating it, repurpose
amd_check_erratum_1485() into amd_check_bp_cfg() and calculate all updates at
once.

Xen also needs to to advertise SRSO_U/S_NO to guests to allow the guest kernel
to skip SRSO mitigations too:

 * This is trivial for HVM guests.  It is also is accurate for PV32 guests
   too, but we have already excluded them from consideration, and do so again
   here to simplify the policy logic.

 * As written, SRSO_U/S_NO does not help for the PV64 user->kernel boundary.
   However, after discussing with AMD, an implementation detail of having
   BP_SPEC_REDUCE active causes the PV64 user->kernel boundary to have the
   property described by SRSO_U/S_NO, so we can advertise SRSO_U/S_NO to
   guests when the BP_SPEC_REDUCE precondition is met.

Finally, fix a typo in the SRSO_NO's comment.

[1] https://www.amd.com/content/dam/amd/en/documents/corporate/cr/speculative-return-stack-overflow-whitepaper.pdf
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a1746cd4434dd27ca2da8430dfb10edc76264bb3
master date: 2025-01-02 18:44:49 +0000

3 months agoxen/arch/x86: make objdump output user locale agnostic
Maximilian Engelhardt [Tue, 21 Jan 2025 08:28:28 +0000 (09:28 +0100)]
xen/arch/x86: make objdump output user locale agnostic

The objdump output is fed to grep, so make sure it doesn't change with
different user locales and break the grep parsing.
This problem was identified while updating xen in Debian and the fix is
needed for generating reproducible builds in varying environments.

Signed-off-by: Maximilian Engelhardt <maxi@daemonizer.de>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0d729221ab74c5d2571e71501dc63838acbf752a
master date: 2024-12-30 21:40:37 +0000

3 months agotools/xg: increase LZMA_BLOCK_SIZE for uncompressing the kernel
Marek Marczykowski-Górecki [Tue, 21 Jan 2025 08:28:13 +0000 (09:28 +0100)]
tools/xg: increase LZMA_BLOCK_SIZE for uncompressing the kernel

Linux 6.12-rc2 fails to decompress with the current 128MiB, contrary to
the code comment. It results in a failure like this:

    domainbuilder: detail: xc_dom_kernel_file: filename="/var/lib/qubes/vm-kernels/6.12-rc2-1.1.fc37/vmlinuz"
    domainbuilder: detail: xc_dom_malloc_filemap    : 12104 kB
    domainbuilder: detail: xc_dom_module_file: filename="/var/lib/qubes/vm-kernels/6.12-rc2-1.1.fc37/initramfs"
    domainbuilder: detail: xc_dom_malloc_filemap    : 7711 kB
    domainbuilder: detail: xc_dom_boot_xen_init: ver 4.19, caps xen-3.0-x86_64 hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64
    domainbuilder: detail: xc_dom_parse_image: called
    domainbuilder: detail: xc_dom_find_loader: trying multiboot-binary loader ...
    domainbuilder: detail: loader probe failed
    domainbuilder: detail: xc_dom_find_loader: trying HVM-generic loader ...
    domainbuilder: detail: loader probe failed
    domainbuilder: detail: xc_dom_find_loader: trying Linux bzImage loader ...
    domainbuilder: detail: _xc_try_lzma_decode: XZ decompression error: Memory usage limit reached
    xc: error: panic: xg_dom_bzimageloader.c:761: xc_dom_probe_bzimage_kernel unable to XZ decompress kernel: Invalid kernel
    domainbuilder: detail: loader probe failed
    domainbuilder: detail: xc_dom_find_loader: trying ELF-generic loader ...
    domainbuilder: detail: loader probe failed
    xc: error: panic: xg_dom_core.c:689: xc_dom_find_loader: no loader found: Invalid kernel
    libxl: error: libxl_dom.c:566:libxl__build_dom: xc_dom_parse_image failed

The important part: XZ decompression error: Memory usage limit reached

This looks to be related to the following change in Linux:
8653c909922743bceb4800e5cc26087208c9e0e6 ("xz: use 128 MiB dictionary and force single-threaded mode")

Fix this by increasing the block size to 256MiB. And remove the
misleading comment (from lack of better ideas).

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Anthony PERARD <anthony.perard@vates.tech>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e6472d46680ccd2b804ad73c19042a5811d036f0
master date: 2024-12-19 17:33:54 +0000

3 months agox86emul: correct #UD check for AVX512-FP16 complex multiplications
Jan Beulich [Tue, 21 Jan 2025 08:27:45 +0000 (09:27 +0100)]
x86emul: correct #UD check for AVX512-FP16 complex multiplications

avx512_vlen_check()'s argument was inverted, while the surrounding
conditional wrongly forced the EVEX.L'L check for the scalar forms when
embedded rounding was in effect.

Fixes: d14c52cba0f5 ("x86emul: handle AVX512-FP16 complex multiplication insns")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a30d438ce58b70c5955f5d37f776086ab8f88623
master date: 2024-08-19 15:32:31 +0200

3 months agoupdate Xen version to 4.18.5-pre
Jan Beulich [Tue, 21 Jan 2025 08:27:05 +0000 (09:27 +0100)]
update Xen version to 4.18.5-pre

4 months agoupdate Xen version to 4.18.4 RELEASE-4.18.4
Jan Beulich [Thu, 19 Dec 2024 08:04:30 +0000 (09:04 +0100)]
update Xen version to 4.18.4

4 months agoMISRA: Unmark Rules 1.1 and 2.1 as clean following Eclair upgrade
Andrew Cooper [Tue, 17 Dec 2024 17:04:59 +0000 (17:04 +0000)]
MISRA: Unmark Rules 1.1 and 2.1 as clean following Eclair upgrade

Updating the Eclair runner has had knock-on effects with previously-clean
rules now flagging violations:

 - x86:   Rule 1.1, 1940 violations
 - ARM64: Rule 1.1, 725 violations, Rule 2.1, 255 violations

Fixes: 631f535a3d4f ("xen: update ECLAIR service identifiers from MC3R1 to MC3A2.")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 171cb318deaa0be786cc3af3599c72e8909e60f9)
[Xen 4.18 doesn't have R2.1 as clean.  However, D4.11 is unclean too]

4 months agoautomation/eclair_analysis: substitute deprecated service STD.emptrecd
Nicola Vetrini [Mon, 22 Apr 2024 13:12:47 +0000 (15:12 +0200)]
automation/eclair_analysis: substitute deprecated service STD.emptrecd

The ECLAIR service STD.emptrecd (which checks for empty structures) is being
deprecated; hence, as a preventive measure, STD.anonstct (which checks for
structures with no named members, an UB in C99) is used here; the latter being
a more general case than the previous one, this change does not affect the
analysis. This new service is already supported by the current version of
ECLAIR.

No functional change.

Signed-off-by: Nicola Vetrini <nicola.vetrini@bugseng.com>
Acked-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit 93d6e6efa8942fb068df799527f36be406fcd2bc)

4 months agoxen: update ECLAIR service identifiers from MC3R1 to MC3A2.
Alessandro Zucchelli [Tue, 10 Dec 2024 10:37:23 +0000 (11:37 +0100)]
xen: update ECLAIR service identifiers from MC3R1 to MC3A2.

Rename all instances of ECLAIR MISRA C:2012 service identifiers,
identified by the prefix MC3R1, to use the prefix MC3A2, which
refers to MISRA C:2012 Amendment 2 guidelines.

This update is motivated by the need to upgrade ECLAIR GitLab runners
that use the new naming scheme for MISRA C:2012 Amendment 2 guidelines.

Changes to the docs/misra directory are needed in order to keep
comment-based deviation up to date.

Signed-off-by: Alessandro Zucchelli <alessandro.zucchelli@bugseng.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 631f535a3d4ffd66a270672f0f787d79f3bf38f8)
(cherry picked from commit 8b584c97f88724a390fd17b08b2735f488d5f980)

4 months agox86/io-apic: prevent early exit from i8259 loop detection
Roger Pau Monné [Tue, 17 Dec 2024 11:47:22 +0000 (12:47 +0100)]
x86/io-apic: prevent early exit from i8259 loop detection

Avoid exiting early from the loop when a pin that could be connected to the
i8259 is found, as such early exit would leave the EOI handler translation
array only partially allocated and/or initialized.

Otherwise on systems with multiple IO-APICs and an unmasked ExtINT pin on
any IO-APIC that's no the last one the following NULL pointer dereference
triggers:

(XEN) Enabling APIC mode.  Using 2 I/O APICs
(XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82d040328046>] __ioapic_write_entry+0x83/0x95
[...]
(XEN) Xen call trace:
(XEN)    [<ffff82d040328046>] R __ioapic_write_entry+0x83/0x95
(XEN)    [<ffff82d04027464b>] F amd_iommu_ioapic_update_ire+0x1ea/0x273
(XEN)    [<ffff82d0402755a1>] F iommu_update_ire_from_apic+0xa/0xc
(XEN)    [<ffff82d040328056>] F __ioapic_write_entry+0x93/0x95
(XEN)    [<ffff82d0403283c1>] F arch/x86/io_apic.c#clear_IO_APIC_pin+0x7c/0x10e
(XEN)    [<ffff82d040328480>] F arch/x86/io_apic.c#clear_IO_APIC+0x2d/0x61
(XEN)    [<ffff82d0404448b7>] F enable_IO_APIC+0x2e3/0x34f
(XEN)    [<ffff82d04044c9b0>] F smp_prepare_cpus+0x254/0x27a
(XEN)    [<ffff82d04044bec2>] F __start_xen+0x1ce1/0x23ae
(XEN)    [<ffff82d0402033ae>] F __high_start+0x8e/0x90
(XEN)
(XEN) Pagetable walk from 0000000000000000:
(XEN)  L4[0x000] = 000000007dbfd063 ffffffffffffffff
(XEN)  L3[0x000] = 000000007dbfa063 ffffffffffffffff
(XEN)  L2[0x000] = 000000007dbcc063 ffffffffffffffff
(XEN)  L1[0x000] = 0000000000000000 ffffffffffffffff
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0002]
(XEN) Faulting linear address: 0000000000000000
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...

Reported-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
Fixes: 86001b3970fe ('x86/io-apic: fix directed EOI when using AMD-Vi interrupt remapping')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f38fd27c4ceadf7ec4e82e82d0731b6ea415c51e
master date: 2024-12-17 11:15:30 +0100

4 months agotools/ocaml: Specify rpath correctly for ocamlmklib
Andrii Sultanov [Mon, 16 Dec 2024 12:35:10 +0000 (13:35 +0100)]
tools/ocaml: Specify rpath correctly for ocamlmklib

ocamlmklib has special handling for C-like '-Wl,-rpath' option, but does
not know how to handle '-Wl,-rpath-link', as evidenced by warnings like:
"Unknown option
-Wl,-rpath-link=$HOME/xen/tools/ocaml/libs/eventchn/../../../../tools/libs/toollog"
Pass this option directly to the compiler with -ccopt instead.

Also pass -L directly to the linker with -ldopt. This prevents embedding absolute
paths from buildtime into binary's RPATH.

Fixes: f7b4e4558b42 ("tools/ocaml: Fix OCaml libs rules")
Reported-by: Fernando Rodrigues <alpha@sigmasquadron.net>
Tested-by: Fernando Rodrigues <alpha@sigmasquadron.net>
Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Acked-by: Christian Lindig <christian.lindig@cloud.com>
master commit: bf8a209915804088c09ac6575bcca554450fa7e8
master date: 2024-12-11 10:45:08 +0000

4 months agolibs/guest: Fix migration compatibility with a security-patched Xen 4.13
Andrew Cooper [Mon, 16 Dec 2024 12:35:02 +0000 (13:35 +0100)]
libs/guest: Fix migration compatibility with a security-patched Xen 4.13

xc_cpuid_apply_policy() provides compatibility for migration of a pre-4.14 VM
where no CPUID data was provided in the stream.

It guesses the various max-leaf limits, based on what was true at the time of
writing, but this was not correctly adapted when speculative security issues
forced the advertisement of new feature bits.  Of note are:

 * LFENCE-DISPATCH, in leaf 0x80000021.eax
 * BHI-CTRL, in leaf 0x7[2].edx

In both cases, a VM booted on a security-patched Xen 4.13, and then migrated
on to any newer version of Xen on the same or compatible hardware would have
these features stripped back because Xen is still editing the cpu-policy for
sanity behind the back of the toolstack.

For VMs using BHI_DIS_S to mitigate Native-BHI, this resulted in a failure to
restore the guests MSR_SPEC_CTRL setting:

  (XEN) HVM d7v0 load MSR 0x48 with value 0x401 failed
  (XEN) HVM7 restore: failed to load entry 20/0 rc -6

Fixes: e9b4fe263649 ("x86/cpuid: support LFENCE always serialising CPUID bit")
Fixes: f3709b15fc86 ("x86/cpuid: Infrastructure for cpuid word 7:2.edx")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 28301682f492c1df2ff9c3e01a0aab6262bd925a
master date: 2024-12-03 12:20:41 +0000

4 months agoxen/Kconfig: livepatch-build-tools requires debug information
Roger Pau Monné [Mon, 16 Dec 2024 12:34:43 +0000 (13:34 +0100)]
xen/Kconfig: livepatch-build-tools requires debug information

The tools infrastructure used to build livepatches for Xen
(livepatch-build-tools) consumes some DWARF debug information present in
xen-syms to generate a livepatch (see livepatch-build script usage of readelf
-wi).

The current Kconfig defaults however will enable LIVEPATCH without DEBUG_INFO
on release builds, thus providing a default Kconfig selection that's not
suitable for livepatch-build-tools even when LIVEPATCH support is enabled,
because it's missing the DWARF debug section.

Fix by defaulting DEBUG_INFO to enabled when LIVEPATCH is.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 126b0a6e537ce1d486a29e35cfeec1f222a74d11
master date: 2024-12-02 15:22:05 +0100

4 months agox86emul: MOVBE requires a memory operand
Jan Beulich [Mon, 16 Dec 2024 12:34:19 +0000 (13:34 +0100)]
x86emul: MOVBE requires a memory operand

The reg-reg forms should cause #UD; they come into existence only with
APX, where MOVBE also extends BSWAP (for the latter not being "eligible"
to a REX2 prefix).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 4c5d9a01f8fa81417a9c431e9624fb71361ec4f9
master date: 2024-12-02 09:50:14 +0100

5 months agobuild: Remove -fno-stack-protector-all from EMBEDDED_EXTRA_CFLAGS
Andrew Cooper [Wed, 27 Nov 2024 11:42:11 +0000 (12:42 +0100)]
build: Remove -fno-stack-protector-all from EMBEDDED_EXTRA_CFLAGS

This seems to have been introduced in commit f8beb54e2455 ("Disable PIE/SSP
features when building Xen, if GCC supports them.") in 2004.

However, neither GCC nor Clang appear to have ever supported taking the
negated form of -fstack-protector-all, meaning this been useless since its
introduction.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b661fe107d45713f4e1395f29180cc1ae50bba92
master date: 2024-11-25 13:44:56 +0000

5 months agox86/irq: fix calculation of max PV dom0 pIRQs
Roger Pau Monné [Wed, 27 Nov 2024 11:42:02 +0000 (12:42 +0100)]
x86/irq: fix calculation of max PV dom0 pIRQs

The current calculation of PV dom0 pIRQs uses:

n = min(fls(num_present_cpus()), dom0_max_vcpus());

The usage of fls() is wrong, as num_present_cpus() already returns the number
of present CPUs, not the bitmap mask of CPUs.

Fix by removing the usage of fls().

Fixes: 7e73a6e7f12a ('have architectures specify the number of PIRQs a hardware domain gets')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5c56361c618e5d05855fc73118c4655f998b8272
master date: 2024-11-25 11:33:06 +0100

5 months agox86/mm: ensure L2 is always freed if empty
Roger Pau Monné [Mon, 25 Nov 2024 11:19:08 +0000 (12:19 +0100)]
x86/mm: ensure L2 is always freed if empty

The current logic in modify_xen_mappings() allows for fully empty L2 tables to
not be freed and unhooked from the parent L3 if the last L2 slot is not
populated.

Ensure that even when an L2 slot is empty the logic to check whether the whole
L2 can be removed is not skipped.

Fixes: 4376c05c3113 ('x86-64: use 1GB pages in 1:1 mapping if available')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 41c80496084aa3601230e01c3bc579a0a6d8359a
master date: 2024-11-14 16:13:10 +0100

5 months agox86/setup: remove bootstrap_map_addr() usage of destroy_xen_mappings()
Roger Pau Monné [Mon, 25 Nov 2024 11:18:58 +0000 (12:18 +0100)]
x86/setup: remove bootstrap_map_addr() usage of destroy_xen_mappings()

bootstrap_map_addr() needs to be careful to not remove existing page-table
structures when tearing down mappings, as such pagetable structures might be
needed to fulfill subsequent mappings requests.  The comment ahead of the
function already notes that pagetable memory shouldn't be allocated.

Fix this by using map_pages_to_xen(), which does zap the page-table entries,
but does not free page-table structures even when empty.

Fixes: 4376c05c3113 ('x86-64: use 1GB pages in 1:1 mapping if available')
Signed-off-by: Roger Pau Monné <roger.pau@ctrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 73194b5701725e53d72b98e568484b2fccaf855c
master date: 2024-11-14 16:12:51 +0100

5 months agox86/mm: skip super-page alignment checks for non-present entries
Roger Pau Monné [Mon, 25 Nov 2024 11:18:38 +0000 (12:18 +0100)]
x86/mm: skip super-page alignment checks for non-present entries

INVALID_MFN is ~0, so by it having all bits as 1s it doesn't fulfill the
super-page address alignment checks for L3 and L2 entries.  Skip the alignment
checks if the new entry is a non-present one.

This fixes a regression introduced by 0b6b51a69f4d, where the switch from 0 to
INVALID_MFN caused all super-pages to be shattered when attempting to remove
mappings by passing INVALID_MFN instead of 0.

Fixes: 0b6b51a69f4d ('xen/mm: Switch map_pages_to_xen to use MFN typesafe')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/mm: fix alignment check for non-present entries

While the alignment of the mfn is not relevant for non-present entries, the
alignment of the linear address is.  Commit 5b52e1b0436f introduced a
regression by not checking the alignment of the linear address when the new
entry was a non-present one.

Fix by always checking the alignment of the linear address, non-present entries
must just skip the alignment check of the physical address.

Fixes: 5b52e1b0436f ('x86/mm: skip super-page alignment checks for non-present entries')
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5b52e1b0436f4adb784562f4d05ae67605ce8ce7
master date: 2024-11-14 16:12:35 +0100
master commit: b1ebb6461a027f07e4a844cae348fbd9cfabe984
master date: 2024-11-15 14:14:12 +0100

5 months agox86/mm: introduce helpers to detect super page alignment
Roger Pau Monné [Mon, 25 Nov 2024 11:18:26 +0000 (12:18 +0100)]
x86/mm: introduce helpers to detect super page alignment

Split the code that detects whether the physical and linear address of a
mapping request are suitable to be used in an L3 or L2 slot.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 97fb6fcf26e80b402aebe9fb31459ecdfd14a900
master date: 2024-11-14 16:12:12 +0100

5 months agox86emul: avoid double memory read for RORX
Jan Beulich [Mon, 25 Nov 2024 11:18:09 +0000 (12:18 +0100)]
x86emul: avoid double memory read for RORX

Originally only twobyte_table[0x3a] determined what part of generic
operand fetching (near the top of x86_emulate()) comes into play. When
ext0f3a_table[] was added, ->desc was updated to properly describe the
ModR/M byte's function. With that generic source operand fetching came
into play for RORX, rendering the explicit fetching in the respective
case block redundant (and wrong at the very least when MMIO with side
effects is accessed).

While there also make a purely cosmetic / documentary adjustment to
ext0f3a_table[]: RORX really is a 2-operand insn, MOV-like in that it
only writes its destination register.

Fixes: 9f7f5f6bc95b ("x86emul: add tables for 0f38 and 0f3a extension space")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 939a9e800c4156677c10c6cf08fde071e9b86eaf
master date: 2024-11-14 13:03:18 +0100

5 months agox86emul: ignore VEX.W for BMI{1,2} insns in 32-bit mode
Jan Beulich [Mon, 25 Nov 2024 11:17:49 +0000 (12:17 +0100)]
x86emul: ignore VEX.W for BMI{1,2} insns in 32-bit mode

While result values and other status flags are unaffected as long as we
can ignore the case of registers having their upper 32 bits non-zero
outside of 64-bit mode, EFLAGS.SF may obtain a wrong value when we
mistakenly re-execute the original insn with VEX.W set.

Note that guest the memory access, if any, is correctly carried out as
32-bit regardless of VEX.W. The emulator-local memory operand will be
accessed as a 64-bit quantity, but it is pre-initialised to zero so no
internal state can leak.

Fixes: 771daacd197a ("x86emul: support BMI1 insns")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1179d51dcb7d93111bfb35172c75eb5a73fe6a43
master date: 2024-11-14 13:00:57 +0100

5 months agox86/cpu-policy: Extend the guest max policy max leaf/subleaves
Andrew Cooper [Mon, 25 Nov 2024 11:16:34 +0000 (12:16 +0100)]
x86/cpu-policy: Extend the guest max policy max leaf/subleaves

We already have one migration case opencoded (feat.max_subleaf).  A more
recent discovery is that we advertise x2APIC to guests without ensuring that
we provide max_leaf >= 0xb.

In general, any leaf known to Xen can be safely configured by the toolstack if
it doesn't violate other constraints.

Therefore, introduce guest_common_{max,default}_leaves() to generalise the
special case we currently have for feat.max_subleaf, in preparation to be able
to provide x2APIC topology in leaf 0xb even on older hardware.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: fa2d8318033e468a4ded1fc3d721dc3e019e449b
master date: 2024-10-30 17:34:32 +0000

5 months agox86/alternatives: do not BUG during apply
Roger Pau Monné [Mon, 25 Nov 2024 11:16:09 +0000 (12:16 +0100)]
x86/alternatives: do not BUG during apply

alternatives is used both at boot time, and when loading livepatch payloads.
While for the former it makes sense to panic, it's not useful for the later, as
for livepatches it's possible to fail to load the livepatch if alternatives
cannot be resolved and continue operating normally.

Relax the BUGs in _apply_alternatives() to instead return an error code.  The
caller will figure out whether the failures are fatal and panic.

Print an error message to provide some user-readable information about what
went wrong.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: aa5a06d5d6eda291afa3849ab143ffcd69b0d4e6
master date: 2024-09-26 14:18:03 +0100

5 months agoxen/livepatch: do Xen build-id check earlier
Roger Pau Monné [Mon, 25 Nov 2024 11:15:56 +0000 (12:15 +0100)]
xen/livepatch: do Xen build-id check earlier

The check against the expected Xen build ID should be done ahead of attempting
to apply the alternatives contained in the livepatch.

If the CPUID in the alternatives patching data is out of the scope of the
running Xen featureset the BUG() in _apply_alternatives() will trigger thus
bringing the system down.  Note the layout of struct alt_instr could also
change between versions.  It's also possible for struct exception_table_entry
to have changed format, hence leading to other kind of errors if parsing of the
payload is done ahead of checking if the Xen build-id matches.

Move the Xen build ID check as early as possible.  To do so introduce a new
check_xen_buildid() function that parses and checks the Xen build-id before
moving the payload.  Since the expected Xen build-id is used early to
detect whether the livepatch payload could be loaded, there's no reason to
store it in the payload struct, as a non-matching Xen build-id won't get the
payload populated in the first place.

Note printing the expected Xen build ID has part of dumping the payload
information is no longer done: all loaded payloads would have Xen build IDs
matching the running Xen, otherwise they would have failed to load.

Fixes: 879615f5db1d ('livepatch: Always check hypervisor build ID upon livepatch upload')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fa49f4be413cdabde5b2264fc85d2710b15ea691
master date: 2024-09-26 14:18:03 +0100

5 months agoxen/livepatch: simplify and unify logic in prepare_payload()
Roger Pau Monné [Mon, 25 Nov 2024 11:15:29 +0000 (12:15 +0100)]
xen/livepatch: simplify and unify logic in prepare_payload()

The following sections: .note.gnu.build-id, .livepatch.xen_depends and
.livepatch.depends are mandatory and ensured to be present by
check_special_sections() before prepare_payload() is called.

Simplify the logic in prepare_payload() by introducing a generic function to
parse the sections that contain a buildid.  Note the function assumes the
buildid related section to always be present.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 86d09d16dd74298b19a03df492d9503f20cfc17c
master date: 2024-09-26 14:18:03 +0100

5 months agoxen/livepatch: drop load_addr Elf section field
Roger Pau Monné [Mon, 25 Nov 2024 11:15:00 +0000 (12:15 +0100)]
xen/livepatch: drop load_addr Elf section field

The Elf loading logic will initially use the `data` section field to stash a
pointer to the temporary loaded data (from the buffer allocated in
livepatch_upload(), which is later relocated and the new pointer stashed in
`load_addr`.

Remove this dual field usage and use an `addr` uniformly.  Initially data will
point to the temporary buffer, until relocation happens, at which point the
pointer will be updated to the relocated address.

This avoids leaving a dangling pointer in the `data` field once the temporary
buffer is freed by livepatch_upload().

Note the `addr` field cannot retain the const attribute from the previous
`data`field, as there's logic that performs manipulations against the loaded
sections, like applying relocations or sorting the exception table.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Michal Orzel <michal.orzel@amd.com>
master commit: 8c81423038f17f6cbc853dd35d69d50a4458f764
master date: 2024-09-26 14:18:03 +0100

5 months agox86/boot: Preserve the value clobbered by the load-base calculation
Frediano Ziglio [Mon, 25 Nov 2024 11:14:13 +0000 (12:14 +0100)]
x86/boot: Preserve the value clobbered by the load-base calculation

Right now, Xen clobbers the value at 0xffc when performing it's load-base
calculation.  We've got plenty of free registers at this point, so the value
can be preserved easily.

This fixes a real bug booting under Coreboot+SeaBIOS, where 0xffc happens to
be the cbmem pointer (e.g. Coreboot's dmesg ring, among other things).

However, there's also a better choice of memory location to use than 0xffc, as
all our supported boot protocols have a pointer to an info structure in %ebx.

Update the documentation to match.

Fixes: 1695e53851e5 ("x86/boot: Fix the boot time relocation calculations")
Fixes: d96bb172e8c9 ("x86/entry: Early PVH boot code")
Signed-off-by: Frediano Ziglio <frediano.ziglio@cloud.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e58a2858d588ed57ca13200f3d3148d78ad0e491
master date: 2024-08-27 18:08:19 +0100

5 months agotools/ocaml: Fix the version embedded in META files
Andrew Cooper [Mon, 25 Nov 2024 11:13:47 +0000 (12:13 +0100)]
tools/ocaml: Fix the version embedded in META files

Xen 4.1 is more than a decade stale now.  Use the same mechanism as elsewhere
in the tree to get the current version number.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Christian Lindig <christian.lindig@cloud.com>
Reviewed-by: Edwin Török <edwin.torok@cloud.com>
master commit: 1965e9a930740b37637ac450f4752fd53edf63c4
master date: 2024-08-23 15:02:27 +0100

5 months agotools/ocaml: Drop the OCAMLOPTFLAG_G invocation
Andrew Cooper [Mon, 25 Nov 2024 11:13:28 +0000 (12:13 +0100)]
tools/ocaml: Drop the OCAMLOPTFLAG_G invocation

These days, `ocamlopt -h` asks you whether you meant --help instead, meaning
that the $(shell ) invocation here isn't going end up containing '-g'.

Make it unconditional, like it is in OCAMLCFLAGS already.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Christian Lindig <christian.lindig@cloud.com>
Reviewed-by: Edwin Török <edwin.torok@cloud.com>
master commit: 126293eae6485089471ebdfd91fe944a0274e613
master date: 2024-08-23 15:02:27 +0100

5 months agotools/ocaml: Fix OCaml libs rules
Andrii Sultanov [Mon, 25 Nov 2024 11:13:18 +0000 (12:13 +0100)]
tools/ocaml: Fix OCaml libs rules

This has been sat in the XenServer patchqueue for more than a decade.

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Acked-by: Christian Lindig <christian.lindig@cloud.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 8ffcf184affbc2ff1860dabe1757388509d5ed67
master date: 2024-08-23 10:48:52 +0100

5 months agotools/ocaml: Remove '-cc $(CC)' from OCAMLOPTFLAGS
Andrii Sultanov [Mon, 25 Nov 2024 11:13:02 +0000 (12:13 +0100)]
tools/ocaml: Remove '-cc $(CC)' from OCAMLOPTFLAGS

This option does not work as one might expect, and needs to be the full
compiler invocation including linking arguments to operate correctly.

See https://github.com/ocaml/ocaml/issues/12284 for more details.

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Acked-by: Christian Lindig <christian.lindig@cloud.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0d69635d27cb1d011ccdf419b2bc85cc7809ac92
master date: 2024-08-23 10:44:24 +0100

5 months agox86/shutdown: change default reboot method preference
Roger Pau Monné [Mon, 25 Nov 2024 11:11:24 +0000 (12:11 +0100)]
x86/shutdown: change default reboot method preference

The current logic to chose the preferred reboot method is based on the mode Xen
has been booted into, so if the box is booted from UEFI, the preferred reboot
method will be to use the ResetSystem() run time service call.

However, that method seems to be widely untested, and quite often leads to a
result similar to:

Hardware Dom0 shutdown: rebooting machine
----[ Xen-4.18-unstable  x86_64  debug=y  Tainted:   C    ]----
CPU:    0
RIP:    e008:[<0000000000000017>] 0000000000000017
RFLAGS: 0000000000010202   CONTEXT: hypervisor
[...]
Xen call trace:
   [<0000000000000017>] R 0000000000000017
   [<ffff83207eff7b50>] S ffff83207eff7b50
   [<ffff82d0403525aa>] F machine_restart+0x1da/0x261
   [<ffff82d04035263c>] F apic_wait_icr_idle+0/0x37
   [<ffff82d040233689>] F smp_call_function_interrupt+0xc7/0xcb
   [<ffff82d040352f05>] F call_function_interrupt+0x20/0x34
   [<ffff82d04033b0d5>] F do_IRQ+0x150/0x6f3
   [<ffff82d0402018c2>] F common_interrupt+0x132/0x140
   [<ffff82d040283d33>] F arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x113/0x129
   [<ffff82d04028436c>] F arch/x86/acpi/cpu_idle.c#acpi_processor_idle+0x3eb/0x5f7
   [<ffff82d04032a549>] F arch/x86/domain.c#idle_loop+0xec/0xee

****************************************
Panic on CPU 0:
FATAL TRAP: vector = 6 (invalid opcode)
****************************************

Which in most cases does lead to a reboot, however that's unreliable.

Change the default reboot preference to prefer ACPI over UEFI if available and
not in reduced hardware mode.

This is in line to what Linux does, so it's unlikely to cause issues on current
and future hardware, since there's a much higher chance of vendors testing
hardware with Linux rather than Xen.

Add a special case for one Acer model that does require being rebooted using
ResetSystem().  See Linux commit 0082517fa4bce for rationale.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>
master commit: d81dd31303513a1626973778d557a6493d86511a
master date: 2024-08-05 10:16:54 +0200

5 months agox86/viridian: Clarify some viridian logging strings
Alejandro Vallejo [Mon, 25 Nov 2024 11:11:15 +0000 (12:11 +0100)]
x86/viridian: Clarify some viridian logging strings

It's sadically misleading to show an error without letters and expect
the dmesg reader to understand it's in hex. The patch adds a 0x prefix
to all hex numbers that don't already have it.

On the one instance in which a boolean is printed as an integer, print
it as a decimal integer instead so it's 0/1 in the common case and not
misleading if it's ever not just that due to a bug.

While at it, rename VIRIDIAN CRASH to VIRIDIAN GUEST_CRASH. Every member
of a support team that looks at the message systematically believes
"viridian" crashed, which is absolutely not what goes on. It's the guest
asking the hypervisor for a sudden shutdown because it crashed, and
stating why.

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: ba709d514aac1484f8a0825d3907dda11cf569bd
master date: 2024-07-30 11:51:23 +0200

5 months agotools/libxs: Stop playing with SIGPIPE
Andrew Cooper [Mon, 25 Nov 2024 11:11:04 +0000 (12:11 +0100)]
tools/libxs: Stop playing with SIGPIPE

It's very rude for a library to play with signals behind the back of the
application, no matter ones views on the default behaviour of SIGPIPE under
POSIX.  Even if the application doesn't care about the xenstored socket, it my
care about others.

This logic has existed since xenstore/xenstored was originally added in commit
29c9e570b1ed ("Add xenstore daemon and library") in 2005.

It's also unnecessary.  Pass MSG_NOSIGNAL when talking to xenstored over a
pipe (to avoid sucumbing to SIGPIPE if xenstored has crashed), and forgo any
playing with the signal disposition.

This has a side benefit of saving 2 syscalls per xenstore request.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: a17b6db9b00784b409c35e3017dc45aed1ec2bfb
master date: 2024-07-23 15:11:27 +0100

5 months agotools/libxs: Use writev()/sendmsg() instead of write()
Andrew Cooper [Mon, 25 Nov 2024 11:10:34 +0000 (12:10 +0100)]
tools/libxs: Use writev()/sendmsg() instead of write()

With the input data now conveniently arranged, use writev()/sendmsg() instead
of decomposing it into write() calls.

This causes all requests to be submitted with a single system call, rather
than at least two.  While in principle short writes can occur, the chances of
it happening are slim given that most xenbus comms are only a handful of
bytes.

Nevertheless, provide {writev,sendmsg}_exact() wrappers which take care of
resubmitting on EINTR or short write.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: ebaeb0c64a6d363313e213eb9995f48307604ebb
master date: 2024-07-23 15:11:27 +0100

5 months agotools/libxs: Track whether we're using a socket or file
Andrew Cooper [Mon, 25 Nov 2024 11:10:16 +0000 (12:10 +0100)]
tools/libxs: Track whether we're using a socket or file

It will determine whether to use writev() or sendmsg().

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: 046efe529e82b8b999d8453d4ea49cb817c3f9b5
master date: 2024-07-23 15:11:27 +0100

5 months agotools/libxs: Rework xs_talkv() to take xsd_sockmsg within the iovec
Andrew Cooper [Mon, 25 Nov 2024 11:10:07 +0000 (12:10 +0100)]
tools/libxs: Rework xs_talkv() to take xsd_sockmsg within the iovec

We would like to writev() the whole outgoing message, but this is hard given
the current need to prepend the locally-constructed xsd_sockmsg.

Instead, have the caller provide xsd_sockmsg in iovec[0].  This in turn drops
the 't' and 'type' parameters from xs_talkv().

Note that xs_talkv() may alter the iovec structure.  This may happen when
writev() is really used under the covers, and it's preferable to having the
lower levels need to duplicate the iovec to edit it upon encountering a short
write.  xs_directory_part() is the only function impacted by this, and it's
easy to rearrange to be compatible.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: e2a93bed8b9e0f0c4779dcd4b10dc7ba2a959fbc
master date: 2024-07-23 15:11:27 +0100

5 months agotools/libxs: Fix length check in xs_talkv()
Andrew Cooper [Mon, 25 Nov 2024 11:09:51 +0000 (12:09 +0100)]
tools/libxs: Fix length check in xs_talkv()

If the sum of iov element lengths overflows, the XENSTORE_PAYLOAD_MAX can
pass, after which we'll write 4G of data with a good-looking length field, and
the remainder of the payload will be interpreted as subsequent commands.

Check each iov element length for XENSTORE_PAYLOAD_MAX before accmulating it.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
master commit: 42db2deb5e7617f0459b68cd73ab503938356186
master date: 2024-07-23 15:11:27 +0100

5 months agotools/misc: xen-hvmcrash: Inject #DF instead of overwriting RIP
Matthew Barnes [Mon, 25 Nov 2024 11:09:37 +0000 (12:09 +0100)]
tools/misc: xen-hvmcrash: Inject #DF instead of overwriting RIP

xen-hvmcrash would previously save records, overwrite the instruction
pointer with a bogus value, and then restore them to crash a domain
just enough to cause the guest OS to memdump.

This approach is found to be unreliable when tested on a guest running
Windows 10 x64, with some executions doing nothing at all.

Another approach would be to trigger NMIs. This approach is found to be
unreliable when tested on Linux (Ubuntu 22.04), as Linux will ignore
NMIs if it is not configured to handle such.

Injecting a double fault abort to all vCPUs is found to be more
reliable at crashing and invoking memdumps from Windows and Linux
domains.

This patch modifies the xen-hvmcrash tool to inject #DF to all vCPUs
belonging to the specified domain, instead of overwriting RIP.

Signed-off-by: Matthew Barnes <matthew.barnes@cloud.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e42e4d8c3e2c391e8ce990e0a2c76d6e9a17aad6
master date: 2024-07-22 10:50:03 +0100

5 months agoxen/x86: prevent addition of .note.gnu.property if livepatch is enabled
Roger Pau Monné [Tue, 12 Nov 2024 12:54:56 +0000 (13:54 +0100)]
xen/x86: prevent addition of .note.gnu.property if livepatch is enabled

GNU assembly that supports such feature will unconditionally add a
.note.gnu.property section to object files.  The content of that section can
change depending on the generated instructions.  The current logic in
livepatch-build-tools doesn't know how to deal with such section changing
as a result of applying a patch and rebuilding.

Since .note.gnu.property is not consumed by the Xen build, suppress its
addition when livepatch support is enabled.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 718400a54dcfcc8a11958a6d953168f50944f002
master date: 2024-11-11 13:19:45 +0100

5 months agox86/io-apic: fix directed EOI when using AMD-Vi interrupt remapping
Roger Pau Monné [Tue, 12 Nov 2024 12:54:41 +0000 (13:54 +0100)]
x86/io-apic: fix directed EOI when using AMD-Vi interrupt remapping

When using AMD-Vi interrupt remapping the vector field in the IO-APIC RTE is
repurposed to contain part of the offset into the remapping table.  Previous to
2ca9fbd739b8 Xen had logic so that the offset into the interrupt remapping
table would match the vector.  Such logic was mandatory for end of interrupt to
work, since the vector field (even when not containing a vector) is used by the
IO-APIC to find for which pin the EOI must be performed.

A simple solution wold be to read the IO-APIC RTE each time an EOI is to be
performed, so the raw value of the vector field can be obtained.  However
that's likely to perform poorly.  Instead introduce a cache to store the
EOI handles when using interrupt remapping, so that the IO-APIC driver can
translate pins into EOI handles without having to read the IO-APIC RTE entry.
Note that to simplify the logic such cache is used unconditionally when
interrupt remapping is enabled, even if strictly it would only be required
for AMD-Vi.

Reported-by: Willi Junga <xenproject@ymy.be>
Suggested-by: David Woodhouse <dwmw@amazon.co.uk>
Fixes: 2ca9fbd739b8 ('AMD IOMMU: allocate IRTE entries instead of using a static mapping')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 86001b3970fea4536048607ea6e12541736c48e1
master date: 2024-11-05 10:36:53 +0000

5 months agolibxl: Use zero-ed memory for PVH acpi tables
Jason Andryuk [Tue, 12 Nov 2024 12:54:00 +0000 (13:54 +0100)]
libxl: Use zero-ed memory for PVH acpi tables

xl/libxl memory is leaking into a PVH guest through uninitialized
portions of the ACPI tables.

Use libxl_zalloc() to obtain zero-ed memory to avoid this issue.

This is XSA-464 / CVE-2024-45819.

Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
Fixes: 14c0d328da2b ("libxl/acpi: Build ACPI tables for HVMlite guests")
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0bfe567b58f1182889dea9207103fc9d00baf414
master date: 2024-11-12 13:32:45 +0100

5 months agox86/hvm: Simplify stdvga_mem_accept() further
Andrew Cooper [Tue, 12 Nov 2024 12:53:40 +0000 (13:53 +0100)]
x86/hvm: Simplify stdvga_mem_accept() further

stdvga_mem_accept() is called on almost all IO emulations, and the
overwhelming likely answer is to reject the ioreq.  Simply rearranging the
expression yields an improvement:

  add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-57 (-57)
  Function                                     old     new   delta
  stdvga_mem_accept                            109      52     -57

which is best explained looking at the disassembly:

  Before:                                                    After:
  f3 0f 1e fa           endbr64                              f3 0f 1e fa           endbr64
  0f b6 4e 1e           movzbl 0x1e(%rsi),%ecx            |  0f b6 46 1e           movzbl 0x1e(%rsi),%eax
  48 8b 16              mov    (%rsi),%rdx                |  31 d2                 xor    %edx,%edx
  f6 c1 40              test   $0x40,%cl                  |  a8 30                 test   $0x30,%al
  75 38                 jne    <stdvga_mem_accept+0x48>   |  75 23                 jne    <stdvga_mem_accept+0x31>
  31 c0                 xor    %eax,%eax                  <
  48 81 fa ff ff 09 00  cmp    $0x9ffff,%rdx              <
  76 26                 jbe    <stdvga_mem_accept+0x41>   <
  8b 46 14              mov    0x14(%rsi),%eax            <
  8b 7e 10              mov    0x10(%rsi),%edi            <
  48 0f af c7           imul   %rdi,%rax                  <
  48 8d 54 02 ff        lea    -0x1(%rdx,%rax,1),%rdx     <
  31 c0                 xor    %eax,%eax                  <
  48 81 fa ff ff 0b 00  cmp    $0xbffff,%rdx              <
  77 0c                 ja     <stdvga_mem_accept+0x41>   <
  83 e1 30              and    $0x30,%ecx                 <
  75 07                 jne    <stdvga_mem_accept+0x41>   <
  83 7e 10 01           cmpl   $0x1,0x10(%rsi)               83 7e 10 01           cmpl   $0x1,0x10(%rsi)
  0f 94 c0              sete   %al                        |  75 1d                 jne    <stdvga_mem_accept+0x31>
  c3                    ret                               |  48 8b 0e              mov    (%rsi),%rcx
  66 0f 1f 44 00 00     nopw   0x0(%rax,%rax,1)           |  48 81 f9 ff ff 09 00  cmp    $0x9ffff,%rcx
  8b 46 10              mov    0x10(%rsi),%eax            |  76 11                 jbe    <stdvga_mem_accept+0x31>
  8b 7e 14              mov    0x14(%rsi),%edi            |  8b 46 14              mov    0x14(%rsi),%eax
  49 89 d0              mov    %rdx,%r8                   |  48 8d 44 01 ff        lea    -0x1(%rcx,%rax,1),%rax
  48 83 e8 01           sub    $0x1,%rax                  |  48 3d ff ff 0b 00     cmp    $0xbffff,%rax
  48 8d 54 3a ff        lea    -0x1(%rdx,%rdi,1),%rdx     |  0f 96 c2              setbe  %dl
  48 0f af c7           imul   %rdi,%rax                  |  89 d0                 mov    %edx,%eax
  49 29 c0              sub    %rax,%r8                   <
  31 c0                 xor    %eax,%eax                  <
  49 81 f8 ff ff 09 00  cmp    $0x9ffff,%r8               <
  77 be                 ja     <stdvga_mem_accept+0x2a>   <
  c3                    ret                                  c3                    ret

By moving the "p->count != 1" check ahead of the
ioreq_mmio_{first,last}_byte() calls, both multiplies disappear along with a
lot of surrounding logic.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 08ffd8705d36c7c445df3ecee8ad9b8f8d65fbe0)

5 months agox86/HVM: drop stdvga's "lock" struct member
Jan Beulich [Tue, 12 Nov 2024 12:53:24 +0000 (13:53 +0100)]
x86/HVM: drop stdvga's "lock" struct member

No state is left to protect. It being the last field, drop the struct
itself as well. Similarly for then ending up empty, drop the .complete
handler.

This is part of XSA-463 / CVE-2024-45818

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit b180a50326c8a2c171f37c1940a0fbbdcad4be90)

5 months agox86/HVM: drop stdvga's "vram_page[]" struct member
Jan Beulich [Tue, 12 Nov 2024 12:53:03 +0000 (13:53 +0100)]
x86/HVM: drop stdvga's "vram_page[]" struct member

No uses are left, hence its setup, teardown, and the field itself can
also go away. stdvga_deinit() is then empty and can be dropped as well.

This is part of XSA-463 / CVE-2024-45818

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 3beb4baf2a0a2eef40d39eb7e6eecbfd36da5d14)

5 months agox86/HVM: drop stdvga's "{g,s}r_index" struct members
Jan Beulich [Tue, 12 Nov 2024 12:52:46 +0000 (13:52 +0100)]
x86/HVM: drop stdvga's "{g,s}r_index" struct members

No consumers are left, hence the producer and the fields themselves can
also go away. stdvga_outb() is then useless, rendering stdvga_out()
useless as well. Hence the entire I/O port intercept can go away.

This is part of XSA-463 / CVE-2024-45818

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 86c03372e107f5c18266a62281663861b1144929)

5 months agox86/HVM: drop stdvga's "sr[]" struct member
Jan Beulich [Tue, 12 Nov 2024 12:52:28 +0000 (13:52 +0100)]
x86/HVM: drop stdvga's "sr[]" struct member

No consumers are left, hence the producer and the array itself can also
go away. The static sr_mask[] is then orphaned and hence needs dropping,
too.

This is part of XSA-463 / CVE-2024-45818

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 7aba44bdd78aedb97703811948c3b69ccff85032)

5 months agox86/HVM: drop stdvga's "gr[]" struct member
Jan Beulich [Tue, 12 Nov 2024 12:52:08 +0000 (13:52 +0100)]
x86/HVM: drop stdvga's "gr[]" struct member

No consumers are left, hence the producer and the array itself can also
go away. The static gr_mask[] is then orphaned and hence needs dropping,
too.

This is part of XSA-463 / CVE-2024-45818

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit b16c0966a17f19c0e55ed0b9baa28191d2590178)

5 months agox86/HVM: remove unused MMIO handling code
Jan Beulich [Tue, 12 Nov 2024 12:51:51 +0000 (13:51 +0100)]
x86/HVM: remove unused MMIO handling code

All read accesses are rejected by the ->accept handler, while writes
bypass the bulk of the function body. Drop the dead code, leaving an
assertion in the read handler.

A number of other static items (and a macro) are then unreferenced and
hence also need (want) dropping. The same applies to the "latch" field
of the state structure.

This is part of XSA-463 / CVE-2024-45818

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 89108547af1f230b72893b48351f9c1106189649)

5 months agox86/HVM: drop stdvga's "stdvga" struct member
Jan Beulich [Tue, 12 Nov 2024 12:51:30 +0000 (13:51 +0100)]
x86/HVM: drop stdvga's "stdvga" struct member

Two of its consumers are dead (in compile-time constant conditionals)
and the only remaining ones are merely controlling debug logging. Hence
the field is now pointless to set, which in particular allows to get rid
of the questionable conditional from which the field's value was
established (afaict 551ceee97513 ["x86, hvm: stdvga cache always on"]
had dropped too much of the earlier extra check that was there, and
quite likely further checks were missing).

This is part of XSA-463 / CVE-2024-45818

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit b740a9369e81bdda675a9780130ce2b9e75d4ec9)

5 months agox86/HVM: drop stdvga's "cache" struct member
Jan Beulich [Tue, 12 Nov 2024 12:50:54 +0000 (13:50 +0100)]
x86/HVM: drop stdvga's "cache" struct member

Since 68e1183411be ("libxc: introduce a xc_dom_arch for hvm-3.0-x86_32
guests"), HVM guests are built using XEN_DOMCTL_sethvmcontext, which
ends up disabling stdvga caching because of arch_hvm_load() being
involved in the processing of the request. With that the field is
useless, and can be dropped. Drop the helper functions manipulating /
checking as well right away, but leave the use sites of
stdvga_cache_is_enabled() with the hard-coded result the function would
have produced, to aid validation of subsequent dropping of further code.

This is part of XSA-463 / CVE-2024-45818

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 53b7246bdfb3c280adcdf714918e4decb7e108f4)

5 months agoCI: Stop building QEMU in general
Andrew Cooper [Sat, 13 Jul 2024 16:50:30 +0000 (17:50 +0100)]
CI: Stop building QEMU in general

We spend an awful lot of CI time building QEMU, even though most changes don't
touch the subset of tools/libs/ used by QEMU.  Some numbers taken at a time
when CI was otherwise quiet:

                       With     Without
  Alpine:              13m38s   6m04s
  Debian 12:           10m05s   8m10s
  OpenSUSE Tumbleweed: 11m40s   7m54s
  Ubuntu 24.04:        14m56s   8m06s

which is a >50% improvement in wallclock time in some cases.

The only build we have that needs QEMU is alpine-3.18-gcc-debug.  This is the
build deployed and used by the QubesOS ADL-* and Zen3p-* jobs.

Xilinx-x86_64 deploys it too, but is PVH-only and doesn't use QEMU.

QEMU is also built by CirrusCI for FreeBSD (fully Clang/LLVM toolchain).

This should help quite a lot with Gitlab CI capacity.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit e305256e69b1c943db3ca20173da6ded3db2d252)

5 months agoCI: Resync .cirrus.yml for FreeBSD testing
Andrew Cooper [Mon, 11 Nov 2024 17:02:39 +0000 (17:02 +0000)]
CI: Resync .cirrus.yml for FreeBSD testing

Includes:
  commit ebb7c6b2faf2 ("cirrus-ci: update to FreeBSD 14.1 image")

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 months agoConfig: Fix MiniOS revision
Andrew Cooper [Mon, 11 Nov 2024 14:49:09 +0000 (14:49 +0000)]
Config: Fix MiniOS revision

This is the 4.19 revision, not the 4.18 one.

Fixes: 3c81457aa338 ("Config: Update MiniOS revision")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
5 months agoCI: Mark Archlinux/x86 as allowing failures
Andrew Cooper [Wed, 10 Jul 2024 12:38:52 +0000 (13:38 +0100)]
CI: Mark Archlinux/x86 as allowing failures

Archlinux is a rolling distro.  As a consequence, rebuilding the container
periodically changes the toolchain, and this affects all stable branches in
one go.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>
(cherry picked from commit 5e1773dc863d6e1fb4c1398e380bdfc754342f7b)

5 months agoConfig: Update MiniOS revision
Andrew Cooper [Wed, 30 Oct 2024 18:00:22 +0000 (18:00 +0000)]
Config: Update MiniOS revision

Commit ff13dabd3099 ("mman: correct m{,un}lock() definitions")

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
6 months agox86/boot: Fix XSM module handling during PVH boot
Daniel P. Smith [Tue, 29 Oct 2024 15:42:29 +0000 (16:42 +0100)]
x86/boot: Fix XSM module handling during PVH boot

As detailed in commit 0fe607b2a144 ("x86/boot: Fix PVH boot during boot_info
transition period"), the use of __va(mbi->mods_addr) constitutes a
use-after-free on the PVH boot path.

This pattern has been in use since before PVH support was added.  This has
most likely gone unnoticed because no-one's tried using a detached Flask
policy in a PVH VM before.

Plumb the boot_info pointer down, replacing module_map and mbi.  Importantly,
bi->mods[].mod is a safe way to access the module list during PVH boot.

As this is the final non-bi use of mbi in __start_xen(), make the pointer
unusable once bi has been established, to prevent new uses creeping back in.
This is a stopgap until mbi can be fully removed.

Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Daniel P. Smith <dpsmith@apertussolutions.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 6cf0aaeb8df951fb34679f0408461a5c67cb02c6
master date: 2024-10-23 18:14:24 +0100

6 months agox86/boot: Fix microcode module handling during PVH boot
Daniel P. Smith [Tue, 29 Oct 2024 15:42:16 +0000 (16:42 +0100)]
x86/boot: Fix microcode module handling during PVH boot

As detailed in commit 0fe607b2a144 ("x86/boot: Fix PVH boot during boot_info
transition period"), the use of __va(mbi->mods_addr) constitutes a
use-after-free on the PVH boot path.

This pattern has been in use since before PVH support was added.  Inside a PVH
VM, it will go unnoticed as long as the microcode container parser doesn't
choke on the random data it finds.

The use within early_microcode_init() happens to be safe because it's prior to
move_xen().  microcode_init_cache() is after move_xen(), and therefore unsafe.

Plumb the boot_info pointer down, replacing module_map and mbi.  Importantly,
bi->mods[].mod is a safe way to access the module list during PVH boot.

Note: microcode_scan_module() is still bogusly stashing a bootstrap_map()'d
      pointer in ucode_blob.data, which constitutes a different
      use-after-free, and only works in general because of a second bug.  This
      is unrelated to PVH, and needs untangling differently.

Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Daniel P. Smith <dpsmith@apertussolutions.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8ddf63a252a6eae6e619ba2df9ad6b6f82e660c1
master date: 2024-10-23 18:14:24 +0100

6 months agoiommu/amd-vi: do not error if device referenced in IVMD is not behind any IOMMU
Roger Pau Monné [Tue, 29 Oct 2024 15:41:42 +0000 (16:41 +0100)]
iommu/amd-vi: do not error if device referenced in IVMD is not behind any IOMMU

IVMD table contains restrictions about memory which must be mandatory assigned
to devices (and which permissions it should use), or memory that should be
never accessible to devices.

Some hardware however contains ranges in IVMD that reference devices outside of
the IVHD tables (in other words, devices not behind any IOMMU).  Such mismatch
will cause Xen to fail in register_range_for_device(), ultimately leading to
the IOMMU being disabled, and Xen crashing as x2APIC support might be already
enabled and relying on the IOMMU functionality.

Relax IVMD parsing: allow IVMD blocks to reference devices not assigned to any
IOMMU.  It's impossible for Xen to fulfill the requirement in the IVMD block if
the device is not behind any IOMMU, but it's no worse than booting without
IOMMU support, and thus not parsing ACPI IVRS in the first place.

Reported-by: Willi Junga <xenproject@ymy.be>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 2defb544900a11f93104ac68d2f8beba89d4bd02
master date: 2024-10-15 14:23:59 +0200

6 months agoxen/spinlock: Fix UBSAN "load of address with insufficient space" in lock_prof_init()
Andrew Cooper [Tue, 29 Oct 2024 15:41:30 +0000 (16:41 +0100)]
xen/spinlock: Fix UBSAN "load of address with insufficient space" in lock_prof_init()

UBSAN complains:

  (XEN) ================================================================================
  (XEN) UBSAN: Undefined behaviour in common/spinlock.c:794:10
  (XEN) load of address ffff82d040ae24c8 with insufficient space
  (XEN) for an object of type 'struct lock_profile *'
  (XEN) ----[ Xen-4.20-unstable  x86_64  debug=y ubsan=y  Tainted:   C    ]----

This shows up with GCC-14, but not with GCC-12.  I have not bisected further.

Either way, the types for __lock_profile_{start,end} are incorrect.

They are an array of struct lock_profile pointers.  Correct the extern's
types, and adjust the loop to match.

No practical change.

Reported-by: Andreas Glashauser <ag@andreasglashauser.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
master commit: 542ac112fc68c66cfafc577e252404c21da4f75b
master date: 2024-10-14 16:14:26 +0100

6 months agox86/domctl: fix maximum number of MSRs in XEN_DOMCTL_{get,set}_vcpu_msrs
Roger Pau Monné [Tue, 29 Oct 2024 15:40:58 +0000 (16:40 +0100)]
x86/domctl: fix maximum number of MSRs in XEN_DOMCTL_{get,set}_vcpu_msrs

Since the addition of the MSR_AMD64_DR{1-4}_ADDRESS_MASK MSRs to the
msrs_to_send array, the calculations for the maximum number of MSRs that
the hypercall can handle is off by 4.

Remove the addition of 4 to the maximum number of MSRs that
XEN_DOMCTL_{set,get}_vcpu_msrs supports, as those are already part of the
array.

A further adjustment could be to subtract 4 from the maximum size if the DBEXT
CPUID feature is not exposed to the guest, but guest_{rd,wr}msr() will already
perform that check when fetching or loading the MSRs.  The maximum array is
used to indicate the caller of the buffer it needs to allocate in the get case,
and as an early input sanitation in the set case, using a buffer size slightly
lager than required is not an issue.

Fixes: 86d47adcd3c4 ('x86/msr: Handle MSR_AMD64_DR{0-3}_ADDRESS_MASK in the new MSR infrastructure')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c95cd5f9c5a8c1c6ab1b0b366d829fa8561958fd
master date: 2024-10-08 14:37:53 +0200

6 months agoioreq: don't wrongly claim "success" in ioreq_send_buffered()
Jan Beulich [Tue, 29 Oct 2024 15:40:46 +0000 (16:40 +0100)]
ioreq: don't wrongly claim "success" in ioreq_send_buffered()

Returning a literal number is a bad idea anyway when all other returns
use IOREQ_STATUS_* values. The function is dead on Arm, and mapping to
X86EMUL_OKAY is surely wrong on x86.

Fixes: f6bf39f84f82 ("x86/hvm: add support for broadcast of buffered ioreqs...")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: 2e0b545b847df7d4feb07308d50bad708bd35a66
master date: 2024-10-08 14:36:27 +0200

6 months agox86/dpci: do not leak pending interrupts on CPU offline
Roger Pau Monné [Tue, 29 Oct 2024 15:39:43 +0000 (16:39 +0100)]
x86/dpci: do not leak pending interrupts on CPU offline

The current dpci logic relies on a softirq being executed as a side effect of
the cpu_notifier_call_chain() call in the code path that offlines the target
CPU.  However the call to cpu_notifier_call_chain() won't trigger any softirq
processing, and even if it did, such processing should be done after all
interrupts have been migrated off the current CPU, otherwise new pending dpci
interrupts could still appear.

Currently the ASSERT() in the cpu callback notifier is fairly easy to trigger
by doing CPU offline from a PVH dom0.

Solve this by instead moving out any dpci interrupts pending processing once
the CPU is dead.  This might introduce more latency than attempting to drain
before the CPU is put offline, but it's less complex, and CPU online/offline is
not a common action.  Any extra introduced latency should be tolerable.

Fixes: f6dd295381f4 ('dpci: replace tasklet with softirq')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 29555668b5725b9d5393b72bfe7ff9a3fa606714
master date: 2024-10-07 11:10:21 +0200

6 months agostubdom: Fix newlib build with GCC-14
Andrew Cooper [Tue, 29 Oct 2024 15:39:22 +0000 (16:39 +0100)]
stubdom: Fix newlib build with GCC-14

Based on a fix from OpenSUSE, but adjusted to be Clang-compatible too.  Pass
-Wno-implicit-function-declaration library-wide rather than using local GCC
pragmas.

Fix of copy_past_newline() to avoid triggering -Wstrict-prototypes.

Link: https://build.opensuse.org/request/show/1178775
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: 444cb9350f2c1cc202b6b86176ddd8e57525e2d9
master date: 2024-10-03 10:07:25 +0100

6 months agox86/pv: Rename pv.iobmp_limit to iobmp_nr and clarify behaviour
Andrew Cooper [Tue, 29 Oct 2024 15:38:41 +0000 (16:38 +0100)]
x86/pv: Rename pv.iobmp_limit to iobmp_nr and clarify behaviour

Ever since it's introduction in commit 013351bd7ab3 ("Define new event-channel
and physdev hypercalls") in 2006, the public interface was named nr_ports
while the internal field was called iobmp_limit.

Rename the internal field to iobmp_nr to match the public interface, and
clarify that, when nonzero, Xen will read 2 bytes.

There isn't a perfect parallel with a real TSS, but iobmp_nr being 0 is the
paravirt "no IOPB" case, and it is important that no read occurs in this case.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 633ee8b2df963f7e5cb8de1219c1a48bfb4447f6
master date: 2024-10-01 14:58:18 +0100

6 months agox86/pv: Handle #PF correctly when reading the IO permission bitmap
Andrew Cooper [Tue, 29 Oct 2024 15:38:29 +0000 (16:38 +0100)]
x86/pv: Handle #PF correctly when reading the IO permission bitmap

The switch statement in guest_io_okay() is a very expensive way of
pre-initialising x with ~0, and performing a partial read into it.

However, the logic isn't correct either.

In a real TSS, the CPU always reads two bytes (like here), and any TSS limit
violation turns silently into no-access.  But, in-limit accesses trigger #PF
as usual.  AMD document this property explicitly, and while Intel don't (so
far as I can tell), they do behave consistently with AMD.

Switch from __copy_from_guest_offset() to __copy_from_guest_pv(), like
everything else in this file.  This removes code generation setting up
copy_from_user_hvm() (in the likely path even), and safety LFENCEs from
evaluate_nospec().

Change the logic to raise #PF if __copy_from_guest_pv() fails, rather than
disallowing the IO port access.  This brings the behaviour better in line with
normal x86.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 8a6c495d725408d333c1b47bb8af44615a5bfb18
master date: 2024-10-01 14:58:18 +0100

6 months agox86/pv: Rework guest_io_okay() to return X86EMUL_*
Andrew Cooper [Tue, 29 Oct 2024 15:38:17 +0000 (16:38 +0100)]
x86/pv: Rework guest_io_okay() to return X86EMUL_*

In order to fix a bug with guest_io_okay() (subsequent patch), rework
guest_io_okay() to take in an emulation context, and return X86EMUL_* rather
than a boolean.

For the failing case, take the opportunity to inject #GP explicitly, rather
than returning X86EMUL_UNHANDLEABLE.  There is a logical difference between
"we know what this is, and it's #GP", vs "we don't know what this is".

There is no change in practice as emulation is the final step on general #GP
resolution, but returning X86EMUL_UNHANDLEABLE would be a latent bug if a
subsequent action were to appear.

No practical change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7429e1cc071b0e20ea9581da4893fb9b2f6d21d4
master date: 2024-10-01 14:58:18 +0100

6 months agox86/traps: Re-enable interrupts after reading cr2 in the #PF handler
Alejandro Vallejo [Tue, 29 Oct 2024 15:37:32 +0000 (16:37 +0100)]
x86/traps: Re-enable interrupts after reading cr2 in the #PF handler

Hitting a page fault clobbers %cr2, so if a page fault is handled while
handling a previous page fault then %cr2 will hold the address of the
latter fault rather than the former. In particular, if a debug key
handler happens to trigger during #PF and before %cr2 is read, and that
handler itself encounters a #PF, then %cr2 will be corrupt for the outer #PF
handler.

This patch makes the page fault path delay re-enabling IRQs until %cr2
has been read in order to ensure it stays consistent.

A similar argument holds in additional cases, but they happen to be safe:
    * %dr6 inside #DB: Safe because IST exceptions don't re-enable IRQs.
    * MSR_XFD_ERR inside #NM: Safe because AMX isn't used in #NM handler.

While in the area, remove redundant q suffix to a movq in entry.S and
the space after the comma.

Fixes: a4cd20a19073 ("[XEN] 'd' key dumps both host and guest state.")
Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: b06e76db7c35974f1b127762683e7852ca0c8e76
master date: 2024-10-01 09:45:49 +0200

6 months agox86/PV: simplify (and thus correct) guest accessor functions
Jan Beulich [Tue, 29 Oct 2024 15:37:12 +0000 (16:37 +0100)]
x86/PV: simplify (and thus correct) guest accessor functions

Taking a fault on a non-byte-granular insn means that the "number of
bytes not handled" return value would need extra care in calculating, if
we want callers to be able to derive e.g. exception context (to be
injected to the guest) - CR2 for #PF in particular - from the value. To
simplify things rather than complicating them, reduce inline assembly to
just byte-granular string insns. On recent CPUs that's also supposed to
be more efficient anyway.

For singular element accessors, however, alignment checks are added,
hence slightly complicating the code. Misaligned (user) buffer accesses
will now be forwarded to copy_{from,to}_guest_ll().

Naturally copy_{from,to}_unsafe_ll() accessors end up being adjusted the
same way, as they're produced by mere re-processing of the same code.
Otoh copy_{from,to}_unsafe() aren't similarly adjusted, but have their
comments made match reality; down the road we may want to change their
return types, e.g. to bool.

Fixes: 76974398a63c ("Added user-memory accessing functionality for x86_64")
Fixes: 7b8c36701d26 ("Introduce clear_user and clear_guest")
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 67a8e5721e1ea9c28526883036bf08fb2e8a8c9c
master date: 2024-10-01 09:44:55 +0200

6 months agoxen/ucode: Make Intel's microcode_sanity_check() stricter
Demi Marie Obenour [Tue, 29 Oct 2024 15:35:52 +0000 (16:35 +0100)]
xen/ucode: Make Intel's microcode_sanity_check() stricter

The SDM states that data size must be a multiple of 4, but Xen doesn't check
this propery.

This is liable to cause a later failures, but should be checked explicitly.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 8752ad83e79754f8109457cff796e5f86f644348
master date: 2024-09-24 18:57:38 +0100

7 months agoxen/ucode: Fix buffer under-run when parsing AMD containers
Demi Marie Obenour [Tue, 24 Sep 2024 13:01:15 +0000 (15:01 +0200)]
xen/ucode: Fix buffer under-run when parsing AMD containers

The AMD container format has no formal spec.  It is, at best, precision
guesswork based on AMD's prior contributions to open source projects.  The
Equivalence Table has both an explicit length, and an expectation of having a
NULL entry at the end.

Xen was sanity checking the NULL entry, but without confirming that an entry
was present, resulting in a read off the front of the buffer.  With some
manual debugging/annotations this manifests as:

  (XEN) *** Buf ffff83204c00b19c, eq ffff83204c00b194
  (XEN) *** eq: 0c 00 00 00 44 4d 41 00 00 00 00 00 00 00 00 00 aa aa aa aa
                            ^-Actual buffer-------------------^
  (XEN) *** installed_cpu: 000c
  (XEN) microcode: Bad equivalent cpu table
  (XEN) Parsing microcode blob error -22

When loaded by hypercall, the 4 bytes interpreted as installed_cpu happen to
be the containing struct ucode_buf's len field, and luckily will be nonzero.

When loaded at boot, it's possible for the access to #PF if the module happens
to have been placed on a 2M boundary by the bootloader.  Under Linux, it will
commonly be the end of the CPIO header.

Drop the probe of the NULL entry; Nothing else cares.  A container without one
is well formed, insofar that we can still parse it correctly.  With this
dropped, the same container results in:

  (XEN) microcode: couldn't find any matching ucode in the provided blob!

Fixes: 4de936a38aa9 ("x86/ucode/amd: Rework parsing logic in cpu_request_microcode()")
Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a8bf14f6f331d4f428010b4277b67c33f561ed19
master date: 2024-09-13 15:23:30 +0100

7 months agoblkif: reconcile protocol specification with in-use implementations
Roger Pau Monné [Tue, 24 Sep 2024 13:00:55 +0000 (15:00 +0200)]
blkif: reconcile protocol specification with in-use implementations

Current blkif implementations (both backends and frontends) have all slight
differences about how they handle the 'sector-size' xenstore node, and how
other fields are derived from this value or hardcoded to be expressed in units
of 512 bytes.

To give some context, this is an excerpt of how different implementations use
the value in 'sector-size' as the base unit for to other fields rather than
just to set the logical sector size of the block device:

                        │ sectors xenbus node │ requests sector_number │ requests {first,last}_sect
────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
FreeBSD blk{front,back} │     sector-size     │      sector-size       │           512
────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
Linux blk{front,back}   │         512         │          512           │           512
────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
QEMU blkback            │     sector-size     │      sector-size       │       sector-size
────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
Windows blkfront        │     sector-size     │      sector-size       │       sector-size
────────────────────────┼─────────────────────┼────────────────────────┼───────────────────────────
MiniOS                  │     sector-size     │          512           │           512

An attempt was made by 67e1c050e36b in order to change the base units of the
request fields and the xenstore 'sectors' node.  That however only lead to more
confusion, as the specification now clearly diverged from the reference
implementation in Linux.  Such change was only implemented for QEMU Qdisk
and Windows PV blkfront.

Partially revert to the state before 67e1c050e36b while adjusting the
documentation for 'sectors' to match what it used to be previous to
2fa701e5346d:

 * Declare 'feature-large-sector-size' deprecated.  Frontends should not expose
   the node, backends should not make decisions based on its presence.

 * Clarify that 'sectors' xenstore node and the requests fields are always in
   512-byte units, like it was previous to 2fa701e5346d and 67e1c050e36b.

All base units for the fields used in the protocol are 512-byte based, the
xenbus 'sector-size' field is only used to signal the logic block size.  When
'sector-size' is greater than 512, blkfront implementations must make sure that
the offsets and sizes (despite being expressed in 512-byte units) are aligned
to the logical block size specified in 'sector-size', otherwise the backend
will fail to process the requests.

This will require changes to some of the frontends and backends in order to
properly support 'sector-size' nodes greater than 512.

Fixes: 2fa701e5346d ('blkif.h: Provide more complete documentation of the blkif interface')
Fixes: 67e1c050e36b ('public/io/blkif.h: try to fix the semantics of sector based quantities')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: 221f2748e8dabe8361b8cdfcffbeab9102c4c899
master date: 2024-09-12 14:04:56 +0200

7 months agoxen/x86/pvh: handle ACPI RSDT table in PVH Dom0 build
Stefano Stabellini [Tue, 24 Sep 2024 13:00:29 +0000 (15:00 +0200)]
xen/x86/pvh: handle ACPI RSDT table in PVH Dom0 build

Xen always generates an XSDT table even if the firmware only provided an
RSDT table.  Copy the RSDT header from the firmware table, adjusting the
signature, for the XSDT table when not provided by the firmware.

This is necessary to run Xen on QEMU.

Fixes: 1d74282c455f ('x86: setup PVHv2 Dom0 ACPI tables')
Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com>
Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 6e7f7a0c16c4d406bda6d4a900252ff63a7c5fad
master date: 2024-09-12 09:18:25 +0200

7 months agox86/HVM: properly reject "indirect" VRAM writes
Jan Beulich [Tue, 24 Sep 2024 13:00:07 +0000 (15:00 +0200)]
x86/HVM: properly reject "indirect" VRAM writes

While ->count will only be different from 1 for "indirect" (data in
guest memory) accesses, it being 1 does not exclude the request being an
"indirect" one. Check both to be on the safe side, and bring the ->count
part also in line with what ioreq_send_buffered() actually refuses to
handle.

Fixes: 3bbaaec09b1b ("x86/hvm: unify stdvga mmio intercept with standard mmio intercept")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: eb7cd0593d88c4b967a24bca8bd30591966676cd
master date: 2024-09-12 09:13:04 +0200

7 months agox86emul/test: fix build with gas 2.43
Jan Beulich [Tue, 24 Sep 2024 12:59:22 +0000 (14:59 +0200)]
x86emul/test: fix build with gas 2.43

Drop explicit {evex} pseudo-prefixes. New gas (validly) complains when
they're used on things other than instructions. Our use was potentially
ahead of macro invocations - see simd.h's "override" macro.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 3c09288298af881ea1bb568740deb2d2a06bcd41
master date: 2024-09-06 08:41:18 +0200

7 months agox86: fix UP build with gcc14
Jan Beulich [Tue, 24 Sep 2024 12:58:58 +0000 (14:58 +0200)]
x86: fix UP build with gcc14

The complaint is:

In file included from ././include/xen/config.h:17,
                 from <command-line>:
arch/x86/smpboot.c: In function ‘link_thread_siblings.constprop’:
./include/asm-generic/percpu.h:16:51: error: array subscript [0, 0] is outside array bounds of ‘long unsigned int[1]’ [-Werror=array-bounds=]
   16 |     (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))
./include/xen/compiler.h:140:29: note: in definition of macro ‘RELOC_HIDE’
  140 |     (typeof(ptr)) (__ptr + (off)); })
      |                             ^~~
arch/x86/smpboot.c:238:27: note: in expansion of macro ‘per_cpu’
  238 |     cpumask_set_cpu(cpu2, per_cpu(cpu_sibling_mask, cpu1));
      |                           ^~~~~~~
In file included from ./arch/x86/include/generated/asm/percpu.h:1,
                 from ./include/xen/percpu.h:30,
                 from ./arch/x86/include/asm/cpuid.h:9,
                 from ./arch/x86/include/asm/cpufeature.h:11,
                 from ./arch/x86/include/asm/system.h:6,
                 from ./include/xen/list.h:11,
                 from ./include/xen/mm.h:68,
                 from arch/x86/smpboot.c:12:
./include/asm-generic/percpu.h:12:22: note: while referencing ‘__per_cpu_offset’
   12 | extern unsigned long __per_cpu_offset[NR_CPUS];
      |                      ^~~~~~~~~~~~~~~~

Which I consider bogus in the first place ("array subscript [0, 0]" vs a
1-element array). Yet taking the experience from 99f942f3d410 ("Arm64:
adjust __irq_to_desc() to fix build with gcc14") I guessed that
switching function parameters to unsigned int (which they should have
been anyway) might help. And voilà ...

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a2de7dc4d845738e734b10fce6550c89c6b1092c
master date: 2024-09-04 16:09:28 +0200

7 months agoSUPPORT.md: split XSM from Flask
Jan Beulich [Tue, 24 Sep 2024 12:58:45 +0000 (14:58 +0200)]
SUPPORT.md: split XSM from Flask

XSM is a generic framework, which in particular is also used by SILO.
With this it can't really be experimental: Arm mandates SILO for having
a security supported configuration.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Daniel P. Smith <dpsmith@apertussolutions.com>
master commit: d7c18b8720824d7efc39ffa7296751e1812865a9
master date: 2024-09-04 16:05:03 +0200

7 months agolibxl: Fix nul-termination of the return value of libxl_xen_console_read_line()
Javi Merino [Tue, 24 Sep 2024 12:58:13 +0000 (14:58 +0200)]
libxl: Fix nul-termination of the return value of libxl_xen_console_read_line()

When built with ASAN, "xl dmesg" crashes in the "printf("%s", line)"
call in main_dmesg().  ASAN reports a heap buffer overflow: an
off-by-one access to cr->buffer.

The readconsole sysctl copies up to count characters into the buffer,
but it does not add a null character at the end.  Despite the
documentation of libxl_xen_console_read_line(), line_r is not
nul-terminated if 16384 characters were copied to the buffer.

Fix this by asking xc_readconsolering() to fill the buffer up to size
- 1.  As the number of characters in the buffer is only needed in
libxl_xen_console_read_line(), make it a local variable there instead
of part of the libxl__xen_console_reader struct.

Fixes: 4024bae739cc ("xl: Add subcommand 'xl dmesg'")
Reported-by: Edwin Török <edwin.torok@cloud.com>
Signed-off-by: Javi Merino <javi.merino@cloud.com>
Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: bb03169bcb6ecccf372de1f6b9285cd519a26bb8
master date: 2024-09-03 10:53:44 +0100

7 months agoArm64: adjust __irq_to_desc() to fix build with gcc14
Jan Beulich [Tue, 24 Sep 2024 12:57:43 +0000 (14:57 +0200)]
Arm64: adjust __irq_to_desc() to fix build with gcc14

With the original code I observe

In function ‘__irq_to_desc’,
    inlined from ‘route_irq_to_guest’ at arch/arm/irq.c:465:12:
arch/arm/irq.c:54:16: error: array subscript -2 is below array bounds of ‘irq_desc_t[32]’ {aka ‘struct irq_desc[32]’} [-Werror=array-bounds=]
   54 |         return &this_cpu(local_irq_desc)[irq];
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

which looks pretty bogus: How in the world does the compiler arrive at
-2 when compiling route_irq_to_guest()? Yet independent of that the
function's parameter wants to be of unsigned type anyway, as shown by
a vast majority of callers (others use plain int when they really mean
non-negative quantities). With that adjustment the code compiles fine
again.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Michal Orzel <michal.orzel@amd.com>
master commit: 99f942f3d410059dc223ee0a908827e928ef3592
master date: 2024-08-29 10:03:53 +0200

7 months agox86/HVM: correct partial HPET_STATUS write emulation
Jan Beulich [Tue, 24 Sep 2024 12:57:21 +0000 (14:57 +0200)]
x86/HVM: correct partial HPET_STATUS write emulation

For partial writes the non-written parts of registers are folded into
the full 64-bit value from what they're presently set to. That's wrong
to do though when the behavior is write-1-to-clear: Writes not
including to low 3 bits would unconditionally clear all ISR bits which
are presently set. Re-calculate the value to use.

Fixes: be07023be115 ("x86/vhpet: add support for level triggered interrupts")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 41d358d2f9607ba37c216effa39b9f1bc58de69d
master date: 2024-08-29 10:02:20 +0200

7 months agox86/dom0: disable SMAP for PV domain building only
Roger Pau Monné [Tue, 24 Sep 2024 12:56:45 +0000 (14:56 +0200)]
x86/dom0: disable SMAP for PV domain building only

Move the logic that disables SMAP so it's only performed when building a PV
dom0, PVH dom0 builder doesn't require disabling SMAP.

The fixes tag is to account for the wrong usage of cpu_has_smap in
create_dom0(), it should instead have used
boot_cpu_has(X86_FEATURE_XEN_SMAP).  Fix while moving the logic to apply to PV
only.

While there also make cr4_pv32_mask __ro_after_init.

Fixes: 493ab190e5b1 ('xen/sm{e, a}p: allow disabling sm{e, a}p for Xen itself')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fb1658221a31ec1db33253a80001191391e73b17
master date: 2024-08-28 19:59:07 +0100

7 months agox86/x2APIC: correct cluster tracking upon CPUs going down for S3
Jan Beulich [Tue, 24 Sep 2024 12:56:16 +0000 (14:56 +0200)]
x86/x2APIC: correct cluster tracking upon CPUs going down for S3

Downing CPUs for S3 is somewhat special: Since we can expect the system
to come back up in exactly the same hardware configuration, per-CPU data
for the secondary CPUs isn't de-allocated (and then cleared upon re-
allocation when the CPUs are being brought back up). Therefore the
cluster_cpus per-CPU pointer will retain its value for all CPUs other
than the final one in a cluster (i.e. in particular for all CPUs in the
same cluster as CPU0). That, however, is in conflict with the assertion
early in init_apic_ldr_x2apic_cluster().

Note that the issue is avoided on Intel hardware, where we park CPUs
instead of bringing them down.

Extend the bypassing of the freeing to the suspend case, thus making
suspend/resume also a tiny bit faster.

Fixes: 2e6c8f182c9c ("x86: distinguish CPU offlining from CPU removal")
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ad3ff7b4279d16c91c23cda6e8be5bc670b25c9a
master date: 2024-08-26 10:30:40 +0200

7 months agox86emul: set (fake) operand size for AVX512CD broadcast insns
Jan Beulich [Tue, 24 Sep 2024 12:55:48 +0000 (14:55 +0200)]
x86emul: set (fake) operand size for AVX512CD broadcast insns

Back at the time I failed to pay attention to op_bytes still being zero
when reaching the respective case block: With the ext0f38_table[]
entries having simd_packed_int, the defaulting at the bottom of
x86emul_decode() won't set the field to non-zero for F3-prefixed insns.

Fixes: 37ccca740c26 ("x86emul: support AVX512CD insns")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6fa6b7feaafd622db3a2f3436750cf07782f4c12
master date: 2024-08-23 09:12:24 +0200

7 months agox86emul: always set operand size for AVX-VNNI-INT8 insns
Jan Beulich [Tue, 24 Sep 2024 12:55:11 +0000 (14:55 +0200)]
x86emul: always set operand size for AVX-VNNI-INT8 insns

Unlike for AVX-VNNI-INT16 I failed to notice that op_bytes may still be
zero when reaching the respective case block: With the ext0f38_table[]
entries having simd_packed_int, the defaulting at the bottom of
x86emul_decode() won't set the field to non-zero for F3- or F2-prefixed
insns.

Fixes: 842acaa743a5 ("x86emul: support AVX-VNNI-INT8")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d45687cca2450bfebe1dfbddb22f4f03c6fbc9cb
master date: 2024-08-23 09:11:15 +0200

7 months agox86/pv: Address Coverity complaint in check_guest_io_breakpoint()
Andrew Cooper [Tue, 24 Sep 2024 12:54:35 +0000 (14:54 +0200)]
x86/pv: Address Coverity complaint in check_guest_io_breakpoint()

Commit 08aacc392d86 ("x86/emul: Fix misaligned IO breakpoint behaviour in PV
guests") caused a Coverity INTEGER_OVERFLOW complaint based on the reasoning
that width could be 0.

It can't, but digging into the code generation, GCC 8 and later (bisected on
godbolt) choose to emit a CSWITCH lookup table, and because the range (bottom
2 bits clear), it's a 16-entry lookup table.

So Coverity is understandable, given that GCC did emit a (dead) logic path
where width stayed 0.

Rewrite the logic.  Introduce x86_bp_width() which compiles to a single basic
block, which replaces the switch() statement.  Take the opportunity to also
make start and width be loop-scope variables.

No practical change, but it should compile better and placate Coverity.

Fixes: 08aacc392d86 ("x86/emul: Fix misaligned IO breakpoint behaviour in PV guests")
Coverity-ID: 1616152
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6d41a9d8a12ff89adabdc286e63e9391a0481699
master date: 2024-08-21 23:59:19 +0100

7 months agox86/pv: Fix merging of new status bits into %dr6
Andrew Cooper [Tue, 24 Sep 2024 12:53:59 +0000 (14:53 +0200)]
x86/pv: Fix merging of new status bits into %dr6

All #DB exceptions result in an update of %dr6, but this isn't captured in
Xen's handling, and is buggy just about everywhere.

To begin resolving this issue, add a new pending_dbg field to x86_event
(unioned with cr2 to avoid taking any extra space, adjusting users to avoid
old-GCC bugs with anonymous unions), and introduce pv_inject_DB() to replace
the current callers using pv_inject_hw_exception().

Push the adjustment of v->arch.dr6 into pv_inject_event(), and use the new
x86_merge_dr6() rather than the current incorrect logic.

A key property is that pending_dbg is taken with positive polarity to deal
with RTM/BLD sensibly.  Most callers pass in a constant, but callers passing
in a hardware %dr6 value need to XOR the value with X86_DR6_DEFAULT to flip to
positive polarity.

This fixes the behaviour of the breakpoint status bits; that any left pending
are generally discarded when a new #DB is raised.  In principle it would fix
RTM/BLD too, except PV guests can't turn these capabilities on to start with.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: db39fa4b27ea470902d4625567cb6fa24030ddfa
master date: 2024-08-21 23:59:19 +0100

7 months agox86/pv: Introduce x86_merge_dr6() and fix do_debug()
Andrew Cooper [Tue, 24 Sep 2024 12:53:22 +0000 (14:53 +0200)]
x86/pv: Introduce x86_merge_dr6() and fix do_debug()

Pretty much everywhere in Xen the logic to update %dr6 when injecting #DB is
buggy.  Introduce a new x86_merge_dr6() helper, and start fixing the mess by
adjusting the dr6 merge in do_debug().  Also correct the comment.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 54ef601a66e8d812a6a6a308f02524e81201825e
master date: 2024-08-21 23:59:19 +0100

7 months agoxl: fix incorrect output in "help" command
John E. Krokes [Tue, 24 Sep 2024 12:52:42 +0000 (14:52 +0200)]
xl: fix incorrect output in "help" command

In "xl help", the output includes this line:

 vsnd-list           List virtual display devices for a domain

This should obviously say "sound devices" instead of "display devices".

Signed-off-by: John E. Krokes <mag@netherworld.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: 09226d165b57d919150458044c5b594d3d1dc23a
master date: 2024-08-14 08:49:44 +0200