]> xenbits.xensource.com Git - xen.git/log
xen.git
6 days agoxen: remove -N from the linker command line stable-4.19 staging-4.19
Roger Pau Monné [Tue, 29 Apr 2025 09:58:03 +0000 (11:58 +0200)]
xen: remove -N from the linker command line

It's unclear why -N is being used in the first place.  It was added by
commit 4676bbf96dc8 back in 2002 without any justification.

When building a PE image it's actually detrimental to forcefully set the
.text section as writable.  The GNU LD man page contains the following
warning regarding the -N option:

> Note: Although a writable text section is allowed for PE-COFF targets, it
> does not conform to the format specification published by Microsoft.

Remove the usage of -N uniformly on all architectures, assuming that the
addition was simply done as a copy and paste of the original x86 linking
rune.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Julien Grall <jgrall@amazon.com>
master commit: d444763f8ca556d0a67a4b933be303d346baef02
master date: 2025-04-23 16:12:25 +0200

6 days agox86/intel: workaround several MONITOR/MWAIT errata
Roger Pau Monné [Tue, 29 Apr 2025 09:57:31 +0000 (11:57 +0200)]
x86/intel: workaround several MONITOR/MWAIT errata

There are several errata on Intel regarding the usage of the MONITOR/MWAIT
instructions, all having in common that stores to the monitored region
might not wake up the CPU.

Fix them by forcing the sending of an IPI for the affected models.

The Ice Lake issue has been reproduced internally on XenServer hardware,
and the fix does seem to prevent it.  The symptom was APs getting stuck in
the idle loop immediately after bring up, which in turn prevented the BSP
from making progress.  This would happen before the watchdog was
initialized, and hence the whole system would get stuck.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 4aae4452efeee3d3bba092b875e37d1e7c8f6db9
master date: 2025-04-23 16:12:25 +0200

6 days agocompat/memory: avoid UB shifts in XENMEM_exchange handling
Jan Beulich [Tue, 29 Apr 2025 09:57:07 +0000 (11:57 +0200)]
compat/memory: avoid UB shifts in XENMEM_exchange handling

Add an early basic check, yielding the same error code as the more
thorough one the main handler would produce.

Fixes: b8a7efe8528a ("Enable compatibility mode operation for HYPERVISOR_memory_op")
Reported-by: Manuel Andreas <manuel.andreas@tum.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 560c51be8f6a88cde43c0a7c8be60158b5725982
master date: 2025-04-22 11:25:23 +0200

6 days agox86emul: also clip repetition count for STOS
Jan Beulich [Tue, 29 Apr 2025 09:56:44 +0000 (11:56 +0200)]
x86emul: also clip repetition count for STOS

Like MOVS, INS, and OUTS, STOS also has a special purpose hook, where
the hook function may legitimately have the same expectation as to the
request not straddling address space start/end.

Fixes: 5dfe4aa4eeb6 ("x86_emulate: Do not request emulation of REP instructions beyond the")
Reported-by: Fabian Specht <f.specht@tum.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 8c5636b6c87777e6c2e4ffae28bffe1cfc189bfd
master date: 2025-04-22 11:24:20 +0200

6 days agox86/HVM: update repeat count upon nested lin->phys failure
Jan Beulich [Tue, 29 Apr 2025 09:56:25 +0000 (11:56 +0200)]
x86/HVM: update repeat count upon nested lin->phys failure

For the X86EMUL_EXCEPTION case the repeat count must be correctly
propagated back. Since for the recursive invocation we use a local
helper variable, its value needs copying to the caller's one.

While there also correct the off-by-1 range in the comment ahead of the
function (strictly speaking for the "DF set" case we'd need to put
another, different range there as well).

Fixes: 53f87c03b4ea ("x86emul: generalize exception handling for rep_* hooks")
Reported-by: Manuel Andreas <manuel.andreas@tum.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c07b16fd6e47782ebf1ee767cd07c1e2b4140f47
master date: 2025-04-17 10:01:19 +0200

6 days agoxen/rangeset: fix incorrect subtraction
Roger Pau Monné [Tue, 29 Apr 2025 09:55:34 +0000 (11:55 +0200)]
xen/rangeset: fix incorrect subtraction

Given the following rangset operation:

{ [0, 1], [4, 5] } - { [3, 4] }

The current rangeset logic will output a rangeset:

{ [0, 2], [5, 5] }

This is incorrect, and also has the undesirable property of being bogus in
a way that the resulting rangeset is expanded.

Fix this by making sure the bounds are correctly checked before modifying
the previous range.

Fixes: 484a058c4828 ('Add auto-destructing per-domain rangeset data structure...')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e118fc98e7ae652a188d227bd7ea22f132724150
master date: 2025-04-11 12:20:10 +0200

6 days agoinclude: sort $(wildcard ...) results
Jan Beulich [Tue, 29 Apr 2025 09:55:16 +0000 (11:55 +0200)]
include: sort $(wildcard ...) results

The order of items is stored in .*.chk.cmd, and hence variations between
how items are ordered would result in re-invocation of the checking rule
during "make install-xen" despite that already having successfully run
earlier on. The difference can become noticable when building (as non-
root) and installing (as root) use different GNU make versions: In 3.82
the sorting was deliberately undone, just for it to be restored in 4.3.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ff835bbc8096a14ed1bffa235e25848c993f7240
master date: 2025-04-10 10:56:29 +0200

6 days agoxen: x86: irq: initialize irq desc in create_irq()
Volodymyr Babchuk [Tue, 29 Apr 2025 09:54:59 +0000 (11:54 +0200)]
xen: x86: irq: initialize irq desc in create_irq()

While building xen with GCC 14.2.1 with "-fcondition-coverage" option
or with "-Og", the compiler produces a false positive warning:

  arch/x86/irq.c: In function ‘create_irq’:
  arch/x86/irq.c:281:11: error: ‘desc’ may be used uninitialized [-Werror=maybe-uninitialized]
    281 |     ret = init_one_irq_desc(desc);
        |           ^~~~~~~~~~~~~~~~~~~~~~~
  arch/x86/irq.c:269:22: note: ‘desc’ was declared here
    269 |     struct irq_desc *desc;
        |                      ^~~~
  cc1: all warnings being treated as errors
  make[2]: *** [Rules.mk:252: arch/x86/irq.o] Error 1

While we have signed/unsigned comparison both in "for" loop and in
"if" statement, this still can't lead to use of uninitialized "desc",
as either loop will be executed at least once, or the function will
return early. So this is a clearly false positive warning due to a
bug [1] in GCC.

Initialize "desc" with NULL to make GCC happy.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119665

Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7a4484d90b3003171f1700e424ad45b931200ba6
master date: 2025-04-08 09:40:39 +0200

6 days agox86/cpu: Validate CPUID leaf 0x2 EDX output
Ahmed S. Darwish [Tue, 29 Apr 2025 09:54:35 +0000 (11:54 +0200)]
x86/cpu: Validate CPUID leaf 0x2 EDX output

CPUID leaf 0x2 emits one-byte descriptors in its four output registers
EAX, EBX, ECX, and EDX.  For these descriptors to be valid, the most
significant bit (MSB) of each register must be clear.

Leaf 0x2 parsing at intel.c only validated the MSBs of EAX, EBX, and
ECX, but left EDX unchecked.

Validate EDX's most-significant bit as well.

Fixes: 1aa6feb63bfd ("Port CPU setup code from Linux 2.6")
Signed-off-by: Ahmed S. Darwish <darwi@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20250304085152.51092-3-darwi@linutronix.de
Use ARRAY_SIZE() though.

Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 1881148215c6
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a47b44a8f0a58a6015faf6465921cd203f0b51d1
master date: 2025-04-08 09:37:38 +0200

6 days agoxen: vm_event: do not do vm_event_op for an invalid domain
Volodymyr Babchuk [Tue, 29 Apr 2025 09:53:56 +0000 (11:53 +0200)]
xen: vm_event: do not do vm_event_op for an invalid domain

A privileged domain can issue XEN_DOMCTL_vm_event_op with
op->domain == DOMID_INVALID. In this case vm_event_domctl()
function will get NULL as the first parameter and this will
cause hypervisor panic, as it tries to derefer this pointer.

Fix the issue by checking if valid domain is passed in.

Fixes: 48b84249459f ("xen/vm-event: Drop unused u_domctl parameter from vm_event_domctl()")
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
master commit: 6a884750f3b86a45ee5ffbd825c346fcbce86080
master date: 2025-04-08 09:36:38 +0200

6 days agosched/null: avoid another crash after failed domU creation
Stewart Hildebrand [Tue, 29 Apr 2025 09:53:16 +0000 (11:53 +0200)]
sched/null: avoid another crash after failed domU creation

The following sequence of events may lead a debug build of Xen to crash
when using the null scheduler:

1. domain creation (e.g. d1) failed due to bad configuration
2. complete_domain_destroy() was deferred
3. domain creation (e.g. d2) succeeds

At this point, d2 is running, while the zombie d1 is not fully cleaned
up:

(XEN) Online Cpus: 0-3
(XEN) Cpupool 0:
(XEN) Cpus: 0-3
(XEN) Scheduling granularity: cpu, 1 CPU per sched-resource
(XEN) Scheduler: null Scheduler (null)
(XEN)   cpus_free = 3
(XEN) Domain info:
(XEN)   Domain: 0
(XEN)     1: [0.0] pcpu=0
(XEN)     2: [0.1] pcpu=1
(XEN)   Domain: 1
(XEN)     3: [1.0] pcpu=2
(XEN)   Domain: 2
(XEN)     4: [2.0] pcpu=2

4. complete_domain_destroy() gets called for d1 and triggers the
following:

(XEN) Xen call trace:
(XEN)    [<00000a0000322ed4>] null.c#unit_deassign+0x2d8/0xb70 (PC)
(XEN)    [<00000a000032457c>] null.c#null_unit_remove+0x670/0xba8 (LR)
(XEN)    [<00000a000032457c>] null.c#null_unit_remove+0x670/0xba8
(XEN)    [<00000a0000336404>] sched_destroy_vcpu+0x354/0x8fc
(XEN)    [<00000a0000227324>] domain.c#complete_domain_destroy+0x11c/0x49c
(XEN)    [<00000a000029fbd0>] rcupdate.c#rcu_do_batch+0x94/0x3d0
(XEN)    [<00000a00002a10c0>] rcupdate.c#__rcu_process_callbacks+0x160/0x5f4
(XEN)    [<00000a00002a1e60>] rcupdate.c#rcu_process_callbacks+0xcc/0x1b0
(XEN)    [<00000a00002a3460>] softirq.c#__do_softirq+0x1f4/0x3d8
(XEN)    [<00000a00002a37c4>] do_softirq+0x14/0x1c
(XEN)    [<00000a0000465260>] traps.c#check_for_pcpu_work+0x30/0xb8
(XEN)    [<00000a000046bb08>] leave_hypervisor_to_guest+0x28/0x198
(XEN)    [<00000a0000409c84>] entry.o#guest_sync_slowpath+0xac/0xd8
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Assertion 'npc->unit == unit' failed at common/sched/null.c:383
(XEN) ****************************************

Fix by skipping unit_deassign() when the unit to be removed does not
match the pcpu's currently assigned unit.

Fixes: c2eae2614c8f ("sched/null: avoid crash after failed domU creation")
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
master commit: 54fe207f29f86c4226a62a4dd289f10d9d2abc40
master date: 2025-04-07 12:17:31 +0200

6 days agox86/MTRR: hook mtrr_bp_restore() back up
Jan Beulich [Tue, 29 Apr 2025 09:52:22 +0000 (11:52 +0200)]
x86/MTRR: hook mtrr_bp_restore() back up

Unlike stated in the offending commit's description,
load_system_tables() wasn't the only thing left to retain from the
earlier restore_rest_processor_state(). Note that MTRR state was still
reloaded via mtrr_aps_sync_end(), but that happens quite a bit later in
the resume process.

While there also do Misra-related tidying for the function itself: The
function being used from assembly only means it doesn't need to have a
declaration, but wants to be asmlinkage.

Fixes: 4304ff420e51 ("x86/S3: Drop {save,restore}_rest_processor_state() completely")
Reported-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 0414dedd6fde1a1c5c5e38dcbef4dad506e1398c
master date: 2025-04-03 09:39:13 +0200

6 days agoupdate Xen version to 4.19.3-pre
Jan Beulich [Tue, 29 Apr 2025 09:51:42 +0000 (11:51 +0200)]
update Xen version to 4.19.3-pre

3 weeks agox86/ucode: Extend AMD digest checks to cover Zen5 CPUs
Andrew Cooper [Tue, 8 Apr 2025 16:09:15 +0000 (17:09 +0100)]
x86/ucode: Extend AMD digest checks to cover Zen5 CPUs

AMD have updated the SB-7033 advisory to include Zen5 CPUs.  Extend the digest
check to cover Zen5 too.

In practice, cover everything until further notice.

Observant readers may be wondering where the update to the digest list is.  At
the time of writing, no Zen5 patches are available via a verifiable channel.

Link: https://www.amd.com/en/resources/product-security/bulletin/amd-sb-7033.html
Fixes: 630e8875ab36 ("x86/ucode: Perform extra SHA2 checks on AMD Fam17h/19h microcode")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit b63951467e964bcc927f823fc943e40069fac0c9)

x86/ucode: Extend warning about disabling digest check too

This was missed by accident.

Fixes: b63951467e96 ("x86/ucode: Extend AMD digest checks to cover Zen5 CPUs")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 59bb316ea89e7f9461690fe00547d7d2af96321d)

3 weeks agox86/ucode: Perform extra SHA2 checks on AMD Fam17h/19h microcode
Andrew Cooper [Fri, 13 Dec 2024 14:34:00 +0000 (14:34 +0000)]
x86/ucode: Perform extra SHA2 checks on AMD Fam17h/19h microcode

Collisions have been found in the microcode signing algorithm used by AMD
Fam17h/19h CPUs, and now anyone can sign their own.

For more details, see:
  https://bughunters.google.com/blog/5424842357473280/zen-and-the-art-of-microcode-hacking
  https://www.amd.com/en/resources/product-security/bulletin/amd-sb-7033.html

As a stopgap mitigation, check the digest of patches against a table of blobs
with known provenance.  These are all Fam17h and Fam19h blobs included in
linux-firwmare at the time of writing, specifically:

  https://git.kernel.org/firmware/linux-firmware/c/48bb90cceb882cab8e9ab692bc5779d3bf3a13b8

This checks can be opted out of by booting with ucode=no-digest-check, but
doing so is not recommended.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 630e8875ab368b97cc7231aaf3809e3d7d5687e1)

Xen: CI fix from XSN-2

 * Add cf_check annotation to cmp_patch_id() used by bsearch().

Fixes: 630e8875ab36 ("x86/ucode: Perform extra SHA2 checks on AMD Fam17h/19h microcode")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 15fe2eb5f1bac8a212c0ba3d6dfe60d1fdf851cf)

3 weeks agoxen/lib: Introduce SHA2-256
Andrew Cooper [Fri, 13 Dec 2024 14:34:00 +0000 (14:34 +0000)]
xen/lib: Introduce SHA2-256

A future change will need to calculate SHA2-256 digests.  Introduce an
implementation in lib/, derived from Trenchboot which itself is derived from
Linux.

In order to be useful to other architectures, it is careful with endianness
and misaligned accesses as well as being more MISRA friendly, but is only
wired up for x86 in the short term.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 372af524411f5a013bcb0b117073d8d07c026563)

Xen: CI fix from XSN-2

 * Add U suffix to the K[] table to fix MISRA Rule 7.2 violations.

Fixes: 372af524411f ("xen/lib: Introduce SHA2-256")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 15fe2eb5f1bac8a212c0ba3d6dfe60d1fdf851cf)

3 weeks agotools/libxl: do not use `-c -E` compiler options together
Roger Pau Monne [Mon, 7 Apr 2025 11:09:38 +0000 (13:09 +0200)]
tools/libxl: do not use `-c -E` compiler options together

It makes no sense to request for preprocessor only output and also request
object file generation.  Fix the _libxl.api-for-check target to only use
-E (preprocessor output).

Also Clang 20.0 reports an error if both options are used.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Fixes: 2862bf5b6c81 ('libxl: enforce prohibitions of internal callers')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Anthony PERARD <anthony.perard@vates.tech>
(cherry picked from commit a235f856e4bbd270b085590e1f5fc9599234dcdf)

3 weeks agoautomation/eclair: Remove bespoke service B.UNEVALEFF
Nicola Vetrini [Thu, 10 Apr 2025 19:32:14 +0000 (21:32 +0200)]
automation/eclair: Remove bespoke service B.UNEVALEFF

The Eclair runners in GitlabCI have been update.  This service is now
included, and redefining results in an error.

No functional change.

Signed-off-by: Nicola Vetrini <nicola.vetrini@bugseng.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit a43b0a770bdcf3933634c860049e4bd65854e472)

4 weeks agoupdate Xen version to 4.19.2 RELEASE-4.19.2
Jan Beulich [Fri, 4 Apr 2025 10:47:23 +0000 (12:47 +0200)]
update Xen version to 4.19.2

4 weeks agoCI: Add yet another HW runner
Marek Marczykowski-Górecki [Fri, 14 Mar 2025 03:06:26 +0000 (04:06 +0100)]
CI: Add yet another HW runner

This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook 445 G7.

This one has working S3, so add a test for it here.

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit debe8bf537ec2c69a4734393cd2b0c7f57f74c0c)

4 weeks agoautomation/cirrus-ci: add smoke tests for the FreeBSD builds
Roger Pau Monne [Fri, 14 Mar 2025 12:37:46 +0000 (13:37 +0100)]
automation/cirrus-ci: add smoke tests for the FreeBSD builds

Introduce a basic set of smoke tests using the XTF selftest image, and run
them on QEMU.  Use the matrix keyword to create a different task for each
XTF flavor on each FreeBSD build.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 7973cba4dbf72f6b963c780e9d1e0b99fd9622b9)

4 weeks agoautomation/cirrus-ci: store XTF and Xen build artifacts
Roger Pau Monne [Fri, 14 Mar 2025 12:01:36 +0000 (13:01 +0100)]
automation/cirrus-ci: store XTF and Xen build artifacts

In preparation for adding some smoke tests that will consume those outputs.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit c9de0e2ce41700132ffb9f25930fcca9da5e78af)

4 weeks agoautomation/cirrus-ci: build XTF
Roger Pau Monne [Fri, 14 Mar 2025 11:16:19 +0000 (12:16 +0100)]
automation/cirrus-ci: build XTF

In preparation for using the XTF selftests to smoke test the FreeBSD based
Xen builds.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit a92c0a9888f14baf8e66a16e960bc524cc8194e6)

4 weeks agoautomation/cirrus-ci: use matrix keyword to generate per-version build tasks
Roger Pau Monne [Sat, 15 Mar 2025 08:35:12 +0000 (09:35 +0100)]
automation/cirrus-ci: use matrix keyword to generate per-version build tasks

Move the current logic to use the matrix keyword to generate a task for
each version of FreeBSD we want to build Xen on.  The matrix keyword
however cannot be used in YAML aliases, so it needs to be explicitly used
inside of each task, which creates a bit of duplication.  At least abstract
the FreeBSD minor version numbers to avoid repetition of image names.

Note that the full build uses matrix over an env variable instead of using
it directly in image_family.  This is so that the alias can also be set
based on the FreeBSD version, in preparation for adding further tasks that
will depend on the full build having finished.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit b548a7dc4bb4202410983dd3ef7b5f1b469bdac3)

4 weeks agoautomation/console.exp: do not assume expect is always at /usr/bin/
Roger Pau Monne [Mon, 17 Mar 2025 09:31:07 +0000 (10:31 +0100)]
automation/console.exp: do not assume expect is always at /usr/bin/

Instead use env to find the location of expect.

Additionally do not use the -f flag, as it's only meaningful when passing
arguments on the command line, which we never do for console.exp.  From the
expect 5.45.4 man page:

> The -f flag prefaces a file from which to read commands from.  The flag
> itself is optional as it is only useful when using the #! notation (see
> above), so  that other arguments may be supplied on the command line.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
4 weeks agoautomation/cirrus-ci: store Xen Kconfig before doing a build
Roger Pau Monne [Fri, 14 Mar 2025 10:55:48 +0000 (11:55 +0100)]
automation/cirrus-ci: store Xen Kconfig before doing a build

In case the build fails or gets stuck, store the Kconfig file ahead of
starting the build.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 421c2bd58f35ede865fa81c82957ca5297e903a4)

4 weeks agoautomation/cirrus-ci: update FreeBSD to 13.5
Roger Pau Monne [Fri, 14 Mar 2025 10:49:28 +0000 (11:49 +0100)]
automation/cirrus-ci: update FreeBSD to 13.5

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 5395cd7b892daf11cd80e4eec0cfa45c4b319d35)

4 weeks agoautomation/cirrus-ci: add timestamps
Roger Pau Monne [Fri, 14 Mar 2025 10:44:45 +0000 (11:44 +0100)]
automation/cirrus-ci: add timestamps

Such timestamps can still be disabled from the Web UI using a tick box.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 1256159f3cfcfef475b94dda39f68a85702bba64)

4 weeks agoautomation/cirrus-ci: store xen/.config as an artifact
Roger Pau Monne [Mon, 10 Mar 2025 17:41:57 +0000 (18:41 +0100)]
automation/cirrus-ci: store xen/.config as an artifact

Always store xen/.config as an artifact, renamed to xen-config to match
the naming used in the Gitlab CI tests.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 318659c818546b6415c2311339a53384dc20c427)

4 weeks agoCirrusCI: Use shallow clone
Andrew Cooper [Mon, 24 Feb 2025 15:36:11 +0000 (15:36 +0000)]
CirrusCI: Use shallow clone

This reduces the Clone step from ~50s to ~3s.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 96970b46e5e84fa2666f76f5b0972b826a3ffba4)

4 weeks agoautomation/cirrus-ci: introduce FreeBSD randconfig builds
Roger Pau Monne [Thu, 16 Jan 2025 08:06:26 +0000 (09:06 +0100)]
automation/cirrus-ci: introduce FreeBSD randconfig builds

Add a new randconfig job for each FreeBSD version.  This requires some
rework of the template so common parts can be shared between the full and
the randconfig builds.  Such randconfig builds are relevant because FreeBSD
is the only tested system that has a full non-GNU toolchain.

While there replace the usage of the python311 package with python3, which is
already using 3.11, and remove the install of the plain python package for full
builds.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit c3f5d1bb40b57d467cb4051eafa86f5933ec9003)

4 weeks agoautomation/cirrus-ci: update FreeBSD to 13.4
Roger Pau Monne [Thu, 16 Jan 2025 08:07:31 +0000 (09:07 +0100)]
automation/cirrus-ci: update FreeBSD to 13.4

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 850a263b7863ae9c22296a357a50573d5e6ae9d7)

4 weeks agoCI: Update to FreeBSD 14.2
Andrew Cooper [Tue, 3 Dec 2024 08:14:46 +0000 (08:14 +0000)]
CI: Update to FreeBSD 14.2

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 0cc8845fb9fd4300313da49374b6c12346889f9a)

4 weeks agox86/P2M: synchronize fast and slow paths of p2m_get_page_from_gfn()
Jan Beulich [Wed, 2 Apr 2025 12:28:08 +0000 (14:28 +0200)]
x86/P2M: synchronize fast and slow paths of p2m_get_page_from_gfn()

Handling of both grants and foreign pages was different between the two
paths.

While permitting access to grants would be desirable, doing so would
require more involved handling; undo that for the time being. In
particular the page reference obtained would prevent the owning domain
from changing e.g. the page's type (after the grantee has released the
last reference of the grant). Instead perhaps another reference on the
grant would need obtaining. Which in turn would require determining
which grant that was.

Foreign pages in any event need permitting on both paths.

Introduce a helper function to be used on both paths, such that
respective checking differs in just the extra "to be unshared" condition
on the fast path.

While there adjust the sanity check for foreign pages: Don't leak the
reference on release builds when on a debug build the assertion would
have triggered. (Thanks to Roger for the suggestion.)

Fixes: 80ea7af17269 ("x86/mm: Introduce get_page_from_gfn()")
Fixes: 50fe6e737059 ("pvh dom0: add and remove foreign pages")
Fixes: cbbca7be4aaa ("x86/p2m: make p2m_get_page_from_gfn() handle grant case correctly")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a8325f981ce4ff8ac8bcc73735f357846b0a0fbb
master date: 2025-03-31 09:21:12 +0200

4 weeks agoARM/vgic: Fix out-of-bounds accesses in vgic_mmio_write_sgir()
Andrew Cooper [Wed, 2 Apr 2025 12:27:56 +0000 (14:27 +0200)]
ARM/vgic: Fix out-of-bounds accesses in vgic_mmio_write_sgir()

The switch() statement is over bits 24:25 (unshifted) of the guest provided
value.  This makes case 0x3: dead, and not an implementation of the 4th
possible state.

A guest which writes (0x3 << 24) | (0xff << 16) to this register will skip the
early exit, then enter bitmap_for_each() with targets not bound by nr_vcpus.

If the guest has fewer than 8 vCPUs, bitmap_for_each() will read off the end
of d->vcpu[] and use the resulting vcpu pointer to ultimately derive irq, and
perform out-of-bounds writes.

Fix this by changing case 0x3 to default.

Fixes: 08c688ca6422 ("ARM: new VGIC: Add SGIR register handler")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: be7f0cc651d8d02a95820792204c0558f1f29e03
master date: 2025-03-27 11:54:23 +0000

4 weeks agotools/ocaml: Fix oxenstored build warning
Andrii Sultanov [Wed, 2 Apr 2025 12:27:30 +0000 (14:27 +0200)]
tools/ocaml: Fix oxenstored build warning

OCaml, in preparation for a renaming of the error string associated with
conversion failure in 'int_of_string' functions, started to issue this
warning:

  File "process.ml", line 440, characters 13-28:
  440 |   | (Failure "int_of_string")    -> reply_error "EINVAL"
                     ^^^^^^^^^^^^^^^
  Warning 52 [fragile-literal-pattern]: Code should not depend on the actual values of
  this constructor's arguments. They are only for information
  and may change in future versions. (See manual section 11.5)

Deal with this at the source, and instead create our own stable
ConversionFailure exception that's raised on the None case in
'int_of_string_opt'.

'c_int_of_string' is safe and does not raise such exceptions.

Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Acked-by: Christian Lindig <christian.lindig@cloud.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c11772277fe5f1b0874141a24554c2e3da2d9a6e
master date: 2025-02-25 13:30:55 +0000

5 weeks agoArm/domctl: correct XEN_DOMCTL_vuart_op error return value
Jan Beulich [Thu, 27 Mar 2025 14:06:53 +0000 (15:06 +0100)]
Arm/domctl: correct XEN_DOMCTL_vuart_op error return value

copy_to_guest() returns the number of bytes not copied; that's not what
the function should return to its caller though. Convert to returning
-EFAULT instead.

Fixes: 86039f2e8c20 ("xen/arm: vpl011: Add a new domctl API to initialize vpl011")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Michal Orzel <michal.orzel@amd.com>
master commit: 341c0df40bf73b0a5e27db27023ec400858a472d
master date: 2025-03-27 12:22:39 +0100

5 weeks agox86/pmstat: correct get_cpufreq_para()'s error return value
Jan Beulich [Thu, 27 Mar 2025 14:06:45 +0000 (15:06 +0100)]
x86/pmstat: correct get_cpufreq_para()'s error return value

copy_to_guest() returns the number of bytes not copied; that's not what
the function should return to its caller though. Convert to returning
-EFAULT instead.

Fixes: 7542c4ff00f2 ("Add user PM control interface")
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 855337ca4947508ffca23393e291c54b5307cc9a
master date: 2025-03-27 12:22:06 +0100

5 weeks agox86/PVH: account for module command line length
Jan Beulich [Thu, 27 Mar 2025 14:06:33 +0000 (15:06 +0100)]
x86/PVH: account for module command line length

As per observation in practice, initrd->cmdline_pa is not normally zero.
Hence so far we always appended at least one byte. That alone may
already render insufficient the "allocation" made by find_memory().
Things would be worse when there's actually a (perhaps long) command
line.

Skip setup when the command line is empty. Amend the "allocation" size
by padding and actual size of module command line. Along these lines
also skip initrd setup when the initrd is zero size.

Fixes: 0ecb8eb09f9f ("x86/pvh: pass module command line to dom0")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: 989584e532c9517a0f789e993f5f6744beaebe3e
master date: 2025-03-27 12:21:08 +0100

5 weeks agox86/emul: Emulate %cr8 accesses
Andrew Cooper [Thu, 27 Mar 2025 14:06:09 +0000 (15:06 +0100)]
x86/emul: Emulate %cr8 accesses

Petr reports:

  (XEN) MMIO emulation failed (1): d12v1 64bit @ 0010:fffff8057ba7dfbf -> 45 0f 20 c2 ...

during introspection.

This is MOV %cr8, which is wired up for hvm_mov_{to,from}_cr(); the VMExit
fastpaths, but not for the full emulation slowpaths.

Xen's handling of %cr8 turns out to be quite wrong.  At a minimum, we need
storage for %cr8 separate to APIC_TPR, and to alter intercepts based on
whether the vLAPIC is enabled or not.  But that's more work than there is time
for in the short term, so make a stopgap fix.

Extend hvmemul_{read,write}_cr() with %cr8 cases.  Unlike hvm_mov_to_cr(),
hardware hasn't filtered out invalid values (#GP checks are ahead of
intercepts), so introduce X86_CR8_VALID_MASK.

Reported-by: Petr Beneš <w1benny@gmail.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 14fd9b5642cd4805b49fbe716bf2cd577e724169
master date: 2025-03-26 11:54:59 +0000

5 weeks agox86/emul: Rearrange the logic in hvmemul_{read,write}_cr()
Andrew Cooper [Thu, 27 Mar 2025 14:05:35 +0000 (15:05 +0100)]
x86/emul: Rearrange the logic in hvmemul_{read,write}_cr()

In hvmemul_read_cr(), make the TRACE()/X86EMUL_OKAY path common in preparation
for adding a %cr8 case.  Use a local 'val' variable instead of always
operating on a deferenced pointer.

In both, calculate curr once.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b7264a15c28d30bb994ec9e58eba38932be231ec
master date: 2025-03-26 11:54:59 +0000

5 weeks agox86/PVH: expose OEMx ACPI tables to Dom0
Jan Beulich [Thu, 27 Mar 2025 14:05:11 +0000 (15:05 +0100)]
x86/PVH: expose OEMx ACPI tables to Dom0

What they contain we don't know, but we can't sensibly hide them. On my
Skylake system OEM1 (with a description of "INTEL  CPU EIST") is what
contains all the _PCT, _PPC, and _PSS methods, i.e. about everything
needed for cpufreq. (_PSD interestingly are in an SSDT there.)

Further OEM2 there has a description of "INTEL  CPU  HWP", while OEM4
has "INTEL  CPU  CST". Pretty clearly all three need exposing for
cpufreq and cpuidle to work.

Fixes: 8b1a5268daf0 ("pvh/dom0: whitelist PVH Dom0 ACPI tables")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 6378909b41c40187a79df1d38ca4791b34393d67
master date: 2025-03-26 12:32:03 +0100

5 weeks agoxenpm: sanitize allocations in show_cpufreq_para_by_cpuid()
Jan Beulich [Thu, 27 Mar 2025 14:04:48 +0000 (15:04 +0100)]
xenpm: sanitize allocations in show_cpufreq_para_by_cpuid()

malloc(), when passed zero size, may return NULL (the behavior is
implementation defined). Mirror the ->gov_num check to the other two
allocations as well. Don't chance then actually using a NULL in
print_cpufreq_para().

Fixes: 75e06d089d48 ("xenpm: add cpu frequency control interface, through which user can")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: 6c0dc87bb0e08fb31a68bf4c4149a18b92628f14
master date: 2025-03-26 12:30:57 +0100

5 weeks agox86/boot: Simplify the expression for extra allocation space
Andrew Cooper [Thu, 27 Mar 2025 14:04:03 +0000 (15:04 +0100)]
x86/boot: Simplify the expression for extra allocation space

The expression for one parameter of find_memory() is already complicated and
about to become moreso.  Break it out into a new variable, and express it in
an easier-to-follow way.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: ce703c84df1cb279605b0a85a45c6419464a16e8
master date: 2025-03-21 11:52:39 +0000

5 weeks agoxen/compiler: Fix the position of the visibility pragma
Andrew Cooper [Thu, 27 Mar 2025 14:03:56 +0000 (15:03 +0100)]
xen/compiler: Fix the position of the visibility pragma

This needs to be ahead of everything.  Right now, it is after xen/init.h being
included for -DINIT_SECTIONS_ONLY

  # 1 "./include/xen/compiler.h" 1
  # 83 "./include/xen/compiler.h"
  # 1 "./include/xen/init.h" 1
  # 62 "./include/xen/init.h"
  typedef int (*initcall_t)(void);
  typedef void (*exitcall_t)(void);
  # 72 "./include/xen/init.h"
  void do_presmp_initcalls(void);
  void do_initcalls(void);
  # 84 "./include/xen/compiler.h" 2
  # 122 "./include/xen/compiler.h"
  #pragma GCC visibility push(hidden)

Fixes: 84c4461b7d3a ("Force out-of-line instances of inline functions into .init.text in init-only code")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ab7ce0c8ed90f729a186babd87e3cd1fbed8ab98
master date: 2025-03-21 11:52:39 +0000

5 weeks agox86/vga: fix mapping of the VGA text buffer
Roger Pau Monné [Thu, 27 Mar 2025 14:03:20 +0000 (15:03 +0100)]
x86/vga: fix mapping of the VGA text buffer

The call to ioremap_wc() in video_init() will always fail, because
video_init() is called ahead of vm_init_type(), and so the underlying
__vmap() call will fail to allocate the linear address space.

Fix by reverting to the previous behavior and use __va() for the VGA text
buffer, as it's below the 1MB boundary, and thus always mapped in the
directmap.

Fixes: 81d195c6c0e2 ('x86: introduce ioremap_wc()')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1ca5f69e35548e5196eadb329e5a3976821dc982
master date: 2025-03-20 17:16:10 +0100

5 weeks agox86/xlat: fix UB pointer arithmetic in COMPAT_ARG_XLAT_VIRT_BASE
Roger Pau Monné [Thu, 27 Mar 2025 14:02:49 +0000 (15:02 +0100)]
x86/xlat: fix UB pointer arithmetic in COMPAT_ARG_XLAT_VIRT_BASE

UBSAN complains with:

UBSAN: Undefined behaviour in common/compat/memory.c:90:9
pointer operation overflowed ffff820080000000 to 0000020080000000
[...]
Xen call trace:
    [<ffff82d040303782>] R common/ubsan/ubsan.c#ubsan_epilogue+0xa/0xc0
    [<ffff82d040304bc3>] F __ubsan_handle_pointer_overflow+0xcb/0x100
    [<ffff82d0402a6259>] F compat_memory_op+0xf1/0x4d20
    [<ffff82d04041532d>] F hvm_memory_op+0x55/0xe0
    [<ffff82d040416150>] F hvm_hypercall+0xae8/0x21b0
    [<ffff82d0403b24ca>] F svm_vmexit_handler+0x1252/0x2450
    [<ffff82d0402049c0>] F svm_stgi_label+0x5/0x15

Adjust the calculations in COMPAT_ARG_XLAT_VIRT_BASE to subtract from the
per-domain area to obtain the mirrored linear address in the 4th slot,
instead of overflowing the per-domain linear address.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fc302866f42f552337ae7d8d78877aec36e6e2ff
master date: 2025-03-20 12:28:30 +0100

5 weeks agox86/shadow: fix UB pointer arithmetic in sh_mfn_is_a_page_table()
Roger Pau Monné [Thu, 27 Mar 2025 14:02:28 +0000 (15:02 +0100)]
x86/shadow: fix UB pointer arithmetic in sh_mfn_is_a_page_table()

UBSAN complains with:

UBSAN: Undefined behaviour in arch/x86/mm/shadow/private.h:515:30
pointer operation overflowed ffff82e000000000 to ffff82dfffffffe0
[...]
Xen call trace:
    [<ffff82d040303782>] R common/ubsan/ubsan.c#ubsan_epilogue+0xa/0xc0
    [<ffff82d040304bc3>] F __ubsan_handle_pointer_overflow+0xcb/0x100
    [<ffff82d040471b2d>] F arch/x86/mm/shadow/guest_2.c#sh_page_fault__guest_2+0x1e350
    [<ffff82d0403b206b>] F svm_vmexit_handler+0xdf3/0x2450
    [<ffff82d0402049c0>] F svm_stgi_label+0x5/0x15

Fix by moving the call to mfn_to_page() after the check of whether the
passed gmfn is valid.  This avoid the call to mfn_to_page() with an
INVALID_MFN parameter.

While there make the page local variable const, it's not modified by the
function.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 45ee73f1b24246f13cd9583cb2ee25fb9c782db8
master date: 2025-03-20 12:28:30 +0100

5 weeks agox86/mkelf32: account for offset when detecting note segment placement
Roger Pau Monné [Thu, 27 Mar 2025 14:02:20 +0000 (15:02 +0100)]
x86/mkelf32: account for offset when detecting note segment placement

mkelf32 attempt to check that the program header defined NOTE segment falls
inside of the LOAD segment, as the build-id should be loaded for Xen at
runtime to check.

However the current code doesn't take into account the LOAD program header
segment offset when calculating overlap with the NOTE segment.  This
results in incorrect detection, and the following build error:

arch/x86/boot/mkelf32 --notes xen-syms ./.xen.elf32 0x200000 \
               `nm xen-syms | sed -ne 's/^\([^ ]*\) . __2M_rwdata_end$/0x\1/p'`
Expected .note section within .text section!
Offset 4244776 not within 2910364!

When xen-syms has the following program headers:

Program Header:
    LOAD off    0x0000000000200000 vaddr 0xffff82d040200000 paddr 0x0000000000200000 align 2**21
         filesz 0x00000000002c689c memsz 0x00000000003f7e20 flags rwx
    NOTE off    0x000000000040c528 vaddr 0xffff82d04040c528 paddr 0x000000000040c528 align 2**2
         filesz 0x0000000000000024 memsz 0x0000000000000024 flags r--

Account for the program header offset of the LOAD segment when checking
whether the NOTE segments is contained within.  Also fix the logic to
ensure the NOTE segments is fully contained between the LOAD segment.

Fixes: a353cab905af ('build_id: Provide ld-embedded build-ids')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6e5fed7cb67c9f84653cdbd3924b8a119ef653be
master date: 2025-03-20 12:28:30 +0100

6 weeks agoxen/arinc653: call xfree() with local IRQ enabled
Anderson Choi [Thu, 20 Mar 2025 12:21:30 +0000 (13:21 +0100)]
xen/arinc653: call xfree() with local IRQ enabled

xen panic is observed with the following configuration.

1. Debug xen build (CONFIG_DEBUG=y)
2. dom1 of an ARINC653 domain
3. shutdown dom1 with xl command

$ xl shutdown <domain_name>

(XEN) ****************************************
(XEN) Panic on CPU 2:
(XEN) Assertion '!in_irq() && (local_irq_is_enabled() || num_online_cpus() <= 1)' failed at common/xmalloc_tlsf.c:714
(XEN) ****************************************

panic was triggered since xfree() was called with local IRQ disabled and
therefore assertion failed.

Fix this by calling xfree() after local IRQ is enabled.

Fixes: 19049f8d796a sched: fix locking in a653sched_free_vdata()
Signed-off-by: Anderson Choi <anderson.choi@boeing.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Nathan Studer <nathan.studer@dornerworks.com>
master commit: 3ee55c9543fcf0b35593f030b53f56f3222046b7
master date: 2025-03-19 16:44:00 +0000

6 weeks agox86/mm: Fix IS_ALIGNED() check in IS_LnE_ALIGNED()
Andrew Cooper [Thu, 20 Mar 2025 12:21:12 +0000 (13:21 +0100)]
x86/mm: Fix IS_ALIGNED() check in IS_LnE_ALIGNED()

The current CI failures turn out to be a latent bug triggered by a narrow set
of properties of the initrd and the host memory map, which CI encountered by
chance.

One step during boot involves constructing directmap mappings for modules.
With some probing at the point of creation, it is observed that there's a 4k
mapping missing towards the end of the initrd.

  (XEN) === Mapped Mod1 [000000039400100000000003be1ff6dc] to Directmap
  (XEN) Probing paddr 394001000, va ffff830394001000
  (XEN) Probing paddr 3be1ff6db, va ffff8303be1ff6db
  (XEN) Probing paddr 3bdffffff, va ffff8303bdffffff
  (XEN) Probing paddr 3be001000, va ffff8303be001000
  (XEN) Probing paddr 3be000000, va ffff8303be000000
  (XEN) Early fatal page fault at e008:ffff82d04032014c (cr2=ffff8303be000000, ec=0000)

The conditions for this bug appear to be map_pages_to_xen() call with a start
address of exactly 4k beyond a 2M boundary, some number of full 2M pages, then
a tail needing 4k pages.

Anyway, the condition for spotting superpage boundaries in map_pages_to_xen()
is wrong.  The IS_ALIGNED() macro expects a power of two for the alignment
argument, and subtracts 1 itself.

Fixing this causes the failing case to now boot.

Fixes: 97fb6fcf26e8 ("x86/mm: introduce helpers to detect super page alignment")
Debugged-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b07c7d63f9b587e4df5d71f6da9eaa433512c974
master date: 2025-03-19 14:53:28 +0000

6 weeks agox86/ioremap: prevent additions against the NULL pointer
Roger Pau Monné [Thu, 20 Mar 2025 12:20:51 +0000 (13:20 +0100)]
x86/ioremap: prevent additions against the NULL pointer

This was reported by clang UBSAN as:

UBSAN: Undefined behaviour in arch/x86/mm.c:6297:40
applying zero offset to null pointer
[...]
Xen call trace:
    [<ffff82d040303662>] R common/ubsan/ubsan.c#ubsan_epilogue+0xa/0xc0
    [<ffff82d040304aa3>] F __ubsan_handle_pointer_overflow+0xcb/0x100
    [<ffff82d0406ebbc0>] F ioremap_wc+0xc8/0xe0
    [<ffff82d0406c3728>] F video_init+0xd0/0x180
    [<ffff82d0406ab6f5>] F console_init_preirq+0x3d/0x220
    [<ffff82d0406f1876>] F __start_xen+0x68e/0x5530
    [<ffff82d04020482e>] F __high_start+0x8e/0x90

Fix bt_ioremap() and ioremap{,_wc}() to not add the offset if the returned
pointer from __vmap() is NULL.

Fixes: d0d4635d034f ('implement vmap()')
Fixes: f390941a92f1 ('x86/DMI: fix table mapping when one lives above 1Mb')
Fixes: 81d195c6c0e2 ('x86: introduce ioremap_wc()')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9a6f2c52f75781acda39fab5cc96d1bcc54bf534
master date: 2025-03-17 13:33:29 +0100

6 weeks agoxen/sched: fix arinc653 to not use variables across cpupools
Juergen Gross [Thu, 20 Mar 2025 12:20:14 +0000 (13:20 +0100)]
xen/sched: fix arinc653 to not use variables across cpupools

a653sched_do_schedule() is using two function local static variables,
which is resulting in bad behavior when using more than one cpupool
with the arinc653 scheduler.

Fix that by moving those variables to the scheduler private data.

Fixes: 22787f2e107c ("ARINC 653 scheduler")
Reported-by: Choi Anderson <Anderson.Choi@boeing.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Nathan Studer <nathan.studer@dornerworks.com>
master commit: d0561ac8ab0e780b1e8ab41d0d15e9f9b076dee3
master date: 2025-03-14 10:17:11 +0100

6 weeks agolibxl: avoid infinite loop in libxl__remove_directory()
Jan Beulich [Thu, 20 Mar 2025 12:20:06 +0000 (13:20 +0100)]
libxl: avoid infinite loop in libxl__remove_directory()

Infinitely retrying the rmdir() invocation makes little sense. While the
original observation was the log filling the disk (due to repeated
"Directory not empty" errors, in turn occurring for unclear reasons),
the loop wants breaking even if there was no error message being logged
(much like is done in the similar loops in libxl__remove_file() and
libxl__remove_file_or_directory()).

Fixes: c4dcbee67e6d ("libxl: provide libxl__remove_file et al")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: 68baeb5c4852e652b9599e049f40477edac4060e
master date: 2025-03-13 10:23:10 +0100

6 weeks agox86/hvm: check return code of hvm_pi_update_irte when binding
Roger Pau Monné [Thu, 20 Mar 2025 12:19:36 +0000 (13:19 +0100)]
x86/hvm: check return code of hvm_pi_update_irte when binding

Consume the return code from hvm_pi_update_irte(), and propagate the error
back to the caller if hvm_pi_update_irte() fails.

Fixes: 35a1caf8b6b5 ('pass-through: update IRTE according to guest interrupt config changes')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cb587f620ab56cc683347d8120ba63989fad2693
master date: 2025-03-12 13:32:31 +0100

6 weeks agox86/vmx: fix posted interrupts usage of msi_desc->msg field
Roger Pau Monné [Thu, 20 Mar 2025 12:19:17 +0000 (13:19 +0100)]
x86/vmx: fix posted interrupts usage of msi_desc->msg field

The current usage of msi_desc->msg in vmx_pi_update_irte() will make the
field contain a translated MSI message, instead of the expected
untranslated one.  This breaks dump_msi(), that use the data in
msi_desc->msg to print the interrupt details.

Fix this by introducing a dummy local msi_msg, and use it with
iommu_update_ire_from_msi().  vmx_pi_update_irte() relies on the MSI
message not changing, so there's no need to propagate the resulting msi_msg
to the hardware, and the contents can be ignored.

Additionally add a comment to clarify that msi_desc->msg must always
contain the untranslated MSI message.

Fixes: a5e25908d18d ('VT-d: introduce new fields in msi_desc to track binding with guest interrupt')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 30f0e55a79206702b4e82e86dad6b35033157858
master date: 2025-03-12 13:32:30 +0100

6 weeks agox86/msr: expose MSR_FAM10H_MMIO_CONF_BASE on AMD
Roger Pau Monné [Thu, 20 Mar 2025 12:18:46 +0000 (13:18 +0100)]
x86/msr: expose MSR_FAM10H_MMIO_CONF_BASE on AMD

The MMIO_CONF_BASE reports the base of the MCFG range on AMD systems.
Linux pre-6.14 is unconditionally attempting to read the MSR without a
safe MSR accessor, and since Xen doesn't allow access to it Linux reports
the following error:

unchecked MSR access error: RDMSR from 0xc0010058 at rIP: 0xffffffff8101d19f (xen_do_read_msr+0x7f/0xa0)
Call Trace:
 xen_read_msr+0x1e/0x30
 amd_get_mmconfig_range+0x2b/0x80
 quirk_amd_mmconfig_area+0x28/0x100
 pnp_fixup_device+0x39/0x50
 __pnp_add_device+0xf/0x150
 pnp_add_device+0x3d/0x100
 pnpacpi_add_device_handler+0x1f9/0x280
 acpi_ns_get_device_callback+0x104/0x1c0
 acpi_ns_walk_namespace+0x1d0/0x260
 acpi_get_devices+0x8a/0xb0
 pnpacpi_init+0x50/0x80
 do_one_initcall+0x46/0x2e0
 kernel_init_freeable+0x1da/0x2f0
 kernel_init+0x16/0x1b0
 ret_from_fork+0x30/0x50
 ret_from_fork_asm+0x1b/0x30

Such access is conditional to the presence of a device with PnP ID
"PNP0c01", which triggers the execution of the quirk_amd_mmconfig_area()
function.  Note that prior to commit 3fac3734c43a MSR accesses when running
as a PV guest would always use the safe variant, and thus silently handle
the #GP.

Fix by allowing access to the MSR on AMD systems for the hardware domain.

Write attempts to the MSR will still result in #GP for all domain types.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b4071d28c5bd9ca4fed76031cbf0e782b74209b9
master date: 2025-03-12 13:32:30 +0100

6 weeks agox86/vlapic: Fix handling of writes to APIC_ESR
Andrew Cooper [Thu, 20 Mar 2025 12:18:23 +0000 (13:18 +0100)]
x86/vlapic: Fix handling of writes to APIC_ESR

Xen currently presents APIC_ESR to guests as a simple read/write register.

This is incorrect.  The SDM states:

  The ESR is a write/read register. Before attempt to read from the ESR,
  software should first write to it. (The value written does not affect the
  values read subsequently; only zero may be written in x2APIC mode.) This
  write clears any previously logged errors and updates the ESR with any
  errors detected since the last write to the ESR.

Introduce a new pending_esr field in hvm_hw_lapic.

Update vlapic_error() to accumulate errors here, and extend vlapic_reg_write()
to discard the written value and transfer pending_esr into APIC_ESR.  Reads
are still as before.

Importantly, this means that guests no longer destroys the ESR value it's
looking for in the LVTERR handler when following the SDM instructions.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b28b590d4a23894672f1dd7fb98cdf9926ecb282
master date: 2025-03-07 14:34:08 +0000

6 weeks agotools/xl: fix channel configuration setting
Juergen Gross [Thu, 20 Mar 2025 12:17:41 +0000 (13:17 +0100)]
tools/xl: fix channel configuration setting

Channels work differently than other device types: their devid should
be -1 initially in order to distinguish them from the primary console
which has the devid of 0.

So when parsing the channel configuration, use
ARRAY_EXTEND_INIT_NODEVID() in order to avoid overwriting the devid
set by libxl_device_channel_init().

Fixes: 3a6679634766 ("libxl: set channel devid when not provided by application")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: e1ccced4afe465d6541c5825a0f8d1b8f5fa4253
master date: 2025-03-05 16:37:37 +0100

6 weeks agox86/dom0: be less restrictive with the Interrupt Address Range
Roger Pau Monné [Thu, 20 Mar 2025 12:17:17 +0000 (13:17 +0100)]
x86/dom0: be less restrictive with the Interrupt Address Range

Xen currently prevents dom0 from creating CPU or IOMMU page-table mappings
into the interrupt address range [0xfee00000, 0xfeefffff].  This range has
two different purposes.  For accesses from the CPU is contains the default
position of local APIC page at 0xfee00000.  For accesses from devices
it's the MSI address range, so the address field in the MSI entries
(usually) point to an address on that range to trigger an interrupt.

There are reports of Lenovo Thinkpad devices placing what seems to be the
UCSI shared mailbox at address 0xfeec2000 in the interrupt address range.
Attempting to use that device with a Linux PV dom0 leads to an error when
Linux kernel maps 0xfeec2000:

RIP: e030:xen_mc_flush+0x1e8/0x2b0
 xen_leave_lazy_mmu+0x15/0x60
 vmap_range_noflush+0x408/0x6f0
 __ioremap_caller+0x20d/0x350
 acpi_os_map_iomem+0x1a3/0x1c0
 acpi_ex_system_memory_space_handler+0x229/0x3f0
 acpi_ev_address_space_dispatch+0x17e/0x4c0
 acpi_ex_access_region+0x28a/0x510
 acpi_ex_field_datum_io+0x95/0x5c0
 acpi_ex_extract_from_field+0x36b/0x4e0
 acpi_ex_read_data_from_field+0xcb/0x430
 acpi_ex_resolve_node_to_value+0x2e0/0x530
 acpi_ex_resolve_to_value+0x1e7/0x550
 acpi_ds_evaluate_name_path+0x107/0x170
 acpi_ds_exec_end_op+0x392/0x860
 acpi_ps_parse_loop+0x268/0xa30
 acpi_ps_parse_aml+0x221/0x5e0
 acpi_ps_execute_method+0x171/0x3e0
 acpi_ns_evaluate+0x174/0x5d0
 acpi_evaluate_object+0x167/0x440
 acpi_evaluate_dsm+0xb6/0x130
 ucsi_acpi_dsm+0x53/0x80
 ucsi_acpi_read+0x2e/0x60
 ucsi_register+0x24/0xa0
 ucsi_acpi_probe+0x162/0x1e3
 platform_probe+0x48/0x90
 really_probe+0xde/0x340
 __driver_probe_device+0x78/0x110
 driver_probe_device+0x1f/0x90
 __driver_attach+0xd2/0x1c0
 bus_for_each_dev+0x77/0xc0
 bus_add_driver+0x112/0x1f0
 driver_register+0x72/0xd0
 do_one_initcall+0x48/0x300
 do_init_module+0x60/0x220
 __do_sys_init_module+0x17f/0x1b0
 do_syscall_64+0x82/0x170

Remove the restrictions to create mappings in the interrupt address range
for dom0.  Note that the restriction to map the local APIC page is enforced
separately, and that continues to be present.  Additionally make sure the
emulated local APIC page is also not mapped, in case dom0 is using it.

Note that even if the interrupt address range entries are populated in the
IOMMU page-tables no device access will reach those pages.  Device accesses
to the Interrupt Address Range will always be converted into Interrupt
Messages and are not subject to DMA remapping.

There's also the following restriction noted in Intel VT-d:

> Software must not program paging-structure entries to remap any address to
> the interrupt address range. Untranslated requests and translation requests
> that result in an address in the interrupt range will be blocked with
> condition code LGN.4 or SGN.8. Translated requests with an address in the
> interrupt address range are treated as Unsupported Request (UR).

Similarly for AMD-Vi:

> Accesses to the interrupt address range (Table 3) are defined to go through
> the interrupt remapping portion of the IOMMU and not through address
> translation processing. Therefore, when a transaction is being processed as
> an interrupt remapping operation, the transaction attribute of
> pretranslated or untranslated is ignored.
>
> Software Note: The IOMMU should
> not be configured such that an address translation results in a special
> address such as the interrupt address range.

However those restrictions don't apply to the identity mappings possibly
created for dom0, since the interrupt address range is never subject to DMA
remapping, and hence there's no output address after translation that
belongs to the interrupt address range.

Reported-by: Jürgen Groß <jgross@suse.com>
Link: https://lore.kernel.org/xen-devel/baade0a7-e204-4743-bda1-282df74e5f89@suse.com/
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 381caa38850771ae218eb6f6d490dc02e40df964
master date: 2025-03-05 10:26:46 +0100

6 weeks agox86/iommu: account for IOMEM caps when populating dom0 IOMMU page-tables
Roger Pau Monné [Thu, 20 Mar 2025 12:16:56 +0000 (13:16 +0100)]
x86/iommu: account for IOMEM caps when populating dom0 IOMMU page-tables

The current code in arch_iommu_hwdom_init() kind of open-codes the same
MMIO permission ranges that are added to the hardware domain ->iomem_caps.
Avoid this duplication and use ->iomem_caps in arch_iommu_hwdom_init() to
filter which memory regions should be added to the dom0 IOMMU page-tables.

Note the IO-APIC and MCFG page(s) must be set as not accessible for a PVH
dom0, otherwise the internal Xen emulation for those ranges won't work.
This requires adjustments in dom0_setup_permissions().

The call to pvh_setup_mmcfg() in dom0_construct_pvh() must now strictly be
done ahead of setting up dom0 permissions, so take the opportunity to also
put it inside the existing is_hardware_domain() region.

Also the special casing of E820_UNUSABLE regions no longer needs to be done
in arch_iommu_hwdom_init(), as those regions are already blocked in
->iomem_caps and thus would be removed from the rangeset as part of
->iomem_caps processing in arch_iommu_hwdom_init().  The E820_UNUSABLE
regions below 1Mb are not removed from ->iomem_caps, that's a slight
difference for the IOMMU created page-tables, but the aim is to allow
access to the same memory either from the CPU or the IOMMU page-tables.

Since ->iomem_caps already takes into account the domain max paddr, there's
no need to remove any regions past the last address addressable by the
domain, as applying ->iomem_caps would have already taken care of that.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 62f3fc5296c452285e81adb50976bde2d68d3181
master date: 2025-03-05 10:26:46 +0100

6 weeks agox86/dom0: correctly set the maximum ->iomem_caps bound for PVH
Roger Pau Monné [Thu, 20 Mar 2025 12:16:37 +0000 (13:16 +0100)]
x86/dom0: correctly set the maximum ->iomem_caps bound for PVH

The logic in dom0_setup_permissions() sets the maximum bound in
->iomem_caps unconditionally using paddr_bits, which is not correct for HVM
based domains.  Instead use domain_max_paddr_bits() to get the correct
maximum paddr bits for each possible domain type.

Switch to using PFN_DOWN() instead of PAGE_SHIFT, as that's shorter.

Fixes: 53de839fb409 ('x86: constrain MFN range Dom0 may access')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a00e08799cc7657d2a1aca158f4ad43d4c9103e7
master date: 2025-03-05 10:26:46 +0100

6 weeks agox86/dom0: attempt to fixup p2m page-faults for PVH dom0
Roger Pau Monné [Thu, 20 Mar 2025 12:16:14 +0000 (13:16 +0100)]
x86/dom0: attempt to fixup p2m page-faults for PVH dom0

When building a PVH dom0 Xen attempts to map all (relevant) MMIO regions
into the p2m for dom0 access.  However the information Xen has about the
host memory map is limited.  Xen doesn't have access to any resources
described in ACPI dynamic tables, and hence the p2m mappings provided might
not be complete.

PV doesn't suffer from this issue because a PV dom0 is capable of mapping
into it's page-tables any address not explicitly banned in d->iomem_caps.

Introduce a new command line options that allows Xen to attempt to fixup
the p2m page-faults, by creating p2m identity maps in response to p2m
page-faults.

This is aimed as a workaround to small ACPI regions Xen doesn't know about.
Note that missing large MMIO regions mapped in this way will lead to
slowness due to the VM exit processing, plus the mappings will always use
small pages.

The ultimate aim is to attempt to bring better parity with a classic PV
dom0.

Note such fixup rely on the CPU doing the access to the unpopulated
address.  If the access is attempted from a device instead there's no
possible way to fixup, as IOMMU page-fault are asynchronous.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
master commit: 104591f5dd675d7bfb04885dace0e4e5a097fc1e
master date: 2025-03-05 10:26:46 +0100

6 weeks agox86/emul: dump unhandled memory accesses for PVH dom0
Roger Pau Monné [Thu, 20 Mar 2025 12:15:48 +0000 (13:15 +0100)]
x86/emul: dump unhandled memory accesses for PVH dom0

A PV dom0 can map any host memory as long as it's allowed by the IO
capability range in d->iomem_caps.  On the other hand, a PVH dom0 has no
way to populate MMIO region onto it's p2m, so it's limited to what Xen
initially populates on the p2m based on the host memory map and the enabled
device BARs.

Introduce a new debug build only printk that reports attempts by dom0 to
access addresses not populated on the p2m, and not handled by any emulator.
This is for information purposes only, but might allow getting an idea of
what MMIO ranges might be missing on the p2m.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 43d8a80a0cccfe3715bb3178b5c15fb983979651
master date: 2025-03-05 10:26:46 +0100

6 weeks agox86/svm: Separate STI and VMRUN instructions in svm_asm_do_resume()
Andrew Cooper [Thu, 20 Mar 2025 12:14:51 +0000 (13:14 +0100)]
x86/svm: Separate STI and VMRUN instructions in svm_asm_do_resume()

There is a corner case in the VMRUN instruction where its INTR_SHADOW state
leaks into guest state if a VMExit occurs before the VMRUN is complete.  An
example of this could be taking #NPF due to event injection.

Xen can safely execute STI anywhere between CLGI and VMRUN, as CLGI blocks
external interrupts too.  However, an exception (while fatal) will appear to
be in an irqs-on region (as GIF isn't considered), so position the STI after
the speculation actions but prior to the GPR pops.

Link: https://lore.kernel.org/all/CADH9ctBs1YPmE4aCfGPNBwA10cA8RuAk2gO7542DjMZgs4uzJQ@mail.gmail.com/
Fixes: 66b245d9eaeb ("SVM: limit GIF=0 region")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c989ff614f6bad48b3bd4b32694f711b31c7b2d6
master date: 2025-02-19 12:45:48 +0000

6 weeks agoxen/memory: Make resource_max_frames() to return 0 on unknown type
Oleksandr Tyshchenko [Thu, 20 Mar 2025 12:14:27 +0000 (13:14 +0100)]
xen/memory: Make resource_max_frames() to return 0 on unknown type

This is actually what the caller acquire_resource() expects on any kind
of error (the comment on top of resource_max_frames() also suggests that).
Otherwise, the caller will treat -errno as a valid value and propagate incorrect
nr_frames to the VM. As a possible consequence, a VM trying to query a resource
size of an unknown type will get the success result from the hypercall and obtain
nr_frames 4294967201.

Also, add an ASSERT_UNREACHABLE() in the default case of _acquire_resource(),
normally we won't get to this point, as an unknown type will always be rejected
earlier in resource_max_frames().

Also, update test-resource app to verify that Xen can deal with invalid
(unknown) resource type properly.

Fixes: 9244528955de ("xen/memory: Fix acquire_resource size semantics")
Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9b8708290002f0a4d0b363e0c66ce945f6b520bd
master date: 2025-02-18 14:47:34 +0000

6 weeks agoxen/console: Fix truncation of panic() messages
Andrew Cooper [Thu, 20 Mar 2025 12:13:44 +0000 (13:13 +0100)]
xen/console: Fix truncation of panic() messages

The panic() function uses a static buffer to format its arguments into, simply
to emit the result via printk("%s", buf).  This buffer is not large enough for
some existing users in Xen.  e.g.:

  (XEN) ****************************************
  (XEN) Panic on CPU 0:
  (XEN) Invalid device tree blob at physical address 0x46a00000.
  (XEN) The DTB must be 8-byte aligned and must not exceed 2 MB in size.
  (XEN)
  (XEN) Plea****************************************

The remainder of this particular message is 'e check your bootloader.', but
has been inherited by RISC-V from ARM.

It is also pointless double buffering.  Implement vprintk() beside printk(),
and use it directly rather than rendering into a local buffer, removing it as
one source of message limitation.

This marginally simplifies panic(), and drops a global used-once buffer.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 81f8b1dd9407e4a3d9dc058b7fbbc591168649ad
master date: 2025-02-18 14:15:58 +0000

2 months agoIOMMU/x86: the bus-to-bridge lock needs to be acquired IRQ-safe
Jan Beulich [Thu, 27 Feb 2025 12:58:32 +0000 (12:58 +0000)]
IOMMU/x86: the bus-to-bridge lock needs to be acquired IRQ-safe

The function's use from set_msi_source_id() is guaranteed to be in an
IRQs-off region. While the invocation of that function could be moved
ahead in msi_msg_to_remap_entry() (doesn't need to be in the IOMMU-
intremap-locked region), the call tree from map_domain_pirq() holds an
IRQ descriptor lock. Hence all use sites of the lock need become IRQ-
safe ones.

In find_upstream_bridge() do a tiny bit of tidying in adjacent code:
Change a variable's type to unsigned and merge a redundant assignment
into another variable's initializer.

This is XSA-467 / CVE-2025-1713.

Fixes: 476bbccc811c ("VT-d: fix MSI source-id of interrupt remapping")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 39bc6af3ba483282ed6bbf94b08aec38c93d39e6)

2 months agox86/iommu: disable interrupts at shutdown
Roger Pau Monné [Mon, 17 Feb 2025 12:24:23 +0000 (13:24 +0100)]
x86/iommu: disable interrupts at shutdown

Add a new hook to inhibit interrupt generation by the IOMMU(s).  Note the
hook is currently only implemented for x86 IOMMUs.  The purpose is to
disable interrupt generation at shutdown so any kexec chained image finds
the IOMMU(s) in a quiesced state.

It would also prevent "Receive accept error" being raised as a result of
non-disabled interrupts targeting offline CPUs.

Note that the iommu_quiesce() call in nmi_shootdown_cpus() is still
required even when there's a preceding iommu_crash_shutdown() call; the
later can become a no-op depending on the setting of the "crash-disable"
command line option.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 819c3cb186a86ef3e04fb5af4d9f9f6de032c3ee
master date: 2025-02-12 15:56:07 +0100

2 months agox86/pci: disable MSI(-X) on all devices at shutdown
Roger Pau Monné [Mon, 17 Feb 2025 12:24:03 +0000 (13:24 +0100)]
x86/pci: disable MSI(-X) on all devices at shutdown

Attempt to disable MSI(-X) capabilities on all PCI devices know by Xen at
shutdown.  Doing such disabling should facilitate kexec chained kernel from
booting more reliably, as device MSI(-X) interrupt generation should be
quiesced.

Only attempt to disable MSI(-X) on all devices in the crash context if the
PCI lock is not taken, otherwise the PCI device list could be in an
inconsistent state.  This requires introducing a new pcidevs_trylock()
helper to check whether the lock is currently taken.

Disabling MSI(-X) should prevent "Receive accept error" being raised as a
result of non-disabled interrupts targeting offline CPUs.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7ab6951981231b4c576a3588248c303001272588
master date: 2025-02-12 15:56:07 +0100

2 months agox86/smp: perform disabling on interrupts ahead of AP shutdown
Roger Pau Monné [Mon, 17 Feb 2025 12:23:50 +0000 (13:23 +0100)]
x86/smp: perform disabling on interrupts ahead of AP shutdown

Move the disabling of interrupt sources so it's done ahead of the offlining
of APs.  This is to prevent AMD systems triggering "Receive accept error"
when interrupts target CPUs that are no longer online.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: db6daa9bf411260d2c1f5301e4fc786ae4a5cef8
master date: 2025-02-12 15:56:07 +0100

2 months agox86/shutdown: offline APs with interrupts disabled on all CPUs
Roger Pau Monné [Mon, 17 Feb 2025 12:23:27 +0000 (13:23 +0100)]
x86/shutdown: offline APs with interrupts disabled on all CPUs

The current shutdown logic in smp_send_stop() will disable the APs while
having interrupts enabled on the BSP or possibly other APs. On AMD systems
this can lead to local APIC errors:

APIC error on CPU0: 00(08), Receive accept error

Such error message can be printed in a loop, thus blocking the system from
rebooting.  I assume this loop is created by the error being triggered by
the console interrupt, which is further stirred by the ESR handler
printing to the console.

Intel SDM states:

"Receive Accept Error.

Set when the local APIC detects that the message it received was not
accepted by any APIC on the APIC bus, including itself. Used only on P6
family and Pentium processors."

So the error shouldn't trigger on any Intel CPU supported by Xen.

However AMD doesn't make such claims, and indeed the error is broadcast to
all local APICs when an interrupt targets a CPU that's already offline.

To prevent the error from stalling the shutdown process perform the
disabling of APs and the BSP local APIC with interrupts disabled on all
CPUs in the system, so that by the time interrupts are unmasked on the BSP
the local APIC is already disabled.  This can still lead to a spurious:

APIC error on CPU0: 00(00)

As a result of an LVT Error getting injected while interrupts are masked on
the CPU, and the vector only handled after the local APIC is already
disabled.  ESR reports 0 because as part of disable_local_APIC() the ESR
register is cleared.

Note the NMI crash path doesn't have such issue, because disabling of APs
and the caller local APIC is already done in the same contiguous region
with interrupts disabled.  There's a possible window on the NMI crash path
(nmi_shootdown_cpus()) where some APs might be disabled (and thus
interrupts targeting them raising "Receive accept error") before others APs
have interrupts disabled.  However the shutdown NMI will be handled,
regardless of whether the AP is processing a local APIC error, and hence
such interrupts will not cause the shutdown process to get stuck.

Remove the call to fixup_irqs() in smp_send_stop(): it doesn't achieve the
intended goal of moving all interrupts to the BSP anyway.  The logic in
fixup_irqs() will move interrupts whose affinity doesn't overlap with the
passed mask, but the movement of interrupts is done to any CPU set in
cpu_online_map.  As in the shutdown path fixup_irqs() is called before APs
are cleared from cpu_online_map this leads to interrupts being shuffled
around, but not assigned to the BSP exclusively.

The Fixes tag is more of a guess than a certainty; it's possible the
previous sleep window in fixup_irqs() allowed any in-flight interrupt to be
delivered before APs went offline.  However fixup_irqs() was still
incorrectly used, as it didn't (and still doesn't) move all interrupts to
target the provided cpu mask.

Fixes: e2bb28d62158 ('x86/irq: forward pending interrupts to new destination in fixup_irqs()')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1191ce954f64244a3c5f553116184928bcc677e8
master date: 2025-02-12 15:56:07 +0100

2 months agoradix-tree: introduce RADIX_TREE{,_INIT}()
Jan Beulich [Mon, 17 Feb 2025 12:23:00 +0000 (13:23 +0100)]
radix-tree: introduce RADIX_TREE{,_INIT}()

... now that static initialization is possible. Use RADIX_TREE() for
pci_segments and ivrs_maps.

This then fixes an ordering issue on x86: With the call to
radix_tree_init(), acpi_mmcfg_init()'s invocation of pci_segments_init()
will zap the possible earlier introduction of segment 0 by
amd_iommu_detect_one_acpi()'s call to pci_ro_device(), and thus the
write-protection of the PCI devices representing AMD IOMMUs.

Fixes: 3950f2485bbc ("x86/x2APIC: defer probe until after IOMMU ACPI table parsing")
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 26fe09e34566d701ecaea76b4563bb9934e85861
master date: 2025-02-07 10:00:04 +0100

2 months agoradix-tree: purge node allocation override hooks
Jan Beulich [Mon, 17 Feb 2025 12:22:24 +0000 (13:22 +0100)]
radix-tree: purge node allocation override hooks

These were needed by TMEM only, which is long gone. The Linux original
doesn't have such either. This effectively reverts one of the "Other
changes" from 8dc6738dbb3c ("Update radix-tree.[ch] from upstream Linux
to gain RCU awareness").

Positive side effect: Two cf_check go away.

While there also convert xmalloc()+memset() to xzalloc().

Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1275093a96fed45057db241b3aa6e191d9dcf596
master date: 2025-02-07 09:59:11 +0100

2 months agox86/intel: Fix PERF_GLOBAL fixup when virtualised
Andrew Cooper [Mon, 17 Feb 2025 12:21:50 +0000 (13:21 +0100)]
x86/intel: Fix PERF_GLOBAL fixup when virtualised

Logic using performance counters needs to look at
MSR_MISC_ENABLE.PERF_AVAILABLE before touching any other resources.

When virtualised under ESX, Xen dies with a #GP fault trying to read
MSR_CORE_PERF_GLOBAL_CTRL.

Factor this logic out into a separate function (it's already too squashed to
the RHS), and insert a check of MSR_MISC_ENABLE.PERF_AVAILABLE.

This also avoids setting X86_FEATURE_ARCH_PERFMON if MSR_MISC_ENABLE says that
PERF is unavailable, although oprofile (the only consumer of this flag)
cross-checks too.

Fixes: 6bdb965178bb ("x86/intel: ensure Global Performance Counter Control is setup correctly")
Reported-by: Jonathan Katz <jonathan.katz@aptar.com>
Link: https://xcp-ng.org/forum/topic/10286/nesting-xcp-ng-on-esx-8
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: Jonathan Katz <jonathan.katz@aptar.com>
master commit: dd05d265b8abda4cc7206b29cd71b77fb46658bf
master date: 2025-01-28 11:19:45 +0000

2 months agox86/PV: further harden guest memory accesses against speculative abuse
Jan Beulich [Mon, 17 Feb 2025 12:21:30 +0000 (13:21 +0100)]
x86/PV: further harden guest memory accesses against speculative abuse

The original implementation has two issues: For one it doesn't preserve
non-canonical-ness of inputs in the range 0x8000000000000000 through
0x80007fffffffffff. Bogus guest pointers in that range would not cause a
(#GP) fault upon access, when they should.

And then there is an AMD-specific aspect, where only the low 48 bits of
an address are used for speculative execution; the architecturally
mandated #GP for non-canonical addresses would be raised at a later
execution stage. Therefore to prevent Xen controlled data to make it
into any of the caches in a guest controllable manner, we need to
additionally ensure that for non-canonical inputs bit 47 would be clear.

See the code comment for how addressing both is being achieved.

Fixes: 4dc181599142 ("x86/PV: harden guest memory accesses against speculative abuse")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8306d773b03acec6062c0547ac05e3dd4a6960f6
master date: 2025-01-27 15:23:59 +0100

2 months agox86emul: further correct 64-bit mode zero count repeated string insn handling
Jan Beulich [Mon, 17 Feb 2025 12:21:00 +0000 (13:21 +0100)]
x86emul: further correct 64-bit mode zero count repeated string insn handling

In an entirely different context I came across Linux commit 428e3d08574b
("KVM: x86: Fix zero iterations REP-string"), which points out that
we're still doing things wrong: For one, there's no zero-extension at
all on AMD. And then while RCX is zero-extended from 32 bits uniformly
for all string instructions on newer hardware, RSI/RDI are only for MOVS
and STOS on the systems I have access to. (On an old family 0xf system
I've further found that for REP LODS even RCX is not zero-extended.)

While touching the lines anyway, replace two casts in get_rep_prefix().

Fixes: 79e996a89f69 ("x86emul: correct 64-bit mode repeated string insn handling with zero count")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 5310a042c4e3135c471446c8253ad13250539957
master date: 2025-01-27 15:23:19 +0100

2 months agoiommu/amd: atomically update IRTE
Roger Pau Monné [Mon, 17 Feb 2025 12:20:38 +0000 (13:20 +0100)]
iommu/amd: atomically update IRTE

Either when using a 32bit Interrupt Remapping Entry or a 128bit one update
the entry atomically, by using cmpxchg unconditionally as IOMMU depends on
it.  No longer disable the entry by setting RemapEn = 0 ahead of updating
it.  As a consequence of not toggling RemapEn ahead of the update the
Interrupt Remapping Table needs to be flushed after the entry update.

This avoids a window where the IRTE has RemapEn = 0, which can lead to
IO_PAGE_FAULT if the underlying interrupt source is not masked.

There's no guidance in AMD-Vi specification about how IRTE update should be
performed as opposed to DTE updating which has specific guidance.  However
DTE updating claims that reads will always be at least 128bits in size, and
hence for the purposes here assume that reads and caching of the IRTE
entries in either 32 or 128 bit format will be done atomically from
the IOMMU.

Note that as part of introducing a new raw128 field in the IRTE struct, the
current raw field is renamed to raw64 to explicitly contain the size in the
field name.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b953a99da98d63a7c827248abc450d4e8e015ab6
master date: 2025-01-27 13:05:11 +0100

2 months agox86/iommu: check for CMPXCHG16B when enabling IOMMU
Teddy Astie [Mon, 17 Feb 2025 12:20:14 +0000 (13:20 +0100)]
x86/iommu: check for CMPXCHG16B when enabling IOMMU

All hardware with VT-d/AMD-Vi has CMPXCHG16B support. Check this at
initialisation time, and otherwise refuse to use the IOMMU.

If the local APICs support x2APIC mode the IOMMU support for interrupt
remapping will be checked earlier using a specific helper.  If no support
for CX16 is detected by that earlier hook disable the IOMMU at that point
and prevent further poking for CX16 later in the boot process, which would
also fail.

There's a possible corner case when running virtualized, and the underlying
hypervisor exposing an IOMMU but no CMPXCHG16B support.  In which case
ignoring the IOMMU is fine, albeit the most natural would be for the
underlying hypervisor to also expose CMPXCHG16B support if an IOMMU is
available to the VM.

Note this change only introduces the checks, but doesn't remove the now
stale checks for CX16 support sprinkled in the IOMMU code.  Further changes
will take care of that.

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Teddy Astie <teddy.astie@vates.tech>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2636fcdc15c707d5e097770133f0afb69e8d70c9
master date: 2025-01-27 13:05:11 +0100

2 months agox86/HVM: correct read/write split at page boundaries
Jan Beulich [Mon, 17 Feb 2025 12:19:51 +0000 (13:19 +0100)]
x86/HVM: correct read/write split at page boundaries

The MMIO cache is intended to have one entry used per independent memory
access that an insn does. This, in particular, is supposed to be
ignoring any page boundary crossing. Therefore when looking up a cache
entry, the access'es starting (linear) address is relevant, not the one
possibly advanced past a page boundary.

In order for the same offset-into-buffer variable to be usable in
hvmemul_phys_mmio_access() for both the caller's buffer and the cache
entry's it is further necessary to have the un-adjusted caller buffer
passed into there.

Fixes: 2d527ba310dc ("x86/hvm: split all linear reads and writes at page boundary")
Reported-by: Manuel Andreas <manuel.andreas@tum.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 672894a11fe06e664a0ebfb600baf5dbb897b9e4
master date: 2025-01-24 10:15:56 +0100

2 months agox86/HVM: allocate emulation cache entries dynamically
Jan Beulich [Mon, 17 Feb 2025 12:19:04 +0000 (13:19 +0100)]
x86/HVM: allocate emulation cache entries dynamically

Both caches may need higher capacity, and the upper bound will need to
be determined dynamically based on CPUID policy (for AMX'es TILELOAD /
TILESTORE at least).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 23d60dbb0493b2f9ec1d89be5341eec2ee9dab32
master date: 2025-01-24 10:15:29 +0100

2 months agox86/HVM: correct MMIO emulation cache bounds check
Jan Beulich [Mon, 17 Feb 2025 12:18:15 +0000 (13:18 +0100)]
x86/HVM: correct MMIO emulation cache bounds check

To avoid overrunning the internal buffer we need to take the offset into
the buffer into account.

Fixes: d95da91fb497 ("x86/HVM: grow MMIO cache data size to 64 bytes")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: e5339bb689dfa79a914c6c96e1d82d61e1ae3161
master date: 2025-01-23 11:14:48 +0100

2 months agox86/HVM: reduce recursion in linear_{read,write}()
Jan Beulich [Mon, 17 Feb 2025 12:17:45 +0000 (13:17 +0100)]
x86/HVM: reduce recursion in linear_{read,write}()

Let's make explicit what the compiler may or may not do on our behalf:
The 2nd of the recursive invocations each can fall through rather than
re-invoking the function. This will save us from adding yet another
parameter (or more) to the function, just for the recursive invocations.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 18053054b7583810dd356efc8d7018bbc8720f36
master date: 2024-09-09 13:40:47 +0200

3 months agoxen/events: fix race with set_global_virq_handler()
Juergen Gross [Tue, 21 Jan 2025 08:21:01 +0000 (09:21 +0100)]
xen/events: fix race with set_global_virq_handler()

There is a possible race scenario between set_global_virq_handler()
and clear_global_virq_handlers() targeting the same domain, which
might result in that domain ending as a zombie domain.

In case set_global_virq_handler() is being called for a domain which
is just dying, it might happen that clear_global_virq_handlers() is
running first, resulting in set_global_virq_handler() taking a new
reference for that domain and entering in the global_virq_handlers[]
array afterwards. The reference will never be dropped, thus the domain
will never be freed completely.

This can be fixed by checking the is_dying state of the domain inside
the region guarded by global_virq_handlers_lock. In case the domain is
dying, handle it as if the domain wouldn't exist, which will be the
case in near future anyway.

Fixes: 87521589aa6a ("xen: allow global VIRQ handlers to be delegated to other domains")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4d8acc9c1cf14233dda21dd3a7791b5a84b0f6c3
master date: 2025-01-09 17:34:01 +0100

3 months agoxen/flask: Wire up XEN_DOMCTL_dt_overlay
Michal Orzel [Tue, 21 Jan 2025 08:20:51 +0000 (09:20 +0100)]
xen/flask: Wire up XEN_DOMCTL_dt_overlay

Addition of FLASK permission for this hypercall was overlooked in the
original patch. Fix it. The only dt overlay operation is attaching that can
happen only after the domain is created. Dom0 can attach overlay to itself
as well.

Fixes: 4c733873b5c2 ("xen/arm: Add XEN_DOMCTL_dt_overlay and device attachment to domains")
Signed-off-by: Michal Orzel <michal.orzel@amd.com>
Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
master commit: 7fa1411676150634b1d6ca030e53b94c26a949dd
master date: 2025-01-08 13:05:50 +0100

3 months agoxen/flask: Wire up XEN_DOMCTL_vuart_op
Michal Orzel [Tue, 21 Jan 2025 08:20:42 +0000 (09:20 +0100)]
xen/flask: Wire up XEN_DOMCTL_vuart_op

Addition of FLASK permission for this hypercall was overlooked in the
original patch. Fix it. The only VUART operation is initialization that
can occur only during domain creation.

Fixes: 86039f2e8c20 ("xen/arm: vpl011: Add a new domctl API to initialize vpl011")
Signed-off-by: Michal Orzel <michal.orzel@amd.com>
Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
master commit: 29daa72e4019aae92f857cf6e7e0c3ca8fb1483e
master date: 2025-01-08 13:05:38 +0100

3 months agox86emul: correct put_fpu()'s segment selector handling
Jan Beulich [Tue, 21 Jan 2025 08:19:56 +0000 (09:19 +0100)]
x86emul: correct put_fpu()'s segment selector handling

All selector fields under ctxt->regs are (normally) poisoned in the HVM
case, and the four ones besides CS and SS are potentially stale for PV.
Avoid using them in the hypervisor incarnation of the emulator, when
trying to cover for a missing ->read_segment() hook.

To make sure there's always a valid ->read_segment() handler for all HVM
cases, add a respective function to shadow code, even if it is not
expected for FPU insns to be used to update page tables.

Fixes: 0711b59b858a ("x86emul: correct FPU code/data pointers and opcode handling")
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 645b8d48c78f5b6ffd6230873f9e3ced4e840acd
master date: 2025-01-08 11:02:16 +0100

3 months agox86emul: VCVT{,U}DQ2PD ignores embedded rounding
Jan Beulich [Tue, 21 Jan 2025 08:19:39 +0000 (09:19 +0100)]
x86emul: VCVT{,U}DQ2PD ignores embedded rounding

IOW we shouldn't raise #UD in that case. Be on the safe side though and
only encode fully legitimate forms into the stub to be executed.

Things weren't quite right for VCVT{,U}SI2SD either, in the attempt to
be on the safe side: Clearing EVEX.L'L isn't useful; it's EVEX.b which
primarily needs clearing. Also reflect the somewhat improved doc
situation in the comment there.

Fixes: ed806f373730 ("x86emul: support AVX512F legacy-equivalent packed int/FP conversion insns")
Fixes: baf4a376f550 ("x86emul: support AVX512F legacy-equivalent scalar int/FP conversion insns")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d3709d1324aa140f064b9c68da37547f459f8e8d
master date: 2025-01-08 11:01:17 +0100

3 months agox86/amd: Misc setup for Fam1Ah processors
Andrew Cooper [Tue, 21 Jan 2025 08:19:14 +0000 (09:19 +0100)]
x86/amd: Misc setup for Fam1Ah processors

Fam1Ah is similar to Fam19h in these regards.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: f29cc14de1d195bcd8312dcab2b5f8e634b57288
master date: 2025-01-06 18:01:32 +0000

3 months agox86/traps: Rework LER initialisation and support Zen5/Diamond Rapids
Andrew Cooper [Tue, 21 Jan 2025 08:18:42 +0000 (09:18 +0100)]
x86/traps: Rework LER initialisation and support Zen5/Diamond Rapids

AMD have always used the architectural MSRs for LER.  As the first processor
to support LER was the K7 (which was 32bit), we can assume it's presence
unconditionally in 64bit mode.

Intel are about to run out of space in Family 6 and start using 19.  It is
only the Pentium 4 which uses non-architectural LER MSRs.

percpu_traps_init(), which runs on every CPU, contains a lot of code which
should be init-only, and is the only reason why opt_ler can't be in initdata.

Write a brand new init_ler() which expects all future Intel and AMD CPUs to
continue using the architectural MSRs, and does all setup together.  Call it
from trap_init(), and remove the setup logic percpu_traps_init() except for
the single path configuring MSR_IA32_DEBUGCTLMSR.

Leave behind a warning if the user asked for LER and Xen couldn't enable it.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 555866cb56002849014a1409ecdfa3f436c0c2c4
master date: 2025-01-06 12:24:05 +0000

3 months agox86/spec-ctrl: Support for SRSO_U/S_NO and SRSO_MSR_FIX
Andrew Cooper [Tue, 21 Jan 2025 08:18:08 +0000 (09:18 +0100)]
x86/spec-ctrl: Support for SRSO_U/S_NO and SRSO_MSR_FIX

AMD have updated the SRSO whitepaper[1] with further information.  These
features exist on AMD Zen5 CPUs and are necessary for Xen to use.

The two features are in principle unrelated:

 * SRSO_U/S_NO is an enumeration saying that SRSO attacks can't cross the
   User(CPL3) / Supervisor(CPL<3) boundary.  i.e. Xen don't need to use
   IBPB-on-entry for PV64.  PV32 guests are explicitly unsupported for
   speculative issues, and excluded from consideration for simplicity.

 * SRSO_MSR_FIX is an enumeration identifying that the BP_SPEC_REDUCE bit is
   available in MSR_BP_CFG.  When set, SRSO attacks can't cross the host/guest
   boundary.  i.e. Xen don't need to use IBPB-on-entry for HVM.

Extend ibpb_calculations() to account for these when calculating
opt_ibpb_entry_{pv,hvm} defaults.  Add a `bp-spec-reduce=<bool>` option to
control the use of BP_SPEC_REDUCE, with it active by default.

Because MSR_BP_CFG is core-scoped with a race condition updating it, repurpose
amd_check_erratum_1485() into amd_check_bp_cfg() and calculate all updates at
once.

Xen also needs to to advertise SRSO_U/S_NO to guests to allow the guest kernel
to skip SRSO mitigations too:

 * This is trivial for HVM guests.  It is also is accurate for PV32 guests
   too, but we have already excluded them from consideration, and do so again
   here to simplify the policy logic.

 * As written, SRSO_U/S_NO does not help for the PV64 user->kernel boundary.
   However, after discussing with AMD, an implementation detail of having
   BP_SPEC_REDUCE active causes the PV64 user->kernel boundary to have the
   property described by SRSO_U/S_NO, so we can advertise SRSO_U/S_NO to
   guests when the BP_SPEC_REDUCE precondition is met.

Finally, fix a typo in the SRSO_NO's comment.

[1] https://www.amd.com/content/dam/amd/en/documents/corporate/cr/speculative-return-stack-overflow-whitepaper.pdf
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a1746cd4434dd27ca2da8430dfb10edc76264bb3
master date: 2025-01-02 18:44:49 +0000

3 months agoxen/arch/x86: make objdump output user locale agnostic
Maximilian Engelhardt [Tue, 21 Jan 2025 08:17:58 +0000 (09:17 +0100)]
xen/arch/x86: make objdump output user locale agnostic

The objdump output is fed to grep, so make sure it doesn't change with
different user locales and break the grep parsing.
This problem was identified while updating xen in Debian and the fix is
needed for generating reproducible builds in varying environments.

Signed-off-by: Maximilian Engelhardt <maxi@daemonizer.de>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0d729221ab74c5d2571e71501dc63838acbf752a
master date: 2024-12-30 21:40:37 +0000

3 months agotools/xg: increase LZMA_BLOCK_SIZE for uncompressing the kernel
Marek Marczykowski-Górecki [Tue, 21 Jan 2025 08:17:44 +0000 (09:17 +0100)]
tools/xg: increase LZMA_BLOCK_SIZE for uncompressing the kernel

Linux 6.12-rc2 fails to decompress with the current 128MiB, contrary to
the code comment. It results in a failure like this:

    domainbuilder: detail: xc_dom_kernel_file: filename="/var/lib/qubes/vm-kernels/6.12-rc2-1.1.fc37/vmlinuz"
    domainbuilder: detail: xc_dom_malloc_filemap    : 12104 kB
    domainbuilder: detail: xc_dom_module_file: filename="/var/lib/qubes/vm-kernels/6.12-rc2-1.1.fc37/initramfs"
    domainbuilder: detail: xc_dom_malloc_filemap    : 7711 kB
    domainbuilder: detail: xc_dom_boot_xen_init: ver 4.19, caps xen-3.0-x86_64 hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64
    domainbuilder: detail: xc_dom_parse_image: called
    domainbuilder: detail: xc_dom_find_loader: trying multiboot-binary loader ...
    domainbuilder: detail: loader probe failed
    domainbuilder: detail: xc_dom_find_loader: trying HVM-generic loader ...
    domainbuilder: detail: loader probe failed
    domainbuilder: detail: xc_dom_find_loader: trying Linux bzImage loader ...
    domainbuilder: detail: _xc_try_lzma_decode: XZ decompression error: Memory usage limit reached
    xc: error: panic: xg_dom_bzimageloader.c:761: xc_dom_probe_bzimage_kernel unable to XZ decompress kernel: Invalid kernel
    domainbuilder: detail: loader probe failed
    domainbuilder: detail: xc_dom_find_loader: trying ELF-generic loader ...
    domainbuilder: detail: loader probe failed
    xc: error: panic: xg_dom_core.c:689: xc_dom_find_loader: no loader found: Invalid kernel
    libxl: error: libxl_dom.c:566:libxl__build_dom: xc_dom_parse_image failed

The important part: XZ decompression error: Memory usage limit reached

This looks to be related to the following change in Linux:
8653c909922743bceb4800e5cc26087208c9e0e6 ("xz: use 128 MiB dictionary and force single-threaded mode")

Fix this by increasing the block size to 256MiB. And remove the
misleading comment (from lack of better ideas).

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Anthony PERARD <anthony.perard@vates.tech>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e6472d46680ccd2b804ad73c19042a5811d036f0
master date: 2024-12-19 17:33:54 +0000

4 months agoMISRA: Unmark Rules 1.1 and 2.1 as clean following Eclair upgrade
Andrew Cooper [Tue, 17 Dec 2024 17:04:59 +0000 (17:04 +0000)]
MISRA: Unmark Rules 1.1 and 2.1 as clean following Eclair upgrade

Updating the Eclair runner has had knock-on effects with previously-clean
rules now flagging violations:

 - x86:   Rule 1.1, 1940 violations
 - ARM64: Rule 1.1, 725 violations, Rule 2.1, 255 violations

Fixes: 631f535a3d4f ("xen: update ECLAIR service identifiers from MC3R1 to MC3A2.")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 171cb318deaa0be786cc3af3599c72e8909e60f9)

4 months agoxen: update ECLAIR service identifiers from MC3R1 to MC3A2.
Alessandro Zucchelli [Tue, 10 Dec 2024 10:37:23 +0000 (11:37 +0100)]
xen: update ECLAIR service identifiers from MC3R1 to MC3A2.

Rename all instances of ECLAIR MISRA C:2012 service identifiers,
identified by the prefix MC3R1, to use the prefix MC3A2, which
refers to MISRA C:2012 Amendment 2 guidelines.

This update is motivated by the need to upgrade ECLAIR GitLab runners
that use the new naming scheme for MISRA C:2012 Amendment 2 guidelines.

Changes to the docs/misra directory are needed in order to keep
comment-based deviation up to date.

Signed-off-by: Alessandro Zucchelli <alessandro.zucchelli@bugseng.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 631f535a3d4ffd66a270672f0f787d79f3bf38f8)

4 months agox86/io-apic: prevent early exit from i8259 loop detection
Roger Pau Monné [Tue, 17 Dec 2024 11:46:29 +0000 (12:46 +0100)]
x86/io-apic: prevent early exit from i8259 loop detection

Avoid exiting early from the loop when a pin that could be connected to the
i8259 is found, as such early exit would leave the EOI handler translation
array only partially allocated and/or initialized.

Otherwise on systems with multiple IO-APICs and an unmasked ExtINT pin on
any IO-APIC that's no the last one the following NULL pointer dereference
triggers:

(XEN) Enabling APIC mode.  Using 2 I/O APICs
(XEN) ----[ Xen-4.20-unstable  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82d040328046>] __ioapic_write_entry+0x83/0x95
[...]
(XEN) Xen call trace:
(XEN)    [<ffff82d040328046>] R __ioapic_write_entry+0x83/0x95
(XEN)    [<ffff82d04027464b>] F amd_iommu_ioapic_update_ire+0x1ea/0x273
(XEN)    [<ffff82d0402755a1>] F iommu_update_ire_from_apic+0xa/0xc
(XEN)    [<ffff82d040328056>] F __ioapic_write_entry+0x93/0x95
(XEN)    [<ffff82d0403283c1>] F arch/x86/io_apic.c#clear_IO_APIC_pin+0x7c/0x10e
(XEN)    [<ffff82d040328480>] F arch/x86/io_apic.c#clear_IO_APIC+0x2d/0x61
(XEN)    [<ffff82d0404448b7>] F enable_IO_APIC+0x2e3/0x34f
(XEN)    [<ffff82d04044c9b0>] F smp_prepare_cpus+0x254/0x27a
(XEN)    [<ffff82d04044bec2>] F __start_xen+0x1ce1/0x23ae
(XEN)    [<ffff82d0402033ae>] F __high_start+0x8e/0x90
(XEN)
(XEN) Pagetable walk from 0000000000000000:
(XEN)  L4[0x000] = 000000007dbfd063 ffffffffffffffff
(XEN)  L3[0x000] = 000000007dbfa063 ffffffffffffffff
(XEN)  L2[0x000] = 000000007dbcc063 ffffffffffffffff
(XEN)  L1[0x000] = 0000000000000000 ffffffffffffffff
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0002]
(XEN) Faulting linear address: 0000000000000000
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...

Reported-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com>
Fixes: 86001b3970fe ('x86/io-apic: fix directed EOI when using AMD-Vi interrupt remapping')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f38fd27c4ceadf7ec4e82e82d0731b6ea415c51e
master date: 2024-12-17 11:15:30 +0100

4 months agotools/ocaml: Specify rpath correctly for ocamlmklib
Andrii Sultanov [Mon, 16 Dec 2024 12:33:17 +0000 (13:33 +0100)]
tools/ocaml: Specify rpath correctly for ocamlmklib

ocamlmklib has special handling for C-like '-Wl,-rpath' option, but does
not know how to handle '-Wl,-rpath-link', as evidenced by warnings like:
"Unknown option
-Wl,-rpath-link=$HOME/xen/tools/ocaml/libs/eventchn/../../../../tools/libs/toollog"
Pass this option directly to the compiler with -ccopt instead.

Also pass -L directly to the linker with -ldopt. This prevents embedding absolute
paths from buildtime into binary's RPATH.

Fixes: f7b4e4558b42 ("tools/ocaml: Fix OCaml libs rules")
Reported-by: Fernando Rodrigues <alpha@sigmasquadron.net>
Tested-by: Fernando Rodrigues <alpha@sigmasquadron.net>
Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com>
Acked-by: Christian Lindig <christian.lindig@cloud.com>
master commit: bf8a209915804088c09ac6575bcca554450fa7e8
master date: 2024-12-11 10:45:08 +0000

4 months agolibs/guest: Fix migration compatibility with a security-patched Xen 4.13
Andrew Cooper [Mon, 16 Dec 2024 12:33:07 +0000 (13:33 +0100)]
libs/guest: Fix migration compatibility with a security-patched Xen 4.13

xc_cpuid_apply_policy() provides compatibility for migration of a pre-4.14 VM
where no CPUID data was provided in the stream.

It guesses the various max-leaf limits, based on what was true at the time of
writing, but this was not correctly adapted when speculative security issues
forced the advertisement of new feature bits.  Of note are:

 * LFENCE-DISPATCH, in leaf 0x80000021.eax
 * BHI-CTRL, in leaf 0x7[2].edx

In both cases, a VM booted on a security-patched Xen 4.13, and then migrated
on to any newer version of Xen on the same or compatible hardware would have
these features stripped back because Xen is still editing the cpu-policy for
sanity behind the back of the toolstack.

For VMs using BHI_DIS_S to mitigate Native-BHI, this resulted in a failure to
restore the guests MSR_SPEC_CTRL setting:

  (XEN) HVM d7v0 load MSR 0x48 with value 0x401 failed
  (XEN) HVM7 restore: failed to load entry 20/0 rc -6

Fixes: e9b4fe263649 ("x86/cpuid: support LFENCE always serialising CPUID bit")
Fixes: f3709b15fc86 ("x86/cpuid: Infrastructure for cpuid word 7:2.edx")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 28301682f492c1df2ff9c3e01a0aab6262bd925a
master date: 2024-12-03 12:20:41 +0000

4 months agoxen/Kconfig: livepatch-build-tools requires debug information
Roger Pau Monné [Mon, 16 Dec 2024 12:32:43 +0000 (13:32 +0100)]
xen/Kconfig: livepatch-build-tools requires debug information

The tools infrastructure used to build livepatches for Xen
(livepatch-build-tools) consumes some DWARF debug information present in
xen-syms to generate a livepatch (see livepatch-build script usage of readelf
-wi).

The current Kconfig defaults however will enable LIVEPATCH without DEBUG_INFO
on release builds, thus providing a default Kconfig selection that's not
suitable for livepatch-build-tools even when LIVEPATCH support is enabled,
because it's missing the DWARF debug section.

Fix by defaulting DEBUG_INFO to enabled when LIVEPATCH is.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 126b0a6e537ce1d486a29e35cfeec1f222a74d11
master date: 2024-12-02 15:22:05 +0100

4 months agox86emul: MOVBE requires a memory operand
Jan Beulich [Mon, 16 Dec 2024 12:32:19 +0000 (13:32 +0100)]
x86emul: MOVBE requires a memory operand

The reg-reg forms should cause #UD; they come into existence only with
APX, where MOVBE also extends BSWAP (for the latter not being "eligible"
to a REX2 prefix).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 4c5d9a01f8fa81417a9c431e9624fb71361ec4f9
master date: 2024-12-02 09:50:14 +0100