Jan Beulich [Tue, 29 Apr 2025 09:47:50 +0000 (11:47 +0200)]
x86: constrain sub-page access length in mmio_ro_emulated_write()
Without doing so we could trigger the ASSERT_UNREACHABLE() in
subpage_mmio_write_emulate(). A comment there actually says this
validation would already have been done ...
Fixes: 8847d6e23f97 ("x86/mm: add API for marking only part of a MMIO page read only") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: 8dbd9966f82f95b017f06e9397fc78064b688d61
master date: 2025-04-28 09:48:14 +0200
It's unclear why -N is being used in the first place. It was added by
commit 4676bbf96dc8 back in 2002 without any justification.
When building a PE image it's actually detrimental to forcefully set the
.text section as writable. The GNU LD man page contains the following
warning regarding the -N option:
> Note: Although a writable text section is allowed for PE-COFF targets, it
> does not conform to the format specification published by Microsoft.
Remove the usage of -N uniformly on all architectures, assuming that the
addition was simply done as a copy and paste of the original x86 linking
rune.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Acked-by: Julien Grall <jgrall@amazon.com>
master commit: d444763f8ca556d0a67a4b933be303d346baef02
master date: 2025-04-23 16:12:25 +0200
x86/intel: workaround several MONITOR/MWAIT errata
There are several errata on Intel regarding the usage of the MONITOR/MWAIT
instructions, all having in common that stores to the monitored region
might not wake up the CPU.
Fix them by forcing the sending of an IPI for the affected models.
The Ice Lake issue has been reproduced internally on XenServer hardware,
and the fix does seem to prevent it. The symptom was APs getting stuck in
the idle loop immediately after bring up, which in turn prevented the BSP
from making progress. This would happen before the watchdog was
initialized, and hence the whole system would get stuck.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 4aae4452efeee3d3bba092b875e37d1e7c8f6db9
master date: 2025-04-23 16:12:25 +0200
Jan Beulich [Tue, 29 Apr 2025 09:46:37 +0000 (11:46 +0200)]
x86/EFI: correct mkreloc header (field) reading
With us now reading the full combined optional and NT headers, the
subsequent reading of (and seeking to) NT header fields is wrong. Since
PE32 and PE32+ NT headers are different anyway (beyond the image base
oddity extending across both headers), switch to using a union. This
allows to fetch the image base more directly then.
Additionally add checking to map_section(), which would have caught at
least the wrong (zero) image size that we previously used.
Fixes: f7f42accbbbb ("x86/efi: Use generic PE/COFF structures") Reported-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
master commit: 042e9616f2e67476635487f2cd530406c6e3c0c1
master date: 2025-04-23 09:39:44 +0200
Jan Beulich [Tue, 29 Apr 2025 09:45:59 +0000 (11:45 +0200)]
compat/memory: avoid UB shifts in XENMEM_exchange handling
Add an early basic check, yielding the same error code as the more
thorough one the main handler would produce.
Fixes: b8a7efe8528a ("Enable compatibility mode operation for HYPERVISOR_memory_op") Reported-by: Manuel Andreas <manuel.andreas@tum.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 560c51be8f6a88cde43c0a7c8be60158b5725982
master date: 2025-04-22 11:25:23 +0200
Jan Beulich [Tue, 29 Apr 2025 09:45:28 +0000 (11:45 +0200)]
x86emul: also clip repetition count for STOS
Like MOVS, INS, and OUTS, STOS also has a special purpose hook, where
the hook function may legitimately have the same expectation as to the
request not straddling address space start/end.
Fixes: 5dfe4aa4eeb6 ("x86_emulate: Do not request emulation of REP instructions beyond the") Reported-by: Fabian Specht <f.specht@tum.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 8c5636b6c87777e6c2e4ffae28bffe1cfc189bfd
master date: 2025-04-22 11:24:20 +0200
Jan Beulich [Tue, 29 Apr 2025 09:44:56 +0000 (11:44 +0200)]
x86/HVM: update repeat count upon nested lin->phys failure
For the X86EMUL_EXCEPTION case the repeat count must be correctly
propagated back. Since for the recursive invocation we use a local
helper variable, its value needs copying to the caller's one.
While there also correct the off-by-1 range in the comment ahead of the
function (strictly speaking for the "DF set" case we'd need to put
another, different range there as well).
Fixes: 53f87c03b4ea ("x86emul: generalize exception handling for rep_* hooks") Reported-by: Manuel Andreas <manuel.andreas@tum.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c07b16fd6e47782ebf1ee767cd07c1e2b4140f47
master date: 2025-04-17 10:01:19 +0200
x86/mm: account for the offset when performing subpage r/o MMIO access
The current logic in subpage_mmio_write_emulate() doesn't take into account
the page offset, and always performs the writes at offset 0 (start of the
page).
Fix this by accounting for the offset before performing the write.
Fixes: 8847d6e23f97 ('x86/mm: add API for marking only part of a MMIO page read only') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 38d07809794e3c723a4de7e10c25c1f6cb590dc6
master date: 2025-04-15 16:01:48 +0200
Jan Beulich [Tue, 29 Apr 2025 09:42:47 +0000 (11:42 +0200)]
include: sort $(wildcard ...) results
The order of items is stored in .*.chk.cmd, and hence variations between
how items are ordered would result in re-invocation of the checking rule
during "make install-xen" despite that already having successfully run
earlier on. The difference can become noticable when building (as non-
root) and installing (as root) use different GNU make versions: In 3.82
the sorting was deliberately undone, just for it to be restored in 4.3.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ff835bbc8096a14ed1bffa235e25848c993f7240
master date: 2025-04-10 10:56:29 +0200
xen: x86: irq: initialize irq desc in create_irq()
While building xen with GCC 14.2.1 with "-fcondition-coverage" option
or with "-Og", the compiler produces a false positive warning:
arch/x86/irq.c: In function ‘create_irq’:
arch/x86/irq.c:281:11: error: ‘desc’ may be used uninitialized [-Werror=maybe-uninitialized]
281 | ret = init_one_irq_desc(desc);
| ^~~~~~~~~~~~~~~~~~~~~~~
arch/x86/irq.c:269:22: note: ‘desc’ was declared here
269 | struct irq_desc *desc;
| ^~~~
cc1: all warnings being treated as errors
make[2]: *** [Rules.mk:252: arch/x86/irq.o] Error 1
While we have signed/unsigned comparison both in "for" loop and in
"if" statement, this still can't lead to use of uninitialized "desc",
as either loop will be executed at least once, or the function will
return early. So this is a clearly false positive warning due to a
bug [1] in GCC.
Ahmed S. Darwish [Tue, 29 Apr 2025 09:41:13 +0000 (11:41 +0200)]
x86/cpu: Validate CPUID leaf 0x2 EDX output
CPUID leaf 0x2 emits one-byte descriptors in its four output registers
EAX, EBX, ECX, and EDX. For these descriptors to be valid, the most
significant bit (MSB) of each register must be clear.
Leaf 0x2 parsing at intel.c only validated the MSBs of EAX, EBX, and
ECX, but left EDX unchecked.
xen: vm_event: do not do vm_event_op for an invalid domain
A privileged domain can issue XEN_DOMCTL_vm_event_op with
op->domain == DOMID_INVALID. In this case vm_event_domctl()
function will get NULL as the first parameter and this will
cause hypervisor panic, as it tries to derefer this pointer.
Fix the issue by checking if valid domain is passed in.
Fixes: 48b84249459f ("xen/vm-event: Drop unused u_domctl parameter from vm_event_domctl()") Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
master commit: 6a884750f3b86a45ee5ffbd825c346fcbce86080
master date: 2025-04-08 09:36:38 +0200
Jan Beulich [Tue, 29 Apr 2025 09:38:58 +0000 (11:38 +0200)]
x86/MTRR: hook mtrr_bp_restore() back up
Unlike stated in the offending commit's description,
load_system_tables() wasn't the only thing left to retain from the
earlier restore_rest_processor_state(). Note that MTRR state was still
reloaded via mtrr_aps_sync_end(), but that happens quite a bit later in
the resume process.
While there also do Misra-related tidying for the function itself: The
function being used from assembly only means it doesn't need to have a
declaration, but wants to be asmlinkage.
Fixes: 4304ff420e51 ("x86/S3: Drop {save,restore}_rest_processor_state() completely") Reported-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 0414dedd6fde1a1c5c5e38dcbef4dad506e1398c
master date: 2025-04-03 09:39:13 +0200
Andrew Cooper [Tue, 8 Apr 2025 16:09:15 +0000 (17:09 +0100)]
x86/ucode: Extend AMD digest checks to cover Zen5 CPUs
AMD have updated the SB-7033 advisory to include Zen5 CPUs. Extend the digest
check to cover Zen5 too.
In practice, cover everything until further notice.
Observant readers may be wondering where the update to the digest list is. At
the time of writing, no Zen5 patches are available via a verifiable channel.
x86/ucode: Extend warning about disabling digest check too
This was missed by accident.
Fixes: b63951467e96 ("x86/ucode: Extend AMD digest checks to cover Zen5 CPUs") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 59bb316ea89e7f9461690fe00547d7d2af96321d)
Andrew Cooper [Fri, 13 Dec 2024 14:34:00 +0000 (14:34 +0000)]
x86/ucode: Perform extra SHA2 checks on AMD Fam17h/19h microcode
Collisions have been found in the microcode signing algorithm used by AMD
Fam17h/19h CPUs, and now anyone can sign their own.
For more details, see:
https://bughunters.google.com/blog/5424842357473280/zen-and-the-art-of-microcode-hacking
https://www.amd.com/en/resources/product-security/bulletin/amd-sb-7033.html
As a stopgap mitigation, check the digest of patches against a table of blobs
with known provenance. These are all Fam17h and Fam19h blobs included in
linux-firwmare at the time of writing, specifically:
This checks can be opted out of by booting with ucode=no-digest-check, but
doing so is not recommended.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 630e8875ab368b97cc7231aaf3809e3d7d5687e1)
Xen: CI fix from XSN-2
* Add cf_check annotation to cmp_patch_id() used by bsearch().
Fixes: 630e8875ab36 ("x86/ucode: Perform extra SHA2 checks on AMD Fam17h/19h microcode") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 15fe2eb5f1bac8a212c0ba3d6dfe60d1fdf851cf)
Andrew Cooper [Fri, 13 Dec 2024 14:34:00 +0000 (14:34 +0000)]
xen/lib: Introduce SHA2-256
A future change will need to calculate SHA2-256 digests. Introduce an
implementation in lib/, derived from Trenchboot which itself is derived from
Linux.
In order to be useful to other architectures, it is careful with endianness
and misaligned accesses as well as being more MISRA friendly, but is only
wired up for x86 in the short term.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 372af524411f5a013bcb0b117073d8d07c026563)
Xen: CI fix from XSN-2
* Add U suffix to the K[] table to fix MISRA Rule 7.2 violations.
Fixes: 372af524411f ("xen/lib: Introduce SHA2-256") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 15fe2eb5f1bac8a212c0ba3d6dfe60d1fdf851cf)
tools/libxl: do not use `-c -E` compiler options together
It makes no sense to request for preprocessor only output and also request
object file generation. Fix the _libxl.api-for-check target to only use
-E (preprocessor output).
Also Clang 20.0 reports an error if both options are used.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Fixes: 2862bf5b6c81 ('libxl: enforce prohibitions of internal callers') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Anthony PERARD <anthony.perard@vates.tech>
(cherry picked from commit a235f856e4bbd270b085590e1f5fc9599234dcdf)
automation/eclair: Remove bespoke service B.UNEVALEFF
The Eclair runners in GitlabCI have been update. This service is now
included, and redefining results in an error.
No functional change.
Signed-off-by: Nicola Vetrini <nicola.vetrini@bugseng.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit a43b0a770bdcf3933634c860049e4bd65854e472)
This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook 445 G7.
This one has working S3, so add a test for it here.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit debe8bf537ec2c69a4734393cd2b0c7f57f74c0c)
Roger Pau Monne [Fri, 14 Mar 2025 12:37:46 +0000 (13:37 +0100)]
automation/cirrus-ci: add smoke tests for the FreeBSD builds
Introduce a basic set of smoke tests using the XTF selftest image, and run
them on QEMU. Use the matrix keyword to create a different task for each
XTF flavor on each FreeBSD build.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 7973cba4dbf72f6b963c780e9d1e0b99fd9622b9)
Roger Pau Monne [Sat, 15 Mar 2025 08:35:12 +0000 (09:35 +0100)]
automation/cirrus-ci: use matrix keyword to generate per-version build tasks
Move the current logic to use the matrix keyword to generate a task for
each version of FreeBSD we want to build Xen on. The matrix keyword
however cannot be used in YAML aliases, so it needs to be explicitly used
inside of each task, which creates a bit of duplication. At least abstract
the FreeBSD minor version numbers to avoid repetition of image names.
Note that the full build uses matrix over an env variable instead of using
it directly in image_family. This is so that the alias can also be set
based on the FreeBSD version, in preparation for adding further tasks that
will depend on the full build having finished.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit b548a7dc4bb4202410983dd3ef7b5f1b469bdac3)
Roger Pau Monne [Mon, 17 Mar 2025 09:31:07 +0000 (10:31 +0100)]
automation/console.exp: do not assume expect is always at /usr/bin/
Instead use env to find the location of expect.
Additionally do not use the -f flag, as it's only meaningful when passing
arguments on the command line, which we never do for console.exp. From the
expect 5.45.4 man page:
> The -f flag prefaces a file from which to read commands from. The flag
> itself is optional as it is only useful when using the #! notation (see
> above), so that other arguments may be supplied on the command line.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit c6867d134eeb3aefcecf022f4ab20c325701ef30)
Roger Pau Monne [Mon, 10 Mar 2025 17:41:57 +0000 (18:41 +0100)]
automation/cirrus-ci: store xen/.config as an artifact
Always store xen/.config as an artifact, renamed to xen-config to match
the naming used in the Gitlab CI tests.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 318659c818546b6415c2311339a53384dc20c427)
Andrew Cooper [Mon, 24 Feb 2025 15:36:11 +0000 (15:36 +0000)]
CirrusCI: Use shallow clone
This reduces the Clone step from ~50s to ~3s.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 96970b46e5e84fa2666f76f5b0972b826a3ffba4)
Freeing per-CPU areas and setting __per_cpu_offset to INVALID_PERCPU_AREA
only occur when !park_offline_cpus and system_state is not SYS_STATE_suspend.
On ARM64, park_offline_cpus is always false, so setting __per_cpu_offset to
INVALID_PERCPU_AREA depends solely on the system state.
If the system is suspended, this area is not freed, and during resume, an error
occurs in init_percpu_area, causing a crash because INVALID_PERCPU_AREA is not
set and park_offline_cpus remains 0:
Jan Beulich [Wed, 2 Apr 2025 12:22:27 +0000 (14:22 +0200)]
x86/P2M: synchronize fast and slow paths of p2m_get_page_from_gfn()
Handling of both grants and foreign pages was different between the two
paths.
While permitting access to grants would be desirable, doing so would
require more involved handling; undo that for the time being. In
particular the page reference obtained would prevent the owning domain
from changing e.g. the page's type (after the grantee has released the
last reference of the grant). Instead perhaps another reference on the
grant would need obtaining. Which in turn would require determining
which grant that was.
Foreign pages in any event need permitting on both paths.
Introduce a helper function to be used on both paths, such that
respective checking differs in just the extra "to be unshared" condition
on the fast path.
While there adjust the sanity check for foreign pages: Don't leak the
reference on release builds when on a debug build the assertion would
have triggered. (Thanks to Roger for the suggestion.)
Fixes: 80ea7af17269 ("x86/mm: Introduce get_page_from_gfn()") Fixes: 50fe6e737059 ("pvh dom0: add and remove foreign pages") Fixes: cbbca7be4aaa ("x86/p2m: make p2m_get_page_from_gfn() handle grant case correctly") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a8325f981ce4ff8ac8bcc73735f357846b0a0fbb
master date: 2025-03-31 09:21:12 +0200
Andrew Cooper [Wed, 2 Apr 2025 12:22:10 +0000 (14:22 +0200)]
ARM/vgic: Fix out-of-bounds accesses in vgic_mmio_write_sgir()
The switch() statement is over bits 24:25 (unshifted) of the guest provided
value. This makes case 0x3: dead, and not an implementation of the 4th
possible state.
A guest which writes (0x3 << 24) | (0xff << 16) to this register will skip the
early exit, then enter bitmap_for_each() with targets not bound by nr_vcpus.
If the guest has fewer than 8 vCPUs, bitmap_for_each() will read off the end
of d->vcpu[] and use the resulting vcpu pointer to ultimately derive irq, and
perform out-of-bounds writes.
Fix this by changing case 0x3 to default.
Fixes: 08c688ca6422 ("ARM: new VGIC: Add SGIR register handler") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: be7f0cc651d8d02a95820792204c0558f1f29e03
master date: 2025-03-27 11:54:23 +0000
OCaml, in preparation for a renaming of the error string associated with
conversion failure in 'int_of_string' functions, started to issue this
warning:
File "process.ml", line 440, characters 13-28:
440 | | (Failure "int_of_string") -> reply_error "EINVAL"
^^^^^^^^^^^^^^^
Warning 52 [fragile-literal-pattern]: Code should not depend on the actual values of
this constructor's arguments. They are only for information
and may change in future versions. (See manual section 11.5)
Deal with this at the source, and instead create our own stable
ConversionFailure exception that's raised on the None case in
'int_of_string_opt'.
'c_int_of_string' is safe and does not raise such exceptions.
Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com> Acked-by: Christian Lindig <christian.lindig@cloud.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c11772277fe5f1b0874141a24554c2e3da2d9a6e
master date: 2025-02-25 13:30:55 +0000
Jan Beulich [Thu, 27 Mar 2025 14:01:18 +0000 (15:01 +0100)]
Arm/domctl: correct XEN_DOMCTL_vuart_op error return value
copy_to_guest() returns the number of bytes not copied; that's not what
the function should return to its caller though. Convert to returning
-EFAULT instead.
Fixes: 86039f2e8c20 ("xen/arm: vpl011: Add a new domctl API to initialize vpl011") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Michal Orzel <michal.orzel@amd.com>
master commit: 341c0df40bf73b0a5e27db27023ec400858a472d
master date: 2025-03-27 12:22:39 +0100
Jan Beulich [Thu, 27 Mar 2025 14:00:50 +0000 (15:00 +0100)]
x86/pmstat: correct get_cpufreq_para()'s error return value
copy_to_guest() returns the number of bytes not copied; that's not what
the function should return to its caller though. Convert to returning
-EFAULT instead.
Fixes: 7542c4ff00f2 ("Add user PM control interface") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 855337ca4947508ffca23393e291c54b5307cc9a
master date: 2025-03-27 12:22:06 +0100
Jan Beulich [Thu, 27 Mar 2025 14:00:36 +0000 (15:00 +0100)]
x86/PVH: account for module command line length
As per observation in practice, initrd->cmdline_pa is not normally zero.
Hence so far we always appended at least one byte. That alone may
already render insufficient the "allocation" made by find_memory().
Things would be worse when there's actually a (perhaps long) command
line.
Skip setup when the command line is empty. Amend the "allocation" size
by padding and actual size of module command line. Along these lines
also skip initrd setup when the initrd is zero size.
Fixes: 0ecb8eb09f9f ("x86/pvh: pass module command line to dom0") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: 989584e532c9517a0f789e993f5f6744beaebe3e
master date: 2025-03-27 12:21:08 +0100
This is MOV %cr8, which is wired up for hvm_mov_{to,from}_cr(); the VMExit
fastpaths, but not for the full emulation slowpaths.
Xen's handling of %cr8 turns out to be quite wrong. At a minimum, we need
storage for %cr8 separate to APIC_TPR, and to alter intercepts based on
whether the vLAPIC is enabled or not. But that's more work than there is time
for in the short term, so make a stopgap fix.
Extend hvmemul_{read,write}_cr() with %cr8 cases. Unlike hvm_mov_to_cr(),
hardware hasn't filtered out invalid values (#GP checks are ahead of
intercepts), so introduce X86_CR8_VALID_MASK.
Reported-by: Petr Beneš <w1benny@gmail.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 14fd9b5642cd4805b49fbe716bf2cd577e724169
master date: 2025-03-26 11:54:59 +0000
Andrew Cooper [Thu, 27 Mar 2025 13:59:36 +0000 (14:59 +0100)]
x86/emul: Rearrange the logic in hvmemul_{read,write}_cr()
In hvmemul_read_cr(), make the TRACE()/X86EMUL_OKAY path common in preparation
for adding a %cr8 case. Use a local 'val' variable instead of always
operating on a deferenced pointer.
In both, calculate curr once.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b7264a15c28d30bb994ec9e58eba38932be231ec
master date: 2025-03-26 11:54:59 +0000
Jan Beulich [Thu, 27 Mar 2025 13:58:59 +0000 (14:58 +0100)]
x86/PVH: expose OEMx ACPI tables to Dom0
What they contain we don't know, but we can't sensibly hide them. On my
Skylake system OEM1 (with a description of "INTEL CPU EIST") is what
contains all the _PCT, _PPC, and _PSS methods, i.e. about everything
needed for cpufreq. (_PSD interestingly are in an SSDT there.)
Further OEM2 there has a description of "INTEL CPU HWP", while OEM4
has "INTEL CPU CST". Pretty clearly all three need exposing for
cpufreq and cpuidle to work.
Jan Beulich [Thu, 27 Mar 2025 13:58:51 +0000 (14:58 +0100)]
xenpm: sanitize allocations in show_cpufreq_para_by_cpuid()
malloc(), when passed zero size, may return NULL (the behavior is
implementation defined). Mirror the ->gov_num check to the other two
allocations as well. Don't chance then actually using a NULL in
print_cpufreq_para().
Fixes: 75e06d089d48 ("xenpm: add cpu frequency control interface, through which user can") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: 6c0dc87bb0e08fb31a68bf4c4149a18b92628f14
master date: 2025-03-26 12:30:57 +0100
Andrew Cooper [Thu, 27 Mar 2025 13:57:55 +0000 (14:57 +0100)]
x86/boot: Simplify the expression for extra allocation space
The expression for one parameter of find_memory() is already complicated and
about to become moreso. Break it out into a new variable, and express it in
an easier-to-follow way.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: ce703c84df1cb279605b0a85a45c6419464a16e8
master date: 2025-03-21 11:52:39 +0000
Fixes: 84c4461b7d3a ("Force out-of-line instances of inline functions into .init.text in init-only code") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ab7ce0c8ed90f729a186babd87e3cd1fbed8ab98
master date: 2025-03-21 11:52:39 +0000
Roger Pau Monné [Thu, 27 Mar 2025 13:56:57 +0000 (14:56 +0100)]
x86/vga: fix mapping of the VGA text buffer
The call to ioremap_wc() in video_init() will always fail, because
video_init() is called ahead of vm_init_type(), and so the underlying
__vmap() call will fail to allocate the linear address space.
Fix by reverting to the previous behavior and use __va() for the VGA text
buffer, as it's below the 1MB boundary, and thus always mapped in the
directmap.
Adjust the calculations in COMPAT_ARG_XLAT_VIRT_BASE to subtract from the
per-domain area to obtain the mirrored linear address in the 4th slot,
instead of overflowing the per-domain linear address.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fc302866f42f552337ae7d8d78877aec36e6e2ff
master date: 2025-03-20 12:28:30 +0100
Roger Pau Monné [Thu, 27 Mar 2025 13:55:57 +0000 (14:55 +0100)]
x86/shadow: fix UB pointer arithmetic in sh_mfn_is_a_page_table()
UBSAN complains with:
UBSAN: Undefined behaviour in arch/x86/mm/shadow/private.h:515:30
pointer operation overflowed ffff82e000000000 to ffff82dfffffffe0
[...]
Xen call trace:
[<ffff82d040303782>] R common/ubsan/ubsan.c#ubsan_epilogue+0xa/0xc0
[<ffff82d040304bc3>] F __ubsan_handle_pointer_overflow+0xcb/0x100
[<ffff82d040471b2d>] F arch/x86/mm/shadow/guest_2.c#sh_page_fault__guest_2+0x1e350
[<ffff82d0403b206b>] F svm_vmexit_handler+0xdf3/0x2450
[<ffff82d0402049c0>] F svm_stgi_label+0x5/0x15
Fix by moving the call to mfn_to_page() after the check of whether the
passed gmfn is valid. This avoid the call to mfn_to_page() with an
INVALID_MFN parameter.
While there make the page local variable const, it's not modified by the
function.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 45ee73f1b24246f13cd9583cb2ee25fb9c782db8
master date: 2025-03-20 12:28:30 +0100
Roger Pau Monné [Thu, 27 Mar 2025 13:55:29 +0000 (14:55 +0100)]
x86/mkelf32: account for offset when detecting note segment placement
mkelf32 attempt to check that the program header defined NOTE segment falls
inside of the LOAD segment, as the build-id should be loaded for Xen at
runtime to check.
However the current code doesn't take into account the LOAD program header
segment offset when calculating overlap with the NOTE segment. This
results in incorrect detection, and the following build error:
arch/x86/boot/mkelf32 --notes xen-syms ./.xen.elf32 0x200000 \
`nm xen-syms | sed -ne 's/^\([^ ]*\) . __2M_rwdata_end$/0x\1/p'`
Expected .note section within .text section!
Offset 4244776 not within 2910364!
Account for the program header offset of the LOAD segment when checking
whether the NOTE segments is contained within. Also fix the logic to
ensure the NOTE segments is fully contained between the LOAD segment.
Fixes: a353cab905af ('build_id: Provide ld-embedded build-ids') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6e5fed7cb67c9f84653cdbd3924b8a119ef653be
master date: 2025-03-20 12:28:30 +0100
Jan Beulich [Thu, 20 Mar 2025 11:55:32 +0000 (12:55 +0100)]
x86/setup: correct off-by-1 in module mapping
If a module's length is an exact multiple of PAGE_SIZE, the 2nd argument
passed to set_pdx_range() would be one larger than intended. Use
PFN_{UP,DOWN}() there instead.
Fixes: cd7cc5320bb2 ("x86/boot: add start and size fields to struct boot_module") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: c3b54301fbe918c7e7e9c3b2dfe671c1ab79f882
master date: 2025-03-20 08:51:55 +0100
Andrew Cooper [Thu, 20 Mar 2025 11:54:53 +0000 (12:54 +0100)]
x86/mm: Fix IS_ALIGNED() check in IS_LnE_ALIGNED()
The current CI failures turn out to be a latent bug triggered by a narrow set
of properties of the initrd and the host memory map, which CI encountered by
chance.
One step during boot involves constructing directmap mappings for modules.
With some probing at the point of creation, it is observed that there's a 4k
mapping missing towards the end of the initrd.
The conditions for this bug appear to be map_pages_to_xen() call with a start
address of exactly 4k beyond a 2M boundary, some number of full 2M pages, then
a tail needing 4k pages.
Anyway, the condition for spotting superpage boundaries in map_pages_to_xen()
is wrong. The IS_ALIGNED() macro expects a power of two for the alignment
argument, and subtracts 1 itself.
Fixing this causes the failing case to now boot.
Fixes: 97fb6fcf26e8 ("x86/mm: introduce helpers to detect super page alignment") Debugged-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b07c7d63f9b587e4df5d71f6da9eaa433512c974
master date: 2025-03-19 14:53:28 +0000
Roger Pau Monné [Thu, 20 Mar 2025 11:54:16 +0000 (12:54 +0100)]
x86/ioremap: prevent additions against the NULL pointer
This was reported by clang UBSAN as:
UBSAN: Undefined behaviour in arch/x86/mm.c:6297:40
applying zero offset to null pointer
[...]
Xen call trace:
[<ffff82d040303662>] R common/ubsan/ubsan.c#ubsan_epilogue+0xa/0xc0
[<ffff82d040304aa3>] F __ubsan_handle_pointer_overflow+0xcb/0x100
[<ffff82d0406ebbc0>] F ioremap_wc+0xc8/0xe0
[<ffff82d0406c3728>] F video_init+0xd0/0x180
[<ffff82d0406ab6f5>] F console_init_preirq+0x3d/0x220
[<ffff82d0406f1876>] F __start_xen+0x68e/0x5530
[<ffff82d04020482e>] F __high_start+0x8e/0x90
Fix bt_ioremap() and ioremap{,_wc}() to not add the offset if the returned
pointer from __vmap() is NULL.
Fixes: d0d4635d034f ('implement vmap()') Fixes: f390941a92f1 ('x86/DMI: fix table mapping when one lives above 1Mb') Fixes: 81d195c6c0e2 ('x86: introduce ioremap_wc()') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9a6f2c52f75781acda39fab5cc96d1bcc54bf534
master date: 2025-03-17 13:33:29 +0100
Jan Beulich [Thu, 20 Mar 2025 11:53:51 +0000 (12:53 +0100)]
libxl: avoid infinite loop in libxl__remove_directory()
Infinitely retrying the rmdir() invocation makes little sense. While the
original observation was the log filling the disk (due to repeated
"Directory not empty" errors, in turn occurring for unclear reasons),
the loop wants breaking even if there was no error message being logged
(much like is done in the similar loops in libxl__remove_file() and
libxl__remove_file_or_directory()).
Fixes: c4dcbee67e6d ("libxl: provide libxl__remove_file et al") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: 68baeb5c4852e652b9599e049f40477edac4060e
master date: 2025-03-13 10:23:10 +0100
Juergen Gross [Thu, 20 Mar 2025 11:53:29 +0000 (12:53 +0100)]
xen/sched: fix arinc653 to not use variables across cpupools
a653sched_do_schedule() is using two function local static variables,
which is resulting in bad behavior when using more than one cpupool
with the arinc653 scheduler.
Fix that by moving those variables to the scheduler private data.
Fixes: 22787f2e107c ("ARINC 653 scheduler") Reported-by: Choi Anderson <Anderson.Choi@boeing.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Nathan Studer <nathan.studer@dornerworks.com>
master commit: d0561ac8ab0e780b1e8ab41d0d15e9f9b076dee3
master date: 2025-03-14 10:17:11 +0100
Jason Andryuk [Thu, 20 Mar 2025 11:53:13 +0000 (12:53 +0100)]
tools/libxl: Skip missing PCI GSIs
A PCI device may not have a legacy IRQ. In that case, we don't need to
do anything, so don't fail in libxl__arch_hvm_map_gsi() and
libxl__arch_hvm_unmap_gsi().
Requires an updated pciback to return -ENOENT.
Fixes: f97f885c7198 ("tools: Add new function to do PIRQ (un)map on PVH dom0") Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: d3ac8fdce26476d148fb2a8f18c7e5b0c153be0a
master date: 2025-03-13 10:23:52 +0100
Jason Andryuk [Thu, 20 Mar 2025 11:52:57 +0000 (12:52 +0100)]
tools/ctrl: Silence missing GSI in xc_pcidev_get_gsi()
It is valid for a PCI device to not have a legacy IRQ. In that case, do
not print an error to keep the logs clean.
This relies on pciback being updated to return -ENOENT for a missing
GSI.
Fixes: b93e5981d258 ("tools: Add new function to get gsi from dev") Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: 8808f45305d146d05469ed4551bb5ccf931b1800
master date: 2025-03-13 10:23:42 +0100
Roger Pau Monné [Thu, 20 Mar 2025 11:52:00 +0000 (12:52 +0100)]
x86/vmx: fix posted interrupts usage of msi_desc->msg field
The current usage of msi_desc->msg in vmx_pi_update_irte() will make the
field contain a translated MSI message, instead of the expected
untranslated one. This breaks dump_msi(), that use the data in
msi_desc->msg to print the interrupt details.
Fix this by introducing a dummy local msi_msg, and use it with
iommu_update_ire_from_msi(). vmx_pi_update_irte() relies on the MSI
message not changing, so there's no need to propagate the resulting msi_msg
to the hardware, and the contents can be ignored.
Additionally add a comment to clarify that msi_desc->msg must always
contain the untranslated MSI message.
Fixes: a5e25908d18d ('VT-d: introduce new fields in msi_desc to track binding with guest interrupt') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 30f0e55a79206702b4e82e86dad6b35033157858
master date: 2025-03-12 13:32:30 +0100
Roger Pau Monné [Thu, 20 Mar 2025 11:51:40 +0000 (12:51 +0100)]
x86/msr: expose MSR_FAM10H_MMIO_CONF_BASE on AMD
The MMIO_CONF_BASE reports the base of the MCFG range on AMD systems.
Linux pre-6.14 is unconditionally attempting to read the MSR without a
safe MSR accessor, and since Xen doesn't allow access to it Linux reports
the following error:
Such access is conditional to the presence of a device with PnP ID
"PNP0c01", which triggers the execution of the quirk_amd_mmconfig_area()
function. Note that prior to commit 3fac3734c43a MSR accesses when running
as a PV guest would always use the safe variant, and thus silently handle
the #GP.
Fix by allowing access to the MSR on AMD systems for the hardware domain.
Write attempts to the MSR will still result in #GP for all domain types.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b4071d28c5bd9ca4fed76031cbf0e782b74209b9
master date: 2025-03-12 13:32:30 +0100
Andrew Cooper [Thu, 20 Mar 2025 11:50:38 +0000 (12:50 +0100)]
x86/vlapic: Fix handling of writes to APIC_ESR
Xen currently presents APIC_ESR to guests as a simple read/write register.
This is incorrect. The SDM states:
The ESR is a write/read register. Before attempt to read from the ESR,
software should first write to it. (The value written does not affect the
values read subsequently; only zero may be written in x2APIC mode.) This
write clears any previously logged errors and updates the ESR with any
errors detected since the last write to the ESR.
Introduce a new pending_esr field in hvm_hw_lapic.
Update vlapic_error() to accumulate errors here, and extend vlapic_reg_write()
to discard the written value and transfer pending_esr into APIC_ESR. Reads
are still as before.
Importantly, this means that guests no longer destroys the ESR value it's
looking for in the LVTERR handler when following the SDM instructions.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b28b590d4a23894672f1dd7fb98cdf9926ecb282
master date: 2025-03-07 14:34:08 +0000
Juergen Gross [Thu, 20 Mar 2025 11:50:24 +0000 (12:50 +0100)]
tools/xl: fix channel configuration setting
Channels work differently than other device types: their devid should
be -1 initially in order to distinguish them from the primary console
which has the devid of 0.
So when parsing the channel configuration, use
ARRAY_EXTEND_INIT_NODEVID() in order to avoid overwriting the devid
set by libxl_device_channel_init().
Fixes: 3a6679634766 ("libxl: set channel devid when not provided by application") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: e1ccced4afe465d6541c5825a0f8d1b8f5fa4253
master date: 2025-03-05 16:37:37 +0100
Roger Pau Monné [Thu, 20 Mar 2025 11:49:57 +0000 (12:49 +0100)]
x86/dom0: be less restrictive with the Interrupt Address Range
Xen currently prevents dom0 from creating CPU or IOMMU page-table mappings
into the interrupt address range [0xfee00000, 0xfeefffff]. This range has
two different purposes. For accesses from the CPU is contains the default
position of local APIC page at 0xfee00000. For accesses from devices
it's the MSI address range, so the address field in the MSI entries
(usually) point to an address on that range to trigger an interrupt.
There are reports of Lenovo Thinkpad devices placing what seems to be the
UCSI shared mailbox at address 0xfeec2000 in the interrupt address range.
Attempting to use that device with a Linux PV dom0 leads to an error when
Linux kernel maps 0xfeec2000:
Remove the restrictions to create mappings in the interrupt address range
for dom0. Note that the restriction to map the local APIC page is enforced
separately, and that continues to be present. Additionally make sure the
emulated local APIC page is also not mapped, in case dom0 is using it.
Note that even if the interrupt address range entries are populated in the
IOMMU page-tables no device access will reach those pages. Device accesses
to the Interrupt Address Range will always be converted into Interrupt
Messages and are not subject to DMA remapping.
There's also the following restriction noted in Intel VT-d:
> Software must not program paging-structure entries to remap any address to
> the interrupt address range. Untranslated requests and translation requests
> that result in an address in the interrupt range will be blocked with
> condition code LGN.4 or SGN.8. Translated requests with an address in the
> interrupt address range are treated as Unsupported Request (UR).
Similarly for AMD-Vi:
> Accesses to the interrupt address range (Table 3) are defined to go through
> the interrupt remapping portion of the IOMMU and not through address
> translation processing. Therefore, when a transaction is being processed as
> an interrupt remapping operation, the transaction attribute of
> pretranslated or untranslated is ignored.
>
> Software Note: The IOMMU should
> not be configured such that an address translation results in a special
> address such as the interrupt address range.
However those restrictions don't apply to the identity mappings possibly
created for dom0, since the interrupt address range is never subject to DMA
remapping, and hence there's no output address after translation that
belongs to the interrupt address range.
Roger Pau Monné [Thu, 20 Mar 2025 11:49:30 +0000 (12:49 +0100)]
x86/iommu: account for IOMEM caps when populating dom0 IOMMU page-tables
The current code in arch_iommu_hwdom_init() kind of open-codes the same
MMIO permission ranges that are added to the hardware domain ->iomem_caps.
Avoid this duplication and use ->iomem_caps in arch_iommu_hwdom_init() to
filter which memory regions should be added to the dom0 IOMMU page-tables.
Note the IO-APIC and MCFG page(s) must be set as not accessible for a PVH
dom0, otherwise the internal Xen emulation for those ranges won't work.
This requires adjustments in dom0_setup_permissions().
The call to pvh_setup_mmcfg() in dom0_construct_pvh() must now strictly be
done ahead of setting up dom0 permissions, so take the opportunity to also
put it inside the existing is_hardware_domain() region.
Also the special casing of E820_UNUSABLE regions no longer needs to be done
in arch_iommu_hwdom_init(), as those regions are already blocked in
->iomem_caps and thus would be removed from the rangeset as part of
->iomem_caps processing in arch_iommu_hwdom_init(). The E820_UNUSABLE
regions below 1Mb are not removed from ->iomem_caps, that's a slight
difference for the IOMMU created page-tables, but the aim is to allow
access to the same memory either from the CPU or the IOMMU page-tables.
Since ->iomem_caps already takes into account the domain max paddr, there's
no need to remove any regions past the last address addressable by the
domain, as applying ->iomem_caps would have already taken care of that.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 62f3fc5296c452285e81adb50976bde2d68d3181
master date: 2025-03-05 10:26:46 +0100
Roger Pau Monné [Thu, 20 Mar 2025 11:49:12 +0000 (12:49 +0100)]
x86/dom0: correctly set the maximum ->iomem_caps bound for PVH
The logic in dom0_setup_permissions() sets the maximum bound in
->iomem_caps unconditionally using paddr_bits, which is not correct for HVM
based domains. Instead use domain_max_paddr_bits() to get the correct
maximum paddr bits for each possible domain type.
Switch to using PFN_DOWN() instead of PAGE_SHIFT, as that's shorter.
Fixes: 53de839fb409 ('x86: constrain MFN range Dom0 may access') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a00e08799cc7657d2a1aca158f4ad43d4c9103e7
master date: 2025-03-05 10:26:46 +0100
Roger Pau Monné [Thu, 20 Mar 2025 11:48:51 +0000 (12:48 +0100)]
x86/dom0: attempt to fixup p2m page-faults for PVH dom0
When building a PVH dom0 Xen attempts to map all (relevant) MMIO regions
into the p2m for dom0 access. However the information Xen has about the
host memory map is limited. Xen doesn't have access to any resources
described in ACPI dynamic tables, and hence the p2m mappings provided might
not be complete.
PV doesn't suffer from this issue because a PV dom0 is capable of mapping
into it's page-tables any address not explicitly banned in d->iomem_caps.
Introduce a new command line options that allows Xen to attempt to fixup
the p2m page-faults, by creating p2m identity maps in response to p2m
page-faults.
This is aimed as a workaround to small ACPI regions Xen doesn't know about.
Note that missing large MMIO regions mapped in this way will lead to
slowness due to the VM exit processing, plus the mappings will always use
small pages.
The ultimate aim is to attempt to bring better parity with a classic PV
dom0.
Note such fixup rely on the CPU doing the access to the unpopulated
address. If the access is attempted from a device instead there's no
possible way to fixup, as IOMMU page-fault are asynchronous.
Roger Pau Monné [Thu, 20 Mar 2025 11:48:32 +0000 (12:48 +0100)]
x86/emul: dump unhandled memory accesses for PVH dom0
A PV dom0 can map any host memory as long as it's allowed by the IO
capability range in d->iomem_caps. On the other hand, a PVH dom0 has no
way to populate MMIO region onto it's p2m, so it's limited to what Xen
initially populates on the p2m based on the host memory map and the enabled
device BARs.
Introduce a new debug build only printk that reports attempts by dom0 to
access addresses not populated on the p2m, and not handled by any emulator.
This is for information purposes only, but might allow getting an idea of
what MMIO ranges might be missing on the p2m.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 43d8a80a0cccfe3715bb3178b5c15fb983979651
master date: 2025-03-05 10:26:46 +0100
Andrew Cooper [Mon, 3 Mar 2025 14:06:55 +0000 (14:06 +0000)]
CHANGELOG.md: Set release date for 4.20
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
(cherry picked from commit e28802927e0a24dab9c73082c3e322ef4dd0bd02)
Oleksii Kurochko [Thu, 27 Feb 2025 14:27:52 +0000 (15:27 +0100)]
CHANGELOG.md: Finalize changes in 4.20 release cycle
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit d3a7d29d76fe4ca4f58164cbe20a6b2dd4500ab8)
Jan Beulich [Thu, 27 Feb 2025 12:58:32 +0000 (12:58 +0000)]
IOMMU/x86: the bus-to-bridge lock needs to be acquired IRQ-safe
The function's use from set_msi_source_id() is guaranteed to be in an
IRQs-off region. While the invocation of that function could be moved
ahead in msi_msg_to_remap_entry() (doesn't need to be in the IOMMU-
intremap-locked region), the call tree from map_domain_pirq() holds an
IRQ descriptor lock. Hence all use sites of the lock need become IRQ-
safe ones.
In find_upstream_bridge() do a tiny bit of tidying in adjacent code:
Change a variable's type to unsigned and merge a redundant assignment
into another variable's initializer.
This is XSA-467 / CVE-2025-1713.
Fixes: 476bbccc811c ("VT-d: fix MSI source-id of interrupt remapping") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 39bc6af3ba483282ed6bbf94b08aec38c93d39e6)
Andrew Cooper [Wed, 26 Feb 2025 03:27:33 +0000 (21:27 -0600)]
PPC: Activate UBSAN in testing
Also enable -fno-sanitize=alignment like x86 since support for unaligned
accesses is guaranteed by the ISA and the existing OPAL setup code
relies on it.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Shawn Anastasio <sanastasio@raptorengineering.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>
(cherry picked from commit 7cf163879c5add0a4f7f9c987b61f04f8f7051b1)
Jan Beulich [Thu, 20 Feb 2025 12:50:19 +0000 (13:50 +0100)]
x86/MCE-telem: adjust cookie definition
struct mctelem_ent is opaque outside of mcetelem.c; the cookie
abstraction exists - afaict - just to achieve this opaqueness. Then it
is irrelevant though which kind of pointer mctelem_cookie_t resolves to.
IOW we can as well use struct mctelem_ent there, allowing to remove the
casts from COOKIE2MCTE() and MCTE2COOKIE(). Their removal addresses
Misra C:2012 rule 11.2 ("Conversions shall not be performed between a
pointer to an incomplete type and any other type") violations.
No functional change intended.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-By: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Andrew Cooper [Mon, 17 Feb 2025 15:51:51 +0000 (15:51 +0000)]
x86/svm: Separate STI and VMRUN instructions in svm_asm_do_resume()
There is a corner case in the VMRUN instruction where its INTR_SHADOW state
leaks into guest state if a VMExit occurs before the VMRUN is complete. An
example of this could be taking #NPF due to event injection.
Xen can safely execute STI anywhere between CLGI and VMRUN, as CLGI blocks
external interrupts too. However, an exception (while fatal) will appear to
be in an irqs-on region (as GIF isn't considered), so position the STI after
the speculation actions but prior to the GPR pops.
xen/memory: Make resource_max_frames() to return 0 on unknown type
This is actually what the caller acquire_resource() expects on any kind
of error (the comment on top of resource_max_frames() also suggests that).
Otherwise, the caller will treat -errno as a valid value and propagate incorrect
nr_frames to the VM. As a possible consequence, a VM trying to query a resource
size of an unknown type will get the success result from the hypercall and obtain
nr_frames 4294967201.
Also, add an ASSERT_UNREACHABLE() in the default case of _acquire_resource(),
normally we won't get to this point, as an unknown type will always be rejected
earlier in resource_max_frames().
Also, update test-resource app to verify that Xen can deal with invalid
(unknown) resource type properly.
Fixes: 9244528955de ("xen/memory: Fix acquire_resource size semantics") Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Andrew Cooper [Wed, 22 Jan 2025 12:13:24 +0000 (12:13 +0000)]
xen/console: Fix truncation of panic() messages
The panic() function uses a static buffer to format its arguments into, simply
to emit the result via printk("%s", buf). This buffer is not large enough for
some existing users in Xen. e.g.:
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Invalid device tree blob at physical address 0x46a00000.
(XEN) The DTB must be 8-byte aligned and must not exceed 2 MB in size.
(XEN)
(XEN) Plea****************************************
The remainder of this particular message is 'e check your bootloader.', but
has been inherited by RISC-V from ARM.
It is also pointless double buffering. Implement vprintk() beside printk(),
and use it directly rather than rendering into a local buffer, removing it as
one source of message limitation.
This marginally simplifies panic(), and drops a global used-once buffer.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
+ dump_execution_state();
+
for_each_domain( d )
domain_unpause_by_systemcontroller(d);
failed with:
(XEN) *** Serial input to DOM0 (type 'CTRL-a' three times to switch input)
(XEN) CPU0: Unexpected Trap: Undefined Instruction
(XEN) ----[ Xen-4.20-rc arm32 debug=n Not tainted ]----
(XEN) CPU: 0
<snip>
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) CPU0: Unexpected Trap: Undefined Instruction
(XEN) ****************************************
This is because the condition for init text is wrong. While there's nothing
interesting from that point onwards in start_xen(), it's also wrong for
livepatches too.
Use is_active_kernel_text() which is the correct test for this purpose, and is
aware of init and livepatch regions as well as their lifetimes.
Fixes: 3e802c6ca1fb ("xen/arm: Correctly support WARN_ON") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Roger Pau Monne [Tue, 4 Feb 2025 10:46:14 +0000 (11:46 +0100)]
x86/iommu: disable interrupts at shutdown
Add a new hook to inhibit interrupt generation by the IOMMU(s). Note the
hook is currently only implemented for x86 IOMMUs. The purpose is to
disable interrupt generation at shutdown so any kexec chained image finds
the IOMMU(s) in a quiesced state.
It would also prevent "Receive accept error" being raised as a result of
non-disabled interrupts targeting offline CPUs.
Note that the iommu_quiesce() call in nmi_shootdown_cpus() is still
required even when there's a preceding iommu_crash_shutdown() call; the
later can become a no-op depending on the setting of the "crash-disable"
command line option.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Roger Pau Monne [Wed, 5 Feb 2025 14:05:47 +0000 (15:05 +0100)]
x86/pci: disable MSI(-X) on all devices at shutdown
Attempt to disable MSI(-X) capabilities on all PCI devices know by Xen at
shutdown. Doing such disabling should facilitate kexec chained kernel from
booting more reliably, as device MSI(-X) interrupt generation should be
quiesced.
Only attempt to disable MSI(-X) on all devices in the crash context if the
PCI lock is not taken, otherwise the PCI device list could be in an
inconsistent state. This requires introducing a new pcidevs_trylock()
helper to check whether the lock is currently taken.
Disabling MSI(-X) should prevent "Receive accept error" being raised as a
result of non-disabled interrupts targeting offline CPUs.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Roger Pau Monne [Thu, 6 Feb 2025 11:20:04 +0000 (12:20 +0100)]
x86/smp: perform disabling on interrupts ahead of AP shutdown
Move the disabling of interrupt sources so it's done ahead of the offlining
of APs. This is to prevent AMD systems triggering "Receive accept error"
when interrupts target CPUs that are no longer online.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Roger Pau Monne [Tue, 28 Jan 2025 15:06:07 +0000 (16:06 +0100)]
x86/irq: drop fixup_irqs() parameters
The solely remaining caller always passes the same globally available
parameters. Drop the parameters and modify fixup_irqs() to use
cpu_online_map in place of the input mask parameter, and always be verbose
in its output printing.
While there remove some of the checks given the single context where
fixup_irqs() is now called, which should always be in the CPU offline path,
after the CPU going offline has been removed from cpu_online_map.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Roger Pau Monne [Tue, 28 Jan 2025 08:34:20 +0000 (09:34 +0100)]
x86/shutdown: offline APs with interrupts disabled on all CPUs
The current shutdown logic in smp_send_stop() will disable the APs while
having interrupts enabled on the BSP or possibly other APs. On AMD systems
this can lead to local APIC errors:
APIC error on CPU0: 00(08), Receive accept error
Such error message can be printed in a loop, thus blocking the system from
rebooting. I assume this loop is created by the error being triggered by
the console interrupt, which is further stirred by the ESR handler
printing to the console.
Intel SDM states:
"Receive Accept Error.
Set when the local APIC detects that the message it received was not
accepted by any APIC on the APIC bus, including itself. Used only on P6
family and Pentium processors."
So the error shouldn't trigger on any Intel CPU supported by Xen.
However AMD doesn't make such claims, and indeed the error is broadcast to
all local APICs when an interrupt targets a CPU that's already offline.
To prevent the error from stalling the shutdown process perform the
disabling of APs and the BSP local APIC with interrupts disabled on all
CPUs in the system, so that by the time interrupts are unmasked on the BSP
the local APIC is already disabled. This can still lead to a spurious:
APIC error on CPU0: 00(00)
As a result of an LVT Error getting injected while interrupts are masked on
the CPU, and the vector only handled after the local APIC is already
disabled. ESR reports 0 because as part of disable_local_APIC() the ESR
register is cleared.
Note the NMI crash path doesn't have such issue, because disabling of APs
and the caller local APIC is already done in the same contiguous region
with interrupts disabled. There's a possible window on the NMI crash path
(nmi_shootdown_cpus()) where some APs might be disabled (and thus
interrupts targeting them raising "Receive accept error") before others APs
have interrupts disabled. However the shutdown NMI will be handled,
regardless of whether the AP is processing a local APIC error, and hence
such interrupts will not cause the shutdown process to get stuck.
Remove the call to fixup_irqs() in smp_send_stop(): it doesn't achieve the
intended goal of moving all interrupts to the BSP anyway. The logic in
fixup_irqs() will move interrupts whose affinity doesn't overlap with the
passed mask, but the movement of interrupts is done to any CPU set in
cpu_online_map. As in the shutdown path fixup_irqs() is called before APs
are cleared from cpu_online_map this leads to interrupts being shuffled
around, but not assigned to the BSP exclusively.
The Fixes tag is more of a guess than a certainty; it's possible the
previous sleep window in fixup_irqs() allowed any in-flight interrupt to be
delivered before APs went offline. However fixup_irqs() was still
incorrectly used, as it didn't (and still doesn't) move all interrupts to
target the provided cpu mask.
Fixes: e2bb28d62158 ('x86/irq: forward pending interrupts to new destination in fixup_irqs()') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Andrew Cooper [Fri, 7 Feb 2025 21:19:21 +0000 (21:19 +0000)]
RISCV: Activate UBSAN in testing
RISC-V has less complicated headers, so update ubsan.c to pull in everything
it needs. Provide dump_execution_state(), and update the printk() message to
make it more obvious that it's an outstanding task.
As with commit 8ef2ac727e21 ("automation: enable UBSAN for debug tests"),
enable UBSAN in RISC-V testing too.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Andrew Cooper [Fri, 7 Feb 2025 15:04:25 +0000 (15:04 +0000)]
RISCV/asm: Use CALL rather than JAL
JAL has a maximium displacement of 2M. To branch further, it needs pairing
with an AUIPC instruction. CALL is a pseudoinstruction which allows the
linker to pick the appropriate sequence when relaxations are enabled.
This avoids a build failure of the form:
prelink.o: in function `start':
xen/xen/arch/riscv/riscv64/head.S:28:(.text.header+0x2c):
relocation truncated to fit: R_RISCV_JAL against symbol `calc_phys_offset' defined in .init.text section in prelink.o
make[3]: *** [arch/riscv/Makefile:18: xen-syms] Error 1
when Xen gets large enough, e.g. with CONFIG_UBSAN enabled.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Enable CONFIG_UBSAN and CONFIG_UBSAN_FATAL for the ARM64 and x86_64
build jobs, with debug enabled, which are later used for Xen tests on
QEMU and/or real hardware.
Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> R-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Jan Beulich [Fri, 7 Feb 2025 09:00:04 +0000 (10:00 +0100)]
radix-tree: introduce RADIX_TREE{,_INIT}()
... now that static initialization is possible. Use RADIX_TREE() for
pci_segments and ivrs_maps.
This then fixes an ordering issue on x86: With the call to
radix_tree_init(), acpi_mmcfg_init()'s invocation of pci_segments_init()
will zap the possible earlier introduction of segment 0 by
amd_iommu_detect_one_acpi()'s call to pci_ro_device(), and thus the
write-protection of the PCI devices representing AMD IOMMUs.
Fixes: 3950f2485bbc ("x86/x2APIC: defer probe until after IOMMU ACPI table parsing") Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Jan Beulich [Fri, 7 Feb 2025 08:59:11 +0000 (09:59 +0100)]
radix-tree: purge node allocation override hooks
These were needed by TMEM only, which is long gone. The Linux original
doesn't have such either. This effectively reverts one of the "Other
changes" from 8dc6738dbb3c ("Update radix-tree.[ch] from upstream Linux
to gain RCU awareness").
Positive side effect: Two cf_check go away.
While there also convert xmalloc()+memset() to xzalloc(). (Don't convert
to xvzalloc(), as that would require touching the freeing side, too.)
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Jan Beulich [Tue, 4 Feb 2025 12:50:49 +0000 (13:50 +0100)]
AMD/IOMMU: drop stray MSI enabling
While the 2nd of the commits referenced below should have moved the call
to amd_iommu_msi_enable() instead of adding another one, the situation
wasn't quite right even before: It can't have done any good to enable
MSI when no IRQ was allocated for it, yet.
The other call to amd_iommu_msi_enable(), just out of patch context,
needs to stay there until S3 resume is re-worked. For the boot path that
call should be unnecessary, as iommu{,_maskable}_msi_startup() will have
done it already (by way of invoking iommu_msi_unmask()).
Fixes: 5f569f1ac50e ("AMD/IOMMU: allow enabling with IRQ not yet set up") Fixes: d9e49d1afe2e ("AMD/IOMMU: adjust setup of internal interrupt for x2APIC mode") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Tested-by: Jason Andryuk <jason.andryuk@amd.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Jens Wiklander [Mon, 3 Feb 2025 10:21:12 +0000 (11:21 +0100)]
xen/arm: ffa: fix bind/unbind notification
The notification bitmask is in passed in the FF-A ABI in two 32-bit
registers w3 and w4. The lower 32-bits should go in w3 and the higher in
w4. These two registers has unfortunately been swapped for
FFA_NOTIFICATION_BIND and FFA_NOTIFICATION_UNBIND in the FF-A mediator.
So fix that by using the correct registers.