It's unclear why -N is being used in the first place. It was added by
commit 4676bbf96dc8 back in 2002 without any justification.
When building a PE image it's actually detrimental to forcefully set the
.text section as writable. The GNU LD man page contains the following
warning regarding the -N option:
> Note: Although a writable text section is allowed for PE-COFF targets, it
> does not conform to the format specification published by Microsoft.
Remove the usage of -N uniformly on all architectures, assuming that the
addition was simply done as a copy and paste of the original x86 linking
rune.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Acked-by: Julien Grall <jgrall@amazon.com>
master commit: d444763f8ca556d0a67a4b933be303d346baef02
master date: 2025-04-23 16:12:25 +0200
x86/intel: workaround several MONITOR/MWAIT errata
There are several errata on Intel regarding the usage of the MONITOR/MWAIT
instructions, all having in common that stores to the monitored region
might not wake up the CPU.
Fix them by forcing the sending of an IPI for the affected models.
The Ice Lake issue has been reproduced internally on XenServer hardware,
and the fix does seem to prevent it. The symptom was APs getting stuck in
the idle loop immediately after bring up, which in turn prevented the BSP
from making progress. This would happen before the watchdog was
initialized, and hence the whole system would get stuck.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 4aae4452efeee3d3bba092b875e37d1e7c8f6db9
master date: 2025-04-23 16:12:25 +0200
Jan Beulich [Tue, 29 Apr 2025 09:57:07 +0000 (11:57 +0200)]
compat/memory: avoid UB shifts in XENMEM_exchange handling
Add an early basic check, yielding the same error code as the more
thorough one the main handler would produce.
Fixes: b8a7efe8528a ("Enable compatibility mode operation for HYPERVISOR_memory_op") Reported-by: Manuel Andreas <manuel.andreas@tum.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 560c51be8f6a88cde43c0a7c8be60158b5725982
master date: 2025-04-22 11:25:23 +0200
Jan Beulich [Tue, 29 Apr 2025 09:56:44 +0000 (11:56 +0200)]
x86emul: also clip repetition count for STOS
Like MOVS, INS, and OUTS, STOS also has a special purpose hook, where
the hook function may legitimately have the same expectation as to the
request not straddling address space start/end.
Fixes: 5dfe4aa4eeb6 ("x86_emulate: Do not request emulation of REP instructions beyond the") Reported-by: Fabian Specht <f.specht@tum.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 8c5636b6c87777e6c2e4ffae28bffe1cfc189bfd
master date: 2025-04-22 11:24:20 +0200
Jan Beulich [Tue, 29 Apr 2025 09:56:25 +0000 (11:56 +0200)]
x86/HVM: update repeat count upon nested lin->phys failure
For the X86EMUL_EXCEPTION case the repeat count must be correctly
propagated back. Since for the recursive invocation we use a local
helper variable, its value needs copying to the caller's one.
While there also correct the off-by-1 range in the comment ahead of the
function (strictly speaking for the "DF set" case we'd need to put
another, different range there as well).
Fixes: 53f87c03b4ea ("x86emul: generalize exception handling for rep_* hooks") Reported-by: Manuel Andreas <manuel.andreas@tum.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c07b16fd6e47782ebf1ee767cd07c1e2b4140f47
master date: 2025-04-17 10:01:19 +0200
Jan Beulich [Tue, 29 Apr 2025 09:55:16 +0000 (11:55 +0200)]
include: sort $(wildcard ...) results
The order of items is stored in .*.chk.cmd, and hence variations between
how items are ordered would result in re-invocation of the checking rule
during "make install-xen" despite that already having successfully run
earlier on. The difference can become noticable when building (as non-
root) and installing (as root) use different GNU make versions: In 3.82
the sorting was deliberately undone, just for it to be restored in 4.3.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ff835bbc8096a14ed1bffa235e25848c993f7240
master date: 2025-04-10 10:56:29 +0200
xen: x86: irq: initialize irq desc in create_irq()
While building xen with GCC 14.2.1 with "-fcondition-coverage" option
or with "-Og", the compiler produces a false positive warning:
arch/x86/irq.c: In function ‘create_irq’:
arch/x86/irq.c:281:11: error: ‘desc’ may be used uninitialized [-Werror=maybe-uninitialized]
281 | ret = init_one_irq_desc(desc);
| ^~~~~~~~~~~~~~~~~~~~~~~
arch/x86/irq.c:269:22: note: ‘desc’ was declared here
269 | struct irq_desc *desc;
| ^~~~
cc1: all warnings being treated as errors
make[2]: *** [Rules.mk:252: arch/x86/irq.o] Error 1
While we have signed/unsigned comparison both in "for" loop and in
"if" statement, this still can't lead to use of uninitialized "desc",
as either loop will be executed at least once, or the function will
return early. So this is a clearly false positive warning due to a
bug [1] in GCC.
Ahmed S. Darwish [Tue, 29 Apr 2025 09:54:35 +0000 (11:54 +0200)]
x86/cpu: Validate CPUID leaf 0x2 EDX output
CPUID leaf 0x2 emits one-byte descriptors in its four output registers
EAX, EBX, ECX, and EDX. For these descriptors to be valid, the most
significant bit (MSB) of each register must be clear.
Leaf 0x2 parsing at intel.c only validated the MSBs of EAX, EBX, and
ECX, but left EDX unchecked.
xen: vm_event: do not do vm_event_op for an invalid domain
A privileged domain can issue XEN_DOMCTL_vm_event_op with
op->domain == DOMID_INVALID. In this case vm_event_domctl()
function will get NULL as the first parameter and this will
cause hypervisor panic, as it tries to derefer this pointer.
Fix the issue by checking if valid domain is passed in.
Fixes: 48b84249459f ("xen/vm-event: Drop unused u_domctl parameter from vm_event_domctl()") Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
master commit: 6a884750f3b86a45ee5ffbd825c346fcbce86080
master date: 2025-04-08 09:36:38 +0200
Jan Beulich [Tue, 29 Apr 2025 09:52:22 +0000 (11:52 +0200)]
x86/MTRR: hook mtrr_bp_restore() back up
Unlike stated in the offending commit's description,
load_system_tables() wasn't the only thing left to retain from the
earlier restore_rest_processor_state(). Note that MTRR state was still
reloaded via mtrr_aps_sync_end(), but that happens quite a bit later in
the resume process.
While there also do Misra-related tidying for the function itself: The
function being used from assembly only means it doesn't need to have a
declaration, but wants to be asmlinkage.
Fixes: 4304ff420e51 ("x86/S3: Drop {save,restore}_rest_processor_state() completely") Reported-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 0414dedd6fde1a1c5c5e38dcbef4dad506e1398c
master date: 2025-04-03 09:39:13 +0200
Andrew Cooper [Tue, 8 Apr 2025 16:09:15 +0000 (17:09 +0100)]
x86/ucode: Extend AMD digest checks to cover Zen5 CPUs
AMD have updated the SB-7033 advisory to include Zen5 CPUs. Extend the digest
check to cover Zen5 too.
In practice, cover everything until further notice.
Observant readers may be wondering where the update to the digest list is. At
the time of writing, no Zen5 patches are available via a verifiable channel.
x86/ucode: Extend warning about disabling digest check too
This was missed by accident.
Fixes: b63951467e96 ("x86/ucode: Extend AMD digest checks to cover Zen5 CPUs") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 59bb316ea89e7f9461690fe00547d7d2af96321d)
Andrew Cooper [Fri, 13 Dec 2024 14:34:00 +0000 (14:34 +0000)]
x86/ucode: Perform extra SHA2 checks on AMD Fam17h/19h microcode
Collisions have been found in the microcode signing algorithm used by AMD
Fam17h/19h CPUs, and now anyone can sign their own.
For more details, see:
https://bughunters.google.com/blog/5424842357473280/zen-and-the-art-of-microcode-hacking
https://www.amd.com/en/resources/product-security/bulletin/amd-sb-7033.html
As a stopgap mitigation, check the digest of patches against a table of blobs
with known provenance. These are all Fam17h and Fam19h blobs included in
linux-firwmare at the time of writing, specifically:
This checks can be opted out of by booting with ucode=no-digest-check, but
doing so is not recommended.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 630e8875ab368b97cc7231aaf3809e3d7d5687e1)
Xen: CI fix from XSN-2
* Add cf_check annotation to cmp_patch_id() used by bsearch().
Fixes: 630e8875ab36 ("x86/ucode: Perform extra SHA2 checks on AMD Fam17h/19h microcode") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 15fe2eb5f1bac8a212c0ba3d6dfe60d1fdf851cf)
Andrew Cooper [Fri, 13 Dec 2024 14:34:00 +0000 (14:34 +0000)]
xen/lib: Introduce SHA2-256
A future change will need to calculate SHA2-256 digests. Introduce an
implementation in lib/, derived from Trenchboot which itself is derived from
Linux.
In order to be useful to other architectures, it is careful with endianness
and misaligned accesses as well as being more MISRA friendly, but is only
wired up for x86 in the short term.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 372af524411f5a013bcb0b117073d8d07c026563)
Xen: CI fix from XSN-2
* Add U suffix to the K[] table to fix MISRA Rule 7.2 violations.
Fixes: 372af524411f ("xen/lib: Introduce SHA2-256") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 15fe2eb5f1bac8a212c0ba3d6dfe60d1fdf851cf)
tools/libxl: do not use `-c -E` compiler options together
It makes no sense to request for preprocessor only output and also request
object file generation. Fix the _libxl.api-for-check target to only use
-E (preprocessor output).
Also Clang 20.0 reports an error if both options are used.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Fixes: 2862bf5b6c81 ('libxl: enforce prohibitions of internal callers') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Anthony PERARD <anthony.perard@vates.tech>
(cherry picked from commit a235f856e4bbd270b085590e1f5fc9599234dcdf)
automation/eclair: Remove bespoke service B.UNEVALEFF
The Eclair runners in GitlabCI have been update. This service is now
included, and redefining results in an error.
No functional change.
Signed-off-by: Nicola Vetrini <nicola.vetrini@bugseng.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit a43b0a770bdcf3933634c860049e4bd65854e472)
This is AMD Zen2 (Ryzen 5 4500U specifically), in a HP Probook 445 G7.
This one has working S3, so add a test for it here.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit debe8bf537ec2c69a4734393cd2b0c7f57f74c0c)
Roger Pau Monne [Fri, 14 Mar 2025 12:37:46 +0000 (13:37 +0100)]
automation/cirrus-ci: add smoke tests for the FreeBSD builds
Introduce a basic set of smoke tests using the XTF selftest image, and run
them on QEMU. Use the matrix keyword to create a different task for each
XTF flavor on each FreeBSD build.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 7973cba4dbf72f6b963c780e9d1e0b99fd9622b9)
Roger Pau Monne [Sat, 15 Mar 2025 08:35:12 +0000 (09:35 +0100)]
automation/cirrus-ci: use matrix keyword to generate per-version build tasks
Move the current logic to use the matrix keyword to generate a task for
each version of FreeBSD we want to build Xen on. The matrix keyword
however cannot be used in YAML aliases, so it needs to be explicitly used
inside of each task, which creates a bit of duplication. At least abstract
the FreeBSD minor version numbers to avoid repetition of image names.
Note that the full build uses matrix over an env variable instead of using
it directly in image_family. This is so that the alias can also be set
based on the FreeBSD version, in preparation for adding further tasks that
will depend on the full build having finished.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit b548a7dc4bb4202410983dd3ef7b5f1b469bdac3)
Roger Pau Monne [Mon, 17 Mar 2025 09:31:07 +0000 (10:31 +0100)]
automation/console.exp: do not assume expect is always at /usr/bin/
Instead use env to find the location of expect.
Additionally do not use the -f flag, as it's only meaningful when passing
arguments on the command line, which we never do for console.exp. From the
expect 5.45.4 man page:
> The -f flag prefaces a file from which to read commands from. The flag
> itself is optional as it is only useful when using the #! notation (see
> above), so that other arguments may be supplied on the command line.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Roger Pau Monne [Mon, 10 Mar 2025 17:41:57 +0000 (18:41 +0100)]
automation/cirrus-ci: store xen/.config as an artifact
Always store xen/.config as an artifact, renamed to xen-config to match
the naming used in the Gitlab CI tests.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 318659c818546b6415c2311339a53384dc20c427)
Andrew Cooper [Mon, 24 Feb 2025 15:36:11 +0000 (15:36 +0000)]
CirrusCI: Use shallow clone
This reduces the Clone step from ~50s to ~3s.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 96970b46e5e84fa2666f76f5b0972b826a3ffba4)
Add a new randconfig job for each FreeBSD version. This requires some
rework of the template so common parts can be shared between the full and
the randconfig builds. Such randconfig builds are relevant because FreeBSD
is the only tested system that has a full non-GNU toolchain.
While there replace the usage of the python311 package with python3, which is
already using 3.11, and remove the install of the plain python package for full
builds.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit c3f5d1bb40b57d467cb4051eafa86f5933ec9003)
Roger Pau Monne [Thu, 16 Jan 2025 08:07:31 +0000 (09:07 +0100)]
automation/cirrus-ci: update FreeBSD to 13.4
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 850a263b7863ae9c22296a357a50573d5e6ae9d7)
Jan Beulich [Wed, 2 Apr 2025 12:28:08 +0000 (14:28 +0200)]
x86/P2M: synchronize fast and slow paths of p2m_get_page_from_gfn()
Handling of both grants and foreign pages was different between the two
paths.
While permitting access to grants would be desirable, doing so would
require more involved handling; undo that for the time being. In
particular the page reference obtained would prevent the owning domain
from changing e.g. the page's type (after the grantee has released the
last reference of the grant). Instead perhaps another reference on the
grant would need obtaining. Which in turn would require determining
which grant that was.
Foreign pages in any event need permitting on both paths.
Introduce a helper function to be used on both paths, such that
respective checking differs in just the extra "to be unshared" condition
on the fast path.
While there adjust the sanity check for foreign pages: Don't leak the
reference on release builds when on a debug build the assertion would
have triggered. (Thanks to Roger for the suggestion.)
Fixes: 80ea7af17269 ("x86/mm: Introduce get_page_from_gfn()") Fixes: 50fe6e737059 ("pvh dom0: add and remove foreign pages") Fixes: cbbca7be4aaa ("x86/p2m: make p2m_get_page_from_gfn() handle grant case correctly") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a8325f981ce4ff8ac8bcc73735f357846b0a0fbb
master date: 2025-03-31 09:21:12 +0200
Andrew Cooper [Wed, 2 Apr 2025 12:27:56 +0000 (14:27 +0200)]
ARM/vgic: Fix out-of-bounds accesses in vgic_mmio_write_sgir()
The switch() statement is over bits 24:25 (unshifted) of the guest provided
value. This makes case 0x3: dead, and not an implementation of the 4th
possible state.
A guest which writes (0x3 << 24) | (0xff << 16) to this register will skip the
early exit, then enter bitmap_for_each() with targets not bound by nr_vcpus.
If the guest has fewer than 8 vCPUs, bitmap_for_each() will read off the end
of d->vcpu[] and use the resulting vcpu pointer to ultimately derive irq, and
perform out-of-bounds writes.
Fix this by changing case 0x3 to default.
Fixes: 08c688ca6422 ("ARM: new VGIC: Add SGIR register handler") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: be7f0cc651d8d02a95820792204c0558f1f29e03
master date: 2025-03-27 11:54:23 +0000
OCaml, in preparation for a renaming of the error string associated with
conversion failure in 'int_of_string' functions, started to issue this
warning:
File "process.ml", line 440, characters 13-28:
440 | | (Failure "int_of_string") -> reply_error "EINVAL"
^^^^^^^^^^^^^^^
Warning 52 [fragile-literal-pattern]: Code should not depend on the actual values of
this constructor's arguments. They are only for information
and may change in future versions. (See manual section 11.5)
Deal with this at the source, and instead create our own stable
ConversionFailure exception that's raised on the None case in
'int_of_string_opt'.
'c_int_of_string' is safe and does not raise such exceptions.
Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com> Acked-by: Christian Lindig <christian.lindig@cloud.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c11772277fe5f1b0874141a24554c2e3da2d9a6e
master date: 2025-02-25 13:30:55 +0000
Jan Beulich [Thu, 27 Mar 2025 14:06:53 +0000 (15:06 +0100)]
Arm/domctl: correct XEN_DOMCTL_vuart_op error return value
copy_to_guest() returns the number of bytes not copied; that's not what
the function should return to its caller though. Convert to returning
-EFAULT instead.
Fixes: 86039f2e8c20 ("xen/arm: vpl011: Add a new domctl API to initialize vpl011") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Michal Orzel <michal.orzel@amd.com>
master commit: 341c0df40bf73b0a5e27db27023ec400858a472d
master date: 2025-03-27 12:22:39 +0100
Jan Beulich [Thu, 27 Mar 2025 14:06:45 +0000 (15:06 +0100)]
x86/pmstat: correct get_cpufreq_para()'s error return value
copy_to_guest() returns the number of bytes not copied; that's not what
the function should return to its caller though. Convert to returning
-EFAULT instead.
Fixes: 7542c4ff00f2 ("Add user PM control interface") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 855337ca4947508ffca23393e291c54b5307cc9a
master date: 2025-03-27 12:22:06 +0100
Jan Beulich [Thu, 27 Mar 2025 14:06:33 +0000 (15:06 +0100)]
x86/PVH: account for module command line length
As per observation in practice, initrd->cmdline_pa is not normally zero.
Hence so far we always appended at least one byte. That alone may
already render insufficient the "allocation" made by find_memory().
Things would be worse when there's actually a (perhaps long) command
line.
Skip setup when the command line is empty. Amend the "allocation" size
by padding and actual size of module command line. Along these lines
also skip initrd setup when the initrd is zero size.
Fixes: 0ecb8eb09f9f ("x86/pvh: pass module command line to dom0") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: 989584e532c9517a0f789e993f5f6744beaebe3e
master date: 2025-03-27 12:21:08 +0100
This is MOV %cr8, which is wired up for hvm_mov_{to,from}_cr(); the VMExit
fastpaths, but not for the full emulation slowpaths.
Xen's handling of %cr8 turns out to be quite wrong. At a minimum, we need
storage for %cr8 separate to APIC_TPR, and to alter intercepts based on
whether the vLAPIC is enabled or not. But that's more work than there is time
for in the short term, so make a stopgap fix.
Extend hvmemul_{read,write}_cr() with %cr8 cases. Unlike hvm_mov_to_cr(),
hardware hasn't filtered out invalid values (#GP checks are ahead of
intercepts), so introduce X86_CR8_VALID_MASK.
Reported-by: Petr Beneš <w1benny@gmail.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 14fd9b5642cd4805b49fbe716bf2cd577e724169
master date: 2025-03-26 11:54:59 +0000
Andrew Cooper [Thu, 27 Mar 2025 14:05:35 +0000 (15:05 +0100)]
x86/emul: Rearrange the logic in hvmemul_{read,write}_cr()
In hvmemul_read_cr(), make the TRACE()/X86EMUL_OKAY path common in preparation
for adding a %cr8 case. Use a local 'val' variable instead of always
operating on a deferenced pointer.
In both, calculate curr once.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b7264a15c28d30bb994ec9e58eba38932be231ec
master date: 2025-03-26 11:54:59 +0000
Jan Beulich [Thu, 27 Mar 2025 14:05:11 +0000 (15:05 +0100)]
x86/PVH: expose OEMx ACPI tables to Dom0
What they contain we don't know, but we can't sensibly hide them. On my
Skylake system OEM1 (with a description of "INTEL CPU EIST") is what
contains all the _PCT, _PPC, and _PSS methods, i.e. about everything
needed for cpufreq. (_PSD interestingly are in an SSDT there.)
Further OEM2 there has a description of "INTEL CPU HWP", while OEM4
has "INTEL CPU CST". Pretty clearly all three need exposing for
cpufreq and cpuidle to work.
Jan Beulich [Thu, 27 Mar 2025 14:04:48 +0000 (15:04 +0100)]
xenpm: sanitize allocations in show_cpufreq_para_by_cpuid()
malloc(), when passed zero size, may return NULL (the behavior is
implementation defined). Mirror the ->gov_num check to the other two
allocations as well. Don't chance then actually using a NULL in
print_cpufreq_para().
Fixes: 75e06d089d48 ("xenpm: add cpu frequency control interface, through which user can") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: 6c0dc87bb0e08fb31a68bf4c4149a18b92628f14
master date: 2025-03-26 12:30:57 +0100
Andrew Cooper [Thu, 27 Mar 2025 14:04:03 +0000 (15:04 +0100)]
x86/boot: Simplify the expression for extra allocation space
The expression for one parameter of find_memory() is already complicated and
about to become moreso. Break it out into a new variable, and express it in
an easier-to-follow way.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: ce703c84df1cb279605b0a85a45c6419464a16e8
master date: 2025-03-21 11:52:39 +0000
Fixes: 84c4461b7d3a ("Force out-of-line instances of inline functions into .init.text in init-only code") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ab7ce0c8ed90f729a186babd87e3cd1fbed8ab98
master date: 2025-03-21 11:52:39 +0000
Roger Pau Monné [Thu, 27 Mar 2025 14:03:20 +0000 (15:03 +0100)]
x86/vga: fix mapping of the VGA text buffer
The call to ioremap_wc() in video_init() will always fail, because
video_init() is called ahead of vm_init_type(), and so the underlying
__vmap() call will fail to allocate the linear address space.
Fix by reverting to the previous behavior and use __va() for the VGA text
buffer, as it's below the 1MB boundary, and thus always mapped in the
directmap.
Adjust the calculations in COMPAT_ARG_XLAT_VIRT_BASE to subtract from the
per-domain area to obtain the mirrored linear address in the 4th slot,
instead of overflowing the per-domain linear address.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fc302866f42f552337ae7d8d78877aec36e6e2ff
master date: 2025-03-20 12:28:30 +0100
Roger Pau Monné [Thu, 27 Mar 2025 14:02:28 +0000 (15:02 +0100)]
x86/shadow: fix UB pointer arithmetic in sh_mfn_is_a_page_table()
UBSAN complains with:
UBSAN: Undefined behaviour in arch/x86/mm/shadow/private.h:515:30
pointer operation overflowed ffff82e000000000 to ffff82dfffffffe0
[...]
Xen call trace:
[<ffff82d040303782>] R common/ubsan/ubsan.c#ubsan_epilogue+0xa/0xc0
[<ffff82d040304bc3>] F __ubsan_handle_pointer_overflow+0xcb/0x100
[<ffff82d040471b2d>] F arch/x86/mm/shadow/guest_2.c#sh_page_fault__guest_2+0x1e350
[<ffff82d0403b206b>] F svm_vmexit_handler+0xdf3/0x2450
[<ffff82d0402049c0>] F svm_stgi_label+0x5/0x15
Fix by moving the call to mfn_to_page() after the check of whether the
passed gmfn is valid. This avoid the call to mfn_to_page() with an
INVALID_MFN parameter.
While there make the page local variable const, it's not modified by the
function.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 45ee73f1b24246f13cd9583cb2ee25fb9c782db8
master date: 2025-03-20 12:28:30 +0100
Roger Pau Monné [Thu, 27 Mar 2025 14:02:20 +0000 (15:02 +0100)]
x86/mkelf32: account for offset when detecting note segment placement
mkelf32 attempt to check that the program header defined NOTE segment falls
inside of the LOAD segment, as the build-id should be loaded for Xen at
runtime to check.
However the current code doesn't take into account the LOAD program header
segment offset when calculating overlap with the NOTE segment. This
results in incorrect detection, and the following build error:
arch/x86/boot/mkelf32 --notes xen-syms ./.xen.elf32 0x200000 \
`nm xen-syms | sed -ne 's/^\([^ ]*\) . __2M_rwdata_end$/0x\1/p'`
Expected .note section within .text section!
Offset 4244776 not within 2910364!
Account for the program header offset of the LOAD segment when checking
whether the NOTE segments is contained within. Also fix the logic to
ensure the NOTE segments is fully contained between the LOAD segment.
Fixes: a353cab905af ('build_id: Provide ld-embedded build-ids') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6e5fed7cb67c9f84653cdbd3924b8a119ef653be
master date: 2025-03-20 12:28:30 +0100
Andrew Cooper [Thu, 20 Mar 2025 12:21:12 +0000 (13:21 +0100)]
x86/mm: Fix IS_ALIGNED() check in IS_LnE_ALIGNED()
The current CI failures turn out to be a latent bug triggered by a narrow set
of properties of the initrd and the host memory map, which CI encountered by
chance.
One step during boot involves constructing directmap mappings for modules.
With some probing at the point of creation, it is observed that there's a 4k
mapping missing towards the end of the initrd.
The conditions for this bug appear to be map_pages_to_xen() call with a start
address of exactly 4k beyond a 2M boundary, some number of full 2M pages, then
a tail needing 4k pages.
Anyway, the condition for spotting superpage boundaries in map_pages_to_xen()
is wrong. The IS_ALIGNED() macro expects a power of two for the alignment
argument, and subtracts 1 itself.
Fixing this causes the failing case to now boot.
Fixes: 97fb6fcf26e8 ("x86/mm: introduce helpers to detect super page alignment") Debugged-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b07c7d63f9b587e4df5d71f6da9eaa433512c974
master date: 2025-03-19 14:53:28 +0000
Roger Pau Monné [Thu, 20 Mar 2025 12:20:51 +0000 (13:20 +0100)]
x86/ioremap: prevent additions against the NULL pointer
This was reported by clang UBSAN as:
UBSAN: Undefined behaviour in arch/x86/mm.c:6297:40
applying zero offset to null pointer
[...]
Xen call trace:
[<ffff82d040303662>] R common/ubsan/ubsan.c#ubsan_epilogue+0xa/0xc0
[<ffff82d040304aa3>] F __ubsan_handle_pointer_overflow+0xcb/0x100
[<ffff82d0406ebbc0>] F ioremap_wc+0xc8/0xe0
[<ffff82d0406c3728>] F video_init+0xd0/0x180
[<ffff82d0406ab6f5>] F console_init_preirq+0x3d/0x220
[<ffff82d0406f1876>] F __start_xen+0x68e/0x5530
[<ffff82d04020482e>] F __high_start+0x8e/0x90
Fix bt_ioremap() and ioremap{,_wc}() to not add the offset if the returned
pointer from __vmap() is NULL.
Fixes: d0d4635d034f ('implement vmap()') Fixes: f390941a92f1 ('x86/DMI: fix table mapping when one lives above 1Mb') Fixes: 81d195c6c0e2 ('x86: introduce ioremap_wc()') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9a6f2c52f75781acda39fab5cc96d1bcc54bf534
master date: 2025-03-17 13:33:29 +0100
Juergen Gross [Thu, 20 Mar 2025 12:20:14 +0000 (13:20 +0100)]
xen/sched: fix arinc653 to not use variables across cpupools
a653sched_do_schedule() is using two function local static variables,
which is resulting in bad behavior when using more than one cpupool
with the arinc653 scheduler.
Fix that by moving those variables to the scheduler private data.
Fixes: 22787f2e107c ("ARINC 653 scheduler") Reported-by: Choi Anderson <Anderson.Choi@boeing.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Nathan Studer <nathan.studer@dornerworks.com>
master commit: d0561ac8ab0e780b1e8ab41d0d15e9f9b076dee3
master date: 2025-03-14 10:17:11 +0100
Jan Beulich [Thu, 20 Mar 2025 12:20:06 +0000 (13:20 +0100)]
libxl: avoid infinite loop in libxl__remove_directory()
Infinitely retrying the rmdir() invocation makes little sense. While the
original observation was the log filling the disk (due to repeated
"Directory not empty" errors, in turn occurring for unclear reasons),
the loop wants breaking even if there was no error message being logged
(much like is done in the similar loops in libxl__remove_file() and
libxl__remove_file_or_directory()).
Fixes: c4dcbee67e6d ("libxl: provide libxl__remove_file et al") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: 68baeb5c4852e652b9599e049f40477edac4060e
master date: 2025-03-13 10:23:10 +0100
Roger Pau Monné [Thu, 20 Mar 2025 12:19:17 +0000 (13:19 +0100)]
x86/vmx: fix posted interrupts usage of msi_desc->msg field
The current usage of msi_desc->msg in vmx_pi_update_irte() will make the
field contain a translated MSI message, instead of the expected
untranslated one. This breaks dump_msi(), that use the data in
msi_desc->msg to print the interrupt details.
Fix this by introducing a dummy local msi_msg, and use it with
iommu_update_ire_from_msi(). vmx_pi_update_irte() relies on the MSI
message not changing, so there's no need to propagate the resulting msi_msg
to the hardware, and the contents can be ignored.
Additionally add a comment to clarify that msi_desc->msg must always
contain the untranslated MSI message.
Fixes: a5e25908d18d ('VT-d: introduce new fields in msi_desc to track binding with guest interrupt') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 30f0e55a79206702b4e82e86dad6b35033157858
master date: 2025-03-12 13:32:30 +0100
Roger Pau Monné [Thu, 20 Mar 2025 12:18:46 +0000 (13:18 +0100)]
x86/msr: expose MSR_FAM10H_MMIO_CONF_BASE on AMD
The MMIO_CONF_BASE reports the base of the MCFG range on AMD systems.
Linux pre-6.14 is unconditionally attempting to read the MSR without a
safe MSR accessor, and since Xen doesn't allow access to it Linux reports
the following error:
Such access is conditional to the presence of a device with PnP ID
"PNP0c01", which triggers the execution of the quirk_amd_mmconfig_area()
function. Note that prior to commit 3fac3734c43a MSR accesses when running
as a PV guest would always use the safe variant, and thus silently handle
the #GP.
Fix by allowing access to the MSR on AMD systems for the hardware domain.
Write attempts to the MSR will still result in #GP for all domain types.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b4071d28c5bd9ca4fed76031cbf0e782b74209b9
master date: 2025-03-12 13:32:30 +0100
Andrew Cooper [Thu, 20 Mar 2025 12:18:23 +0000 (13:18 +0100)]
x86/vlapic: Fix handling of writes to APIC_ESR
Xen currently presents APIC_ESR to guests as a simple read/write register.
This is incorrect. The SDM states:
The ESR is a write/read register. Before attempt to read from the ESR,
software should first write to it. (The value written does not affect the
values read subsequently; only zero may be written in x2APIC mode.) This
write clears any previously logged errors and updates the ESR with any
errors detected since the last write to the ESR.
Introduce a new pending_esr field in hvm_hw_lapic.
Update vlapic_error() to accumulate errors here, and extend vlapic_reg_write()
to discard the written value and transfer pending_esr into APIC_ESR. Reads
are still as before.
Importantly, this means that guests no longer destroys the ESR value it's
looking for in the LVTERR handler when following the SDM instructions.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b28b590d4a23894672f1dd7fb98cdf9926ecb282
master date: 2025-03-07 14:34:08 +0000
Juergen Gross [Thu, 20 Mar 2025 12:17:41 +0000 (13:17 +0100)]
tools/xl: fix channel configuration setting
Channels work differently than other device types: their devid should
be -1 initially in order to distinguish them from the primary console
which has the devid of 0.
So when parsing the channel configuration, use
ARRAY_EXTEND_INIT_NODEVID() in order to avoid overwriting the devid
set by libxl_device_channel_init().
Fixes: 3a6679634766 ("libxl: set channel devid when not provided by application") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: e1ccced4afe465d6541c5825a0f8d1b8f5fa4253
master date: 2025-03-05 16:37:37 +0100
Roger Pau Monné [Thu, 20 Mar 2025 12:17:17 +0000 (13:17 +0100)]
x86/dom0: be less restrictive with the Interrupt Address Range
Xen currently prevents dom0 from creating CPU or IOMMU page-table mappings
into the interrupt address range [0xfee00000, 0xfeefffff]. This range has
two different purposes. For accesses from the CPU is contains the default
position of local APIC page at 0xfee00000. For accesses from devices
it's the MSI address range, so the address field in the MSI entries
(usually) point to an address on that range to trigger an interrupt.
There are reports of Lenovo Thinkpad devices placing what seems to be the
UCSI shared mailbox at address 0xfeec2000 in the interrupt address range.
Attempting to use that device with a Linux PV dom0 leads to an error when
Linux kernel maps 0xfeec2000:
Remove the restrictions to create mappings in the interrupt address range
for dom0. Note that the restriction to map the local APIC page is enforced
separately, and that continues to be present. Additionally make sure the
emulated local APIC page is also not mapped, in case dom0 is using it.
Note that even if the interrupt address range entries are populated in the
IOMMU page-tables no device access will reach those pages. Device accesses
to the Interrupt Address Range will always be converted into Interrupt
Messages and are not subject to DMA remapping.
There's also the following restriction noted in Intel VT-d:
> Software must not program paging-structure entries to remap any address to
> the interrupt address range. Untranslated requests and translation requests
> that result in an address in the interrupt range will be blocked with
> condition code LGN.4 or SGN.8. Translated requests with an address in the
> interrupt address range are treated as Unsupported Request (UR).
Similarly for AMD-Vi:
> Accesses to the interrupt address range (Table 3) are defined to go through
> the interrupt remapping portion of the IOMMU and not through address
> translation processing. Therefore, when a transaction is being processed as
> an interrupt remapping operation, the transaction attribute of
> pretranslated or untranslated is ignored.
>
> Software Note: The IOMMU should
> not be configured such that an address translation results in a special
> address such as the interrupt address range.
However those restrictions don't apply to the identity mappings possibly
created for dom0, since the interrupt address range is never subject to DMA
remapping, and hence there's no output address after translation that
belongs to the interrupt address range.
Roger Pau Monné [Thu, 20 Mar 2025 12:16:56 +0000 (13:16 +0100)]
x86/iommu: account for IOMEM caps when populating dom0 IOMMU page-tables
The current code in arch_iommu_hwdom_init() kind of open-codes the same
MMIO permission ranges that are added to the hardware domain ->iomem_caps.
Avoid this duplication and use ->iomem_caps in arch_iommu_hwdom_init() to
filter which memory regions should be added to the dom0 IOMMU page-tables.
Note the IO-APIC and MCFG page(s) must be set as not accessible for a PVH
dom0, otherwise the internal Xen emulation for those ranges won't work.
This requires adjustments in dom0_setup_permissions().
The call to pvh_setup_mmcfg() in dom0_construct_pvh() must now strictly be
done ahead of setting up dom0 permissions, so take the opportunity to also
put it inside the existing is_hardware_domain() region.
Also the special casing of E820_UNUSABLE regions no longer needs to be done
in arch_iommu_hwdom_init(), as those regions are already blocked in
->iomem_caps and thus would be removed from the rangeset as part of
->iomem_caps processing in arch_iommu_hwdom_init(). The E820_UNUSABLE
regions below 1Mb are not removed from ->iomem_caps, that's a slight
difference for the IOMMU created page-tables, but the aim is to allow
access to the same memory either from the CPU or the IOMMU page-tables.
Since ->iomem_caps already takes into account the domain max paddr, there's
no need to remove any regions past the last address addressable by the
domain, as applying ->iomem_caps would have already taken care of that.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 62f3fc5296c452285e81adb50976bde2d68d3181
master date: 2025-03-05 10:26:46 +0100
Roger Pau Monné [Thu, 20 Mar 2025 12:16:37 +0000 (13:16 +0100)]
x86/dom0: correctly set the maximum ->iomem_caps bound for PVH
The logic in dom0_setup_permissions() sets the maximum bound in
->iomem_caps unconditionally using paddr_bits, which is not correct for HVM
based domains. Instead use domain_max_paddr_bits() to get the correct
maximum paddr bits for each possible domain type.
Switch to using PFN_DOWN() instead of PAGE_SHIFT, as that's shorter.
Fixes: 53de839fb409 ('x86: constrain MFN range Dom0 may access') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a00e08799cc7657d2a1aca158f4ad43d4c9103e7
master date: 2025-03-05 10:26:46 +0100
Roger Pau Monné [Thu, 20 Mar 2025 12:16:14 +0000 (13:16 +0100)]
x86/dom0: attempt to fixup p2m page-faults for PVH dom0
When building a PVH dom0 Xen attempts to map all (relevant) MMIO regions
into the p2m for dom0 access. However the information Xen has about the
host memory map is limited. Xen doesn't have access to any resources
described in ACPI dynamic tables, and hence the p2m mappings provided might
not be complete.
PV doesn't suffer from this issue because a PV dom0 is capable of mapping
into it's page-tables any address not explicitly banned in d->iomem_caps.
Introduce a new command line options that allows Xen to attempt to fixup
the p2m page-faults, by creating p2m identity maps in response to p2m
page-faults.
This is aimed as a workaround to small ACPI regions Xen doesn't know about.
Note that missing large MMIO regions mapped in this way will lead to
slowness due to the VM exit processing, plus the mappings will always use
small pages.
The ultimate aim is to attempt to bring better parity with a classic PV
dom0.
Note such fixup rely on the CPU doing the access to the unpopulated
address. If the access is attempted from a device instead there's no
possible way to fixup, as IOMMU page-fault are asynchronous.
Roger Pau Monné [Thu, 20 Mar 2025 12:15:48 +0000 (13:15 +0100)]
x86/emul: dump unhandled memory accesses for PVH dom0
A PV dom0 can map any host memory as long as it's allowed by the IO
capability range in d->iomem_caps. On the other hand, a PVH dom0 has no
way to populate MMIO region onto it's p2m, so it's limited to what Xen
initially populates on the p2m based on the host memory map and the enabled
device BARs.
Introduce a new debug build only printk that reports attempts by dom0 to
access addresses not populated on the p2m, and not handled by any emulator.
This is for information purposes only, but might allow getting an idea of
what MMIO ranges might be missing on the p2m.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 43d8a80a0cccfe3715bb3178b5c15fb983979651
master date: 2025-03-05 10:26:46 +0100
Andrew Cooper [Thu, 20 Mar 2025 12:14:51 +0000 (13:14 +0100)]
x86/svm: Separate STI and VMRUN instructions in svm_asm_do_resume()
There is a corner case in the VMRUN instruction where its INTR_SHADOW state
leaks into guest state if a VMExit occurs before the VMRUN is complete. An
example of this could be taking #NPF due to event injection.
Xen can safely execute STI anywhere between CLGI and VMRUN, as CLGI blocks
external interrupts too. However, an exception (while fatal) will appear to
be in an irqs-on region (as GIF isn't considered), so position the STI after
the speculation actions but prior to the GPR pops.
xen/memory: Make resource_max_frames() to return 0 on unknown type
This is actually what the caller acquire_resource() expects on any kind
of error (the comment on top of resource_max_frames() also suggests that).
Otherwise, the caller will treat -errno as a valid value and propagate incorrect
nr_frames to the VM. As a possible consequence, a VM trying to query a resource
size of an unknown type will get the success result from the hypercall and obtain
nr_frames 4294967201.
Also, add an ASSERT_UNREACHABLE() in the default case of _acquire_resource(),
normally we won't get to this point, as an unknown type will always be rejected
earlier in resource_max_frames().
Also, update test-resource app to verify that Xen can deal with invalid
(unknown) resource type properly.
Fixes: 9244528955de ("xen/memory: Fix acquire_resource size semantics") Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9b8708290002f0a4d0b363e0c66ce945f6b520bd
master date: 2025-02-18 14:47:34 +0000
Andrew Cooper [Thu, 20 Mar 2025 12:13:44 +0000 (13:13 +0100)]
xen/console: Fix truncation of panic() messages
The panic() function uses a static buffer to format its arguments into, simply
to emit the result via printk("%s", buf). This buffer is not large enough for
some existing users in Xen. e.g.:
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Invalid device tree blob at physical address 0x46a00000.
(XEN) The DTB must be 8-byte aligned and must not exceed 2 MB in size.
(XEN)
(XEN) Plea****************************************
The remainder of this particular message is 'e check your bootloader.', but
has been inherited by RISC-V from ARM.
It is also pointless double buffering. Implement vprintk() beside printk(),
and use it directly rather than rendering into a local buffer, removing it as
one source of message limitation.
This marginally simplifies panic(), and drops a global used-once buffer.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 81f8b1dd9407e4a3d9dc058b7fbbc591168649ad
master date: 2025-02-18 14:15:58 +0000
Jan Beulich [Thu, 27 Feb 2025 12:58:32 +0000 (12:58 +0000)]
IOMMU/x86: the bus-to-bridge lock needs to be acquired IRQ-safe
The function's use from set_msi_source_id() is guaranteed to be in an
IRQs-off region. While the invocation of that function could be moved
ahead in msi_msg_to_remap_entry() (doesn't need to be in the IOMMU-
intremap-locked region), the call tree from map_domain_pirq() holds an
IRQ descriptor lock. Hence all use sites of the lock need become IRQ-
safe ones.
In find_upstream_bridge() do a tiny bit of tidying in adjacent code:
Change a variable's type to unsigned and merge a redundant assignment
into another variable's initializer.
This is XSA-467 / CVE-2025-1713.
Fixes: 476bbccc811c ("VT-d: fix MSI source-id of interrupt remapping") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 39bc6af3ba483282ed6bbf94b08aec38c93d39e6)
Roger Pau Monné [Mon, 17 Feb 2025 12:24:23 +0000 (13:24 +0100)]
x86/iommu: disable interrupts at shutdown
Add a new hook to inhibit interrupt generation by the IOMMU(s). Note the
hook is currently only implemented for x86 IOMMUs. The purpose is to
disable interrupt generation at shutdown so any kexec chained image finds
the IOMMU(s) in a quiesced state.
It would also prevent "Receive accept error" being raised as a result of
non-disabled interrupts targeting offline CPUs.
Note that the iommu_quiesce() call in nmi_shootdown_cpus() is still
required even when there's a preceding iommu_crash_shutdown() call; the
later can become a no-op depending on the setting of the "crash-disable"
command line option.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 819c3cb186a86ef3e04fb5af4d9f9f6de032c3ee
master date: 2025-02-12 15:56:07 +0100
Roger Pau Monné [Mon, 17 Feb 2025 12:24:03 +0000 (13:24 +0100)]
x86/pci: disable MSI(-X) on all devices at shutdown
Attempt to disable MSI(-X) capabilities on all PCI devices know by Xen at
shutdown. Doing such disabling should facilitate kexec chained kernel from
booting more reliably, as device MSI(-X) interrupt generation should be
quiesced.
Only attempt to disable MSI(-X) on all devices in the crash context if the
PCI lock is not taken, otherwise the PCI device list could be in an
inconsistent state. This requires introducing a new pcidevs_trylock()
helper to check whether the lock is currently taken.
Disabling MSI(-X) should prevent "Receive accept error" being raised as a
result of non-disabled interrupts targeting offline CPUs.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7ab6951981231b4c576a3588248c303001272588
master date: 2025-02-12 15:56:07 +0100
Roger Pau Monné [Mon, 17 Feb 2025 12:23:50 +0000 (13:23 +0100)]
x86/smp: perform disabling on interrupts ahead of AP shutdown
Move the disabling of interrupt sources so it's done ahead of the offlining
of APs. This is to prevent AMD systems triggering "Receive accept error"
when interrupts target CPUs that are no longer online.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: db6daa9bf411260d2c1f5301e4fc786ae4a5cef8
master date: 2025-02-12 15:56:07 +0100
Roger Pau Monné [Mon, 17 Feb 2025 12:23:27 +0000 (13:23 +0100)]
x86/shutdown: offline APs with interrupts disabled on all CPUs
The current shutdown logic in smp_send_stop() will disable the APs while
having interrupts enabled on the BSP or possibly other APs. On AMD systems
this can lead to local APIC errors:
APIC error on CPU0: 00(08), Receive accept error
Such error message can be printed in a loop, thus blocking the system from
rebooting. I assume this loop is created by the error being triggered by
the console interrupt, which is further stirred by the ESR handler
printing to the console.
Intel SDM states:
"Receive Accept Error.
Set when the local APIC detects that the message it received was not
accepted by any APIC on the APIC bus, including itself. Used only on P6
family and Pentium processors."
So the error shouldn't trigger on any Intel CPU supported by Xen.
However AMD doesn't make such claims, and indeed the error is broadcast to
all local APICs when an interrupt targets a CPU that's already offline.
To prevent the error from stalling the shutdown process perform the
disabling of APs and the BSP local APIC with interrupts disabled on all
CPUs in the system, so that by the time interrupts are unmasked on the BSP
the local APIC is already disabled. This can still lead to a spurious:
APIC error on CPU0: 00(00)
As a result of an LVT Error getting injected while interrupts are masked on
the CPU, and the vector only handled after the local APIC is already
disabled. ESR reports 0 because as part of disable_local_APIC() the ESR
register is cleared.
Note the NMI crash path doesn't have such issue, because disabling of APs
and the caller local APIC is already done in the same contiguous region
with interrupts disabled. There's a possible window on the NMI crash path
(nmi_shootdown_cpus()) where some APs might be disabled (and thus
interrupts targeting them raising "Receive accept error") before others APs
have interrupts disabled. However the shutdown NMI will be handled,
regardless of whether the AP is processing a local APIC error, and hence
such interrupts will not cause the shutdown process to get stuck.
Remove the call to fixup_irqs() in smp_send_stop(): it doesn't achieve the
intended goal of moving all interrupts to the BSP anyway. The logic in
fixup_irqs() will move interrupts whose affinity doesn't overlap with the
passed mask, but the movement of interrupts is done to any CPU set in
cpu_online_map. As in the shutdown path fixup_irqs() is called before APs
are cleared from cpu_online_map this leads to interrupts being shuffled
around, but not assigned to the BSP exclusively.
The Fixes tag is more of a guess than a certainty; it's possible the
previous sleep window in fixup_irqs() allowed any in-flight interrupt to be
delivered before APs went offline. However fixup_irqs() was still
incorrectly used, as it didn't (and still doesn't) move all interrupts to
target the provided cpu mask.
Fixes: e2bb28d62158 ('x86/irq: forward pending interrupts to new destination in fixup_irqs()') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1191ce954f64244a3c5f553116184928bcc677e8
master date: 2025-02-12 15:56:07 +0100
Jan Beulich [Mon, 17 Feb 2025 12:23:00 +0000 (13:23 +0100)]
radix-tree: introduce RADIX_TREE{,_INIT}()
... now that static initialization is possible. Use RADIX_TREE() for
pci_segments and ivrs_maps.
This then fixes an ordering issue on x86: With the call to
radix_tree_init(), acpi_mmcfg_init()'s invocation of pci_segments_init()
will zap the possible earlier introduction of segment 0 by
amd_iommu_detect_one_acpi()'s call to pci_ro_device(), and thus the
write-protection of the PCI devices representing AMD IOMMUs.
Fixes: 3950f2485bbc ("x86/x2APIC: defer probe until after IOMMU ACPI table parsing") Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 26fe09e34566d701ecaea76b4563bb9934e85861
master date: 2025-02-07 10:00:04 +0100
Jan Beulich [Mon, 17 Feb 2025 12:22:24 +0000 (13:22 +0100)]
radix-tree: purge node allocation override hooks
These were needed by TMEM only, which is long gone. The Linux original
doesn't have such either. This effectively reverts one of the "Other
changes" from 8dc6738dbb3c ("Update radix-tree.[ch] from upstream Linux
to gain RCU awareness").
Positive side effect: Two cf_check go away.
While there also convert xmalloc()+memset() to xzalloc().
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1275093a96fed45057db241b3aa6e191d9dcf596
master date: 2025-02-07 09:59:11 +0100
Andrew Cooper [Mon, 17 Feb 2025 12:21:50 +0000 (13:21 +0100)]
x86/intel: Fix PERF_GLOBAL fixup when virtualised
Logic using performance counters needs to look at
MSR_MISC_ENABLE.PERF_AVAILABLE before touching any other resources.
When virtualised under ESX, Xen dies with a #GP fault trying to read
MSR_CORE_PERF_GLOBAL_CTRL.
Factor this logic out into a separate function (it's already too squashed to
the RHS), and insert a check of MSR_MISC_ENABLE.PERF_AVAILABLE.
This also avoids setting X86_FEATURE_ARCH_PERFMON if MSR_MISC_ENABLE says that
PERF is unavailable, although oprofile (the only consumer of this flag)
cross-checks too.
Fixes: 6bdb965178bb ("x86/intel: ensure Global Performance Counter Control is setup correctly") Reported-by: Jonathan Katz <jonathan.katz@aptar.com> Link: https://xcp-ng.org/forum/topic/10286/nesting-xcp-ng-on-esx-8 Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Jonathan Katz <jonathan.katz@aptar.com>
master commit: dd05d265b8abda4cc7206b29cd71b77fb46658bf
master date: 2025-01-28 11:19:45 +0000
Jan Beulich [Mon, 17 Feb 2025 12:21:30 +0000 (13:21 +0100)]
x86/PV: further harden guest memory accesses against speculative abuse
The original implementation has two issues: For one it doesn't preserve
non-canonical-ness of inputs in the range 0x8000000000000000 through
0x80007fffffffffff. Bogus guest pointers in that range would not cause a
(#GP) fault upon access, when they should.
And then there is an AMD-specific aspect, where only the low 48 bits of
an address are used for speculative execution; the architecturally
mandated #GP for non-canonical addresses would be raised at a later
execution stage. Therefore to prevent Xen controlled data to make it
into any of the caches in a guest controllable manner, we need to
additionally ensure that for non-canonical inputs bit 47 would be clear.
See the code comment for how addressing both is being achieved.
Fixes: 4dc181599142 ("x86/PV: harden guest memory accesses against speculative abuse") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8306d773b03acec6062c0547ac05e3dd4a6960f6
master date: 2025-01-27 15:23:59 +0100
Jan Beulich [Mon, 17 Feb 2025 12:21:00 +0000 (13:21 +0100)]
x86emul: further correct 64-bit mode zero count repeated string insn handling
In an entirely different context I came across Linux commit 428e3d08574b
("KVM: x86: Fix zero iterations REP-string"), which points out that
we're still doing things wrong: For one, there's no zero-extension at
all on AMD. And then while RCX is zero-extended from 32 bits uniformly
for all string instructions on newer hardware, RSI/RDI are only for MOVS
and STOS on the systems I have access to. (On an old family 0xf system
I've further found that for REP LODS even RCX is not zero-extended.)
While touching the lines anyway, replace two casts in get_rep_prefix().
Fixes: 79e996a89f69 ("x86emul: correct 64-bit mode repeated string insn handling with zero count") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 5310a042c4e3135c471446c8253ad13250539957
master date: 2025-01-27 15:23:19 +0100
Roger Pau Monné [Mon, 17 Feb 2025 12:20:38 +0000 (13:20 +0100)]
iommu/amd: atomically update IRTE
Either when using a 32bit Interrupt Remapping Entry or a 128bit one update
the entry atomically, by using cmpxchg unconditionally as IOMMU depends on
it. No longer disable the entry by setting RemapEn = 0 ahead of updating
it. As a consequence of not toggling RemapEn ahead of the update the
Interrupt Remapping Table needs to be flushed after the entry update.
This avoids a window where the IRTE has RemapEn = 0, which can lead to
IO_PAGE_FAULT if the underlying interrupt source is not masked.
There's no guidance in AMD-Vi specification about how IRTE update should be
performed as opposed to DTE updating which has specific guidance. However
DTE updating claims that reads will always be at least 128bits in size, and
hence for the purposes here assume that reads and caching of the IRTE
entries in either 32 or 128 bit format will be done atomically from
the IOMMU.
Note that as part of introducing a new raw128 field in the IRTE struct, the
current raw field is renamed to raw64 to explicitly contain the size in the
field name.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b953a99da98d63a7c827248abc450d4e8e015ab6
master date: 2025-01-27 13:05:11 +0100
Teddy Astie [Mon, 17 Feb 2025 12:20:14 +0000 (13:20 +0100)]
x86/iommu: check for CMPXCHG16B when enabling IOMMU
All hardware with VT-d/AMD-Vi has CMPXCHG16B support. Check this at
initialisation time, and otherwise refuse to use the IOMMU.
If the local APICs support x2APIC mode the IOMMU support for interrupt
remapping will be checked earlier using a specific helper. If no support
for CX16 is detected by that earlier hook disable the IOMMU at that point
and prevent further poking for CX16 later in the boot process, which would
also fail.
There's a possible corner case when running virtualized, and the underlying
hypervisor exposing an IOMMU but no CMPXCHG16B support. In which case
ignoring the IOMMU is fine, albeit the most natural would be for the
underlying hypervisor to also expose CMPXCHG16B support if an IOMMU is
available to the VM.
Note this change only introduces the checks, but doesn't remove the now
stale checks for CX16 support sprinkled in the IOMMU code. Further changes
will take care of that.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Teddy Astie <teddy.astie@vates.tech> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2636fcdc15c707d5e097770133f0afb69e8d70c9
master date: 2025-01-27 13:05:11 +0100
Jan Beulich [Mon, 17 Feb 2025 12:19:51 +0000 (13:19 +0100)]
x86/HVM: correct read/write split at page boundaries
The MMIO cache is intended to have one entry used per independent memory
access that an insn does. This, in particular, is supposed to be
ignoring any page boundary crossing. Therefore when looking up a cache
entry, the access'es starting (linear) address is relevant, not the one
possibly advanced past a page boundary.
In order for the same offset-into-buffer variable to be usable in
hvmemul_phys_mmio_access() for both the caller's buffer and the cache
entry's it is further necessary to have the un-adjusted caller buffer
passed into there.
Fixes: 2d527ba310dc ("x86/hvm: split all linear reads and writes at page boundary") Reported-by: Manuel Andreas <manuel.andreas@tum.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 672894a11fe06e664a0ebfb600baf5dbb897b9e4
master date: 2025-01-24 10:15:56 +0100
Both caches may need higher capacity, and the upper bound will need to
be determined dynamically based on CPUID policy (for AMX'es TILELOAD /
TILESTORE at least).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 23d60dbb0493b2f9ec1d89be5341eec2ee9dab32
master date: 2025-01-24 10:15:29 +0100
Jan Beulich [Mon, 17 Feb 2025 12:17:45 +0000 (13:17 +0100)]
x86/HVM: reduce recursion in linear_{read,write}()
Let's make explicit what the compiler may or may not do on our behalf:
The 2nd of the recursive invocations each can fall through rather than
re-invoking the function. This will save us from adding yet another
parameter (or more) to the function, just for the recursive invocations.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 18053054b7583810dd356efc8d7018bbc8720f36
master date: 2024-09-09 13:40:47 +0200
Juergen Gross [Tue, 21 Jan 2025 08:21:01 +0000 (09:21 +0100)]
xen/events: fix race with set_global_virq_handler()
There is a possible race scenario between set_global_virq_handler()
and clear_global_virq_handlers() targeting the same domain, which
might result in that domain ending as a zombie domain.
In case set_global_virq_handler() is being called for a domain which
is just dying, it might happen that clear_global_virq_handlers() is
running first, resulting in set_global_virq_handler() taking a new
reference for that domain and entering in the global_virq_handlers[]
array afterwards. The reference will never be dropped, thus the domain
will never be freed completely.
This can be fixed by checking the is_dying state of the domain inside
the region guarded by global_virq_handlers_lock. In case the domain is
dying, handle it as if the domain wouldn't exist, which will be the
case in near future anyway.
Fixes: 87521589aa6a ("xen: allow global VIRQ handlers to be delegated to other domains") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4d8acc9c1cf14233dda21dd3a7791b5a84b0f6c3
master date: 2025-01-09 17:34:01 +0100
Michal Orzel [Tue, 21 Jan 2025 08:20:51 +0000 (09:20 +0100)]
xen/flask: Wire up XEN_DOMCTL_dt_overlay
Addition of FLASK permission for this hypercall was overlooked in the
original patch. Fix it. The only dt overlay operation is attaching that can
happen only after the domain is created. Dom0 can attach overlay to itself
as well.
Fixes: 4c733873b5c2 ("xen/arm: Add XEN_DOMCTL_dt_overlay and device attachment to domains") Signed-off-by: Michal Orzel <michal.orzel@amd.com> Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
master commit: 7fa1411676150634b1d6ca030e53b94c26a949dd
master date: 2025-01-08 13:05:50 +0100
Michal Orzel [Tue, 21 Jan 2025 08:20:42 +0000 (09:20 +0100)]
xen/flask: Wire up XEN_DOMCTL_vuart_op
Addition of FLASK permission for this hypercall was overlooked in the
original patch. Fix it. The only VUART operation is initialization that
can occur only during domain creation.
Fixes: 86039f2e8c20 ("xen/arm: vpl011: Add a new domctl API to initialize vpl011") Signed-off-by: Michal Orzel <michal.orzel@amd.com> Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
master commit: 29daa72e4019aae92f857cf6e7e0c3ca8fb1483e
master date: 2025-01-08 13:05:38 +0100
All selector fields under ctxt->regs are (normally) poisoned in the HVM
case, and the four ones besides CS and SS are potentially stale for PV.
Avoid using them in the hypervisor incarnation of the emulator, when
trying to cover for a missing ->read_segment() hook.
To make sure there's always a valid ->read_segment() handler for all HVM
cases, add a respective function to shadow code, even if it is not
expected for FPU insns to be used to update page tables.
Fixes: 0711b59b858a ("x86emul: correct FPU code/data pointers and opcode handling") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 645b8d48c78f5b6ffd6230873f9e3ced4e840acd
master date: 2025-01-08 11:02:16 +0100
Jan Beulich [Tue, 21 Jan 2025 08:19:39 +0000 (09:19 +0100)]
x86emul: VCVT{,U}DQ2PD ignores embedded rounding
IOW we shouldn't raise #UD in that case. Be on the safe side though and
only encode fully legitimate forms into the stub to be executed.
Things weren't quite right for VCVT{,U}SI2SD either, in the attempt to
be on the safe side: Clearing EVEX.L'L isn't useful; it's EVEX.b which
primarily needs clearing. Also reflect the somewhat improved doc
situation in the comment there.
Fixes: ed806f373730 ("x86emul: support AVX512F legacy-equivalent packed int/FP conversion insns") Fixes: baf4a376f550 ("x86emul: support AVX512F legacy-equivalent scalar int/FP conversion insns") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d3709d1324aa140f064b9c68da37547f459f8e8d
master date: 2025-01-08 11:01:17 +0100
Andrew Cooper [Tue, 21 Jan 2025 08:18:42 +0000 (09:18 +0100)]
x86/traps: Rework LER initialisation and support Zen5/Diamond Rapids
AMD have always used the architectural MSRs for LER. As the first processor
to support LER was the K7 (which was 32bit), we can assume it's presence
unconditionally in 64bit mode.
Intel are about to run out of space in Family 6 and start using 19. It is
only the Pentium 4 which uses non-architectural LER MSRs.
percpu_traps_init(), which runs on every CPU, contains a lot of code which
should be init-only, and is the only reason why opt_ler can't be in initdata.
Write a brand new init_ler() which expects all future Intel and AMD CPUs to
continue using the architectural MSRs, and does all setup together. Call it
from trap_init(), and remove the setup logic percpu_traps_init() except for
the single path configuring MSR_IA32_DEBUGCTLMSR.
Leave behind a warning if the user asked for LER and Xen couldn't enable it.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 555866cb56002849014a1409ecdfa3f436c0c2c4
master date: 2025-01-06 12:24:05 +0000
Andrew Cooper [Tue, 21 Jan 2025 08:18:08 +0000 (09:18 +0100)]
x86/spec-ctrl: Support for SRSO_U/S_NO and SRSO_MSR_FIX
AMD have updated the SRSO whitepaper[1] with further information. These
features exist on AMD Zen5 CPUs and are necessary for Xen to use.
The two features are in principle unrelated:
* SRSO_U/S_NO is an enumeration saying that SRSO attacks can't cross the
User(CPL3) / Supervisor(CPL<3) boundary. i.e. Xen don't need to use
IBPB-on-entry for PV64. PV32 guests are explicitly unsupported for
speculative issues, and excluded from consideration for simplicity.
* SRSO_MSR_FIX is an enumeration identifying that the BP_SPEC_REDUCE bit is
available in MSR_BP_CFG. When set, SRSO attacks can't cross the host/guest
boundary. i.e. Xen don't need to use IBPB-on-entry for HVM.
Extend ibpb_calculations() to account for these when calculating
opt_ibpb_entry_{pv,hvm} defaults. Add a `bp-spec-reduce=<bool>` option to
control the use of BP_SPEC_REDUCE, with it active by default.
Because MSR_BP_CFG is core-scoped with a race condition updating it, repurpose
amd_check_erratum_1485() into amd_check_bp_cfg() and calculate all updates at
once.
Xen also needs to to advertise SRSO_U/S_NO to guests to allow the guest kernel
to skip SRSO mitigations too:
* This is trivial for HVM guests. It is also is accurate for PV32 guests
too, but we have already excluded them from consideration, and do so again
here to simplify the policy logic.
* As written, SRSO_U/S_NO does not help for the PV64 user->kernel boundary.
However, after discussing with AMD, an implementation detail of having
BP_SPEC_REDUCE active causes the PV64 user->kernel boundary to have the
property described by SRSO_U/S_NO, so we can advertise SRSO_U/S_NO to
guests when the BP_SPEC_REDUCE precondition is met.
Finally, fix a typo in the SRSO_NO's comment.
[1] https://www.amd.com/content/dam/amd/en/documents/corporate/cr/speculative-return-stack-overflow-whitepaper.pdf Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a1746cd4434dd27ca2da8430dfb10edc76264bb3
master date: 2025-01-02 18:44:49 +0000
xen/arch/x86: make objdump output user locale agnostic
The objdump output is fed to grep, so make sure it doesn't change with
different user locales and break the grep parsing.
This problem was identified while updating xen in Debian and the fix is
needed for generating reproducible builds in varying environments.
Signed-off-by: Maximilian Engelhardt <maxi@daemonizer.de> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0d729221ab74c5d2571e71501dc63838acbf752a
master date: 2024-12-30 21:40:37 +0000
The important part: XZ decompression error: Memory usage limit reached
This looks to be related to the following change in Linux: 8653c909922743bceb4800e5cc26087208c9e0e6 ("xz: use 128 MiB dictionary and force single-threaded mode")
Fix this by increasing the block size to 256MiB. And remove the
misleading comment (from lack of better ideas).
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Anthony PERARD <anthony.perard@vates.tech> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e6472d46680ccd2b804ad73c19042a5811d036f0
master date: 2024-12-19 17:33:54 +0000
Fixes: 631f535a3d4f ("xen: update ECLAIR service identifiers from MC3R1 to MC3A2.") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 171cb318deaa0be786cc3af3599c72e8909e60f9)
xen: update ECLAIR service identifiers from MC3R1 to MC3A2.
Rename all instances of ECLAIR MISRA C:2012 service identifiers,
identified by the prefix MC3R1, to use the prefix MC3A2, which
refers to MISRA C:2012 Amendment 2 guidelines.
This update is motivated by the need to upgrade ECLAIR GitLab runners
that use the new naming scheme for MISRA C:2012 Amendment 2 guidelines.
Changes to the docs/misra directory are needed in order to keep
comment-based deviation up to date.
Roger Pau Monné [Tue, 17 Dec 2024 11:46:29 +0000 (12:46 +0100)]
x86/io-apic: prevent early exit from i8259 loop detection
Avoid exiting early from the loop when a pin that could be connected to the
i8259 is found, as such early exit would leave the EOI handler translation
array only partially allocated and/or initialized.
Otherwise on systems with multiple IO-APICs and an unmasked ExtINT pin on
any IO-APIC that's no the last one the following NULL pointer dereference
triggers:
(XEN) Enabling APIC mode. Using 2 I/O APICs
(XEN) ----[ Xen-4.20-unstable x86_64 debug=y Not tainted ]----
(XEN) CPU: 0
(XEN) RIP: e008:[<ffff82d040328046>] __ioapic_write_entry+0x83/0x95
[...]
(XEN) Xen call trace:
(XEN) [<ffff82d040328046>] R __ioapic_write_entry+0x83/0x95
(XEN) [<ffff82d04027464b>] F amd_iommu_ioapic_update_ire+0x1ea/0x273
(XEN) [<ffff82d0402755a1>] F iommu_update_ire_from_apic+0xa/0xc
(XEN) [<ffff82d040328056>] F __ioapic_write_entry+0x93/0x95
(XEN) [<ffff82d0403283c1>] F arch/x86/io_apic.c#clear_IO_APIC_pin+0x7c/0x10e
(XEN) [<ffff82d040328480>] F arch/x86/io_apic.c#clear_IO_APIC+0x2d/0x61
(XEN) [<ffff82d0404448b7>] F enable_IO_APIC+0x2e3/0x34f
(XEN) [<ffff82d04044c9b0>] F smp_prepare_cpus+0x254/0x27a
(XEN) [<ffff82d04044bec2>] F __start_xen+0x1ce1/0x23ae
(XEN) [<ffff82d0402033ae>] F __high_start+0x8e/0x90
(XEN)
(XEN) Pagetable walk from 0000000000000000:
(XEN) L4[0x000] = 000000007dbfd063ffffffffffffffff
(XEN) L3[0x000] = 000000007dbfa063ffffffffffffffff
(XEN) L2[0x000] = 000000007dbcc063ffffffffffffffff
(XEN) L1[0x000] = 0000000000000000ffffffffffffffff
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0002]
(XEN) Faulting linear address: 0000000000000000
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...
Reported-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com> Fixes: 86001b3970fe ('x86/io-apic: fix directed EOI when using AMD-Vi interrupt remapping') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f38fd27c4ceadf7ec4e82e82d0731b6ea415c51e
master date: 2024-12-17 11:15:30 +0100
Andrii Sultanov [Mon, 16 Dec 2024 12:33:17 +0000 (13:33 +0100)]
tools/ocaml: Specify rpath correctly for ocamlmklib
ocamlmklib has special handling for C-like '-Wl,-rpath' option, but does
not know how to handle '-Wl,-rpath-link', as evidenced by warnings like:
"Unknown option
-Wl,-rpath-link=$HOME/xen/tools/ocaml/libs/eventchn/../../../../tools/libs/toollog"
Pass this option directly to the compiler with -ccopt instead.
Also pass -L directly to the linker with -ldopt. This prevents embedding absolute
paths from buildtime into binary's RPATH.
Andrew Cooper [Mon, 16 Dec 2024 12:33:07 +0000 (13:33 +0100)]
libs/guest: Fix migration compatibility with a security-patched Xen 4.13
xc_cpuid_apply_policy() provides compatibility for migration of a pre-4.14 VM
where no CPUID data was provided in the stream.
It guesses the various max-leaf limits, based on what was true at the time of
writing, but this was not correctly adapted when speculative security issues
forced the advertisement of new feature bits. Of note are:
* LFENCE-DISPATCH, in leaf 0x80000021.eax
* BHI-CTRL, in leaf 0x7[2].edx
In both cases, a VM booted on a security-patched Xen 4.13, and then migrated
on to any newer version of Xen on the same or compatible hardware would have
these features stripped back because Xen is still editing the cpu-policy for
sanity behind the back of the toolstack.
For VMs using BHI_DIS_S to mitigate Native-BHI, this resulted in a failure to
restore the guests MSR_SPEC_CTRL setting:
(XEN) HVM d7v0 load MSR 0x48 with value 0x401 failed
(XEN) HVM7 restore: failed to load entry 20/0 rc -6
Fixes: e9b4fe263649 ("x86/cpuid: support LFENCE always serialising CPUID bit") Fixes: f3709b15fc86 ("x86/cpuid: Infrastructure for cpuid word 7:2.edx") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 28301682f492c1df2ff9c3e01a0aab6262bd925a
master date: 2024-12-03 12:20:41 +0000
Roger Pau Monné [Mon, 16 Dec 2024 12:32:43 +0000 (13:32 +0100)]
xen/Kconfig: livepatch-build-tools requires debug information
The tools infrastructure used to build livepatches for Xen
(livepatch-build-tools) consumes some DWARF debug information present in
xen-syms to generate a livepatch (see livepatch-build script usage of readelf
-wi).
The current Kconfig defaults however will enable LIVEPATCH without DEBUG_INFO
on release builds, thus providing a default Kconfig selection that's not
suitable for livepatch-build-tools even when LIVEPATCH support is enabled,
because it's missing the DWARF debug section.
Fix by defaulting DEBUG_INFO to enabled when LIVEPATCH is.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 126b0a6e537ce1d486a29e35cfeec1f222a74d11
master date: 2024-12-02 15:22:05 +0100
Jan Beulich [Mon, 16 Dec 2024 12:32:19 +0000 (13:32 +0100)]
x86emul: MOVBE requires a memory operand
The reg-reg forms should cause #UD; they come into existence only with
APX, where MOVBE also extends BSWAP (for the latter not being "eligible"
to a REX2 prefix).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 4c5d9a01f8fa81417a9c431e9624fb71361ec4f9
master date: 2024-12-02 09:50:14 +0100