Andrew Cooper [Tue, 8 Apr 2025 16:09:15 +0000 (17:09 +0100)]
x86/ucode: Extend AMD digest checks to cover Zen5 CPUs
AMD have updated the SB-7033 advisory to include Zen5 CPUs. Extend the digest
check to cover Zen5 too.
In practice, cover everything until further notice.
Observant readers may be wondering where the update to the digest list is. At
the time of writing, no Zen5 patches are available via a verifiable channel.
x86/ucode: Extend warning about disabling digest check too
This was missed by accident.
Fixes: b63951467e96 ("x86/ucode: Extend AMD digest checks to cover Zen5 CPUs") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 59bb316ea89e7f9461690fe00547d7d2af96321d)
Andrew Cooper [Fri, 13 Dec 2024 14:34:00 +0000 (14:34 +0000)]
x86/ucode: Perform extra SHA2 checks on AMD Fam17h/19h microcode
Collisions have been found in the microcode signing algorithm used by AMD
Fam17h/19h CPUs, and now anyone can sign their own.
For more details, see:
https://bughunters.google.com/blog/5424842357473280/zen-and-the-art-of-microcode-hacking
https://www.amd.com/en/resources/product-security/bulletin/amd-sb-7033.html
As a stopgap mitigation, check the digest of patches against a table of blobs
with known provenance. These are all Fam17h and Fam19h blobs included in
linux-firwmare at the time of writing, specifically:
This checks can be opted out of by booting with ucode=no-digest-check, but
doing so is not recommended.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 630e8875ab368b97cc7231aaf3809e3d7d5687e1)
Xen: CI fix from XSN-2
* Add cf_check annotation to cmp_patch_id() used by bsearch().
Fixes: 630e8875ab36 ("x86/ucode: Perform extra SHA2 checks on AMD Fam17h/19h microcode") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 15fe2eb5f1bac8a212c0ba3d6dfe60d1fdf851cf)
Andrew Cooper [Fri, 13 Dec 2024 14:34:00 +0000 (14:34 +0000)]
xen/lib: Introduce SHA2-256
A future change will need to calculate SHA2-256 digests. Introduce an
implementation in lib/, derived from Trenchboot which itself is derived from
Linux.
In order to be useful to other architectures, it is careful with endianness
and misaligned accesses as well as being more MISRA friendly, but is only
wired up for x86 in the short term.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 372af524411f5a013bcb0b117073d8d07c026563)
Xen: CI fix from XSN-2
* Add U suffix to the K[] table to fix MISRA Rule 7.2 violations.
Fixes: 372af524411f ("xen/lib: Introduce SHA2-256") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 15fe2eb5f1bac8a212c0ba3d6dfe60d1fdf851cf)
Victor Lira [Fri, 23 Aug 2024 22:29:04 +0000 (15:29 -0700)]
automation: update xilinx test scripts (tty)
Update serial device names from ttyUSB* to test board specific names.
Update xilinx-smoke-dom0-x86_64 with new Xen command line console options,
which are now set as Gitlab CI/CD variables. Abstract the directory where
binaries are stored. Increase the timeout to match new setup.
Signed-off-by: Victor Lira <victorm.lira@amd.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 95764a0817a51741b7ffb1f78cba2a19b08ab2d1)
[Stripped down to xilinx-smoke-dom0less-arm64.sh only] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monne [Mon, 28 Oct 2024 11:48:31 +0000 (12:48 +0100)]
tools/libxl: remove usage of VLA arrays
Clang 19 complains with the following error when building libxl:
libxl_utils.c:48:15: error: variable length array folded to constant array as an extension [-Werror,-Wgnu-folding-constant]
48 | char path[strlen("/local/domain") + 12];
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
Replace the usage of strlen() with sizeof, which allows the literal
string length to be known at build time. Note sizeof accounts for the
NUL terminator while strlen() didn't, hence subtract 1 from the total
size calculation.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Frediano Ziglio <frediano.ziglio@cloud.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Anthony PERARD <anthony.perard@vates.tech>
(cherry picked from commit a7c7c3f6424504c4004bbb3437be319aa41ad580)
tools/libxl: do not use `-c -E` compiler options together
It makes no sense to request for preprocessor only output and also request
object file generation. Fix the _libxl.api-for-check target to only use
-E (preprocessor output).
Also Clang 20.0 reports an error if both options are used.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Fixes: 2862bf5b6c81 ('libxl: enforce prohibitions of internal callers') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Anthony PERARD <anthony.perard@vates.tech>
(cherry picked from commit a235f856e4bbd270b085590e1f5fc9599234dcdf)
Jan Beulich [Wed, 2 Apr 2025 12:31:26 +0000 (14:31 +0200)]
x86/P2M: synchronize fast and slow paths of p2m_get_page_from_gfn()
Handling of foreign pages was different between the two paths.
While permitting access to grants would be desirable, doing so would
require more involved handling; undo that for the time being. In
particular the page reference obtained would prevent the owning domain
from changing e.g. the page's type (after the grantee has released the
last reference of the grant). Instead perhaps another reference on the
grant would need obtaining. Which in turn would require determining
which grant that was.
Foreign pages in any event need permitting on both paths.
Introduce a helper function to be used on both paths, such that
respective checking differs in just the extra "to be unshared" condition
on the fast path.
While there adjust the sanity check for foreign pages: Don't leak the
reference on release builds when on a debug build the assertion would
have triggered. (Thanks to Roger for the suggestion.)
Andrew Cooper [Wed, 2 Apr 2025 12:31:17 +0000 (14:31 +0200)]
ARM/vgic: Fix out-of-bounds accesses in vgic_mmio_write_sgir()
The switch() statement is over bits 24:25 (unshifted) of the guest provided
value. This makes case 0x3: dead, and not an implementation of the 4th
possible state.
A guest which writes (0x3 << 24) | (0xff << 16) to this register will skip the
early exit, then enter bitmap_for_each() with targets not bound by nr_vcpus.
If the guest has fewer than 8 vCPUs, bitmap_for_each() will read off the end
of d->vcpu[] and use the resulting vcpu pointer to ultimately derive irq, and
perform out-of-bounds writes.
Fix this by changing case 0x3 to default.
Fixes: 08c688ca6422 ("ARM: new VGIC: Add SGIR register handler") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: be7f0cc651d8d02a95820792204c0558f1f29e03
master date: 2025-03-27 11:54:23 +0000
OCaml, in preparation for a renaming of the error string associated with
conversion failure in 'int_of_string' functions, started to issue this
warning:
File "process.ml", line 440, characters 13-28:
440 | | (Failure "int_of_string") -> reply_error "EINVAL"
^^^^^^^^^^^^^^^
Warning 52 [fragile-literal-pattern]: Code should not depend on the actual values of
this constructor's arguments. They are only for information
and may change in future versions. (See manual section 11.5)
Deal with this at the source, and instead create our own stable
ConversionFailure exception that's raised on the None case in
'int_of_string_opt'.
'c_int_of_string' is safe and does not raise such exceptions.
Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com> Acked-by: Christian Lindig <christian.lindig@cloud.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c11772277fe5f1b0874141a24554c2e3da2d9a6e
master date: 2025-02-25 13:30:55 +0000
Jan Beulich [Thu, 27 Mar 2025 14:27:09 +0000 (15:27 +0100)]
Arm/domctl: correct XEN_DOMCTL_vuart_op error return value
copy_to_guest() returns the number of bytes not copied; that's not what
the function should return to its caller though. Convert to returning
-EFAULT instead.
Fixes: 86039f2e8c20 ("xen/arm: vpl011: Add a new domctl API to initialize vpl011") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Michal Orzel <michal.orzel@amd.com>
master commit: 341c0df40bf73b0a5e27db27023ec400858a472d
master date: 2025-03-27 12:22:39 +0100
Jan Beulich [Thu, 27 Mar 2025 14:26:52 +0000 (15:26 +0100)]
x86/pmstat: correct get_cpufreq_para()'s error return value
copy_to_guest() returns the number of bytes not copied; that's not what
the function should return to its caller though. Convert to returning
-EFAULT instead.
Fixes: 7542c4ff00f2 ("Add user PM control interface") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 855337ca4947508ffca23393e291c54b5307cc9a
master date: 2025-03-27 12:22:06 +0100
Jan Beulich [Thu, 27 Mar 2025 14:26:37 +0000 (15:26 +0100)]
x86/PVH: account for module command line length
As per observation in practice, initrd->cmdline_pa is not normally zero.
Hence so far we always appended at least one byte. That alone may
already render insufficient the "allocation" made by find_memory().
Things would be worse when there's actually a (perhaps long) command
line.
Skip setup when the command line is empty. Amend the "allocation" size
by padding and actual size of module command line. Along these lines
also skip initrd setup when the initrd is zero size.
Fixes: 0ecb8eb09f9f ("x86/pvh: pass module command line to dom0") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: 989584e532c9517a0f789e993f5f6744beaebe3e
master date: 2025-03-27 12:21:08 +0100
This is MOV %cr8, which is wired up for hvm_mov_{to,from}_cr(); the VMExit
fastpaths, but not for the full emulation slowpaths.
Xen's handling of %cr8 turns out to be quite wrong. At a minimum, we need
storage for %cr8 separate to APIC_TPR, and to alter intercepts based on
whether the vLAPIC is enabled or not. But that's more work than there is time
for in the short term, so make a stopgap fix.
Extend hvmemul_{read,write}_cr() with %cr8 cases. Unlike hvm_mov_to_cr(),
hardware hasn't filtered out invalid values (#GP checks are ahead of
intercepts), so introduce X86_CR8_VALID_MASK.
Reported-by: Petr Beneš <w1benny@gmail.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 14fd9b5642cd4805b49fbe716bf2cd577e724169
master date: 2025-03-26 11:54:59 +0000
Andrew Cooper [Thu, 27 Mar 2025 14:25:48 +0000 (15:25 +0100)]
x86/emul: Rearrange the logic in hvmemul_{read,write}_cr()
In hvmemul_read_cr(), make the TRACE()/X86EMUL_OKAY path common in preparation
for adding a %cr8 case. Use a local 'val' variable instead of always
operating on a deferenced pointer.
In both, calculate curr once.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b7264a15c28d30bb994ec9e58eba38932be231ec
master date: 2025-03-26 11:54:59 +0000
Jan Beulich [Thu, 27 Mar 2025 14:25:39 +0000 (15:25 +0100)]
x86/PVH: expose OEMx ACPI tables to Dom0
What they contain we don't know, but we can't sensibly hide them. On my
Skylake system OEM1 (with a description of "INTEL CPU EIST") is what
contains all the _PCT, _PPC, and _PSS methods, i.e. about everything
needed for cpufreq. (_PSD interestingly are in an SSDT there.)
Further OEM2 there has a description of "INTEL CPU HWP", while OEM4
has "INTEL CPU CST". Pretty clearly all three need exposing for
cpufreq and cpuidle to work.
Jan Beulich [Thu, 27 Mar 2025 14:25:30 +0000 (15:25 +0100)]
xenpm: sanitize allocations in show_cpufreq_para_by_cpuid()
malloc(), when passed zero size, may return NULL (the behavior is
implementation defined). Mirror the ->gov_num check to the other two
allocations as well. Don't chance then actually using a NULL in
print_cpufreq_para().
Fixes: 75e06d089d48 ("xenpm: add cpu frequency control interface, through which user can") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: 6c0dc87bb0e08fb31a68bf4c4149a18b92628f14
master date: 2025-03-26 12:30:57 +0100
Andrew Cooper [Thu, 27 Mar 2025 14:24:43 +0000 (15:24 +0100)]
x86/boot: Simplify the expression for extra allocation space
The expression for one parameter of find_memory() is already complicated and
about to become moreso. Break it out into a new variable, and express it in
an easier-to-follow way.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: ce703c84df1cb279605b0a85a45c6419464a16e8
master date: 2025-03-21 11:52:39 +0000
Fixes: 84c4461b7d3a ("Force out-of-line instances of inline functions into .init.text in init-only code") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ab7ce0c8ed90f729a186babd87e3cd1fbed8ab98
master date: 2025-03-21 11:52:39 +0000
Roger Pau Monné [Thu, 27 Mar 2025 14:24:13 +0000 (15:24 +0100)]
x86/vga: fix mapping of the VGA text buffer
The call to ioremap_wc() in video_init() will always fail, because
video_init() is called ahead of vm_init_type(), and so the underlying
__vmap() call will fail to allocate the linear address space.
Fix by reverting to the previous behavior and use __va() for the VGA text
buffer, as it's below the 1MB boundary, and thus always mapped in the
directmap.
Adjust the calculations in COMPAT_ARG_XLAT_VIRT_BASE to subtract from the
per-domain area to obtain the mirrored linear address in the 4th slot,
instead of overflowing the per-domain linear address.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fc302866f42f552337ae7d8d78877aec36e6e2ff
master date: 2025-03-20 12:28:30 +0100
Roger Pau Monné [Thu, 27 Mar 2025 14:23:00 +0000 (15:23 +0100)]
x86/shadow: fix UB pointer arithmetic in sh_mfn_is_a_page_table()
UBSAN complains with:
UBSAN: Undefined behaviour in arch/x86/mm/shadow/private.h:515:30
pointer operation overflowed ffff82e000000000 to ffff82dfffffffe0
[...]
Xen call trace:
[<ffff82d040303782>] R common/ubsan/ubsan.c#ubsan_epilogue+0xa/0xc0
[<ffff82d040304bc3>] F __ubsan_handle_pointer_overflow+0xcb/0x100
[<ffff82d040471b2d>] F arch/x86/mm/shadow/guest_2.c#sh_page_fault__guest_2+0x1e350
[<ffff82d0403b206b>] F svm_vmexit_handler+0xdf3/0x2450
[<ffff82d0402049c0>] F svm_stgi_label+0x5/0x15
Fix by moving the call to mfn_to_page() after the check of whether the
passed gmfn is valid. This avoid the call to mfn_to_page() with an
INVALID_MFN parameter.
While there make the page local variable const, it's not modified by the
function.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 45ee73f1b24246f13cd9583cb2ee25fb9c782db8
master date: 2025-03-20 12:28:30 +0100
Roger Pau Monné [Thu, 27 Mar 2025 14:22:48 +0000 (15:22 +0100)]
x86/mkelf32: account for offset when detecting note segment placement
mkelf32 attempt to check that the program header defined NOTE segment falls
inside of the LOAD segment, as the build-id should be loaded for Xen at
runtime to check.
However the current code doesn't take into account the LOAD program header
segment offset when calculating overlap with the NOTE segment. This
results in incorrect detection, and the following build error:
arch/x86/boot/mkelf32 --notes xen-syms ./.xen.elf32 0x200000 \
`nm xen-syms | sed -ne 's/^\([^ ]*\) . __2M_rwdata_end$/0x\1/p'`
Expected .note section within .text section!
Offset 4244776 not within 2910364!
Account for the program header offset of the LOAD segment when checking
whether the NOTE segments is contained within. Also fix the logic to
ensure the NOTE segments is fully contained between the LOAD segment.
Fixes: a353cab905af ('build_id: Provide ld-embedded build-ids') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6e5fed7cb67c9f84653cdbd3924b8a119ef653be
master date: 2025-03-20 12:28:30 +0100
Andrew Cooper [Thu, 20 Mar 2025 12:35:23 +0000 (13:35 +0100)]
x86/mm: Fix IS_ALIGNED() check in IS_LnE_ALIGNED()
The current CI failures turn out to be a latent bug triggered by a narrow set
of properties of the initrd and the host memory map, which CI encountered by
chance.
One step during boot involves constructing directmap mappings for modules.
With some probing at the point of creation, it is observed that there's a 4k
mapping missing towards the end of the initrd.
The conditions for this bug appear to be map_pages_to_xen() call with a start
address of exactly 4k beyond a 2M boundary, some number of full 2M pages, then
a tail needing 4k pages.
Anyway, the condition for spotting superpage boundaries in map_pages_to_xen()
is wrong. The IS_ALIGNED() macro expects a power of two for the alignment
argument, and subtracts 1 itself.
Fixing this causes the failing case to now boot.
Fixes: 97fb6fcf26e8 ("x86/mm: introduce helpers to detect super page alignment") Debugged-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b07c7d63f9b587e4df5d71f6da9eaa433512c974
master date: 2025-03-19 14:53:28 +0000
Roger Pau Monné [Thu, 20 Mar 2025 12:35:04 +0000 (13:35 +0100)]
x86/ioremap: prevent additions against the NULL pointer
This was reported by clang UBSAN as:
UBSAN: Undefined behaviour in arch/x86/mm.c:6297:40
applying zero offset to null pointer
[...]
Xen call trace:
[<ffff82d040303662>] R common/ubsan/ubsan.c#ubsan_epilogue+0xa/0xc0
[<ffff82d040304aa3>] F __ubsan_handle_pointer_overflow+0xcb/0x100
[<ffff82d0406ebbc0>] F ioremap_wc+0xc8/0xe0
[<ffff82d0406c3728>] F video_init+0xd0/0x180
[<ffff82d0406ab6f5>] F console_init_preirq+0x3d/0x220
[<ffff82d0406f1876>] F __start_xen+0x68e/0x5530
[<ffff82d04020482e>] F __high_start+0x8e/0x90
Fix bt_ioremap() and ioremap{,_wc}() to not add the offset if the returned
pointer from __vmap() is NULL.
Fixes: d0d4635d034f ('implement vmap()') Fixes: f390941a92f1 ('x86/DMI: fix table mapping when one lives above 1Mb') Fixes: 81d195c6c0e2 ('x86: introduce ioremap_wc()') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9a6f2c52f75781acda39fab5cc96d1bcc54bf534
master date: 2025-03-17 13:33:29 +0100
Juergen Gross [Thu, 20 Mar 2025 12:34:45 +0000 (13:34 +0100)]
xen/sched: fix arinc653 to not use variables across cpupools
a653sched_do_schedule() is using two function local static variables,
which is resulting in bad behavior when using more than one cpupool
with the arinc653 scheduler.
Fix that by moving those variables to the scheduler private data.
Fixes: 22787f2e107c ("ARINC 653 scheduler") Reported-by: Choi Anderson <Anderson.Choi@boeing.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Nathan Studer <nathan.studer@dornerworks.com>
master commit: d0561ac8ab0e780b1e8ab41d0d15e9f9b076dee3
master date: 2025-03-14 10:17:11 +0100
Jan Beulich [Thu, 20 Mar 2025 12:34:15 +0000 (13:34 +0100)]
libxl: avoid infinite loop in libxl__remove_directory()
Infinitely retrying the rmdir() invocation makes little sense. While the
original observation was the log filling the disk (due to repeated
"Directory not empty" errors, in turn occurring for unclear reasons),
the loop wants breaking even if there was no error message being logged
(much like is done in the similar loops in libxl__remove_file() and
libxl__remove_file_or_directory()).
Fixes: c4dcbee67e6d ("libxl: provide libxl__remove_file et al") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: 68baeb5c4852e652b9599e049f40477edac4060e
master date: 2025-03-13 10:23:10 +0100
Roger Pau Monné [Thu, 20 Mar 2025 12:33:31 +0000 (13:33 +0100)]
x86/vmx: fix posted interrupts usage of msi_desc->msg field
The current usage of msi_desc->msg in vmx_pi_update_irte() will make the
field contain a translated MSI message, instead of the expected
untranslated one. This breaks dump_msi(), that use the data in
msi_desc->msg to print the interrupt details.
Fix this by introducing a dummy local msi_msg, and use it with
iommu_update_ire_from_msi(). vmx_pi_update_irte() relies on the MSI
message not changing, so there's no need to propagate the resulting msi_msg
to the hardware, and the contents can be ignored.
Additionally add a comment to clarify that msi_desc->msg must always
contain the untranslated MSI message.
Fixes: a5e25908d18d ('VT-d: introduce new fields in msi_desc to track binding with guest interrupt') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 30f0e55a79206702b4e82e86dad6b35033157858
master date: 2025-03-12 13:32:30 +0100
Roger Pau Monné [Thu, 20 Mar 2025 12:33:12 +0000 (13:33 +0100)]
x86/msr: expose MSR_FAM10H_MMIO_CONF_BASE on AMD
The MMIO_CONF_BASE reports the base of the MCFG range on AMD systems.
Linux pre-6.14 is unconditionally attempting to read the MSR without a
safe MSR accessor, and since Xen doesn't allow access to it Linux reports
the following error:
Such access is conditional to the presence of a device with PnP ID
"PNP0c01", which triggers the execution of the quirk_amd_mmconfig_area()
function. Note that prior to commit 3fac3734c43a MSR accesses when running
as a PV guest would always use the safe variant, and thus silently handle
the #GP.
Fix by allowing access to the MSR on AMD systems for the hardware domain.
Write attempts to the MSR will still result in #GP for all domain types.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b4071d28c5bd9ca4fed76031cbf0e782b74209b9
master date: 2025-03-12 13:32:30 +0100
Andrew Cooper [Thu, 20 Mar 2025 12:32:41 +0000 (13:32 +0100)]
x86/vlapic: Fix handling of writes to APIC_ESR
Xen currently presents APIC_ESR to guests as a simple read/write register.
This is incorrect. The SDM states:
The ESR is a write/read register. Before attempt to read from the ESR,
software should first write to it. (The value written does not affect the
values read subsequently; only zero may be written in x2APIC mode.) This
write clears any previously logged errors and updates the ESR with any
errors detected since the last write to the ESR.
Introduce a new pending_esr field in hvm_hw_lapic.
Update vlapic_error() to accumulate errors here, and extend vlapic_reg_write()
to discard the written value and transfer pending_esr into APIC_ESR. Reads
are still as before.
Importantly, this means that guests no longer destroys the ESR value it's
looking for in the LVTERR handler when following the SDM instructions.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b28b590d4a23894672f1dd7fb98cdf9926ecb282
master date: 2025-03-07 14:34:08 +0000
Juergen Gross [Thu, 20 Mar 2025 12:32:19 +0000 (13:32 +0100)]
tools/xl: fix channel configuration setting
Channels work differently than other device types: their devid should
be -1 initially in order to distinguish them from the primary console
which has the devid of 0.
So when parsing the channel configuration, use
ARRAY_EXTEND_INIT_NODEVID() in order to avoid overwriting the devid
set by libxl_device_channel_init().
Fixes: 3a6679634766 ("libxl: set channel devid when not provided by application") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: e1ccced4afe465d6541c5825a0f8d1b8f5fa4253
master date: 2025-03-05 16:37:37 +0100
Roger Pau Monné [Thu, 20 Mar 2025 12:31:03 +0000 (13:31 +0100)]
x86/dom0: be less restrictive with the Interrupt Address Range
Xen currently prevents dom0 from creating CPU or IOMMU page-table mappings
into the interrupt address range [0xfee00000, 0xfeefffff]. This range has
two different purposes. For accesses from the CPU is contains the default
position of local APIC page at 0xfee00000. For accesses from devices
it's the MSI address range, so the address field in the MSI entries
(usually) point to an address on that range to trigger an interrupt.
There are reports of Lenovo Thinkpad devices placing what seems to be the
UCSI shared mailbox at address 0xfeec2000 in the interrupt address range.
Attempting to use that device with a Linux PV dom0 leads to an error when
Linux kernel maps 0xfeec2000:
Remove the restrictions to create mappings in the interrupt address range
for dom0. Note that the restriction to map the local APIC page is enforced
separately, and that continues to be present. Additionally make sure the
emulated local APIC page is also not mapped, in case dom0 is using it.
Note that even if the interrupt address range entries are populated in the
IOMMU page-tables no device access will reach those pages. Device accesses
to the Interrupt Address Range will always be converted into Interrupt
Messages and are not subject to DMA remapping.
There's also the following restriction noted in Intel VT-d:
> Software must not program paging-structure entries to remap any address to
> the interrupt address range. Untranslated requests and translation requests
> that result in an address in the interrupt range will be blocked with
> condition code LGN.4 or SGN.8. Translated requests with an address in the
> interrupt address range are treated as Unsupported Request (UR).
Similarly for AMD-Vi:
> Accesses to the interrupt address range (Table 3) are defined to go through
> the interrupt remapping portion of the IOMMU and not through address
> translation processing. Therefore, when a transaction is being processed as
> an interrupt remapping operation, the transaction attribute of
> pretranslated or untranslated is ignored.
>
> Software Note: The IOMMU should
> not be configured such that an address translation results in a special
> address such as the interrupt address range.
However those restrictions don't apply to the identity mappings possibly
created for dom0, since the interrupt address range is never subject to DMA
remapping, and hence there's no output address after translation that
belongs to the interrupt address range.
Roger Pau Monné [Thu, 20 Mar 2025 12:30:30 +0000 (13:30 +0100)]
x86/dom0: correctly set the maximum ->iomem_caps bound for PVH
The logic in dom0_setup_permissions() sets the maximum bound in
->iomem_caps unconditionally using paddr_bits, which is not correct for HVM
based domains. Instead use domain_max_paddr_bits() to get the correct
maximum paddr bits for each possible domain type.
Switch to using PFN_DOWN() instead of PAGE_SHIFT, as that's shorter.
Fixes: 53de839fb409 ('x86: constrain MFN range Dom0 may access') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a00e08799cc7657d2a1aca158f4ad43d4c9103e7
master date: 2025-03-05 10:26:46 +0100
Andrew Cooper [Thu, 20 Mar 2025 12:29:57 +0000 (13:29 +0100)]
x86/svm: Separate STI and VMRUN instructions in svm_asm_do_resume()
There is a corner case in the VMRUN instruction where its INTR_SHADOW state
leaks into guest state if a VMExit occurs before the VMRUN is complete. An
example of this could be taking #NPF due to event injection.
Xen can safely execute STI anywhere between CLGI and VMRUN, as CLGI blocks
external interrupts too. However, an exception (while fatal) will appear to
be in an irqs-on region (as GIF isn't considered), so position the STI after
the speculation actions but prior to the GPR pops.
xen/memory: Make resource_max_frames() to return 0 on unknown type
This is actually what the caller acquire_resource() expects on any kind
of error (the comment on top of resource_max_frames() also suggests that).
Otherwise, the caller will treat -errno as a valid value and propagate incorrect
nr_frames to the VM. As a possible consequence, a VM trying to query a resource
size of an unknown type will get the success result from the hypercall and obtain
nr_frames 4294967201.
Also, add an ASSERT_UNREACHABLE() in the default case of _acquire_resource(),
normally we won't get to this point, as an unknown type will always be rejected
earlier in resource_max_frames().
Also, update test-resource app to verify that Xen can deal with invalid
(unknown) resource type properly.
Fixes: 9244528955de ("xen/memory: Fix acquire_resource size semantics") Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9b8708290002f0a4d0b363e0c66ce945f6b520bd
master date: 2025-02-18 14:47:34 +0000
Andrew Cooper [Thu, 20 Mar 2025 12:28:42 +0000 (13:28 +0100)]
xen/console: Fix truncation of panic() messages
The panic() function uses a static buffer to format its arguments into, simply
to emit the result via printk("%s", buf). This buffer is not large enough for
some existing users in Xen. e.g.:
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Invalid device tree blob at physical address 0x46a00000.
(XEN) The DTB must be 8-byte aligned and must not exceed 2 MB in size.
(XEN)
(XEN) Plea****************************************
The remainder of this particular message is 'e check your bootloader.', but
has been inherited by RISC-V from ARM.
It is also pointless double buffering. Implement vprintk() beside printk(),
and use it directly rather than rendering into a local buffer, removing it as
one source of message limitation.
This marginally simplifies panic(), and drops a global used-once buffer.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 81f8b1dd9407e4a3d9dc058b7fbbc591168649ad
master date: 2025-02-18 14:15:58 +0000
Jan Beulich [Thu, 27 Feb 2025 12:58:32 +0000 (12:58 +0000)]
IOMMU/x86: the bus-to-bridge lock needs to be acquired IRQ-safe
The function's use from set_msi_source_id() is guaranteed to be in an
IRQs-off region. While the invocation of that function could be moved
ahead in msi_msg_to_remap_entry() (doesn't need to be in the IOMMU-
intremap-locked region), the call tree from map_domain_pirq() holds an
IRQ descriptor lock. Hence all use sites of the lock need become IRQ-
safe ones.
In find_upstream_bridge() do a tiny bit of tidying in adjacent code:
Change a variable's type to unsigned and merge a redundant assignment
into another variable's initializer.
This is XSA-467 / CVE-2025-1713.
Fixes: 476bbccc811c ("VT-d: fix MSI source-id of interrupt remapping") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 39bc6af3ba483282ed6bbf94b08aec38c93d39e6)
Roger Pau Monné [Mon, 17 Feb 2025 12:36:15 +0000 (13:36 +0100)]
iommu/amd: atomically update IRTE
Either when using a 32bit Interrupt Remapping Entry or a 128bit one update
the entry atomically, by using cmpxchg unconditionally as IOMMU depends on
it. No longer disable the entry by setting RemapEn = 0 ahead of updating
it. As a consequence of not toggling RemapEn ahead of the update the
Interrupt Remapping Table needs to be flushed after the entry update.
This avoids a window where the IRTE has RemapEn = 0, which can lead to
IO_PAGE_FAULT if the underlying interrupt source is not masked.
There's no guidance in AMD-Vi specification about how IRTE update should be
performed as opposed to DTE updating which has specific guidance. However
DTE updating claims that reads will always be at least 128bits in size, and
hence for the purposes here assume that reads and caching of the IRTE
entries in either 32 or 128 bit format will be done atomically from
the IOMMU.
Note that as part of introducing a new raw128 field in the IRTE struct, the
current raw field is renamed to raw64 to explicitly contain the size in the
field name.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b953a99da98d63a7c827248abc450d4e8e015ab6
master date: 2025-01-27 13:05:11 +0100
Roger Pau Monné [Mon, 17 Feb 2025 12:33:12 +0000 (13:33 +0100)]
x86/iommu: disable interrupts at shutdown
Add a new hook to inhibit interrupt generation by the IOMMU(s). Note the
hook is currently only implemented for x86 IOMMUs. The purpose is to
disable interrupt generation at shutdown so any kexec chained image finds
the IOMMU(s) in a quiesced state.
It would also prevent "Receive accept error" being raised as a result of
non-disabled interrupts targeting offline CPUs.
Note that the iommu_quiesce() call in nmi_shootdown_cpus() is still
required even when there's a preceding iommu_crash_shutdown() call; the
later can become a no-op depending on the setting of the "crash-disable"
command line option.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 819c3cb186a86ef3e04fb5af4d9f9f6de032c3ee
master date: 2025-02-12 15:56:07 +0100
Roger Pau Monné [Mon, 17 Feb 2025 12:32:52 +0000 (13:32 +0100)]
x86/pci: disable MSI(-X) on all devices at shutdown
Attempt to disable MSI(-X) capabilities on all PCI devices know by Xen at
shutdown. Doing such disabling should facilitate kexec chained kernel from
booting more reliably, as device MSI(-X) interrupt generation should be
quiesced.
Only attempt to disable MSI(-X) on all devices in the crash context if the
PCI lock is not taken, otherwise the PCI device list could be in an
inconsistent state. This requires introducing a new pcidevs_trylock()
helper to check whether the lock is currently taken.
Disabling MSI(-X) should prevent "Receive accept error" being raised as a
result of non-disabled interrupts targeting offline CPUs.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7ab6951981231b4c576a3588248c303001272588
master date: 2025-02-12 15:56:07 +0100
Roger Pau Monné [Mon, 17 Feb 2025 12:32:40 +0000 (13:32 +0100)]
x86/smp: perform disabling on interrupts ahead of AP shutdown
Move the disabling of interrupt sources so it's done ahead of the offlining
of APs. This is to prevent AMD systems triggering "Receive accept error"
when interrupts target CPUs that are no longer online.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: db6daa9bf411260d2c1f5301e4fc786ae4a5cef8
master date: 2025-02-12 15:56:07 +0100
Roger Pau Monné [Mon, 17 Feb 2025 12:32:17 +0000 (13:32 +0100)]
x86/shutdown: offline APs with interrupts disabled on all CPUs
The current shutdown logic in smp_send_stop() will disable the APs while
having interrupts enabled on the BSP or possibly other APs. On AMD systems
this can lead to local APIC errors:
APIC error on CPU0: 00(08), Receive accept error
Such error message can be printed in a loop, thus blocking the system from
rebooting. I assume this loop is created by the error being triggered by
the console interrupt, which is further stirred by the ESR handler
printing to the console.
Intel SDM states:
"Receive Accept Error.
Set when the local APIC detects that the message it received was not
accepted by any APIC on the APIC bus, including itself. Used only on P6
family and Pentium processors."
So the error shouldn't trigger on any Intel CPU supported by Xen.
However AMD doesn't make such claims, and indeed the error is broadcast to
all local APICs when an interrupt targets a CPU that's already offline.
To prevent the error from stalling the shutdown process perform the
disabling of APs and the BSP local APIC with interrupts disabled on all
CPUs in the system, so that by the time interrupts are unmasked on the BSP
the local APIC is already disabled. This can still lead to a spurious:
APIC error on CPU0: 00(00)
As a result of an LVT Error getting injected while interrupts are masked on
the CPU, and the vector only handled after the local APIC is already
disabled. ESR reports 0 because as part of disable_local_APIC() the ESR
register is cleared.
Note the NMI crash path doesn't have such issue, because disabling of APs
and the caller local APIC is already done in the same contiguous region
with interrupts disabled. There's a possible window on the NMI crash path
(nmi_shootdown_cpus()) where some APs might be disabled (and thus
interrupts targeting them raising "Receive accept error") before others APs
have interrupts disabled. However the shutdown NMI will be handled,
regardless of whether the AP is processing a local APIC error, and hence
such interrupts will not cause the shutdown process to get stuck.
Remove the call to fixup_irqs() in smp_send_stop(): it doesn't achieve the
intended goal of moving all interrupts to the BSP anyway. The logic in
fixup_irqs() will move interrupts whose affinity doesn't overlap with the
passed mask, but the movement of interrupts is done to any CPU set in
cpu_online_map. As in the shutdown path fixup_irqs() is called before APs
are cleared from cpu_online_map this leads to interrupts being shuffled
around, but not assigned to the BSP exclusively.
The Fixes tag is more of a guess than a certainty; it's possible the
previous sleep window in fixup_irqs() allowed any in-flight interrupt to be
delivered before APs went offline. However fixup_irqs() was still
incorrectly used, as it didn't (and still doesn't) move all interrupts to
target the provided cpu mask.
Fixes: e2bb28d62158 ('x86/irq: forward pending interrupts to new destination in fixup_irqs()') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1191ce954f64244a3c5f553116184928bcc677e8
master date: 2025-02-12 15:56:07 +0100
Jan Beulich [Mon, 17 Feb 2025 12:31:59 +0000 (13:31 +0100)]
radix-tree: introduce RADIX_TREE{,_INIT}()
... now that static initialization is possible. Use RADIX_TREE() for
pci_segments and ivrs_maps.
This then fixes an ordering issue on x86: With the call to
radix_tree_init(), acpi_mmcfg_init()'s invocation of pci_segments_init()
will zap the possible earlier introduction of segment 0 by
amd_iommu_detect_one_acpi()'s call to pci_ro_device(), and thus the
write-protection of the PCI devices representing AMD IOMMUs.
Fixes: 3950f2485bbc ("x86/x2APIC: defer probe until after IOMMU ACPI table parsing") Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 26fe09e34566d701ecaea76b4563bb9934e85861
master date: 2025-02-07 10:00:04 +0100
Jan Beulich [Mon, 17 Feb 2025 12:31:22 +0000 (13:31 +0100)]
radix-tree: purge node allocation override hooks
These were needed by TMEM only, which is long gone. The Linux original
doesn't have such either. This effectively reverts one of the "Other
changes" from 8dc6738dbb3c ("Update radix-tree.[ch] from upstream Linux
to gain RCU awareness").
Positive side effect: Two cf_check go away.
While there also convert xmalloc()+memset() to xzalloc().
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1275093a96fed45057db241b3aa6e191d9dcf596
master date: 2025-02-07 09:59:11 +0100
Andrew Cooper [Mon, 17 Feb 2025 12:31:06 +0000 (13:31 +0100)]
x86/intel: Fix PERF_GLOBAL fixup when virtualised
Logic using performance counters needs to look at
MSR_MISC_ENABLE.PERF_AVAILABLE before touching any other resources.
When virtualised under ESX, Xen dies with a #GP fault trying to read
MSR_CORE_PERF_GLOBAL_CTRL.
Factor this logic out into a separate function (it's already too squashed to
the RHS), and insert a check of MSR_MISC_ENABLE.PERF_AVAILABLE.
This also avoids setting X86_FEATURE_ARCH_PERFMON if MSR_MISC_ENABLE says that
PERF is unavailable, although oprofile (the only consumer of this flag)
cross-checks too.
Fixes: 6bdb965178bb ("x86/intel: ensure Global Performance Counter Control is setup correctly") Reported-by: Jonathan Katz <jonathan.katz@aptar.com> Link: https://xcp-ng.org/forum/topic/10286/nesting-xcp-ng-on-esx-8 Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Jonathan Katz <jonathan.katz@aptar.com>
master commit: dd05d265b8abda4cc7206b29cd71b77fb46658bf
master date: 2025-01-28 11:19:45 +0000
Jan Beulich [Mon, 17 Feb 2025 12:30:38 +0000 (13:30 +0100)]
x86/PV: further harden guest memory accesses against speculative abuse
The original implementation has two issues: For one it doesn't preserve
non-canonical-ness of inputs in the range 0x8000000000000000 through
0x80007fffffffffff. Bogus guest pointers in that range would not cause a
(#GP) fault upon access, when they should.
And then there is an AMD-specific aspect, where only the low 48 bits of
an address are used for speculative execution; the architecturally
mandated #GP for non-canonical addresses would be raised at a later
execution stage. Therefore to prevent Xen controlled data to make it
into any of the caches in a guest controllable manner, we need to
additionally ensure that for non-canonical inputs bit 47 would be clear.
See the code comment for how addressing both is being achieved.
Fixes: 4dc181599142 ("x86/PV: harden guest memory accesses against speculative abuse") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8306d773b03acec6062c0547ac05e3dd4a6960f6
master date: 2025-01-27 15:23:59 +0100
Jan Beulich [Mon, 17 Feb 2025 12:30:21 +0000 (13:30 +0100)]
x86emul: further correct 64-bit mode zero count repeated string insn handling
In an entirely different context I came across Linux commit 428e3d08574b
("KVM: x86: Fix zero iterations REP-string"), which points out that
we're still doing things wrong: For one, there's no zero-extension at
all on AMD. And then while RCX is zero-extended from 32 bits uniformly
for all string instructions on newer hardware, RSI/RDI are only for MOVS
and STOS on the systems I have access to. (On an old family 0xf system
I've further found that for REP LODS even RCX is not zero-extended.)
While touching the lines anyway, replace two casts in get_rep_prefix().
Fixes: 79e996a89f69 ("x86emul: correct 64-bit mode repeated string insn handling with zero count") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 5310a042c4e3135c471446c8253ad13250539957
master date: 2025-01-27 15:23:19 +0100
Teddy Astie [Mon, 17 Feb 2025 12:29:27 +0000 (13:29 +0100)]
x86/iommu: check for CMPXCHG16B when enabling IOMMU
All hardware with VT-d/AMD-Vi has CMPXCHG16B support. Check this at
initialisation time, and otherwise refuse to use the IOMMU.
If the local APICs support x2APIC mode the IOMMU support for interrupt
remapping will be checked earlier using a specific helper. If no support
for CX16 is detected by that earlier hook disable the IOMMU at that point
and prevent further poking for CX16 later in the boot process, which would
also fail.
There's a possible corner case when running virtualized, and the underlying
hypervisor exposing an IOMMU but no CMPXCHG16B support. In which case
ignoring the IOMMU is fine, albeit the most natural would be for the
underlying hypervisor to also expose CMPXCHG16B support if an IOMMU is
available to the VM.
Note this change only introduces the checks, but doesn't remove the now
stale checks for CX16 support sprinkled in the IOMMU code. Further changes
will take care of that.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Teddy Astie <teddy.astie@vates.tech> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2636fcdc15c707d5e097770133f0afb69e8d70c9
master date: 2025-01-27 13:05:11 +0100
Jan Beulich [Mon, 17 Feb 2025 12:29:05 +0000 (13:29 +0100)]
x86/HVM: correct read/write split at page boundaries
The MMIO cache is intended to have one entry used per independent memory
access that an insn does. This, in particular, is supposed to be
ignoring any page boundary crossing. Therefore when looking up a cache
entry, the access'es starting (linear) address is relevant, not the one
possibly advanced past a page boundary.
In order for the same offset-into-buffer variable to be usable in
hvmemul_phys_mmio_access() for both the caller's buffer and the cache
entry's it is further necessary to have the un-adjusted caller buffer
passed into there.
Fixes: 2d527ba310dc ("x86/hvm: split all linear reads and writes at page boundary") Reported-by: Manuel Andreas <manuel.andreas@tum.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 672894a11fe06e664a0ebfb600baf5dbb897b9e4
master date: 2025-01-24 10:15:56 +0100
Both caches may need higher capacity, and the upper bound will need to
be determined dynamically based on CPUID policy (for AMX'es TILELOAD /
TILESTORE at least).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 23d60dbb0493b2f9ec1d89be5341eec2ee9dab32
master date: 2025-01-24 10:15:29 +0100
Jan Beulich [Mon, 17 Feb 2025 12:27:22 +0000 (13:27 +0100)]
x86/HVM: reduce recursion in linear_{read,write}()
Let's make explicit what the compiler may or may not do on our behalf:
The 2nd of the recursive invocations each can fall through rather than
re-invoking the function. This will save us from adding yet another
parameter (or more) to the function, just for the recursive invocations.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 18053054b7583810dd356efc8d7018bbc8720f36
master date: 2024-09-09 13:40:47 +0200
Juergen Gross [Tue, 21 Jan 2025 08:30:59 +0000 (09:30 +0100)]
xen/events: fix race with set_global_virq_handler()
There is a possible race scenario between set_global_virq_handler()
and clear_global_virq_handlers() targeting the same domain, which
might result in that domain ending as a zombie domain.
In case set_global_virq_handler() is being called for a domain which
is just dying, it might happen that clear_global_virq_handlers() is
running first, resulting in set_global_virq_handler() taking a new
reference for that domain and entering in the global_virq_handlers[]
array afterwards. The reference will never be dropped, thus the domain
will never be freed completely.
This can be fixed by checking the is_dying state of the domain inside
the region guarded by global_virq_handlers_lock. In case the domain is
dying, handle it as if the domain wouldn't exist, which will be the
case in near future anyway.
Fixes: 87521589aa6a ("xen: allow global VIRQ handlers to be delegated to other domains") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4d8acc9c1cf14233dda21dd3a7791b5a84b0f6c3
master date: 2025-01-09 17:34:01 +0100
Michal Orzel [Tue, 21 Jan 2025 08:30:49 +0000 (09:30 +0100)]
xen/flask: Wire up XEN_DOMCTL_vuart_op
Addition of FLASK permission for this hypercall was overlooked in the
original patch. Fix it. The only VUART operation is initialization that
can occur only during domain creation.
Fixes: 86039f2e8c20 ("xen/arm: vpl011: Add a new domctl API to initialize vpl011") Signed-off-by: Michal Orzel <michal.orzel@amd.com> Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
master commit: 29daa72e4019aae92f857cf6e7e0c3ca8fb1483e
master date: 2025-01-08 13:05:38 +0100
All selector fields under ctxt->regs are (normally) poisoned in the HVM
case, and the four ones besides CS and SS are potentially stale for PV.
Avoid using them in the hypervisor incarnation of the emulator, when
trying to cover for a missing ->read_segment() hook.
To make sure there's always a valid ->read_segment() handler for all HVM
cases, add a respective function to shadow code, even if it is not
expected for FPU insns to be used to update page tables.
Fixes: 0711b59b858a ("x86emul: correct FPU code/data pointers and opcode handling") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 645b8d48c78f5b6ffd6230873f9e3ced4e840acd
master date: 2025-01-08 11:02:16 +0100
Jan Beulich [Tue, 21 Jan 2025 08:30:04 +0000 (09:30 +0100)]
x86emul: VCVT{,U}DQ2PD ignores embedded rounding
IOW we shouldn't raise #UD in that case. Be on the safe side though and
only encode fully legitimate forms into the stub to be executed.
Things weren't quite right for VCVT{,U}SI2SD either, in the attempt to
be on the safe side: Clearing EVEX.L'L isn't useful; it's EVEX.b which
primarily needs clearing. Also reflect the somewhat improved doc
situation in the comment there.
Fixes: ed806f373730 ("x86emul: support AVX512F legacy-equivalent packed int/FP conversion insns") Fixes: baf4a376f550 ("x86emul: support AVX512F legacy-equivalent scalar int/FP conversion insns") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d3709d1324aa140f064b9c68da37547f459f8e8d
master date: 2025-01-08 11:01:17 +0100
Andrew Cooper [Tue, 21 Jan 2025 08:29:34 +0000 (09:29 +0100)]
x86/traps: Rework LER initialisation and support Zen5/Diamond Rapids
AMD have always used the architectural MSRs for LER. As the first processor
to support LER was the K7 (which was 32bit), we can assume it's presence
unconditionally in 64bit mode.
Intel are about to run out of space in Family 6 and start using 19. It is
only the Pentium 4 which uses non-architectural LER MSRs.
percpu_traps_init(), which runs on every CPU, contains a lot of code which
should be init-only, and is the only reason why opt_ler can't be in initdata.
Write a brand new init_ler() which expects all future Intel and AMD CPUs to
continue using the architectural MSRs, and does all setup together. Call it
from trap_init(), and remove the setup logic percpu_traps_init() except for
the single path configuring MSR_IA32_DEBUGCTLMSR.
Leave behind a warning if the user asked for LER and Xen couldn't enable it.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 555866cb56002849014a1409ecdfa3f436c0c2c4
master date: 2025-01-06 12:24:05 +0000
Andrew Cooper [Tue, 21 Jan 2025 08:28:42 +0000 (09:28 +0100)]
x86/spec-ctrl: Support for SRSO_U/S_NO and SRSO_MSR_FIX
AMD have updated the SRSO whitepaper[1] with further information. These
features exist on AMD Zen5 CPUs and are necessary for Xen to use.
The two features are in principle unrelated:
* SRSO_U/S_NO is an enumeration saying that SRSO attacks can't cross the
User(CPL3) / Supervisor(CPL<3) boundary. i.e. Xen don't need to use
IBPB-on-entry for PV64. PV32 guests are explicitly unsupported for
speculative issues, and excluded from consideration for simplicity.
* SRSO_MSR_FIX is an enumeration identifying that the BP_SPEC_REDUCE bit is
available in MSR_BP_CFG. When set, SRSO attacks can't cross the host/guest
boundary. i.e. Xen don't need to use IBPB-on-entry for HVM.
Extend ibpb_calculations() to account for these when calculating
opt_ibpb_entry_{pv,hvm} defaults. Add a `bp-spec-reduce=<bool>` option to
control the use of BP_SPEC_REDUCE, with it active by default.
Because MSR_BP_CFG is core-scoped with a race condition updating it, repurpose
amd_check_erratum_1485() into amd_check_bp_cfg() and calculate all updates at
once.
Xen also needs to to advertise SRSO_U/S_NO to guests to allow the guest kernel
to skip SRSO mitigations too:
* This is trivial for HVM guests. It is also is accurate for PV32 guests
too, but we have already excluded them from consideration, and do so again
here to simplify the policy logic.
* As written, SRSO_U/S_NO does not help for the PV64 user->kernel boundary.
However, after discussing with AMD, an implementation detail of having
BP_SPEC_REDUCE active causes the PV64 user->kernel boundary to have the
property described by SRSO_U/S_NO, so we can advertise SRSO_U/S_NO to
guests when the BP_SPEC_REDUCE precondition is met.
Finally, fix a typo in the SRSO_NO's comment.
[1] https://www.amd.com/content/dam/amd/en/documents/corporate/cr/speculative-return-stack-overflow-whitepaper.pdf Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a1746cd4434dd27ca2da8430dfb10edc76264bb3
master date: 2025-01-02 18:44:49 +0000
xen/arch/x86: make objdump output user locale agnostic
The objdump output is fed to grep, so make sure it doesn't change with
different user locales and break the grep parsing.
This problem was identified while updating xen in Debian and the fix is
needed for generating reproducible builds in varying environments.
Signed-off-by: Maximilian Engelhardt <maxi@daemonizer.de> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0d729221ab74c5d2571e71501dc63838acbf752a
master date: 2024-12-30 21:40:37 +0000
The important part: XZ decompression error: Memory usage limit reached
This looks to be related to the following change in Linux: 8653c909922743bceb4800e5cc26087208c9e0e6 ("xz: use 128 MiB dictionary and force single-threaded mode")
Fix this by increasing the block size to 256MiB. And remove the
misleading comment (from lack of better ideas).
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Anthony PERARD <anthony.perard@vates.tech> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e6472d46680ccd2b804ad73c19042a5811d036f0
master date: 2024-12-19 17:33:54 +0000
Jan Beulich [Tue, 21 Jan 2025 08:27:45 +0000 (09:27 +0100)]
x86emul: correct #UD check for AVX512-FP16 complex multiplications
avx512_vlen_check()'s argument was inverted, while the surrounding
conditional wrongly forced the EVEX.L'L check for the scalar forms when
embedded rounding was in effect.
Fixes: d14c52cba0f5 ("x86emul: handle AVX512-FP16 complex multiplication insns") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a30d438ce58b70c5955f5d37f776086ab8f88623
master date: 2024-08-19 15:32:31 +0200
Fixes: 631f535a3d4f ("xen: update ECLAIR service identifiers from MC3R1 to MC3A2.") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 171cb318deaa0be786cc3af3599c72e8909e60f9)
[Xen 4.18 doesn't have R2.1 as clean. However, D4.11 is unclean too]
automation/eclair_analysis: substitute deprecated service STD.emptrecd
The ECLAIR service STD.emptrecd (which checks for empty structures) is being
deprecated; hence, as a preventive measure, STD.anonstct (which checks for
structures with no named members, an UB in C99) is used here; the latter being
a more general case than the previous one, this change does not affect the
analysis. This new service is already supported by the current version of
ECLAIR.
xen: update ECLAIR service identifiers from MC3R1 to MC3A2.
Rename all instances of ECLAIR MISRA C:2012 service identifiers,
identified by the prefix MC3R1, to use the prefix MC3A2, which
refers to MISRA C:2012 Amendment 2 guidelines.
This update is motivated by the need to upgrade ECLAIR GitLab runners
that use the new naming scheme for MISRA C:2012 Amendment 2 guidelines.
Changes to the docs/misra directory are needed in order to keep
comment-based deviation up to date.
Roger Pau Monné [Tue, 17 Dec 2024 11:47:22 +0000 (12:47 +0100)]
x86/io-apic: prevent early exit from i8259 loop detection
Avoid exiting early from the loop when a pin that could be connected to the
i8259 is found, as such early exit would leave the EOI handler translation
array only partially allocated and/or initialized.
Otherwise on systems with multiple IO-APICs and an unmasked ExtINT pin on
any IO-APIC that's no the last one the following NULL pointer dereference
triggers:
(XEN) Enabling APIC mode. Using 2 I/O APICs
(XEN) ----[ Xen-4.20-unstable x86_64 debug=y Not tainted ]----
(XEN) CPU: 0
(XEN) RIP: e008:[<ffff82d040328046>] __ioapic_write_entry+0x83/0x95
[...]
(XEN) Xen call trace:
(XEN) [<ffff82d040328046>] R __ioapic_write_entry+0x83/0x95
(XEN) [<ffff82d04027464b>] F amd_iommu_ioapic_update_ire+0x1ea/0x273
(XEN) [<ffff82d0402755a1>] F iommu_update_ire_from_apic+0xa/0xc
(XEN) [<ffff82d040328056>] F __ioapic_write_entry+0x93/0x95
(XEN) [<ffff82d0403283c1>] F arch/x86/io_apic.c#clear_IO_APIC_pin+0x7c/0x10e
(XEN) [<ffff82d040328480>] F arch/x86/io_apic.c#clear_IO_APIC+0x2d/0x61
(XEN) [<ffff82d0404448b7>] F enable_IO_APIC+0x2e3/0x34f
(XEN) [<ffff82d04044c9b0>] F smp_prepare_cpus+0x254/0x27a
(XEN) [<ffff82d04044bec2>] F __start_xen+0x1ce1/0x23ae
(XEN) [<ffff82d0402033ae>] F __high_start+0x8e/0x90
(XEN)
(XEN) Pagetable walk from 0000000000000000:
(XEN) L4[0x000] = 000000007dbfd063ffffffffffffffff
(XEN) L3[0x000] = 000000007dbfa063ffffffffffffffff
(XEN) L2[0x000] = 000000007dbcc063ffffffffffffffff
(XEN) L1[0x000] = 0000000000000000ffffffffffffffff
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0002]
(XEN) Faulting linear address: 0000000000000000
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...
Reported-by: Sergii Dmytruk <sergii.dmytruk@3mdeb.com> Fixes: 86001b3970fe ('x86/io-apic: fix directed EOI when using AMD-Vi interrupt remapping') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f38fd27c4ceadf7ec4e82e82d0731b6ea415c51e
master date: 2024-12-17 11:15:30 +0100
Andrii Sultanov [Mon, 16 Dec 2024 12:35:10 +0000 (13:35 +0100)]
tools/ocaml: Specify rpath correctly for ocamlmklib
ocamlmklib has special handling for C-like '-Wl,-rpath' option, but does
not know how to handle '-Wl,-rpath-link', as evidenced by warnings like:
"Unknown option
-Wl,-rpath-link=$HOME/xen/tools/ocaml/libs/eventchn/../../../../tools/libs/toollog"
Pass this option directly to the compiler with -ccopt instead.
Also pass -L directly to the linker with -ldopt. This prevents embedding absolute
paths from buildtime into binary's RPATH.
Andrew Cooper [Mon, 16 Dec 2024 12:35:02 +0000 (13:35 +0100)]
libs/guest: Fix migration compatibility with a security-patched Xen 4.13
xc_cpuid_apply_policy() provides compatibility for migration of a pre-4.14 VM
where no CPUID data was provided in the stream.
It guesses the various max-leaf limits, based on what was true at the time of
writing, but this was not correctly adapted when speculative security issues
forced the advertisement of new feature bits. Of note are:
* LFENCE-DISPATCH, in leaf 0x80000021.eax
* BHI-CTRL, in leaf 0x7[2].edx
In both cases, a VM booted on a security-patched Xen 4.13, and then migrated
on to any newer version of Xen on the same or compatible hardware would have
these features stripped back because Xen is still editing the cpu-policy for
sanity behind the back of the toolstack.
For VMs using BHI_DIS_S to mitigate Native-BHI, this resulted in a failure to
restore the guests MSR_SPEC_CTRL setting:
(XEN) HVM d7v0 load MSR 0x48 with value 0x401 failed
(XEN) HVM7 restore: failed to load entry 20/0 rc -6
Fixes: e9b4fe263649 ("x86/cpuid: support LFENCE always serialising CPUID bit") Fixes: f3709b15fc86 ("x86/cpuid: Infrastructure for cpuid word 7:2.edx") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 28301682f492c1df2ff9c3e01a0aab6262bd925a
master date: 2024-12-03 12:20:41 +0000
Roger Pau Monné [Mon, 16 Dec 2024 12:34:43 +0000 (13:34 +0100)]
xen/Kconfig: livepatch-build-tools requires debug information
The tools infrastructure used to build livepatches for Xen
(livepatch-build-tools) consumes some DWARF debug information present in
xen-syms to generate a livepatch (see livepatch-build script usage of readelf
-wi).
The current Kconfig defaults however will enable LIVEPATCH without DEBUG_INFO
on release builds, thus providing a default Kconfig selection that's not
suitable for livepatch-build-tools even when LIVEPATCH support is enabled,
because it's missing the DWARF debug section.
Fix by defaulting DEBUG_INFO to enabled when LIVEPATCH is.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 126b0a6e537ce1d486a29e35cfeec1f222a74d11
master date: 2024-12-02 15:22:05 +0100
Jan Beulich [Mon, 16 Dec 2024 12:34:19 +0000 (13:34 +0100)]
x86emul: MOVBE requires a memory operand
The reg-reg forms should cause #UD; they come into existence only with
APX, where MOVBE also extends BSWAP (for the latter not being "eligible"
to a REX2 prefix).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 4c5d9a01f8fa81417a9c431e9624fb71361ec4f9
master date: 2024-12-02 09:50:14 +0100
Andrew Cooper [Wed, 27 Nov 2024 11:42:11 +0000 (12:42 +0100)]
build: Remove -fno-stack-protector-all from EMBEDDED_EXTRA_CFLAGS
This seems to have been introduced in commit f8beb54e2455 ("Disable PIE/SSP
features when building Xen, if GCC supports them.") in 2004.
However, neither GCC nor Clang appear to have ever supported taking the
negated form of -fstack-protector-all, meaning this been useless since its
introduction.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b661fe107d45713f4e1395f29180cc1ae50bba92
master date: 2024-11-25 13:44:56 +0000
Roger Pau Monné [Wed, 27 Nov 2024 11:42:02 +0000 (12:42 +0100)]
x86/irq: fix calculation of max PV dom0 pIRQs
The current calculation of PV dom0 pIRQs uses:
n = min(fls(num_present_cpus()), dom0_max_vcpus());
The usage of fls() is wrong, as num_present_cpus() already returns the number
of present CPUs, not the bitmap mask of CPUs.
Fix by removing the usage of fls().
Fixes: 7e73a6e7f12a ('have architectures specify the number of PIRQs a hardware domain gets') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5c56361c618e5d05855fc73118c4655f998b8272
master date: 2024-11-25 11:33:06 +0100
Roger Pau Monné [Mon, 25 Nov 2024 11:19:08 +0000 (12:19 +0100)]
x86/mm: ensure L2 is always freed if empty
The current logic in modify_xen_mappings() allows for fully empty L2 tables to
not be freed and unhooked from the parent L3 if the last L2 slot is not
populated.
Ensure that even when an L2 slot is empty the logic to check whether the whole
L2 can be removed is not skipped.
Fixes: 4376c05c3113 ('x86-64: use 1GB pages in 1:1 mapping if available') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 41c80496084aa3601230e01c3bc579a0a6d8359a
master date: 2024-11-14 16:13:10 +0100
Roger Pau Monné [Mon, 25 Nov 2024 11:18:58 +0000 (12:18 +0100)]
x86/setup: remove bootstrap_map_addr() usage of destroy_xen_mappings()
bootstrap_map_addr() needs to be careful to not remove existing page-table
structures when tearing down mappings, as such pagetable structures might be
needed to fulfill subsequent mappings requests. The comment ahead of the
function already notes that pagetable memory shouldn't be allocated.
Fix this by using map_pages_to_xen(), which does zap the page-table entries,
but does not free page-table structures even when empty.
Fixes: 4376c05c3113 ('x86-64: use 1GB pages in 1:1 mapping if available') Signed-off-by: Roger Pau Monné <roger.pau@ctrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 73194b5701725e53d72b98e568484b2fccaf855c
master date: 2024-11-14 16:12:51 +0100
Roger Pau Monné [Mon, 25 Nov 2024 11:18:38 +0000 (12:18 +0100)]
x86/mm: skip super-page alignment checks for non-present entries
INVALID_MFN is ~0, so by it having all bits as 1s it doesn't fulfill the
super-page address alignment checks for L3 and L2 entries. Skip the alignment
checks if the new entry is a non-present one.
This fixes a regression introduced by 0b6b51a69f4d, where the switch from 0 to
INVALID_MFN caused all super-pages to be shattered when attempting to remove
mappings by passing INVALID_MFN instead of 0.
Fixes: 0b6b51a69f4d ('xen/mm: Switch map_pages_to_xen to use MFN typesafe') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/mm: fix alignment check for non-present entries
While the alignment of the mfn is not relevant for non-present entries, the
alignment of the linear address is. Commit 5b52e1b0436f introduced a
regression by not checking the alignment of the linear address when the new
entry was a non-present one.
Fix by always checking the alignment of the linear address, non-present entries
must just skip the alignment check of the physical address.
Fixes: 5b52e1b0436f ('x86/mm: skip super-page alignment checks for non-present entries') Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5b52e1b0436f4adb784562f4d05ae67605ce8ce7
master date: 2024-11-14 16:12:35 +0100
master commit: b1ebb6461a027f07e4a844cae348fbd9cfabe984
master date: 2024-11-15 14:14:12 +0100
Jan Beulich [Mon, 25 Nov 2024 11:18:09 +0000 (12:18 +0100)]
x86emul: avoid double memory read for RORX
Originally only twobyte_table[0x3a] determined what part of generic
operand fetching (near the top of x86_emulate()) comes into play. When
ext0f3a_table[] was added, ->desc was updated to properly describe the
ModR/M byte's function. With that generic source operand fetching came
into play for RORX, rendering the explicit fetching in the respective
case block redundant (and wrong at the very least when MMIO with side
effects is accessed).
While there also make a purely cosmetic / documentary adjustment to
ext0f3a_table[]: RORX really is a 2-operand insn, MOV-like in that it
only writes its destination register.
Fixes: 9f7f5f6bc95b ("x86emul: add tables for 0f38 and 0f3a extension space") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 939a9e800c4156677c10c6cf08fde071e9b86eaf
master date: 2024-11-14 13:03:18 +0100
Jan Beulich [Mon, 25 Nov 2024 11:17:49 +0000 (12:17 +0100)]
x86emul: ignore VEX.W for BMI{1,2} insns in 32-bit mode
While result values and other status flags are unaffected as long as we
can ignore the case of registers having their upper 32 bits non-zero
outside of 64-bit mode, EFLAGS.SF may obtain a wrong value when we
mistakenly re-execute the original insn with VEX.W set.
Note that guest the memory access, if any, is correctly carried out as
32-bit regardless of VEX.W. The emulator-local memory operand will be
accessed as a 64-bit quantity, but it is pre-initialised to zero so no
internal state can leak.
Fixes: 771daacd197a ("x86emul: support BMI1 insns") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1179d51dcb7d93111bfb35172c75eb5a73fe6a43
master date: 2024-11-14 13:00:57 +0100
Andrew Cooper [Mon, 25 Nov 2024 11:16:34 +0000 (12:16 +0100)]
x86/cpu-policy: Extend the guest max policy max leaf/subleaves
We already have one migration case opencoded (feat.max_subleaf). A more
recent discovery is that we advertise x2APIC to guests without ensuring that
we provide max_leaf >= 0xb.
In general, any leaf known to Xen can be safely configured by the toolstack if
it doesn't violate other constraints.
Therefore, introduce guest_common_{max,default}_leaves() to generalise the
special case we currently have for feat.max_subleaf, in preparation to be able
to provide x2APIC topology in leaf 0xb even on older hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: fa2d8318033e468a4ded1fc3d721dc3e019e449b
master date: 2024-10-30 17:34:32 +0000
Roger Pau Monné [Mon, 25 Nov 2024 11:16:09 +0000 (12:16 +0100)]
x86/alternatives: do not BUG during apply
alternatives is used both at boot time, and when loading livepatch payloads.
While for the former it makes sense to panic, it's not useful for the later, as
for livepatches it's possible to fail to load the livepatch if alternatives
cannot be resolved and continue operating normally.
Relax the BUGs in _apply_alternatives() to instead return an error code. The
caller will figure out whether the failures are fatal and panic.
Print an error message to provide some user-readable information about what
went wrong.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: aa5a06d5d6eda291afa3849ab143ffcd69b0d4e6
master date: 2024-09-26 14:18:03 +0100
Roger Pau Monné [Mon, 25 Nov 2024 11:15:56 +0000 (12:15 +0100)]
xen/livepatch: do Xen build-id check earlier
The check against the expected Xen build ID should be done ahead of attempting
to apply the alternatives contained in the livepatch.
If the CPUID in the alternatives patching data is out of the scope of the
running Xen featureset the BUG() in _apply_alternatives() will trigger thus
bringing the system down. Note the layout of struct alt_instr could also
change between versions. It's also possible for struct exception_table_entry
to have changed format, hence leading to other kind of errors if parsing of the
payload is done ahead of checking if the Xen build-id matches.
Move the Xen build ID check as early as possible. To do so introduce a new
check_xen_buildid() function that parses and checks the Xen build-id before
moving the payload. Since the expected Xen build-id is used early to
detect whether the livepatch payload could be loaded, there's no reason to
store it in the payload struct, as a non-matching Xen build-id won't get the
payload populated in the first place.
Note printing the expected Xen build ID has part of dumping the payload
information is no longer done: all loaded payloads would have Xen build IDs
matching the running Xen, otherwise they would have failed to load.
Fixes: 879615f5db1d ('livepatch: Always check hypervisor build ID upon livepatch upload') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fa49f4be413cdabde5b2264fc85d2710b15ea691
master date: 2024-09-26 14:18:03 +0100
Roger Pau Monné [Mon, 25 Nov 2024 11:15:29 +0000 (12:15 +0100)]
xen/livepatch: simplify and unify logic in prepare_payload()
The following sections: .note.gnu.build-id, .livepatch.xen_depends and
.livepatch.depends are mandatory and ensured to be present by
check_special_sections() before prepare_payload() is called.
Simplify the logic in prepare_payload() by introducing a generic function to
parse the sections that contain a buildid. Note the function assumes the
buildid related section to always be present.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 86d09d16dd74298b19a03df492d9503f20cfc17c
master date: 2024-09-26 14:18:03 +0100
Roger Pau Monné [Mon, 25 Nov 2024 11:15:00 +0000 (12:15 +0100)]
xen/livepatch: drop load_addr Elf section field
The Elf loading logic will initially use the `data` section field to stash a
pointer to the temporary loaded data (from the buffer allocated in
livepatch_upload(), which is later relocated and the new pointer stashed in
`load_addr`.
Remove this dual field usage and use an `addr` uniformly. Initially data will
point to the temporary buffer, until relocation happens, at which point the
pointer will be updated to the relocated address.
This avoids leaving a dangling pointer in the `data` field once the temporary
buffer is freed by livepatch_upload().
Note the `addr` field cannot retain the const attribute from the previous
`data`field, as there's logic that performs manipulations against the loaded
sections, like applying relocations or sorting the exception table.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Michal Orzel <michal.orzel@amd.com>
master commit: 8c81423038f17f6cbc853dd35d69d50a4458f764
master date: 2024-09-26 14:18:03 +0100
Frediano Ziglio [Mon, 25 Nov 2024 11:14:13 +0000 (12:14 +0100)]
x86/boot: Preserve the value clobbered by the load-base calculation
Right now, Xen clobbers the value at 0xffc when performing it's load-base
calculation. We've got plenty of free registers at this point, so the value
can be preserved easily.
This fixes a real bug booting under Coreboot+SeaBIOS, where 0xffc happens to
be the cbmem pointer (e.g. Coreboot's dmesg ring, among other things).
However, there's also a better choice of memory location to use than 0xffc, as
all our supported boot protocols have a pointer to an info structure in %ebx.
Update the documentation to match.
Fixes: 1695e53851e5 ("x86/boot: Fix the boot time relocation calculations") Fixes: d96bb172e8c9 ("x86/entry: Early PVH boot code") Signed-off-by: Frediano Ziglio <frediano.ziglio@cloud.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e58a2858d588ed57ca13200f3d3148d78ad0e491
master date: 2024-08-27 18:08:19 +0100
Andrew Cooper [Mon, 25 Nov 2024 11:13:47 +0000 (12:13 +0100)]
tools/ocaml: Fix the version embedded in META files
Xen 4.1 is more than a decade stale now. Use the same mechanism as elsewhere
in the tree to get the current version number.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christian Lindig <christian.lindig@cloud.com> Reviewed-by: Edwin Török <edwin.torok@cloud.com>
master commit: 1965e9a930740b37637ac450f4752fd53edf63c4
master date: 2024-08-23 15:02:27 +0100
Andrew Cooper [Mon, 25 Nov 2024 11:13:28 +0000 (12:13 +0100)]
tools/ocaml: Drop the OCAMLOPTFLAG_G invocation
These days, `ocamlopt -h` asks you whether you meant --help instead, meaning
that the $(shell ) invocation here isn't going end up containing '-g'.
Make it unconditional, like it is in OCAMLCFLAGS already.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christian Lindig <christian.lindig@cloud.com> Reviewed-by: Edwin Török <edwin.torok@cloud.com>
master commit: 126293eae6485089471ebdfd91fe944a0274e613
master date: 2024-08-23 15:02:27 +0100
The current logic to chose the preferred reboot method is based on the mode Xen
has been booted into, so if the box is booted from UEFI, the preferred reboot
method will be to use the ResetSystem() run time service call.
However, that method seems to be widely untested, and quite often leads to a
result similar to:
****************************************
Panic on CPU 0:
FATAL TRAP: vector = 6 (invalid opcode)
****************************************
Which in most cases does lead to a reboot, however that's unreliable.
Change the default reboot preference to prefer ACPI over UEFI if available and
not in reduced hardware mode.
This is in line to what Linux does, so it's unlikely to cause issues on current
and future hardware, since there's a much higher chance of vendors testing
hardware with Linux rather than Xen.
Add a special case for one Acer model that does require being rebooted using
ResetSystem(). See Linux commit 0082517fa4bce for rationale.
x86/viridian: Clarify some viridian logging strings
It's sadically misleading to show an error without letters and expect
the dmesg reader to understand it's in hex. The patch adds a 0x prefix
to all hex numbers that don't already have it.
On the one instance in which a boolean is printed as an integer, print
it as a decimal integer instead so it's 0/1 in the common case and not
misleading if it's ever not just that due to a bug.
While at it, rename VIRIDIAN CRASH to VIRIDIAN GUEST_CRASH. Every member
of a support team that looks at the message systematically believes
"viridian" crashed, which is absolutely not what goes on. It's the guest
asking the hypervisor for a sudden shutdown because it crashed, and
stating why.
Andrew Cooper [Mon, 25 Nov 2024 11:11:04 +0000 (12:11 +0100)]
tools/libxs: Stop playing with SIGPIPE
It's very rude for a library to play with signals behind the back of the
application, no matter ones views on the default behaviour of SIGPIPE under
POSIX. Even if the application doesn't care about the xenstored socket, it my
care about others.
This logic has existed since xenstore/xenstored was originally added in commit 29c9e570b1ed ("Add xenstore daemon and library") in 2005.
It's also unnecessary. Pass MSG_NOSIGNAL when talking to xenstored over a
pipe (to avoid sucumbing to SIGPIPE if xenstored has crashed), and forgo any
playing with the signal disposition.
This has a side benefit of saving 2 syscalls per xenstore request.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: a17b6db9b00784b409c35e3017dc45aed1ec2bfb
master date: 2024-07-23 15:11:27 +0100
Andrew Cooper [Mon, 25 Nov 2024 11:10:34 +0000 (12:10 +0100)]
tools/libxs: Use writev()/sendmsg() instead of write()
With the input data now conveniently arranged, use writev()/sendmsg() instead
of decomposing it into write() calls.
This causes all requests to be submitted with a single system call, rather
than at least two. While in principle short writes can occur, the chances of
it happening are slim given that most xenbus comms are only a handful of
bytes.
Nevertheless, provide {writev,sendmsg}_exact() wrappers which take care of
resubmitting on EINTR or short write.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: ebaeb0c64a6d363313e213eb9995f48307604ebb
master date: 2024-07-23 15:11:27 +0100
Andrew Cooper [Mon, 25 Nov 2024 11:10:07 +0000 (12:10 +0100)]
tools/libxs: Rework xs_talkv() to take xsd_sockmsg within the iovec
We would like to writev() the whole outgoing message, but this is hard given
the current need to prepend the locally-constructed xsd_sockmsg.
Instead, have the caller provide xsd_sockmsg in iovec[0]. This in turn drops
the 't' and 'type' parameters from xs_talkv().
Note that xs_talkv() may alter the iovec structure. This may happen when
writev() is really used under the covers, and it's preferable to having the
lower levels need to duplicate the iovec to edit it upon encountering a short
write. xs_directory_part() is the only function impacted by this, and it's
easy to rearrange to be compatible.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: e2a93bed8b9e0f0c4779dcd4b10dc7ba2a959fbc
master date: 2024-07-23 15:11:27 +0100
Andrew Cooper [Mon, 25 Nov 2024 11:09:51 +0000 (12:09 +0100)]
tools/libxs: Fix length check in xs_talkv()
If the sum of iov element lengths overflows, the XENSTORE_PAYLOAD_MAX can
pass, after which we'll write 4G of data with a good-looking length field, and
the remainder of the payload will be interpreted as subsequent commands.
Check each iov element length for XENSTORE_PAYLOAD_MAX before accmulating it.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Juergen Gross <jgross@suse.com>
master commit: 42db2deb5e7617f0459b68cd73ab503938356186
master date: 2024-07-23 15:11:27 +0100