Daniel P. Smith [Tue, 29 Oct 2024 15:42:29 +0000 (16:42 +0100)]
x86/boot: Fix XSM module handling during PVH boot
As detailed in commit 0fe607b2a144 ("x86/boot: Fix PVH boot during boot_info
transition period"), the use of __va(mbi->mods_addr) constitutes a
use-after-free on the PVH boot path.
This pattern has been in use since before PVH support was added. This has
most likely gone unnoticed because no-one's tried using a detached Flask
policy in a PVH VM before.
Plumb the boot_info pointer down, replacing module_map and mbi. Importantly,
bi->mods[].mod is a safe way to access the module list during PVH boot.
As this is the final non-bi use of mbi in __start_xen(), make the pointer
unusable once bi has been established, to prevent new uses creeping back in.
This is a stopgap until mbi can be fully removed.
Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Daniel P. Smith <dpsmith@apertussolutions.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 6cf0aaeb8df951fb34679f0408461a5c67cb02c6
master date: 2024-10-23 18:14:24 +0100
Daniel P. Smith [Tue, 29 Oct 2024 15:42:16 +0000 (16:42 +0100)]
x86/boot: Fix microcode module handling during PVH boot
As detailed in commit 0fe607b2a144 ("x86/boot: Fix PVH boot during boot_info
transition period"), the use of __va(mbi->mods_addr) constitutes a
use-after-free on the PVH boot path.
This pattern has been in use since before PVH support was added. Inside a PVH
VM, it will go unnoticed as long as the microcode container parser doesn't
choke on the random data it finds.
The use within early_microcode_init() happens to be safe because it's prior to
move_xen(). microcode_init_cache() is after move_xen(), and therefore unsafe.
Plumb the boot_info pointer down, replacing module_map and mbi. Importantly,
bi->mods[].mod is a safe way to access the module list during PVH boot.
Note: microcode_scan_module() is still bogusly stashing a bootstrap_map()'d
pointer in ucode_blob.data, which constitutes a different
use-after-free, and only works in general because of a second bug. This
is unrelated to PVH, and needs untangling differently.
Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Daniel P. Smith <dpsmith@apertussolutions.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8ddf63a252a6eae6e619ba2df9ad6b6f82e660c1
master date: 2024-10-23 18:14:24 +0100
Roger Pau Monné [Tue, 29 Oct 2024 15:41:42 +0000 (16:41 +0100)]
iommu/amd-vi: do not error if device referenced in IVMD is not behind any IOMMU
IVMD table contains restrictions about memory which must be mandatory assigned
to devices (and which permissions it should use), or memory that should be
never accessible to devices.
Some hardware however contains ranges in IVMD that reference devices outside of
the IVHD tables (in other words, devices not behind any IOMMU). Such mismatch
will cause Xen to fail in register_range_for_device(), ultimately leading to
the IOMMU being disabled, and Xen crashing as x2APIC support might be already
enabled and relying on the IOMMU functionality.
Relax IVMD parsing: allow IVMD blocks to reference devices not assigned to any
IOMMU. It's impossible for Xen to fulfill the requirement in the IVMD block if
the device is not behind any IOMMU, but it's no worse than booting without
IOMMU support, and thus not parsing ACPI IVRS in the first place.
Andrew Cooper [Tue, 29 Oct 2024 15:41:30 +0000 (16:41 +0100)]
xen/spinlock: Fix UBSAN "load of address with insufficient space" in lock_prof_init()
UBSAN complains:
(XEN) ================================================================================
(XEN) UBSAN: Undefined behaviour in common/spinlock.c:794:10
(XEN) load of address ffff82d040ae24c8 with insufficient space
(XEN) for an object of type 'struct lock_profile *'
(XEN) ----[ Xen-4.20-unstable x86_64 debug=y ubsan=y Tainted: C ]----
This shows up with GCC-14, but not with GCC-12. I have not bisected further.
Either way, the types for __lock_profile_{start,end} are incorrect.
They are an array of struct lock_profile pointers. Correct the extern's
types, and adjust the loop to match.
No practical change.
Reported-by: Andreas Glashauser <ag@andreasglashauser.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com>
master commit: 542ac112fc68c66cfafc577e252404c21da4f75b
master date: 2024-10-14 16:14:26 +0100
Roger Pau Monné [Tue, 29 Oct 2024 15:40:58 +0000 (16:40 +0100)]
x86/domctl: fix maximum number of MSRs in XEN_DOMCTL_{get,set}_vcpu_msrs
Since the addition of the MSR_AMD64_DR{1-4}_ADDRESS_MASK MSRs to the
msrs_to_send array, the calculations for the maximum number of MSRs that
the hypercall can handle is off by 4.
Remove the addition of 4 to the maximum number of MSRs that
XEN_DOMCTL_{set,get}_vcpu_msrs supports, as those are already part of the
array.
A further adjustment could be to subtract 4 from the maximum size if the DBEXT
CPUID feature is not exposed to the guest, but guest_{rd,wr}msr() will already
perform that check when fetching or loading the MSRs. The maximum array is
used to indicate the caller of the buffer it needs to allocate in the get case,
and as an early input sanitation in the set case, using a buffer size slightly
lager than required is not an issue.
Fixes: 86d47adcd3c4 ('x86/msr: Handle MSR_AMD64_DR{0-3}_ADDRESS_MASK in the new MSR infrastructure') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c95cd5f9c5a8c1c6ab1b0b366d829fa8561958fd
master date: 2024-10-08 14:37:53 +0200
Jan Beulich [Tue, 29 Oct 2024 15:40:46 +0000 (16:40 +0100)]
ioreq: don't wrongly claim "success" in ioreq_send_buffered()
Returning a literal number is a bad idea anyway when all other returns
use IOREQ_STATUS_* values. The function is dead on Arm, and mapping to
X86EMUL_OKAY is surely wrong on x86.
Fixes: f6bf39f84f82 ("x86/hvm: add support for broadcast of buffered ioreqs...") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: 2e0b545b847df7d4feb07308d50bad708bd35a66
master date: 2024-10-08 14:36:27 +0200
Roger Pau Monné [Tue, 29 Oct 2024 15:39:43 +0000 (16:39 +0100)]
x86/dpci: do not leak pending interrupts on CPU offline
The current dpci logic relies on a softirq being executed as a side effect of
the cpu_notifier_call_chain() call in the code path that offlines the target
CPU. However the call to cpu_notifier_call_chain() won't trigger any softirq
processing, and even if it did, such processing should be done after all
interrupts have been migrated off the current CPU, otherwise new pending dpci
interrupts could still appear.
Currently the ASSERT() in the cpu callback notifier is fairly easy to trigger
by doing CPU offline from a PVH dom0.
Solve this by instead moving out any dpci interrupts pending processing once
the CPU is dead. This might introduce more latency than attempting to drain
before the CPU is put offline, but it's less complex, and CPU online/offline is
not a common action. Any extra introduced latency should be tolerable.
Fixes: f6dd295381f4 ('dpci: replace tasklet with softirq') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 29555668b5725b9d5393b72bfe7ff9a3fa606714
master date: 2024-10-07 11:10:21 +0200
Andrew Cooper [Tue, 29 Oct 2024 15:39:22 +0000 (16:39 +0100)]
stubdom: Fix newlib build with GCC-14
Based on a fix from OpenSUSE, but adjusted to be Clang-compatible too. Pass
-Wno-implicit-function-declaration library-wide rather than using local GCC
pragmas.
Fix of copy_past_newline() to avoid triggering -Wstrict-prototypes.
Andrew Cooper [Tue, 29 Oct 2024 15:38:41 +0000 (16:38 +0100)]
x86/pv: Rename pv.iobmp_limit to iobmp_nr and clarify behaviour
Ever since it's introduction in commit 013351bd7ab3 ("Define new event-channel
and physdev hypercalls") in 2006, the public interface was named nr_ports
while the internal field was called iobmp_limit.
Rename the internal field to iobmp_nr to match the public interface, and
clarify that, when nonzero, Xen will read 2 bytes.
There isn't a perfect parallel with a real TSS, but iobmp_nr being 0 is the
paravirt "no IOPB" case, and it is important that no read occurs in this case.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 633ee8b2df963f7e5cb8de1219c1a48bfb4447f6
master date: 2024-10-01 14:58:18 +0100
Andrew Cooper [Tue, 29 Oct 2024 15:38:29 +0000 (16:38 +0100)]
x86/pv: Handle #PF correctly when reading the IO permission bitmap
The switch statement in guest_io_okay() is a very expensive way of
pre-initialising x with ~0, and performing a partial read into it.
However, the logic isn't correct either.
In a real TSS, the CPU always reads two bytes (like here), and any TSS limit
violation turns silently into no-access. But, in-limit accesses trigger #PF
as usual. AMD document this property explicitly, and while Intel don't (so
far as I can tell), they do behave consistently with AMD.
Switch from __copy_from_guest_offset() to __copy_from_guest_pv(), like
everything else in this file. This removes code generation setting up
copy_from_user_hvm() (in the likely path even), and safety LFENCEs from
evaluate_nospec().
Change the logic to raise #PF if __copy_from_guest_pv() fails, rather than
disallowing the IO port access. This brings the behaviour better in line with
normal x86.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 8a6c495d725408d333c1b47bb8af44615a5bfb18
master date: 2024-10-01 14:58:18 +0100
Andrew Cooper [Tue, 29 Oct 2024 15:38:17 +0000 (16:38 +0100)]
x86/pv: Rework guest_io_okay() to return X86EMUL_*
In order to fix a bug with guest_io_okay() (subsequent patch), rework
guest_io_okay() to take in an emulation context, and return X86EMUL_* rather
than a boolean.
For the failing case, take the opportunity to inject #GP explicitly, rather
than returning X86EMUL_UNHANDLEABLE. There is a logical difference between
"we know what this is, and it's #GP", vs "we don't know what this is".
There is no change in practice as emulation is the final step on general #GP
resolution, but returning X86EMUL_UNHANDLEABLE would be a latent bug if a
subsequent action were to appear.
No practical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7429e1cc071b0e20ea9581da4893fb9b2f6d21d4
master date: 2024-10-01 14:58:18 +0100
x86/traps: Re-enable interrupts after reading cr2 in the #PF handler
Hitting a page fault clobbers %cr2, so if a page fault is handled while
handling a previous page fault then %cr2 will hold the address of the
latter fault rather than the former. In particular, if a debug key
handler happens to trigger during #PF and before %cr2 is read, and that
handler itself encounters a #PF, then %cr2 will be corrupt for the outer #PF
handler.
This patch makes the page fault path delay re-enabling IRQs until %cr2
has been read in order to ensure it stays consistent.
A similar argument holds in additional cases, but they happen to be safe:
* %dr6 inside #DB: Safe because IST exceptions don't re-enable IRQs.
* MSR_XFD_ERR inside #NM: Safe because AMX isn't used in #NM handler.
While in the area, remove redundant q suffix to a movq in entry.S and
the space after the comma.
Fixes: a4cd20a19073 ("[XEN] 'd' key dumps both host and guest state.") Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: b06e76db7c35974f1b127762683e7852ca0c8e76
master date: 2024-10-01 09:45:49 +0200
Taking a fault on a non-byte-granular insn means that the "number of
bytes not handled" return value would need extra care in calculating, if
we want callers to be able to derive e.g. exception context (to be
injected to the guest) - CR2 for #PF in particular - from the value. To
simplify things rather than complicating them, reduce inline assembly to
just byte-granular string insns. On recent CPUs that's also supposed to
be more efficient anyway.
For singular element accessors, however, alignment checks are added,
hence slightly complicating the code. Misaligned (user) buffer accesses
will now be forwarded to copy_{from,to}_guest_ll().
Naturally copy_{from,to}_unsafe_ll() accessors end up being adjusted the
same way, as they're produced by mere re-processing of the same code.
Otoh copy_{from,to}_unsafe() aren't similarly adjusted, but have their
comments made match reality; down the road we may want to change their
return types, e.g. to bool.
Fixes: 76974398a63c ("Added user-memory accessing functionality for x86_64") Fixes: 7b8c36701d26 ("Introduce clear_user and clear_guest") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 67a8e5721e1ea9c28526883036bf08fb2e8a8c9c
master date: 2024-10-01 09:44:55 +0200
xen/ucode: Fix buffer under-run when parsing AMD containers
The AMD container format has no formal spec. It is, at best, precision
guesswork based on AMD's prior contributions to open source projects. The
Equivalence Table has both an explicit length, and an expectation of having a
NULL entry at the end.
Xen was sanity checking the NULL entry, but without confirming that an entry
was present, resulting in a read off the front of the buffer. With some
manual debugging/annotations this manifests as:
(XEN) *** Buf ffff83204c00b19c, eq ffff83204c00b194
(XEN) *** eq: 0c 00 00 00 44 4d 41 00 00 00 00 00 00 00 00 00 aa aa aa aa
^-Actual buffer-------------------^
(XEN) *** installed_cpu: 000c
(XEN) microcode: Bad equivalent cpu table
(XEN) Parsing microcode blob error -22
When loaded by hypercall, the 4 bytes interpreted as installed_cpu happen to
be the containing struct ucode_buf's len field, and luckily will be nonzero.
When loaded at boot, it's possible for the access to #PF if the module happens
to have been placed on a 2M boundary by the bootloader. Under Linux, it will
commonly be the end of the CPIO header.
Drop the probe of the NULL entry; Nothing else cares. A container without one
is well formed, insofar that we can still parse it correctly. With this
dropped, the same container results in:
(XEN) microcode: couldn't find any matching ucode in the provided blob!
Fixes: 4de936a38aa9 ("x86/ucode/amd: Rework parsing logic in cpu_request_microcode()") Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a8bf14f6f331d4f428010b4277b67c33f561ed19
master date: 2024-09-13 15:23:30 +0100
blkif: reconcile protocol specification with in-use implementations
Current blkif implementations (both backends and frontends) have all slight
differences about how they handle the 'sector-size' xenstore node, and how
other fields are derived from this value or hardcoded to be expressed in units
of 512 bytes.
To give some context, this is an excerpt of how different implementations use
the value in 'sector-size' as the base unit for to other fields rather than
just to set the logical sector size of the block device:
An attempt was made by 67e1c050e36b in order to change the base units of the
request fields and the xenstore 'sectors' node. That however only lead to more
confusion, as the specification now clearly diverged from the reference
implementation in Linux. Such change was only implemented for QEMU Qdisk
and Windows PV blkfront.
Partially revert to the state before 67e1c050e36b while adjusting the
documentation for 'sectors' to match what it used to be previous to 2fa701e5346d:
* Declare 'feature-large-sector-size' deprecated. Frontends should not expose
the node, backends should not make decisions based on its presence.
* Clarify that 'sectors' xenstore node and the requests fields are always in
512-byte units, like it was previous to 2fa701e5346d and 67e1c050e36b.
All base units for the fields used in the protocol are 512-byte based, the
xenbus 'sector-size' field is only used to signal the logic block size. When
'sector-size' is greater than 512, blkfront implementations must make sure that
the offsets and sizes (despite being expressed in 512-byte units) are aligned
to the logical block size specified in 'sector-size', otherwise the backend
will fail to process the requests.
This will require changes to some of the frontends and backends in order to
properly support 'sector-size' nodes greater than 512.
Fixes: 2fa701e5346d ('blkif.h: Provide more complete documentation of the blkif interface') Fixes: 67e1c050e36b ('public/io/blkif.h: try to fix the semantics of sector based quantities') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: 221f2748e8dabe8361b8cdfcffbeab9102c4c899
master date: 2024-09-12 14:04:56 +0200
xen/x86/pvh: handle ACPI RSDT table in PVH Dom0 build
Xen always generates an XSDT table even if the firmware only provided an
RSDT table. Copy the RSDT header from the firmware table, adjusting the
signature, for the XSDT table when not provided by the firmware.
This is necessary to run Xen on QEMU.
Fixes: 1d74282c455f ('x86: setup PVHv2 Dom0 ACPI tables') Suggested-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com> Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 6e7f7a0c16c4d406bda6d4a900252ff63a7c5fad
master date: 2024-09-12 09:18:25 +0200
Jan Beulich [Tue, 24 Sep 2024 13:00:07 +0000 (15:00 +0200)]
x86/HVM: properly reject "indirect" VRAM writes
While ->count will only be different from 1 for "indirect" (data in
guest memory) accesses, it being 1 does not exclude the request being an
"indirect" one. Check both to be on the safe side, and bring the ->count
part also in line with what ioreq_send_buffered() actually refuses to
handle.
Fixes: 3bbaaec09b1b ("x86/hvm: unify stdvga mmio intercept with standard mmio intercept") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: eb7cd0593d88c4b967a24bca8bd30591966676cd
master date: 2024-09-12 09:13:04 +0200
Jan Beulich [Tue, 24 Sep 2024 12:59:22 +0000 (14:59 +0200)]
x86emul/test: fix build with gas 2.43
Drop explicit {evex} pseudo-prefixes. New gas (validly) complains when
they're used on things other than instructions. Our use was potentially
ahead of macro invocations - see simd.h's "override" macro.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 3c09288298af881ea1bb568740deb2d2a06bcd41
master date: 2024-09-06 08:41:18 +0200
Jan Beulich [Tue, 24 Sep 2024 12:58:58 +0000 (14:58 +0200)]
x86: fix UP build with gcc14
The complaint is:
In file included from ././include/xen/config.h:17,
from <command-line>:
arch/x86/smpboot.c: In function ‘link_thread_siblings.constprop’:
./include/asm-generic/percpu.h:16:51: error: array subscript [0, 0] is outside array bounds of ‘long unsigned int[1]’ [-Werror=array-bounds=]
16 | (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))
./include/xen/compiler.h:140:29: note: in definition of macro ‘RELOC_HIDE’
140 | (typeof(ptr)) (__ptr + (off)); })
| ^~~
arch/x86/smpboot.c:238:27: note: in expansion of macro ‘per_cpu’
238 | cpumask_set_cpu(cpu2, per_cpu(cpu_sibling_mask, cpu1));
| ^~~~~~~
In file included from ./arch/x86/include/generated/asm/percpu.h:1,
from ./include/xen/percpu.h:30,
from ./arch/x86/include/asm/cpuid.h:9,
from ./arch/x86/include/asm/cpufeature.h:11,
from ./arch/x86/include/asm/system.h:6,
from ./include/xen/list.h:11,
from ./include/xen/mm.h:68,
from arch/x86/smpboot.c:12:
./include/asm-generic/percpu.h:12:22: note: while referencing ‘__per_cpu_offset’
12 | extern unsigned long __per_cpu_offset[NR_CPUS];
| ^~~~~~~~~~~~~~~~
Which I consider bogus in the first place ("array subscript [0, 0]" vs a
1-element array). Yet taking the experience from 99f942f3d410 ("Arm64:
adjust __irq_to_desc() to fix build with gcc14") I guessed that
switching function parameters to unsigned int (which they should have
been anyway) might help. And voilà ...
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a2de7dc4d845738e734b10fce6550c89c6b1092c
master date: 2024-09-04 16:09:28 +0200
Jan Beulich [Tue, 24 Sep 2024 12:58:45 +0000 (14:58 +0200)]
SUPPORT.md: split XSM from Flask
XSM is a generic framework, which in particular is also used by SILO.
With this it can't really be experimental: Arm mandates SILO for having
a security supported configuration.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Daniel P. Smith <dpsmith@apertussolutions.com>
master commit: d7c18b8720824d7efc39ffa7296751e1812865a9
master date: 2024-09-04 16:05:03 +0200
libxl: Fix nul-termination of the return value of libxl_xen_console_read_line()
When built with ASAN, "xl dmesg" crashes in the "printf("%s", line)"
call in main_dmesg(). ASAN reports a heap buffer overflow: an
off-by-one access to cr->buffer.
The readconsole sysctl copies up to count characters into the buffer,
but it does not add a null character at the end. Despite the
documentation of libxl_xen_console_read_line(), line_r is not
nul-terminated if 16384 characters were copied to the buffer.
Fix this by asking xc_readconsolering() to fill the buffer up to size
- 1. As the number of characters in the buffer is only needed in
libxl_xen_console_read_line(), make it a local variable there instead
of part of the libxl__xen_console_reader struct.
Jan Beulich [Tue, 24 Sep 2024 12:57:43 +0000 (14:57 +0200)]
Arm64: adjust __irq_to_desc() to fix build with gcc14
With the original code I observe
In function ‘__irq_to_desc’,
inlined from ‘route_irq_to_guest’ at arch/arm/irq.c:465:12:
arch/arm/irq.c:54:16: error: array subscript -2 is below array bounds of ‘irq_desc_t[32]’ {aka ‘struct irq_desc[32]’} [-Werror=array-bounds=]
54 | return &this_cpu(local_irq_desc)[irq];
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
which looks pretty bogus: How in the world does the compiler arrive at
-2 when compiling route_irq_to_guest()? Yet independent of that the
function's parameter wants to be of unsigned type anyway, as shown by
a vast majority of callers (others use plain int when they really mean
non-negative quantities). With that adjustment the code compiles fine
again.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Michal Orzel <michal.orzel@amd.com>
master commit: 99f942f3d410059dc223ee0a908827e928ef3592
master date: 2024-08-29 10:03:53 +0200
For partial writes the non-written parts of registers are folded into
the full 64-bit value from what they're presently set to. That's wrong
to do though when the behavior is write-1-to-clear: Writes not
including to low 3 bits would unconditionally clear all ISR bits which
are presently set. Re-calculate the value to use.
Fixes: be07023be115 ("x86/vhpet: add support for level triggered interrupts") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 41d358d2f9607ba37c216effa39b9f1bc58de69d
master date: 2024-08-29 10:02:20 +0200
x86/dom0: disable SMAP for PV domain building only
Move the logic that disables SMAP so it's only performed when building a PV
dom0, PVH dom0 builder doesn't require disabling SMAP.
The fixes tag is to account for the wrong usage of cpu_has_smap in
create_dom0(), it should instead have used
boot_cpu_has(X86_FEATURE_XEN_SMAP). Fix while moving the logic to apply to PV
only.
While there also make cr4_pv32_mask __ro_after_init.
Fixes: 493ab190e5b1 ('xen/sm{e, a}p: allow disabling sm{e, a}p for Xen itself') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fb1658221a31ec1db33253a80001191391e73b17
master date: 2024-08-28 19:59:07 +0100
Jan Beulich [Tue, 24 Sep 2024 12:56:16 +0000 (14:56 +0200)]
x86/x2APIC: correct cluster tracking upon CPUs going down for S3
Downing CPUs for S3 is somewhat special: Since we can expect the system
to come back up in exactly the same hardware configuration, per-CPU data
for the secondary CPUs isn't de-allocated (and then cleared upon re-
allocation when the CPUs are being brought back up). Therefore the
cluster_cpus per-CPU pointer will retain its value for all CPUs other
than the final one in a cluster (i.e. in particular for all CPUs in the
same cluster as CPU0). That, however, is in conflict with the assertion
early in init_apic_ldr_x2apic_cluster().
Note that the issue is avoided on Intel hardware, where we park CPUs
instead of bringing them down.
Extend the bypassing of the freeing to the suspend case, thus making
suspend/resume also a tiny bit faster.
Fixes: 2e6c8f182c9c ("x86: distinguish CPU offlining from CPU removal") Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ad3ff7b4279d16c91c23cda6e8be5bc670b25c9a
master date: 2024-08-26 10:30:40 +0200
Jan Beulich [Tue, 24 Sep 2024 12:55:48 +0000 (14:55 +0200)]
x86emul: set (fake) operand size for AVX512CD broadcast insns
Back at the time I failed to pay attention to op_bytes still being zero
when reaching the respective case block: With the ext0f38_table[]
entries having simd_packed_int, the defaulting at the bottom of
x86emul_decode() won't set the field to non-zero for F3-prefixed insns.
Fixes: 37ccca740c26 ("x86emul: support AVX512CD insns") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6fa6b7feaafd622db3a2f3436750cf07782f4c12
master date: 2024-08-23 09:12:24 +0200
Jan Beulich [Tue, 24 Sep 2024 12:55:11 +0000 (14:55 +0200)]
x86emul: always set operand size for AVX-VNNI-INT8 insns
Unlike for AVX-VNNI-INT16 I failed to notice that op_bytes may still be
zero when reaching the respective case block: With the ext0f38_table[]
entries having simd_packed_int, the defaulting at the bottom of
x86emul_decode() won't set the field to non-zero for F3- or F2-prefixed
insns.
Fixes: 842acaa743a5 ("x86emul: support AVX-VNNI-INT8") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d45687cca2450bfebe1dfbddb22f4f03c6fbc9cb
master date: 2024-08-23 09:11:15 +0200
Andrew Cooper [Tue, 24 Sep 2024 12:54:35 +0000 (14:54 +0200)]
x86/pv: Address Coverity complaint in check_guest_io_breakpoint()
Commit 08aacc392d86 ("x86/emul: Fix misaligned IO breakpoint behaviour in PV
guests") caused a Coverity INTEGER_OVERFLOW complaint based on the reasoning
that width could be 0.
It can't, but digging into the code generation, GCC 8 and later (bisected on
godbolt) choose to emit a CSWITCH lookup table, and because the range (bottom
2 bits clear), it's a 16-entry lookup table.
So Coverity is understandable, given that GCC did emit a (dead) logic path
where width stayed 0.
Rewrite the logic. Introduce x86_bp_width() which compiles to a single basic
block, which replaces the switch() statement. Take the opportunity to also
make start and width be loop-scope variables.
No practical change, but it should compile better and placate Coverity.
Fixes: 08aacc392d86 ("x86/emul: Fix misaligned IO breakpoint behaviour in PV guests")
Coverity-ID: 1616152 Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6d41a9d8a12ff89adabdc286e63e9391a0481699
master date: 2024-08-21 23:59:19 +0100
Andrew Cooper [Tue, 24 Sep 2024 12:53:59 +0000 (14:53 +0200)]
x86/pv: Fix merging of new status bits into %dr6
All #DB exceptions result in an update of %dr6, but this isn't captured in
Xen's handling, and is buggy just about everywhere.
To begin resolving this issue, add a new pending_dbg field to x86_event
(unioned with cr2 to avoid taking any extra space, adjusting users to avoid
old-GCC bugs with anonymous unions), and introduce pv_inject_DB() to replace
the current callers using pv_inject_hw_exception().
Push the adjustment of v->arch.dr6 into pv_inject_event(), and use the new
x86_merge_dr6() rather than the current incorrect logic.
A key property is that pending_dbg is taken with positive polarity to deal
with RTM/BLD sensibly. Most callers pass in a constant, but callers passing
in a hardware %dr6 value need to XOR the value with X86_DR6_DEFAULT to flip to
positive polarity.
This fixes the behaviour of the breakpoint status bits; that any left pending
are generally discarded when a new #DB is raised. In principle it would fix
RTM/BLD too, except PV guests can't turn these capabilities on to start with.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: db39fa4b27ea470902d4625567cb6fa24030ddfa
master date: 2024-08-21 23:59:19 +0100
Andrew Cooper [Tue, 24 Sep 2024 12:53:22 +0000 (14:53 +0200)]
x86/pv: Introduce x86_merge_dr6() and fix do_debug()
Pretty much everywhere in Xen the logic to update %dr6 when injecting #DB is
buggy. Introduce a new x86_merge_dr6() helper, and start fixing the mess by
adjusting the dr6 merge in do_debug(). Also correct the comment.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 54ef601a66e8d812a6a6a308f02524e81201825e
master date: 2024-08-21 23:59:19 +0100
Jan Beulich [Tue, 24 Sep 2024 12:52:15 +0000 (14:52 +0200)]
Arm: correct FIXADDR_TOP
While reviewing a RISC-V patch cloning the Arm code, I noticed an
off-by-1 here: FIX_PMAP_{BEGIN,END} being an inclusive range and
FIX_LAST being the same as FIX_PMAP_END, FIXADDR_TOP cannot derive from
FIX_LAST alone, or else the BUG_ON() in virt_to_fix() would trigger if
FIX_PMAP_END ended up being used.
While touching this area also add a check for fixmap and boot FDT area
to not only not overlap, but to have at least one (unmapped) page in
between.
Jan Beulich [Tue, 24 Sep 2024 12:49:18 +0000 (14:49 +0200)]
x86/vLAPIC: prevent undue recursion of vlapic_error()
With the error vector set to an illegal value, the function invoking
vlapic_set_irq() would bring execution back here, with the non-recursive
lock already held. Avoid the call in this case, merely further updating
ESR (if necessary).
This is XSA-462 / CVE-2024-45817.
Fixes: 5f32d186a8b1 ("x86/vlapic: don't silently accept bad vectors") Reported-by: Federico Serafini <federico.serafini@bugseng.com> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c42d9ec61f6d11e25fa77bd44dd11dad1edda268
master date: 2024-09-24 14:23:29 +0200
Use expect to invoke QEMU so that we can terminate the test as soon as
we get the right string in the output instead of waiting until the
final timeout.
For timeout, instead of an hardcoding the value, use a Gitlab CI
variable "QEMU_TIMEOUT" that can be changed depending on the latest
status of the Gitlab CI runners.
[This backport skips the PPC and RISC scripts as well as the XTF
scripts on x86]
The Yocto jobs take a long time to run. We are changing Gitlab ARM64
runners and the new runners might not be able to finish the Yocto jobs
in a reasonable time.
For now, disable the Yocto jobs by turning them into "manual" trigger
(they need to be manually executed.)
Jan Beulich [Tue, 13 Aug 2024 14:49:45 +0000 (16:49 +0200)]
x86/pass-through: documents as security-unsupported when sharing resources
When multiple devices share resources and one of them is to be passed
through to a guest, security of the entire system and of respective
guests individually cannot really be guaranteed without knowing
internals of any of the involved guests. Therefore such a configuration
cannot really be security-supported, yet making that explicit was so far
missing.
Teddy Astie [Tue, 13 Aug 2024 14:49:06 +0000 (16:49 +0200)]
x86/IOMMU: move tracking in iommu_identity_mapping()
If for some reason xmalloc() fails after having mapped the reserved
regions, an error is reported, but the regions remain mapped in the P2M.
Similarly if an error occurs during set_identity_p2m_entry() (except on
the first call), the partial mappings of the region would be retained
without being tracked anywhere, and hence without there being a way to
remove them again from the domain's P2M.
Move the setting up of the list entry ahead of trying to map the region.
In cases other than the first mapping failing, keep record of the full
region, such that a subsequent unmapping request can be properly torn
down.
To compensate for the potentially excess unmapping requests, don't log a
warning from p2m_remove_identity_entry() when there really was nothing
mapped at a given GFN.
This is XSA-460 / CVE-2024-31145.
Fixes: 2201b67b9128 ("VT-d: improve RMRR region handling") Fixes: c0e19d7c6c42 ("IOMMU: generalize VT-d's tracking of mapped RMRR regions") Signed-off-by: Teddy Astie <teddy.astie@vates.tech> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: beadd68b5490ada053d72f8a9ce6fd696d626596
master date: 2024-08-13 16:36:40 +0200
Matthew Barnes [Thu, 8 Aug 2024 11:53:26 +0000 (13:53 +0200)]
tools/lsevtchn: Use errno macro to handle hypercall error cases
Currently, lsevtchn aborts its event channel enumeration when it hits
an event channel that is owned by Xen.
lsevtchn does not distinguish between different hypercall errors, which
results in lsevtchn missing potential relevant event channels with
higher port numbers.
Use the errno macro to distinguish between hypercall errors, and
continue event channel enumeration if the hypercall error is not
critical to enumeration.
Signed-off-by: Matthew Barnes <matthew.barnes@cloud.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: e92a453c8db8bba62d6be3006079e2b9990c3978
master date: 2024-08-02 08:43:57 +0200
George Dunlap [Thu, 8 Aug 2024 11:53:01 +0000 (13:53 +0200)]
xen/hvm: Don't skip MSR_READ trace record
Commit 37f074a3383 ("x86/msr: introduce guest_rdmsr()") introduced a
function to combine the MSR_READ handling between PV and HVM.
Unfortunately, by returning directly, it skipped the trace generation,
leading to gaps in the trace record, as well as xenalyze errors like
this:
hvm_generic_postprocess: d2v0 Strange, exit 7c(VMEXIT_MSR) missing a handler
Roger Pau Monné [Thu, 8 Aug 2024 11:52:11 +0000 (13:52 +0200)]
x86/altcall: further refine clang workaround
The current code in ALT_CALL_ARG() won't successfully workaround the clang
code-generation issue if the arg parameter has a size that's not a power of 2.
While there are no such sized parameters at the moment, improve the workaround
to also be effective when such sizes are used.
Instead of using a union with a long use an unsigned long that's first
initialized to 0 and afterwards set to the argument value.
Roger Pau Monné [Thu, 8 Aug 2024 11:51:39 +0000 (13:51 +0200)]
x86/dom0: fix restoring %cr3 and the mapcache override on PV build error
One of the error paths in the PV dom0 builder section that runs on the guest
page-tables wasn't restoring the Xen value of %cr3, neither removing the
mapcache override.
Andrew Cooper [Thu, 8 Aug 2024 11:51:09 +0000 (13:51 +0200)]
XSM/domctl: Fix permission checks on XEN_DOMCTL_createdomain
The XSM checks for XEN_DOMCTL_createdomain are problematic. There's a split
between xsm_domctl() called early, and flask_domain_create() called quite late
during domain construction.
All XSM implementations except Flask have a simple IS_PRIV check in
xsm_domctl(), and operate as expected when an unprivileged domain tries to
make a hypercall.
Flask however foregoes any action in xsm_domctl() and defers everything,
including the simple "is the caller permitted to create a domain" check, to
flask_domain_create().
As a consequence, when XSM Flask is active, and irrespective of the policy
loaded, all domains irrespective of privilege can:
* Mutate the global 'rover' variable, used to track the next free domid.
Therefore, all domains can cause a domid wraparound, and combined with a
voluntary reboot, choose their own domid.
* Cause a reasonable amount of a domain to be constructed before ultimately
failing for permission reasons, including the use of settings outside of
supported limits.
In order to remediate this, pass the ssidref into xsm_domctl() and at least
check that the calling domain privileged enough to create domains.
Take the opportunity to also fix the sign of the cmd parameter to be unsigned.
This issue has not been assigned an XSA, because Flask is experimental and not
security supported.
Reported-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
master commit: ee32b9b29af449d38aad0a1b3a81aaae586f5ea7
master date: 2024-07-30 17:42:17 +0100
Ross Lagerwall [Thu, 8 Aug 2024 11:50:36 +0000 (13:50 +0200)]
bunzip2: fix rare decompression failure
The decompression code parses a huffman tree and counts the number of
symbols for a given bit length. In rare cases, there may be >= 256
symbols with a given bit length, causing the unsigned char to overflow.
This causes a decompression failure later when the code tries and fails to
find the bit length for a given symbol.
Since the maximum number of symbols is 258, use unsigned short instead.
Fixes: ab77e81f6521 ("x86/dom0: support bzip2 and lzma compressed bzImage payloads") Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 303d3ff85c90ee4af4bad4e3b1d4932fa2634d64
master date: 2024-07-30 11:55:56 +0200
x86/altcall: fix clang code-gen when using altcall in loop constructs
Yet another clang code generation issue when using altcalls.
The issue this time is with using loop constructs around alternative_{,v}call
instances using parameter types smaller than the register size.
Given the following example code:
static void bar(bool b)
{
unsigned int i;
for ( i = 0; i < 10; i++ )
{
int ret_;
register union {
bool e;
unsigned long r;
} di asm("rdi") = { .e = b };
register unsigned long si asm("rsi");
register unsigned long dx asm("rdx");
register unsigned long cx asm("rcx");
register unsigned long r8 asm("r8");
register unsigned long r9 asm("r9");
register unsigned long r10 asm("r10");
register unsigned long r11 asm("r11");
Clang will generate machine code that only resets the low 8 bits of %rdi
between loop calls, leaving the rest of the register possibly containing
garbage from the use of %rdi inside the called function. Note also that clang
doesn't truncate the input parameters at the callee, thus breaking the psABI.
Fix this by turning the `e` element in the anonymous union into an array that
consumes the same space as an unsigned long, as this forces clang to reset the
whole %rdi register instead of just the low 8 bits.
Fixes: 2ce562b2a413 ('x86/altcall: use a union as register type for function parameters on clang') Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d51b2f5ea1915fe058f730b0ec542cf84254fca0
master date: 2024-07-23 13:59:30 +0200
x86/physdev: Return pirq that irq was already mapped to
Fix bug introduced by 0762e2502f1f ("x86/physdev: factor out the code to allocate and
map a pirq"). After that re-factoring, when pirq<0 and current_pirq>0, it means
caller want to allocate a free pirq for irq but irq already has a mapped pirq, then
it returns the negative pirq, so it fails. However, the logic before that
re-factoring is different, it should return the current_pirq that irq was already
mapped to and make the call success.
Fixes: 0762e2502f1f ("x86/physdev: factor out the code to allocate and map a pirq") Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com> Signed-off-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0d2b87b5adfc19e87e9027d996db204c66a47f30
master date: 2024-07-08 14:46:12 +0100
Jan Beulich [Tue, 16 Jul 2024 12:14:43 +0000 (14:14 +0200)]
x86/IRQ: avoid double unlock in map_domain_pirq()
Forever since its introduction the main loop in the function dealing
with multi-vector MSI had error exit points ("break") with different
properties: In one case no IRQ descriptor lock is being held.
Nevertheless the subsequent error cleanup path assumed such a lock would
uniformly need releasing. Identify the case by setting "desc" to NULL,
thus allowing the unlock to be skipped as necessary.
Jan Beulich [Thu, 4 Jul 2024 14:57:29 +0000 (16:57 +0200)]
evtchn: build fix for Arm
When backporting daa90dfea917 ("pirq_cleanup_check() leaks") I neglected
to pay attention to it depending on 13a7b0f9f747 ("restrict concept of
pIRQ to x86"). That one doesn't want backporting imo, so use / adjust
custom #ifdef-ary to address the immediate issue of pirq_cleanup_check()
not being available on Arm.
Jan Beulich [Thu, 4 Jul 2024 12:14:49 +0000 (14:14 +0200)]
x86/entry: don't clear DF when raising #UD for lack of syscall handler
While doing so is intentional when invoking the actual callback, to
mimic a hard-coded SYCALL_MASK / FMASK MSR, the same should not be done
when no handler is available and hence #UD is raised.
Fixes: ca6fcf4321b3 ("x86/pv: Inject #UD for missing SYSCALL callbacks") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d2fe9ab3048d503869ec81bc49db07e55a4a2386
master date: 2024-07-02 12:01:21 +0200
Jan Beulich [Thu, 4 Jul 2024 12:14:16 +0000 (14:14 +0200)]
cmdline: document and enforce "extra_guest_irqs" upper bounds
PHYSDEVOP_pirq_eoi_gmfn_v<N> accepting just a single GFN implies that no
more than 32k pIRQ-s can be used by a domain on x86. Document this upper
bound.
To also enforce the limit, (ab)use both arch_hwdom_irqs() (changing its
parameter type) and setup_system_domains(). This is primarily to avoid
exposing the two static variables or introducing yet further arch hooks.
While touching arch_hwdom_irqs() also mark it hwdom-init.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
amend 'cmdline: document and enforce "extra_guest_irqs" upper bounds'
Address late review comments for what is now commit 17f6d398f765:
- bound max_irqs right away against nr_irqs
- introduce a #define for a constant used twice
Requested-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 17f6d398f76597f8009ec0530842fb8705ece7ba
master date: 2024-07-02 12:00:27 +0200
master commit: 1f56accba33ffea0abf7d1c6384710823d10cbd6
master date: 2024-07-03 14:03:27 +0200
Andrew Cooper [Thu, 4 Jul 2024 12:12:31 +0000 (14:12 +0200)]
tools/dombuilder: Correct the length calculation in xc_dom_alloc_segment()
xc_dom_alloc_segment() is passed a size in bytes, calculates a size in pages
from it, then fills in the new segment information with a bytes value
re-calculated from the number of pages.
This causes the module information given to the guest (MB, or PVH) to have
incorrect sizes; specifically, sizes rounded up to the next page.
This in turn is problematic for Xen. When Xen finds a gzipped module, it
peeks at the end metadata to judge the decompressed size, which is a -4
backreference from the reported end of the module.
Fill in seg->vend using the correct number of bytes.
Fixes: ea7c8a3d0e82 ("libxc: reorganize domain builder guest memory allocator") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: 4c3a618b0adaa0cd59e0fa0898bb60978b8b3a5f
master date: 2024-07-02 10:50:18 +0100
Jan Beulich [Thu, 4 Jul 2024 12:11:57 +0000 (14:11 +0200)]
pirq_cleanup_check() leaks
Its original introduction had two issues: For one the "common" part of
the checks (carried out in the macro) was inverted. And then after
removal from the radix tree the structure wasn't scheduled for freeing.
(All structures still left in the radix tree would be freed upon domain
destruction, though.)
For the freeing to be safe even if it didn't use RCU (i.e. to avoid use-
after-free), re-arrange checks/operations in evtchn_close(), such that
the pointer wouldn't be used anymore after calling pirq_cleanup_check()
(noting that unmap_domain_pirq_emuirq() itself calls the function in the
success case).
Andrew Cooper [Thu, 4 Jul 2024 12:11:36 +0000 (14:11 +0200)]
tools/xl: Open xldevd.log with O_CLOEXEC
`xl devd` has been observed leaking /var/log/xldevd.log into children.
Note this is specifically safe; dup2() leaves O_CLOEXEC disabled on newfd, so
after setting up stdout/stderr, it's only the logfile fd which will close on
exec().
Link: https://github.com/QubesOS/qubes-issues/issues/8292 Reported-by: Demi Marie Obenour <demi@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Demi Marie Obenour <demi@invisiblethingslab.com> Acked-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: ba52b3b624e4a1a976908552364eba924ca45430
master date: 2024-06-24 16:22:59 +0100
Matthew Barnes [Thu, 4 Jul 2024 12:11:03 +0000 (14:11 +0200)]
x86/ioapic: Fix signed shifts in io_apic.c
There exists bitshifts in the IOAPIC code where signed integers are
shifted to the left by up to 31 bits, which is undefined behaviour.
This patch fixes this by changing the integers from signed to unsigned.
Signed-off-by: Matthew Barnes <matthew.barnes@cloud.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c5746b021e573184fb92b601a0e93a295485054e
master date: 2024-06-21 15:09:26 +0100
Andrew Cooper [Thu, 4 Jul 2024 12:10:40 +0000 (14:10 +0200)]
tools: Drop libsystemd as a dependency
There are no more users, and we want to disuade people from introducing new
users just for sd_notify() and friends. Drop the dependency.
We still want the overall --with{,out}-systemd to gate the generation of the
service/unit/mount/etc files.
Rerun autogen.sh, and mark the dependency as removed in the build containers.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Christian Lindig <christian.lindig@cloud.com>
tools: (Actually) drop libsystemd as a dependency
When reinstating some of systemd.m4 between v1 and v2, I reintroduced a little
too much. While {c,o}xenstored are indeed no longer linked against
libsystemd, ./configure still looks for it.
Drop this too.
Fixes: ae26101f6bfc ("tools: Drop libsystemd as a dependency") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: ae26101f6bfc8185adcdb9165d469bdc467780db
master date: 2024-05-23 15:04:40 +0100
master commit: 6ef4fa1e7fe78c1dae07b451292b07facfce4902
master date: 2024-05-30 12:15:25 +0100
tools/tests: don't let test-xenstore write nodes exceeding default size
Today test-xenstore will write nodes with 3000 bytes node data. This
size is exceeding the default quota for the allowed node size. While
working in dom0 with C-xenstored, OCAML-xenstored does not like that.
Use a size of 2000 instead, which is lower than the allowed default
node size of 2048.
Jan Beulich [Thu, 4 Jul 2024 12:06:19 +0000 (14:06 +0200)]
x86: re-run exception-from-stub recovery selftests with CET-SS enabled
On the BSP, shadow stacks are enabled only relatively late in the
booting process. They in particular aren't active yet when initcalls are
run. Keep the testing there, but invoke that testing a 2nd time when
shadow stacks are active, to make sure we won't regress that case after
addressing XSA-451.
While touching this code, switch the guard from NDEBUG to CONFIG_DEBUG,
such that IS_ENABLED() can validly be used at the new call site.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: cfe3ad67127b86e1b1c06993b86422673a51b050
master date: 2024-02-27 13:49:52 +0100
Roger Pau Monné [Wed, 26 Jun 2024 11:44:08 +0000 (13:44 +0200)]
x86/irq: forward pending interrupts to new destination in fixup_irqs()
fixup_irqs() is used to evacuate interrupts from to be offlined CPUs. Given
the CPU is to become offline, the normal migration logic used by Xen where the
vector in the previous target(s) is left configured until the interrupt is
received on the new destination is not suitable.
Instead attempt to do as much as possible in order to prevent loosing
interrupts. If fixup_irqs() is called from the CPU to be offlined (as is
currently the case for CPU hot unplug) attempt to forward pending vectors when
interrupts that target the current CPU are migrated to a different destination.
Additionally, for interrupts that have already been moved from the current CPU
prior to the call to fixup_irqs() but that haven't been delivered to the new
destination (iow: interrupts with move_in_progress set and the current CPU set
in ->arch.old_cpu_mask) also check whether the previous vector is pending and
forward it to the new destination.
This allows us to remove the window with interrupts enabled at the bottom of
fixup_irqs(). Such window wasn't safe anyway: references to the CPU to become
offline are removed from interrupts masks, but the per-CPU vector_irq[] array
is not updated to reflect those changes (as the CPU is going offline anyway).
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e2bb28d621584fce15c907002ddc7c6772644b64
master date: 2024-06-20 12:09:32 +0200
Andrew Cooper [Wed, 26 Jun 2024 11:43:44 +0000 (13:43 +0200)]
x86/cpuid: Fix handling of XSAVE dynamic leaves
[ This is a minimal backport of commit 71cacfb035f4 ("x86/cpuid: Fix handling
of XSAVE dynamic leaves") to fix the bugs without depending on the large
rework of XSTATE handling in Xen 4.19 ]
First, if XSAVE is available in hardware but not visible to the guest, the
dynamic leaves shouldn't be filled in.
Second, the comment concerning XSS state is wrong. VT-x doesn't manage
host/guest state automatically, but there is provision for "host only" bits to
be set, so the implications are still accurate.
In Xen 4.18, no XSS states are supported, so it's safe to keep deferring to
real hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 71cacfb035f4a78ee10970dc38a3baa04d387451
master date: 2024-06-19 13:00:06 +0100
Andrew Cooper [Wed, 26 Jun 2024 11:43:19 +0000 (13:43 +0200)]
x86/xstate: Fix initialisation of XSS cache
The clobbering of this_cpu(xcr0) and this_cpu(xss) to architecturally invalid
values is to force the subsequent set_xcr0() and set_msr_xss() to reload the
hardware register.
While XCR0 is reloaded in xstate_init(), MSR_XSS isn't. This causes
get_msr_xss() to return the invalid value, and logic of the form:
old = get_msr_xss();
set_msr_xss(new);
...
set_msr_xss(old);
to try and restore said invalid value.
The architecturally invalid value must be purged from the cache, meaning the
hardware register must be written at least once. This in turn highlights that
the invalid value must only be used in the case that the hardware register is
available.
Fixes: f7f4a523927f ("x86/xstate: reset cached register values on resume") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 9e6dbbe8bf400aacb99009ddffa91d2a0c312b39
master date: 2024-06-19 13:00:06 +0100
Andrew Cooper [Wed, 26 Jun 2024 11:42:30 +0000 (13:42 +0200)]
xen/ubsan: Fix UB in type_descriptor declaration
struct type_descriptor is arranged with a NUL terminated string following the
kind/info fields.
The only reason this doesn't trip UBSAN detection itself (on more modern
compilers at least) is because struct type_descriptor is only referenced in
suppressed regions.
Switch the declaration to be a real flexible member. No functional change.
Fixes: 00fcf4dd8eb4 ("xen/ubsan: Import ubsan implementation from Linux 4.13") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: bd59af99700f075d06a6d47a16f777c9519928e0
master date: 2024-06-18 14:55:04 +0100
Roger Pau Monné [Wed, 26 Jun 2024 11:42:05 +0000 (13:42 +0200)]
x86/irq: handle moving interrupts in _assign_irq_vector()
Currently there's logic in fixup_irqs() that attempts to prevent
_assign_irq_vector() from failing, as fixup_irqs() is required to evacuate all
interrupts from the CPUs not present in the input mask. The current logic in
fixup_irqs() is incomplete, as it doesn't deal with interrupts that have
move_cleanup_count > 0 and a non-empty ->arch.old_cpu_mask field.
Instead of attempting to fixup the interrupt descriptor in fixup_irqs() so that
_assign_irq_vector() cannot fail, introduce logic in _assign_irq_vector()
to deal with interrupts that have either move_{in_progress,cleanup_count} set
and no remaining online CPUs in ->arch.cpu_mask.
If _assign_irq_vector() is requested to move an interrupt in the state
described above, first attempt to see if ->arch.old_cpu_mask contains any valid
CPUs that could be used as fallback, and if that's the case do move the
interrupt back to the previous destination. Note this is easier because the
vector hasn't been released yet, so there's no need to allocate and setup a new
vector on the destination.
Due to the logic in fixup_irqs() that clears offline CPUs from
->arch.old_cpu_mask (and releases the old vector if the mask becomes empty) it
shouldn't be possible to get into _assign_irq_vector() with
->arch.move_{in_progress,cleanup_count} set but no online CPUs in
->arch.old_cpu_mask.
However if ->arch.move_{in_progress,cleanup_count} is set and the interrupt has
also changed affinity, it's possible the members of ->arch.old_cpu_mask are no
longer part of the affinity set, move the interrupt to a different CPU part of
the provided mask and keep the current ->arch.old_{cpu_mask,vector} for the
pending interrupt movement to be completed.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 369558924a642bbb0cb731e9a3375958867cb17b
master date: 2024-06-18 15:15:10 +0200
Roger Pau Monné [Wed, 26 Jun 2024 11:41:35 +0000 (13:41 +0200)]
x86/irq: deal with old_cpu_mask for interrupts in movement in fixup_irqs()
Given the current logic it's possible for ->arch.old_cpu_mask to get out of
sync: if a CPU set in old_cpu_mask is offlined and then onlined
again without old_cpu_mask having been updated the data in the mask will no
longer be accurate, as when brought back online the CPU will no longer have
old_vector configured to handle the old interrupt source.
If there's an interrupt movement in progress, and the to be offlined CPU (which
is the call context) is in the old_cpu_mask, clear it and update the mask, so
it doesn't contain stale data.
Note that when the system is going down fixup_irqs() will be called by
smp_send_stop() from CPU 0 with a mask with only CPU 0 on it, effectively
asking to move all interrupts to the current caller (CPU 0) which is the only
CPU to remain online. In that case we don't care to migrate interrupts that
are in the process of being moved, as it's likely we won't be able to move all
interrupts to CPU 0 due to vector shortage anyway.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 817d1cd627be668c358d038f0fadbf7d24d417d3
master date: 2024-06-18 15:14:49 +0200
Jan Beulich [Wed, 26 Jun 2024 11:41:05 +0000 (13:41 +0200)]
x86/Intel: unlock CPUID earlier for the BSP
Intel CPUs have a MSR bit to limit CPUID enumeration to leaf two. If
this bit is set by the BIOS then CPUID evaluation does not work when
data from any leaf greater than two is needed; early_cpu_init() in
particular wants to collect leaf 7 data.
Cure this by unlocking CPUID right before evaluating anything which
depends on the maximum CPUID leaf being greater than two.
Inspired by (and description cloned from) Linux commit 0c2f6d04619e
("x86/topology/intel: Unlock CPUID before evaluating anything").
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: fa4d026737a47cd1d66ffb797a29150b4453aa9f
master date: 2024-06-18 15:12:44 +0200
Jan Beulich [Wed, 26 Jun 2024 11:40:35 +0000 (13:40 +0200)]
x86/EPT: drop questionable mfn_valid() from epte_get_entry_emt()
mfn_valid() is RAM-focused; it will often return false for MMIO. Yet
access to actual MMIO space should not generally be restricted to UC
only; especially video frame buffer accesses are unduly affected by such
a restriction.
Since, as of 777c71d31325 ("x86/EPT: avoid marking non-present entries
for re-configuring"), the function won't be called with INVALID_MFN or,
worse, truncated forms thereof anymore, we call fully drop that check.
Fixes: 81fd0d3ca4b2 ("x86/hvm: simplify 'mmio_direct' check in epte_get_entry_emt()") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 4fdd8d75566fdad06667a79ec0ce6f43cc466c54
master date: 2024-06-13 16:55:22 +0200
Jan Beulich [Wed, 26 Jun 2024 11:40:11 +0000 (13:40 +0200)]
x86/EPT: avoid marking non-present entries for re-configuring
For non-present entries EMT, like most other fields, is meaningless to
hardware. Make the logic in ept_set_entry() setting the field (and iPAT)
conditional upon dealing with a present entry, leaving the value at 0
otherwise. This has two effects for epte_get_entry_emt() which we'll
want to leverage subsequently:
1) The call moved here now won't be issued with INVALID_MFN anymore (a
respective BUG_ON() is being added).
2) Neither of the other two calls could now be issued with a truncated
form of INVALID_MFN anymore (as long as there's no bug anywhere
marking an entry present when that was populated using INVALID_MFN).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 777c71d31325bc55ba1cc3f317d4155fe519ab0b
master date: 2024-06-13 16:54:17 +0200
Jan Beulich [Wed, 26 Jun 2024 11:39:44 +0000 (13:39 +0200)]
x86/EPT: correct special page checking in epte_get_entry_emt()
mfn_valid() granularity is (currently) 256Mb. Therefore the start of a
1Gb page passing the test doesn't necessarily mean all parts of such a
range would also pass. Yet using the result of mfn_to_page() on an MFN
which doesn't pass mfn_valid() checking is liable to result in a crash
(the invocation of mfn_to_page() alone is presumably "just" UB in such a
case).
Fixes: ca24b2ffdbd9 ("x86/hvm: set 'ipat' in EPT for special pages") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 5540b94e8191059eb9cbbe98ac316232a42208f6
master date: 2024-06-13 16:53:34 +0200
Roger Pau Monné [Wed, 26 Jun 2024 11:39:11 +0000 (13:39 +0200)]
x86/irq: limit interrupt movement done by fixup_irqs()
The current check used in fixup_irqs() to decide whether to move around
interrupts is based on the affinity mask, but such mask can have all bits set,
and hence is unlikely to be a subset of the input mask. For example if an
interrupt has an affinity mask of all 1s, any input to fixup_irqs() that's not
an all set CPU mask would cause that interrupt to be shuffled around
unconditionally.
What fixup_irqs() care about is evacuating interrupts from CPUs not set on the
input CPU mask, and for that purpose it should check whether the interrupt is
assigned to a CPU not present in the input mask. Assume that ->arch.cpu_mask
is a subset of the ->affinity mask, and keep the current logic that resets the
->affinity mask if the interrupt has to be shuffled around.
Doing the affinity movement based on ->arch.cpu_mask requires removing the
special handling to ->arch.cpu_mask done for high priority vectors, otherwise
the adjustment done to cpu_mask makes them always skip the CPU interrupt
movement.
While there also adjust the comment as to the purpose of fixup_irqs().
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c7564d7366d865cc407e3d64bca816d07edee174
master date: 2024-06-12 14:30:40 +0200
Roger Pau Monné [Wed, 26 Jun 2024 11:38:36 +0000 (13:38 +0200)]
x86/smp: do not use shorthand IPI destinations in CPU hot{,un}plug contexts
Due to the current rwlock logic, if the CPU calling get_cpu_maps() does
so from a cpu_hotplug_{begin,done}() region the function will still
return success, because a CPU taking the rwlock in read mode after
having taken it in write mode is allowed. Such corner case makes using
get_cpu_maps() alone not enough to prevent using the shorthand in CPU
hotplug regions.
Introduce a new helper to detect whether the current caller is between a
cpu_hotplug_{begin,done}() region and use it in send_IPI_mask() to restrict
shorthand usage.
Fixes: 5500d265a2a8 ('x86/smp: use APIC ALLBUT destination shorthand when possible') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 171c52fba5d94e050d704770480dcb983490d0ad
master date: 2024-06-12 14:29:31 +0200
Roger Pau Monné [Wed, 26 Jun 2024 11:37:20 +0000 (13:37 +0200)]
x86/irq: remove offline CPUs from old CPU mask when adjusting move_cleanup_count
When adjusting move_cleanup_count to account for CPUs that are offline also
adjust old_cpu_mask, otherwise further calls to fixup_irqs() could subtract
those again and create an imbalance in move_cleanup_count.
Fixes: 472e0b74c5c4 ('x86/IRQ: deal with move cleanup count state in fixup_irqs()') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e63209d3ba2fd1b2f232babd14c9c679ffa7b09a
master date: 2024-06-10 10:33:22 +0200
Andrew Cooper [Wed, 26 Jun 2024 11:36:13 +0000 (13:36 +0200)]
x86/ucode: Further fixes to identify "ucode already up to date"
When the revision in hardware is newer than anything Xen has to hand,
'microcode_cache' isn't set up. Then, `xen-ucode` initiates the update
because it doesn't know whether the revisions across the system are symmetric
or not. This involves the patch getting all the way into the
apply_microcode() hooks before being found to be too old.
This is all a giant mess and needs an overhaul, but in the short term simply
adjust the apply_microcode() to return -EEXIST.
Also, unconditionally print the preexisting microcode revision on boot. It's
relevant information which is otherwise unavailable if Xen doesn't find new
microcode to use.
Fixes: 648db37a155a ("x86/ucode: Distinguish "ucode already up to date"") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 977d98e67c2e929c62aa1f495fc4c6341c45abb5
master date: 2024-05-16 13:59:11 +0100
Roger Pau Monné [Tue, 21 May 2024 08:25:39 +0000 (10:25 +0200)]
x86/mtrr: avoid system wide rendezvous when setting AP MTRRs
There's no point in forcing a system wide update of the MTRRs on all processors
when there are no changes to be propagated. On AP startup it's only the AP
that needs to write the system wide MTRR values in order to match the rest of
the already online CPUs.
We have occasionally seen the watchdog trigger during `xen-hptool cpu-online`
in one Intel Cascade Lake box with 448 CPUs due to the re-setting of the MTRRs
on all the CPUs in the system.
While there adjust the comment to clarify why the system-wide resetting of the
MTRR registers is not needed for the purposes of mtrr_ap_init().
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: abd00b037da5ffa4e8c4508a5df0cd6eabb805a4
master date: 2024-05-15 19:59:52 +0100
Leigh Brown [Tue, 21 May 2024 08:25:30 +0000 (10:25 +0200)]
tools/xentop: Fix cpu% sort order
In compare_cpu_pct(), there is a double -> unsigned long long converion when
calling compare(). In C, this discards the fractional part, resulting in an
out-of order sorting such as:
Jan Beulich [Tue, 21 May 2024 08:25:08 +0000 (10:25 +0200)]
x86: respect mapcache_domain_init() failing
The function itself properly handles and hands onwards failure from
create_perdomain_mapping(). Therefore its caller should respect possible
failure, too.
Fixes: 4b28bf6ae90b ("x86: re-introduce map_domain_page() et al") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 7270fdc7a0028d4b7b26fd1b36c6b9e97abcf3da
master date: 2024-05-15 19:59:52 +0100
Juergen Gross [Tue, 21 May 2024 08:24:26 +0000 (10:24 +0200)]
xen/sched: set all sched_resource data inside locked region for new cpu
When adding a cpu to a scheduler, set all data items of struct
sched_resource inside the locked region, as otherwise a race might
happen (e.g. when trying to access the cpupool of the cpu):
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Fixes: a8c6c623192e ("sched: clarify use cases of schedule_cpu_switch()") Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d104a07524ffc92ae7a70dfe192c291de2a563cc
master date: 2024-05-15 19:59:52 +0100
libxl: Fix handling XenStore errors in device creation
If xenstored runs out of memory it is possible for it to fail operations
that should succeed. libxl wasn't robust against this, and could fail
to ensure that the TTY path of a non-initial console was created and
read-only for guests. This doesn't qualify for an XSA because guests
should not be able to run xenstored out of memory, but it still needs to
be fixed.
Add the missing error checks to ensure that all errors are properly
handled and that at no point can a guest make the TTY path of its
frontend directory writable.
Roger Pau Monné [Tue, 21 May 2024 08:23:27 +0000 (10:23 +0200)]
libxl: fix population of the online vCPU bitmap for PVH
libxl passes some information to libacpi to create the ACPI table for a PVH
guest, and among that information it's a bitmap of which vCPUs are online
which can be less than the maximum number of vCPUs assigned to the domain.
While the population of the bitmap is done correctly for HVM based on the
number of online vCPUs, for PVH the population of the bitmap is done based on
the number of maximum vCPUs allowed. This leads to all local APIC entries in
the MADT being set as enabled, which contradicts the data in xenstore if vCPUs
is different than maximum vCPUs.
Fix by copying the internal libxl bitmap that's populated based on the vCPUs
parameter.
Reported-by: Arthur Borsboom <arthurborsboom@gmail.com> Link: https://gitlab.com/libvirt/libvirt/-/issues/399 Reported-by: Leigh Brown <leigh@solinno.co.uk> Fixes: 14c0d328da2b ('libxl/acpi: Build ACPI tables for HVMlite guests') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Leigh Brown <leigh@solinno.co.uk> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5cc7347b04b2d0a3133754c7a9b936f614ec656a
master date: 2024-05-11 00:13:43 +0100
Andrew Cooper [Tue, 21 May 2024 08:22:08 +0000 (10:22 +0200)]
x86/cpu-policy: Fix migration from Ice Lake to Cascade Lake
Ever since Xen 4.14, there has been a latent bug with migration.
While some toolstacks can level the features properly, they don't shink
feat.max_subleaf when all features have been dropped. This is because
we *still* have not completed the toolstack side work for full CPU Policy
objects.
As a consequence, even when properly feature levelled, VMs can't migrate
"backwards" across hardware which reduces feat.max_subleaf. One such example
is Ice Lake (max_subleaf=2 for INTEL_PSFD) to Cascade Lake (max_subleaf=0).
Extend the max policies feat.max_subleaf to the hightest number Xen knows
about, but leave the default policies matching the host. This will allow VMs
with a higher feat.max_subleaf than strictly necessary to migrate in.
Eventually we'll manage to teach the toolstack how to avoid creating such VMs
in the first place, but there's still more work to do there.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a2330b51df267e20e66bbba6c5bf08f0570ed58b
master date: 2024-05-07 16:56:46 +0100
Jan Beulich [Tue, 21 May 2024 08:20:58 +0000 (10:20 +0200)]
VT-d: correct ATS checking for root complex integrated devices
Spec version 4.1 says
"The ATSR structures identifies PCI Express Root-Ports supporting
Address Translation Services (ATS) transactions. Software must enable
ATS on endpoint devices behind a Root Port only if the Root Port is
reported as supporting ATS transactions."
Clearly root complex integrated devices aren't "behind root ports",
matching my observation on a SapphireRapids system having an ATS-
capable root complex integrated device. Hence for such devices we
shouldn't try to locate a corresponding ATSR.
Since both pci_find_ext_capability() and pci_find_cap_offset() return
"unsigned int", change "pos" to that type at the same time.
Fixes: 903b93211f56 ("[VTD] laying the ground work for ATS") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 04e31583bab97e5042a44a1d00fce2760272635f
master date: 2024-05-06 09:22:45 +0200