Jan Beulich [Tue, 24 Sep 2024 12:59:22 +0000 (14:59 +0200)]
x86emul/test: fix build with gas 2.43
Drop explicit {evex} pseudo-prefixes. New gas (validly) complains when
they're used on things other than instructions. Our use was potentially
ahead of macro invocations - see simd.h's "override" macro.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 3c09288298af881ea1bb568740deb2d2a06bcd41
master date: 2024-09-06 08:41:18 +0200
Jan Beulich [Tue, 24 Sep 2024 12:58:58 +0000 (14:58 +0200)]
x86: fix UP build with gcc14
The complaint is:
In file included from ././include/xen/config.h:17,
from <command-line>:
arch/x86/smpboot.c: In function ‘link_thread_siblings.constprop’:
./include/asm-generic/percpu.h:16:51: error: array subscript [0, 0] is outside array bounds of ‘long unsigned int[1]’ [-Werror=array-bounds=]
16 | (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))
./include/xen/compiler.h:140:29: note: in definition of macro ‘RELOC_HIDE’
140 | (typeof(ptr)) (__ptr + (off)); })
| ^~~
arch/x86/smpboot.c:238:27: note: in expansion of macro ‘per_cpu’
238 | cpumask_set_cpu(cpu2, per_cpu(cpu_sibling_mask, cpu1));
| ^~~~~~~
In file included from ./arch/x86/include/generated/asm/percpu.h:1,
from ./include/xen/percpu.h:30,
from ./arch/x86/include/asm/cpuid.h:9,
from ./arch/x86/include/asm/cpufeature.h:11,
from ./arch/x86/include/asm/system.h:6,
from ./include/xen/list.h:11,
from ./include/xen/mm.h:68,
from arch/x86/smpboot.c:12:
./include/asm-generic/percpu.h:12:22: note: while referencing ‘__per_cpu_offset’
12 | extern unsigned long __per_cpu_offset[NR_CPUS];
| ^~~~~~~~~~~~~~~~
Which I consider bogus in the first place ("array subscript [0, 0]" vs a
1-element array). Yet taking the experience from 99f942f3d410 ("Arm64:
adjust __irq_to_desc() to fix build with gcc14") I guessed that
switching function parameters to unsigned int (which they should have
been anyway) might help. And voilà ...
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a2de7dc4d845738e734b10fce6550c89c6b1092c
master date: 2024-09-04 16:09:28 +0200
Jan Beulich [Tue, 24 Sep 2024 12:58:45 +0000 (14:58 +0200)]
SUPPORT.md: split XSM from Flask
XSM is a generic framework, which in particular is also used by SILO.
With this it can't really be experimental: Arm mandates SILO for having
a security supported configuration.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Daniel P. Smith <dpsmith@apertussolutions.com>
master commit: d7c18b8720824d7efc39ffa7296751e1812865a9
master date: 2024-09-04 16:05:03 +0200
libxl: Fix nul-termination of the return value of libxl_xen_console_read_line()
When built with ASAN, "xl dmesg" crashes in the "printf("%s", line)"
call in main_dmesg(). ASAN reports a heap buffer overflow: an
off-by-one access to cr->buffer.
The readconsole sysctl copies up to count characters into the buffer,
but it does not add a null character at the end. Despite the
documentation of libxl_xen_console_read_line(), line_r is not
nul-terminated if 16384 characters were copied to the buffer.
Fix this by asking xc_readconsolering() to fill the buffer up to size
- 1. As the number of characters in the buffer is only needed in
libxl_xen_console_read_line(), make it a local variable there instead
of part of the libxl__xen_console_reader struct.
Jan Beulich [Tue, 24 Sep 2024 12:57:43 +0000 (14:57 +0200)]
Arm64: adjust __irq_to_desc() to fix build with gcc14
With the original code I observe
In function ‘__irq_to_desc’,
inlined from ‘route_irq_to_guest’ at arch/arm/irq.c:465:12:
arch/arm/irq.c:54:16: error: array subscript -2 is below array bounds of ‘irq_desc_t[32]’ {aka ‘struct irq_desc[32]’} [-Werror=array-bounds=]
54 | return &this_cpu(local_irq_desc)[irq];
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
which looks pretty bogus: How in the world does the compiler arrive at
-2 when compiling route_irq_to_guest()? Yet independent of that the
function's parameter wants to be of unsigned type anyway, as shown by
a vast majority of callers (others use plain int when they really mean
non-negative quantities). With that adjustment the code compiles fine
again.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Michal Orzel <michal.orzel@amd.com>
master commit: 99f942f3d410059dc223ee0a908827e928ef3592
master date: 2024-08-29 10:03:53 +0200
For partial writes the non-written parts of registers are folded into
the full 64-bit value from what they're presently set to. That's wrong
to do though when the behavior is write-1-to-clear: Writes not
including to low 3 bits would unconditionally clear all ISR bits which
are presently set. Re-calculate the value to use.
Fixes: be07023be115 ("x86/vhpet: add support for level triggered interrupts") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 41d358d2f9607ba37c216effa39b9f1bc58de69d
master date: 2024-08-29 10:02:20 +0200
x86/dom0: disable SMAP for PV domain building only
Move the logic that disables SMAP so it's only performed when building a PV
dom0, PVH dom0 builder doesn't require disabling SMAP.
The fixes tag is to account for the wrong usage of cpu_has_smap in
create_dom0(), it should instead have used
boot_cpu_has(X86_FEATURE_XEN_SMAP). Fix while moving the logic to apply to PV
only.
While there also make cr4_pv32_mask __ro_after_init.
Fixes: 493ab190e5b1 ('xen/sm{e, a}p: allow disabling sm{e, a}p for Xen itself') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fb1658221a31ec1db33253a80001191391e73b17
master date: 2024-08-28 19:59:07 +0100
Jan Beulich [Tue, 24 Sep 2024 12:56:16 +0000 (14:56 +0200)]
x86/x2APIC: correct cluster tracking upon CPUs going down for S3
Downing CPUs for S3 is somewhat special: Since we can expect the system
to come back up in exactly the same hardware configuration, per-CPU data
for the secondary CPUs isn't de-allocated (and then cleared upon re-
allocation when the CPUs are being brought back up). Therefore the
cluster_cpus per-CPU pointer will retain its value for all CPUs other
than the final one in a cluster (i.e. in particular for all CPUs in the
same cluster as CPU0). That, however, is in conflict with the assertion
early in init_apic_ldr_x2apic_cluster().
Note that the issue is avoided on Intel hardware, where we park CPUs
instead of bringing them down.
Extend the bypassing of the freeing to the suspend case, thus making
suspend/resume also a tiny bit faster.
Fixes: 2e6c8f182c9c ("x86: distinguish CPU offlining from CPU removal") Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ad3ff7b4279d16c91c23cda6e8be5bc670b25c9a
master date: 2024-08-26 10:30:40 +0200
Jan Beulich [Tue, 24 Sep 2024 12:55:48 +0000 (14:55 +0200)]
x86emul: set (fake) operand size for AVX512CD broadcast insns
Back at the time I failed to pay attention to op_bytes still being zero
when reaching the respective case block: With the ext0f38_table[]
entries having simd_packed_int, the defaulting at the bottom of
x86emul_decode() won't set the field to non-zero for F3-prefixed insns.
Fixes: 37ccca740c26 ("x86emul: support AVX512CD insns") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6fa6b7feaafd622db3a2f3436750cf07782f4c12
master date: 2024-08-23 09:12:24 +0200
Jan Beulich [Tue, 24 Sep 2024 12:55:11 +0000 (14:55 +0200)]
x86emul: always set operand size for AVX-VNNI-INT8 insns
Unlike for AVX-VNNI-INT16 I failed to notice that op_bytes may still be
zero when reaching the respective case block: With the ext0f38_table[]
entries having simd_packed_int, the defaulting at the bottom of
x86emul_decode() won't set the field to non-zero for F3- or F2-prefixed
insns.
Fixes: 842acaa743a5 ("x86emul: support AVX-VNNI-INT8") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d45687cca2450bfebe1dfbddb22f4f03c6fbc9cb
master date: 2024-08-23 09:11:15 +0200
Andrew Cooper [Tue, 24 Sep 2024 12:54:35 +0000 (14:54 +0200)]
x86/pv: Address Coverity complaint in check_guest_io_breakpoint()
Commit 08aacc392d86 ("x86/emul: Fix misaligned IO breakpoint behaviour in PV
guests") caused a Coverity INTEGER_OVERFLOW complaint based on the reasoning
that width could be 0.
It can't, but digging into the code generation, GCC 8 and later (bisected on
godbolt) choose to emit a CSWITCH lookup table, and because the range (bottom
2 bits clear), it's a 16-entry lookup table.
So Coverity is understandable, given that GCC did emit a (dead) logic path
where width stayed 0.
Rewrite the logic. Introduce x86_bp_width() which compiles to a single basic
block, which replaces the switch() statement. Take the opportunity to also
make start and width be loop-scope variables.
No practical change, but it should compile better and placate Coverity.
Fixes: 08aacc392d86 ("x86/emul: Fix misaligned IO breakpoint behaviour in PV guests")
Coverity-ID: 1616152 Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6d41a9d8a12ff89adabdc286e63e9391a0481699
master date: 2024-08-21 23:59:19 +0100
Andrew Cooper [Tue, 24 Sep 2024 12:53:59 +0000 (14:53 +0200)]
x86/pv: Fix merging of new status bits into %dr6
All #DB exceptions result in an update of %dr6, but this isn't captured in
Xen's handling, and is buggy just about everywhere.
To begin resolving this issue, add a new pending_dbg field to x86_event
(unioned with cr2 to avoid taking any extra space, adjusting users to avoid
old-GCC bugs with anonymous unions), and introduce pv_inject_DB() to replace
the current callers using pv_inject_hw_exception().
Push the adjustment of v->arch.dr6 into pv_inject_event(), and use the new
x86_merge_dr6() rather than the current incorrect logic.
A key property is that pending_dbg is taken with positive polarity to deal
with RTM/BLD sensibly. Most callers pass in a constant, but callers passing
in a hardware %dr6 value need to XOR the value with X86_DR6_DEFAULT to flip to
positive polarity.
This fixes the behaviour of the breakpoint status bits; that any left pending
are generally discarded when a new #DB is raised. In principle it would fix
RTM/BLD too, except PV guests can't turn these capabilities on to start with.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: db39fa4b27ea470902d4625567cb6fa24030ddfa
master date: 2024-08-21 23:59:19 +0100
Andrew Cooper [Tue, 24 Sep 2024 12:53:22 +0000 (14:53 +0200)]
x86/pv: Introduce x86_merge_dr6() and fix do_debug()
Pretty much everywhere in Xen the logic to update %dr6 when injecting #DB is
buggy. Introduce a new x86_merge_dr6() helper, and start fixing the mess by
adjusting the dr6 merge in do_debug(). Also correct the comment.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 54ef601a66e8d812a6a6a308f02524e81201825e
master date: 2024-08-21 23:59:19 +0100
Jan Beulich [Tue, 24 Sep 2024 12:52:15 +0000 (14:52 +0200)]
Arm: correct FIXADDR_TOP
While reviewing a RISC-V patch cloning the Arm code, I noticed an
off-by-1 here: FIX_PMAP_{BEGIN,END} being an inclusive range and
FIX_LAST being the same as FIX_PMAP_END, FIXADDR_TOP cannot derive from
FIX_LAST alone, or else the BUG_ON() in virt_to_fix() would trigger if
FIX_PMAP_END ended up being used.
While touching this area also add a check for fixmap and boot FDT area
to not only not overlap, but to have at least one (unmapped) page in
between.
Jan Beulich [Tue, 24 Sep 2024 12:49:18 +0000 (14:49 +0200)]
x86/vLAPIC: prevent undue recursion of vlapic_error()
With the error vector set to an illegal value, the function invoking
vlapic_set_irq() would bring execution back here, with the non-recursive
lock already held. Avoid the call in this case, merely further updating
ESR (if necessary).
This is XSA-462 / CVE-2024-45817.
Fixes: 5f32d186a8b1 ("x86/vlapic: don't silently accept bad vectors") Reported-by: Federico Serafini <federico.serafini@bugseng.com> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c42d9ec61f6d11e25fa77bd44dd11dad1edda268
master date: 2024-09-24 14:23:29 +0200
Use expect to invoke QEMU so that we can terminate the test as soon as
we get the right string in the output instead of waiting until the
final timeout.
For timeout, instead of an hardcoding the value, use a Gitlab CI
variable "QEMU_TIMEOUT" that can be changed depending on the latest
status of the Gitlab CI runners.
[This backport skips the PPC and RISC scripts as well as the XTF
scripts on x86]
The Yocto jobs take a long time to run. We are changing Gitlab ARM64
runners and the new runners might not be able to finish the Yocto jobs
in a reasonable time.
For now, disable the Yocto jobs by turning them into "manual" trigger
(they need to be manually executed.)
Jan Beulich [Tue, 13 Aug 2024 14:49:45 +0000 (16:49 +0200)]
x86/pass-through: documents as security-unsupported when sharing resources
When multiple devices share resources and one of them is to be passed
through to a guest, security of the entire system and of respective
guests individually cannot really be guaranteed without knowing
internals of any of the involved guests. Therefore such a configuration
cannot really be security-supported, yet making that explicit was so far
missing.
Teddy Astie [Tue, 13 Aug 2024 14:49:06 +0000 (16:49 +0200)]
x86/IOMMU: move tracking in iommu_identity_mapping()
If for some reason xmalloc() fails after having mapped the reserved
regions, an error is reported, but the regions remain mapped in the P2M.
Similarly if an error occurs during set_identity_p2m_entry() (except on
the first call), the partial mappings of the region would be retained
without being tracked anywhere, and hence without there being a way to
remove them again from the domain's P2M.
Move the setting up of the list entry ahead of trying to map the region.
In cases other than the first mapping failing, keep record of the full
region, such that a subsequent unmapping request can be properly torn
down.
To compensate for the potentially excess unmapping requests, don't log a
warning from p2m_remove_identity_entry() when there really was nothing
mapped at a given GFN.
This is XSA-460 / CVE-2024-31145.
Fixes: 2201b67b9128 ("VT-d: improve RMRR region handling") Fixes: c0e19d7c6c42 ("IOMMU: generalize VT-d's tracking of mapped RMRR regions") Signed-off-by: Teddy Astie <teddy.astie@vates.tech> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: beadd68b5490ada053d72f8a9ce6fd696d626596
master date: 2024-08-13 16:36:40 +0200
Matthew Barnes [Thu, 8 Aug 2024 11:53:26 +0000 (13:53 +0200)]
tools/lsevtchn: Use errno macro to handle hypercall error cases
Currently, lsevtchn aborts its event channel enumeration when it hits
an event channel that is owned by Xen.
lsevtchn does not distinguish between different hypercall errors, which
results in lsevtchn missing potential relevant event channels with
higher port numbers.
Use the errno macro to distinguish between hypercall errors, and
continue event channel enumeration if the hypercall error is not
critical to enumeration.
Signed-off-by: Matthew Barnes <matthew.barnes@cloud.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: e92a453c8db8bba62d6be3006079e2b9990c3978
master date: 2024-08-02 08:43:57 +0200
George Dunlap [Thu, 8 Aug 2024 11:53:01 +0000 (13:53 +0200)]
xen/hvm: Don't skip MSR_READ trace record
Commit 37f074a3383 ("x86/msr: introduce guest_rdmsr()") introduced a
function to combine the MSR_READ handling between PV and HVM.
Unfortunately, by returning directly, it skipped the trace generation,
leading to gaps in the trace record, as well as xenalyze errors like
this:
hvm_generic_postprocess: d2v0 Strange, exit 7c(VMEXIT_MSR) missing a handler
Roger Pau Monné [Thu, 8 Aug 2024 11:52:11 +0000 (13:52 +0200)]
x86/altcall: further refine clang workaround
The current code in ALT_CALL_ARG() won't successfully workaround the clang
code-generation issue if the arg parameter has a size that's not a power of 2.
While there are no such sized parameters at the moment, improve the workaround
to also be effective when such sizes are used.
Instead of using a union with a long use an unsigned long that's first
initialized to 0 and afterwards set to the argument value.
Roger Pau Monné [Thu, 8 Aug 2024 11:51:39 +0000 (13:51 +0200)]
x86/dom0: fix restoring %cr3 and the mapcache override on PV build error
One of the error paths in the PV dom0 builder section that runs on the guest
page-tables wasn't restoring the Xen value of %cr3, neither removing the
mapcache override.
Andrew Cooper [Thu, 8 Aug 2024 11:51:09 +0000 (13:51 +0200)]
XSM/domctl: Fix permission checks on XEN_DOMCTL_createdomain
The XSM checks for XEN_DOMCTL_createdomain are problematic. There's a split
between xsm_domctl() called early, and flask_domain_create() called quite late
during domain construction.
All XSM implementations except Flask have a simple IS_PRIV check in
xsm_domctl(), and operate as expected when an unprivileged domain tries to
make a hypercall.
Flask however foregoes any action in xsm_domctl() and defers everything,
including the simple "is the caller permitted to create a domain" check, to
flask_domain_create().
As a consequence, when XSM Flask is active, and irrespective of the policy
loaded, all domains irrespective of privilege can:
* Mutate the global 'rover' variable, used to track the next free domid.
Therefore, all domains can cause a domid wraparound, and combined with a
voluntary reboot, choose their own domid.
* Cause a reasonable amount of a domain to be constructed before ultimately
failing for permission reasons, including the use of settings outside of
supported limits.
In order to remediate this, pass the ssidref into xsm_domctl() and at least
check that the calling domain privileged enough to create domains.
Take the opportunity to also fix the sign of the cmd parameter to be unsigned.
This issue has not been assigned an XSA, because Flask is experimental and not
security supported.
Reported-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
master commit: ee32b9b29af449d38aad0a1b3a81aaae586f5ea7
master date: 2024-07-30 17:42:17 +0100
Ross Lagerwall [Thu, 8 Aug 2024 11:50:36 +0000 (13:50 +0200)]
bunzip2: fix rare decompression failure
The decompression code parses a huffman tree and counts the number of
symbols for a given bit length. In rare cases, there may be >= 256
symbols with a given bit length, causing the unsigned char to overflow.
This causes a decompression failure later when the code tries and fails to
find the bit length for a given symbol.
Since the maximum number of symbols is 258, use unsigned short instead.
Fixes: ab77e81f6521 ("x86/dom0: support bzip2 and lzma compressed bzImage payloads") Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 303d3ff85c90ee4af4bad4e3b1d4932fa2634d64
master date: 2024-07-30 11:55:56 +0200
x86/altcall: fix clang code-gen when using altcall in loop constructs
Yet another clang code generation issue when using altcalls.
The issue this time is with using loop constructs around alternative_{,v}call
instances using parameter types smaller than the register size.
Given the following example code:
static void bar(bool b)
{
unsigned int i;
for ( i = 0; i < 10; i++ )
{
int ret_;
register union {
bool e;
unsigned long r;
} di asm("rdi") = { .e = b };
register unsigned long si asm("rsi");
register unsigned long dx asm("rdx");
register unsigned long cx asm("rcx");
register unsigned long r8 asm("r8");
register unsigned long r9 asm("r9");
register unsigned long r10 asm("r10");
register unsigned long r11 asm("r11");
Clang will generate machine code that only resets the low 8 bits of %rdi
between loop calls, leaving the rest of the register possibly containing
garbage from the use of %rdi inside the called function. Note also that clang
doesn't truncate the input parameters at the callee, thus breaking the psABI.
Fix this by turning the `e` element in the anonymous union into an array that
consumes the same space as an unsigned long, as this forces clang to reset the
whole %rdi register instead of just the low 8 bits.
Fixes: 2ce562b2a413 ('x86/altcall: use a union as register type for function parameters on clang') Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d51b2f5ea1915fe058f730b0ec542cf84254fca0
master date: 2024-07-23 13:59:30 +0200
x86/physdev: Return pirq that irq was already mapped to
Fix bug introduced by 0762e2502f1f ("x86/physdev: factor out the code to allocate and
map a pirq"). After that re-factoring, when pirq<0 and current_pirq>0, it means
caller want to allocate a free pirq for irq but irq already has a mapped pirq, then
it returns the negative pirq, so it fails. However, the logic before that
re-factoring is different, it should return the current_pirq that irq was already
mapped to and make the call success.
Fixes: 0762e2502f1f ("x86/physdev: factor out the code to allocate and map a pirq") Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com> Signed-off-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0d2b87b5adfc19e87e9027d996db204c66a47f30
master date: 2024-07-08 14:46:12 +0100
Jan Beulich [Tue, 16 Jul 2024 12:14:43 +0000 (14:14 +0200)]
x86/IRQ: avoid double unlock in map_domain_pirq()
Forever since its introduction the main loop in the function dealing
with multi-vector MSI had error exit points ("break") with different
properties: In one case no IRQ descriptor lock is being held.
Nevertheless the subsequent error cleanup path assumed such a lock would
uniformly need releasing. Identify the case by setting "desc" to NULL,
thus allowing the unlock to be skipped as necessary.
Jan Beulich [Thu, 4 Jul 2024 14:57:29 +0000 (16:57 +0200)]
evtchn: build fix for Arm
When backporting daa90dfea917 ("pirq_cleanup_check() leaks") I neglected
to pay attention to it depending on 13a7b0f9f747 ("restrict concept of
pIRQ to x86"). That one doesn't want backporting imo, so use / adjust
custom #ifdef-ary to address the immediate issue of pirq_cleanup_check()
not being available on Arm.
Jan Beulich [Thu, 4 Jul 2024 12:14:49 +0000 (14:14 +0200)]
x86/entry: don't clear DF when raising #UD for lack of syscall handler
While doing so is intentional when invoking the actual callback, to
mimic a hard-coded SYCALL_MASK / FMASK MSR, the same should not be done
when no handler is available and hence #UD is raised.
Fixes: ca6fcf4321b3 ("x86/pv: Inject #UD for missing SYSCALL callbacks") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d2fe9ab3048d503869ec81bc49db07e55a4a2386
master date: 2024-07-02 12:01:21 +0200
Jan Beulich [Thu, 4 Jul 2024 12:14:16 +0000 (14:14 +0200)]
cmdline: document and enforce "extra_guest_irqs" upper bounds
PHYSDEVOP_pirq_eoi_gmfn_v<N> accepting just a single GFN implies that no
more than 32k pIRQ-s can be used by a domain on x86. Document this upper
bound.
To also enforce the limit, (ab)use both arch_hwdom_irqs() (changing its
parameter type) and setup_system_domains(). This is primarily to avoid
exposing the two static variables or introducing yet further arch hooks.
While touching arch_hwdom_irqs() also mark it hwdom-init.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
amend 'cmdline: document and enforce "extra_guest_irqs" upper bounds'
Address late review comments for what is now commit 17f6d398f765:
- bound max_irqs right away against nr_irqs
- introduce a #define for a constant used twice
Requested-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 17f6d398f76597f8009ec0530842fb8705ece7ba
master date: 2024-07-02 12:00:27 +0200
master commit: 1f56accba33ffea0abf7d1c6384710823d10cbd6
master date: 2024-07-03 14:03:27 +0200
Andrew Cooper [Thu, 4 Jul 2024 12:12:31 +0000 (14:12 +0200)]
tools/dombuilder: Correct the length calculation in xc_dom_alloc_segment()
xc_dom_alloc_segment() is passed a size in bytes, calculates a size in pages
from it, then fills in the new segment information with a bytes value
re-calculated from the number of pages.
This causes the module information given to the guest (MB, or PVH) to have
incorrect sizes; specifically, sizes rounded up to the next page.
This in turn is problematic for Xen. When Xen finds a gzipped module, it
peeks at the end metadata to judge the decompressed size, which is a -4
backreference from the reported end of the module.
Fill in seg->vend using the correct number of bytes.
Fixes: ea7c8a3d0e82 ("libxc: reorganize domain builder guest memory allocator") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: 4c3a618b0adaa0cd59e0fa0898bb60978b8b3a5f
master date: 2024-07-02 10:50:18 +0100
Jan Beulich [Thu, 4 Jul 2024 12:11:57 +0000 (14:11 +0200)]
pirq_cleanup_check() leaks
Its original introduction had two issues: For one the "common" part of
the checks (carried out in the macro) was inverted. And then after
removal from the radix tree the structure wasn't scheduled for freeing.
(All structures still left in the radix tree would be freed upon domain
destruction, though.)
For the freeing to be safe even if it didn't use RCU (i.e. to avoid use-
after-free), re-arrange checks/operations in evtchn_close(), such that
the pointer wouldn't be used anymore after calling pirq_cleanup_check()
(noting that unmap_domain_pirq_emuirq() itself calls the function in the
success case).
Andrew Cooper [Thu, 4 Jul 2024 12:11:36 +0000 (14:11 +0200)]
tools/xl: Open xldevd.log with O_CLOEXEC
`xl devd` has been observed leaking /var/log/xldevd.log into children.
Note this is specifically safe; dup2() leaves O_CLOEXEC disabled on newfd, so
after setting up stdout/stderr, it's only the logfile fd which will close on
exec().
Link: https://github.com/QubesOS/qubes-issues/issues/8292 Reported-by: Demi Marie Obenour <demi@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Demi Marie Obenour <demi@invisiblethingslab.com> Acked-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: ba52b3b624e4a1a976908552364eba924ca45430
master date: 2024-06-24 16:22:59 +0100
Matthew Barnes [Thu, 4 Jul 2024 12:11:03 +0000 (14:11 +0200)]
x86/ioapic: Fix signed shifts in io_apic.c
There exists bitshifts in the IOAPIC code where signed integers are
shifted to the left by up to 31 bits, which is undefined behaviour.
This patch fixes this by changing the integers from signed to unsigned.
Signed-off-by: Matthew Barnes <matthew.barnes@cloud.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c5746b021e573184fb92b601a0e93a295485054e
master date: 2024-06-21 15:09:26 +0100
Andrew Cooper [Thu, 4 Jul 2024 12:10:40 +0000 (14:10 +0200)]
tools: Drop libsystemd as a dependency
There are no more users, and we want to disuade people from introducing new
users just for sd_notify() and friends. Drop the dependency.
We still want the overall --with{,out}-systemd to gate the generation of the
service/unit/mount/etc files.
Rerun autogen.sh, and mark the dependency as removed in the build containers.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Christian Lindig <christian.lindig@cloud.com>
tools: (Actually) drop libsystemd as a dependency
When reinstating some of systemd.m4 between v1 and v2, I reintroduced a little
too much. While {c,o}xenstored are indeed no longer linked against
libsystemd, ./configure still looks for it.
Drop this too.
Fixes: ae26101f6bfc ("tools: Drop libsystemd as a dependency") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: ae26101f6bfc8185adcdb9165d469bdc467780db
master date: 2024-05-23 15:04:40 +0100
master commit: 6ef4fa1e7fe78c1dae07b451292b07facfce4902
master date: 2024-05-30 12:15:25 +0100
tools/tests: don't let test-xenstore write nodes exceeding default size
Today test-xenstore will write nodes with 3000 bytes node data. This
size is exceeding the default quota for the allowed node size. While
working in dom0 with C-xenstored, OCAML-xenstored does not like that.
Use a size of 2000 instead, which is lower than the allowed default
node size of 2048.
Jan Beulich [Thu, 4 Jul 2024 12:06:19 +0000 (14:06 +0200)]
x86: re-run exception-from-stub recovery selftests with CET-SS enabled
On the BSP, shadow stacks are enabled only relatively late in the
booting process. They in particular aren't active yet when initcalls are
run. Keep the testing there, but invoke that testing a 2nd time when
shadow stacks are active, to make sure we won't regress that case after
addressing XSA-451.
While touching this code, switch the guard from NDEBUG to CONFIG_DEBUG,
such that IS_ENABLED() can validly be used at the new call site.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: cfe3ad67127b86e1b1c06993b86422673a51b050
master date: 2024-02-27 13:49:52 +0100
Roger Pau Monné [Wed, 26 Jun 2024 11:44:08 +0000 (13:44 +0200)]
x86/irq: forward pending interrupts to new destination in fixup_irqs()
fixup_irqs() is used to evacuate interrupts from to be offlined CPUs. Given
the CPU is to become offline, the normal migration logic used by Xen where the
vector in the previous target(s) is left configured until the interrupt is
received on the new destination is not suitable.
Instead attempt to do as much as possible in order to prevent loosing
interrupts. If fixup_irqs() is called from the CPU to be offlined (as is
currently the case for CPU hot unplug) attempt to forward pending vectors when
interrupts that target the current CPU are migrated to a different destination.
Additionally, for interrupts that have already been moved from the current CPU
prior to the call to fixup_irqs() but that haven't been delivered to the new
destination (iow: interrupts with move_in_progress set and the current CPU set
in ->arch.old_cpu_mask) also check whether the previous vector is pending and
forward it to the new destination.
This allows us to remove the window with interrupts enabled at the bottom of
fixup_irqs(). Such window wasn't safe anyway: references to the CPU to become
offline are removed from interrupts masks, but the per-CPU vector_irq[] array
is not updated to reflect those changes (as the CPU is going offline anyway).
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e2bb28d621584fce15c907002ddc7c6772644b64
master date: 2024-06-20 12:09:32 +0200
Andrew Cooper [Wed, 26 Jun 2024 11:43:44 +0000 (13:43 +0200)]
x86/cpuid: Fix handling of XSAVE dynamic leaves
[ This is a minimal backport of commit 71cacfb035f4 ("x86/cpuid: Fix handling
of XSAVE dynamic leaves") to fix the bugs without depending on the large
rework of XSTATE handling in Xen 4.19 ]
First, if XSAVE is available in hardware but not visible to the guest, the
dynamic leaves shouldn't be filled in.
Second, the comment concerning XSS state is wrong. VT-x doesn't manage
host/guest state automatically, but there is provision for "host only" bits to
be set, so the implications are still accurate.
In Xen 4.18, no XSS states are supported, so it's safe to keep deferring to
real hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 71cacfb035f4a78ee10970dc38a3baa04d387451
master date: 2024-06-19 13:00:06 +0100
Andrew Cooper [Wed, 26 Jun 2024 11:43:19 +0000 (13:43 +0200)]
x86/xstate: Fix initialisation of XSS cache
The clobbering of this_cpu(xcr0) and this_cpu(xss) to architecturally invalid
values is to force the subsequent set_xcr0() and set_msr_xss() to reload the
hardware register.
While XCR0 is reloaded in xstate_init(), MSR_XSS isn't. This causes
get_msr_xss() to return the invalid value, and logic of the form:
old = get_msr_xss();
set_msr_xss(new);
...
set_msr_xss(old);
to try and restore said invalid value.
The architecturally invalid value must be purged from the cache, meaning the
hardware register must be written at least once. This in turn highlights that
the invalid value must only be used in the case that the hardware register is
available.
Fixes: f7f4a523927f ("x86/xstate: reset cached register values on resume") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 9e6dbbe8bf400aacb99009ddffa91d2a0c312b39
master date: 2024-06-19 13:00:06 +0100
Andrew Cooper [Wed, 26 Jun 2024 11:42:30 +0000 (13:42 +0200)]
xen/ubsan: Fix UB in type_descriptor declaration
struct type_descriptor is arranged with a NUL terminated string following the
kind/info fields.
The only reason this doesn't trip UBSAN detection itself (on more modern
compilers at least) is because struct type_descriptor is only referenced in
suppressed regions.
Switch the declaration to be a real flexible member. No functional change.
Fixes: 00fcf4dd8eb4 ("xen/ubsan: Import ubsan implementation from Linux 4.13") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: bd59af99700f075d06a6d47a16f777c9519928e0
master date: 2024-06-18 14:55:04 +0100
Roger Pau Monné [Wed, 26 Jun 2024 11:42:05 +0000 (13:42 +0200)]
x86/irq: handle moving interrupts in _assign_irq_vector()
Currently there's logic in fixup_irqs() that attempts to prevent
_assign_irq_vector() from failing, as fixup_irqs() is required to evacuate all
interrupts from the CPUs not present in the input mask. The current logic in
fixup_irqs() is incomplete, as it doesn't deal with interrupts that have
move_cleanup_count > 0 and a non-empty ->arch.old_cpu_mask field.
Instead of attempting to fixup the interrupt descriptor in fixup_irqs() so that
_assign_irq_vector() cannot fail, introduce logic in _assign_irq_vector()
to deal with interrupts that have either move_{in_progress,cleanup_count} set
and no remaining online CPUs in ->arch.cpu_mask.
If _assign_irq_vector() is requested to move an interrupt in the state
described above, first attempt to see if ->arch.old_cpu_mask contains any valid
CPUs that could be used as fallback, and if that's the case do move the
interrupt back to the previous destination. Note this is easier because the
vector hasn't been released yet, so there's no need to allocate and setup a new
vector on the destination.
Due to the logic in fixup_irqs() that clears offline CPUs from
->arch.old_cpu_mask (and releases the old vector if the mask becomes empty) it
shouldn't be possible to get into _assign_irq_vector() with
->arch.move_{in_progress,cleanup_count} set but no online CPUs in
->arch.old_cpu_mask.
However if ->arch.move_{in_progress,cleanup_count} is set and the interrupt has
also changed affinity, it's possible the members of ->arch.old_cpu_mask are no
longer part of the affinity set, move the interrupt to a different CPU part of
the provided mask and keep the current ->arch.old_{cpu_mask,vector} for the
pending interrupt movement to be completed.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 369558924a642bbb0cb731e9a3375958867cb17b
master date: 2024-06-18 15:15:10 +0200
Roger Pau Monné [Wed, 26 Jun 2024 11:41:35 +0000 (13:41 +0200)]
x86/irq: deal with old_cpu_mask for interrupts in movement in fixup_irqs()
Given the current logic it's possible for ->arch.old_cpu_mask to get out of
sync: if a CPU set in old_cpu_mask is offlined and then onlined
again without old_cpu_mask having been updated the data in the mask will no
longer be accurate, as when brought back online the CPU will no longer have
old_vector configured to handle the old interrupt source.
If there's an interrupt movement in progress, and the to be offlined CPU (which
is the call context) is in the old_cpu_mask, clear it and update the mask, so
it doesn't contain stale data.
Note that when the system is going down fixup_irqs() will be called by
smp_send_stop() from CPU 0 with a mask with only CPU 0 on it, effectively
asking to move all interrupts to the current caller (CPU 0) which is the only
CPU to remain online. In that case we don't care to migrate interrupts that
are in the process of being moved, as it's likely we won't be able to move all
interrupts to CPU 0 due to vector shortage anyway.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 817d1cd627be668c358d038f0fadbf7d24d417d3
master date: 2024-06-18 15:14:49 +0200
Jan Beulich [Wed, 26 Jun 2024 11:41:05 +0000 (13:41 +0200)]
x86/Intel: unlock CPUID earlier for the BSP
Intel CPUs have a MSR bit to limit CPUID enumeration to leaf two. If
this bit is set by the BIOS then CPUID evaluation does not work when
data from any leaf greater than two is needed; early_cpu_init() in
particular wants to collect leaf 7 data.
Cure this by unlocking CPUID right before evaluating anything which
depends on the maximum CPUID leaf being greater than two.
Inspired by (and description cloned from) Linux commit 0c2f6d04619e
("x86/topology/intel: Unlock CPUID before evaluating anything").
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: fa4d026737a47cd1d66ffb797a29150b4453aa9f
master date: 2024-06-18 15:12:44 +0200
Jan Beulich [Wed, 26 Jun 2024 11:40:35 +0000 (13:40 +0200)]
x86/EPT: drop questionable mfn_valid() from epte_get_entry_emt()
mfn_valid() is RAM-focused; it will often return false for MMIO. Yet
access to actual MMIO space should not generally be restricted to UC
only; especially video frame buffer accesses are unduly affected by such
a restriction.
Since, as of 777c71d31325 ("x86/EPT: avoid marking non-present entries
for re-configuring"), the function won't be called with INVALID_MFN or,
worse, truncated forms thereof anymore, we call fully drop that check.
Fixes: 81fd0d3ca4b2 ("x86/hvm: simplify 'mmio_direct' check in epte_get_entry_emt()") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 4fdd8d75566fdad06667a79ec0ce6f43cc466c54
master date: 2024-06-13 16:55:22 +0200
Jan Beulich [Wed, 26 Jun 2024 11:40:11 +0000 (13:40 +0200)]
x86/EPT: avoid marking non-present entries for re-configuring
For non-present entries EMT, like most other fields, is meaningless to
hardware. Make the logic in ept_set_entry() setting the field (and iPAT)
conditional upon dealing with a present entry, leaving the value at 0
otherwise. This has two effects for epte_get_entry_emt() which we'll
want to leverage subsequently:
1) The call moved here now won't be issued with INVALID_MFN anymore (a
respective BUG_ON() is being added).
2) Neither of the other two calls could now be issued with a truncated
form of INVALID_MFN anymore (as long as there's no bug anywhere
marking an entry present when that was populated using INVALID_MFN).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 777c71d31325bc55ba1cc3f317d4155fe519ab0b
master date: 2024-06-13 16:54:17 +0200
Jan Beulich [Wed, 26 Jun 2024 11:39:44 +0000 (13:39 +0200)]
x86/EPT: correct special page checking in epte_get_entry_emt()
mfn_valid() granularity is (currently) 256Mb. Therefore the start of a
1Gb page passing the test doesn't necessarily mean all parts of such a
range would also pass. Yet using the result of mfn_to_page() on an MFN
which doesn't pass mfn_valid() checking is liable to result in a crash
(the invocation of mfn_to_page() alone is presumably "just" UB in such a
case).
Fixes: ca24b2ffdbd9 ("x86/hvm: set 'ipat' in EPT for special pages") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 5540b94e8191059eb9cbbe98ac316232a42208f6
master date: 2024-06-13 16:53:34 +0200
Roger Pau Monné [Wed, 26 Jun 2024 11:39:11 +0000 (13:39 +0200)]
x86/irq: limit interrupt movement done by fixup_irqs()
The current check used in fixup_irqs() to decide whether to move around
interrupts is based on the affinity mask, but such mask can have all bits set,
and hence is unlikely to be a subset of the input mask. For example if an
interrupt has an affinity mask of all 1s, any input to fixup_irqs() that's not
an all set CPU mask would cause that interrupt to be shuffled around
unconditionally.
What fixup_irqs() care about is evacuating interrupts from CPUs not set on the
input CPU mask, and for that purpose it should check whether the interrupt is
assigned to a CPU not present in the input mask. Assume that ->arch.cpu_mask
is a subset of the ->affinity mask, and keep the current logic that resets the
->affinity mask if the interrupt has to be shuffled around.
Doing the affinity movement based on ->arch.cpu_mask requires removing the
special handling to ->arch.cpu_mask done for high priority vectors, otherwise
the adjustment done to cpu_mask makes them always skip the CPU interrupt
movement.
While there also adjust the comment as to the purpose of fixup_irqs().
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c7564d7366d865cc407e3d64bca816d07edee174
master date: 2024-06-12 14:30:40 +0200
Roger Pau Monné [Wed, 26 Jun 2024 11:38:36 +0000 (13:38 +0200)]
x86/smp: do not use shorthand IPI destinations in CPU hot{,un}plug contexts
Due to the current rwlock logic, if the CPU calling get_cpu_maps() does
so from a cpu_hotplug_{begin,done}() region the function will still
return success, because a CPU taking the rwlock in read mode after
having taken it in write mode is allowed. Such corner case makes using
get_cpu_maps() alone not enough to prevent using the shorthand in CPU
hotplug regions.
Introduce a new helper to detect whether the current caller is between a
cpu_hotplug_{begin,done}() region and use it in send_IPI_mask() to restrict
shorthand usage.
Fixes: 5500d265a2a8 ('x86/smp: use APIC ALLBUT destination shorthand when possible') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 171c52fba5d94e050d704770480dcb983490d0ad
master date: 2024-06-12 14:29:31 +0200
Roger Pau Monné [Wed, 26 Jun 2024 11:37:20 +0000 (13:37 +0200)]
x86/irq: remove offline CPUs from old CPU mask when adjusting move_cleanup_count
When adjusting move_cleanup_count to account for CPUs that are offline also
adjust old_cpu_mask, otherwise further calls to fixup_irqs() could subtract
those again and create an imbalance in move_cleanup_count.
Fixes: 472e0b74c5c4 ('x86/IRQ: deal with move cleanup count state in fixup_irqs()') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e63209d3ba2fd1b2f232babd14c9c679ffa7b09a
master date: 2024-06-10 10:33:22 +0200
Andrew Cooper [Wed, 26 Jun 2024 11:36:13 +0000 (13:36 +0200)]
x86/ucode: Further fixes to identify "ucode already up to date"
When the revision in hardware is newer than anything Xen has to hand,
'microcode_cache' isn't set up. Then, `xen-ucode` initiates the update
because it doesn't know whether the revisions across the system are symmetric
or not. This involves the patch getting all the way into the
apply_microcode() hooks before being found to be too old.
This is all a giant mess and needs an overhaul, but in the short term simply
adjust the apply_microcode() to return -EEXIST.
Also, unconditionally print the preexisting microcode revision on boot. It's
relevant information which is otherwise unavailable if Xen doesn't find new
microcode to use.
Fixes: 648db37a155a ("x86/ucode: Distinguish "ucode already up to date"") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 977d98e67c2e929c62aa1f495fc4c6341c45abb5
master date: 2024-05-16 13:59:11 +0100
Roger Pau Monné [Tue, 21 May 2024 08:25:39 +0000 (10:25 +0200)]
x86/mtrr: avoid system wide rendezvous when setting AP MTRRs
There's no point in forcing a system wide update of the MTRRs on all processors
when there are no changes to be propagated. On AP startup it's only the AP
that needs to write the system wide MTRR values in order to match the rest of
the already online CPUs.
We have occasionally seen the watchdog trigger during `xen-hptool cpu-online`
in one Intel Cascade Lake box with 448 CPUs due to the re-setting of the MTRRs
on all the CPUs in the system.
While there adjust the comment to clarify why the system-wide resetting of the
MTRR registers is not needed for the purposes of mtrr_ap_init().
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: abd00b037da5ffa4e8c4508a5df0cd6eabb805a4
master date: 2024-05-15 19:59:52 +0100
Leigh Brown [Tue, 21 May 2024 08:25:30 +0000 (10:25 +0200)]
tools/xentop: Fix cpu% sort order
In compare_cpu_pct(), there is a double -> unsigned long long converion when
calling compare(). In C, this discards the fractional part, resulting in an
out-of order sorting such as:
Jan Beulich [Tue, 21 May 2024 08:25:08 +0000 (10:25 +0200)]
x86: respect mapcache_domain_init() failing
The function itself properly handles and hands onwards failure from
create_perdomain_mapping(). Therefore its caller should respect possible
failure, too.
Fixes: 4b28bf6ae90b ("x86: re-introduce map_domain_page() et al") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 7270fdc7a0028d4b7b26fd1b36c6b9e97abcf3da
master date: 2024-05-15 19:59:52 +0100
Juergen Gross [Tue, 21 May 2024 08:24:26 +0000 (10:24 +0200)]
xen/sched: set all sched_resource data inside locked region for new cpu
When adding a cpu to a scheduler, set all data items of struct
sched_resource inside the locked region, as otherwise a race might
happen (e.g. when trying to access the cpupool of the cpu):
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Fixes: a8c6c623192e ("sched: clarify use cases of schedule_cpu_switch()") Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d104a07524ffc92ae7a70dfe192c291de2a563cc
master date: 2024-05-15 19:59:52 +0100
libxl: Fix handling XenStore errors in device creation
If xenstored runs out of memory it is possible for it to fail operations
that should succeed. libxl wasn't robust against this, and could fail
to ensure that the TTY path of a non-initial console was created and
read-only for guests. This doesn't qualify for an XSA because guests
should not be able to run xenstored out of memory, but it still needs to
be fixed.
Add the missing error checks to ensure that all errors are properly
handled and that at no point can a guest make the TTY path of its
frontend directory writable.
Roger Pau Monné [Tue, 21 May 2024 08:23:27 +0000 (10:23 +0200)]
libxl: fix population of the online vCPU bitmap for PVH
libxl passes some information to libacpi to create the ACPI table for a PVH
guest, and among that information it's a bitmap of which vCPUs are online
which can be less than the maximum number of vCPUs assigned to the domain.
While the population of the bitmap is done correctly for HVM based on the
number of online vCPUs, for PVH the population of the bitmap is done based on
the number of maximum vCPUs allowed. This leads to all local APIC entries in
the MADT being set as enabled, which contradicts the data in xenstore if vCPUs
is different than maximum vCPUs.
Fix by copying the internal libxl bitmap that's populated based on the vCPUs
parameter.
Reported-by: Arthur Borsboom <arthurborsboom@gmail.com> Link: https://gitlab.com/libvirt/libvirt/-/issues/399 Reported-by: Leigh Brown <leigh@solinno.co.uk> Fixes: 14c0d328da2b ('libxl/acpi: Build ACPI tables for HVMlite guests') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Leigh Brown <leigh@solinno.co.uk> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5cc7347b04b2d0a3133754c7a9b936f614ec656a
master date: 2024-05-11 00:13:43 +0100
Andrew Cooper [Tue, 21 May 2024 08:22:08 +0000 (10:22 +0200)]
x86/cpu-policy: Fix migration from Ice Lake to Cascade Lake
Ever since Xen 4.14, there has been a latent bug with migration.
While some toolstacks can level the features properly, they don't shink
feat.max_subleaf when all features have been dropped. This is because
we *still* have not completed the toolstack side work for full CPU Policy
objects.
As a consequence, even when properly feature levelled, VMs can't migrate
"backwards" across hardware which reduces feat.max_subleaf. One such example
is Ice Lake (max_subleaf=2 for INTEL_PSFD) to Cascade Lake (max_subleaf=0).
Extend the max policies feat.max_subleaf to the hightest number Xen knows
about, but leave the default policies matching the host. This will allow VMs
with a higher feat.max_subleaf than strictly necessary to migrate in.
Eventually we'll manage to teach the toolstack how to avoid creating such VMs
in the first place, but there's still more work to do there.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a2330b51df267e20e66bbba6c5bf08f0570ed58b
master date: 2024-05-07 16:56:46 +0100
Jan Beulich [Tue, 21 May 2024 08:20:58 +0000 (10:20 +0200)]
VT-d: correct ATS checking for root complex integrated devices
Spec version 4.1 says
"The ATSR structures identifies PCI Express Root-Ports supporting
Address Translation Services (ATS) transactions. Software must enable
ATS on endpoint devices behind a Root Port only if the Root Port is
reported as supporting ATS transactions."
Clearly root complex integrated devices aren't "behind root ports",
matching my observation on a SapphireRapids system having an ATS-
capable root complex integrated device. Hence for such devices we
shouldn't try to locate a corresponding ATSR.
Since both pci_find_ext_capability() and pci_find_cap_offset() return
"unsigned int", change "pos" to that type at the same time.
Fixes: 903b93211f56 ("[VTD] laying the ground work for ATS") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 04e31583bab97e5042a44a1d00fce2760272635f
master date: 2024-05-06 09:22:45 +0200
Jason Andryuk [Tue, 21 May 2024 08:19:43 +0000 (10:19 +0200)]
xen/xsm: Wire up get_dom0_console
An XSM hook for get_dom0_console is currently missing. Using XSM with
a PVH dom0 shows:
(XEN) FLASK: Denying unknown platform_op: 64.
Wire up the hook, and allow it for dom0.
Fixes: 4dd160583c ("x86/platform: introduce hypercall to get initial video console settings") Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
master commit: 647f7e50ebeeb8152974cad6a12affe474c74513
master date: 2024-04-30 08:33:41 +0200
It's currently too restrictive by just checking whether there's a BHB clearing
sequence selected. It should instead check whether BHB clearing is used on
entry from PV or HVM specifically.
Switch to use opt_bhb_entry_{pv,hvm} instead, and then remove cpu_has_bhb_seq
since it no longer has any users.
Reported-by: Jan Beulich <jbeulich@suse.com> Fixes: 954c983abcee ('x86/spec-ctrl: Software BHB-clearing sequences') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 656ae8f1091bcefec9c46ec3ea3ac2118742d4f6
master date: 2024-04-25 16:37:01 +0200
x86/spec: fix reporting of BHB clearing usage from guest entry points
Reporting whether the BHB clearing on entry is done for the different domains
types based on cpu_has_bhb_seq is unhelpful, as that variable signals whether
there's a BHB clearing sequence selected, but that alone doesn't imply that
such sequence is used from the PV and/or HVM entry points.
Instead use opt_bhb_entry_{pv,hvm} which do signal whether BHB clearing is
performed on entry from PV/HVM.
Fixes: 689ad48ce9cf ('x86/spec-ctrl: Wire up the Native-BHI software sequences') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 049ab0b2c9f1f5edb54b505fef0bc575787dafe9
master date: 2024-04-25 16:35:56 +0200
Ross Lagerwall [Mon, 29 Apr 2024 07:36:04 +0000 (09:36 +0200)]
x86/rtc: Avoid UIP flag being set for longer than expected
In a test, OVMF reported an error initializing the RTC without
indicating the precise nature of the error. The only plausible
explanation I can find is as follows:
As part of the initialization, OVMF reads register C and then reads
register A repatedly until the UIP flag is not set. If this takes longer
than 100 ms, OVMF fails and reports an error. This may happen with the
following sequence of events:
At guest time=0s, rtc_init() calls check_update_timer() which schedules
update_timer for t=(1 - 244us).
At t=1s, the update_timer function happens to have been called >= 244us
late. In the timer callback, it sets the UIP flag and schedules
update_timer2 for t=1s.
Before update_timer2 runs, the guest reads register C which calls
check_update_timer(). check_update_timer() stops the scheduled
update_timer2 and since the guest time is now outside of the update
cycle, it schedules update_timer for t=(2 - 244us).
The UIP flag will therefore be set for a whole second from t=1 to t=2
while the guest repeatedly reads register A waiting for the UIP flag to
clear. Fix it by clearing the UIP flag when scheduling update_timer.
I was able to reproduce this issue with a synthetic test and this
resolves the issue.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 43a07069863b419433dee12c9b58c1f7ce70aa97
master date: 2024-04-23 14:09:18 +0200
altcall: fix __alt_call_maybe_initdata so it's safe for livepatch
Setting alternative call variables as __init is not safe for use with
livepatch, as livepatches can rightfully introduce new alternative calls to
structures marked as __alt_call_maybe_initdata (possibly just indirectly due to
replacing existing functions that use those). Attempting to resolve those
alternative calls then results in page faults as the variable that holds the
function pointer address has been freed.
When livepatch is supported use the __ro_after_init attribute instead of
__initdata for __alt_call_maybe_initdata.
Fixes: f26bb285949b ('xen: Implement xen/alternative-call.h for use in common code') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: af4cd0a6a61cdb03bc1afca9478b05b0c9703599
master date: 2024-04-11 18:51:36 +0100
Andrew Cooper [Tue, 9 Apr 2024 20:39:51 +0000 (21:39 +0100)]
x86/entry: Fix build with older toolchains
Binutils older than 2.29 doesn't know INCSSPD.
Fixes: 8e186f98ce0e ("x86: Use indirect calls in reset-stack infrastructure") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit a9fa82500818a8d8ce5f2843f1577bd2c29d088e)
Andrew Cooper [Fri, 22 Mar 2024 19:29:34 +0000 (19:29 +0000)]
x86/spec-ctrl: Support the "long" BHB loop sequence
Out of an abudnance of caution, implement the long loop too, and allowing for
it to be opted in to.
This is part of XSA-456 / CVE-2024-2201.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit d5887c0decbd90e798b24ed696628645b04632fb)
Andrew Cooper [Thu, 8 Jun 2023 18:41:44 +0000 (19:41 +0100)]
x86/spec-ctrl: Wire up the Native-BHI software sequences
In the absence of BHI_DIS_S, mitigating Native-BHI requires the use of a
software sequence.
Introduce a new bhb-seq= option to select between avaialble sequences and
bhb-entry= to control the per-PV/HVM actions like we have for other blocks.
Activate the short sequence by default for PV and HVM guests on affected
hardware if BHI_DIS_S isn't present.
This is part of XSA-456 / CVE-2024-2201.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 689ad48ce9cf4c38297cd126e7e003a1c13a3b9d)
Andrew Cooper [Thu, 8 Jun 2023 18:41:44 +0000 (19:41 +0100)]
x86/spec-ctrl: Software BHB-clearing sequences
Implement clear_bhb_{tsx,loops}() as per the BHI guidance. The loops variant
is set up as the "short" sequence.
Introduce SCF_entry_bhb and extend SPEC_CTRL_ENTRY_* with a conditional call
to selected clearing routine.
Note that due to a limitation in the ALTERNATIVE capability, the TEST/JZ can't
be included alongside a CALL in a single alternative block. This is going to
require further work to untangle.
The BHB sequences (if used) must be after the restoration of Xen's
MSR_SPEC_CTRL value, which must be accounted for when judging whether it is
safe to skip the safety LFENCEs.
This is part of XSA-456 / CVE-2024-2201.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 954c983abceee97bf5f6230b9ae164f2c49a9aa9)
Andrew Cooper [Tue, 26 Mar 2024 19:01:37 +0000 (19:01 +0000)]
x86/spec-ctrl: Support BHI_DIS_S in order to mitigate BHI
Introduce a "bhi-dis-s" boolean to match the other options we have for
MSR_SPEC_CTRL values. Also introduce bhi_calculations().
Use BHI_DIS_S whenever possible.
Guests which are levelled to be migration compatible with older CPUs can't see
BHI_DIS_S, and Xen must fill in the difference to make the guest safe. Use
the virt MSR_SPEC_CTRL infrastructure to force BHI_DIS_S behind the guest's
back.
This is part of XSA-456 / CVE-2024-2201.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 62a1106415c5e8a49b45147ca84d54a58d471343)
Andrew Cooper [Sat, 6 Apr 2024 19:36:54 +0000 (20:36 +0100)]
x86/tsx: Expose RTM_ALWAYS_ABORT to guests
A TSX Abort is one option mitigate Native-BHI, but a guest kernel doesn't get
to see this if Xen has turned RTM off using MSR_TSX_{CTRL,FORCE_ABORT}.
Therefore, the meaning of RTM_ALWAYS_ABORT has been adjusted to "XBEGIN won't
fault", and it should be exposed to guests so they can make a better decision.
Expose it in the max policy for any RTM-capable system. Offer it by default
only if RTM has been disabled.
Update test-tsx to account for this new meaning. While adjusting the logic in
test_guest_policies(), take the opportunity to use feature names (now they're
available) to make the logic easier to follow.
This is part of XSA-456 / CVE-2024-2201.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c94e2105924347de0d9f32065370e802a20cc829)
Andrew Cooper [Fri, 22 Dec 2023 18:01:37 +0000 (18:01 +0000)]
x86: Drop INDIRECT_JMP
Indirect JMPs which are not tailcalls can lead to an unwelcome form of
speculative type confusion, and we've removed the uses of INDIRECT_JMP to
compensate. Remove the temptation to reintroduce new instances.
This is part of XSA-456 / CVE-2024-2201.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 0b66d7ce3c0290eaad28bdafb35200052d012b14)
Andrew Cooper [Fri, 22 Dec 2023 17:44:48 +0000 (17:44 +0000)]
x86: Use indirect calls in reset-stack infrastructure
Mixing up JMP and CALL indirect targets leads a very fun form of speculative
type confusion. A target which is expecting to be called CALLed needs a
return address on the stack, and an indirect JMP doesn't place one there.
An indirect JMP which predicts to a target intending to be CALLed can end up
with a RET speculatively executing with a value from the JMPers stack frame.
There are several ways get indirect JMPs in Xen.
* From tailcall optimisations. These are safe because the compiler has
arranged the stack to point at the callee's return address.
* From jump tables. These are unsafe, but Xen is built with -fno-jump-tables
to work around several compiler issues.
* From reset_stack_and_jump_ind(), which is particularly unsafe. Because of
the additional stack adjustment made, the value picked up off the stack is
regs->r15 of the next vCPU to run.
In order to mitigate this type confusion, we want to make all indirect targets
be CALL targets, and remove the use of indirect JMP except via tailcall
optimisation.
Luckily due to XSA-348, all C target functions of reset_stack_and_jump_ind()
are noreturn. {svm,vmx}_do_resume() exits via reset_stack_and_jump(); a
direct JMP with entirely different prediction properties. idle_loop() is an
infinite loop which eventually exits via reset_stack_and_jump_ind() from a new
schedule. i.e. These paths are all fine having one extra return address on
the stack.
This leaves continue_pv_domain(), which is expecting to be a JMP target.
Alter it to strip the return address off the stack, which is safe because
there isn't actually a RET expecting to return to its caller.
This allows us change reset_stack_and_jump_ind() to reset_stack_and_call_ind()
in order to mitigate the speculative type confusion.
This is part of XSA-456 / CVE-2024-2201.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 8e186f98ce0e35d1754ec9299da41ec98873b65c)
Andrew Cooper [Tue, 26 Mar 2024 22:43:18 +0000 (22:43 +0000)]
x86/spec-ctrl: Widen the {xen,last,default}_spec_ctrl fields
Right now, they're all bytes, but MSR_SPEC_CTRL has been steadily gaining new
features.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 45dac88e78e8a2d9d8738eef884fe6730faf9e67)
Roger Pau Monne [Thu, 15 Feb 2024 16:46:53 +0000 (17:46 +0100)]
x86/vmx: Add support for virtualize SPEC_CTRL
The feature is defined in the tertiary exec control, and is available starting
from Sapphire Rapids and Alder Lake CPUs.
When enabled, two extra VMCS fields are used: SPEC_CTRL mask and shadow. Bits
set in mask are not allowed to be toggled by the guest (either set or clear)
and the value in the shadow field is the value the guest expects to be in the
SPEC_CTRL register.
By using it the hypervisor can force the value of SPEC_CTRL bits behind the
guest back without having to trap all accesses to SPEC_CTRL, note that no bits
are forced into the guest as part of this patch. It also allows getting rid of
SPEC_CTRL in the guest MSR load list, since the value in the shadow field will
be loaded by the hardware on vmentry.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 97c5b8b657e41a6645de9d40713b881234417b49)
Andrew Cooper [Mon, 25 Mar 2024 11:09:35 +0000 (11:09 +0000)]
x86/spec-ctrl: Detail the safety properties in SPEC_CTRL_ENTRY_*
The complexity is getting out of hand.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 40dea83b75386cb693481cf340024ce093be5c0f)