Andrew Cooper [Thu, 8 Aug 2024 11:55:30 +0000 (13:55 +0200)]
XSM/domctl: Fix permission checks on XEN_DOMCTL_createdomain
The XSM checks for XEN_DOMCTL_createdomain are problematic. There's a split
between xsm_domctl() called early, and flask_domain_create() called quite late
during domain construction.
All XSM implementations except Flask have a simple IS_PRIV check in
xsm_domctl(), and operate as expected when an unprivileged domain tries to
make a hypercall.
Flask however foregoes any action in xsm_domctl() and defers everything,
including the simple "is the caller permitted to create a domain" check, to
flask_domain_create().
As a consequence, when XSM Flask is active, and irrespective of the policy
loaded, all domains irrespective of privilege can:
* Mutate the global 'rover' variable, used to track the next free domid.
Therefore, all domains can cause a domid wraparound, and combined with a
voluntary reboot, choose their own domid.
* Cause a reasonable amount of a domain to be constructed before ultimately
failing for permission reasons, including the use of settings outside of
supported limits.
In order to remediate this, pass the ssidref into xsm_domctl() and at least
check that the calling domain privileged enough to create domains.
Take the opportunity to also fix the sign of the cmd parameter to be unsigned.
This issue has not been assigned an XSA, because Flask is experimental and not
security supported.
Reported-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
master commit: ee32b9b29af449d38aad0a1b3a81aaae586f5ea7
master date: 2024-07-30 17:42:17 +0100
Ross Lagerwall [Thu, 8 Aug 2024 11:55:01 +0000 (13:55 +0200)]
bunzip2: fix rare decompression failure
The decompression code parses a huffman tree and counts the number of
symbols for a given bit length. In rare cases, there may be >= 256
symbols with a given bit length, causing the unsigned char to overflow.
This causes a decompression failure later when the code tries and fails to
find the bit length for a given symbol.
Since the maximum number of symbols is 258, use unsigned short instead.
Fixes: ab77e81f6521 ("x86/dom0: support bzip2 and lzma compressed bzImage payloads") Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 303d3ff85c90ee4af4bad4e3b1d4932fa2634d64
master date: 2024-07-30 11:55:56 +0200
x86/altcall: fix clang code-gen when using altcall in loop constructs
Yet another clang code generation issue when using altcalls.
The issue this time is with using loop constructs around alternative_{,v}call
instances using parameter types smaller than the register size.
Given the following example code:
static void bar(bool b)
{
unsigned int i;
for ( i = 0; i < 10; i++ )
{
int ret_;
register union {
bool e;
unsigned long r;
} di asm("rdi") = { .e = b };
register unsigned long si asm("rsi");
register unsigned long dx asm("rdx");
register unsigned long cx asm("rcx");
register unsigned long r8 asm("r8");
register unsigned long r9 asm("r9");
register unsigned long r10 asm("r10");
register unsigned long r11 asm("r11");
Clang will generate machine code that only resets the low 8 bits of %rdi
between loop calls, leaving the rest of the register possibly containing
garbage from the use of %rdi inside the called function. Note also that clang
doesn't truncate the input parameters at the callee, thus breaking the psABI.
Fix this by turning the `e` element in the anonymous union into an array that
consumes the same space as an unsigned long, as this forces clang to reset the
whole %rdi register instead of just the low 8 bits.
Fixes: 2ce562b2a413 ('x86/altcall: use a union as register type for function parameters on clang') Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d51b2f5ea1915fe058f730b0ec542cf84254fca0
master date: 2024-07-23 13:59:30 +0200
x86/physdev: Return pirq that irq was already mapped to
Fix bug introduced by 0762e2502f1f ("x86/physdev: factor out the code to allocate and
map a pirq"). After that re-factoring, when pirq<0 and current_pirq>0, it means
caller want to allocate a free pirq for irq but irq already has a mapped pirq, then
it returns the negative pirq, so it fails. However, the logic before that
re-factoring is different, it should return the current_pirq that irq was already
mapped to and make the call success.
Fixes: 0762e2502f1f ("x86/physdev: factor out the code to allocate and map a pirq") Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com> Signed-off-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0d2b87b5adfc19e87e9027d996db204c66a47f30
master date: 2024-07-08 14:46:12 +0100
Jan Beulich [Tue, 16 Jul 2024 12:15:33 +0000 (14:15 +0200)]
x86/IRQ: avoid double unlock in map_domain_pirq()
Forever since its introduction the main loop in the function dealing
with multi-vector MSI had error exit points ("break") with different
properties: In one case no IRQ descriptor lock is being held.
Nevertheless the subsequent error cleanup path assumed such a lock would
uniformly need releasing. Identify the case by setting "desc" to NULL,
thus allowing the unlock to be skipped as necessary.
Jan Beulich [Thu, 4 Jul 2024 14:59:03 +0000 (16:59 +0200)]
evtchn: build fix for Arm
When backporting daa90dfea917 ("pirq_cleanup_check() leaks") I neglected
to pay attention to it depending on 13a7b0f9f747 ("restrict concept of
pIRQ to x86"). That one doesn't want backporting imo, so use / adjust
custom #ifdef-ary to address the immediate issue of pirq_cleanup_check()
not being available on Arm.
Jan Beulich [Thu, 4 Jul 2024 12:23:30 +0000 (14:23 +0200)]
x86/entry: don't clear DF when raising #UD for lack of syscall handler
While doing so is intentional when invoking the actual callback, to
mimic a hard-coded SYCALL_MASK / FMASK MSR, the same should not be done
when no handler is available and hence #UD is raised.
Fixes: ca6fcf4321b3 ("x86/pv: Inject #UD for missing SYSCALL callbacks") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d2fe9ab3048d503869ec81bc49db07e55a4a2386
master date: 2024-07-02 12:01:21 +0200
Jan Beulich [Thu, 4 Jul 2024 12:22:54 +0000 (14:22 +0200)]
cmdline: document and enforce "extra_guest_irqs" upper bounds
PHYSDEVOP_pirq_eoi_gmfn_v<N> accepting just a single GFN implies that no
more than 32k pIRQ-s can be used by a domain on x86. Document this upper
bound.
To also enforce the limit, (ab)use both arch_hwdom_irqs() (changing its
parameter type) and setup_system_domains(). This is primarily to avoid
exposing the two static variables or introducing yet further arch hooks.
While touching arch_hwdom_irqs() also mark it hwdom-init.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
amend 'cmdline: document and enforce "extra_guest_irqs" upper bounds'
Address late review comments for what is now commit 17f6d398f765:
- bound max_irqs right away against nr_irqs
- introduce a #define for a constant used twice
Requested-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 17f6d398f76597f8009ec0530842fb8705ece7ba
master date: 2024-07-02 12:00:27 +0200
master commit: 1f56accba33ffea0abf7d1c6384710823d10cbd6
master date: 2024-07-03 14:03:27 +0200
Andrew Cooper [Thu, 4 Jul 2024 12:21:46 +0000 (14:21 +0200)]
tools/dombuilder: Correct the length calculation in xc_dom_alloc_segment()
xc_dom_alloc_segment() is passed a size in bytes, calculates a size in pages
from it, then fills in the new segment information with a bytes value
re-calculated from the number of pages.
This causes the module information given to the guest (MB, or PVH) to have
incorrect sizes; specifically, sizes rounded up to the next page.
This in turn is problematic for Xen. When Xen finds a gzipped module, it
peeks at the end metadata to judge the decompressed size, which is a -4
backreference from the reported end of the module.
Fill in seg->vend using the correct number of bytes.
Fixes: ea7c8a3d0e82 ("libxc: reorganize domain builder guest memory allocator") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: 4c3a618b0adaa0cd59e0fa0898bb60978b8b3a5f
master date: 2024-07-02 10:50:18 +0100
Jan Beulich [Thu, 4 Jul 2024 12:20:50 +0000 (14:20 +0200)]
pirq_cleanup_check() leaks
Its original introduction had two issues: For one the "common" part of
the checks (carried out in the macro) was inverted. And then after
removal from the radix tree the structure wasn't scheduled for freeing.
(All structures still left in the radix tree would be freed upon domain
destruction, though.)
For the freeing to be safe even if it didn't use RCU (i.e. to avoid use-
after-free), re-arrange checks/operations in evtchn_close(), such that
the pointer wouldn't be used anymore after calling pirq_cleanup_check()
(noting that unmap_domain_pirq_emuirq() itself calls the function in the
success case).
Andrew Cooper [Thu, 4 Jul 2024 12:20:25 +0000 (14:20 +0200)]
tools/xl: Open xldevd.log with O_CLOEXEC
`xl devd` has been observed leaking /var/log/xldevd.log into children.
Note this is specifically safe; dup2() leaves O_CLOEXEC disabled on newfd, so
after setting up stdout/stderr, it's only the logfile fd which will close on
exec().
Link: https://github.com/QubesOS/qubes-issues/issues/8292 Reported-by: Demi Marie Obenour <demi@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Demi Marie Obenour <demi@invisiblethingslab.com> Acked-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: ba52b3b624e4a1a976908552364eba924ca45430
master date: 2024-06-24 16:22:59 +0100
Matthew Barnes [Thu, 4 Jul 2024 12:19:57 +0000 (14:19 +0200)]
x86/ioapic: Fix signed shifts in io_apic.c
There exists bitshifts in the IOAPIC code where signed integers are
shifted to the left by up to 31 bits, which is undefined behaviour.
This patch fixes this by changing the integers from signed to unsigned.
Signed-off-by: Matthew Barnes <matthew.barnes@cloud.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c5746b021e573184fb92b601a0e93a295485054e
master date: 2024-06-21 15:09:26 +0100
Andrew Cooper [Thu, 4 Jul 2024 12:19:35 +0000 (14:19 +0200)]
tools: Drop libsystemd as a dependency
There are no more users, and we want to disuade people from introducing new
users just for sd_notify() and friends. Drop the dependency.
We still want the overall --with{,out}-systemd to gate the generation of the
service/unit/mount/etc files.
Rerun autogen.sh, and mark the dependency as removed in the build containers.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Christian Lindig <christian.lindig@cloud.com>
tools: (Actually) drop libsystemd as a dependency
When reinstating some of systemd.m4 between v1 and v2, I reintroduced a little
too much. While {c,o}xenstored are indeed no longer linked against
libsystemd, ./configure still looks for it.
Drop this too.
Fixes: ae26101f6bfc ("tools: Drop libsystemd as a dependency") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: ae26101f6bfc8185adcdb9165d469bdc467780db
master date: 2024-05-23 15:04:40 +0100
master commit: 6ef4fa1e7fe78c1dae07b451292b07facfce4902
master date: 2024-05-30 12:15:25 +0100
tools/tests: don't let test-xenstore write nodes exceeding default size
Today test-xenstore will write nodes with 3000 bytes node data. This
size is exceeding the default quota for the allowed node size. While
working in dom0 with C-xenstored, OCAML-xenstored does not like that.
Use a size of 2000 instead, which is lower than the allowed default
node size of 2048.
Jan Beulich [Thu, 4 Jul 2024 12:17:03 +0000 (14:17 +0200)]
x86: re-run exception-from-stub recovery selftests with CET-SS enabled
On the BSP, shadow stacks are enabled only relatively late in the
booting process. They in particular aren't active yet when initcalls are
run. Keep the testing there, but invoke that testing a 2nd time when
shadow stacks are active, to make sure we won't regress that case after
addressing XSA-451.
While touching this code, switch the guard from NDEBUG to CONFIG_DEBUG,
such that IS_ENABLED() can validly be used at the new call site.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: cfe3ad67127b86e1b1c06993b86422673a51b050
master date: 2024-02-27 13:49:52 +0100
Roger Pau Monné [Wed, 26 Jun 2024 12:14:01 +0000 (14:14 +0200)]
x86/irq: forward pending interrupts to new destination in fixup_irqs()
fixup_irqs() is used to evacuate interrupts from to be offlined CPUs. Given
the CPU is to become offline, the normal migration logic used by Xen where the
vector in the previous target(s) is left configured until the interrupt is
received on the new destination is not suitable.
Instead attempt to do as much as possible in order to prevent loosing
interrupts. If fixup_irqs() is called from the CPU to be offlined (as is
currently the case for CPU hot unplug) attempt to forward pending vectors when
interrupts that target the current CPU are migrated to a different destination.
Additionally, for interrupts that have already been moved from the current CPU
prior to the call to fixup_irqs() but that haven't been delivered to the new
destination (iow: interrupts with move_in_progress set and the current CPU set
in ->arch.old_cpu_mask) also check whether the previous vector is pending and
forward it to the new destination.
This allows us to remove the window with interrupts enabled at the bottom of
fixup_irqs(). Such window wasn't safe anyway: references to the CPU to become
offline are removed from interrupts masks, but the per-CPU vector_irq[] array
is not updated to reflect those changes (as the CPU is going offline anyway).
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e2bb28d621584fce15c907002ddc7c6772644b64
master date: 2024-06-20 12:09:32 +0200
Andrew Cooper [Wed, 26 Jun 2024 12:13:37 +0000 (14:13 +0200)]
x86/cpuid: Fix handling of XSAVE dynamic leaves
[ This is a minimal backport of commit 71cacfb035f4 ("x86/cpuid: Fix handling
of XSAVE dynamic leaves") to fix the bugs without depending on the large
rework of XSTATE handling in Xen 4.19 ]
First, if XSAVE is available in hardware but not visible to the guest, the
dynamic leaves shouldn't be filled in.
Second, the comment concerning XSS state is wrong. VT-x doesn't manage
host/guest state automatically, but there is provision for "host only" bits to
be set, so the implications are still accurate.
In Xen 4.18, no XSS states are supported, so it's safe to keep deferring to
real hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 71cacfb035f4a78ee10970dc38a3baa04d387451
master date: 2024-06-19 13:00:06 +0100
Andrew Cooper [Wed, 26 Jun 2024 12:13:17 +0000 (14:13 +0200)]
x86/xstate: Fix initialisation of XSS cache
The clobbering of this_cpu(xcr0) and this_cpu(xss) to architecturally invalid
values is to force the subsequent set_xcr0() and set_msr_xss() to reload the
hardware register.
While XCR0 is reloaded in xstate_init(), MSR_XSS isn't. This causes
get_msr_xss() to return the invalid value, and logic of the form:
old = get_msr_xss();
set_msr_xss(new);
...
set_msr_xss(old);
to try and restore said invalid value.
The architecturally invalid value must be purged from the cache, meaning the
hardware register must be written at least once. This in turn highlights that
the invalid value must only be used in the case that the hardware register is
available.
Fixes: f7f4a523927f ("x86/xstate: reset cached register values on resume") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 9e6dbbe8bf400aacb99009ddffa91d2a0c312b39
master date: 2024-06-19 13:00:06 +0100
Andrew Cooper [Wed, 26 Jun 2024 12:12:38 +0000 (14:12 +0200)]
xen/ubsan: Fix UB in type_descriptor declaration
struct type_descriptor is arranged with a NUL terminated string following the
kind/info fields.
The only reason this doesn't trip UBSAN detection itself (on more modern
compilers at least) is because struct type_descriptor is only referenced in
suppressed regions.
Switch the declaration to be a real flexible member. No functional change.
Fixes: 00fcf4dd8eb4 ("xen/ubsan: Import ubsan implementation from Linux 4.13") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: bd59af99700f075d06a6d47a16f777c9519928e0
master date: 2024-06-18 14:55:04 +0100
Roger Pau Monné [Wed, 26 Jun 2024 12:12:05 +0000 (14:12 +0200)]
x86/irq: handle moving interrupts in _assign_irq_vector()
Currently there's logic in fixup_irqs() that attempts to prevent
_assign_irq_vector() from failing, as fixup_irqs() is required to evacuate all
interrupts from the CPUs not present in the input mask. The current logic in
fixup_irqs() is incomplete, as it doesn't deal with interrupts that have
move_cleanup_count > 0 and a non-empty ->arch.old_cpu_mask field.
Instead of attempting to fixup the interrupt descriptor in fixup_irqs() so that
_assign_irq_vector() cannot fail, introduce logic in _assign_irq_vector()
to deal with interrupts that have either move_{in_progress,cleanup_count} set
and no remaining online CPUs in ->arch.cpu_mask.
If _assign_irq_vector() is requested to move an interrupt in the state
described above, first attempt to see if ->arch.old_cpu_mask contains any valid
CPUs that could be used as fallback, and if that's the case do move the
interrupt back to the previous destination. Note this is easier because the
vector hasn't been released yet, so there's no need to allocate and setup a new
vector on the destination.
Due to the logic in fixup_irqs() that clears offline CPUs from
->arch.old_cpu_mask (and releases the old vector if the mask becomes empty) it
shouldn't be possible to get into _assign_irq_vector() with
->arch.move_{in_progress,cleanup_count} set but no online CPUs in
->arch.old_cpu_mask.
However if ->arch.move_{in_progress,cleanup_count} is set and the interrupt has
also changed affinity, it's possible the members of ->arch.old_cpu_mask are no
longer part of the affinity set, move the interrupt to a different CPU part of
the provided mask and keep the current ->arch.old_{cpu_mask,vector} for the
pending interrupt movement to be completed.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 369558924a642bbb0cb731e9a3375958867cb17b
master date: 2024-06-18 15:15:10 +0200
Roger Pau Monné [Wed, 26 Jun 2024 12:11:33 +0000 (14:11 +0200)]
x86/irq: deal with old_cpu_mask for interrupts in movement in fixup_irqs()
Given the current logic it's possible for ->arch.old_cpu_mask to get out of
sync: if a CPU set in old_cpu_mask is offlined and then onlined
again without old_cpu_mask having been updated the data in the mask will no
longer be accurate, as when brought back online the CPU will no longer have
old_vector configured to handle the old interrupt source.
If there's an interrupt movement in progress, and the to be offlined CPU (which
is the call context) is in the old_cpu_mask, clear it and update the mask, so
it doesn't contain stale data.
Note that when the system is going down fixup_irqs() will be called by
smp_send_stop() from CPU 0 with a mask with only CPU 0 on it, effectively
asking to move all interrupts to the current caller (CPU 0) which is the only
CPU to remain online. In that case we don't care to migrate interrupts that
are in the process of being moved, as it's likely we won't be able to move all
interrupts to CPU 0 due to vector shortage anyway.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 817d1cd627be668c358d038f0fadbf7d24d417d3
master date: 2024-06-18 15:14:49 +0200
Jan Beulich [Wed, 26 Jun 2024 12:11:08 +0000 (14:11 +0200)]
x86/Intel: unlock CPUID earlier for the BSP
Intel CPUs have a MSR bit to limit CPUID enumeration to leaf two. If
this bit is set by the BIOS then CPUID evaluation does not work when
data from any leaf greater than two is needed; early_cpu_init() in
particular wants to collect leaf 7 data.
Cure this by unlocking CPUID right before evaluating anything which
depends on the maximum CPUID leaf being greater than two.
Inspired by (and description cloned from) Linux commit 0c2f6d04619e
("x86/topology/intel: Unlock CPUID before evaluating anything").
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: fa4d026737a47cd1d66ffb797a29150b4453aa9f
master date: 2024-06-18 15:12:44 +0200
Jan Beulich [Wed, 26 Jun 2024 12:10:40 +0000 (14:10 +0200)]
x86/EPT: drop questionable mfn_valid() from epte_get_entry_emt()
mfn_valid() is RAM-focused; it will often return false for MMIO. Yet
access to actual MMIO space should not generally be restricted to UC
only; especially video frame buffer accesses are unduly affected by such
a restriction.
Since, as of 777c71d31325 ("x86/EPT: avoid marking non-present entries
for re-configuring"), the function won't be called with INVALID_MFN or,
worse, truncated forms thereof anymore, we call fully drop that check.
Fixes: 81fd0d3ca4b2 ("x86/hvm: simplify 'mmio_direct' check in epte_get_entry_emt()") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 4fdd8d75566fdad06667a79ec0ce6f43cc466c54
master date: 2024-06-13 16:55:22 +0200
Jan Beulich [Wed, 26 Jun 2024 12:10:15 +0000 (14:10 +0200)]
x86/EPT: avoid marking non-present entries for re-configuring
For non-present entries EMT, like most other fields, is meaningless to
hardware. Make the logic in ept_set_entry() setting the field (and iPAT)
conditional upon dealing with a present entry, leaving the value at 0
otherwise. This has two effects for epte_get_entry_emt() which we'll
want to leverage subsequently:
1) The call moved here now won't be issued with INVALID_MFN anymore (a
respective BUG_ON() is being added).
2) Neither of the other two calls could now be issued with a truncated
form of INVALID_MFN anymore (as long as there's no bug anywhere
marking an entry present when that was populated using INVALID_MFN).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 777c71d31325bc55ba1cc3f317d4155fe519ab0b
master date: 2024-06-13 16:54:17 +0200
Jan Beulich [Wed, 26 Jun 2024 12:09:50 +0000 (14:09 +0200)]
x86/EPT: correct special page checking in epte_get_entry_emt()
mfn_valid() granularity is (currently) 256Mb. Therefore the start of a
1Gb page passing the test doesn't necessarily mean all parts of such a
range would also pass. Yet using the result of mfn_to_page() on an MFN
which doesn't pass mfn_valid() checking is liable to result in a crash
(the invocation of mfn_to_page() alone is presumably "just" UB in such a
case).
Fixes: ca24b2ffdbd9 ("x86/hvm: set 'ipat' in EPT for special pages") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 5540b94e8191059eb9cbbe98ac316232a42208f6
master date: 2024-06-13 16:53:34 +0200
Roger Pau Monné [Wed, 26 Jun 2024 12:09:15 +0000 (14:09 +0200)]
x86/irq: limit interrupt movement done by fixup_irqs()
The current check used in fixup_irqs() to decide whether to move around
interrupts is based on the affinity mask, but such mask can have all bits set,
and hence is unlikely to be a subset of the input mask. For example if an
interrupt has an affinity mask of all 1s, any input to fixup_irqs() that's not
an all set CPU mask would cause that interrupt to be shuffled around
unconditionally.
What fixup_irqs() care about is evacuating interrupts from CPUs not set on the
input CPU mask, and for that purpose it should check whether the interrupt is
assigned to a CPU not present in the input mask. Assume that ->arch.cpu_mask
is a subset of the ->affinity mask, and keep the current logic that resets the
->affinity mask if the interrupt has to be shuffled around.
Doing the affinity movement based on ->arch.cpu_mask requires removing the
special handling to ->arch.cpu_mask done for high priority vectors, otherwise
the adjustment done to cpu_mask makes them always skip the CPU interrupt
movement.
While there also adjust the comment as to the purpose of fixup_irqs().
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c7564d7366d865cc407e3d64bca816d07edee174
master date: 2024-06-12 14:30:40 +0200
Roger Pau Monné [Wed, 26 Jun 2024 12:08:40 +0000 (14:08 +0200)]
x86/smp: do not use shorthand IPI destinations in CPU hot{,un}plug contexts
Due to the current rwlock logic, if the CPU calling get_cpu_maps() does
so from a cpu_hotplug_{begin,done}() region the function will still
return success, because a CPU taking the rwlock in read mode after
having taken it in write mode is allowed. Such corner case makes using
get_cpu_maps() alone not enough to prevent using the shorthand in CPU
hotplug regions.
Introduce a new helper to detect whether the current caller is between a
cpu_hotplug_{begin,done}() region and use it in send_IPI_mask() to restrict
shorthand usage.
Fixes: 5500d265a2a8 ('x86/smp: use APIC ALLBUT destination shorthand when possible') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 171c52fba5d94e050d704770480dcb983490d0ad
master date: 2024-06-12 14:29:31 +0200
Roger Pau Monné [Wed, 26 Jun 2024 12:07:06 +0000 (14:07 +0200)]
x86/irq: remove offline CPUs from old CPU mask when adjusting move_cleanup_count
When adjusting move_cleanup_count to account for CPUs that are offline also
adjust old_cpu_mask, otherwise further calls to fixup_irqs() could subtract
those again and create an imbalance in move_cleanup_count.
Fixes: 472e0b74c5c4 ('x86/IRQ: deal with move cleanup count state in fixup_irqs()') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e63209d3ba2fd1b2f232babd14c9c679ffa7b09a
master date: 2024-06-10 10:33:22 +0200
Andrew Cooper [Wed, 26 Jun 2024 12:05:54 +0000 (14:05 +0200)]
x86/ucode: Further fixes to identify "ucode already up to date"
When the revision in hardware is newer than anything Xen has to hand,
'microcode_cache' isn't set up. Then, `xen-ucode` initiates the update
because it doesn't know whether the revisions across the system are symmetric
or not. This involves the patch getting all the way into the
apply_microcode() hooks before being found to be too old.
This is all a giant mess and needs an overhaul, but in the short term simply
adjust the apply_microcode() to return -EEXIST.
Also, unconditionally print the preexisting microcode revision on boot. It's
relevant information which is otherwise unavailable if Xen doesn't find new
microcode to use.
Fixes: 648db37a155a ("x86/ucode: Distinguish "ucode already up to date"") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 977d98e67c2e929c62aa1f495fc4c6341c45abb5
master date: 2024-05-16 13:59:11 +0100
Roger Pau Monné [Tue, 21 May 2024 10:02:13 +0000 (12:02 +0200)]
x86/mtrr: avoid system wide rendezvous when setting AP MTRRs
There's no point in forcing a system wide update of the MTRRs on all processors
when there are no changes to be propagated. On AP startup it's only the AP
that needs to write the system wide MTRR values in order to match the rest of
the already online CPUs.
We have occasionally seen the watchdog trigger during `xen-hptool cpu-online`
in one Intel Cascade Lake box with 448 CPUs due to the re-setting of the MTRRs
on all the CPUs in the system.
While there adjust the comment to clarify why the system-wide resetting of the
MTRR registers is not needed for the purposes of mtrr_ap_init().
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: abd00b037da5ffa4e8c4508a5df0cd6eabb805a4
master date: 2024-05-15 19:59:52 +0100
Leigh Brown [Tue, 21 May 2024 10:02:03 +0000 (12:02 +0200)]
tools/xentop: Fix cpu% sort order
In compare_cpu_pct(), there is a double -> unsigned long long converion when
calling compare(). In C, this discards the fractional part, resulting in an
out-of order sorting such as:
Jan Beulich [Tue, 21 May 2024 10:01:33 +0000 (12:01 +0200)]
x86: respect mapcache_domain_init() failing
The function itself properly handles and hands onwards failure from
create_perdomain_mapping(). Therefore its caller should respect possible
failure, too.
Fixes: 4b28bf6ae90b ("x86: re-introduce map_domain_page() et al") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 7270fdc7a0028d4b7b26fd1b36c6b9e97abcf3da
master date: 2024-05-15 19:59:52 +0100
Juergen Gross [Tue, 21 May 2024 10:01:06 +0000 (12:01 +0200)]
xen/sched: set all sched_resource data inside locked region for new cpu
When adding a cpu to a scheduler, set all data items of struct
sched_resource inside the locked region, as otherwise a race might
happen (e.g. when trying to access the cpupool of the cpu):
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Fixes: a8c6c623192e ("sched: clarify use cases of schedule_cpu_switch()") Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d104a07524ffc92ae7a70dfe192c291de2a563cc
master date: 2024-05-15 19:59:52 +0100
libxl: Fix handling XenStore errors in device creation
If xenstored runs out of memory it is possible for it to fail operations
that should succeed. libxl wasn't robust against this, and could fail
to ensure that the TTY path of a non-initial console was created and
read-only for guests. This doesn't qualify for an XSA because guests
should not be able to run xenstored out of memory, but it still needs to
be fixed.
Add the missing error checks to ensure that all errors are properly
handled and that at no point can a guest make the TTY path of its
frontend directory writable.
Roger Pau Monné [Tue, 21 May 2024 10:00:09 +0000 (12:00 +0200)]
libxl: fix population of the online vCPU bitmap for PVH
libxl passes some information to libacpi to create the ACPI table for a PVH
guest, and among that information it's a bitmap of which vCPUs are online
which can be less than the maximum number of vCPUs assigned to the domain.
While the population of the bitmap is done correctly for HVM based on the
number of online vCPUs, for PVH the population of the bitmap is done based on
the number of maximum vCPUs allowed. This leads to all local APIC entries in
the MADT being set as enabled, which contradicts the data in xenstore if vCPUs
is different than maximum vCPUs.
Fix by copying the internal libxl bitmap that's populated based on the vCPUs
parameter.
Reported-by: Arthur Borsboom <arthurborsboom@gmail.com> Link: https://gitlab.com/libvirt/libvirt/-/issues/399 Reported-by: Leigh Brown <leigh@solinno.co.uk> Fixes: 14c0d328da2b ('libxl/acpi: Build ACPI tables for HVMlite guests') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Leigh Brown <leigh@solinno.co.uk> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5cc7347b04b2d0a3133754c7a9b936f614ec656a
master date: 2024-05-11 00:13:43 +0100
Andrew Cooper [Tue, 21 May 2024 09:59:09 +0000 (11:59 +0200)]
x86/cpu-policy: Fix migration from Ice Lake to Cascade Lake
Ever since Xen 4.14, there has been a latent bug with migration.
While some toolstacks can level the features properly, they don't shink
feat.max_subleaf when all features have been dropped. This is because
we *still* have not completed the toolstack side work for full CPU Policy
objects.
As a consequence, even when properly feature levelled, VMs can't migrate
"backwards" across hardware which reduces feat.max_subleaf. One such example
is Ice Lake (max_subleaf=2 for INTEL_PSFD) to Cascade Lake (max_subleaf=0).
Extend the max policies feat.max_subleaf to the hightest number Xen knows
about, but leave the default policies matching the host. This will allow VMs
with a higher feat.max_subleaf than strictly necessary to migrate in.
Eventually we'll manage to teach the toolstack how to avoid creating such VMs
in the first place, but there's still more work to do there.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a2330b51df267e20e66bbba6c5bf08f0570ed58b
master date: 2024-05-07 16:56:46 +0100
Jan Beulich [Tue, 21 May 2024 09:58:17 +0000 (11:58 +0200)]
VT-d: correct ATS checking for root complex integrated devices
Spec version 4.1 says
"The ATSR structures identifies PCI Express Root-Ports supporting
Address Translation Services (ATS) transactions. Software must enable
ATS on endpoint devices behind a Root Port only if the Root Port is
reported as supporting ATS transactions."
Clearly root complex integrated devices aren't "behind root ports",
matching my observation on a SapphireRapids system having an ATS-
capable root complex integrated device. Hence for such devices we
shouldn't try to locate a corresponding ATSR.
Since both pci_find_ext_capability() and pci_find_cap_offset() return
"unsigned int", change "pos" to that type at the same time.
Fixes: 903b93211f56 ("[VTD] laying the ground work for ATS") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 04e31583bab97e5042a44a1d00fce2760272635f
master date: 2024-05-06 09:22:45 +0200
Jason Andryuk [Tue, 21 May 2024 09:57:20 +0000 (11:57 +0200)]
xen/xsm: Wire up get_dom0_console
An XSM hook for get_dom0_console is currently missing. Using XSM with
a PVH dom0 shows:
(XEN) FLASK: Denying unknown platform_op: 64.
Wire up the hook, and allow it for dom0.
Fixes: 4dd160583c ("x86/platform: introduce hypercall to get initial video console settings") Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
master commit: 647f7e50ebeeb8152974cad6a12affe474c74513
master date: 2024-04-30 08:33:41 +0200
Ross Lagerwall [Tue, 21 May 2024 09:55:49 +0000 (11:55 +0200)]
x86/rtc: Avoid UIP flag being set for longer than expected
In a test, OVMF reported an error initializing the RTC without
indicating the precise nature of the error. The only plausible
explanation I can find is as follows:
As part of the initialization, OVMF reads register C and then reads
register A repatedly until the UIP flag is not set. If this takes longer
than 100 ms, OVMF fails and reports an error. This may happen with the
following sequence of events:
At guest time=0s, rtc_init() calls check_update_timer() which schedules
update_timer for t=(1 - 244us).
At t=1s, the update_timer function happens to have been called >= 244us
late. In the timer callback, it sets the UIP flag and schedules
update_timer2 for t=1s.
Before update_timer2 runs, the guest reads register C which calls
check_update_timer(). check_update_timer() stops the scheduled
update_timer2 and since the guest time is now outside of the update
cycle, it schedules update_timer for t=(2 - 244us).
The UIP flag will therefore be set for a whole second from t=1 to t=2
while the guest repeatedly reads register A waiting for the UIP flag to
clear. Fix it by clearing the UIP flag when scheduling update_timer.
I was able to reproduce this issue with a synthetic test and this
resolves the issue.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 43a07069863b419433dee12c9b58c1f7ce70aa97
master date: 2024-04-23 14:09:18 +0200
Roger Pau Monné [Tue, 21 May 2024 09:55:09 +0000 (11:55 +0200)]
altcall: fix __alt_call_maybe_initdata so it's safe for livepatch
Setting alternative call variables as __init is not safe for use with
livepatch, as livepatches can rightfully introduce new alternative calls to
structures marked as __alt_call_maybe_initdata (possibly just indirectly due to
replacing existing functions that use those). Attempting to resolve those
alternative calls then results in page faults as the variable that holds the
function pointer address has been freed.
When livepatch is supported use the __ro_after_init attribute instead of
__initdata for __alt_call_maybe_initdata.
Fixes: f26bb285949b ('xen: Implement xen/alternative-call.h for use in common code') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: af4cd0a6a61cdb03bc1afca9478b05b0c9703599
master date: 2024-04-11 18:51:36 +0100
Jason Andryuk [Tue, 21 May 2024 09:53:54 +0000 (11:53 +0200)]
block-common: Fix same_vm for no targets
same_vm is broken when the two main domains do not have targets. otvm
and targetvm are both missing, which means they get set to -1 and then
converted to empty strings:
++10697+ local targetvm=-1
++10697+ local otvm=-1
++10697+ otvm=
++10697+ othervm=/vm/cc97bc2f-3a91-43f7-8fbc-4cb92f90b4e4
++10697+ targetvm=
++10697+ local frontend_uuid=/vm/844dea4e-44f8-4e3e-8145-325132a31ca5
The final comparison returns true since the two empty strings match:
Replace -1 with distinct strings indicating the lack of a value and
remove the collescing to empty stings. The strings themselves will no
longer match, and that is correct.
It's currently too restrictive by just checking whether there's a BHB clearing
sequence selected. It should instead check whether BHB clearing is used on
entry from PV or HVM specifically.
Switch to use opt_bhb_entry_{pv,hvm} instead, and then remove cpu_has_bhb_seq
since it no longer has any users.
Reported-by: Jan Beulich <jbeulich@suse.com> Fixes: 954c983abcee ('x86/spec-ctrl: Software BHB-clearing sequences') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 656ae8f1091bcefec9c46ec3ea3ac2118742d4f6
master date: 2024-04-25 16:37:01 +0200
x86/spec: fix reporting of BHB clearing usage from guest entry points
Reporting whether the BHB clearing on entry is done for the different domains
types based on cpu_has_bhb_seq is unhelpful, as that variable signals whether
there's a BHB clearing sequence selected, but that alone doesn't imply that
such sequence is used from the PV and/or HVM entry points.
Instead use opt_bhb_entry_{pv,hvm} which do signal whether BHB clearing is
performed on entry from PV/HVM.
Fixes: 689ad48ce9cf ('x86/spec-ctrl: Wire up the Native-BHI software sequences') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 049ab0b2c9f1f5edb54b505fef0bc575787dafe9
master date: 2024-04-25 16:35:56 +0200
Andrew Cooper [Tue, 9 Apr 2024 20:39:51 +0000 (21:39 +0100)]
x86/entry: Fix build with older toolchains
Binutils older than 2.29 doesn't know INCSSPD.
Fixes: 8e186f98ce0e ("x86: Use indirect calls in reset-stack infrastructure") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit a9fa82500818a8d8ce5f2843f1577bd2c29d088e)
Andrew Cooper [Fri, 22 Mar 2024 19:29:34 +0000 (19:29 +0000)]
x86/spec-ctrl: Support the "long" BHB loop sequence
Out of an abudnance of caution, implement the long loop too, and allowing for
it to be opted in to.
This is part of XSA-456 / CVE-2024-2201.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit d5887c0decbd90e798b24ed696628645b04632fb)
Andrew Cooper [Thu, 8 Jun 2023 18:41:44 +0000 (19:41 +0100)]
x86/spec-ctrl: Wire up the Native-BHI software sequences
In the absence of BHI_DIS_S, mitigating Native-BHI requires the use of a
software sequence.
Introduce a new bhb-seq= option to select between avaialble sequences and
bhb-entry= to control the per-PV/HVM actions like we have for other blocks.
Activate the short sequence by default for PV and HVM guests on affected
hardware if BHI_DIS_S isn't present.
This is part of XSA-456 / CVE-2024-2201.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 689ad48ce9cf4c38297cd126e7e003a1c13a3b9d)
Andrew Cooper [Thu, 8 Jun 2023 18:41:44 +0000 (19:41 +0100)]
x86/spec-ctrl: Software BHB-clearing sequences
Implement clear_bhb_{tsx,loops}() as per the BHI guidance. The loops variant
is set up as the "short" sequence.
Introduce SCF_entry_bhb and extend SPEC_CTRL_ENTRY_* with a conditional call
to selected clearing routine.
Note that due to a limitation in the ALTERNATIVE capability, the TEST/JZ can't
be included alongside a CALL in a single alternative block. This is going to
require further work to untangle.
The BHB sequences (if used) must be after the restoration of Xen's
MSR_SPEC_CTRL value, which must be accounted for when judging whether it is
safe to skip the safety LFENCEs.
This is part of XSA-456 / CVE-2024-2201.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 954c983abceee97bf5f6230b9ae164f2c49a9aa9)
Andrew Cooper [Tue, 26 Mar 2024 19:01:37 +0000 (19:01 +0000)]
x86/spec-ctrl: Support BHI_DIS_S in order to mitigate BHI
Introduce a "bhi-dis-s" boolean to match the other options we have for
MSR_SPEC_CTRL values. Also introduce bhi_calculations().
Use BHI_DIS_S whenever possible.
Guests which are levelled to be migration compatible with older CPUs can't see
BHI_DIS_S, and Xen must fill in the difference to make the guest safe. Use
the virt MSR_SPEC_CTRL infrastructure to force BHI_DIS_S behind the guest's
back.
This is part of XSA-456 / CVE-2024-2201.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 62a1106415c5e8a49b45147ca84d54a58d471343)
Andrew Cooper [Sat, 6 Apr 2024 19:36:54 +0000 (20:36 +0100)]
x86/tsx: Expose RTM_ALWAYS_ABORT to guests
A TSX Abort is one option mitigate Native-BHI, but a guest kernel doesn't get
to see this if Xen has turned RTM off using MSR_TSX_{CTRL,FORCE_ABORT}.
Therefore, the meaning of RTM_ALWAYS_ABORT has been adjusted to "XBEGIN won't
fault", and it should be exposed to guests so they can make a better decision.
Expose it in the max policy for any RTM-capable system. Offer it by default
only if RTM has been disabled.
Update test-tsx to account for this new meaning. While adjusting the logic in
test_guest_policies(), take the opportunity to use feature names (now they're
available) to make the logic easier to follow.
This is part of XSA-456 / CVE-2024-2201.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c94e2105924347de0d9f32065370e802a20cc829)
Andrew Cooper [Fri, 22 Dec 2023 18:01:37 +0000 (18:01 +0000)]
x86: Drop INDIRECT_JMP
Indirect JMPs which are not tailcalls can lead to an unwelcome form of
speculative type confusion, and we've removed the uses of INDIRECT_JMP to
compensate. Remove the temptation to reintroduce new instances.
This is part of XSA-456 / CVE-2024-2201.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 0b66d7ce3c0290eaad28bdafb35200052d012b14)
Andrew Cooper [Fri, 22 Dec 2023 17:44:48 +0000 (17:44 +0000)]
x86: Use indirect calls in reset-stack infrastructure
Mixing up JMP and CALL indirect targets leads a very fun form of speculative
type confusion. A target which is expecting to be called CALLed needs a
return address on the stack, and an indirect JMP doesn't place one there.
An indirect JMP which predicts to a target intending to be CALLed can end up
with a RET speculatively executing with a value from the JMPers stack frame.
There are several ways get indirect JMPs in Xen.
* From tailcall optimisations. These are safe because the compiler has
arranged the stack to point at the callee's return address.
* From jump tables. These are unsafe, but Xen is built with -fno-jump-tables
to work around several compiler issues.
* From reset_stack_and_jump_ind(), which is particularly unsafe. Because of
the additional stack adjustment made, the value picked up off the stack is
regs->r15 of the next vCPU to run.
In order to mitigate this type confusion, we want to make all indirect targets
be CALL targets, and remove the use of indirect JMP except via tailcall
optimisation.
Luckily due to XSA-348, all C target functions of reset_stack_and_jump_ind()
are noreturn. {svm,vmx}_do_resume() exits via reset_stack_and_jump(); a
direct JMP with entirely different prediction properties. idle_loop() is an
infinite loop which eventually exits via reset_stack_and_jump_ind() from a new
schedule. i.e. These paths are all fine having one extra return address on
the stack.
This leaves continue_pv_domain(), which is expecting to be a JMP target.
Alter it to strip the return address off the stack, which is safe because
there isn't actually a RET expecting to return to its caller.
This allows us change reset_stack_and_jump_ind() to reset_stack_and_call_ind()
in order to mitigate the speculative type confusion.
This is part of XSA-456 / CVE-2024-2201.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 8e186f98ce0e35d1754ec9299da41ec98873b65c)
Andrew Cooper [Tue, 26 Mar 2024 22:43:18 +0000 (22:43 +0000)]
x86/spec-ctrl: Widen the {xen,last,default}_spec_ctrl fields
Right now, they're all bytes, but MSR_SPEC_CTRL has been steadily gaining new
features.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 45dac88e78e8a2d9d8738eef884fe6730faf9e67)
Roger Pau Monne [Thu, 15 Feb 2024 16:46:53 +0000 (17:46 +0100)]
x86/vmx: Add support for virtualize SPEC_CTRL
The feature is defined in the tertiary exec control, and is available starting
from Sapphire Rapids and Alder Lake CPUs.
When enabled, two extra VMCS fields are used: SPEC_CTRL mask and shadow. Bits
set in mask are not allowed to be toggled by the guest (either set or clear)
and the value in the shadow field is the value the guest expects to be in the
SPEC_CTRL register.
By using it the hypervisor can force the value of SPEC_CTRL bits behind the
guest back without having to trap all accesses to SPEC_CTRL, note that no bits
are forced into the guest as part of this patch. It also allows getting rid of
SPEC_CTRL in the guest MSR load list, since the value in the shadow field will
be loaded by the hardware on vmentry.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 97c5b8b657e41a6645de9d40713b881234417b49)
Andrew Cooper [Mon, 25 Mar 2024 11:09:35 +0000 (11:09 +0000)]
x86/spec-ctrl: Detail the safety properties in SPEC_CTRL_ENTRY_*
The complexity is getting out of hand.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 40dea83b75386cb693481cf340024ce093be5c0f)
Andrew Cooper [Fri, 22 Mar 2024 14:33:17 +0000 (14:33 +0000)]
x86/spec-ctrl: Simplify DO_COND_IBPB
With the prior refactoring, SPEC_CTRL_ENTRY_{PV,INTR} both load SCF into %ebx,
and handle the conditional safety including skipping if interrupting Xen.
Therefore, we can drop the maybexen parameter and the conditional safety.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 2378d16a931de0e62c03669169989e9437306abe)
Andrew Cooper [Fri, 22 Mar 2024 12:08:02 +0000 (12:08 +0000)]
x86/spec_ctrl: Hold SCF in %ebx across SPEC_CTRL_ENTRY_{PV,INTR}
... as we do in the exit paths too. This will allow simplification to the
sub-blocks.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9607aeb6602b8ed9962404de3f5f90170ffddb66)
Andrew Cooper [Fri, 22 Mar 2024 15:52:06 +0000 (15:52 +0000)]
x86/entry: Arrange for %r14 to be STACK_END across SPEC_CTRL_ENTRY_FROM_PV
Other SPEC_CTRL_* paths already use %r14 like this, and it will allow for
simplifications.
All instances of SPEC_CTRL_ENTRY_FROM_PV are followed by a GET_STACK_END()
invocation, so this change is only really logic and register shuffling.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 22390697bf1b4cd3024f2d10893dec3c3ec08a9c)
Andrew Cooper [Fri, 22 Mar 2024 11:41:41 +0000 (11:41 +0000)]
x86/spec-ctrl: Rework conditional safety for SPEC_CTRL_ENTRY_*
Right now, we have a mix of safety strategies in different blocks, making the
logic fragile and hard to follow.
Start addressing this by having a safety LFENCE at the end of the blocks,
which can be patched out if other safety criteria are met. This will allow us
to simplify the sub-blocks. For SPEC_CTRL_ENTRY_FROM_IST, simply leave an
LFENCE unconditionally at the end; the IST path is not a fast-path by any
stretch of the imagination.
For SPEC_CTRL_ENTRY_FROM_INTR, the existing description was incorrect. The
IRET #GP path is non-fatal but can occur with the guest's choice of
MSR_SPEC_CTRL. It is safe to skip the flush/barrier-like protections when
interrupting Xen, but we must run DO_SPEC_CTRL_ENTRY irrespective.
This will skip RSB stuffing which was previously unconditional even when
interrupting Xen.
AFAICT, this is a missing cleanup from commit 3fffaf9c13e9 ("x86/entry: Avoid
using alternatives in NMI/#MC paths") where we split the IST entry path out of
the main INTR entry path.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 94896de1a98c4289fe6fef9e16ef99fc6ef2efc4)
Andrew Cooper [Thu, 28 Mar 2024 11:57:25 +0000 (11:57 +0000)]
x86/spec-ctrl: Rename spec_ctrl_flags to scf
XSA-455 was ultimately caused by having fields with too-similar names.
Both {xen,last}_spec_ctrl are fields containing an architectural MSR_SPEC_CTRL
value. The spec_ctrl_flags field contains Xen-internal flags.
To more-obviously distinguish the two, rename spec_ctrl_flags to scf, which is
also the prefix of the constants used by the fields.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c62673c4334b3372ebd4292a7ac8185357e7ea27)
Andrew Cooper [Tue, 9 Apr 2024 14:03:05 +0000 (15:03 +0100)]
x86/cpuid: Don't expose {IPRED,RRSBA,BHI}_CTRL to PV guests
All of these are prediction-mode (i.e. CPL) based. They don't operate as
advertised in PV context.
Fixes: 4dd676070684 ("x86/spec-ctrl: Expose IPRED_CTRL to guests") Fixes: 478e4787fa64 ("x86/spec-ctrl: Expose RRSBA_CTRL to guests") Fixes: 583f1d095052 ("x86/spec-ctrl: Expose BHI_CTRL to guests") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 4b3da946ad7e3452761478ae683da842e7ff20d6)
x86/alternatives: fix .init section reference in _apply_alternatives()
The code in _apply_alternatives() will unconditionally attempt to read
__initdata_cf_clobber_{start,end} when called as part of applying alternatives
to a livepatch payload when Xen is using IBT.
That leads to a page-fault as __initdata_cf_clobber_{start,end} living in
.init section will have been unmapped by the time a livepatch gets loaded.
Fix by adding a check that limits the clobbering of endbr64 instructions to
boot time only.
Fixes: 37ed5da851b8 ('x86/altcall: Optimise away endbr64 instruction where possible') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 4be1fef1e6572c2be0bd378902ffb62a6e73faeb)
Andrew Cooper [Wed, 3 Apr 2024 16:43:42 +0000 (17:43 +0100)]
x86/tsx: Cope with RTM_ALWAYS_ABORT vs RTM mismatch
It turns out there is something wonky on some but not all CPUs with
MSR_TSX_FORCE_ABORT. The presence of RTM_ALWAYS_ABORT causes Xen to think
it's safe to offer HLE/RTM to guests, but in this case, XBEGIN instructions
genuinely #UD.
Spot this case and try to back out as cleanly as we can.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit b33f191e3ca99458fdcea1cb5a29dfa4965d1604)
Andrew Cooper [Thu, 28 Mar 2024 12:38:32 +0000 (12:38 +0000)]
x86/spec-ctrl: Move __read_mostly data into __ro_after_init
These variables predate the introduction of __ro_after_init, but all qualify.
Update them to be consistent with the rest of the file.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 7a09966e7b2823b70f6d56d0cf66c11124f4a3c1)
Jan Beulich [Wed, 7 Feb 2024 12:46:11 +0000 (13:46 +0100)]
VMX: tertiary execution control infrastructure
This is a prereq to enabling e.g. the MSRLIST feature.
Note that the PROCBASED_CTLS3 MSR is different from other VMX feature
reporting MSRs, in that all 64 bits report allowed 1-settings.
vVMX code is left alone, though, for the time being.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 878159bf259bfbd7a40312829f1ea0ce1f6645e2)
Jan Beulich [Mon, 5 Feb 2024 09:48:11 +0000 (10:48 +0100)]
x86/CPU: convert vendor hook invocations to altcall
While not performance critical, these hook invocations still want
converting: This way all pre-filled struct cpu_dev instances can become
__initconst_cf_clobber, thus allowing to eliminate further 8 ENDBR
during the 2nd phase of alternatives patching (besides moving previously
resident data to .init.*).
Since all use sites need touching anyway, take the opportunity and also
address a Misra C:2012 Rule 5.5 violation: Rename the this_cpu static
variable.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 660f8a75013c947fbe5358a640032a1f9f1eece5)
Jan Beulich [Mon, 5 Feb 2024 09:45:31 +0000 (10:45 +0100)]
x86/guest: finish conversion to altcall
While .setup() and .e820_fixup() don't need fiddling with for being run
only very early, both .ap_setup() and .resume() want converting too:
This way both pre-filled struct hypervisor_ops instances can become
__initconst_cf_clobber, thus allowing to eliminate up to 5 more ENDBR
(configuration dependent) during the 2nd phase of alternatives patching.
While fiddling with section annotations here, also move "ops" itself to
.data.ro_after_init.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Paul Durrant <paul@xen.org>
(cherry picked from commit e931edccc53c9dd6e9a505ad0ff3a03d985669bc)
Jan Beulich [Mon, 5 Feb 2024 09:44:46 +0000 (10:44 +0100)]
x86: arrange for ENDBR zapping from <vendor>_ctxt_switch_masking()
While altcall is already used for them, the functions want announcing in
.init.rodata.cf_clobber, even if the resulting static variables aren't
otherwise used.
While doing this also move ctxt_switch_masking to .data.ro_after_init.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 044168fa3a65b6542bda5c21e373742de1bd5980)
Roger Pau Monné [Tue, 30 Jan 2024 09:14:00 +0000 (10:14 +0100)]
x86/spec-ctrl: Expose BHI_CTRL to guests
The CPUID feature bit signals the presence of the BHI_DIS_S control in
SPEC_CTRL MSR, first available in Intel AlderLake and Sapphire Rapids CPUs
Xen already knows how to context switch MSR_SPEC_CTRL properly between guest
and hypervisor context.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 583f1d0950529f3517b1741c2b21a028a82ba831)
Roger Pau Monné [Tue, 30 Jan 2024 09:13:59 +0000 (10:13 +0100)]
x86/spec-ctrl: Expose RRSBA_CTRL to guests
The CPUID feature bit signals the presence of the RRSBA_DIS_{U,S} controls in
SPEC_CTRL MSR, first available in Intel AlderLake and Sapphire Rapids CPUs.
Xen already knows how to context switch MSR_SPEC_CTRL properly between guest
and hypervisor context.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 478e4787fa64b621061177a7843c452e9a19916d)
Roger Pau Monné [Tue, 30 Jan 2024 09:13:58 +0000 (10:13 +0100)]
x86/spec-ctrl: Expose IPRED_CTRL to guests
The CPUID feature bit signals the presence of the IPRED_DIS_{U,S} controls in
SPEC_CTRL MSR, first available in Intel AlderLake and Sapphire Rapids CPUs.
Xen already knows how to context switch MSR_SPEC_CTRL properly between guest
and hypervisor context.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 4dd6760706848de30f7c8b5f83462b9bcb070c91)
Jan Beulich [Tue, 23 Jan 2024 11:03:23 +0000 (12:03 +0100)]
IRQ: generalize [gs]et_irq_regs()
Move functions (and their data) to common code, and invoke the functions
on Arm as well. This is in preparation of dropping the register
parameters from handler functions.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit f67bddf3bccd99a5fee968c3b3f288db6a57d3be)
Jan Beulich [Mon, 22 Jan 2024 12:41:07 +0000 (13:41 +0100)]
x86/MCE: switch some callback invocations to altcall
While not performance critical, these hook invocations still would
better be converted: This way all pre-filled (and newly introduced)
struct mce_callback instances can become __initconst_cf_clobber, thus
allowing to eliminate another 9 ENDBR during the 2nd phase of
alternatives patching.
While this means registering callbacks a little earlier, doing so is
perhaps even advantageous, for having pointers be non-NULL earlier on.
Only one set of callbacks would only ever be registered anyway, and
neither of the respective initialization function can (subsequently)
fail.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 85ba4d050f9f3c4286164f21660ae88435b7e83c)
Jan Beulich [Mon, 22 Jan 2024 12:40:32 +0000 (13:40 +0100)]
x86/MCE: separate BSP-only initialization
Several function pointers are registered over and over again, when
setting them once on the BSP suffices. Arrange for this in the vendor
init functions and mark involved registration functions __init.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 9f58616ddb1cc1870399de2202fafc7bf0d61694)
Jan Beulich [Mon, 22 Jan 2024 12:40:00 +0000 (13:40 +0100)]
x86/PV: avoid indirect call for I/O emulation quirk hook
This way ioemul_handle_proliant_quirk() won't need ENDBR anymore.
While touching this code, also
- arrange for it to not be built at all when !PV,
- add "const" to the last function parameter and bring the definition
in sync with the declaration (for Misra).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 1212af3e8c4d3a1350046d4fe0ca3b97b51e67de)
Jan Beulich [Mon, 22 Jan 2024 12:39:23 +0000 (13:39 +0100)]
x86/MTRR: avoid several indirect calls
The use of (supposedly) vendor-specific hooks is a relic from the days
when Xen was still possible to build as 32-bit binary. There's no
expectation that a new need for such an abstraction would arise. Convert
mttr_if to a mere boolean and all prior calls through it to direct ones,
thus allowing to eliminate 6 ENDBR from .text.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit e9e0eb30d4d6565b411499ca826718b4b9acab68)
Jan Beulich [Mon, 22 Jan 2024 12:38:24 +0000 (13:38 +0100)]
core-parking: use alternative_call()
This way we can arrange for core_parking_{performance,power}()'s ENDBR
to also be zapped.
For the decision to be taken before the 2nd alternative patching pass,
the initcall needs to become a pre-SMP one, though.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 1bc07ebcac3b1bb2a378732bc0f9a19940e76faf)
Jan Beulich [Wed, 17 Jan 2024 09:43:02 +0000 (10:43 +0100)]
x86/HPET: avoid an indirect call
When this code was written, indirect branches still weren't considered
much of a problem (besides being a little slower). Instead of a function
pointer, pass a boolean to _disable_pit_irq(), thus allowing to
eliminate two ENDBR (one of them in .text).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 730d2637a8e5b98dc8e4e366179b4cedc496b3ad)
Jan Beulich [Wed, 17 Jan 2024 09:42:27 +0000 (10:42 +0100)]
cpufreq: finish conversion to altcall
Even functions used on infrequently executed paths want converting: This
way all pre-filled struct cpufreq_driver instances can become
__initconst_cf_clobber, thus allowing to eliminate another 15 ENDBR
during the 2nd phase of alternatives patching.
For acpi-cpufreq's optionally populated .get hook make sure alternatives
patching can actually see the pointer. See also the code comment.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 467ae515caee491e9b6ae1da8b9b98d094955822)
Jan Beulich [Wed, 17 Jan 2024 09:41:52 +0000 (10:41 +0100)]
x86/APIC: finish genapic conversion to altcall
While .probe() doesn't need fiddling with for being run only very early,
init_apic_ldr() wants converting too despite not being on a frequently
executed path: This way all pre-filled struct genapic instances can
become __initconst_cf_clobber, thus allowing to eliminate 15 more ENDBR
during the 2nd phase of alternatives patching.
While fiddling with section annotations here, also move "genapic" itself
to .data.ro_after_init.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit b1cc53753cba4c3253f2e1093a3a6a9a828314bf)
Andrew Cooper [Tue, 26 Mar 2024 22:47:25 +0000 (22:47 +0000)]
x86/spec-ctrl: Fix BTC/SRSO mitigations
We were looking for SCF_entry_ibpb in the wrong variable in the top-of-stack
block, and xen_spec_ctrl won't have had bit 5 set because Xen doesn't
understand SPEC_CTRL_RRSBA_DIS_U yet.
This is XSA-455 / CVE-2024-31142.
Fixes: 53a570b28569 ("x86/spec-ctrl: Support IBPB-on-entry") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Bjoern Doebel [Wed, 27 Mar 2024 18:30:55 +0000 (18:30 +0000)]
hypercall_xlat_continuation: Replace BUG_ON with domain_crash
Instead of crashing the host in case of unexpected hypercall parameters,
resort to only crashing the calling domain.
This is part of XSA-454 / CVE-2023-46842.
Fixes: b8a7efe8528a ("Enable compatibility mode operation for HYPERVISOR_memory_op") Reported-by: Manuel Andreas <manuel.andreas@tum.de> Signed-off-by: Bjoern Doebel <doebel@amazon.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 9926e692c4afc40bcd66f8416ff6a1e93ce402f6)