]> xenbits.xensource.com Git - xen.git/log
xen.git
10 months agox86/EPT: drop questionable mfn_valid() from epte_get_entry_emt()
Jan Beulich [Wed, 26 Jun 2024 12:10:40 +0000 (14:10 +0200)]
x86/EPT: drop questionable mfn_valid() from epte_get_entry_emt()

mfn_valid() is RAM-focused; it will often return false for MMIO. Yet
access to actual MMIO space should not generally be restricted to UC
only; especially video frame buffer accesses are unduly affected by such
a restriction.

Since, as of 777c71d31325 ("x86/EPT: avoid marking non-present entries
for re-configuring"), the function won't be called with INVALID_MFN or,
worse, truncated forms thereof anymore, we call fully drop that check.

Fixes: 81fd0d3ca4b2 ("x86/hvm: simplify 'mmio_direct' check in epte_get_entry_emt()")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 4fdd8d75566fdad06667a79ec0ce6f43cc466c54
master date: 2024-06-13 16:55:22 +0200

10 months agox86/EPT: avoid marking non-present entries for re-configuring
Jan Beulich [Wed, 26 Jun 2024 12:10:15 +0000 (14:10 +0200)]
x86/EPT: avoid marking non-present entries for re-configuring

For non-present entries EMT, like most other fields, is meaningless to
hardware. Make the logic in ept_set_entry() setting the field (and iPAT)
conditional upon dealing with a present entry, leaving the value at 0
otherwise. This has two effects for epte_get_entry_emt() which we'll
want to leverage subsequently:
1) The call moved here now won't be issued with INVALID_MFN anymore (a
   respective BUG_ON() is being added).
2) Neither of the other two calls could now be issued with a truncated
   form of INVALID_MFN anymore (as long as there's no bug anywhere
   marking an entry present when that was populated using INVALID_MFN).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 777c71d31325bc55ba1cc3f317d4155fe519ab0b
master date: 2024-06-13 16:54:17 +0200

10 months agox86/EPT: correct special page checking in epte_get_entry_emt()
Jan Beulich [Wed, 26 Jun 2024 12:09:50 +0000 (14:09 +0200)]
x86/EPT: correct special page checking in epte_get_entry_emt()

mfn_valid() granularity is (currently) 256Mb. Therefore the start of a
1Gb page passing the test doesn't necessarily mean all parts of such a
range would also pass. Yet using the result of mfn_to_page() on an MFN
which doesn't pass mfn_valid() checking is liable to result in a crash
(the invocation of mfn_to_page() alone is presumably "just" UB in such a
case).

Fixes: ca24b2ffdbd9 ("x86/hvm: set 'ipat' in EPT for special pages")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 5540b94e8191059eb9cbbe98ac316232a42208f6
master date: 2024-06-13 16:53:34 +0200

10 months agox86/irq: limit interrupt movement done by fixup_irqs()
Roger Pau Monné [Wed, 26 Jun 2024 12:09:15 +0000 (14:09 +0200)]
x86/irq: limit interrupt movement done by fixup_irqs()

The current check used in fixup_irqs() to decide whether to move around
interrupts is based on the affinity mask, but such mask can have all bits set,
and hence is unlikely to be a subset of the input mask.  For example if an
interrupt has an affinity mask of all 1s, any input to fixup_irqs() that's not
an all set CPU mask would cause that interrupt to be shuffled around
unconditionally.

What fixup_irqs() care about is evacuating interrupts from CPUs not set on the
input CPU mask, and for that purpose it should check whether the interrupt is
assigned to a CPU not present in the input mask.  Assume that ->arch.cpu_mask
is a subset of the ->affinity mask, and keep the current logic that resets the
->affinity mask if the interrupt has to be shuffled around.

Doing the affinity movement based on ->arch.cpu_mask requires removing the
special handling to ->arch.cpu_mask done for high priority vectors, otherwise
the adjustment done to cpu_mask makes them always skip the CPU interrupt
movement.

While there also adjust the comment as to the purpose of fixup_irqs().

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c7564d7366d865cc407e3d64bca816d07edee174
master date: 2024-06-12 14:30:40 +0200

10 months agox86/smp: do not use shorthand IPI destinations in CPU hot{,un}plug contexts
Roger Pau Monné [Wed, 26 Jun 2024 12:08:40 +0000 (14:08 +0200)]
x86/smp: do not use shorthand IPI destinations in CPU hot{,un}plug contexts

Due to the current rwlock logic, if the CPU calling get_cpu_maps() does
so from a cpu_hotplug_{begin,done}() region the function will still
return success, because a CPU taking the rwlock in read mode after
having taken it in write mode is allowed.  Such corner case makes using
get_cpu_maps() alone not enough to prevent using the shorthand in CPU
hotplug regions.

Introduce a new helper to detect whether the current caller is between a
cpu_hotplug_{begin,done}() region and use it in send_IPI_mask() to restrict
shorthand usage.

Fixes: 5500d265a2a8 ('x86/smp: use APIC ALLBUT destination shorthand when possible')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 171c52fba5d94e050d704770480dcb983490d0ad
master date: 2024-06-12 14:29:31 +0200

10 months agoCI: Update FreeBSD to 13.3
Andrew Cooper [Wed, 26 Jun 2024 12:07:53 +0000 (14:07 +0200)]
CI: Update FreeBSD to 13.3

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 5ea7f2c9d7a1334b3b2bd5f67fab4d447b60613d
master date: 2024-06-11 17:00:10 +0100

10 months agox86/irq: remove offline CPUs from old CPU mask when adjusting move_cleanup_count
Roger Pau Monné [Wed, 26 Jun 2024 12:07:06 +0000 (14:07 +0200)]
x86/irq: remove offline CPUs from old CPU mask when adjusting move_cleanup_count

When adjusting move_cleanup_count to account for CPUs that are offline also
adjust old_cpu_mask, otherwise further calls to fixup_irqs() could subtract
those again and create an imbalance in move_cleanup_count.

Fixes: 472e0b74c5c4 ('x86/IRQ: deal with move cleanup count state in fixup_irqs()')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e63209d3ba2fd1b2f232babd14c9c679ffa7b09a
master date: 2024-06-10 10:33:22 +0200

10 months agox86/msi: prevent watchdog triggering when dumping MSI state
Roger Pau Monné [Wed, 26 Jun 2024 12:06:35 +0000 (14:06 +0200)]
x86/msi: prevent watchdog triggering when dumping MSI state

Use the same check that's used in dump_irqs().

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 594b22ca5be681ec1b42c34f321cc2600d582210
master date: 2024-05-20 14:29:44 +0100

10 months agox86/ucode: Further fixes to identify "ucode already up to date"
Andrew Cooper [Wed, 26 Jun 2024 12:05:54 +0000 (14:05 +0200)]
x86/ucode: Further fixes to identify "ucode already up to date"

When the revision in hardware is newer than anything Xen has to hand,
'microcode_cache' isn't set up.  Then, `xen-ucode` initiates the update
because it doesn't know whether the revisions across the system are symmetric
or not.  This involves the patch getting all the way into the
apply_microcode() hooks before being found to be too old.

This is all a giant mess and needs an overhaul, but in the short term simply
adjust the apply_microcode() to return -EEXIST.

Also, unconditionally print the preexisting microcode revision on boot.  It's
relevant information which is otherwise unavailable if Xen doesn't find new
microcode to use.

Fixes: 648db37a155a ("x86/ucode: Distinguish "ucode already up to date"")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 977d98e67c2e929c62aa1f495fc4c6341c45abb5
master date: 2024-05-16 13:59:11 +0100

11 months agoRevert "xen/xsm: Wire up get_dom0_console"
Jan Beulich [Tue, 21 May 2024 11:36:27 +0000 (13:36 +0200)]
Revert "xen/xsm: Wire up get_dom0_console"

This reverts commit 9cef77400470604e76c6c3aa9f647c40429ff956,
for not being applicable to this branch.

11 months agox86/mtrr: avoid system wide rendezvous when setting AP MTRRs
Roger Pau Monné [Tue, 21 May 2024 10:02:13 +0000 (12:02 +0200)]
x86/mtrr: avoid system wide rendezvous when setting AP MTRRs

There's no point in forcing a system wide update of the MTRRs on all processors
when there are no changes to be propagated.  On AP startup it's only the AP
that needs to write the system wide MTRR values in order to match the rest of
the already online CPUs.

We have occasionally seen the watchdog trigger during `xen-hptool cpu-online`
in one Intel Cascade Lake box with 448 CPUs due to the re-setting of the MTRRs
on all the CPUs in the system.

While there adjust the comment to clarify why the system-wide resetting of the
MTRR registers is not needed for the purposes of mtrr_ap_init().

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Release-acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: abd00b037da5ffa4e8c4508a5df0cd6eabb805a4
master date: 2024-05-15 19:59:52 +0100

11 months agotools/xentop: Fix cpu% sort order
Leigh Brown [Tue, 21 May 2024 10:02:03 +0000 (12:02 +0200)]
tools/xentop: Fix cpu% sort order

In compare_cpu_pct(), there is a double -> unsigned long long converion when
calling compare().  In C, this discards the fractional part, resulting in an
out-of order sorting such as:

        NAME  STATE   CPU(sec) CPU(%)
       xendd --b---       4020    5.7
    icecream --b---       2600    3.8
    Domain-0 -----r       1060    1.5
        neon --b---        827    1.1
      cheese --b---        225    0.7
       pizza --b---        359    0.5
     cassini --b---        490    0.4
     fusilli --b---        159    0.2
         bob --b---        502    0.2
     blender --b---        121    0.2
       bread --b---         69    0.1
    chickpea --b---         67    0.1
      lentil --b---         67    0.1

Introduce compare_dbl() function and update compare_cpu_pct() to call it.

Fixes: 49839b535b78 ("Add xenstat framework.")
Signed-off-by: Leigh Brown <leigh@solinno.co.uk>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e27fc7d15eab79e604e8b8728778594accc23cf1
master date: 2024-05-15 19:59:52 +0100

11 months agox86: respect mapcache_domain_init() failing
Jan Beulich [Tue, 21 May 2024 10:01:33 +0000 (12:01 +0200)]
x86: respect mapcache_domain_init() failing

The function itself properly handles and hands onwards failure from
create_perdomain_mapping(). Therefore its caller should respect possible
failure, too.

Fixes: 4b28bf6ae90b ("x86: re-introduce map_domain_page() et al")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 7270fdc7a0028d4b7b26fd1b36c6b9e97abcf3da
master date: 2024-05-15 19:59:52 +0100

11 months agoxen/sched: set all sched_resource data inside locked region for new cpu
Juergen Gross [Tue, 21 May 2024 10:01:06 +0000 (12:01 +0200)]
xen/sched: set all sched_resource data inside locked region for new cpu

When adding a cpu to a scheduler, set all data items of struct
sched_resource inside the locked region, as otherwise a race might
happen (e.g. when trying to access the cpupool of the cpu):

  (XEN) ----[ Xen-4.19.0-1-d  x86_64  debug=y  Tainted:     H  ]----
  (XEN) CPU:    45
  (XEN) RIP:    e008:[<ffff82d040244cbf>] common/sched/credit.c#csched_load_balance+0x41/0x877
  (XEN) RFLAGS: 0000000000010092   CONTEXT: hypervisor
  (XEN) rax: ffff82d040981618   rbx: ffff82d040981618   rcx: 0000000000000000
  (XEN) rdx: 0000003ff68cd000   rsi: 000000000000002d   rdi: ffff83103723d450
  (XEN) rbp: ffff83207caa7d48   rsp: ffff83207caa7b98   r8:  0000000000000000
  (XEN) r9:  ffff831037253cf0   r10: ffff83103767c3f0   r11: 0000000000000009
  (XEN) r12: ffff831037237990   r13: ffff831037237990   r14: ffff831037253720
  (XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 0000000000f526e0
  (XEN) cr3: 000000005bc2f000   cr2: 0000000000000010
  (XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
  (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
  (XEN) Xen code around <ffff82d040244cbf> (common/sched/credit.c#csched_load_balance+0x41/0x877):
  (XEN)  48 8b 0c 10 48 8b 49 08 <48> 8b 79 10 48 89 bd b8 fe ff ff 49 8b 4e 28 48
  <snip>
  (XEN) Xen call trace:
  (XEN)    [<ffff82d040244cbf>] R common/sched/credit.c#csched_load_balance+0x41/0x877
  (XEN)    [<ffff82d040245a18>] F common/sched/credit.c#csched_schedule+0x36a/0x69f
  (XEN)    [<ffff82d040252644>] F common/sched/core.c#do_schedule+0xe8/0x433
  (XEN)    [<ffff82d0402572dd>] F common/sched/core.c#schedule+0x2e5/0x2f9
  (XEN)    [<ffff82d040232f35>] F common/softirq.c#__do_softirq+0x94/0xbe
  (XEN)    [<ffff82d040232fc8>] F do_softirq+0x13/0x15
  (XEN)    [<ffff82d0403075ef>] F arch/x86/domain.c#idle_loop+0x92/0xe6
  (XEN)
  (XEN) Pagetable walk from 0000000000000010:
  (XEN)  L4[0x000] = 000000103ff61063 ffffffffffffffff
  (XEN)  L3[0x000] = 000000103ff60063 ffffffffffffffff
  (XEN)  L2[0x000] = 0000001033dff063 ffffffffffffffff
  (XEN)  L1[0x000] = 0000000000000000 ffffffffffffffff
  (XEN)
  (XEN) ****************************************
  (XEN) Panic on CPU 45:
  (XEN) FATAL PAGE FAULT
  (XEN) [error_code=0000]
  (XEN) Faulting linear address: 0000000000000010
  (XEN) ****************************************

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Fixes: a8c6c623192e ("sched: clarify use cases of schedule_cpu_switch()")
Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d104a07524ffc92ae7a70dfe192c291de2a563cc
master date: 2024-05-15 19:59:52 +0100

11 months agolibxl: Fix handling XenStore errors in device creation
Demi Marie Obenour [Tue, 21 May 2024 10:00:34 +0000 (12:00 +0200)]
libxl: Fix handling XenStore errors in device creation

If xenstored runs out of memory it is possible for it to fail operations
that should succeed.  libxl wasn't robust against this, and could fail
to ensure that the TTY path of a non-initial console was created and
read-only for guests.  This doesn't qualify for an XSA because guests
should not be able to run xenstored out of memory, but it still needs to
be fixed.

Add the missing error checks to ensure that all errors are properly
handled and that at no point can a guest make the TTY path of its
frontend directory writable.

Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
master commit: 531d3bea5e9357357eaf6d40f5784a1b4c29b910
master date: 2024-05-11 00:13:43 +0100

11 months agolibxl: fix population of the online vCPU bitmap for PVH
Roger Pau Monné [Tue, 21 May 2024 10:00:09 +0000 (12:00 +0200)]
libxl: fix population of the online vCPU bitmap for PVH

libxl passes some information to libacpi to create the ACPI table for a PVH
guest, and among that information it's a bitmap of which vCPUs are online
which can be less than the maximum number of vCPUs assigned to the domain.

While the population of the bitmap is done correctly for HVM based on the
number of online vCPUs, for PVH the population of the bitmap is done based on
the number of maximum vCPUs allowed.  This leads to all local APIC entries in
the MADT being set as enabled, which contradicts the data in xenstore if vCPUs
is different than maximum vCPUs.

Fix by copying the internal libxl bitmap that's populated based on the vCPUs
parameter.

Reported-by: Arthur Borsboom <arthurborsboom@gmail.com>
Link: https://gitlab.com/libvirt/libvirt/-/issues/399
Reported-by: Leigh Brown <leigh@solinno.co.uk>
Fixes: 14c0d328da2b ('libxl/acpi: Build ACPI tables for HVMlite guests')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: Leigh Brown <leigh@solinno.co.uk>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5cc7347b04b2d0a3133754c7a9b936f614ec656a
master date: 2024-05-11 00:13:43 +0100

11 months agox86/ucode: Distinguish "ucode already up to date"
Andrew Cooper [Tue, 21 May 2024 09:59:36 +0000 (11:59 +0200)]
x86/ucode: Distinguish "ucode already up to date"

Right now, Xen returns -ENOENT for both "the provided blob isn't correct for
this CPU", and "the blob isn't newer than what's loaded".

This in turn causes xen-ucode to exit with an error, when "nothing to do" is
more commonly a success condition.

Handle EEXIST specially and exit cleanly.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 648db37a155aca6f66d4cf3bb118417a728c3579
master date: 2024-05-09 18:19:49 +0100

11 months agox86/cpu-policy: Fix migration from Ice Lake to Cascade Lake
Andrew Cooper [Tue, 21 May 2024 09:59:09 +0000 (11:59 +0200)]
x86/cpu-policy: Fix migration from Ice Lake to Cascade Lake

Ever since Xen 4.14, there has been a latent bug with migration.

While some toolstacks can level the features properly, they don't shink
feat.max_subleaf when all features have been dropped.  This is because
we *still* have not completed the toolstack side work for full CPU Policy
objects.

As a consequence, even when properly feature levelled, VMs can't migrate
"backwards" across hardware which reduces feat.max_subleaf.  One such example
is Ice Lake (max_subleaf=2 for INTEL_PSFD) to Cascade Lake (max_subleaf=0).

Extend the max policies feat.max_subleaf to the hightest number Xen knows
about, but leave the default policies matching the host.  This will allow VMs
with a higher feat.max_subleaf than strictly necessary to migrate in.

Eventually we'll manage to teach the toolstack how to avoid creating such VMs
in the first place, but there's still more work to do there.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a2330b51df267e20e66bbba6c5bf08f0570ed58b
master date: 2024-05-07 16:56:46 +0100

11 months agotools/libxs: Open /dev/xen/xenbus fds as O_CLOEXEC
Andrew Cooper [Tue, 21 May 2024 09:58:47 +0000 (11:58 +0200)]
tools/libxs: Open /dev/xen/xenbus fds as O_CLOEXEC

The header description for xs_open() goes as far as to suggest that the fd is
O_CLOEXEC, but it isn't actually.

`xl devd` has been observed leaking /dev/xen/xenbus into children.

Link: https://github.com/QubesOS/qubes-issues/issues/8292
Reported-by: Demi Marie Obenour <demi@invisiblethingslab.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
master commit: f4f2f3402b2f4985d69ffc0d46f845d05fd0b60f
master date: 2024-05-07 15:18:36 +0100

11 months agoVT-d: correct ATS checking for root complex integrated devices
Jan Beulich [Tue, 21 May 2024 09:58:17 +0000 (11:58 +0200)]
VT-d: correct ATS checking for root complex integrated devices

Spec version 4.1 says

"The ATSR structures identifies PCI Express Root-Ports supporting
 Address Translation Services (ATS) transactions. Software must enable
 ATS on endpoint devices behind a Root Port only if the Root Port is
 reported as supporting ATS transactions."

Clearly root complex integrated devices aren't "behind root ports",
matching my observation on a SapphireRapids system having an ATS-
capable root complex integrated device. Hence for such devices we
shouldn't try to locate a corresponding ATSR.

Since both pci_find_ext_capability() and pci_find_cap_offset() return
"unsigned int", change "pos" to that type at the same time.

Fixes: 903b93211f56 ("[VTD] laying the ground work for ATS")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 04e31583bab97e5042a44a1d00fce2760272635f
master date: 2024-05-06 09:22:45 +0200

11 months agoxen/x86: Fix Syntax warning in gen-cpuid.py
Jason Andryuk [Tue, 21 May 2024 09:57:41 +0000 (11:57 +0200)]
xen/x86: Fix Syntax warning in gen-cpuid.py

Python 3.12.2 warns:

xen/tools/gen-cpuid.py:50: SyntaxWarning: invalid escape sequence '\s'
  "\s+([\s\d]+\*[\s\d]+\+[\s\d]+)\)"
xen/tools/gen-cpuid.py:51: SyntaxWarning: invalid escape sequence '\s'
  "\s+/\*([\w!]*) .*$")

Specify the strings as raw strings so '\s' is read as literal '\' + 's'.
This avoids escaping all the '\'s in the strings.

Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 08e79bba73d74a85d3ce6ff0f91c5205f1e05eda
master date: 2024-04-30 08:34:37 +0200

11 months agoxen/xsm: Wire up get_dom0_console
Jason Andryuk [Tue, 21 May 2024 09:57:20 +0000 (11:57 +0200)]
xen/xsm: Wire up get_dom0_console

An XSM hook for get_dom0_console is currently missing.  Using XSM with
a PVH dom0 shows:
(XEN) FLASK: Denying unknown platform_op: 64.

Wire up the hook, and allow it for dom0.

Fixes: 4dd160583c ("x86/platform: introduce hypercall to get initial video console settings")
Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
master commit: 647f7e50ebeeb8152974cad6a12affe474c74513
master date: 2024-04-30 08:33:41 +0200

11 months agox86/rtc: Avoid UIP flag being set for longer than expected
Ross Lagerwall [Tue, 21 May 2024 09:55:49 +0000 (11:55 +0200)]
x86/rtc: Avoid UIP flag being set for longer than expected

In a test, OVMF reported an error initializing the RTC without
indicating the precise nature of the error. The only plausible
explanation I can find is as follows:

As part of the initialization, OVMF reads register C and then reads
register A repatedly until the UIP flag is not set. If this takes longer
than 100 ms, OVMF fails and reports an error. This may happen with the
following sequence of events:

At guest time=0s, rtc_init() calls check_update_timer() which schedules
update_timer for t=(1 - 244us).

At t=1s, the update_timer function happens to have been called >= 244us
late. In the timer callback, it sets the UIP flag and schedules
update_timer2 for t=1s.

Before update_timer2 runs, the guest reads register C which calls
check_update_timer(). check_update_timer() stops the scheduled
update_timer2 and since the guest time is now outside of the update
cycle, it schedules update_timer for t=(2 - 244us).

The UIP flag will therefore be set for a whole second from t=1 to t=2
while the guest repeatedly reads register A waiting for the UIP flag to
clear. Fix it by clearing the UIP flag when scheduling update_timer.

I was able to reproduce this issue with a synthetic test and this
resolves the issue.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 43a07069863b419433dee12c9b58c1f7ce70aa97
master date: 2024-04-23 14:09:18 +0200

11 months agoaltcall: fix __alt_call_maybe_initdata so it's safe for livepatch
Roger Pau Monné [Tue, 21 May 2024 09:55:09 +0000 (11:55 +0200)]
altcall: fix __alt_call_maybe_initdata so it's safe for livepatch

Setting alternative call variables as __init is not safe for use with
livepatch, as livepatches can rightfully introduce new alternative calls to
structures marked as __alt_call_maybe_initdata (possibly just indirectly due to
replacing existing functions that use those).  Attempting to resolve those
alternative calls then results in page faults as the variable that holds the
function pointer address has been freed.

When livepatch is supported use the __ro_after_init attribute instead of
__initdata for __alt_call_maybe_initdata.

Fixes: f26bb285949b ('xen: Implement xen/alternative-call.h for use in common code')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: af4cd0a6a61cdb03bc1afca9478b05b0c9703599
master date: 2024-04-11 18:51:36 +0100

11 months agox86/hvm: Fix Misra Rule 19.1 regression
Andrew Cooper [Tue, 21 May 2024 09:54:20 +0000 (11:54 +0200)]
x86/hvm: Fix Misra Rule 19.1 regression

Despite noticing an impending Rule 19.1 violation, the adjustment made (the
uint32_t cast) wasn't sufficient to avoid it.  Try again.

Subsequently noticed by Coverity too.

Fixes: 6a98383b0877 ("x86/HVM: clear upper halves of GPRs upon entry from 32-bit code")
Coverity-IDs: 1596289 thru 1596298
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: d0a718a45f14b86471d8eb3083acd72760963470
master date: 2024-04-11 13:23:08 +0100

11 months agoblock-common: Fix same_vm for no targets
Jason Andryuk [Tue, 21 May 2024 09:53:54 +0000 (11:53 +0200)]
block-common: Fix same_vm for no targets

same_vm is broken when the two main domains do not have targets.  otvm
and targetvm are both missing, which means they get set to -1 and then
converted to empty strings:

++10697+ local targetvm=-1
++10697+ local otvm=-1
++10697+ otvm=
++10697+ othervm=/vm/cc97bc2f-3a91-43f7-8fbc-4cb92f90b4e4
++10697+ targetvm=
++10697+ local frontend_uuid=/vm/844dea4e-44f8-4e3e-8145-325132a31ca5

The final comparison returns true since the two empty strings match:

++10697+ '[' /vm/844dea4e-44f8-4e3e-8145-325132a31ca5 = /vm/cc97bc2f-3a91-43f7-8fbc-4cb92f90b4e4 -o '' = /vm/cc97bc2f-3a91-43f7-8fbc-4cb92f90b4e4 -o /vm/844dea4e-44f8-4e3e-8145-325132a31ca5 = '' -o '' = '' ']'

Replace -1 with distinct strings indicating the lack of a value and
remove the collescing to empty stings.  The strings themselves will no
longer match, and that is correct.

++12364+ '[' /vm/844dea4e-44f8-4e3e-8145-325132a31ca5 = /vm/cc97bc2f-3a91-43f7-8fbc-4cb92f90b4e4 -o 'No target' = /vm/cc97bc2f-3a91-43f7-8fbc-4cb92f90b4e4 -o /vm/844dea4e-44f8-4e3e-8145-325132a31ca5 = 'No other target' -o 'No target' = 'No other target' ']'

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: e8f1bb803fdf44db708991593568a9e3e6b3d130
master date: 2024-02-07 13:46:52 +0100

11 months agoupdate Xen version to 4.17.5-pre
Jan Beulich [Tue, 21 May 2024 09:53:11 +0000 (11:53 +0200)]
update Xen version to 4.17.5-pre

12 months agox86/spec: adjust logic that elides lfence
Roger Pau Monné [Mon, 29 Apr 2024 07:39:53 +0000 (09:39 +0200)]
x86/spec: adjust logic that elides lfence

It's currently too restrictive by just checking whether there's a BHB clearing
sequence selected.  It should instead check whether BHB clearing is used on
entry from PV or HVM specifically.

Switch to use opt_bhb_entry_{pv,hvm} instead, and then remove cpu_has_bhb_seq
since it no longer has any users.

Reported-by: Jan Beulich <jbeulich@suse.com>
Fixes: 954c983abcee ('x86/spec-ctrl: Software BHB-clearing sequences')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 656ae8f1091bcefec9c46ec3ea3ac2118742d4f6
master date: 2024-04-25 16:37:01 +0200

12 months agox86/spec: fix reporting of BHB clearing usage from guest entry points
Roger Pau Monné [Mon, 29 Apr 2024 07:39:28 +0000 (09:39 +0200)]
x86/spec: fix reporting of BHB clearing usage from guest entry points

Reporting whether the BHB clearing on entry is done for the different domains
types based on cpu_has_bhb_seq is unhelpful, as that variable signals whether
there's a BHB clearing sequence selected, but that alone doesn't imply that
such sequence is used from the PV and/or HVM entry points.

Instead use opt_bhb_entry_{pv,hvm} which do signal whether BHB clearing is
performed on entry from PV/HVM.

Fixes: 689ad48ce9cf ('x86/spec-ctrl: Wire up the Native-BHI software sequences')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 049ab0b2c9f1f5edb54b505fef0bc575787dafe9
master date: 2024-04-25 16:35:56 +0200

12 months agox86/MTRR: correct inadvertently inverted WC check
Jan Beulich [Mon, 29 Apr 2024 07:38:47 +0000 (09:38 +0200)]
x86/MTRR: correct inadvertently inverted WC check

The ! clearly got lost by mistake.

Fixes: e9e0eb30d4d6 ("x86/MTRR: avoid several indirect calls")
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 77e25f0e30ddd11e043e6fce84bf108ce7de5b6f
master date: 2024-04-23 14:13:48 +0200

12 months agox86/entry: Fix build with older toolchains
Andrew Cooper [Tue, 9 Apr 2024 20:39:51 +0000 (21:39 +0100)]
x86/entry: Fix build with older toolchains

Binutils older than 2.29 doesn't know INCSSPD.

Fixes: 8e186f98ce0e ("x86: Use indirect calls in reset-stack infrastructure")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit a9fa82500818a8d8ce5f2843f1577bd2c29d088e)

12 months agoRelease: Update CHANGELOG.md RELEASE-4.17.4
George Dunlap [Tue, 9 Apr 2024 15:48:56 +0000 (16:48 +0100)]
Release: Update CHANGELOG.md

Signed-off-by: George Dunlap <george.dunlap@cloud.com>
12 months agoUpdate Xen version to 4.17.4
Andrew Cooper [Wed, 27 Mar 2024 18:23:18 +0000 (18:23 +0000)]
Update Xen version to 4.17.4

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
12 months agox86/spec-ctrl: Support the "long" BHB loop sequence
Andrew Cooper [Fri, 22 Mar 2024 19:29:34 +0000 (19:29 +0000)]
x86/spec-ctrl: Support the "long" BHB loop sequence

Out of an abudnance of caution, implement the long loop too, and allowing for
it to be opted in to.

This is part of XSA-456 / CVE-2024-2201.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit d5887c0decbd90e798b24ed696628645b04632fb)

12 months agox86/spec-ctrl: Wire up the Native-BHI software sequences
Andrew Cooper [Thu, 8 Jun 2023 18:41:44 +0000 (19:41 +0100)]
x86/spec-ctrl: Wire up the Native-BHI software sequences

In the absence of BHI_DIS_S, mitigating Native-BHI requires the use of a
software sequence.

Introduce a new bhb-seq= option to select between avaialble sequences and
bhb-entry= to control the per-PV/HVM actions like we have for other blocks.

Activate the short sequence by default for PV and HVM guests on affected
hardware if BHI_DIS_S isn't present.

This is part of XSA-456 / CVE-2024-2201.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 689ad48ce9cf4c38297cd126e7e003a1c13a3b9d)

12 months agox86/spec-ctrl: Software BHB-clearing sequences
Andrew Cooper [Thu, 8 Jun 2023 18:41:44 +0000 (19:41 +0100)]
x86/spec-ctrl: Software BHB-clearing sequences

Implement clear_bhb_{tsx,loops}() as per the BHI guidance.  The loops variant
is set up as the "short" sequence.

Introduce SCF_entry_bhb and extend SPEC_CTRL_ENTRY_* with a conditional call
to selected clearing routine.

Note that due to a limitation in the ALTERNATIVE capability, the TEST/JZ can't
be included alongside a CALL in a single alternative block.  This is going to
require further work to untangle.

The BHB sequences (if used) must be after the restoration of Xen's
MSR_SPEC_CTRL value, which must be accounted for when judging whether it is
safe to skip the safety LFENCEs.

This is part of XSA-456 / CVE-2024-2201.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 954c983abceee97bf5f6230b9ae164f2c49a9aa9)

12 months agox86/spec-ctrl: Support BHI_DIS_S in order to mitigate BHI
Andrew Cooper [Tue, 26 Mar 2024 19:01:37 +0000 (19:01 +0000)]
x86/spec-ctrl: Support BHI_DIS_S in order to mitigate BHI

Introduce a "bhi-dis-s" boolean to match the other options we have for
MSR_SPEC_CTRL values.  Also introduce bhi_calculations().

Use BHI_DIS_S whenever possible.

Guests which are levelled to be migration compatible with older CPUs can't see
BHI_DIS_S, and Xen must fill in the difference to make the guest safe.  Use
the virt MSR_SPEC_CTRL infrastructure to force BHI_DIS_S behind the guest's
back.

This is part of XSA-456 / CVE-2024-2201.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 62a1106415c5e8a49b45147ca84d54a58d471343)

12 months agox86/tsx: Expose RTM_ALWAYS_ABORT to guests
Andrew Cooper [Sat, 6 Apr 2024 19:36:54 +0000 (20:36 +0100)]
x86/tsx: Expose RTM_ALWAYS_ABORT to guests

A TSX Abort is one option mitigate Native-BHI, but a guest kernel doesn't get
to see this if Xen has turned RTM off using MSR_TSX_{CTRL,FORCE_ABORT}.

Therefore, the meaning of RTM_ALWAYS_ABORT has been adjusted to "XBEGIN won't
fault", and it should be exposed to guests so they can make a better decision.

Expose it in the max policy for any RTM-capable system.  Offer it by default
only if RTM has been disabled.

Update test-tsx to account for this new meaning.  While adjusting the logic in
test_guest_policies(), take the opportunity to use feature names (now they're
available) to make the logic easier to follow.

This is part of XSA-456 / CVE-2024-2201.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c94e2105924347de0d9f32065370e802a20cc829)

12 months agox86: Drop INDIRECT_JMP
Andrew Cooper [Fri, 22 Dec 2023 18:01:37 +0000 (18:01 +0000)]
x86: Drop INDIRECT_JMP

Indirect JMPs which are not tailcalls can lead to an unwelcome form of
speculative type confusion, and we've removed the uses of INDIRECT_JMP to
compensate.  Remove the temptation to reintroduce new instances.

This is part of XSA-456 / CVE-2024-2201.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 0b66d7ce3c0290eaad28bdafb35200052d012b14)

12 months agox86: Use indirect calls in reset-stack infrastructure
Andrew Cooper [Fri, 22 Dec 2023 17:44:48 +0000 (17:44 +0000)]
x86: Use indirect calls in reset-stack infrastructure

Mixing up JMP and CALL indirect targets leads a very fun form of speculative
type confusion.  A target which is expecting to be called CALLed needs a
return address on the stack, and an indirect JMP doesn't place one there.

An indirect JMP which predicts to a target intending to be CALLed can end up
with a RET speculatively executing with a value from the JMPers stack frame.

There are several ways get indirect JMPs in Xen.

 * From tailcall optimisations.  These are safe because the compiler has
   arranged the stack to point at the callee's return address.

 * From jump tables.  These are unsafe, but Xen is built with -fno-jump-tables
   to work around several compiler issues.

 * From reset_stack_and_jump_ind(), which is particularly unsafe.  Because of
   the additional stack adjustment made, the value picked up off the stack is
   regs->r15 of the next vCPU to run.

In order to mitigate this type confusion, we want to make all indirect targets
be CALL targets, and remove the use of indirect JMP except via tailcall
optimisation.

Luckily due to XSA-348, all C target functions of reset_stack_and_jump_ind()
are noreturn.  {svm,vmx}_do_resume() exits via reset_stack_and_jump(); a
direct JMP with entirely different prediction properties.  idle_loop() is an
infinite loop which eventually exits via reset_stack_and_jump_ind() from a new
schedule.  i.e. These paths are all fine having one extra return address on
the stack.

This leaves continue_pv_domain(), which is expecting to be a JMP target.
Alter it to strip the return address off the stack, which is safe because
there isn't actually a RET expecting to return to its caller.

This allows us change reset_stack_and_jump_ind() to reset_stack_and_call_ind()
in order to mitigate the speculative type confusion.

This is part of XSA-456 / CVE-2024-2201.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 8e186f98ce0e35d1754ec9299da41ec98873b65c)

12 months agox86/spec-ctrl: Widen the {xen,last,default}_spec_ctrl fields
Andrew Cooper [Tue, 26 Mar 2024 22:43:18 +0000 (22:43 +0000)]
x86/spec-ctrl: Widen the {xen,last,default}_spec_ctrl fields

Right now, they're all bytes, but MSR_SPEC_CTRL has been steadily gaining new
features.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 45dac88e78e8a2d9d8738eef884fe6730faf9e67)

12 months agox86/vmx: Add support for virtualize SPEC_CTRL
Roger Pau Monne [Thu, 15 Feb 2024 16:46:53 +0000 (17:46 +0100)]
x86/vmx: Add support for virtualize SPEC_CTRL

The feature is defined in the tertiary exec control, and is available starting
from Sapphire Rapids and Alder Lake CPUs.

When enabled, two extra VMCS fields are used: SPEC_CTRL mask and shadow.  Bits
set in mask are not allowed to be toggled by the guest (either set or clear)
and the value in the shadow field is the value the guest expects to be in the
SPEC_CTRL register.

By using it the hypervisor can force the value of SPEC_CTRL bits behind the
guest back without having to trap all accesses to SPEC_CTRL, note that no bits
are forced into the guest as part of this patch.  It also allows getting rid of
SPEC_CTRL in the guest MSR load list, since the value in the shadow field will
be loaded by the hardware on vmentry.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 97c5b8b657e41a6645de9d40713b881234417b49)

12 months agox86/spec-ctrl: Detail the safety properties in SPEC_CTRL_ENTRY_*
Andrew Cooper [Mon, 25 Mar 2024 11:09:35 +0000 (11:09 +0000)]
x86/spec-ctrl: Detail the safety properties in SPEC_CTRL_ENTRY_*

The complexity is getting out of hand.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 40dea83b75386cb693481cf340024ce093be5c0f)

12 months agox86/spec-ctrl: Simplify DO_COND_IBPB
Andrew Cooper [Fri, 22 Mar 2024 14:33:17 +0000 (14:33 +0000)]
x86/spec-ctrl: Simplify DO_COND_IBPB

With the prior refactoring, SPEC_CTRL_ENTRY_{PV,INTR} both load SCF into %ebx,
and handle the conditional safety including skipping if interrupting Xen.

Therefore, we can drop the maybexen parameter and the conditional safety.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 2378d16a931de0e62c03669169989e9437306abe)

12 months agox86/spec_ctrl: Hold SCF in %ebx across SPEC_CTRL_ENTRY_{PV,INTR}
Andrew Cooper [Fri, 22 Mar 2024 12:08:02 +0000 (12:08 +0000)]
x86/spec_ctrl: Hold SCF in %ebx across SPEC_CTRL_ENTRY_{PV,INTR}

... as we do in the exit paths too.  This will allow simplification to the
sub-blocks.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9607aeb6602b8ed9962404de3f5f90170ffddb66)

12 months agox86/entry: Arrange for %r14 to be STACK_END across SPEC_CTRL_ENTRY_FROM_PV
Andrew Cooper [Fri, 22 Mar 2024 15:52:06 +0000 (15:52 +0000)]
x86/entry: Arrange for %r14 to be STACK_END across SPEC_CTRL_ENTRY_FROM_PV

Other SPEC_CTRL_* paths already use %r14 like this, and it will allow for
simplifications.

All instances of SPEC_CTRL_ENTRY_FROM_PV are followed by a GET_STACK_END()
invocation, so this change is only really logic and register shuffling.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 22390697bf1b4cd3024f2d10893dec3c3ec08a9c)

12 months agox86/spec-ctrl: Rework conditional safety for SPEC_CTRL_ENTRY_*
Andrew Cooper [Fri, 22 Mar 2024 11:41:41 +0000 (11:41 +0000)]
x86/spec-ctrl: Rework conditional safety for SPEC_CTRL_ENTRY_*

Right now, we have a mix of safety strategies in different blocks, making the
logic fragile and hard to follow.

Start addressing this by having a safety LFENCE at the end of the blocks,
which can be patched out if other safety criteria are met.  This will allow us
to simplify the sub-blocks.  For SPEC_CTRL_ENTRY_FROM_IST, simply leave an
LFENCE unconditionally at the end; the IST path is not a fast-path by any
stretch of the imagination.

For SPEC_CTRL_ENTRY_FROM_INTR, the existing description was incorrect.  The
IRET #GP path is non-fatal but can occur with the guest's choice of
MSR_SPEC_CTRL.  It is safe to skip the flush/barrier-like protections when
interrupting Xen, but we must run DO_SPEC_CTRL_ENTRY irrespective.

This will skip RSB stuffing which was previously unconditional even when
interrupting Xen.

AFAICT, this is a missing cleanup from commit 3fffaf9c13e9 ("x86/entry: Avoid
using alternatives in NMI/#MC paths") where we split the IST entry path out of
the main INTR entry path.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 94896de1a98c4289fe6fef9e16ef99fc6ef2efc4)

12 months agox86/spec-ctrl: Rename spec_ctrl_flags to scf
Andrew Cooper [Thu, 28 Mar 2024 11:57:25 +0000 (11:57 +0000)]
x86/spec-ctrl: Rename spec_ctrl_flags to scf

XSA-455 was ultimately caused by having fields with too-similar names.

Both {xen,last}_spec_ctrl are fields containing an architectural MSR_SPEC_CTRL
value.  The spec_ctrl_flags field contains Xen-internal flags.

To more-obviously distinguish the two, rename spec_ctrl_flags to scf, which is
also the prefix of the constants used by the fields.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c62673c4334b3372ebd4292a7ac8185357e7ea27)

12 months agox86/cpuid: Don't expose {IPRED,RRSBA,BHI}_CTRL to PV guests
Andrew Cooper [Tue, 9 Apr 2024 14:03:05 +0000 (15:03 +0100)]
x86/cpuid: Don't expose {IPRED,RRSBA,BHI}_CTRL to PV guests

All of these are prediction-mode (i.e. CPL) based.  They don't operate as
advertised in PV context.

Fixes: 4dd676070684 ("x86/spec-ctrl: Expose IPRED_CTRL to guests")
Fixes: 478e4787fa64 ("x86/spec-ctrl: Expose RRSBA_CTRL to guests")
Fixes: 583f1d095052 ("x86/spec-ctrl: Expose BHI_CTRL to guests")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 4b3da946ad7e3452761478ae683da842e7ff20d6)

12 months agox86/alternatives: fix .init section reference in _apply_alternatives()
Roger Pau Monné [Tue, 9 Apr 2024 12:50:46 +0000 (14:50 +0200)]
x86/alternatives: fix .init section reference in _apply_alternatives()

The code in _apply_alternatives() will unconditionally attempt to read
__initdata_cf_clobber_{start,end} when called as part of applying alternatives
to a livepatch payload when Xen is using IBT.

That leads to a page-fault as __initdata_cf_clobber_{start,end} living in
.init section will have been unmapped by the time a livepatch gets loaded.

Fix by adding a check that limits the clobbering of endbr64 instructions to
boot time only.

Fixes: 37ed5da851b8 ('x86/altcall: Optimise away endbr64 instruction where possible')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 4be1fef1e6572c2be0bd378902ffb62a6e73faeb)

12 months agox86/tsx: Cope with RTM_ALWAYS_ABORT vs RTM mismatch
Andrew Cooper [Wed, 3 Apr 2024 16:43:42 +0000 (17:43 +0100)]
x86/tsx: Cope with RTM_ALWAYS_ABORT vs RTM mismatch

It turns out there is something wonky on some but not all CPUs with
MSR_TSX_FORCE_ABORT.  The presence of RTM_ALWAYS_ABORT causes Xen to think
it's safe to offer HLE/RTM to guests, but in this case, XBEGIN instructions
genuinely #UD.

Spot this case and try to back out as cleanly as we can.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit b33f191e3ca99458fdcea1cb5a29dfa4965d1604)

12 months agox86/spec-ctrl: Move __read_mostly data into __ro_after_init
Andrew Cooper [Thu, 28 Mar 2024 12:38:32 +0000 (12:38 +0000)]
x86/spec-ctrl: Move __read_mostly data into __ro_after_init

These variables predate the introduction of __ro_after_init, but all qualify.
Update them to be consistent with the rest of the file.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 7a09966e7b2823b70f6d56d0cf66c11124f4a3c1)

12 months agoVMX: tertiary execution control infrastructure
Jan Beulich [Wed, 7 Feb 2024 12:46:11 +0000 (13:46 +0100)]
VMX: tertiary execution control infrastructure

This is a prereq to enabling e.g. the MSRLIST feature.

Note that the PROCBASED_CTLS3 MSR is different from other VMX feature
reporting MSRs, in that all 64 bits report allowed 1-settings.

vVMX code is left alone, though, for the time being.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 878159bf259bfbd7a40312829f1ea0ce1f6645e2)

12 months agox86/CPU: convert vendor hook invocations to altcall
Jan Beulich [Mon, 5 Feb 2024 09:48:11 +0000 (10:48 +0100)]
x86/CPU: convert vendor hook invocations to altcall

While not performance critical, these hook invocations still want
converting: This way all pre-filled struct cpu_dev instances can become
__initconst_cf_clobber, thus allowing to eliminate further 8 ENDBR
during the 2nd phase of alternatives patching (besides moving previously
resident data to .init.*).

Since all use sites need touching anyway, take the opportunity and also
address a Misra C:2012 Rule 5.5 violation: Rename the this_cpu static
variable.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 660f8a75013c947fbe5358a640032a1f9f1eece5)

12 months agox86/guest: finish conversion to altcall
Jan Beulich [Mon, 5 Feb 2024 09:45:31 +0000 (10:45 +0100)]
x86/guest: finish conversion to altcall

While .setup() and .e820_fixup() don't need fiddling with for being run
only very early, both .ap_setup() and .resume() want converting too:
This way both pre-filled struct hypervisor_ops instances can become
__initconst_cf_clobber, thus allowing to eliminate up to 5 more ENDBR
(configuration dependent) during the 2nd phase of alternatives patching.

While fiddling with section annotations here, also move "ops" itself to
.data.ro_after_init.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Paul Durrant <paul@xen.org>
(cherry picked from commit e931edccc53c9dd6e9a505ad0ff3a03d985669bc)

12 months agox86: arrange for ENDBR zapping from <vendor>_ctxt_switch_masking()
Jan Beulich [Mon, 5 Feb 2024 09:44:46 +0000 (10:44 +0100)]
x86: arrange for ENDBR zapping from <vendor>_ctxt_switch_masking()

While altcall is already used for them, the functions want announcing in
.init.rodata.cf_clobber, even if the resulting static variables aren't
otherwise used.

While doing this also move ctxt_switch_masking to .data.ro_after_init.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 044168fa3a65b6542bda5c21e373742de1bd5980)

12 months agox86/spec-ctrl: Expose BHI_CTRL to guests
Roger Pau Monné [Tue, 30 Jan 2024 09:14:00 +0000 (10:14 +0100)]
x86/spec-ctrl: Expose BHI_CTRL to guests

The CPUID feature bit signals the presence of the BHI_DIS_S control in
SPEC_CTRL MSR, first available in Intel AlderLake and Sapphire Rapids CPUs

Xen already knows how to context switch MSR_SPEC_CTRL properly between guest
and hypervisor context.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 583f1d0950529f3517b1741c2b21a028a82ba831)

12 months agox86/spec-ctrl: Expose RRSBA_CTRL to guests
Roger Pau Monné [Tue, 30 Jan 2024 09:13:59 +0000 (10:13 +0100)]
x86/spec-ctrl: Expose RRSBA_CTRL to guests

The CPUID feature bit signals the presence of the RRSBA_DIS_{U,S} controls in
SPEC_CTRL MSR, first available in Intel AlderLake and Sapphire Rapids CPUs.

Xen already knows how to context switch MSR_SPEC_CTRL properly between guest
and hypervisor context.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 478e4787fa64b621061177a7843c452e9a19916d)

12 months agox86/spec-ctrl: Expose IPRED_CTRL to guests
Roger Pau Monné [Tue, 30 Jan 2024 09:13:58 +0000 (10:13 +0100)]
x86/spec-ctrl: Expose IPRED_CTRL to guests

The CPUID feature bit signals the presence of the IPRED_DIS_{U,S} controls in
SPEC_CTRL MSR, first available in Intel AlderLake and Sapphire Rapids CPUs.

Xen already knows how to context switch MSR_SPEC_CTRL properly between guest
and hypervisor context.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 4dd6760706848de30f7c8b5f83462b9bcb070c91)

12 months agoIRQ: generalize [gs]et_irq_regs()
Jan Beulich [Tue, 23 Jan 2024 11:03:23 +0000 (12:03 +0100)]
IRQ: generalize [gs]et_irq_regs()

Move functions (and their data) to common code, and invoke the functions
on Arm as well. This is in preparation of dropping the register
parameters from handler functions.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit f67bddf3bccd99a5fee968c3b3f288db6a57d3be)

12 months agox86/MCE: switch some callback invocations to altcall
Jan Beulich [Mon, 22 Jan 2024 12:41:07 +0000 (13:41 +0100)]
x86/MCE: switch some callback invocations to altcall

While not performance critical, these hook invocations still would
better be converted: This way all pre-filled (and newly introduced)
struct mce_callback instances can become __initconst_cf_clobber, thus
allowing to eliminate another 9 ENDBR during the 2nd phase of
alternatives patching.

While this means registering callbacks a little earlier, doing so is
perhaps even advantageous, for having pointers be non-NULL earlier on.
Only one set of callbacks would only ever be registered anyway, and
neither of the respective initialization function can (subsequently)
fail.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 85ba4d050f9f3c4286164f21660ae88435b7e83c)

12 months agox86/MCE: separate BSP-only initialization
Jan Beulich [Mon, 22 Jan 2024 12:40:32 +0000 (13:40 +0100)]
x86/MCE: separate BSP-only initialization

Several function pointers are registered over and over again, when
setting them once on the BSP suffices. Arrange for this in the vendor
init functions and mark involved registration functions __init.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 9f58616ddb1cc1870399de2202fafc7bf0d61694)

12 months agox86/PV: avoid indirect call for I/O emulation quirk hook
Jan Beulich [Mon, 22 Jan 2024 12:40:00 +0000 (13:40 +0100)]
x86/PV: avoid indirect call for I/O emulation quirk hook

This way ioemul_handle_proliant_quirk() won't need ENDBR anymore.

While touching this code, also
- arrange for it to not be built at all when !PV,
- add "const" to the last function parameter and bring the definition
  in sync with the declaration (for Misra).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 1212af3e8c4d3a1350046d4fe0ca3b97b51e67de)

12 months agox86/MTRR: avoid several indirect calls
Jan Beulich [Mon, 22 Jan 2024 12:39:23 +0000 (13:39 +0100)]
x86/MTRR: avoid several indirect calls

The use of (supposedly) vendor-specific hooks is a relic from the days
when Xen was still possible to build as 32-bit binary. There's no
expectation that a new need for such an abstraction would arise. Convert
mttr_if to a mere boolean and all prior calls through it to direct ones,
thus allowing to eliminate 6 ENDBR from .text.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit e9e0eb30d4d6565b411499ca826718b4b9acab68)

12 months agocore-parking: use alternative_call()
Jan Beulich [Mon, 22 Jan 2024 12:38:24 +0000 (13:38 +0100)]
core-parking: use alternative_call()

This way we can arrange for core_parking_{performance,power}()'s ENDBR
to also be zapped.

For the decision to be taken before the 2nd alternative patching pass,
the initcall needs to become a pre-SMP one, though.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 1bc07ebcac3b1bb2a378732bc0f9a19940e76faf)

12 months agox86/HPET: avoid an indirect call
Jan Beulich [Wed, 17 Jan 2024 09:43:02 +0000 (10:43 +0100)]
x86/HPET: avoid an indirect call

When this code was written, indirect branches still weren't considered
much of a problem (besides being a little slower). Instead of a function
pointer, pass a boolean to _disable_pit_irq(), thus allowing to
eliminate two ENDBR (one of them in .text).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 730d2637a8e5b98dc8e4e366179b4cedc496b3ad)

12 months agocpufreq: finish conversion to altcall
Jan Beulich [Wed, 17 Jan 2024 09:42:27 +0000 (10:42 +0100)]
cpufreq: finish conversion to altcall

Even functions used on infrequently executed paths want converting: This
way all pre-filled struct cpufreq_driver instances can become
__initconst_cf_clobber, thus allowing to eliminate another 15 ENDBR
during the 2nd phase of alternatives patching.

For acpi-cpufreq's optionally populated .get hook make sure alternatives
patching can actually see the pointer. See also the code comment.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 467ae515caee491e9b6ae1da8b9b98d094955822)

12 months agox86/APIC: finish genapic conversion to altcall
Jan Beulich [Wed, 17 Jan 2024 09:41:52 +0000 (10:41 +0100)]
x86/APIC: finish genapic conversion to altcall

While .probe() doesn't need fiddling with for being run only very early,
init_apic_ldr() wants converting too despite not being on a frequently
executed path: This way all pre-filled struct genapic instances can
become __initconst_cf_clobber, thus allowing to eliminate 15 more ENDBR
during the 2nd phase of alternatives patching.

While fiddling with section annotations here, also move "genapic" itself
to .data.ro_after_init.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit b1cc53753cba4c3253f2e1093a3a6a9a828314bf)

12 months agox86/spec-ctrl: Fix BTC/SRSO mitigations
Andrew Cooper [Tue, 26 Mar 2024 22:47:25 +0000 (22:47 +0000)]
x86/spec-ctrl: Fix BTC/SRSO mitigations

We were looking for SCF_entry_ibpb in the wrong variable in the top-of-stack
block, and xen_spec_ctrl won't have had bit 5 set because Xen doesn't
understand SPEC_CTRL_RRSBA_DIS_U yet.

This is XSA-455 / CVE-2024-31142.

Fixes: 53a570b28569 ("x86/spec-ctrl: Support IBPB-on-entry")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
12 months agohypercall_xlat_continuation: Replace BUG_ON with domain_crash
Bjoern Doebel [Wed, 27 Mar 2024 18:30:55 +0000 (18:30 +0000)]
hypercall_xlat_continuation: Replace BUG_ON with domain_crash

Instead of crashing the host in case of unexpected hypercall parameters,
resort to only crashing the calling domain.

This is part of XSA-454 / CVE-2023-46842.

Fixes: b8a7efe8528a ("Enable compatibility mode operation for HYPERVISOR_memory_op")
Reported-by: Manuel Andreas <manuel.andreas@tum.de>
Signed-off-by: Bjoern Doebel <doebel@amazon.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 9926e692c4afc40bcd66f8416ff6a1e93ce402f6)

12 months agox86/HVM: clear upper halves of GPRs upon entry from 32-bit code
Jan Beulich [Wed, 27 Mar 2024 17:31:38 +0000 (17:31 +0000)]
x86/HVM: clear upper halves of GPRs upon entry from 32-bit code

Hypercalls in particular can be the subject of continuations, and logic
there checks updated state against incoming register values. If the
guest manufactured a suitable argument register with a non-zero upper
half before entering compatibility mode and issuing a hypercall from
there, checks in hypercall_xlat_continuation() might trip.

Since for HVM we want to also be sure to not hit a corner case in the
emulator, initiate the clipping right from the top of
{svm,vmx}_vmexit_handler(). Also rename the invoked function, as it no
longer does only invalidation of fields.

Note that architecturally the upper halves of registers are undefined
after a switch between compatibility and 64-bit mode (either direction).
Hence once having entered compatibility mode, the guest can't assume
the upper half of any register to retain its value.

This is part of XSA-454 / CVE-2023-46842.

Fixes: b8a7efe8528a ("Enable compatibility mode operation for HYPERVISOR_memory_op")
Reported-by: Manuel Andreas <manuel.andreas@tum.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 6a98383b0877bb66ebfe189da43bf81abe3d7909)

12 months agoxen/livepatch: Fix .altinstructions safety checks
Andrew Cooper [Thu, 13 Apr 2023 19:56:15 +0000 (20:56 +0100)]
xen/livepatch: Fix .altinstructions safety checks

The prior check has && vs || mixups, making it tautologically false and thus
providing no safety at all.  There are boundary errors too.

First start with a comment describing how the .altinstructions and
.altinstr_replacement sections interact, and perform suitable cross-checking.

Second, rewrite the alt_instr loop entirely from scratch.  Origin sites have
non-zero size, and must be fully contained within the livepatches .text
section(s).  Any non-zero sized replacements must be fully contained within
the .altinstr_replacement section.

Fixes: f8a10174e8b1 ("xsplice: Add support for alternatives")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
(cherry picked from commit e74360e4ba4a6b6827a44f8b1b22a0ec4311694a)

12 months agoarm/alternatives: Rename alt_instr fields which are used in common code
Andrew Cooper [Sun, 16 Apr 2023 00:10:43 +0000 (01:10 +0100)]
arm/alternatives: Rename alt_instr fields which are used in common code

Alternatives auditing for livepatches is currently broken.  To fix it, the
livepatch code needs to inspect more fields of alt_instr.

Rename ARM's fields to match x86's, because:

 * ARM already exposes alt_offset under the repl name via ALT_REPL_PTR().
 * "alt" is ambiguous in a structure entirely about alternatives already.
 * "repl", being the same width as orig leads to slightly neater code.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 418cf59c4e29451010d7efb3835b900690d19866)

13 months agotests/resource: Fix HVM guest in !SHADOW builds
Andrew Cooper [Tue, 2 Apr 2024 14:24:07 +0000 (16:24 +0200)]
tests/resource: Fix HVM guest in !SHADOW builds

Right now, test-resource always creates HVM Shadow guests.  But if Xen has
SHADOW compiled out, running the test yields:

  $./test-resource
  XENMEM_acquire_resource tests
  Test x86 PV
    Created d1
    Test grant table
  Test x86 PVH
    Skip: 95 - Operation not supported

and doesn't really test HVM guests, but doesn't fail either.

There's nothing paging-mode-specific about this test, so default to HAP if
possible and provide a more specific message if neither HAP or Shadow are
available.

As we've got physinfo to hand, also provide more specific message about the
absence of PV or HVM support.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 0263dc9069ddb66335c72a159e09050b1600e56a
master date: 2024-03-01 20:14:19 +0000

13 months agox86/boot: Support the watchdog on newer AMD systems
Andrew Cooper [Tue, 2 Apr 2024 14:20:30 +0000 (16:20 +0200)]
x86/boot: Support the watchdog on newer AMD systems

The MSRs used by setup_k7_watchdog() are architectural in 64bit.  The Unit
Select (0x76, cycles not in halt state) isn't, but it hasn't changed in 25
years, making this a trend likely to continue.

Drop the family check.  If the Unit Select does happen to change meaning in
the future, check_nmi_watchdog() will still notice the watchdog not operating
as expected.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 131892e0dcc1265b621c2b7d844cb9e7c3a4404f
master date: 2024-03-19 18:29:37 +0000

13 months agox86/boot: Improve the boot watchdog determination of stuck cpus
Andrew Cooper [Tue, 2 Apr 2024 14:20:09 +0000 (16:20 +0200)]
x86/boot: Improve the boot watchdog determination of stuck cpus

Right now, check_nmi_watchdog() has two processing loops over all online CPUs
using prev_nmi_count as storage.

Use a cpumask_t instead (1/32th as much initdata) and have wait_for_nmis()
make the determination of whether it is stuck, rather than having both
functions needing to agree on how many ticks mean stuck.

More importantly though, it means we can use the standard cpumask
infrastructure, including turning this:

  (XEN) Brought up 512 CPUs
  (XEN) Testing NMI watchdog on all CPUs: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511} stuck

into the rather more manageable:

  (XEN) Brought up 512 CPUs
  (XEN) Testing NMI watchdog on all CPUs: {0-511} stuck

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 9e18f339830c828798aef465556d4029d83476a0
master date: 2024-03-19 18:29:37 +0000

13 months agox86/livepatch: Relax permissions on rodata too
Andrew Cooper [Tue, 2 Apr 2024 14:19:36 +0000 (16:19 +0200)]
x86/livepatch: Relax permissions on rodata too

This reinstates the capability to patch .rodata in load/unload hooks, which
was lost when we stopped using CR0.WP=0 to patch.

This turns out to be rather less of a large TODO than I thought at the time.

Fixes: 8676092a0f16 ("x86/livepatch: Fix livepatch application when CET is active")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: b083b1c393dc8961acf0959b1d2e0ad459985ae3
master date: 2024-03-07 14:24:42 +0000

13 months agoxen/virtual-region: Include rodata pointers
Andrew Cooper [Tue, 2 Apr 2024 14:19:11 +0000 (16:19 +0200)]
xen/virtual-region: Include rodata pointers

These are optional.  .init doesn't distinguish types of data like this, and
livepatches don't necesserily have any .rodata either.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: ef969144a425e39f5b214a875b5713d0ea8575fb
master date: 2024-03-07 14:24:42 +0000

13 months agoxen/virtual-region: Rename the start/end fields
Andrew Cooper [Tue, 2 Apr 2024 14:18:51 +0000 (16:18 +0200)]
xen/virtual-region: Rename the start/end fields

... to text_{start,end}.  We're about to introduce another start/end pair.

Despite it's name, struct virtual_region has always been a module-ish
description.  Call this out specifically.

As minor cleanup, replace ROUNDUP(x, PAGE_SIZE) with the more concise
PAGE_ALIGN() ahead of duplicating the example.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: 989556c6f8ca080f5f202417af97d1188b9ba52a
master date: 2024-03-07 14:24:42 +0000

13 months agox86/cpu-policy: Fix visibility of HTT/CMP_LEGACY in max policies
Andrew Cooper [Tue, 2 Apr 2024 14:18:05 +0000 (16:18 +0200)]
x86/cpu-policy: Fix visibility of HTT/CMP_LEGACY in max policies

The block in recalculate_cpuid_policy() predates the proper split between
default and max policies, and was a "slightly max for a toolstack which knows
about it" capability.  It didn't get transformed properly in Xen 4.14.

Because Xen will accept a VM with HTT/CMP_LEGACY seen, they should be visible
in the max polices.  Keep the default policy matching host settings.

This manifested as an incorrectly-rejected migration across XenServer's Xen
4.13 -> 4.17 upgrade, as Xapi is slowly growing the logic to check a VM
against the target max policy.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: e2d8a652251660c3252d92b442e1a9c5d6e6a1e9
master date: 2024-03-01 20:14:19 +0000

13 months agox86/cpu-policy: Hide x2APIC from PV guests
Andrew Cooper [Tue, 2 Apr 2024 14:17:25 +0000 (16:17 +0200)]
x86/cpu-policy: Hide x2APIC from PV guests

PV guests can't write to MSR_APIC_BASE (in order to set EXTD), nor can they
access any of the x2APIC MSR range.  Therefore they mustn't see the x2APIC
CPUID bit saying that they can.

Right now, the host x2APIC flag filters into PV guests, meaning that PV guests
generally see x2APIC except on Zen1-and-older AMD systems.

Linux works around this by explicitly hiding the bit itself, and filtering
EXTD out of MSR_APIC_BASE reads.  NetBSD behaves more in the spirit of PV
guests, and entirely ignores the APIC when built as a PV guest.

Change the annotation from !A to !S.  This has a consequence of stripping it
out of both PV featuremasks.  However, as existing guests may have seen the
bit, set it back into the PV Max policy; a VM which saw the bit and is alive
enough to migrate will have ignored it one way or another.

Hiding x2APIC does change the contents of leaf 0xb, but as the information is
nonsense to begin with, this is likely an improvement on the status quo.

Xen's blind assumption that APIC_ID = vCPU_ID * 2 isn't interlinked with the
host's topology structure, where a PV guest may see real host values, and the
APIC_IDs are useless without an MADT to start with.  Dom0 is the only PV VM to
get an MADT but it's the host one, meaning the two sets of APIC_IDs are from
different address spaces.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 5420aa165dfa5fe95dd84bb71cb96c15459935b1
master date: 2024-03-01 20:14:19 +0000

13 months agotools/oxenstored: Make Quota.t pure
Edwin Török [Wed, 31 Jan 2024 10:52:56 +0000 (10:52 +0000)]
tools/oxenstored: Make Quota.t pure

Now that we no longer have a hashtable inside we can make Quota.t pure, and
push the mutable update to its callers.  Store.t already had a mutable Quota.t
field.

No functional change.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Acked-by: Christian Lindig <christian.lindig@cloud.com>
(cherry picked from commit 098d868e52ac0165b7f36e22b767ea70cef70054)

13 months agotools/oxenstored: Use Map instead of Hashtbl for quotas
Edwin Török [Wed, 31 Jan 2024 10:52:55 +0000 (10:52 +0000)]
tools/oxenstored: Use Map instead of Hashtbl for quotas

On a stress test running 1000 VMs flamegraphs have shown that
`oxenstored` spends a large amount of time in `Hashtbl.copy` and the GC.

Hashtable complexity:
 * read/write: O(1) average
 * copy: O(domains) -- copying the entire table

Map complexity:
 * read/write: O(log n) worst case
 * copy: O(1) -- a word copy

We always perform at least one 'copy' when processing each xenstore
packet (regardless whether it is a readonly operation or inside a
transaction or not), so the actual complexity per packet is:
  * Hashtbl: O(domains)
  * Map: O(log domains)

Maps are the clear winner, and a better fit for the immutable xenstore
tree.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Acked-by: Christian Lindig <christian.lindig@cloud.com>
(cherry picked from commit b6cf604207fd0a04451a48f2ce6d05fb66c612ab)

13 months agox86/PoD: tie together P2M update and increment of entry count
Jan Beulich [Wed, 27 Mar 2024 11:29:33 +0000 (12:29 +0100)]
x86/PoD: tie together P2M update and increment of entry count

When not holding the PoD lock across the entire region covering P2M
update and stats update, the entry count - if to be incorrect at all -
should indicate too large a value in preference to a too small one, to
avoid functions bailing early when they find the count is zero. However,
instead of moving the increment ahead (and adjust back upon failure),
extend the PoD-locked region.

Fixes: 99af3cd40b6e ("x86/mm: Rework locking in the PoD layer")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@cloud.com>
master commit: cc950c49ae6a6690f7fc3041a1f43122c250d250
master date: 2024-03-21 09:48:10 +0100

13 months agox86/boot: Fix setup_apic_nmi_watchdog() to fail more cleanly
Andrew Cooper [Wed, 27 Mar 2024 11:29:11 +0000 (12:29 +0100)]
x86/boot: Fix setup_apic_nmi_watchdog() to fail more cleanly

Right now, if the user requests the watchdog on the command line,
setup_apic_nmi_watchdog() will blindly assume that setting up the watchdog
worked.  Reuse nmi_perfctr_msr to identify when the watchdog has been
configured.

Rearrange setup_p6_watchdog() to not set nmi_perfctr_msr until the sanity
checks are complete.  Turn setup_p4_watchdog() into a void function, matching
the others.

If the watchdog isn't set up, inform the user and override to NMI_NONE, which
will prevent check_nmi_watchdog() from claiming that all CPUs are stuck.

e.g.:

  (XEN) alt table ffff82d040697c38 -> ffff82d0406a97f0
  (XEN) Failed to configure NMI watchdog
  (XEN) Brought up 512 CPUs
  (XEN) Scheduling granularity: cpu, 1 CPU per sched-resource

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f658321374687c7339235e1ac643e0427acff717
master date: 2024-03-19 18:29:37 +0000

13 months agox86/mm: use block_lock_speculation() in _mm_write_lock()
Jan Beulich [Wed, 27 Mar 2024 11:28:24 +0000 (12:28 +0100)]
x86/mm: use block_lock_speculation() in _mm_write_lock()

I can only guess that using block_speculation() there was a leftover
from, earlier on, SPECULATIVE_HARDEN_LOCK depending on
SPECULATIVE_HARDEN_BRANCH.

Fixes: 197ecd838a2a ("locking: attempt to ensure lock wrappers are always inline")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 62018f08708a5ff6ef8fc8ff2aaaac46e5a60430
master date: 2024-03-18 13:53:37 +0100

13 months agotools: ipxe: update for fixing build with GCC12
Olaf Hering [Wed, 27 Mar 2024 11:27:03 +0000 (12:27 +0100)]
tools: ipxe: update for fixing build with GCC12

Use a snapshot which includes commit
b0ded89e917b48b73097d3b8b88dfa3afb264ed0 ("[build] Disable dangling
pointer checking for GCC"), which fixes build with gcc12.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 18a36b4a9b088875486cfe33a2d4a8ae7eb4ab47
master date: 2023-04-25 23:47:45 +0100

13 months agox86: protect conditional lock taking from speculative execution
Roger Pau Monné [Mon, 4 Mar 2024 15:24:21 +0000 (16:24 +0100)]
x86: protect conditional lock taking from speculative execution

Conditionally taken locks that use the pattern:

if ( lock )
    spin_lock(...);

Need an else branch in order to issue an speculation barrier in the else case,
just like it's done in case the lock needs to be acquired.

eval_nospec() could be used on the condition itself, but that would result in a
double barrier on the branch where the lock is taken.

Introduce a new pair of helpers, {gfn,spin}_lock_if() that can be used to
conditionally take a lock in a speculation safe way.

This is part of XSA-453 / CVE-2024-2193

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 03cf7ca23e0e876075954c558485b267b7d02406)

13 months agox86/mm: add speculation barriers to open coded locks
Roger Pau Monné [Mon, 4 Mar 2024 17:08:48 +0000 (18:08 +0100)]
x86/mm: add speculation barriers to open coded locks

Add a speculation barrier to the clearly identified open-coded lock taking
functions.

Note that the memory sharing page_lock() replacement (_page_lock()) is left
as-is, as the code is experimental and not security supported.

This is part of XSA-453 / CVE-2024-2193

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 42a572a38e22a97d86a4b648a22597628d5b42e4)

13 months agolocking: attempt to ensure lock wrappers are always inline
Roger Pau Monné [Mon, 4 Mar 2024 13:29:36 +0000 (14:29 +0100)]
locking: attempt to ensure lock wrappers are always inline

In order to prevent the locking speculation barriers from being inside of
`call`ed functions that could be speculatively bypassed.

While there also add an extra locking barrier to _mm_write_lock() in the branch
taken when the lock is already held.

Note some functions are switched to use the unsafe variants (without speculation
barrier) of the locking primitives, but a speculation barrier is always added
to the exposed public lock wrapping helper.  That's the case with
sched_spin_lock_double() or pcidevs_lock() for example.

This is part of XSA-453 / CVE-2024-2193

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 197ecd838a2aaf959a469df3696d4559c4f8b762)

13 months agopercpu-rwlock: introduce support for blocking speculation into critical regions
Roger Pau Monné [Tue, 13 Feb 2024 16:57:38 +0000 (17:57 +0100)]
percpu-rwlock: introduce support for blocking speculation into critical regions

Add direct calls to block_lock_speculation() where required in order to prevent
speculation into the lock protected critical regions.  Also convert
_percpu_read_lock() from inline to always_inline.

Note that _percpu_write_lock() has been modified the use the non speculation
safe of the locking primites, as a speculation is added unconditionally by the
calling wrapper.

This is part of XSA-453 / CVE-2024-2193

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f218daf6d3a3b847736d37c6a6b76031a0d08441)

13 months agorwlock: introduce support for blocking speculation into critical regions
Roger Pau Monné [Tue, 13 Feb 2024 15:08:52 +0000 (16:08 +0100)]
rwlock: introduce support for blocking speculation into critical regions

Introduce inline wrappers as required and add direct calls to
block_lock_speculation() in order to prevent speculation into the rwlock
protected critical regions.

Note the rwlock primitives are adjusted to use the non speculation safe variants
of the spinlock handlers, as a speculation barrier is added in the rwlock
calling wrappers.

trylock variants are protected by using lock_evaluate_nospec().

This is part of XSA-453 / CVE-2024-2193

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit a1fb15f61692b1fa9945fc51f55471ace49cdd59)

13 months agox86/spinlock: introduce support for blocking speculation into critical regions
Roger Pau Monné [Tue, 13 Feb 2024 12:08:05 +0000 (13:08 +0100)]
x86/spinlock: introduce support for blocking speculation into critical regions

Introduce a new Kconfig option to block speculation into lock protected
critical regions.  The Kconfig option is enabled by default, but the mitigation
won't be engaged unless it's explicitly enabled in the command line using
`spec-ctrl=lock-harden`.

Convert the spinlock acquire macros into always-inline functions, and introduce
a speculation barrier after the lock has been taken.  Note the speculation
barrier is not placed inside the implementation of the spin lock functions, as
to prevent speculation from falling through the call to the lock functions
resulting in the barrier also being skipped.

trylock variants are protected using a construct akin to the existing
evaluate_nospec().

This patch only implements the speculation barrier for x86.

Note spin locks are the only locking primitive taken care in this change,
further locking primitives will be adjusted by separate changes.

This is part of XSA-453 / CVE-2024-2193

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 7ef0084418e188d05f338c3e028fbbe8b6924afa)

13 months agoxen: Swap order of actions in the FREE*() macros
Andrew Cooper [Fri, 2 Feb 2024 00:39:42 +0000 (00:39 +0000)]
xen: Swap order of actions in the FREE*() macros

Wherever possible, it is a good idea to NULL out the visible reference to an
object prior to freeing it.  The FREE*() macros already collect together both
parts, making it easy to adjust.

This has a marginal code generation improvement, as some of the calls to the
free() function can be tailcall optimised.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c4f427ec879e7c0df6d44d02561e8bee838a293e)

13 months agox86/paging: Delete update_cr3()'s do_locking parameter
Andrew Cooper [Wed, 20 Sep 2023 19:06:53 +0000 (20:06 +0100)]
x86/paging: Delete update_cr3()'s do_locking parameter

Nicola reports that the XSA-438 fix introduced new MISRA violations because of
some incidental tidying it tried to do.  The parameter is useless, so resolve
the MISRA regression by removing it.

hap_update_cr3() discards the parameter entirely, while sh_update_cr3() uses
it to distinguish internal and external callers and therefore whether the
paging lock should be taken.

However, we have paging_lock_recursive() for this purpose, which also avoids
the ability for the shadow internal callers to accidentally not hold the lock.

Fixes: fb0ff49fe9f7 ("x86/shadow: defer releasing of PV's top-level shadow reference")
Reported-by: Nicola Vetrini <nicola.vetrini@bugseng.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Henry Wang <Henry.Wang@arm.com>
(cherry picked from commit e71157d1ac2a7fbf413130663cf0a93ff9fbcf7e)

13 months agox86/spec-ctrl: Mitigation Register File Data Sampling
Andrew Cooper [Thu, 22 Jun 2023 22:32:19 +0000 (23:32 +0100)]
x86/spec-ctrl: Mitigation Register File Data Sampling

RFDS affects Atom cores, also branded E-cores, between the Goldmont and
Gracemont microarchitectures.  This includes Alder Lake and Raptor Lake hybrid
clien systems which have a mix of Gracemont and other types of cores.

Two new bits have been defined; RFDS_CLEAR to indicate VERW has more side
effets, and RFDS_NO to incidate that the system is unaffected.  Plenty of
unaffected CPUs won't be getting RFDS_NO retrofitted in microcode, so we
synthesise it.  Alder Lake and Raptor Lake Xeon-E's are unaffected due to
their platform configuration, and we must use the Hybrid CPUID bit to
distinguish them from their non-Xeon counterparts.

Like MD_CLEAR and FB_CLEAR, RFDS_CLEAR needs OR-ing across a resource pool, so
set it in the max policies and reflect the host setting in default.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit fb5b6f6744713410c74cfc12b7176c108e3c9a31)

13 months agox86/spec-ctrl: VERW-handling adjustments
Andrew Cooper [Tue, 5 Mar 2024 19:33:37 +0000 (19:33 +0000)]
x86/spec-ctrl: VERW-handling adjustments

... before we add yet more complexity to this logic.  Mostly expanded
comments, but with three minor changes.

1) Introduce cpu_has_useful_md_clear to simplify later logic in this patch and
   future ones.

2) We only ever need SC_VERW_IDLE when SMT is active.  If SMT isn't active,
   then there's no re-partition of pipeline resources based on thread-idleness
   to worry about.

3) The logic to adjust HVM VERW based on L1D_FLUSH is unmaintainable and, as
   it turns out, wrong.  SKIP_L1DFL is just a hint bit, whereas opt_l1d_flush
   is the relevant decision of whether to use L1D_FLUSH based on
   susceptibility and user preference.

   Rewrite the logic so it can be followed, and incorporate the fact that when
   FB_CLEAR is visible, L1D_FLUSH isn't a safe substitution.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 1eb91a8a06230b4b64228c9a380194f8cfe6c5e2)

13 months agox86/spec-ctrl: Rename VERW related options
Andrew Cooper [Mon, 12 Feb 2024 17:50:43 +0000 (17:50 +0000)]
x86/spec-ctrl: Rename VERW related options

VERW is going to be used for a 3rd purpose, and the existing nomenclature
didn't survive the Stale MMIO issues terribly well.

Rename the command line option from `md-clear=` to `verw=`.  This is more
consistent with other options which tend to be named based on what they're
doing, not which feature enumeration they use behind the scenes.  Retain
`md-clear=` as a deprecated alias.

Rename opt_md_clear_{pv,hvm} and opt_fb_clear_mmio to opt_verw_{pv,hvm,mmio},
which has a side effect of making spec_ctrl_init_domain() rather clearer to
follow.

No functional change.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f7603ca252e4226739eb3129a5290ee3da3f8ea4)

13 months agox86/spec-ctrl: Perform VERW flushing later in exit paths
Andrew Cooper [Sat, 27 Jan 2024 18:20:56 +0000 (18:20 +0000)]
x86/spec-ctrl: Perform VERW flushing later in exit paths

On parts vulnerable to RFDS, VERW's side effects are extended to scrub all
non-architectural entries in various Physical Register Files.  To remove all
of Xen's values, the VERW must be after popping the GPRs.

Rework SPEC_CTRL_COND_VERW to default to an CPUINFO_error_code %rsp position,
but with overrides for other contexts.  Identify that it clobbers eflags; this
is particularly relevant for the SYSRET path.

For the IST exit return to Xen, have the main SPEC_CTRL_EXIT_TO_XEN put a
shadow copy of spec_ctrl_flags, as GPRs can't be used at the point we want to
issue the VERW.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 0a666cf2cd99df6faf3eebc81a1fc286e4eca4c7)

13 months agox86/vmx: Perform VERW flushing later in the VMExit path
Andrew Cooper [Fri, 23 Jun 2023 10:32:00 +0000 (11:32 +0100)]
x86/vmx: Perform VERW flushing later in the VMExit path

Broken out of the following patch because this change is subtle enough on its
own.  See it for the rational of why we're moving VERW.

As for how, extend the trick already used to hold one condition in
flags (RESUME vs LAUNCH) through the POPing of GPRs.

Move the MOV CR earlier.  Intel specify flags to be undefined across it.

Encode the two conditions we want using SF and PF.  See the code comment for
exactly how.

Leave a comment to explain the lack of any content around
SPEC_CTRL_EXIT_TO_VMX, but leave the block in place.  Sods law says if we
delete it, we'll need to reintroduce it.

This is part of XSA-452 / CVE-2023-28746.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 475fa20b7384464210f42bad7195f87bd6f1c63f)