Juergen Gross [Wed, 6 Dec 2023 09:49:29 +0000 (10:49 +0100)]
xen/sched: fix adding offline cpu to cpupool
Trying to add an offline cpu to a cpupool can crash the hypervisor,
as the probably non-existing percpu area of the cpu is accessed before
the availability of the cpu is being tested. This can happen in case
the cpupool's granularity is "core" or "socket".
Fix that by testing the cpu to be online.
Fixes: cb563d7665f2 ("xen/sched: support core scheduling for moving cpus to/from cpupools") Reported-by: René Winther Højgaard <renewin@proton.me> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 06e8d65d33896aa90f5b6d9b2bce7f11433b33c9
master date: 2023-12-05 09:57:38 +0100
Jan Beulich [Wed, 6 Dec 2023 09:48:56 +0000 (10:48 +0100)]
x86emul: avoid triggering event related assertions
The assertion at the end of x86_emulate_wrapper() as well as the ones
in x86_emul_{hw_exception,pagefault}() can trigger if we ignore
X86EMUL_EXCEPTION coming back from certain hook functions. Squash
exceptions when merely probing MSRs, plus on SWAPGS'es "best effort"
error handling path.
In adjust_bnd() add another assertion after the read_xcr(0, ...)
invocation, paralleling the one in x86emul_get_fpu() - XCR0 reads should
never fault when XSAVE is (implicitly) known to be available.
Also update the respective comment in x86_emulate_wrapper().
Fixes: 14a6be89ec04 ("x86emul: correct EFLAGS.TF handling") Fixes: cb2626c75813 ("x86emul: conditionally clear BNDn for branches") Fixes: 6eb43fcf8a0b ("x86emul: support SWAPGS") Reported-by: AFL Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 787d11c5aaf4d3411d4658cff137cd49b0bd951b
master date: 2023-12-05 09:57:05 +0100
Both Intel and AMD manuals agree that in x2APIC mode, the APIC LDR and ID
registers are derivable from each other through a fixed formula.
Xen uses that formula, but applies it to vCPU IDs (which are sequential)
rather than x2APIC IDs (which are not, at the moment). As I understand it,
this is an attempt to tightly pack vCPUs into clusters so each cluster has
16 vCPUs rather than 8, but this is a spec violation.
This patch fixes the implementation so we follow the x2APIC spec for new
VMs, while preserving the behaviour (buggy or fixed) for migrated-in VMs.
While touching that area, remove the existing printk statement in
vlapic_load_fixup() (as the checks it performed didn't make sense in x2APIC
mode and wouldn't affect the outcome) and put another printk as an else
branch so we get warnings trying to load nonsensical LDR values we don't
know about.
Fixes: f9e0cccf7b35 ("x86/HVM: fix ID handling of x2APIC emulation") Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 90309854fd2440fb08b4c808f47d7670ba0d250d
master date: 2023-11-29 10:05:55 +0100
Roger Pau Monné [Wed, 6 Dec 2023 09:46:47 +0000 (10:46 +0100)]
livepatch: do not use .livepatch.funcs section to store internal state
Currently the livepatch logic inside of Xen will use fields of struct
livepatch_func in order to cache internal state of patched functions. Note
this is a field that is part of the payload, and is loaded as an ELF section
(.livepatch.funcs), taking into account the SHF_* flags in the section
header.
The flags for the .livepatch.funcs section, as set by livepatch-build-tools,
are SHF_ALLOC, which leads to its contents (the array of livepatch_func
structures) being placed in read-only memory:
This previously went unnoticed, as all writes to the fields of livepatch_func
happen in the critical region that had WP disabled in CR0. After 8676092a0f16
however WP is no longer toggled in CR0 for patch application, and only the
hypervisor .text mappings are made write-accessible. That leads to the
following page fault when attempting to apply a livepatch:
----[ Xen-4.19-unstable x86_64 debug=y Tainted: C ]----
CPU: 4
RIP: e008:[<ffff82d040221e81>] common/livepatch.c#apply_payload+0x45/0x1e1
[...]
Xen call trace:
[<ffff82d040221e81>] R common/livepatch.c#apply_payload+0x45/0x1e1
[<ffff82d0402235b2>] F check_for_livepatch_work+0x385/0xaa5
[<ffff82d04032508f>] F arch/x86/domain.c#idle_loop+0x92/0xee
****************************************
Panic on CPU 4:
FATAL PAGE FAULT
[error_code=0003]
Faulting linear address: ffff82d040625079
****************************************
Fix this by moving the internal Xen function patching state out of
livepatch_func into an area not allocated as part of the ELF payload. While
there also constify the array of livepatch_func structures in order to prevent
further surprises.
Note there's still one field (old_addr) that gets set during livepatch load. I
consider this fine since the field is read-only after load, and at the point
the field gets set the underlying mapping hasn't been made read-only yet.
Fixes: 8676092a0f16 ('x86/livepatch: Fix livepatch application when CET is active') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
xen/livepatch: fix livepatch tests
The current set of in-tree livepatch tests in xen/test/livepatch started
failing after the constify of the payload funcs array, and the movement of the
status data into a separate array.
Fix the tests so they respect the constness of the funcs array and also make
use of the new location of the per-func state data.
Fixes: 82182ad7b46e ('livepatch: do not use .livepatch.funcs section to store internal state') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: 82182ad7b46e0f7a3856bb12c7a9bf2e2a4570bc
master date: 2023-11-27 15:16:01 +0100
master commit: 902377b690f42ddf44ae91c4b0751d597f1cd694
master date: 2023-11-29 10:46:42 +0000
Frediano Ziglio [Wed, 6 Dec 2023 09:46:01 +0000 (10:46 +0100)]
x86/mem_sharing: Release domain if we are not able to enable memory sharing
In case it's not possible to enable memory sharing (mem_sharing_control
fails) we just return the error code without releasing the domain
acquired some lines above by rcu_lock_live_remote_domain_by_id().
Fixes: 72f8d45d69b8 ("x86/mem_sharing: enable mem_sharing on first memop") Signed-off-by: Frediano Ziglio <frediano.ziglio@cloud.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
master commit: fbcec32d6d3ea0ac329301925b317478316209ed
master date: 2023-11-27 12:06:13 +0000
Juergen Gross [Thu, 23 Nov 2023 11:24:12 +0000 (12:24 +0100)]
xen/sched: fix sched_move_domain()
When moving a domain out of a cpupool running with the credit2
scheduler and having multiple run-queues, the following ASSERT() can
be observed:
(XEN) Xen call trace:
(XEN) [<ffff82d04023a700>] R credit2.c#csched2_unit_remove+0xe3/0xe7
(XEN) [<ffff82d040246adb>] S sched_move_domain+0x2f3/0x5b1
(XEN) [<ffff82d040234cf7>] S cpupool.c#cpupool_move_domain_locked+0x1d/0x3b
(XEN) [<ffff82d040236025>] S cpupool_move_domain+0x24/0x35
(XEN) [<ffff82d040206513>] S domain_kill+0xa5/0x116
(XEN) [<ffff82d040232b12>] S do_domctl+0xe5f/0x1951
(XEN) [<ffff82d0402276ba>] S timer.c#timer_lock+0x69/0x143
(XEN) [<ffff82d0402dc71b>] S pv_hypercall+0x44e/0x4a9
(XEN) [<ffff82d0402012b7>] S lstar_enter+0x137/0x140
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 1:
(XEN) Assertion 'svc->rqd == c2rqd(sched_unit_master(unit))' failed at common/sched/credit2.c:1159
(XEN) ****************************************
This is happening as sched_move_domain() is setting a different cpu
for a scheduling unit without telling the scheduler. When this unit is
removed from the scheduler, the ASSERT() will trigger.
In non-debug builds the result is usually a clobbered pointer, leading
to another crash a short time later.
Fix that by swapping the two involved actions (setting another cpu and
removing the unit from the scheduler).
Roger Pau Monné [Tue, 14 Nov 2023 13:01:33 +0000 (14:01 +0100)]
x86/i8259: do not assume interrupts always target CPU0
Sporadically we have seen the following during AP bringup on AMD platforms
only:
microcode: CPU59 updated from revision 0x830107a to 0x830107a, date = 2023-05-17
microcode: CPU60 updated from revision 0x830104d to 0x830107a, date = 2023-05-17
CPU60: No irq handler for vector 27 (IRQ -2147483648)
microcode: CPU61 updated from revision 0x830107a to 0x830107a, date = 2023-05-17
This is similar to the issue raised on Linux commit 36e9e1eab777e, where they
observed i8259 (active) vectors getting delivered to CPUs different than 0.
On AMD or Hygon platforms adjust the target CPU mask of i8259 interrupt
descriptors to contain all possible CPUs, so that APs will reserve the vector
at startup if any legacy IRQ is still delivered through the i8259. Note that
if the IO-APIC takes over those interrupt descriptors the CPU mask will be
reset.
Spurious i8259 interrupt vectors however (IRQ7 and IRQ15) can be injected even
when all i8259 pins are masked, and hence would need to be handled on all CPUs.
Continue to reserve PIC vectors on CPU0 only, but do check for such spurious
interrupts on all CPUs if the vendor is AMD or Hygon. Note that once the
vectors get used by devices detecting PIC spurious interrupts will no longer be
possible, however the device driver should be able to cope with spurious
interrupts. Such PIC spurious interrupts occurring when the vector is in use
by a local APIC routed source will lead to an extra EOI, which might
unintentionally clear a different vector from ISR. Note this is already the
current behavior, so assume it's infrequent enough to not cause real issues.
Finally, adjust the printed message to display the CPU where the spurious
interrupt has been received, so it looks like:
microcode: CPU1 updated from revision 0x830107a to 0x830107a, date = 2023-05-17
cpu1: spurious 8259A interrupt: IRQ7
microcode: CPU2 updated from revision 0x830104d to 0x830107a, date = 2023-05-17
Amends: 3fba06ba9f8b ('x86/IRQ: re-use legacy vector ranges on APs') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 87f37449d586b4d407b75235bb0a171e018e25ec
master date: 2023-11-02 10:50:59 +0100
Roger Pau Monné [Tue, 14 Nov 2023 13:01:07 +0000 (14:01 +0100)]
x86/x2apic: remove usage of ACPI_FADT_APIC_CLUSTER
The ACPI FADT APIC_CLUSTER flag mandates that when the interrupt delivery is
Logical mode APIC must be configured for Cluster destination model. However in
apic_x2apic_probe() such flag is incorrectly used to gate whether Physical mode
can be used.
Since Xen when in x2APIC mode only uses Logical mode together with Cluster
model completely remove checking for ACPI_FADT_APIC_CLUSTER, as Xen always
fulfills the requirement signaled by the flag.
Fixes: eb40ae41b658 ('x86/Kconfig: add option for default x2APIC destination mode') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 26a449ce32cef33f2cb50602be19fcc0c4223ba9
master date: 2023-11-02 10:50:26 +0100
David Woodhouse [Tue, 14 Nov 2023 13:00:37 +0000 (14:00 +0100)]
x86/pv-shim: fix grant table operations for 32-bit guests
When switching to call the shim functions from the normal handlers, the
compat_grant_table_op() function was omitted, leaving it calling the
real grant table operations in !PV_SHIM_EXCLUSIVE builds. This leaves a
32-bit shim guest failing to set up its real grant table with the parent
hypervisor.
Fixes: e7db635f4428 ("x86/pv-shim: Don't modify the hypercall table") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 93ec30bc545f15760039c23ee4b97b80c0b3b3b3
master date: 2023-10-31 16:10:14 +0000
Tamas K Lengyel [Tue, 14 Nov 2023 13:00:20 +0000 (14:00 +0100)]
x86/mem_sharing: add missing m2p entry when mapping shared_info page
When mapping in the shared_info page to a fork the m2p entry wasn't set
resulting in the shared_info being reset even when the fork reset was called
with only reset_state and not reset_memory. This results in an extra
unnecessary TLB flush.
Fixes: 1a0000ac775 ("mem_sharing: map shared_info page to same gfn during fork") Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 23eb39acf011ef9bbe02ed4619c55f208fbcd39b
master date: 2023-10-31 16:10:14 +0000
Jan Beulich [Tue, 14 Nov 2023 12:58:18 +0000 (13:58 +0100)]
x86: support data operand independent timing mode
[1] specifies a long list of instructions which are intended to exhibit
timing behavior independent of the data they operate on. On certain
hardware this independence is optional, controlled by a bit in a new
MSR. Provide a command line option to control the mode Xen and its
guests are to operate in, with a build time control over the default.
Longer term we may want to allow guests to control this.
Since Arm64 supposedly also has such a control, put command line option
and Kconfig control in common files.
Roger Pau Monné [Tue, 14 Nov 2023 12:56:39 +0000 (13:56 +0100)]
iommu/vt-d: fix SAGAW capability parsing
SAGAW is a bitmap field, with bits 1, 2 and 3 signaling support for 3, 4 and 5
level page tables respectively. According to the Intel VT-d specification, an
IOMMU can report multiple SAGAW bits being set.
Commit 859d11b27912 claims to replace the open-coded find_first_set_bit(), but
it's actually replacing an open coded implementation to find the last set bit.
The change forces the used AGAW to the lowest supported by the IOMMU instead of
the highest one between 1 and 2.
Restore the previous SAGAW parsing by using fls() instead of
find_first_set_bit(), in order to get the highest (supported) AGAW to be used.
However there's a caveat related to the value the AW context entry field must
be set to when using passthrough mode:
"When the Translation-type (TT) field indicates pass-through processing (10b),
this field must be programmed to indicate the largest AGAW value supported by
hardware." [0]
Newer Intel IOMMU implementations support 5 level page tables for the IOMMU,
and signal such support in SAGAW bit 3.
Enabling 5 level paging support (AGAW 3) is too risky at this point in the Xen
4.18 release, so instead put a bodge to unconditionally disable passthough
mode if SAGAW has any bits greater than 2 set. Ignore bit 0; it's reserved in
current specifications, but had a meaning in the past and is unlikely to be
reused in the future.
Note the message about unhandled SAGAW bits being set is printed
unconditionally, regardless of whether passthrough mode is enabled. This is
done in order to easily notice IOMMU implementations with not yet supported
SAGAW values.
Roger Pau Monné [Tue, 14 Nov 2023 12:56:13 +0000 (13:56 +0100)]
iommu: fix quarantine mode command line documentation
With the addition of per-device quarantine page tables the sink page is now
exclusive for each device, and thus writable. Update the documentation to
reflect the current implementation.
Fixes: 14dd241aad8a ('IOMMU/x86: use per-device page tables for quarantining') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 94a5127ebeb4a005f128150909ca78bfea50206a
master date: 2023-10-19 21:52:52 +0100
Roger Pau Monné [Tue, 14 Nov 2023 12:55:23 +0000 (13:55 +0100)]
x86/pvh: fix identity mapping of low 1MB
The mapping of memory regions below the 1MB mark was all done by the PVH dom0
builder code, causing the region to be avoided by the arch specific IOMMU
hardware domain initialization code. That lead to the IOMMU being enabled
without reserved regions in the low 1MB identity mapped in the p2m for PVH
hardware domains. Firmware which happens to be missing RMRR/IVMD ranges
describing E820 reserved regions in the low 1MB would transiently trigger IOMMU
faults until the p2m is populated by the PVH dom0 builder:
Those errors have been observed on the osstest pinot{0,1} boxes (AMD Fam15h
Opteron(tm) Processor 3350 HE).
Rely on the IOMMU arch init code to create any identity mappings for reserved
regions in the low 1MB range (like it already does for reserved regions
elsewhere), and leave the mapping of any holes to be performed by the dom0
builder code.
Fixes: 6b4f6a31ace1 ('x86/PVH: de-duplicate mappings for first Mb of Dom0 memory') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4bb882fe6e4430782101fe06379649df1bbd458a
master date: 2023-10-19 09:52:43 +0200
This is an AMD feature to reduce the IBRS handling overhead. Once enabled,
processes running at CPL=0 are automatically IBRS-protected even if
SPEC_CTRL.IBRS is not set. Furthermore, the RAS/RSB is cleared on VMEXIT.
The feature is exposed in CPUID and toggled in EFER.
Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 8347d6bb29bfd0c3b5acdc078574a8643c5a5637
master date: 2023-05-30 18:24:07 +0100
tools/pygrub: Fix pygrub's --entry flag for python3
string.atoi() has been deprecated since Python 2.0, has a big scary warning
in the python2.7 docs and is absent from python3 altogether. int() does the
same thing and is compatible with both.
See https://docs.python.org/2/library/string.html#string.atoi:
Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 40387f62061c4b9c780cda78b4ac0e29d478f648
master date: 2023-10-18 15:44:31 +0100
George Dunlap [Tue, 14 Nov 2023 12:52:27 +0000 (13:52 +0100)]
cxenstored: wait until after reset to notify dom0less domains
Commit fc2b57c9a ("xenstored: send an evtchn notification on
introduce_domain") introduced the sending of an event channel to the
guest when first introduced, so that dom0less domains waiting for the
connection would know that xenstore was ready to use.
Unfortunately, it was introduced in introduce_domain(), which 1) is
called by other functions, where such functionality is unneeded, and
2) after the main XS_INTRODUCE call, calls domain_conn_reset(). This
introduces a race condition, whereby if xenstored is delayed, a domain
can wake up, send messages to the buffer, only to have them deleted by
xenstore before finishing its processing of the XS_INTRODUCE message.
Move the connect-and-notfy call into do_introduce() instead, after the
domain_conn_rest(); predicated on the state being in the
XENSTORE_RECONNECT state.
(We don't need to check for "restoring", since that value is always
passed as "false" from do_domain_introduce()).
Also take the opportunity to add a missing wmb barrier after resetting
the indexes of the ring in domain_conn_reset.
This change will also remove an extra event channel notification for
dom0 (because the notification is now done by do_introduce which is not
called for dom0.) The extra dom0 event channel notification was only
introduced by fc2b57c9a and was never present before. It is not needed
because dom0 is the one to tell xenstored the connection parameters, so
dom0 has to know that the ring page is setup correctly by the time
xenstored starts looking at it. It is dom0 that performs the ring page
init.
Michal Orzel [Tue, 14 Nov 2023 12:52:01 +0000 (13:52 +0100)]
x86: Clarify that only 5 hypercall parameters are supported
The x86 hypercall ABI really used to have 6-argument hypercalls. V4V, the
downstream predecessor to Argo did take 6th args.
However, the 6th arg being %ebp in the 32bit ABI makes it unusable in
practice, because that's the frame pointer in builds with frame pointers
enabled. Therefore Argo was altered to being a 5-arg hypercall when it was
upstreamed.
c/s 2f531c122e95 ("x86: limit number of hypercall parameters to 5") removed
the ability for hypercalls to take 6 arguments.
Update the documentation to match reality.
Signed-off-by: Michal Orzel <michal.orzel@amd.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jason Andryuk <jandryuk@gmail.com>
master commit: c035151902689aa5a3765aeb16fa52755917b9ca
master date: 2023-10-10 10:03:49 +0100
Roger Pau Monné [Tue, 14 Nov 2023 12:50:35 +0000 (13:50 +0100)]
x86/amd: do not expose HWCR.TscFreqSel to guests
OpenBSD 7.3 will unconditionally access HWCR if the TSC is reported as
Invariant, and it will then attempt to also unconditionally access PSTATE0 if
HWCR.TscFreqSel is set (currently the case on Xen).
The motivation for exposing HWCR.TscFreqSel was to avoid warning messages from
Linux. It has been agreed that Linux should be changed instead to not
complaint about missing HWCR.TscFreqSel when running virtualized.
The relation between HWCR.TscFreqSel and PSTATE0 is not clearly written down in
the PPR, but it's natural for OSes to attempt to fetch the P0 frequency if the
TSC is stated to increment at the P0 frequency.
Exposing PSTATEn (PSTATE0 at least) with all zeroes is not a suitable solution
because the PstateEn bit is read-write, and OSes could legitimately attempt to
set PstateEn=1 which Xen couldn't handle.
Furthermore, the TscFreqSel bit is model specific and was never safe to expose
like this in the first place. At a minimum it should have had a toolstack
adjustment to know not to migrate such a VM.
Therefore, simply remove the bit. Note the HWCR itself is an architectural
register, and does need to be accessible by the guest. Since HWCR contains
both architectural and non-architectural bits, going forward care must be taken
to assert the exposed value is correct on newer CPU families.
Reported-by: Solène Rapenne <solene@openbsd.org> Link: https://github.com/QubesOS/qubes-issues/issues/8502 Fixes: 14b95b3b8546 ('x86/AMD: expose HWCR.TscFreqSel to guests') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e4ca4e261da3fdddd541c3a9842b1e9e2ad00525
master date: 2023-09-18 15:07:49 +0200
Andrew Cooper [Thu, 26 Oct 2023 13:37:38 +0000 (14:37 +0100)]
x86/spec-ctrl: Remove conditional IRQs-on-ness for INT $0x80/0x82 paths
Before speculation defences, some paths in Xen could genuinely get away with
being IRQs-on at entry. But XPTI invalidated this property on most paths, and
attempting to maintain it on the remaining paths was a mistake.
Fast forward, and DO_SPEC_CTRL_COND_IBPB (protection for AMD BTC/SRSO) is not
IRQ-safe, running with IRQs enabled in some cases. The other actions taken on
these paths happen to be IRQ-safe.
Make entry_int82() and int80_direct_trap() unconditionally Interrupt Gates
rather than Trap Gates. Remove the conditional re-adjustment of
int80_direct_trap() in smp_prepare_cpus(), and have entry_int82() explicitly
enable interrupts when safe to do so.
In smp_prepare_cpus(), with the conditional re-adjustment removed, the
clearing of pv_cr3 is the only remaining action gated on XPTI, and it is out
of place anyway, repeating work already done by smp_prepare_boot_cpu(). Drop
the entire if() condition to avoid leaving an incorrect vestigial remnant.
Also drop comments which make incorrect statements about when its safe to
enable interrupts.
This is XSA-446 / CVE-2023-46836
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit a48bb129f1b9ff55c22cf6d2b589247c8ba3b10e)
Roger Pau Monne [Wed, 11 Oct 2023 11:14:21 +0000 (13:14 +0200)]
iommu/amd-vi: use correct level for quarantine domain page tables
The current setup of the quarantine page tables assumes that the quarantine
domain (dom_io) has been initialized with an address width of
DEFAULT_DOMAIN_ADDRESS_WIDTH (48).
However dom_io being a PV domain gets the AMD-Vi IOMMU page tables levels based
on the maximum (hot pluggable) RAM address, and hence on systems with no RAM
above the 512GB mark only 3 page-table levels are configured in the IOMMU.
On systems without RAM above the 512GB boundary amd_iommu_quarantine_init()
will setup page tables for the scratch page with 4 levels, while the IOMMU will
be configured to use 3 levels only. The page destined to be used as level 1,
and to contain a directory of PTEs ends up being the address in a PTE itself,
and thus level 1 page becomes the leaf page. Without the level mismatch it's
level 0 page that should be the leaf page instead.
The level 1 page won't be used as such, and hence it's not possible to use it
to gain access to other memory on the system. However that page is not cleared
in amd_iommu_quarantine_init() as part of re-initialization of the device
quarantine page tables, and hence data on the level 1 page can be leaked
between device usages.
Fix this by making sure the paging levels setup by amd_iommu_quarantine_init()
match the number configured on the IOMMUs.
Note that IVMD regions are not affected by this issue, as those areas are
mapped taking the configured paging levels into account.
This is XSA-445 / CVE-2023-46835
Fixes: ea38867831da ('x86 / iommu: set up a scratch page in the quarantine domain') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit fe1e4668b373ec4c1e5602e75905a9fa8cc2be3f)
Andrew Cooper [Tue, 26 Sep 2023 19:03:36 +0000 (20:03 +0100)]
x86/pv: Correct the auditing of guest breakpoint addresses
The use of access_ok() is buggy, because it permits access to the compat
translation area. 64bit PV guests don't use the XLAT area, but on AMD
hardware, the DBEXT feature allows a breakpoint to match up to a 4G aligned
region, allowing the breakpoint to reach outside of the XLAT area.
Prior to c/s cda16c1bb223 ("x86: mirror compat argument translation area for
32-bit PV"), the live GDT was within 4G of the XLAT area.
All together, this allowed a malicious 64bit PV guest on AMD hardware to place
a breakpoint over the live GDT, and trigger a #DB livelock (CVE-2015-8104).
Introduce breakpoint_addr_ok() and explain why __addr_ok() happens to be an
appropriate check in this case.
For Xen 4.14 and later, this is a latent bug because the XLAT area has moved
to be on its own with nothing interesting adjacent. For Xen 4.13 and older on
AMD hardware, this fixes a PV-trigger-able DoS.
This is part of XSA-444 / CVE-2023-34328.
Fixes: 65e355490817 ("x86/PV: support data breakpoint extension registers") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit dc9d9aa62ddeb14abd5672690d30789829f58f7e)
Andrew Cooper [Tue, 26 Sep 2023 19:03:36 +0000 (20:03 +0100)]
x86/svm: Fix asymmetry with AMD DR MASK context switching
The handling of MSR_DR{0..3}_MASK is asymmetric between PV and HVM guests.
HVM guests context switch in based on the guest view of DBEXT, whereas PV
guest switch in base on the host capability. Both guest types leave the
context dirty for the next vCPU.
This leads to the following issue:
* PV or HVM vCPU has debugging active (%dr7 + mask)
* Switch out deactivates %dr7 but leaves other state stale in hardware
* HVM vCPU with debugging activate but can't see DBEXT is switched in
* Switch in loads %dr7 but leaves the mask MSRs alone
Now, the HVM vCPU is operating in the context of the prior vCPU's mask MSR,
and furthermore in a case where it genuinely expects there to be no masking
MSRs.
As a stopgap, adjust the HVM path to switch in/out the masks based on host
capabilities rather than guest visibility (i.e. like the PV path). Adjustment
of the of the intercepts still needs to be dependent on the guest visibility
of DBEXT.
This is part of XSA-444 / CVE-2023-34327
Fixes: c097f54912d3 ("x86/SVM: support data breakpoint extension registers") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 5d54282f984bb9a7a65b3d12208584f9fdf1c8e1)
libxl: limit bootloader execution in restricted mode
Introduce a timeout for bootloader execution when running in restricted mode.
Allow overwriting the default time out with an environment provided value.
This is part of XSA-443 / CVE-2023-34325
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit 9c114178ffd700112e91f5ec66cf5151b9c9a8cc)
libxl: add support for running bootloader in restricted mode
Much like the device model depriv mode, add the same kind of support for the
bootloader. Such feature allows passing a UID as a parameter for the
bootloader to run as, together with the bootloader itself taking the necessary
actions to isolate.
Note that the user to run the bootloader as must have the right permissions to
access the guest disk image (in read mode only), and that the bootloader will
be run in non-interactive mode when restricted.
If enabled bootloader restrict mode will attempt to re-use the user(s) from the
QEMU depriv implementation if no user is provided on the configuration file or
the environment. See docs/features/qemu-deprivilege.pandoc for more
information about how to setup those users.
Bootloader restrict mode is not enabled by default as it requires certain
setup to be done first (setup of the user(s) to use in restrict mode).
This is part of XSA-443 / CVE-2023-34325
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit 1f762642d2cad1a40634e3280361928109d902f1)
Introduce a --runas=<uid> flag to deprivilege pygrub on Linux and *BSDs. It
also implicitly creates a chroot env where it drops a deprivileged forked
process. The chroot itself is cleaned up at the end.
If the --runas arg is present, then pygrub forks, leaving the child to
deprivilege itself, and waiting for it to complete. When the child exists,
the parent performs cleanup and exits with the same error code.
This is roughly what the child does:
1. Initialize libfsimage (this loads every .so in memory so the chroot
can avoid bind-mounting /{,usr}/lib*
2. Create a temporary empty chroot directory
3. Mount tmpfs in it
4. Bind mount the disk inside, because libfsimage expects a path, not a
file descriptor.
5. Remount the root tmpfs to be stricter (ro,nosuid,nodev)
6. Set RLIMIT_FSIZE to a sensibly high amount (128 MiB)
7. Depriv gid, groups and uid
With this scheme in place, the "output" files are writable (up to
RLIMIT_FSIZE octets) and the exposed filesystem is immutable and contains
the single only file we can't easily get rid of (the disk).
If running on Linux, the child process also unshares mount, IPC, and
network namespaces before dropping its privileges.
tools/libfsimage: Export a new function to preload all plugins
This is work required in order to let pygrub operate in highly deprivileged
chroot mode. This patch adds a function that preloads every plugin, hence
ensuring that a on function exit, every shared library is loaded in memory.
The new "init" function is supposed to be used before depriv, but that's
fine because it's not acting on untrusted data.
This patch allows pygrub to get ahold of every RW file descriptor it needs
early on. A later patch will clamp the filesystem it can access so it can't
obtain any others.
There's a hypercall being issued in order to determine whether PV64 is
supported, but since Xen 4.3 that's strictly true so it's not required.
Plus, this way we can avoid mapping the privcmd interface altogether in the
depriv pygrub.
This is part of XSA-443 / CVE-2023-34325
Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit f4b504c6170c446e61055cbd388ae4e832a9deca)
libfsimage/xfs: Sanity-check the superblock during mounts
Sanity-check the XFS superblock for wellformedness at the mount handler.
This forces pygrub to abort parsing a potentially malformed filesystem and
ensures the invariants assumed throughout the rest of the code hold.
Also, derive parameters from previously sanitized parameters where possible
(rather than reading them off the superblock)
The code doesn't try to avoid overflowing the end of the disk, because
that's an unlikely and benign error. Parameters used in calculations of
xfs_daddr_t (like the root inode index) aren't in critical need of being
sanitized.
The sanitization of agblklog is basically checking that no obvious
overflows happen on agblklog, and then ensuring agblocks is contained in
the range (2^(sb_agblklog-1), 2^sb_agblklog].
This is part of XSA-443 / CVE-2023-34325
Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 620500dd1baf33347dfde5e7fde7cf7fe347da5c)
Roger Pau Monne [Tue, 13 Jun 2023 13:01:05 +0000 (15:01 +0200)]
iommu/amd-vi: flush IOMMU TLB when flushing the DTE
The caching invalidation guidelines from the AMD-Vi specification (48882—Rev
3.07-PUB—Oct 2022) seem to be misleading on some hardware, as devices will
malfunction (see stale DMA mappings) if some fields of the DTE are updated but
the IOMMU TLB is not flushed. This has been observed in practice on AMD
systems. Due to the lack of guidance from the currently published
specification this patch aims to increase the flushing done in order to prevent
device malfunction.
In order to fix, issue an INVALIDATE_IOMMU_PAGES command from
amd_iommu_flush_device(), flushing all the address space. Note this requires
callers to be adjusted in order to pass the DomID on the DTE previous to the
modification.
Some call sites don't provide a valid DomID to amd_iommu_flush_device() in
order to avoid the flush. That's because the device had address translations
disabled and hence the previous DomID on the DTE is not valid. Note the
current logic relies on the entity disabling address translations to also flush
the TLB of the in use DomID.
Device I/O TLB flushing when ATS are enabled is not covered by the current
change, as ATS usage is not security supported.
This is XSA-442 / CVE-2023-34326
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 5fc98b97084a46884acef9320e643faf40d42212)
The function domain_entry_fix() will be initially called to check if the
quota is correct before attempt to commit any nodes. So it would be
possible that accounting is temporarily negative. This is the case
in the following sequence:
1) Create 50 nodes
2) Start two transactions
3) Delete all the nodes in each transaction
4) Commit the two transactions
Because the first transaction will have succeed and updated the
accounting, there is no guarantee that 'd->nbentry + num' will still
be above 0. So the assert() would be triggered.
The assert() was introduced in dbef1f748289 ("tools/xenstore: simplify
and fix per domain node accounting") with the assumption that the
value can't be negative. As this is not true revert to the original
check but restricted to the path where we don't update. Take the
opportunity to explain the rationale behind the check.
This CVE-2023-34323 / XSA-440.
Fixes: dbef1f748289 ("tools/xenstore: simplify and fix per domain node accounting") Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Wed, 20 Sep 2023 09:33:26 +0000 (10:33 +0100)]
x86/shadow: defer releasing of PV's top-level shadow reference
sh_set_toplevel_shadow() re-pinning the top-level shadow we may be
running on is not enough (and at the same time unnecessary when the
shadow isn't what we're running on): That shadow becomes eligible for
blowing away (from e.g. shadow_prealloc()) immediately after the
paging lock was dropped. Yet it needs to remain valid until the actual
page table switch occurred.
Propagate up the call chain the shadow entry that needs releasing
eventually, and carry out the release immediately after switching page
tables. Handle update_cr3() failures by switching to idle pagetables.
Note that various further uses of update_cr3() are HVM-only or only act
on paused vCPU-s, in which case sh_set_toplevel_shadow() will not defer
releasing of the reference.
While changing the update_cr3() hook, also convert the "do_locking"
parameter to boolean.
This is CVE-2023-34322 / XSA-438.
Reported-by: Tim Deegan <tim@xen.org> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@cloud.com>
(cherry picked from commit fb0ff49fe9f784bfee0370c2a3c5f20e39d7a1cb)
Andrew Cooper [Wed, 30 Aug 2023 19:24:25 +0000 (20:24 +0100)]
x86/spec-ctrl: Mitigate the Zen1 DIV leakage
In the Zen1 microarchitecure, there is one divider in the pipeline which
services uops from both threads. In the case of #DE, the latched result from
the previous DIV to execute will be forwarded speculatively.
This is an interesting covert channel that allows two threads to communicate
without any system calls. In also allows userspace to obtain the result of
the most recent DIV instruction executed (even speculatively) in the core,
which can be from a higher privilege context.
Scrub the result from the divider by executing a non-faulting divide. This
needs performing on the exit-to-guest paths, and ist_exit-to-Xen.
Alternatives in IST context is believed safe now that it's done in NMI
context.
This is XSA-439 / CVE-2023-20588.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit b5926c6ecf05c28ee99c6248c42d691ccbf0c315)
Andrew Cooper [Fri, 15 Sep 2023 11:13:51 +0000 (12:13 +0100)]
x86/amd: Introduce is_zen{1,2}_uarch() predicates
We already have 3 cases using STIBP as a Zen1/2 heuristic, and are about to
introduce a 4th. Wrap the heuristic into a pair of predicates rather than
opencoding it, and the explanation of the heuristic, at each usage site.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit de1d265001397f308c5c3c5d3ffc30e7ef8c0705)
Andrew Cooper [Wed, 13 Sep 2023 12:53:33 +0000 (13:53 +0100)]
x86/spec-ctrl: Issue VERW during IST exit to Xen
There is a corner case where e.g. an NMI hitting an exit-to-guest path after
SPEC_CTRL_EXIT_TO_* would have run the entire NMI handler *after* the VERW
flush to scrub potentially sensitive data from uarch buffers.
In order to compensate, issue VERW when exiting to Xen from an IST entry.
SPEC_CTRL_EXIT_TO_XEN already has two reads of spec_ctrl_flags off the stack,
and we're about to add a third. Load the field into %ebx, and list the
register as clobbered.
%r12 has been arranged to be the ist_exit signal, so add this as an input
dependency and use it to identify when to issue a VERW.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3ee6066bcd737756b0990d417d94eddc0b0d2585)
Andrew Cooper [Wed, 13 Sep 2023 11:20:12 +0000 (12:20 +0100)]
x86/entry: Track the IST-ness of an entry for the exit paths
Use %r12 to hold an ist_exit boolean. This register is zero elsewhere in the
entry/exit asm, so it only needs setting in the IST path.
As this is subtle and fragile, add check_ist_exit() to be used in debugging
builds to cross-check that the ist_exit boolean matches the entry vector.
Write check_ist_exit() it in C, because it's debug only and the logic more
complicated than I care to maintain in asm.
For now, we only need to use this signal in the exit-to-Xen path, but some
exit-to-guest paths happen in IST context too. Check the correctness in all
exit paths to avoid the logic bit-rotting.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 21bdc25b05a0f8ab6bc73520a9ca01327360732c)
x86/entry: Partially revert IST-exit checks
The patch adding check_ist_exit() didn't account for the fact that
reset_stack_and_jump() is not an ABI-preserving boundary. The IST-ness in
%r12 doesn't survive into the next context, and is a stale value C.
There's no straightforward way to reconstruct the IST-exit-ness on the
exit-to-guest path after a context switch. For now, we only need IST-exit on
the return-to-Xen path.
Fixes: 21bdc25b05a0 ("x86/entry: Track the IST-ness of an entry for the exit paths") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9b57c800b79b96769ea3dcd6468578fa664d19f9)
Andrew Cooper [Wed, 13 Sep 2023 12:48:16 +0000 (13:48 +0100)]
x86/entry: Adjust restore_all_xen to hold stack_end in %r14
All other SPEC_CTRL_{ENTRY,EXIT}_* helpers hold stack_end in %r14. Adjust it
for consistency.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 7aa28849a1155d856e214e9a80a7e65fffdc3e58)
Andrew Cooper [Wed, 30 Aug 2023 19:11:50 +0000 (20:11 +0100)]
x86/spec-ctrl: Improve all SPEC_CTRL_{ENTER,EXIT}_* comments
... to better explain how they're used.
Doing so highlights that SPEC_CTRL_EXIT_TO_XEN is missing a VERW flush for the
corner case when e.g. an NMI hits late in an exit-to-guest path.
Leave a TODO, which will be addressed in subsequent patches which arrange for
VERW flushing to be safe within SPEC_CTRL_EXIT_TO_XEN.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 45f00557350dc7d0756551069803fc49c29184ca)
Andrew Cooper [Fri, 1 Sep 2023 10:38:44 +0000 (11:38 +0100)]
x86/spec-ctrl: Turn the remaining SPEC_CTRL_{ENTRY,EXIT}_* into asm macros
These have grown more complex over time, with some already having been
converted.
Provide full Requires/Clobbers comments, otherwise missing at this level of
indirection.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 7125429aafb9e3c9c88fc93001fc2300e0ac2cc8)
Andrew Cooper [Tue, 12 Sep 2023 16:03:16 +0000 (17:03 +0100)]
x86/spec-ctrl: Fold DO_SPEC_CTRL_EXIT_TO_XEN into it's single user
With the SPEC_CTRL_EXIT_TO_XEN{,_IST} confusion fixed, it's now obvious that
there's only a single EXIT_TO_XEN path. Fold DO_SPEC_CTRL_EXIT_TO_XEN into
SPEC_CTRL_EXIT_TO_XEN to simplify further fixes.
When merging labels, switch the name to .L\@_skip_sc_msr as "skip" on its own
is going to be too generic shortly.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 694bb0f280fd08a4377e36e32b84b5062def4de2)
Andrew Cooper [Tue, 12 Sep 2023 14:06:49 +0000 (15:06 +0100)]
x86/spec-ctrl: Fix confusion between SPEC_CTRL_EXIT_TO_XEN{,_IST}
c/s 3fffaf9c13e9 ("x86/entry: Avoid using alternatives in NMI/#MC paths")
dropped the only user, leaving behind the (incorrect) implication that Xen had
split exit paths.
Delete the unused SPEC_CTRL_EXIT_TO_XEN and rename SPEC_CTRL_EXIT_TO_XEN_IST
to SPEC_CTRL_EXIT_TO_XEN for consistency.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 1c18d73774533a55ba9d1cbee8bdace03efdb5e7)
Jan Beulich [Wed, 23 Aug 2023 07:26:36 +0000 (09:26 +0200)]
x86/AMD: extend Zenbleed check to models "good" ucode isn't known for
Reportedly the AMD Custom APU 0405 found on SteamDeck, models 0x90 and
0x91, (quoting the respective Linux commit) is similarly affected. Put
another instance of our Zen1 vs Zen2 distinction checks in
amd_check_zenbleed(), forcing use of the chickenbit irrespective of
ucode version (building upon real hardware never surfacing a version of
0xffffffff).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 145a69c0944ac70cfcf9d247c85dee9e99d9d302)
xen/arm: page: Handle cache flush of an element at the top of the address space
The region that needs to be cleaned/invalidated may be at the top
of the address space. This means that 'end' (i.e. 'p + size') will
be 0 and therefore nothing will be cleaned/invalidated as the check
in the loop will always be false.
On Arm64, we only support we only support up to 48-bit Virtual
address space. So this is not a concern there. However, for 32-bit,
the mapcache is using the last 2GB of the address space. Therefore
we may not clean/invalidate properly some pages. This could lead
to memory corruption or data leakage (the scrubbed value may
still sit in the cache when the guest could read directly the memory
and therefore read the old content).
Rework invalidate_dcache_va_range(), clean_dcache_va_range(),
clean_and_invalidate_dcache_va_range() to handle a cache flush
with an element at the top of the address space.
x86/irq: fix reporting of spurious i8259 interrupts
The return value of bogus_8259A_irq() is wrong: the function will
return `true` when the IRQ is real and `false` when it's a spurious
IRQ. This causes the "No irq handler for vector ..." message in
do_IRQ() to be printed for spurious i8259 interrupts which is not
intended (and not helpful).
Fix by inverting the return value of bogus_8259A_irq().
Fixes: 132906348a14 ('x86/i8259: Handle bogus spurious interrupts more quietly') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 709f6c8ce6422475c372e67507606170a31ccb65
master date: 2023-08-30 10:03:53 +0200
Andrew Cooper [Tue, 5 Sep 2023 06:53:31 +0000 (08:53 +0200)]
x86/vmx: Revert "x86/VMX: sanitize rIP before re-entering guest"
At the time of XSA-170, the x86 instruction emulator was genuinely broken. It
would load arbitrary values into %rip and putting a check here probably was
the best stopgap security fix. It should have been reverted following c/s 81d3a0b26c1 "x86emul: limit-check branch targets" which corrected the emulator
behaviour.
However, everyone involved in XSA-170, myself included, failed to read the SDM
correctly. On the subject of %rip consistency checks, the SDM stated:
If the processor supports N < 64 linear-address bits, bits 63:N must be
identical
A non-canonical %rip (and SSP more recently) is an explicitly legal state in
x86, and the VMEntry consistency checks are intentionally off-by-one from a
regular canonical check.
The consequence of this bug is that Xen will currently take a legal x86 state
which would successfully VMEnter, and corrupt it into having non-architectural
behaviour.
Furthermore, in the time this bugfix has been pending in public, I
successfully persuaded Intel to clarify the SDM, adding the following
clarification:
The guest RIP value is not required to be canonical; the value of bit N-1
may differ from that of bit N.
Fixes: ffbbfda377 ("x86/VMX: sanitize rIP before re-entering guest") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 10c83bb0f5d158d101d983883741b76f927e54a3
master date: 2023-08-23 18:44:59 +0100
Jason Andryuk [Tue, 5 Sep 2023 06:52:33 +0000 (08:52 +0200)]
tboot: Disable CET at shutdown
tboot_shutdown() calls into tboot to perform the actual system shutdown.
tboot isn't built with endbr annotations, and Xen has CET-IBT enabled on
newer hardware. shutdown_entry isn't annotated with endbr and Xen
faults:
Panic on CPU 0:
CONTROL-FLOW PROTECTION FAULT: #CP[0003] endbranch
And Xen hangs at this point.
Disabling CET-IBT let Xen and tboot power off, but reboot was
perfoming a poweroff instead of a warm reboot. Disabling all of CET,
i.e. shadow stacks as well, lets tboot reboot properly.
Fixes: cdbe2b0a1aec ("x86: Enable CET Indirect Branch Tracking") Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Daniel P. Smith <dpsmith@apertussolutions.com>
master commit: 0801868f550539d417d46f82c49307480947ccaa
master date: 2023-08-17 16:24:49 +0200
Jan Beulich [Tue, 5 Sep 2023 06:52:15 +0000 (08:52 +0200)]
libxl: slightly correct JSON generation of CPU policy
The "cpuid_empty" label is also (in principle; maybe only for rubbish
input) reachable in the "cpuid_only" case. Hence the label needs to live
ahead of the check of the variable.
Fixes: 5b80cecb747b ("libxl: introduce MSR data in libxl_cpuid_policy") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: ebce4e3a146c39e57bb7a890e059e89c32b6d547
master date: 2023-08-17 16:24:17 +0200
x86/iommu: pass full IO-APIC RTE for remapping table update
So that the remapping entry can be updated atomically when possible.
Doing such update atomically will avoid Xen having to mask the IO-APIC
pin prior to performing any interrupt movements (ie: changing the
destination and vector fields), as the interrupt remapping entry is
always consistent.
This also simplifies some of the logic on both VT-d and AMD-Vi
implementations, as having the full RTE available instead of half of
it avoids to possibly read and update the missing other half from
hardware.
While there remove the explicit zeroing of new_ire fields in
ioapic_rte_to_remap_entry() and initialize the variable at definition
so all fields are zeroed. Note fields could be also initialized with
final values at definition, but I found that likely too much to be
done at this time.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3e033172b0250446bfe119f31c7f0f51684b0472
master date: 2023-08-01 11:48:39 +0200
iommu/vtd: rename io_apic_read_remap_rte() local variable
Preparatory change to unify the IO-APIC pin variable name between
io_apic_read_remap_rte() and amd_iommu_ioapic_update_ire(), so that
the local variable can be made a function parameter with the same name
across vendors.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: a478b38c01b65fa030303f0324a3380d872eb165
master date: 2023-07-28 09:40:42 +0200
x86/ioapic: RTE modifications must use ioapic_write_entry
Do not allow to write to RTE registers using io_apic_write and instead
require changes to RTE to be performed using ioapic_write_entry.
This is in preparation for passing the full contents of the RTE to the
IOMMU interrupt remapping handlers, so remapping entries for IO-APIC
RTEs can be updated atomically when possible.
While immediately this commit might expand the number of MMIO accesses
in order to update an IO-APIC RTE, further changes will benefit from
getting the full RTE value passed to the IOMMU handlers, as the logic
is greatly simplified when the IOMMU handlers can get the complete RTE
value in one go.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ef7995ed1bcd7eac37fb3c3fe56eaa54ea9baf6c
master date: 2023-07-28 09:40:20 +0200
x86/ioapic: sanitize IO-APIC pins before enabling lapic LVTERR/ESR
The current logic to init the local APIC and the IO-APIC does init the
local APIC LVTERR/ESR before doing any sanitization on the IO-APIC pin
configuration. It's already noted on enable_IO_APIC() that Xen
shouldn't trust the IO-APIC being empty at bootup.
At XenServer we have a system where the IO-APIC 0 is handed to Xen
with pin 0 unmasked, set to Fixed delivery mode, edge triggered and
with a vector of 0 (all fields of the RTE are zeroed). Once the local
APIC LVTERR/ESR is enabled periodic injections from such pin cause the
local APIC to in turn inject periodic error vectors:
APIC error on CPU0: 00(40), Received illegal vector
APIC error on CPU0: 40(40), Received illegal vector
APIC error on CPU0: 40(40), Received illegal vector
APIC error on CPU0: 40(40), Received illegal vector
APIC error on CPU0: 40(40), Received illegal vector
APIC error on CPU0: 40(40), Received illegal vector
That prevents Xen from booting.
Move the masking of the IO-APIC pins ahead of the setup of the local
APIC. This has the side effect of also moving the detection of the
pin where the i8259 is connected, as such detection must be done
before masking any pins.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 813da5f0e73b8cbd2ac3c7922506e58c28cd736d
master date: 2023-07-17 10:31:10 +0200
A recent xentrace highlighted an unhandled corner case in the vcpu
"start-of-day" logic, if the trace starts after the last running ->
non-running transition, but before the first non-running -> running
transition. Because start-of-day wasn't handled, vcpu_next_update()
was expecting p->current to be NULL, and tripping out with the
following error message when it wasn't:
vcpu_next_update: FATAL: p->current not NULL! (d32768dv$p, runstate RUNSTATE_INIT)
where 32768 is the DEFAULT_DOMAIN, and $p is the pcpu number.
Instead of calling vcpu_start() piecemeal throughout
sched_runstate_process(), call it at the top of the function if the
vcpu in question is still in RUNSTATE_INIT, so that we can handle all
the cases in one place.
Sketch out at the top of the function all cases which we need to
handle, and what to do in those cases. Some transitions tell us where
v is running; some transitions tell us about what is (or is not)
running on p; some transitions tell us neither.
If a transition tells us where v is now running, update its state;
otherwise leave it in INIT, in order to avoid having to deal with TSC
skew on start-up.
If a transition tells us what is or is not running on p, update
p->current (either to v or NULL). Otherwise leave it alone.
If neither, do nothing.
Reifying those rules:
- If we're continuing to run, set v to RUNNING, and use p->first_tsc
as the runstate time.
- If we're starting to run, set v to RUNNING, and use ri->tsc as the
runstate time.
- If v is being deschedled, leave v in the INIT state to avoid dealing
with TSC skew; but set p->current to NULL so that whatever is
scheduled next won't trigger the assert in vcpu_next_update().
- If a vcpu is waking up (switching from one non-runnable state to
another non-runnable state), leave v in INIT, and p in whatever
state it's in (which may be the default domain, or some other vcpu
which has already run).
While here, fix the comment above vcpu_start; it's called when the
vcpu state is INIT, not when current is the default domain.
Signed-off-by: George Dunlap <george.dunlap@cloud.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: aab4b38b5d77e3c65f44bacd56427a85b7392a11
master date: 2023-06-30 11:25:33 +0100
Ensure that the base address is 2M aligned, or else the page table
entries created would be corrupt as reserved bits on the PDE end up
set.
We have encountered a broken firmware where grub2 would end up loading
Xen at a non 2M aligned region when using the multiboot2 protocol, and
that caused a very difficult to debug triple fault.
If the alignment is not as required by the page tables print an error
message and stop the boot. Also add a build time check that the
calculation of symbol offsets don't break alignment of passed
addresses.
The check could be performed earlier, but so far the alignment is
required by the page tables, and hence feels more natural that the
check lives near to the piece of code that requires it.
Note that when booted as an EFI application from the PE entry point
the alignment check is already performed by
efi_arch_load_addr_check(), and hence there's no need to add another
check at the point where page tables get built in
efi_arch_memory_setup().
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0946068e7faea22868c577d7afa54ba4970ff520
master date: 2023-05-03 13:36:25 +0200
The usage of VCPU_SSHOTTMR_future in Linux prior to 4.7 is bogus.
When the hypervisor returns -ETIME (timeout in the past) Linux keeps
retrying to setup the timer with a higher timeout instead of
self-injecting a timer interrupt.
On boxes without any hardware assistance for logdirty we have seen HVM
Linux guests < 4.7 with 32vCPUs give up trying to setup the timer when
logdirty is enabled:
CE: Reprogramming failure. Giving up
CE: xen increased min_delta_ns to 1000000 nsec
CE: Reprogramming failure. Giving up
CE: Reprogramming failure. Giving up
CE: xen increased min_delta_ns to 506250 nsec
CE: xen increased min_delta_ns to 759375 nsec
CE: xen increased min_delta_ns to 1000000 nsec
CE: Reprogramming failure. Giving up
CE: Reprogramming failure. Giving up
CE: Reprogramming failure. Giving up
Freezing user space processes ...
INFO: rcu_sched detected stalls on CPUs/tasks: { 14} (detected by 10, t=60002 jiffies, g=4006, c=4005, q=14130)
Task dump for CPU 14:
swapper/14 R running task 0 0 1 0x00000000
Call Trace:
[<ffffffff90160f5d>] ? rcu_eqs_enter_common.isra.30+0x3d/0xf0
[<ffffffff907b9bde>] ? default_idle+0x1e/0xd0
[<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0
[<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0
[<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270
[<ffffffff900000d5>] ? start_cpu+0x5/0x14
INFO: rcu_sched detected stalls on CPUs/tasks: { 26} (detected by 24, t=60002 jiffies, g=6922, c=6921, q=7013)
Task dump for CPU 26:
swapper/26 R running task 0 0 1 0x00000000
Call Trace:
[<ffffffff90160f5d>] ? rcu_eqs_enter_common.isra.30+0x3d/0xf0
[<ffffffff907b9bde>] ? default_idle+0x1e/0xd0
[<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0
[<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0
[<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270
[<ffffffff900000d5>] ? start_cpu+0x5/0x14
INFO: rcu_sched detected stalls on CPUs/tasks: { 26} (detected by 24, t=60002 jiffies, g=8499, c=8498, q=7664)
Task dump for CPU 26:
swapper/26 R running task 0 0 1 0x00000000
Call Trace:
[<ffffffff90160f5d>] ? rcu_eqs_enter_common.isra.30+0x3d/0xf0
[<ffffffff907b9bde>] ? default_idle+0x1e/0xd0
[<ffffffff90039570>] ? arch_cpu_idle+0x20/0xc0
[<ffffffff9010820a>] ? cpu_startup_entry+0x14a/0x1e0
[<ffffffff9005d3a7>] ? start_secondary+0x1f7/0x270
[<ffffffff900000d5>] ? start_cpu+0x5/0x14
Thus leading to CPU stalls and a broken system as a result.
Workaround this bogus usage by ignoring the VCPU_SSHOTTMR_future in
the hypervisor. Old Linux versions are the only ones known to have
(wrongly) attempted to use the flag, and ignoring it is compatible
with the behavior expected by any guests setting that flag.
Note the usage of the flag has been removed from Linux by commit:
node.c:158:17: error: implicit truncation from 'int' to a one-bit wide bit-field changes value from 1 to -1 [-Werror,-Wsingle-bit-bitfield-constant-conversion]
Andrew Cooper [Fri, 17 Feb 2023 11:16:32 +0000 (11:16 +0000)]
CI: Resync FreeBSD config with staging
CI: Update FreeBSD to 13.1
Also print the compiler version before starting. It's not easy to find
otherwise, and does change from time to time.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit 5e7667ea2dd33e0e5e0f3a96db37fdb4ecd98fba)
CI: Update FreeBSD to 13.2
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit f872a624cbf92de9944483eea7674ef80ced1380)
CI: Update FreeBSD to 12.4
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit a73560896ce3c513460f26bd1c205060d6ec4f8a)
Andrew Cooper [Fri, 18 Aug 2023 10:05:00 +0000 (11:05 +0100)]
rombios: Remove the use of egrep
As the Alpine 3.18 container notes:
egrep: warning: egrep is obsolescent; using grep -E
Adjust it.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 5ddac3c2852ecc120acab86fc403153a2097c5dc)
Andrew Cooper [Fri, 18 Aug 2023 09:47:46 +0000 (10:47 +0100)]
rombios: Avoid using K&R function syntax
Clang-15 complains:
tcgbios.c:598:25: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
void tcpa_calling_int19h()
^
void
C2x formally removes K&R syntax. The declarations for these functions in
32bitprotos.h are already ANSI compatible. Update the definitions to match.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit a562afa5679d4a7ceb9cb9222fec1fea9a61f738)
Andrew Cooper [Thu, 17 Aug 2023 20:32:53 +0000 (21:32 +0100)]
rombios: Work around GCC issue 99578
GCC 12 objects to pointers derived from a constant:
util.c: In function 'find_rsdp':
util.c:429:16: error: array subscript 0 is outside array bounds of 'uint16_t[0]' {aka 'short unsigned int[]'} [-Werror=array-bounds]
429 | ebda_seg = *(uint16_t *)ADDR_FROM_SEG_OFF(0x40, 0xe);
cc1: all warnings being treated as errors
This is a GCC bug, but work around it rather than turning array-bounds
checking off generally.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit e35138a2ffbe1fe71edaaaaae71063dc545a8416)
Jan Beulich [Fri, 18 Aug 2023 13:04:28 +0000 (15:04 +0200)]
x86emul: rework wrapping of libc functions in test and fuzzing harnesses
Our present approach is working fully behind the compiler's back. This
was found to not work with LTO. Employ ld's --wrap= option instead. Note
that while this makes the build work at least with new enough gcc (it
doesn't with gcc7, for example, due to tool chain side issues afaict),
according to my testing things still won't work when building the
fuzzing harness with afl-cc: While with the gcc7 tool chain I see afl-as
getting invoked, this does not happen with gcc13. Yet without using that
assembler wrapper the resulting binary will look uninstrumented to
afl-fuzz.
While checking the resulting binaries I noticed that we've gained uses
of snprintf() and strstr(), which only just so happen to not cause any
problems. Add a wrappers for them as well.
Since we don't have any actual uses of v{,sn}printf(), no definitions of
their wrappers appear (just yet). But I think we want
__wrap_{,sn}printf() to properly use __real_v{,sn}printf() right away,
which means we need delarations of the latter.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 6fba45ca3be1c5d46cddb1eaf371d9e69550b244)
Anthony PERARD [Mon, 31 Jul 2023 13:02:34 +0000 (15:02 +0200)]
Config.mk: evaluate XEN_COMPILE_ARCH and XEN_OS immediately
With GNU make 4.4, the number of execution of the command present in
these $(shell ) increased greatly. This is probably because as of make
4.4, exported variable are also added to the environment of $(shell )
construct.
So to avoid having these command been run more than necessary, we
will replace ?= by an equivalent but with immediate expansion.
Reported-by: Jason Andryuk <jandryuk@gmail.com> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Tested-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit a07414d989cf52e5e84192b78023bee1589bbda4)
Anthony PERARD [Mon, 31 Jul 2023 13:02:18 +0000 (15:02 +0200)]
build: evaluate XEN_BUILD_* and XEN_DOMAIN immediately
With GNU make 4.4, the number of execution of the command present in
these $(shell ) increased greatly. This is probably because as of make
4.4, exported variable are also added to the environment of $(shell )
construct.
Also, `make -d` shows a lot of these:
Makefile:15: not recursively expanding XEN_BUILD_DATE to export to shell function
Makefile:16: not recursively expanding XEN_BUILD_TIME to export to shell function
Makefile:17: not recursively expanding XEN_BUILD_HOST to export to shell function
Makefile:14: not recursively expanding XEN_DOMAIN to export to shell function
So to avoid having these command been run more than necessary, we
will replace ?= by an equivalent but with immediate expansion.
Reported-by: Jason Andryuk <jandryuk@gmail.com> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Tested-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 0c594c1b57ee2ecec5f70826c53a2cf02a9c2acb)
Anthony PERARD [Wed, 5 Jul 2023 06:29:49 +0000 (08:29 +0200)]
build: remove TARGET_ARCH, a duplicate of SRCARCH
The same command is used to generate the value of both $(TARGET_ARCH)
and $(SRCARCH), as $(ARCH) is an alias for $(XEN_TARGET_ARCH).
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit ac27b3beb9b7b423d5563768de890c7594c21b4e)
Anthony PERARD [Wed, 5 Jul 2023 06:27:51 +0000 (08:27 +0200)]
build: remove TARGET_SUBARCH, a duplicate of ARCH
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit a6ab7dd061338c33faef629cbe52ed1608571d84)
Anthony PERARD [Wed, 5 Jul 2023 06:25:03 +0000 (08:25 +0200)]
build: define ARCH and SRCARCH later
Defining ARCH and SRCARCH later in xen/Makefile allows to switch to
immediate evaluation variable type.
ARCH and SRCARCH depend on value defined in Config.mk and aren't used
for e.g. TARGET_SUBARCH or TARGET_ARCH, and not before they're needed in
a sub-make or a rule.
This will help reduce the number of times the shell rune is been
run.
With GNU make 4.4, the number of execution of the command present in
these $(shell ) increased greatly. This is probably because as of make
4.4, exported variable are also added to the environment of $(shell )
construct.
Also, `make -d` shows a lot of these:
Makefile:39: not recursively expanding SRCARCH to export to shell function
Makefile:38: not recursively expanding ARCH to export to shell function
Reported-by: Jason Andryuk <jandryuk@gmail.com> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Tested-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 58e0a3f3b2c430f8640ef9df67ac857b0008ebc8)
Anthony PERARD [Mon, 21 Aug 2023 13:53:47 +0000 (15:53 +0200)]
libxl: Use XEN_LIB_DIR to store bootloader from pygrub
In osstest, the jobs using pygrub on arm64 on the branch linux-linus
started to fails with:
[Errno 28] No space left on device
Error writing temporary copy of ramdisk
This is because /var/run is small when dom0 has only 512MB to work
with, /var/run is only 40MB. The size of both kernel and ramdisk on
this jobs is now about 42MB, so not enough space in /var/run.
So, to avoid writing a big binary in ramfs, we will use /var/lib
instead, like we already do when saving the device model state on
migration.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Jason Andryuk <jandryuk@gmail.com>
master commit: ad89640ad766d3cb6c92fc8b6406ca6bbab44136
master date: 2023-08-08 09:45:20 +0200
Andrew Cooper [Wed, 4 Jan 2023 16:32:44 +0000 (16:32 +0000)]
x86/spec-ctrl: Mitigate Gather Data Sampling
This is part of XSA-435 / CVE-2022-40982
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 56d690efd3ca3c68e1d222f259fb3d216206e8e5)
Andrew Cooper [Wed, 4 Jan 2023 17:32:44 +0000 (17:32 +0000)]
x86/spec-ctrl: Enumerations for Gather Data Sampling
GDS_CTRL is introduced by the August 2023 microcode. GDS_NO is for current
and future processors not susceptible to GDS.
This is part of XSA-435 / CVE-2022-40982
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9f585f59d90c8d3a1b21369a852b7d7eee8a29b9)
Andrew Cooper [Mon, 27 Feb 2023 15:36:49 +0000 (15:36 +0000)]
x86/cpu-policy: Hide CLWB by default on SKX/CLX/CPX
The August 2023 microcode for GDS has an impact on the CLWB instruction. See
code comments for full details.
This is part of XSA-435 / CVE-2022-40982
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 2dd06b4ea10891750af38e4a0e1efaeb0a9b3518)
On native, synthesise the SRSO bits by probing various hardware properties as
given by AMD.
Extend the IBPB-on-entry mitigations to Zen3/4 CPUs. There is a microcode
prerequisite to make this an effective mitigation.
This is part of XSA-434 / CVE-2023-20569
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 220c06e6fefe2378f40e2a7391f5e265a2aa50f7)
Andrew Cooper [Wed, 14 Jun 2023 08:13:28 +0000 (09:13 +0100)]
x86/spec-ctrl: Enumerations for Speculative Return Stack Overflow
AMD have specified new CPUID bits relating to SRSO.
* SRSO_NO indicates that hardware is no longer vulnerable to SRSO.
* IBPB_BRTYPE indicates that IBPB flushes branch type information too.
* SBPB indicates support for a relaxed form of IBPB that does not flush
branch type information.
Current CPUs (Zen4 and older) are not expected to enumerate these bits.
Native software is expected to synthesise them for guests using model and
microcode revision checks.
Two are just status bits, and SBPB is trivial to support for guests by
tweaking the reserved bit calculation in guest_wrmsr() and feature
dependencies. Expose all by default to guests, so they start showing up when
Xen synthesises them.
While adding feature dependenies for IBPB, fix up an overlooked issue from
XSA-422. It's inappropriate to advertise that IBPB flushes RET predictions if
IBPB is unavailable itself.
This is part of XSA-434 / CVE-2023-20569
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 2280b0ee2aed6e0fd4af3fa31bf99bc04d038bfe)
Andrew Cooper [Thu, 27 Jul 2023 19:03:28 +0000 (20:03 +0100)]
x86/spec-ctrl: Rework ibpb_calculations()
... in order to make the SRSO mitigations easier to integrate.
* Check for AMD/Hygon CPUs directly, rather than assuming based on IBPB.
In particular, Xen supports synthesising the IBPB bit to guests on Intel to
allow IBPB while dissuading the use of (legacy) IBRS.
* Collect def_ibpb_entry rather than opencoding the BTC_NO calculation for
both opt_ibpb_entry_{pv,hvm}.
No functional change.
This is part of XSA-434 / CVE-2023-20569
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 292f68fb77196a35ac92b296792770d0f3190d75)
Andrew Cooper [Wed, 17 May 2023 09:13:36 +0000 (10:13 +0100)]
x86/cpu-policy: Advertise MSR_ARCH_CAPS to guests by default
With xl/libxl now able to control the policy bits for MSR_ARCH_CAPS, it is
safe to advertise to guests by default. In turn, we don't need the special
case to expose details to dom0.
This advertises MSR_ARCH_CAPS to guests on *all* Intel hardware, even if the
register content ends up being empty.
- Advertising ARCH_CAPS and not RSBA signals "retpoline is safe here and
everywhere you might migrate to". This is important because it avoids the
guest kernel needing to rely on model checks.
- Alternatively, levelling for safety across the Broadwell/Skylake divide
requires advertising ARCH_CAPS and RSBA, meaning "retpoline not safe on
some hardware you might migrate to".
On Cascade Lake and later hardware, guests can now see RDCL_NO (not vulnerable
to Meltdown) amongst others. This causes substantial performance
improvements, as guests are no longer applying software mitigations in cases
where they don't need to.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 4b2cdbfe766e5666e6754198946df2dc16f6a642)
Jan Beulich [Thu, 3 Aug 2023 15:35:39 +0000 (17:35 +0200)]
libxl: allow building with old gcc again
We can't use initializers of unnamed struct/union members just yet.
Fixes: d638fe233cb3 ("libxl: use the cpuid feature names from cpufeatureset.h") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 1aa5acbbec3f37bf38d78fa96d210053f8e8efd5)
Jan Beulich [Thu, 3 Aug 2023 15:35:26 +0000 (17:35 +0200)]
libxl: avoid shadowing of index()
Because of -Wshadow the build otherwise fails with old enough glibc.
While there also obey line length limits for msr_add().
Fixes: 6d21cedbaa34 ("libxl: add support for parsing MSR features") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 4f6afde88be3e8960eb311d16ac41d44ab71ed10)
Introduce support for handling MSR features in
libxl_cpuid_parse_config(). The MSR policies are added to the
libxl_cpuid_policy like the CPUID one, which gets passed to
xc_cpuid_apply_policy().
This allows existing users of libxl to provide MSR related features as
key=value pairs to libxl_cpuid_parse_config() without requiring the
usage of a different API.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit 6d21cedbaa34b3a3856f964189e911112c732b21)
libxl: use the cpuid feature names from cpufeatureset.h
The current implementation in libxl_cpuid_parse_config() requires
keeping a list of cpuid feature bits that should be mostly in sync
with the contents of cpufeatureset.h.
Avoid such duplication by using the automatically generated list of
cpuid features in INIT_FEATURE_NAMES in order to map feature names to
featureset bits, and then translate from featureset bits into cpuid
leaf, subleaf, register tuple.
Note that the full contents of the previous cpuid translation table
can't be removed. That's because some feature names allowed by libxl
are not described in the featuresets, or because naming has diverged
and the previous nomenclature is preserved for compatibility reasons.
Should result in no functional change observed by callers, albeit some
new cpuid features will be available as a result of the change.
While there constify cpuid_flags name field.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit d638fe233cb3a45105319df75df0edfed2fde5a5)
libxl: split logic to parse user provided CPUID features
Move the CPUID value parsers out of libxl_cpuid_parse_config() into a
newly created cpuid_add() local helper. This is in preparation for
also adding MSR feature parsing support.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit e2b1da9b8fda0ed7d3dca7bd15829cfea496973a)
Add a new array field to libxl_cpuid_policy in order to store the MSR
policies.
Adding the MSR data in the libxl_cpuid_policy_list type is done so
that existing users can seamlessly pass MSR features as part of the
CPUID data, without requiring the introduction of a separate
domain_build_info field, and a new set of handlers functions.
Note that support for parsing the old JSON format is kept, as that's
required in order to restore domains or received migrations from
previous tool versions. Differentiation between the old and the new
formats is done based on whether the contents of the 'cpuid' field is
an array or a map JSON object.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit 5b80cecb747b2176b9e85f6e7aa7be83416d77e1)
Currently libxl_cpuid_policy_list is an opaque type to the users of
libxl, and internally it's an array of xc_xend_cpuid objects.
Change the type to instead be a structure that contains one array for
CPUID policies, in preparation for it also holding another array for
MSR policies.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit 4825d19603580949144ac2ac5cb22df75c9da954)
libs/guest: introduce support for setting guest MSRs
Like it's done with CPUID, introduce support for passing MSR values to
xc_cpuid_apply_policy(). The chosen format for expressing MSR policy
data matches the current one used for CPUID. Note that existing
callers of xc_cpuid_apply_policy() can pass NULL as the value for the
newly introduced 'msr' parameter in order to preserve the same
functionality, and in fact that's done in libxl on this patch.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit ed742cf1b65c822759833027ca5cbb087c506a41)
Andrew Cooper [Wed, 24 May 2023 14:41:21 +0000 (15:41 +0100)]
x86/cpu-policy: Derive RSBA/RRSBA for guest policies
The RSBA bit, "RSB Alternative", means that the RSB may use alternative
predictors when empty. From a practical point of view, this mean "Retpoline
not safe".
Enhanced IBRS (officially IBRS_ALL in Intel's docs, previously IBRS_ATT) is a
statement that IBRS is implemented in hardware (as opposed to the form
retrofitted to existing CPUs in microcode).
The RRSBA bit, "Restricted-RSBA", is a combination of RSBA, and the eIBRS
property that predictions are tagged with the mode in which they were learnt.
Therefore, it means "when eIBRS is active, the RSB may fall back to
alternative predictors but restricted to the current prediction mode". As
such, it's stronger statement than RSBA, but still means "Retpoline not safe".
CPUs are not expected to enumerate both RSBA and RRSBA.
Add feature dependencies for EIBRS and RRSBA. While technically they're not
linked, absolutely nothing good can come of letting the guest see RRSBA
without EIBRS. Nor a guest seeing EIBRS without IBRSB. Furthermore, we use
this dependency to simplify the max derivation logic.
The max policies gets RSBA and RRSBA unconditionally set (with the EIBRS
dependency maybe hiding RRSBA). We can run any VM, even if it has been told
"somewhere you might run, Retpoline isn't safe".
The default policies are more complicated. A guest shouldn't see both bits,
but it needs to see one if the current host suffers from any form of RSBA, and
which bit it needs to see depends on whether eIBRS is visible or not.
Therefore, the calculation must be performed after sanitise_featureset().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit e0586a4ff514590eec50185e2440b97f9a31cb7f)
Andrew Cooper [Thu, 25 May 2023 19:31:22 +0000 (20:31 +0100)]
x86/spec-ctrl: Fix up the RSBA/RRSBA bits as appropriate
In order to level a VM safely for migration, the toolstack needs to know the
RSBA/RRSBA properties of the CPU, whether or not they happen to be enumerated.
See the code comment for details.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 36525a964fb629d0bd26e5a1c42de467af7a42a7)