Roger Pau Monné [Mon, 4 Mar 2024 17:08:48 +0000 (18:08 +0100)]
x86/mm: add speculation barriers to open coded locks
Add a speculation barrier to the clearly identified open-coded lock taking
functions.
Note that the memory sharing page_lock() replacement (_page_lock()) is left
as-is, as the code is experimental and not security supported.
This is part of XSA-453 / CVE-2024-2193
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 42a572a38e22a97d86a4b648a22597628d5b42e4)
Roger Pau Monné [Mon, 4 Mar 2024 13:29:36 +0000 (14:29 +0100)]
locking: attempt to ensure lock wrappers are always inline
In order to prevent the locking speculation barriers from being inside of
`call`ed functions that could be speculatively bypassed.
While there also add an extra locking barrier to _mm_write_lock() in the branch
taken when the lock is already held.
Note some functions are switched to use the unsafe variants (without speculation
barrier) of the locking primitives, but a speculation barrier is always added
to the exposed public lock wrapping helper. That's the case with
sched_spin_lock_double() or pcidevs_lock() for example.
This is part of XSA-453 / CVE-2024-2193
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 197ecd838a2aaf959a469df3696d4559c4f8b762)
Roger Pau Monné [Tue, 13 Feb 2024 16:57:38 +0000 (17:57 +0100)]
percpu-rwlock: introduce support for blocking speculation into critical regions
Add direct calls to block_lock_speculation() where required in order to prevent
speculation into the lock protected critical regions. Also convert
_percpu_read_lock() from inline to always_inline.
Note that _percpu_write_lock() has been modified the use the non speculation
safe of the locking primites, as a speculation is added unconditionally by the
calling wrapper.
This is part of XSA-453 / CVE-2024-2193
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f218daf6d3a3b847736d37c6a6b76031a0d08441)
Roger Pau Monné [Tue, 13 Feb 2024 15:08:52 +0000 (16:08 +0100)]
rwlock: introduce support for blocking speculation into critical regions
Introduce inline wrappers as required and add direct calls to
block_lock_speculation() in order to prevent speculation into the rwlock
protected critical regions.
Note the rwlock primitives are adjusted to use the non speculation safe variants
of the spinlock handlers, as a speculation barrier is added in the rwlock
calling wrappers.
trylock variants are protected by using lock_evaluate_nospec().
This is part of XSA-453 / CVE-2024-2193
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit a1fb15f61692b1fa9945fc51f55471ace49cdd59)
Roger Pau Monné [Tue, 13 Feb 2024 12:08:05 +0000 (13:08 +0100)]
x86/spinlock: introduce support for blocking speculation into critical regions
Introduce a new Kconfig option to block speculation into lock protected
critical regions. The Kconfig option is enabled by default, but the mitigation
won't be engaged unless it's explicitly enabled in the command line using
`spec-ctrl=lock-harden`.
Convert the spinlock acquire macros into always-inline functions, and introduce
a speculation barrier after the lock has been taken. Note the speculation
barrier is not placed inside the implementation of the spin lock functions, as
to prevent speculation from falling through the call to the lock functions
resulting in the barrier also being skipped.
trylock variants are protected using a construct akin to the existing
evaluate_nospec().
This patch only implements the speculation barrier for x86.
Note spin locks are the only locking primitive taken care in this change,
further locking primitives will be adjusted by separate changes.
This is part of XSA-453 / CVE-2024-2193
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 7ef0084418e188d05f338c3e028fbbe8b6924afa)
Andrew Cooper [Fri, 2 Feb 2024 00:39:42 +0000 (00:39 +0000)]
xen: Swap order of actions in the FREE*() macros
Wherever possible, it is a good idea to NULL out the visible reference to an
object prior to freeing it. The FREE*() macros already collect together both
parts, making it easy to adjust.
This has a marginal code generation improvement, as some of the calls to the
free() function can be tailcall optimised.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c4f427ec879e7c0df6d44d02561e8bee838a293e)
Nicola reports that the XSA-438 fix introduced new MISRA violations because of
some incidental tidying it tried to do. The parameter is useless, so resolve
the MISRA regression by removing it.
hap_update_cr3() discards the parameter entirely, while sh_update_cr3() uses
it to distinguish internal and external callers and therefore whether the
paging lock should be taken.
However, we have paging_lock_recursive() for this purpose, which also avoids
the ability for the shadow internal callers to accidentally not hold the lock.
Fixes: fb0ff49fe9f7 ("x86/shadow: defer releasing of PV's top-level shadow reference") Reported-by: Nicola Vetrini <nicola.vetrini@bugseng.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
(cherry picked from commit e71157d1ac2a7fbf413130663cf0a93ff9fbcf7e)
Andrew Cooper [Thu, 22 Jun 2023 22:32:19 +0000 (23:32 +0100)]
x86/spec-ctrl: Mitigation Register File Data Sampling
RFDS affects Atom cores, also branded E-cores, between the Goldmont and
Gracemont microarchitectures. This includes Alder Lake and Raptor Lake hybrid
clien systems which have a mix of Gracemont and other types of cores.
Two new bits have been defined; RFDS_CLEAR to indicate VERW has more side
effets, and RFDS_NO to incidate that the system is unaffected. Plenty of
unaffected CPUs won't be getting RFDS_NO retrofitted in microcode, so we
synthesise it. Alder Lake and Raptor Lake Xeon-E's are unaffected due to
their platform configuration, and we must use the Hybrid CPUID bit to
distinguish them from their non-Xeon counterparts.
Like MD_CLEAR and FB_CLEAR, RFDS_CLEAR needs OR-ing across a resource pool, so
set it in the max policies and reflect the host setting in default.
This is part of XSA-452 / CVE-2023-28746.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit fb5b6f6744713410c74cfc12b7176c108e3c9a31)
Andrew Cooper [Tue, 5 Mar 2024 19:33:37 +0000 (19:33 +0000)]
x86/spec-ctrl: VERW-handling adjustments
... before we add yet more complexity to this logic. Mostly expanded
comments, but with three minor changes.
1) Introduce cpu_has_useful_md_clear to simplify later logic in this patch and
future ones.
2) We only ever need SC_VERW_IDLE when SMT is active. If SMT isn't active,
then there's no re-partition of pipeline resources based on thread-idleness
to worry about.
3) The logic to adjust HVM VERW based on L1D_FLUSH is unmaintainable and, as
it turns out, wrong. SKIP_L1DFL is just a hint bit, whereas opt_l1d_flush
is the relevant decision of whether to use L1D_FLUSH based on
susceptibility and user preference.
Rewrite the logic so it can be followed, and incorporate the fact that when
FB_CLEAR is visible, L1D_FLUSH isn't a safe substitution.
This is part of XSA-452 / CVE-2023-28746.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 1eb91a8a06230b4b64228c9a380194f8cfe6c5e2)
Andrew Cooper [Mon, 12 Feb 2024 17:50:43 +0000 (17:50 +0000)]
x86/spec-ctrl: Rename VERW related options
VERW is going to be used for a 3rd purpose, and the existing nomenclature
didn't survive the Stale MMIO issues terribly well.
Rename the command line option from `md-clear=` to `verw=`. This is more
consistent with other options which tend to be named based on what they're
doing, not which feature enumeration they use behind the scenes. Retain
`md-clear=` as a deprecated alias.
Rename opt_md_clear_{pv,hvm} and opt_fb_clear_mmio to opt_verw_{pv,hvm,mmio},
which has a side effect of making spec_ctrl_init_domain() rather clearer to
follow.
No functional change.
This is part of XSA-452 / CVE-2023-28746.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f7603ca252e4226739eb3129a5290ee3da3f8ea4)
Andrew Cooper [Sat, 27 Jan 2024 18:20:56 +0000 (18:20 +0000)]
x86/spec-ctrl: Perform VERW flushing later in exit paths
On parts vulnerable to RFDS, VERW's side effects are extended to scrub all
non-architectural entries in various Physical Register Files. To remove all
of Xen's values, the VERW must be after popping the GPRs.
Rework SPEC_CTRL_COND_VERW to default to an CPUINFO_error_code %rsp position,
but with overrides for other contexts. Identify that it clobbers eflags; this
is particularly relevant for the SYSRET path.
For the IST exit return to Xen, have the main SPEC_CTRL_EXIT_TO_XEN put a
shadow copy of spec_ctrl_flags, as GPRs can't be used at the point we want to
issue the VERW.
This is part of XSA-452 / CVE-2023-28746.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 0a666cf2cd99df6faf3eebc81a1fc286e4eca4c7)
Andrew Cooper [Fri, 23 Jun 2023 10:32:00 +0000 (11:32 +0100)]
x86/vmx: Perform VERW flushing later in the VMExit path
Broken out of the following patch because this change is subtle enough on its
own. See it for the rational of why we're moving VERW.
As for how, extend the trick already used to hold one condition in
flags (RESUME vs LAUNCH) through the POPing of GPRs.
Move the MOV CR earlier. Intel specify flags to be undefined across it.
Encode the two conditions we want using SF and PF. See the code comment for
exactly how.
Leave a comment to explain the lack of any content around
SPEC_CTRL_EXIT_TO_VMX, but leave the block in place. Sods law says if we
delete it, we'll need to reintroduce it.
This is part of XSA-452 / CVE-2023-28746.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 475fa20b7384464210f42bad7195f87bd6f1c63f)
Andrew Cooper [Thu, 29 Feb 2024 11:26:40 +0000 (11:26 +0000)]
x86/cpu-policy: Allow for levelling of VERW side effects
MD_CLEAR and FB_CLEAR need OR-ing across a migrate pool. Allow this, by
having them unconditinally set in max, with the host values reflected in
default. Annotate the bits as having special properies.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit de17162cafd27f2865a3102a2ec0f386a02ed03d)
Andrew Cooper [Sat, 27 Jan 2024 17:52:09 +0000 (17:52 +0000)]
x86/entry: Introduce EFRAME_* constants
restore_all_guest() does a lot of manipulation of the stack after popping the
GPRs, and uses raw %rsp displacements to do so. Also, almost all entrypaths
use raw %rsp displacements prior to pushing GPRs.
Provide better mnemonics, to aid readability and reduce the chance of errors
when editing.
No functional change. The resulting binary is identical.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 37541208f119a9c552c6c6c3246ea61be0d44035)
Jan Beulich [Tue, 27 Feb 2024 13:13:21 +0000 (14:13 +0100)]
x86: account for shadow stack in exception-from-stub recovery
Dealing with exceptions raised from within emulation stubs involves
discarding return address (replaced by exception related information).
Such discarding of course also requires removing the corresponding entry
from the shadow stack.
Also amend the comment in fixup_exception_return(), to further clarify
why use of ptr[1] can't be an out-of-bounds access.
This is CVE-2023-46841 / XSA-451.
Fixes: 209fb9919b50 ("x86/extable: Adjust extable handling to be shadow stack compatible") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 91f5f7a9154919a765c3933521760acffeddbf28
master date: 2024-02-27 13:49:22 +0100
Roger Pau Monné [Tue, 30 Jan 2024 13:42:41 +0000 (14:42 +0100)]
pci: fail device assignment if phantom functions cannot be assigned
The current behavior is that no error is reported if (some) phantom functions
fail to be assigned during device add or assignment, so the operation succeeds
even if some phantom functions are not correctly setup.
This can lead to devices possibly being successfully assigned to a domU while
some of the device phantom functions are still assigned to dom0. Even when the
device is assigned domIO before being assigned to a domU phantom functions
might fail to be assigned to domIO, and also fail to be assigned to the domU,
leaving them assigned to dom0.
Since the device can generate requests using the IDs of those phantom
functions, given the scenario above a device in such state would be in control
of a domU, but still capable of generating transactions that use a context ID
targeting dom0 owned memory.
Modify device assign in order to attempt to deassign the device if phantom
functions failed to be assigned.
Note that device addition is not modified in the same way, as in that case the
device is assigned to a trusted domain, and hence partial assign can lead to
device malfunction but not a security issue.
This is XSA-449 / CVE-2023-46839
Fixes: 4e9950dc1bd2 ('IOMMU: add phantom function support') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cb4ecb3cc17b02c2814bc817efd05f3f3ba33d1e
master date: 2024-01-30 14:28:01 +0100
Michal Orzel [Tue, 12 Dec 2023 13:53:13 +0000 (14:53 +0100)]
xen/arm: page: Avoid pointer overflow on cache clean & invalidate
On Arm32, after cleaning and invalidating the last dcache line of the top
domheap page i.e. VA = 0xfffff000 (as a result of flushing the page to
RAM), we end up adding the value of a dcache line size to the pointer
once again, which results in a pointer arithmetic overflow (with 64B line
size, operation 0xffffffc0 + 0x40 overflows to 0x0). Such behavior is
undefined and given the wide range of compiler versions we support, it is
difficult to determine what could happen in such scenario.
Modify clean_and_invalidate_dcache_va_range() as well as
clean_dcache_va_range() and invalidate_dcache_va_range() due to similarity
of handling to prevent pointer arithmetic overflow. Modify the loops to
use an additional variable to store the index of the next cacheline.
Add an assert to prevent passing a region that wraps around which is
illegal and would end up in a page fault anyway (region 0-2MB is
unmapped). Lastly, return early if size passed is 0.
Note that on Arm64, we don't have this problem given that the max VA
space we support is 48-bits.
Andrew Cooper [Thu, 26 Oct 2023 13:37:38 +0000 (14:37 +0100)]
x86/spec-ctrl: Remove conditional IRQs-on-ness for INT $0x80/0x82 paths
Before speculation defences, some paths in Xen could genuinely get away with
being IRQs-on at entry. But XPTI invalidated this property on most paths, and
attempting to maintain it on the remaining paths was a mistake.
Fast forward, and DO_SPEC_CTRL_COND_IBPB (protection for AMD BTC/SRSO) is not
IRQ-safe, running with IRQs enabled in some cases. The other actions taken on
these paths happen to be IRQ-safe.
Make entry_int82() and int80_direct_trap() unconditionally Interrupt Gates
rather than Trap Gates. Remove the conditional re-adjustment of
int80_direct_trap() in smp_prepare_cpus(), and have entry_int82() explicitly
enable interrupts when safe to do so.
In smp_prepare_cpus(), with the conditional re-adjustment removed, the
clearing of pv_cr3 is the only remaining action gated on XPTI, and it is out
of place anyway, repeating work already done by smp_prepare_boot_cpu(). Drop
the entire if() condition to avoid leaving an incorrect vestigial remnant.
Also drop comments which make incorrect statements about when its safe to
enable interrupts.
This is XSA-446 / CVE-2023-46836
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit a48bb129f1b9ff55c22cf6d2b589247c8ba3b10e)
Roger Pau Monne [Wed, 11 Oct 2023 11:14:21 +0000 (13:14 +0200)]
iommu/amd-vi: use correct level for quarantine domain page tables
The current setup of the quarantine page tables assumes that the quarantine
domain (dom_io) has been initialized with an address width of
DEFAULT_DOMAIN_ADDRESS_WIDTH (48).
However dom_io being a PV domain gets the AMD-Vi IOMMU page tables levels based
on the maximum (hot pluggable) RAM address, and hence on systems with no RAM
above the 512GB mark only 3 page-table levels are configured in the IOMMU.
On systems without RAM above the 512GB boundary amd_iommu_quarantine_init()
will setup page tables for the scratch page with 4 levels, while the IOMMU will
be configured to use 3 levels only. The page destined to be used as level 1,
and to contain a directory of PTEs ends up being the address in a PTE itself,
and thus level 1 page becomes the leaf page. Without the level mismatch it's
level 0 page that should be the leaf page instead.
The level 1 page won't be used as such, and hence it's not possible to use it
to gain access to other memory on the system. However that page is not cleared
in amd_iommu_quarantine_init() as part of re-initialization of the device
quarantine page tables, and hence data on the level 1 page can be leaked
between device usages.
Fix this by making sure the paging levels setup by amd_iommu_quarantine_init()
match the number configured on the IOMMUs.
Note that IVMD regions are not affected by this issue, as those areas are
mapped taking the configured paging levels into account.
This is XSA-445 / CVE-2023-46835
Fixes: ea38867831da ('x86 / iommu: set up a scratch page in the quarantine domain') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit fe1e4668b373ec4c1e5602e75905a9fa8cc2be3f)
Andrew Cooper [Tue, 26 Sep 2023 19:06:57 +0000 (20:06 +0100)]
x86/pv: Correct the auditing of guest breakpoint addresses
The use of access_ok() is buggy, because it permits access to the compat
translation area. 64bit PV guests don't use the XLAT area, but on AMD
hardware, the DBEXT feature allows a breakpoint to match up to a 4G aligned
region, allowing the breakpoint to reach outside of the XLAT area.
Prior to c/s cda16c1bb223 ("x86: mirror compat argument translation area for
32-bit PV"), the live GDT was within 4G of the XLAT area.
All together, this allowed a malicious 64bit PV guest on AMD hardware to place
a breakpoint over the live GDT, and trigger a #DB livelock (CVE-2015-8104).
Introduce breakpoint_addr_ok() and explain why __addr_ok() happens to be an
appropriate check in this case.
For Xen 4.14 and later, this is a latent bug because the XLAT area has moved
to be on its own with nothing interesting adjacent. For Xen 4.13 and older on
AMD hardware, this fixes a PV-trigger-able DoS.
This is part of XSA-444 / CVE-2023-34328.
Fixes: 65e355490817 ("x86/PV: support data breakpoint extension registers") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit dc9d9aa62ddeb14abd5672690d30789829f58f7e)
Andrew Cooper [Tue, 26 Sep 2023 19:06:57 +0000 (20:06 +0100)]
x86/svm: Fix asymmetry with AMD DR MASK context switching
The handling of MSR_DR{0..3}_MASK is asymmetric between PV and HVM guests.
HVM guests context switch in based on the guest view of DBEXT, whereas PV
guest switch in base on the host capability. Both guest types leave the
context dirty for the next vCPU.
This leads to the following issue:
* PV or HVM vCPU has debugging active (%dr7 + mask)
* Switch out deactivates %dr7 but leaves other state stale in hardware
* HVM vCPU with debugging activate but can't see DBEXT is switched in
* Switch in loads %dr7 but leaves the mask MSRs alone
Now, the HVM vCPU is operating in the context of the prior vCPU's mask MSR,
and furthermore in a case where it genuinely expects there to be no masking
MSRs.
As a stopgap, adjust the HVM path to switch in/out the masks based on host
capabilities rather than guest visibility (i.e. like the PV path). Adjustment
of the of the intercepts still needs to be dependent on the guest visibility
of DBEXT.
This is part of XSA-444 / CVE-2023-34327
Fixes: c097f54912d3 ("x86/SVM: support data breakpoint extension registers") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 5d54282f984bb9a7a65b3d12208584f9fdf1c8e1)
libxl: limit bootloader execution in restricted mode
Introduce a timeout for bootloader execution when running in restricted mode.
Allow overwriting the default time out with an environment provided value.
This is part of XSA-443 / CVE-2023-34325
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit 9c114178ffd700112e91f5ec66cf5151b9c9a8cc)
libxl: add support for running bootloader in restricted mode
Much like the device model depriv mode, add the same kind of support for the
bootloader. Such feature allows passing a UID as a parameter for the
bootloader to run as, together with the bootloader itself taking the necessary
actions to isolate.
Note that the user to run the bootloader as must have the right permissions to
access the guest disk image (in read mode only), and that the bootloader will
be run in non-interactive mode when restricted.
If enabled bootloader restrict mode will attempt to re-use the user(s) from the
QEMU depriv implementation if no user is provided on the configuration file or
the environment. See docs/features/qemu-deprivilege.pandoc for more
information about how to setup those users.
Bootloader restrict mode is not enabled by default as it requires certain
setup to be done first (setup of the user(s) to use in restrict mode).
This is part of XSA-443 / CVE-2023-34325
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit 1f762642d2cad1a40634e3280361928109d902f1)
Introduce a --runas=<uid> flag to deprivilege pygrub on Linux and *BSDs. It
also implicitly creates a chroot env where it drops a deprivileged forked
process. The chroot itself is cleaned up at the end.
If the --runas arg is present, then pygrub forks, leaving the child to
deprivilege itself, and waiting for it to complete. When the child exists,
the parent performs cleanup and exits with the same error code.
This is roughly what the child does:
1. Initialize libfsimage (this loads every .so in memory so the chroot
can avoid bind-mounting /{,usr}/lib*
2. Create a temporary empty chroot directory
3. Mount tmpfs in it
4. Bind mount the disk inside, because libfsimage expects a path, not a
file descriptor.
5. Remount the root tmpfs to be stricter (ro,nosuid,nodev)
6. Set RLIMIT_FSIZE to a sensibly high amount (128 MiB)
7. Depriv gid, groups and uid
With this scheme in place, the "output" files are writable (up to
RLIMIT_FSIZE octets) and the exposed filesystem is immutable and contains
the single only file we can't easily get rid of (the disk).
If running on Linux, the child process also unshares mount, IPC, and
network namespaces before dropping its privileges.
tools/libfsimage: Export a new function to preload all plugins
This is work required in order to let pygrub operate in highly deprivileged
chroot mode. This patch adds a function that preloads every plugin, hence
ensuring that a on function exit, every shared library is loaded in memory.
The new "init" function is supposed to be used before depriv, but that's
fine because it's not acting on untrusted data.
This patch allows pygrub to get ahold of every RW file descriptor it needs
early on. A later patch will clamp the filesystem it can access so it can't
obtain any others.
There's a hypercall being issued in order to determine whether PV64 is
supported, but since Xen 4.3 that's strictly true so it's not required.
Plus, this way we can avoid mapping the privcmd interface altogether in the
depriv pygrub.
This is part of XSA-443 / CVE-2023-34325
Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit f4b504c6170c446e61055cbd388ae4e832a9deca)
libfsimage/xfs: Sanity-check the superblock during mounts
Sanity-check the XFS superblock for wellformedness at the mount handler.
This forces pygrub to abort parsing a potentially malformed filesystem and
ensures the invariants assumed throughout the rest of the code hold.
Also, derive parameters from previously sanitized parameters where possible
(rather than reading them off the superblock)
The code doesn't try to avoid overflowing the end of the disk, because
that's an unlikely and benign error. Parameters used in calculations of
xfs_daddr_t (like the root inode index) aren't in critical need of being
sanitized.
The sanitization of agblklog is basically checking that no obvious
overflows happen on agblklog, and then ensuring agblocks is contained in
the range (2^(sb_agblklog-1), 2^sb_agblklog].
This is part of XSA-443 / CVE-2023-34325
Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 620500dd1baf33347dfde5e7fde7cf7fe347da5c)
Roger Pau Monne [Tue, 13 Jun 2023 13:01:05 +0000 (15:01 +0200)]
iommu/amd-vi: flush IOMMU TLB when flushing the DTE
The caching invalidation guidelines from the AMD-Vi specification (48882—Rev
3.07-PUB—Oct 2022) seem to be misleading on some hardware, as devices will
malfunction (see stale DMA mappings) if some fields of the DTE are updated but
the IOMMU TLB is not flushed. This has been observed in practice on AMD
systems. Due to the lack of guidance from the currently published
specification this patch aims to increase the flushing done in order to prevent
device malfunction.
In order to fix, issue an INVALIDATE_IOMMU_PAGES command from
amd_iommu_flush_device(), flushing all the address space. Note this requires
callers to be adjusted in order to pass the DomID on the DTE previous to the
modification.
Some call sites don't provide a valid DomID to amd_iommu_flush_device() in
order to avoid the flush. That's because the device had address translations
disabled and hence the previous DomID on the DTE is not valid. Note the
current logic relies on the entity disabling address translations to also flush
the TLB of the in use DomID.
Device I/O TLB flushing when ATS are enabled is not covered by the current
change, as ATS usage is not security supported.
This is XSA-442 / CVE-2023-34326
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 5fc98b97084a46884acef9320e643faf40d42212)
The function domain_entry_fix() will be initially called to check if the
quota is correct before attempt to commit any nodes. So it would be
possible that accounting is temporarily negative. This is the case
in the following sequence:
1) Create 50 nodes
2) Start two transactions
3) Delete all the nodes in each transaction
4) Commit the two transactions
Because the first transaction will have succeed and updated the
accounting, there is no guarantee that 'd->nbentry + num' will still
be above 0. So the assert() would be triggered.
The assert() was introduced in dbef1f748289 ("tools/xenstore: simplify
and fix per domain node accounting") with the assumption that the
value can't be negative. As this is not true revert to the original
check but restricted to the path where we don't update. Take the
opportunity to explain the rationale behind the check.
This CVE-2023-34323 / XSA-440.
Fixes: dbef1f748289 ("tools/xenstore: simplify and fix per domain node accounting") Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit c4e05c97f57d236040d1da5c1fbf6e3699dc86ea)
Jan Beulich [Wed, 20 Sep 2023 09:34:24 +0000 (10:34 +0100)]
x86/shadow: defer releasing of PV's top-level shadow reference
sh_set_toplevel_shadow() re-pinning the top-level shadow we may be
running on is not enough (and at the same time unnecessary when the
shadow isn't what we're running on): That shadow becomes eligible for
blowing away (from e.g. shadow_prealloc()) immediately after the
paging lock was dropped. Yet it needs to remain valid until the actual
page table switch occurred.
Propagate up the call chain the shadow entry that needs releasing
eventually, and carry out the release immediately after switching page
tables. Handle update_cr3() failures by switching to idle pagetables.
Note that various further uses of update_cr3() are HVM-only or only act
on paused vCPU-s, in which case sh_set_toplevel_shadow() will not defer
releasing of the reference.
While changing the update_cr3() hook, also convert the "do_locking"
parameter to boolean.
This is CVE-2023-34322 / XSA-438.
Reported-by: Tim Deegan <tim@xen.org> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@cloud.com>
(cherry picked from commit fb0ff49fe9f784bfee0370c2a3c5f20e39d7a1cb)
Andrew Cooper [Wed, 30 Aug 2023 19:24:25 +0000 (20:24 +0100)]
x86/spec-ctrl: Mitigate the Zen1 DIV leakage
In the Zen1 microarchitecure, there is one divider in the pipeline which
services uops from both threads. In the case of #DE, the latched result from
the previous DIV to execute will be forwarded speculatively.
This is an interesting covert channel that allows two threads to communicate
without any system calls. In also allows userspace to obtain the result of
the most recent DIV instruction executed (even speculatively) in the core,
which can be from a higher privilege context.
Scrub the result from the divider by executing a non-faulting divide. This
needs performing on the exit-to-guest paths, and ist_exit-to-Xen.
Alternatives in IST context is believed safe now that it's done in NMI
context.
This is XSA-439 / CVE-2023-20588.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit b5926c6ecf05c28ee99c6248c42d691ccbf0c315)
Andrew Cooper [Fri, 15 Sep 2023 11:13:51 +0000 (12:13 +0100)]
x86/amd: Introduce is_zen{1,2}_uarch() predicates
We already have 3 cases using STIBP as a Zen1/2 heuristic, and are about to
introduce a 4th. Wrap the heuristic into a pair of predicates rather than
opencoding it, and the explanation of the heuristic, at each usage site.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit de1d265001397f308c5c3c5d3ffc30e7ef8c0705)
Andrew Cooper [Wed, 13 Sep 2023 12:53:33 +0000 (13:53 +0100)]
x86/spec-ctrl: Issue VERW during IST exit to Xen
There is a corner case where e.g. an NMI hitting an exit-to-guest path after
SPEC_CTRL_EXIT_TO_* would have run the entire NMI handler *after* the VERW
flush to scrub potentially sensitive data from uarch buffers.
In order to compensate, issue VERW when exiting to Xen from an IST entry.
SPEC_CTRL_EXIT_TO_XEN already has two reads of spec_ctrl_flags off the stack,
and we're about to add a third. Load the field into %ebx, and list the
register as clobbered.
%r12 has been arranged to be the ist_exit signal, so add this as an input
dependency and use it to identify when to issue a VERW.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3ee6066bcd737756b0990d417d94eddc0b0d2585)
Andrew Cooper [Wed, 13 Sep 2023 11:20:12 +0000 (12:20 +0100)]
x86/entry: Track the IST-ness of an entry for the exit paths
Use %r12 to hold an ist_exit boolean. This register is zero elsewhere in the
entry/exit asm, so it only needs setting in the IST path.
As this is subtle and fragile, add check_ist_exit() to be used in debugging
builds to cross-check that the ist_exit boolean matches the entry vector.
Write check_ist_exit() it in C, because it's debug only and the logic more
complicated than I care to maintain in asm.
For now, we only need to use this signal in the exit-to-Xen path, but some
exit-to-guest paths happen in IST context too. Check the correctness in all
exit paths to avoid the logic bit-rotting.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 21bdc25b05a0f8ab6bc73520a9ca01327360732c)
x86/entry: Partially revert IST-exit checks
The patch adding check_ist_exit() didn't account for the fact that
reset_stack_and_jump() is not an ABI-preserving boundary. The IST-ness in
%r12 doesn't survive into the next context, and is a stale value C.
There's no straightforward way to reconstruct the IST-exit-ness on the
exit-to-guest path after a context switch. For now, we only need IST-exit on
the return-to-Xen path.
Fixes: 21bdc25b05a0 ("x86/entry: Track the IST-ness of an entry for the exit paths") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9b57c800b79b96769ea3dcd6468578fa664d19f9)
Andrew Cooper [Wed, 13 Sep 2023 12:48:16 +0000 (13:48 +0100)]
x86/entry: Adjust restore_all_xen to hold stack_end in %r14
All other SPEC_CTRL_{ENTRY,EXIT}_* helpers hold stack_end in %r14. Adjust it
for consistency.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 7aa28849a1155d856e214e9a80a7e65fffdc3e58)
Andrew Cooper [Wed, 30 Aug 2023 19:11:50 +0000 (20:11 +0100)]
x86/spec-ctrl: Improve all SPEC_CTRL_{ENTER,EXIT}_* comments
... to better explain how they're used.
Doing so highlights that SPEC_CTRL_EXIT_TO_XEN is missing a VERW flush for the
corner case when e.g. an NMI hits late in an exit-to-guest path.
Leave a TODO, which will be addressed in subsequent patches which arrange for
VERW flushing to be safe within SPEC_CTRL_EXIT_TO_XEN.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 45f00557350dc7d0756551069803fc49c29184ca)
Andrew Cooper [Fri, 1 Sep 2023 10:38:44 +0000 (11:38 +0100)]
x86/spec-ctrl: Turn the remaining SPEC_CTRL_{ENTRY,EXIT}_* into asm macros
These have grown more complex over time, with some already having been
converted.
Provide full Requires/Clobbers comments, otherwise missing at this level of
indirection.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 7125429aafb9e3c9c88fc93001fc2300e0ac2cc8)
Andrew Cooper [Tue, 12 Sep 2023 16:03:16 +0000 (17:03 +0100)]
x86/spec-ctrl: Fold DO_SPEC_CTRL_EXIT_TO_XEN into it's single user
With the SPEC_CTRL_EXIT_TO_XEN{,_IST} confusion fixed, it's now obvious that
there's only a single EXIT_TO_XEN path. Fold DO_SPEC_CTRL_EXIT_TO_XEN into
SPEC_CTRL_EXIT_TO_XEN to simplify further fixes.
When merging labels, switch the name to .L\@_skip_sc_msr as "skip" on its own
is going to be too generic shortly.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 694bb0f280fd08a4377e36e32b84b5062def4de2)
Andrew Cooper [Tue, 12 Sep 2023 14:06:49 +0000 (15:06 +0100)]
x86/spec-ctrl: Fix confusion between SPEC_CTRL_EXIT_TO_XEN{,_IST}
c/s 3fffaf9c13e9 ("x86/entry: Avoid using alternatives in NMI/#MC paths")
dropped the only user, leaving behind the (incorrect) implication that Xen had
split exit paths.
Delete the unused SPEC_CTRL_EXIT_TO_XEN and rename SPEC_CTRL_EXIT_TO_XEN_IST
to SPEC_CTRL_EXIT_TO_XEN for consistency.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 1c18d73774533a55ba9d1cbee8bdace03efdb5e7)
Jan Beulich [Wed, 23 Aug 2023 07:26:36 +0000 (09:26 +0200)]
x86/AMD: extend Zenbleed check to models "good" ucode isn't known for
Reportedly the AMD Custom APU 0405 found on SteamDeck, models 0x90 and
0x91, (quoting the respective Linux commit) is similarly affected. Put
another instance of our Zen1 vs Zen2 distinction checks in
amd_check_zenbleed(), forcing use of the chickenbit irrespective of
ucode version (building upon real hardware never surfacing a version of
0xffffffff).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 145a69c0944ac70cfcf9d247c85dee9e99d9d302)
xen/arm: page: Handle cache flush of an element at the top of the address space
The region that needs to be cleaned/invalidated may be at the top
of the address space. This means that 'end' (i.e. 'p + size') will
be 0 and therefore nothing will be cleaned/invalidated as the check
in the loop will always be false.
On Arm64, we only support we only support up to 48-bit Virtual
address space. So this is not a concern there. However, for 32-bit,
the mapcache is using the last 2GB of the address space. Therefore
we may not clean/invalidate properly some pages. This could lead
to memory corruption or data leakage (the scrubbed value may
still sit in the cache when the guest could read directly the memory
and therefore read the old content).
Rework invalidate_dcache_va_range(), clean_dcache_va_range(),
clean_and_invalidate_dcache_va_range() to handle a cache flush
with an element at the top of the address space.
Andrew Cooper [Wed, 4 Jan 2023 16:32:44 +0000 (16:32 +0000)]
x86/spec-ctrl: Mitigate Gather Data Sampling
This is part of XSA-435 / CVE-2022-40982
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 56d690efd3ca3c68e1d222f259fb3d216206e8e5)
Andrew Cooper [Wed, 4 Jan 2023 17:32:44 +0000 (17:32 +0000)]
x86/spec-ctrl: Enumerations for Gather Data Sampling
GDS_CTRL is introduced by the August 2023 microcode. GDS_NO is for current
and future processors not susceptible to GDS.
This is part of XSA-435 / CVE-2022-40982
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9f585f59d90c8d3a1b21369a852b7d7eee8a29b9)
Andrew Cooper [Mon, 27 Feb 2023 15:36:49 +0000 (15:36 +0000)]
x86/cpu-policy: Hide CLWB by default on SKX/CLX/CPX
The August 2023 microcode for GDS has an impact on the CLWB instruction. See
code comments for full details.
This is part of XSA-435 / CVE-2022-40982
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 2dd06b4ea10891750af38e4a0e1efaeb0a9b3518)
On native, synthesise the SRSO bits by probing various hardware properties as
given by AMD.
Extend the IBPB-on-entry mitigations to Zen3/4 CPUs. There is a microcode
prerequisite to make this an effective mitigation.
This is part of XSA-434 / CVE-2023-20569
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 220c06e6fefe2378f40e2a7391f5e265a2aa50f7)
Andrew Cooper [Wed, 14 Jun 2023 08:13:28 +0000 (09:13 +0100)]
x86/spec-ctrl: Enumerations for Speculative Return Stack Overflow
AMD have specified new CPUID bits relating to SRSO.
* SRSO_NO indicates that hardware is no longer vulnerable to SRSO.
* IBPB_BRTYPE indicates that IBPB flushes branch type information too.
* SBPB indicates support for a relaxed form of IBPB that does not flush
branch type information.
Current CPUs (Zen4 and older) are not expected to enumerate these bits.
Native software is expected to synthesise them for guests using model and
microcode revision checks.
Two are just status bits, and SBPB is trivial to support for guests by
tweaking the reserved bit calculation in guest_wrmsr() and feature
dependencies. Expose all by default to guests, so they start showing up when
Xen synthesises them.
While adding feature dependenies for IBPB, fix up an overlooked issue from
XSA-422. It's inappropriate to advertise that IBPB flushes RET predictions if
IBPB is unavailable itself.
This is part of XSA-434 / CVE-2023-20569
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 2280b0ee2aed6e0fd4af3fa31bf99bc04d038bfe)
Andrew Cooper [Thu, 27 Jul 2023 19:03:28 +0000 (20:03 +0100)]
x86/spec-ctrl: Rework ibpb_calculations()
... in order to make the SRSO mitigations easier to integrate.
* Check for AMD/Hygon CPUs directly, rather than assuming based on IBPB.
In particular, Xen supports synthesising the IBPB bit to guests on Intel to
allow IBPB while dissuading the use of (legacy) IBRS.
* Collect def_ibpb_entry rather than opencoding the BTC_NO calculation for
both opt_ibpb_entry_{pv,hvm}.
No functional change.
This is part of XSA-434 / CVE-2023-20569
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 292f68fb77196a35ac92b296792770d0f3190d75)
Andrew Cooper [Wed, 17 May 2023 09:13:36 +0000 (10:13 +0100)]
x86/cpu-policy: Advertise MSR_ARCH_CAPS to guests by default
With xl/libxl now able to control the policy bits for MSR_ARCH_CAPS, it is
safe to advertise to guests by default. In turn, we don't need the special
case to expose details to dom0.
This advertises MSR_ARCH_CAPS to guests on *all* Intel hardware, even if the
register content ends up being empty.
- Advertising ARCH_CAPS and not RSBA signals "retpoline is safe here and
everywhere you might migrate to". This is important because it avoids the
guest kernel needing to rely on model checks.
- Alternatively, levelling for safety across the Broadwell/Skylake divide
requires advertising ARCH_CAPS and RSBA, meaning "retpoline not safe on
some hardware you might migrate to".
On Cascade Lake and later hardware, guests can now see RDCL_NO (not vulnerable
to Meltdown) amongst others. This causes substantial performance
improvements, as guests are no longer applying software mitigations in cases
where they don't need to.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 4b2cdbfe766e5666e6754198946df2dc16f6a642)
Jan Beulich [Thu, 3 Aug 2023 15:35:39 +0000 (17:35 +0200)]
libxl: allow building with old gcc again
We can't use initializers of unnamed struct/union members just yet.
Fixes: d638fe233cb3 ("libxl: use the cpuid feature names from cpufeatureset.h") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 1aa5acbbec3f37bf38d78fa96d210053f8e8efd5)
Jan Beulich [Thu, 3 Aug 2023 15:35:26 +0000 (17:35 +0200)]
libxl: avoid shadowing of index()
Because of -Wshadow the build otherwise fails with old enough glibc.
While there also obey line length limits for msr_add().
Fixes: 6d21cedbaa34 ("libxl: add support for parsing MSR features") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 4f6afde88be3e8960eb311d16ac41d44ab71ed10)
Introduce support for handling MSR features in
libxl_cpuid_parse_config(). The MSR policies are added to the
libxl_cpuid_policy like the CPUID one, which gets passed to
xc_cpuid_apply_policy().
This allows existing users of libxl to provide MSR related features as
key=value pairs to libxl_cpuid_parse_config() without requiring the
usage of a different API.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit 6d21cedbaa34b3a3856f964189e911112c732b21)
libxl: use the cpuid feature names from cpufeatureset.h
The current implementation in libxl_cpuid_parse_config() requires
keeping a list of cpuid feature bits that should be mostly in sync
with the contents of cpufeatureset.h.
Avoid such duplication by using the automatically generated list of
cpuid features in INIT_FEATURE_NAMES in order to map feature names to
featureset bits, and then translate from featureset bits into cpuid
leaf, subleaf, register tuple.
Note that the full contents of the previous cpuid translation table
can't be removed. That's because some feature names allowed by libxl
are not described in the featuresets, or because naming has diverged
and the previous nomenclature is preserved for compatibility reasons.
Should result in no functional change observed by callers, albeit some
new cpuid features will be available as a result of the change.
While there constify cpuid_flags name field.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit d638fe233cb3a45105319df75df0edfed2fde5a5)
libxl: split logic to parse user provided CPUID features
Move the CPUID value parsers out of libxl_cpuid_parse_config() into a
newly created cpuid_add() local helper. This is in preparation for
also adding MSR feature parsing support.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit e2b1da9b8fda0ed7d3dca7bd15829cfea496973a)
Add a new array field to libxl_cpuid_policy in order to store the MSR
policies.
Adding the MSR data in the libxl_cpuid_policy_list type is done so
that existing users can seamlessly pass MSR features as part of the
CPUID data, without requiring the introduction of a separate
domain_build_info field, and a new set of handlers functions.
Note that support for parsing the old JSON format is kept, as that's
required in order to restore domains or received migrations from
previous tool versions. Differentiation between the old and the new
formats is done based on whether the contents of the 'cpuid' field is
an array or a map JSON object.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit 5b80cecb747b2176b9e85f6e7aa7be83416d77e1)
Currently libxl_cpuid_policy_list is an opaque type to the users of
libxl, and internally it's an array of xc_xend_cpuid objects.
Change the type to instead be a structure that contains one array for
CPUID policies, in preparation for it also holding another array for
MSR policies.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit 4825d19603580949144ac2ac5cb22df75c9da954)
libs/guest: introduce support for setting guest MSRs
Like it's done with CPUID, introduce support for passing MSR values to
xc_cpuid_apply_policy(). The chosen format for expressing MSR policy
data matches the current one used for CPUID. Note that existing
callers of xc_cpuid_apply_policy() can pass NULL as the value for the
newly introduced 'msr' parameter in order to preserve the same
functionality, and in fact that's done in libxl on this patch.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit ed742cf1b65c822759833027ca5cbb087c506a41)
Andrew Cooper [Wed, 24 May 2023 14:41:21 +0000 (15:41 +0100)]
x86/cpu-policy: Derive RSBA/RRSBA for guest policies
The RSBA bit, "RSB Alternative", means that the RSB may use alternative
predictors when empty. From a practical point of view, this mean "Retpoline
not safe".
Enhanced IBRS (officially IBRS_ALL in Intel's docs, previously IBRS_ATT) is a
statement that IBRS is implemented in hardware (as opposed to the form
retrofitted to existing CPUs in microcode).
The RRSBA bit, "Restricted-RSBA", is a combination of RSBA, and the eIBRS
property that predictions are tagged with the mode in which they were learnt.
Therefore, it means "when eIBRS is active, the RSB may fall back to
alternative predictors but restricted to the current prediction mode". As
such, it's stronger statement than RSBA, but still means "Retpoline not safe".
CPUs are not expected to enumerate both RSBA and RRSBA.
Add feature dependencies for EIBRS and RRSBA. While technically they're not
linked, absolutely nothing good can come of letting the guest see RRSBA
without EIBRS. Nor a guest seeing EIBRS without IBRSB. Furthermore, we use
this dependency to simplify the max derivation logic.
The max policies gets RSBA and RRSBA unconditionally set (with the EIBRS
dependency maybe hiding RRSBA). We can run any VM, even if it has been told
"somewhere you might run, Retpoline isn't safe".
The default policies are more complicated. A guest shouldn't see both bits,
but it needs to see one if the current host suffers from any form of RSBA, and
which bit it needs to see depends on whether eIBRS is visible or not.
Therefore, the calculation must be performed after sanitise_featureset().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit e0586a4ff514590eec50185e2440b97f9a31cb7f)
Andrew Cooper [Thu, 25 May 2023 19:31:22 +0000 (20:31 +0100)]
x86/spec-ctrl: Fix up the RSBA/RRSBA bits as appropriate
In order to level a VM safely for migration, the toolstack needs to know the
RSBA/RRSBA properties of the CPU, whether or not they happen to be enumerated.
See the code comment for details.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 36525a964fb629d0bd26e5a1c42de467af7a42a7)
Andrew Cooper [Fri, 26 May 2023 09:35:47 +0000 (10:35 +0100)]
x86/spec-ctrl: Rename retpoline_safe() to retpoline_calculations()
This is prep work, split out to simply the diff on the following change.
* Rename to retpoline_calculations(), and call unconditionally. It is
shortly going to synthesise missing enumerations required for guest safety.
* For the model check switch statement, store the result in a variable and
break rather than returning directly.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 724c0d94ff79b208312d30676392bfdd693403be)
Andrew Cooper [Mon, 5 Jun 2023 10:09:11 +0000 (11:09 +0100)]
x86/spec-ctrl: Use a taint for CET without MSR_SPEC_CTRL
Reword the comment for 'S' to include an incompatible set of features on the
same core.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3f63f4510422c29fda7ba238b880cbb53eca34fe)
Andrew Cooper [Mon, 12 Jun 2023 19:24:00 +0000 (20:24 +0100)]
x86/spec-ctrl: Fix the rendering of FB_CLEAR
FB_CLEAR is a read-only status bit, not a read-write control. Move it from
"Hardware features" into "Hardware hints".
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 921afcbae843bb3f575a8f4a270b8e6cf471f4ca)
This is prep work, split out to simply the diff on the following change.
* Split the INTEL check out of the IvyBridge RDRAND check, as the former will
be reused.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 064f572f96f1558faae0a74cad616ba95ec8ff34)
Andrew Cooper [Tue, 30 May 2023 15:03:16 +0000 (16:03 +0100)]
x86/spec-ctrl: Update hardware hints
* Rename IBRS_ALL to EIBRS. EIBRS is the term that everyone knows, and this
makes ARCH_CAPS_EIBRS match the X86_FEATURE_EIBRS form.
* Print RRSBA too, which is also a hint about behaviour.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 94200e1bae07e725cc07238c11569c5cab7befb7)
MSR_ARCH_CAPS data is now included in featureset information. Replace
opencoded checks with regular feature ones.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 511b9f286c3dadd041e0d90beeff7d47c9bf3b7a)
Andrew Cooper [Mon, 15 May 2023 18:05:01 +0000 (19:05 +0100)]
x86/tsx: Remove opencoded MSR_ARCH_CAPS check
The current cpu_has_tsx_ctrl tristate is serving double pupose; to signal the
first pass through tsx_init(), and the availability of MSR_TSX_CTRL.
Drop the variable, replacing it with a once boolean, and altering
cpu_has_tsx_ctrl to come out of the feature information.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 205a9f970378c31ae3e00b52d59103a2e881b9e0)
Andrew Cooper [Mon, 15 May 2023 15:59:25 +0000 (16:59 +0100)]
x86/vtx: Remove opencoded MSR_ARCH_CAPS check
MSR_ARCH_CAPS data is now included in featureset information.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 8f6bc7f9b72eb7cf0c8c5ae5d80498a58ba0b7c3)
Andrew Cooper [Fri, 12 May 2023 14:53:35 +0000 (15:53 +0100)]
x86/boot: Expose MSR_ARCH_CAPS data in guest max policies
We already have common and default feature adjustment helpers. Introduce one
for max featuresets too.
Offer MSR_ARCH_CAPS unconditionally in the max policy, and stop clobbering the
data inherited from the Host policy. This will be necessary to level a VM
safely for migration. Annotate the ARCH_CAPS CPUID bit as special. Note:
ARCH_CAPS is still max-only for now, so will not be inhereted by the default
policies.
With this done, the special case for dom0 can be shrunk to just resampling the
Host policy (as ARCH_CAPS isn't visible by default yet).
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit bbb289f3d5bdd3358af748d7c567343532ac45b5)
Andrew Cooper [Fri, 12 May 2023 14:37:02 +0000 (15:37 +0100)]
x86/boot: Record MSR_ARCH_CAPS for the Raw and Host CPU policy
Extend x86_cpu_policy_fill_native() with a read of ARCH_CAPS based on the
CPUID information just read, removing the specially handling in
calculate_raw_cpu_policy().
Right now, the only use of x86_cpu_policy_fill_native() outside of Xen is the
unit tests. Getting MSR data in this context is left to whomever first
encounters a genuine need to have it.
Extend generic_identify() to read ARCH_CAPS into x86_capability[], which is
fed into the Host Policy. This in turn means there's no need to special case
arch_caps in calculate_host_policy().
No practical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 70553000d6b44dd7c271a35932b0b3e1f22c5532)
Andrew Cooper [Fri, 12 May 2023 17:50:59 +0000 (18:50 +0100)]
x86/cpu-policy: MSR_ARCH_CAPS feature names
Seed the default visibility from the dom0 special case, which for the most
part just exposes the *_NO bits. EIBRS is the one non-*_NO bit, which is
"just" a status bit to the guest indicating a change in implemention of IBRS
which is already fully supported.
Insert a block dependency from the ARCH_CAPS CPUID bit to the entire content
of the MSR. This is because MSRs have no structure information similar to
CPUID, and used by x86_cpu_policy_clear_out_of_range_leaves(), in order to
bulk-clear inaccessable words.
The overall CPUID bit is still max-only, so all of MSR_ARCH_CAPS is hidden in
the default policies.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit ce8c930851a5ca21c4e70f83be7e8b290ce1b519)
Andrew Cooper [Fri, 12 May 2023 16:55:21 +0000 (17:55 +0100)]
x86/cpu-policy: Infrastructure for MSR_ARCH_CAPS
Bits through 24 are already defined, meaning that we're not far off needing
the second word. Put both in right away.
As both halves are present now, the arch_caps field is full width. Adjust the
unit test, which notices.
The bool bitfield names in the arch_caps union are unused, and somewhat out of
date. They'll shortly be automatically generated.
Add CPUID and MSR prefixes to the ./xen-cpuid verbose output, now that there
are a mix of the two.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit d9fe459ffad8a6eac2f695adb2331aff83c345d1)
Andrew Cooper [Mon, 15 May 2023 13:14:53 +0000 (14:14 +0100)]
x86/boot: Adjust MSR_ARCH_CAPS handling for the Host policy
We are about to move MSR_ARCH_CAPS into featureset, but the order of
operations (copy raw policy, then copy x86_capabilitiles[] in) will end up
clobbering the ARCH_CAPS value.
Some toolstacks use this information to handle TSX compatibility across the
CPUs and microcode versions where support was removed.
To avoid this transient breakage, read from raw_cpu_policy rather than
modifying it in place. This logic will be removed entirely in due course.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 43912f8dbb1888ffd7f00adb10724c70e71927c4)
Andrew Cooper [Fri, 12 May 2023 12:52:39 +0000 (13:52 +0100)]
x86/boot: Rework dom0 feature configuration
Right now, dom0's feature configuration is split between between the common
path and a dom0-specific one. This mostly is by accident, and causes some
very subtle bugs.
First, start by clearly defining init_dom0_cpuid_policy() to be the domain
that Xen builds automatically. The late hwdom case is still constructed in a
mostly normal way, with the control domain having full discretion over the CPU
policy.
Identifying this highlights a latent bug - the two halves of the MSR_ARCH_CAPS
bodge are asymmetric with respect to the hardware domain. This means that
shim, or a control-only dom0 sees the MSR_ARCH_CAPS CPUID bit but none of the
MSR content. This in turn declares the hardware to be retpoline-safe by
failing to advertise the {R,}RSBA bits appropriately. Restrict this logic to
the hardware domain, although the special case will cease to exist shortly.
For the CPUID Faulting adjustment, the comment in ctxt_switch_levelling()
isn't actually relevant. Provide a better explanation.
Move the recalculate_cpuid_policy() call outside of the dom0-cpuid= case.
This is no change for now, but will become necessary shortly.
Finally, place the second half of the MSR_ARCH_CAPS bodge after the
recalculate_cpuid_policy() call. This is necessary to avoid transiently
breaking the hardware domain's view while the handling is cleaned up. This
special case will cease to exist shortly.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit ef1987fcb0fdfaa7ee148024037cb5fa335a7b2d)
Andrew Cooper [Wed, 10 May 2023 18:58:43 +0000 (19:58 +0100)]
x86/cpuid: Calculate FEATURESET_NR_ENTRIES more helpfully
When adding new featureset words, it is convenient to split the work into
several patches. However, GCC 12 spotted that the way we prefer to split the
work results in a real (transient) breakage whereby the policy <-> featureset
helpers perform out-of-bounds accesses on the featureset array.
Fix this by having gen-cpuid.py calculate FEATURESET_NR_ENTRIES from the
comments describing the word blocks, rather than from the XEN_CPUFEATURE()
with the greatest value.
For simplicty, require that the word blocks appear in order. This can be
revisted if we find a good reason to have blocks out of order.
No functional change.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 56e2c8e5860090a35d5f0cafe168223a2a7c0e62)
Andrew Cooper [Wed, 29 Mar 2023 12:07:03 +0000 (13:07 +0100)]
x86: Remove temporary {cpuid,msr}_policy defines
With all code areas updated, drop the temporary defines and adjust all
remaining users.
No practical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 994c1553a158ada9db5ab64c9178a0d23c0a42ce)
Andrew Cooper [Mon, 3 Apr 2023 13:18:43 +0000 (14:18 +0100)]
libx86: Update library API for cpu_policy
Adjust the API and comments appropriately.
x86_cpu_policy_fill_native() will eventually contain MSR reads, but leave a
TODO in the short term.
No practical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 1b67fccf3b02825f6a036bad06cd17963d0972d2)
tools/libs/guest: Fix build following libx86 changes
I appear to have lost this hunk somewhere...
Fixes: 1b67fccf3b02 ("libx86: Update library API for cpu_policy") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 48d76e6da92f9ef76c8468e299349a2f698362fa)
Andrew Cooper [Mon, 3 Apr 2023 16:14:14 +0000 (17:14 +0100)]
tools/fuzz: Rework afl-policy-fuzzer
With cpuid_policy and msr_policy merged to form cpu_policy, merge the
respective fuzzing logic.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit a16dcd48c2db3f6820a15ea482551d289bd9cdec)
Andrew Cooper [Mon, 3 Apr 2023 19:03:57 +0000 (20:03 +0100)]
x86/emul: Switch x86_emulate_ctxt to cpu_policy
As with struct domain, retain cpuid as a valid alias for local code clarity.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 441b1b2a50ea3656954d75e06d42c96d619ea0fc)
Andrew Cooper [Mon, 3 Apr 2023 18:06:02 +0000 (19:06 +0100)]
x86/boot: Merge CPUID policy initialisation logic into cpu-policy.c
Switch to the newer cpu_policy nomenclature. Do some easy cleanup of
includes.
No practical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 8eb56eb959a50bf9afd0fd590ec394e9145970a4)
Andrew Cooper [Mon, 3 Apr 2023 16:48:43 +0000 (17:48 +0100)]
x86/boot: Move MSR policy initialisation logic into cpu-policy.c
Switch to the newer cpu_policy nomenclature.
No practical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 4f20f596ce9bd95bde077a1ae0d7e07d20a5f6be)
Andrew Cooper [Thu, 30 Mar 2023 17:21:01 +0000 (18:21 +0100)]
x86: Out-of-inline the policy<->featureset convertors
These are already getting over-large for being inline functions, and are only
going to grow further over time. Out of line them, yielding the following net
delta from bloat-o-meter:
Switch to the newer cpu_policy terminology while doing so.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 1027df4c00823f8b448e3a6861cc7b6ce61ba4e4)
Andrew Cooper [Wed, 29 Mar 2023 11:01:33 +0000 (12:01 +0100)]
x86: Drop struct old_cpu_policy
With all the complicated callers of x86_cpu_policies_are_compatible() updated
to use a single cpu_policy object, we can drop the final user of struct
old_cpu_policy.
Update x86_cpu_policies_are_compatible() to take (new) cpu_policy pointers,
reducing the amount of internal pointer chasing, and update all callers to
pass their cpu_policy objects directly.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 66c5c99656314451ff9520f91cff5bb39fee9fed)
Andrew Cooper [Wed, 29 Mar 2023 11:37:33 +0000 (12:37 +0100)]
x86: Merge xc_cpu_policy's cpuid and msr objects
Right now, they're the same underlying type, containing disjoint information.
Use a single object instead. Also take the opportunity to rename 'entries' to
'msrs' which is more descriptive, and more in line with nr_msrs being the
count of MSR entries in the API.
test-tsx uses xg_private.h to access the internals of xc_cpu_policy, so needs
updating at the same time. Take the opportunity to improve the code clarity
by passing a cpu_policy rather than an xc_cpu_policy into some functions.
No practical change. This undoes the transient doubling of storage space from
earlier patches.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c9985233ca663fea20fc8807cf509d2e3fef0dca)
Andrew Cooper [Wed, 29 Mar 2023 10:32:25 +0000 (11:32 +0100)]
x86: Merge a domain's {cpuid,msr} policy objects
Right now, they're the same underlying type, containing disjoint information.
Drop the d->arch.msr pointer, and union d->arch.cpuid to give it a second name
of cpu_policy in the interim.
Merge init_domain_{cpuid,msr}_policy() into a single init_domain_cpu_policy(),
moving the implementation into cpu-policy.c
No practical change. This undoes the transient doubling of storage space from
earlier patches.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit bd13dae34809e61e37ba1cd5de893c5c10c46256)
Andrew Cooper [Wed, 29 Mar 2023 06:39:44 +0000 (07:39 +0100)]
x86: Merge the system {cpuid,msr} policy objects
Right now, they're the same underlying type, containing disjoint information.
Introduce a new cpu-policy.{h,c} to be the new location for all policy
handling logic. Place the combined objects in __ro_after_init, which is new
since the original logic was written.
As we're trying to phase out the use of struct old_cpu_policy entirely, rework
update_domain_cpu_policy() to not pointer-chase through system_policies[].
This in turn allows system_policies[] in sysctl.c to become static and reduced
in scope to XEN_SYSCTL_get_cpu_policy.
No practical change. This undoes the transient doubling of storage space from
earlier patches.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 6bc33366795d14a21a3244d0f3b63f7dccea87ef)
Andrew Cooper [Tue, 28 Mar 2023 20:24:20 +0000 (21:24 +0100)]
x86: Merge struct msr_policy into struct cpu_policy
As with the cpuid side, use a temporary define to make struct msr_policy still
work.
Note, this means that domains now have two separate struct cpu_policy
allocations with disjoint information, and system policies are in a similar
position, as well as xc_cpu_policy objects in libxenguest. All of these
duplications will be addressed in the following patches.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 03812da3754d550dd8cbee7289469069ea6f0073)
Andrew Cooper [Tue, 28 Mar 2023 17:55:19 +0000 (18:55 +0100)]
x86: Rename struct cpuid_policy to struct cpu_policy
Also merge lib/x86/cpuid.h entirely into lib/x86/cpu-policy.h
Use a temporary define to make struct cpuid_policy still work.
There's one forward declaration of struct cpuid_policy in
tools/tests/x86_emulator/x86-emulate.h that isn't covered by the define, and
it's easier to rename that now than to rearrange the includes.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 743e530380a007774017df9dc2d8cb0659040ee3)
These weren't great names to begin with, and using {leaves,msrs} matches up
better with the existing nr_{leaves,msr} parameters anyway.
Furthermore, by renaming these fields we can get away with using some #define
trickery to avoid the struct {cpuid,msr}_policy merge needing to happen in a
single changeset.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 21e3ef57e0406b6b9a783f721f29df8f91a00f99)
xen: Correct comments after renaming xen_{dom,sys}ctl_cpu_policy fields
Fixes: 21e3ef57e040 ("x86: Rename {domctl,sysctl}.cpu_policy.{cpuid,msr}_policy fields") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 6e06d229d538ea51b92dc189546c522f5e903511)
Andrew Cooper [Tue, 28 Mar 2023 19:31:33 +0000 (20:31 +0100)]
x86: Rename struct cpu_policy to struct old_cpuid_policy
We want to merge struct cpuid_policy and struct msr_policy together, and the
result wants to be called struct cpu_policy.
The current struct cpu_policy, being a pair of pointers, isn't terribly
useful. Rename the type to struct old_cpu_policy, but it will disappear
entirely once the merge is complete.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c2ec94c370f211d73f336ccfbdb32499f1b05f82)
Featuresets are supposed to be disappearing when the CPU policy infrastructure
is complete, but that has taken longer than expected, and isn't going to be
complete imminently either.
In the meantime, Xen does have proper default/max featuresets, and xen-cpuid
can even get them via the XEN_SYSCTL_cpu_policy_* interface, but only knows
now to render them nicely via the featureset interface.
Differences between default and max are a frequent source of errors,
frequently too in secret leading up to an embargo, so extend the featureset
sysctl to allow xen-cpuid to render them all nicely.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Christian Lindig <christian.lindig@cloud.com>
(cherry picked from commit 433d012c6c2737ad5a9aaa994355a4140d601852)
Andrew Cooper [Fri, 10 Mar 2023 19:04:22 +0000 (19:04 +0000)]
tools/xen-cpuid: Rework the handling of dynamic featuresets
struct fsinfo is the vestigial remnant of an older internal design which
didn't survive very long.
Simplify things by inlining get_featureset() and having a single memory
allocation that gets reused. This in turn changes featuresets[] to be a
simple list of names, so rename it to fs_names[].
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit ec3474e1dd42e6f410601f50b6e74fb7c442cfb9)
Andrew Cooper [Tue, 14 Dec 2021 16:53:36 +0000 (16:53 +0000)]
x86/cpuid: Introduce dom0-cpuid command line option
Specifically, this lets the user opt in to non-default features.
Collect all dom0 settings together in dom0_{en,dis}able_feat[], and apply it
to dom0's policy when other tweaks are being made.
As recalculate_cpuid_policy() is an expensive action, and dom0-cpuid= is
likely to only be used by the x86 maintainers for development purposes, forgo
the recalculation in the general case.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 5bd2b82df28cb7390f5ffb00fac635d0b9e36674)
Andrew Cooper [Wed, 15 Dec 2021 16:30:25 +0000 (16:30 +0000)]
x86/cpuid: Factor common parsing out of parse_xen_cpuid()
dom0-cpuid= is going to want to reuse the common parsing loop, so factor it
out into parse_cpuid().
Irritatingly, despite being static const, the features[] array gets duplicated
each time parse_cpuid() is inlined. As it is a large (and ever growing with
new CPU features) datastructure, move it to being file scope so all inlines
use the same single object.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 94c3df9188d6deed6fe213754492b11b9d409262)
Andrew Cooper [Wed, 15 Dec 2021 15:36:59 +0000 (15:36 +0000)]
x86/cpuid: Split dom0 handling out of init_domain_cpuid_policy()
To implement dom0-cpuid= support, the special cases would need extending.
However there is already a problem with late hwdom where the special cases
override toolstack settings, which is unintended and poor behaviour.
Introduce a new init_dom0_cpuid_policy() for the purpose, moving the ITSC and
ARCH_CAPS logic. The is_hardware_domain() can be dropped, and for now there
is no need to rerun recalculate_cpuid_policy(); this is a relatively expensive
operation, and will become more-so over time.
Rearrange the logic in create_dom0() to make room for a call to
init_dom0_cpuid_policy(). The AMX plans for having variable sized XSAVE
states require that modifications to the policy happen before vCPUs are
created.
Additionally, factor out domid into a variable so we can be slightly more
correct in the case of a failure, and also print the error from
domain_create(). This will at least help distinguish -EINVAL from -ENOMEM.
No practical change in behaviour.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c17072fc164a72583fda8e2b836c71d2e3f8e84d)
init.c:348:18: error: implicit truncation from 'int' to a one-bit wide bit-field changes value from 1 to -1 [-Werror,-Wsingle-bit-bitfield-constant-conversion]
ctrl->is_server = 1;
^ ~
1 error generated.
make[6]: *** [/builds/xen-project/people/andyhhp/xen/tools/libs/vchan/../../../tools/Rules.mk:188: init.o] Error 1
In Xen 4.18, this was fixed with c/s 99ab02f63ea8 ("tools: convert bitfields
to unsigned type") but this is an ABI change which can't be backported.
Swich 1 for -1 to provide a minimally invasive way to fix the build.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>