Bertrand Marquis [Thu, 17 Feb 2022 14:52:54 +0000 (14:52 +0000)]
xen/arm: Allow to discover and use SMCCC_ARCH_WORKAROUND_3
Allow guest to discover whether or not SMCCC_ARCH_WORKAROUND_3 is
supported and create a fastpath in the code to handle guests request to
do the workaround.
The function SMCCC_ARCH_WORKAROUND_3 will be called by the guest for
flushing the branch history. So we want the handling to be as fast as
possible.
As the mitigation is applied on every guest exit, we can check for the
call before saving all context and return very early.
Rahul Singh [Mon, 14 Feb 2022 18:47:32 +0000 (18:47 +0000)]
xen/arm: Add Spectre BHB handling
This commit is adding Spectre BHB handling to Xen on Arm.
The commit is introducing new alternative code to be executed during
exception entry:
- SMCC workaround 3 call
- loop workaround (with 8, 24 or 32 iterations)
- use of new clearbhb instruction
Cpuerrata is modified by this patch to apply the required workaround for
CPU affected by Spectre BHB when CONFIG_ARM64_HARDEN_BRANCH_PREDICTOR is
enabled.
To do this the system previously used to apply smcc workaround 1 is
reused and new alternative code to be copied in the exception handler is
introduced.
To define the type of workaround required by a processor, 4 new cpu
capabilities are introduced (for each number of loop and for smcc
workaround 3).
When a processor is affected, enable_spectre_bhb_workaround is called
and if the processor does not have CSV2 set to 3 or ECBHB feature (which
would mean that the processor is doing what is required in hardware),
the proper code is enabled at exception entry.
In the case where workaround 3 is not supported by the firmware, we
enable workaround 1 when possible as it will also mitigate Spectre BHB
on systems without CSV2.
Bertrand Marquis [Wed, 23 Feb 2022 09:42:18 +0000 (09:42 +0000)]
xen/arm: Add ECBHB and CLEARBHB ID fields
Introduce ID coprocessor register ID_AA64ISAR2_EL1.
Add definitions in cpufeature and sysregs of ECBHB field in mmfr1 and
CLEARBHB in isar2 ID coprocessor registers.
Bertrand Marquis [Tue, 15 Feb 2022 10:39:47 +0000 (10:39 +0000)]
xen/arm: move errata CSV2 check earlier
CSV2 availability check is done after printing to the user that
workaround 1 will be used. Move the check before to prevent saying to the
user that workaround 1 is used when it is not because it is not needed.
This will also allow to reuse install_bp_hardening_vec function for
other use cases.
Code previously returning "true", now returns "0" to conform to
enable_smccc_arch_workaround_1 returning an int and surrounding code
doing a "return 0" if workaround is not needed.
Andrew Cooper [Tue, 19 Oct 2021 20:22:27 +0000 (21:22 +0100)]
x86/spec-ctrl: Support Intel PSFD for guests
The Feb 2022 microcode from Intel retrofits AMD's MSR_SPEC_CTRL.PSFD interface
to Sunny Cove (IceLake) and later cores.
Update the MSR_SPEC_CTRL emulation, and expose it to guests.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 52ce1c97844db213de01c5300eaaa8cf101a285f)
Andrew Cooper [Thu, 27 Jan 2022 21:07:40 +0000 (21:07 +0000)]
x86/cpuid: Infrastructure for cpuid word 7:2.edx
While in principle it would be nice to keep leaf 7 in order, that would
involve having an extra 5 words of zeros in a featureset.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit f3709b15fc86c6c6a0959cec8d97f21d0e9f9629)
Andrew Cooper [Wed, 16 Sep 2020 15:15:52 +0000 (16:15 +0100)]
x86/tsx: Cope with TSX deprecation on WHL-R/CFL-R
The February 2022 microcode is formally de-featuring TSX on the TAA-impacted
client CPUs. The backup TAA mitigation (VERW regaining its flushing side
effect) is being dropped, meaning that `smt=0 spec-ctrl=md-clear` no longer
protects against TAA on these parts.
The new functionality enumerates itself via the RTM_ALWAYS_ABORT CPUID
bit (the same as June 2021), but has its control in MSR_MCU_OPT_CTRL as
opposed to MSR_TSX_FORCE_ABORT.
TSX now defaults to being disabled on ucode load. Furthermore, if SGX is
enabled in the BIOS, TSX is locked and cannot be re-enabled. In this case,
override opt_tsx to 0, so the RTM/HLE CPUID bits get hidden by default.
While updating the command line documentation, take the opportunity to add a
paragraph explaining what TSX being disabled actually means, and how migration
compatibility works.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit ad9f7c3b2e0df38ad6d54f4769d4dccf765fbcee)
Andrew Cooper [Wed, 23 Jun 2021 20:53:58 +0000 (21:53 +0100)]
x86/tsx: Move has_rtm_always_abort to an outer scope
We are about to introduce a second path which needs to conditionally force the
presence of RTM_ALWAYS_ABORT.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 4116139131e93b4f075e5442e3c1b424280f6f1f)
Andrew Cooper [Wed, 19 May 2021 18:40:28 +0000 (19:40 +0100)]
x86/spec-ctrl: Clean up MSR_MCU_OPT_CTRL handling
Introduce cpu_has_srbds_ctrl as more users are going to appear shortly.
MSR_MCU_OPT_CTRL is gaining extra functionality, meaning that the current
default_xen_mcu_opt_ctrl is no longer a good fit.
Introduce two new helpers, update_mcu_opt_ctrl() which does a full RMW cycle
on the MSR, and set_in_mcu_opt_ctrl() which lets callers configure specific
bits at a time without clobbering each others settings.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 39a40f3835efcc25c1b05a25c321a01d7e11cbd7)
Jan Beulich [Thu, 27 Jan 2022 12:54:42 +0000 (12:54 +0000)]
x86/cpuid: Infrastructure for leaf 7:1.ebx
Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit e1828e3032ebfe036023cd733adfd2d4ec856688)
Andrew Cooper [Thu, 27 Jan 2022 13:56:04 +0000 (13:56 +0000)]
x86/cpuid: Disentangle logic for new feature leaves
Adding a new feature leaf is a reasonable amount of boilerplate and for the
patch to build, at least one feature from the new leaf needs defining. This
typically causes two non-trivial changes to be merged together.
First, have gen-cpuid.py write out some extra placeholder defines:
This allows DECL_BITFIELD() to be added to struct cpuid_policy without
requiring a XEN_CPUFEATURE() declared for the leaf. The choice of 4 is
arbitrary, and allows us to add more than one leaf at a time if necessary.
Second, rework generic_identify() to not use specific feature names.
The choice of deriving the index from a feature was to avoid mismatches, but
its correctness depends on bugs like c/s 249e0f1d8f20 ("x86/cpuid: Fix
TSXLDTRK definition") not happening.
Switch to using FEATURESET_* just like the policy/featureset helpers. This
breaks the cognitive complexity of needing to know which leaf a specifically
named feature should reside in, and is shorter to write. It is also far
easier to identify as correct at a glance, given the correlation with the
CPUID leaf being read.
In addition, tidy up some other bits of generic_identify()
* Drop leading zeros from leaf numbers.
* Don't use a locked update for X86_FEATURE_APERFMPERF.
* Rework extended_cpuid_level calculation to avoid setting it twice.
* Use "leaf >= $N" consistently so $N matches with the CPUID input.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit e3662437eb43cc8002bd39be077ef68b131649c5)
Andrew Cooper [Mon, 17 Jan 2022 20:29:09 +0000 (20:29 +0000)]
x86/cpuid: Enable MSR_SPEC_CTRL in SVM guests by default
With all other pieces in place, MSR_SPEC_CTRL is fully working for HVM guests.
Update the CPUID derivation logic (both PV and HVM to avoid losing subtle
changes), drop the MSR intercept, and explicitly enable the CPUID bits for HVM
guests.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit a7e7c7260cde78a148810db5320cbf39686c3e09)
Andrew Cooper [Mon, 17 Jan 2022 20:29:09 +0000 (20:29 +0000)]
x86/msr: AMD MSR_SPEC_CTRL infrastructure
Fill in VMCB accessors for spec_ctrl in svm_{get,set}_reg(), and CPUID checks
for all supported bits in guest_{rd,wr}msr().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 22b9add22b4a9af37305c8441fec12cb26bd142b)
Andrew Cooper [Fri, 21 Jan 2022 15:59:03 +0000 (15:59 +0000)]
x86/svm: VMEntry/Exit logic for MSR_SPEC_CTRL
Hardware maintains both host and guest versions of MSR_SPEC_CTRL, but guests
run with the logical OR of both values. Therefore, in principle we want to
clear Xen's value before entering the guest. However, for migration
compatibility (future work), and for performance reasons with SEV-SNP guests,
we want the ability to use a nonzero value behind the guest's back. Use
vcpu_msrs to hold this value, with the guest value in the VMCB.
On the VMEntry path, adjusting MSR_SPEC_CTRL must be done after CLGI so as to
be atomic with respect to NMIs/etc.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 614cec7d79d76786f5638a6e4da0576b57732ca1)
Andrew Cooper [Fri, 21 Jan 2022 15:59:03 +0000 (15:59 +0000)]
x86/spec-ctrl: Use common MSR_SPEC_CTRL logic for AMD
Currently, amd_init_ssbd() works by being the only write to MSR_SPEC_CTRL in
the system. This ceases to be true when using the common logic.
Include AMD MSR_SPEC_CTRL in has_spec_ctrl to activate the common paths, and
introduce an AMD specific block to control alternatives. Also update the
boot/resume paths to configure default_xen_spec_ctrl.
svm.h needs an adjustment to remove a dependency on include order.
For now, only active alternatives for HVM - PV will require more work. No
functional change, as no alternatives are defined yet for HVM yet.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 378f2e6df31442396f0afda19794c5c6091d96f9)
Andrew Cooper [Fri, 28 Jan 2022 11:57:19 +0000 (11:57 +0000)]
x86/spec-ctrl: Record the last write to MSR_SPEC_CTRL
In some cases, writes to MSR_SPEC_CTRL do not have interesting side effects,
and we should implement lazy context switching like we do with other MSRs.
In the short term, this will be used by the SVM infrastructure, but I expect
to extend it to other contexts in due course.
Introduce cpu_info.last_spec_ctrl for the purpose, and cache writes made from
the boot/resume paths. The value can't live in regular per-cpu data when it
is eventually used for PV guests when XPTI might be active.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 00f2992b6c7a9d4090443c1a85bf83224a87eeb9)
Andrew Cooper [Fri, 28 Jan 2022 12:03:42 +0000 (12:03 +0000)]
x86/spec-ctrl: Don't use spec_ctrl_{enter,exit}_idle() for S3
'idle' here refers to hlt/mwait. The S3 path isn't an idle path - it is a
platform reset.
We need to load default_xen_spec_ctrl unilaterally on the way back up.
Currently it happens as a side effect of X86_FEATURE_SC_MSR_IDLE or the next
return-to-guest, but that's fragile behaviour.
Conversely, there is no need to clear IBRS and flush the store buffers on the
way down; we're microseconds away from cutting power.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 71fac402e05ade7b0af2c34f77517449f6f7e2c1)
Andrew Cooper [Tue, 25 Jan 2022 17:14:48 +0000 (17:14 +0000)]
x86/spec-ctrl: Introduce new has_spec_ctrl boolean
Most MSR_SPEC_CTRL setup will be common between Intel and AMD. Instead of
opencoding an OR of two features everywhere, introduce has_spec_ctrl instead.
Reword the comment above the Intel specific alternatives block to highlight
that it is Intel specific, and pull the setting of default_xen_spec_ctrl.IBRS
out because it will want to be common.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 5d9eff3a312763d889cfbf3c8468b6dfb3ab490c)
Andrew Cooper [Tue, 25 Jan 2022 16:09:59 +0000 (16:09 +0000)]
x86/spec-ctrl: Drop use_spec_ctrl boolean
Several bugfixes have reduced the utility of this variable from it's original
purpose, and now all it does is aid in the setup of SCF_ist_wrmsr.
Simplify the logic by drop the variable, and doubling up the setting of
SCF_ist_wrmsr for the PV and HVM blocks, which will make the AMD SPEC_CTRL
support easier to follow. Leave a comment explaining why SCF_ist_wrmsr is
still necessary for the VMExit case.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit ec083bf552c35e10347449e21809f4780f8155d2)
Andrew Cooper [Thu, 27 Jan 2022 21:28:48 +0000 (21:28 +0000)]
x86/cpuid: Advertise SSB_NO to guests by default
This is a statement of hardware behaviour, and not related to controls for the
guest kernel to use. Pass it straight through from hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 15b7611efd497c4b65f350483857082cb70fc348)
Andrew Cooper [Wed, 19 Jan 2022 19:55:02 +0000 (19:55 +0000)]
x86/msr: Fix migration compatibility issue with MSR_SPEC_CTRL
This bug existed in early in 2018 between MSR_SPEC_CTRL arriving in microcode,
and SSBD arriving a few months later. It went unnoticed presumably because
everyone was busy rebooting everything.
The same bug will reappear when adding PSFD support.
Clamp the guest MSR_SPEC_CTRL value to that permitted by CPUID on migrate.
The guest is already playing with reserved bits at this point, and clamping
the value will prevent a migration to a less capable host from failing.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 969a57f73f6b011b2ebf4c0ab1715efc65837335)
Andrew Cooper [Tue, 1 Feb 2022 13:34:49 +0000 (13:34 +0000)]
x86/vmx: Drop spec_ctrl load in VMEntry path
This is not needed now that the VMEntry path is not responsible for loading
the guest's MSR_SPEC_CTRL value.
Fixes: 81f0eaadf84d ("x86/spec-ctrl: Fix NMI race condition with VT-x MSR_SPEC_CTRL handling") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9ce3ef20b4f085a7dc8ee41b0fec6fdeced3773e)
x86/cpuid: support LFENCE always serialising CPUID bit
AMD Milan (Zen3) CPUs have an LFENCE Always Serialising CPUID bit in
leaf 80000021.eax. Previous AMD versions used to have a user settable
bit in DE_CFG MSR to select whether LFENCE was dispatch serialising,
which Xen always attempts to set. The forcefully always on setting is
due to the addition of SEV-SNP so that a VMM cannot break the
confidentiality of a guest.
In order to support this new CPUID bit move the LFENCE_DISPATCH
synthetic CPUID bit to map the hardware bit (leaving a hole in the
synthetic range) and either rely on the bit already being set by the
native CPUID output, or attempt to fake it in Xen by modifying the
DE_CFG MSR. This requires adding one more entry to the featureset to
support leaf 80000021.eax.
The bit is always exposed to guests by default even if the underlying
hardware doesn't support leaf 80000021. Note that Xen doesn't allow
guests to change the DE_CFG value, so once set by Xen LFENCE will always
be serialising.
Note that the access to DE_CFG by guests is left as-is: reads will
unconditionally return LFENCE_SERIALISE bit set, while writes are
silently dropped.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
[Always expose to guests by default] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit e9b4fe26364950258c9f57f0f68eccb778eeadbb)
x86/cpuid: do not expand max leaves on restore
When restoring limit the maximum leaves to the ones supported by Xen
4.12 in order to not expand the maximum leaves a guests sees. Note
this is unlikely to cause real issues.
Guests restored from Xen versions 4.13 or greater will contain CPUID
data on the stream that will override the values set by
xc_cpuid_apply_policy.
Fixes: e9b4fe263649 ("x86/cpuid: support LFENCE always serialising CPUID bit") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 111c8c33a8a18588f3da3c5dbb7f5c63ddb98ce5)
tools/libxenguest: Fix max_extd_leaf calculation for legacy restore
0x1c is lower than any value which will actually be observed in
p->extd.max_leaf, but higher than the logical 9 leaves worth of extended data
on Intel systems, causing x86_cpuid_copy_to_buffer() to fail with -ENOBUFS.
Correct the calculation.
The problem was first noticed in c/s 34990446ca9 "libxl: don't ignore the
return value from xc_cpuid_apply_policy" but introduced earlier.
Fixes: 111c8c33a8a1 ("x86/cpuid: do not expand max leaves on restore") Reported-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 5fa174cbf54cc625a023b8e7170e359dd150c072)
It was Xen 4.14 where CPUID data was added to the migration stream, and 4.13
that we need to worry about with regards to compatibility. Xen 4.12 isn't
relevant.
Expand and correct the commentary.
Fixes: 111c8c33a8a1 ("x86/cpuid: do not expand max leaves on restore") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 820cc393434097f3b7976acdccbf1d96071d6d23)
x86/amd: split LFENCE dispatch serializing setup logic into helper
Split the logic to attempt to setup LFENCE to be dispatch serializing
on AMD into a helper, so it can be shared with Hygon.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 3e9460ec93341fa6a80ecf99832aa5d9975339c9)
Roger Pau Monné [Wed, 26 Jan 2022 11:52:09 +0000 (12:52 +0100)]
x86/pvh: fix population of the low 1MB for dom0
RMRRs are setup ahead of populating the p2m and hence the ASSERT when
populating the low 1MB needs to be relaxed when it finds an existing
entry: it's either RAM or a RMRR resulting from the IOMMU setup.
Rework the logic a bit and introduce a local mfn variable in order to
assert that if the gfn is populated and not RAM it is an identity map.
Fixes: 6b4f6a31ac ('x86/PVH: de-duplicate mappings for first Mb of Dom0 memory') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2d5fc9120d556ec3c4b1acf0ab5660a6d3f7ebeb
master date: 2022-01-25 10:52:24 +0000
Andrew Cooper [Wed, 26 Jan 2022 11:51:31 +0000 (12:51 +0100)]
x86: Fix build with the get/set_reg() infrastructure
I clearly messed up concluding that the stubs were safe to drop.
The is_{pv,hvm}_domain() predicates are not symmetrical with both CONFIG_PV
and CONFIG_HVM. As a result logic of the form `if ( pv/hvm ) ... else ...`
will always have one side which can't be DCE'd.
While technically only the hvm stubs are needed, due to the use of the
is_pv_domain() predicate in guest_{rd,wr}msr(), sort out the pv stubs too to
avoid leaving a bear trap for future users.
Fixes: 88d3ff7ab15d ("x86/guest: Introduce {get,set}_reg() infrastructure") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 13caa585791234fe3e3719c8376f7ea731012451
master date: 2022-01-21 12:42:11 +0000
Andrew Cooper [Tue, 25 Jan 2022 12:53:14 +0000 (13:53 +0100)]
x86/spec-ctrl: Fix NMI race condition with VT-x MSR_SPEC_CTRL handling
The logic was based on a mistaken understanding of how NMI blocking on vmexit
works. NMIs are only blocked for EXIT_REASON_NMI, and not for general exits.
Therefore, an NMI can in general hit early in the vmx_asm_vmexit_handler path,
and the guest's value will be clobbered before it is saved.
Switch to using MSR load/save lists. This causes the guest value to be saved
atomically with respect to NMIs/MCEs/etc.
First, update vmx_cpuid_policy_changed() to configure the load/save lists at
the same time as configuring the intercepts. This function is always used in
remote context, so extend the vmx_vmcs_{enter,exit}() block to cover the whole
function, rather than having multiple remote acquisitions of the same VMCS.
Both of vmx_{add,del}_guest_msr() can fail. The -ESRCH delete case is fine,
but all others are fatal to the running of the VM, so handle them using
domain_crash() - this path is only used during domain construction anyway.
Second, update vmx_{get,set}_reg() to use the MSR load/save lists rather than
vcpu_msrs, and update the vcpu_msrs comment to describe the new state
location.
Finally, adjust the entry/exit asm.
Because the guest value is saved and loaded atomically, we do not need to
manually load the guest value, nor do we need to enable SCF_use_shadow. This
lets us remove the use of DO_SPEC_CTRL_EXIT_TO_GUEST. Additionally,
SPEC_CTRL_ENTRY_FROM_PV gets removed too, because on an early entry failure,
we're no longer in the guest MSR_SPEC_CTRL context needing to switch back to
Xen's context.
The only action remaining is to load Xen's MSR_SPEC_CTRL value on vmexit. We
could in principle use the host msr list, but is expected to complicated
future work. Delete DO_SPEC_CTRL_ENTRY_FROM_HVM entirely, and use a shorter
code sequence to simply reload Xen's setting from the top-of-stack block.
Adjust the comment at the top of spec_ctrl_asm.h in light of this bugfix.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 81f0eaadf84d273a6ff8df3660b874a02d0e7677
master date: 2022-01-20 16:32:11 +0000
Andrew Cooper [Tue, 25 Jan 2022 12:52:56 +0000 (13:52 +0100)]
x86/spec-ctrl: Drop SPEC_CTRL_{ENTRY_FROM,EXIT_TO}_HVM
These were written before Spectre/Meltdown went public, and there was large
uncertainty in how the protections would evolve. As it turns out, they're
very specific to Intel hardware, and not very suitable for AMD.
Drop the macros, opencoding the relevant subset of functionality, and leaving
grep-fodder to locate the logic. No change at all for VT-x.
For AMD, the only relevant piece of functionality is DO_OVERWRITE_RSB,
although we will soon be adding (different) logic to handle MSR_SPEC_CTRL.
This has a marginal improvement of removing an unconditional pile of long-nops
from the vmentry/exit path.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 95b13fa43e0753b7514bef13abe28253e8614f62
master date: 2022-01-20 16:32:11 +0000
Various registers have per-guest-type or per-vendor locations or access
requirements. To support their use from common code, provide accessors which
allow for per-guest-type behaviour.
For now, just infrastructure handling default cases and expectations.
Subsequent patches will start handling registers using this infrastructure.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 88d3ff7ab15da277a85b39735797293fb541c718
master date: 2022-01-20 16:32:11 +0000
Jan Beulich [Tue, 25 Jan 2022 12:51:49 +0000 (13:51 +0100)]
x86/time: improve TSC / CPU freq calibration accuracy
While the problem report was for extreme errors, even smaller ones would
better be avoided: The calculated period to run calibration loops over
can (and usually will) be shorter than the actual time elapsed between
first and last platform timer and TSC reads. Adjust values returned from
the init functions accordingly.
On a Skylake system I've tested this on accuracy (using HPET) went from
detecting in some cases more than 220kHz too high a value to about
±2kHz. On other systems (or on this system, but with PMTMR) the original
error range was much smaller, with less (in some cases only very little)
improvement.
Reported-by: James Dingwall <james-xen@dingwall.me.uk> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a5c9a80af34eefcd6e31d0ed2b083f452cd9076d
master date: 2022-01-13 14:31:52 +0100
Jan Beulich [Tue, 25 Jan 2022 12:51:36 +0000 (13:51 +0100)]
x86/time: use relative counts in calibration loops
Looping until reaching/exceeding a certain value is error prone: If the
target value is close enough to the wrapping point, the loop may not
terminate at all. Switch to using delta values, which then allows to
fold the two loops each into just one.
Fixes: 93340297802b ("x86/time: calibrate TSC against platform timer") Reported-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 467191641d2a2fd2e43b3ae7b80399f89d339980
master date: 2022-01-13 14:30:18 +0100
Julien Grall [Tue, 25 Jan 2022 12:49:26 +0000 (13:49 +0100)]
xen/grant-table: Only decrement the refcounter when grant is fully unmapped
The grant unmapping hypercall (GNTTABOP_unmap_grant_ref) is not a
simple revert of the changes done by the grant mapping hypercall
(GNTTABOP_map_grant_ref).
Instead, it is possible to partially (or even not) clear some flags.
This will leave the grant is mapped until a future call where all
the flags would be cleared.
XSA-380 introduced a refcounting that is meant to only be dropped
when the grant is fully unmapped. Unfortunately, unmap_common() will
decrement the refcount for every successful call.
A consequence is a domain would be able to underflow the refcount
and trigger a BUG().
Looking at the code, it is not clear to me why a domain would
want to partially clear some flags in the grant-table. But as
this is part of the ABI, it is better to not change the behavior
for now.
Fix it by checking if the maptrack handle has been released before
decrementing the refcounting.
Julien Grall [Tue, 25 Jan 2022 12:49:06 +0000 (13:49 +0100)]
xen/arm: p2m: Always clear the P2M entry when the mapping is removed
Commit 2148a125b73b ("xen/arm: Track page accessed between batch of
Set/Way operations") allowed an entry to be invalid from the CPU PoV
(lpae_is_valid()) but valid for Xen (p2m_is_valid()). This is useful
to track which page is accessed and only perform an action on them
(e.g. clean & invalidate the cache after a set/way instruction).
Unfortunately, __p2m_set_entry() is only zeroing the P2M entry when
lpae_is_valid() returns true. This means the entry will not be zeroed
if the entry was valid from Xen PoV but invalid from the CPU PoV for
tracking purpose.
As a consequence, this will allow a domain to continue to access the
page after it was removed.
Resolve the issue by always zeroing the entry if it the LPAE bit is
set or the entry is about to be removed.
Andrew Cooper [Fri, 7 Jan 2022 07:54:38 +0000 (08:54 +0100)]
x86/spec-ctrl: Fix default calculation of opt_srb_lock
Since this logic was introduced, opt_tsx has become more complicated and
shouldn't be compared to 0 directly. While there are no buggy logic paths,
the correct expression is !(opt_tsx & 1) but the rtm_disabled boolean is
easier and clearer to use.
Fixes: 8fe24090d940 ("x86/cpuid: Rework HLE and RTM handling") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 31f3bc97f4508687215e459a5e35676eecf1772b
master date: 2022-01-05 09:44:26 +0000
While its description is correct from an abstract or real hardware pov,
the range is special inside HVM guests. The range being UC in particular
gets in the way of OVMF, which places itself at [FFE00000,FFFFFFFF].
While this is benign to epte_get_entry_emt() as long as the IOMMU isn't
enabled for a guest, it becomes a very noticable problem otherwise: It
takes about half a minute for OVMF to decompress itself into its
designated address range.
And even beyond OVMF there's no reason to have e.g. the ACPI memory
range marked UC.
Fixes: c22bd567ce22 ("hvmloader: PA range 0xfc000000-0xffffffff should be UC") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ea187c0b7a73c26258c0e91e4f3656989804555f
master date: 2021-12-17 08:56:15 +0100
Jan Beulich [Fri, 7 Jan 2022 07:52:48 +0000 (08:52 +0100)]
x86/HVM: permit CLFLUSH{,OPT} on execute-only code segments
Both SDM and PM explicitly permit this.
Fixes: 52dba7bd0b36 ("x86emul: generalize wbinvd() hook") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Paul Durrant <paul@xen.org>
master commit: df3e1a5efe700a9f59eced801cac73f9fd02a0e2
master date: 2021-12-10 14:03:56 +0100
Jan Beulich [Fri, 7 Jan 2022 07:52:18 +0000 (08:52 +0100)]
x86: avoid wrong use of all-but-self IPI shorthand
With "nosmp" I did observe a flood of "APIC error on CPU0: 04(04), Send
accept error" log messages on an AMD system. And rightly so - nothing
excludes the use of the shorthand in send_IPI_mask() in this case. Set
"unaccounted_cpus" to "true" also when command line restrictions are the
cause.
Note that PV-shim mode is unaffected by this change, first and foremost
because "nosmp" and "maxcpus=" are ignored in this case.
Fixes: 5500d265a2a8 ("x86/smp: use APIC ALLBUT destination shorthand when possible") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 7621880de0bb40bae6436a5b106babc0e4718f4d
master date: 2021-12-10 10:26:52 +0100
Jan Beulich [Fri, 7 Jan 2022 07:51:51 +0000 (08:51 +0100)]
x86/HVM: fail virt-to-linear conversion for insn fetches from non-code segments
Just like (in protected mode) reads may not go to exec-only segments and
writes may not go to non-writable ones, insn fetches may not access data
segments.
Fixes: 623e83716791 ("hvm: Support hardware task switching") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 311297f4216a4387bdae6df6cfbb1f5edb06618a
master date: 2021-12-06 14:15:05 +0100
Jan Beulich [Fri, 7 Jan 2022 07:51:27 +0000 (08:51 +0100)]
VT-d: don't leak domid mapping on error path
While domain_context_mapping() invokes domain_context_unmap() in a sub-
case of handling DEV_TYPE_PCI when encountering an error, thus avoiding
a leak, individual calls to domain_context_mapping_one() aren't
similarly covered. Such a leak might persist until domain destruction.
Leverage that these cases can be recognized by pdev being non-NULL.
Fixes: dec403cc668f ("VT-d: fix iommu_domid for PCI/PCIx devices assignment") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: e6252a51faf42c892eb5fc71f8a2617580832196
master date: 2021-11-24 11:07:11 +0100
Roger Pau Monné [Fri, 7 Jan 2022 07:50:22 +0000 (08:50 +0100)]
efi: fix alignment of function parameters in compat mode
Currently the max_store_size, remain_store_size and max_size in
compat_pf_efi_runtime_call are 4 byte aligned, which makes clang
13.0.0 complain with:
In file included from compat.c:30:
./runtime.c:646:13: error: passing 4-byte aligned argument to 8-byte aligned parameter 2 of 'QueryVariableInfo' may result in an unaligned pointer access [-Werror,-Walign-mismatch]
&op->u.query_variable_info.max_store_size,
^
./runtime.c:647:13: error: passing 4-byte aligned argument to 8-byte aligned parameter 3 of 'QueryVariableInfo' may result in an unaligned pointer access [-Werror,-Walign-mismatch]
&op->u.query_variable_info.remain_store_size,
^
./runtime.c:648:13: error: passing 4-byte aligned argument to 8-byte aligned parameter 4 of 'QueryVariableInfo' may result in an unaligned pointer access [-Werror,-Walign-mismatch]
&op->u.query_variable_info.max_size);
^
Fix this by bouncing the variables on the stack in order for them to
be 8 byte aligned.
Note this could be done in a more selective manner to only apply to
compat code calls, but given the overhead of making an EFI call doing
an extra copy of 3 variables doesn't seem to warrant the special
casing.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Ian Jackson <iwj@xenproject.org> Signed-off-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: be12fcca8b784e456df3adedbffe657d753c5ff9
master date: 2021-11-19 17:01:24 +0000
xen/arm: Do not invalidate the P2M when the PT is shared with the IOMMU
Set/Way flushes never work correctly in a virtualized environment.
Our current implementation is based on clearing the valid bit in the p2m
pagetable to track guest memory accesses. This technique doesn't work
when the IOMMU is enabled for the domain and the pagetable is shared
between IOMMU and MMU because it triggers IOMMU faults.
Specifically, p2m_invalidate_root causes IOMMU faults if
iommu_use_hap_pt returns true for the domain.
Add a check in p2m_set_way_flush: if a set/way instruction is used
and iommu_use_hap_pt returns true, rather than failing with obscure
IOMMU faults, inject an undef exception straight away into the guest,
and print a verbose error message to explain the problem.
Also add an ASSERT in p2m_invalidate_root to make sure we don't
inadvertently stumble across this problem again in the future.
Ian Jackson [Mon, 6 Dec 2021 14:40:24 +0000 (14:40 +0000)]
MAINTAINERS: Resign from tools stable branch maintainership
Signed-off-by: Ian Jackson <iwj@xenproject.org> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit c623a84c2a4fda1cd25f5347a6298706218eb5fb)
Jan Beulich [Tue, 23 Nov 2021 12:30:09 +0000 (13:30 +0100)]
x86/P2M: deal with partial success of p2m_set_entry()
M2P and PoD stats need to remain in sync with P2M; if an update succeeds
only partially, respective adjustments need to be made. If updates get
made before the call, they may also need undoing upon complete failure
(i.e. including the single-page case).
Log-dirty state would better also be kept in sync.
Note that the change to set_typed_p2m_entry() may not be strictly
necessary (due to the order restriction enforced near the top of the
function), but is being kept here to be on the safe side.
This is CVE-2021-28705 and CVE-2021-28709 / XSA-389.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 74a11c43fd7e074b1f77631b446dd2115eacb9e8
master date: 2021-11-22 12:27:30 +0000
Jan Beulich [Tue, 23 Nov 2021 12:29:54 +0000 (13:29 +0100)]
x86/PoD: handle intermediate page orders in p2m_pod_cache_add()
p2m_pod_decrease_reservation() may pass pages to the function which
aren't 4k, 2M, or 1G. Handle all intermediate orders as well, to avoid
hitting the BUG() at the switch() statement's "default" case.
This is CVE-2021-28708 / part of XSA-388.
Fixes: 3c352011c0d3 ("x86/PoD: shorten certain operations on higher order ranges") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8ec13f68e0b026863d23e7f44f252d06478bc809
master date: 2021-11-22 12:27:30 +0000
Jan Beulich [Tue, 23 Nov 2021 12:29:41 +0000 (13:29 +0100)]
x86/PoD: deal with misaligned GFNs
Users of XENMEM_decrease_reservation and XENMEM_populate_physmap aren't
required to pass in order-aligned GFN values. (While I consider this
bogus, I don't think we can fix this there, as that might break existing
code, e.g Linux'es swiotlb, which - while affecting PV only - until
recently had been enforcing only page alignment on the original
allocation.) Only non-PoD code paths (guest_physmap_{add,remove}_page(),
p2m_set_entry()) look to be dealing with this properly (in part by being
implemented inefficiently, handling every 4k page separately).
Introduce wrappers taking care of splitting the incoming request into
aligned chunks, without putting much effort in trying to determine the
largest possible chunk at every iteration.
Also "handle" p2m_set_entry() failure for non-order-0 requests by
crashing the domain in one more place. Alongside putting a log message
there, also add one to the other similar path.
Note regarding locking: This is left in the actual worker functions on
the assumption that callers aren't guaranteed atomicity wrt acting on
multiple pages at a time. For mis-aligned GFNs gfn_lock() wouldn't have
locked the correct GFN range anyway, if it didn't simply resolve to
p2m_lock(), and for well-behaved callers there continues to be only a
single iteration, i.e. behavior is unchanged for them. (FTAOD pulling
out just pod_lock() into p2m_pod_decrease_reservation() would result in
a lock order violation.)
This is CVE-2021-28704 and CVE-2021-28707 / part of XSA-388.
Fixes: 3c352011c0d3 ("x86/PoD: shorten certain operations on higher order ranges") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 182c737b9ba540ebceb1433f3940fbed6eac4ea9
master date: 2021-11-22 12:27:30 +0000
Julien Grall [Tue, 23 Nov 2021 12:29:09 +0000 (13:29 +0100)]
xen/page_alloc: Harden assign_pages()
domain_tot_pages() and d->max_pages are 32-bit values. While the order
should always be quite small, it would still be possible to overflow
if domain_tot_pages() is near to (2^32 - 1).
As this code may be called by a guest via XENMEM_increase_reservation
and XENMEM_populate_physmap, we want to make sure the guest is not going
to be able to allocate more than it is allowed.
Rework the allocation check to avoid any possible overflow. While the
check domain_tot_pages() < d->max_pages should technically not be
necessary, it is probably best to have it to catch any possible
inconsistencies in the future.
Jan Beulich [Fri, 19 Nov 2021 08:41:41 +0000 (09:41 +0100)]
x86/APIC: avoid iommu_supports_x2apic() on error path
The value it returns may change from true to false in case
iommu_enable_x2apic() fails and, as a side effect, clears iommu_intremap
(as can happen at least on AMD). Latch the return value from the first
invocation to replace the second one.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 0f50d1696b3c13cbf0b18fec817fc291d5a30a31
master date: 2021-11-04 14:44:43 +0100
Jan Beulich [Fri, 19 Nov 2021 08:41:09 +0000 (09:41 +0100)]
x86/IOMMU: mark IOMMU / intremap not in use when ACPI tables are missing
x2apic_bsp_setup() gets called ahead of iommu_setup(), and since x2APIC
mode (physical vs clustered) depends on iommu_intremap, that variable
needs to be set to off as soon as we know we can't / won't enable
interrupt remapping, i.e. in particular when parsing of the respective
ACPI tables failed. Move the turning off of iommu_intremap from AMD
specific code into acpi_iommu_init(), accompanying it by clearing of
iommu_enable.
Take the opportunity and also fully skip ACPI table parsing logic on
VT-d when both "iommu=off" and "iommu=no-intremap" are in effect anyway,
like was already the case for AMD.
The tag below only references the commit uncovering a pre-existing
anomaly.
Fixes: d8bd82327b0f ("AMD/IOMMU: obtain IVHD type to use earlier") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 46c4061cd2bf69e8039021af615c2bdb94e50088
master date: 2021-11-04 14:44:01 +0100
x86/xstate: reset cached register values on resume
set_xcr0() and set_msr_xss() use cached value to avoid setting the
register to the same value over and over. But suspend/resume implicitly
reset the registers and since percpu areas are not deallocated on
suspend anymore, the cache gets stale.
Reset the cache on resume, to ensure the next write will really hit the
hardware. Choose value 0, as it will never be a legitimate write to
those registers - and so, will force write (and cache update).
Note the cache is used io get_xcr0() and get_msr_xss() too, but:
- set_xcr0() is called few lines below in xstate_init(), so it will
update the cache with appropriate value
- get_msr_xss() is not used anywhere - and thus not before any
set_msr_xss() that will fill the cache
Fixes: aca2a985a55a "xen: don't free percpu areas during suspend" Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f7f4a523927fa4c7598e4647a16bc3e3cf8009d0
master date: 2021-11-04 14:42:37 +0100
Andrew Cooper [Fri, 19 Nov 2021 08:40:19 +0000 (09:40 +0100)]
x86/traps: Fix typo in do_entry_CP()
The call to debugger_trap_entry() should pass the correct vector. The
break-for-gdbsx logic is in practice unreachable because PV guests can't
generate #CP, but it will interfere with anyone inserting custom debugging
into debugger_trap_entry().
Fixes: 5ad05b9c2490 ("x86/traps: Implement #CP handler and extend #PF for shadow stacks") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 512863ed238d7390f74d43f0ba298b1dfa8f4803
master date: 2021-11-03 19:13:17 +0000
Andrew Cooper [Fri, 19 Nov 2021 08:39:46 +0000 (09:39 +0100)]
x86/shstk: Fix use of shadow stacks with XPTI active
The call to setup_cpu_root_pgt(0) in smp_prepare_cpus() is too early. It
clones the BSP's stack while the .data mapping is still in use, causing all
mappings to be fully read read/write (and with no guard pages either). This
ultimately causes #DF when trying to enter the dom0 kernel for the first time.
Defer setting up BSPs XPTI pagetable until reinit_bsp_stack() after we've set
up proper shadow stack permissions.
Fixes: 60016604739b ("x86/shstk: Rework the stack layout to support shadow stacks") Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b2851580b1f2ff121737a37cb25a370d7692ae3b
master date: 2021-11-03 13:08:42 +0000
Dongli Zhang [Fri, 19 Nov 2021 08:39:09 +0000 (09:39 +0100)]
update system time immediately when VCPUOP_register_vcpu_info
The guest may access the pv vcpu_time_info immediately after
VCPUOP_register_vcpu_info. This is to borrow the idea of
VCPUOP_register_vcpu_time_memory_area, where the
force_update_vcpu_system_time() is called immediately when the new memory
area is registered.
Otherwise, we may observe clock drift at the VM side if the VM accesses
the clocksource immediately after VCPUOP_register_vcpu_info().
Jan Beulich [Fri, 19 Nov 2021 08:38:42 +0000 (09:38 +0100)]
x86/paging: restrict physical address width reported to guests
Modern hardware may report more than 48 bits of physical address width.
For paging-external guests our P2M implementation does not cope with
larger values. Telling the guest of more available bits means misleading
it into perhaps trying to actually put some page there (like was e.g.
intermediately done in OVMF for the shared info page).
While there also convert the PV check to a paging-external one (which in
our current code base are synonyms of one another anyway).
Fixes: 5dbd60e16a1f ("x86/shadow: Correct guest behaviour when creating PTEs above maxphysaddr") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: b7635526acffbe4ad8ad16fd92812c57742e54c2
master date: 2021-10-19 10:08:30 +0200
Jan Beulich [Fri, 19 Nov 2021 08:38:09 +0000 (09:38 +0100)]
x86/AMD: make HT range dynamic for Fam17 and up
At the time of d838ac2539cf ("x86: don't allow Dom0 access to the HT
address range") documentation correctly stated that the range was
completely fixed. For Fam17 and newer, it lives at the top of physical
address space, though.
To correctly determine the top of physical address space, we need to
account for their physical address reduction, hence the calculation of
paddr_bits also gets adjusted.
While for paddr_bits < 40 the HT range is completely hidden, there's no
need to suppress the range insertion in that case: It'll just have no
real meaning.
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: d6e38eea2d806c53d976603717aebf6e5de30a1e
master date: 2021-10-19 10:04:13 +0200
Jan Beulich [Fri, 19 Nov 2021 08:37:37 +0000 (09:37 +0100)]
x86emul: de-duplicate scatters to the same linear address
The SDM specifically allows for earlier writes to fully overlapping
ranges to be dropped. If a guest did so, hvmemul_phys_mmio_access()
would crash it if varying data was written to the same address. Detect
overlaps early, as doing so in hvmemul_{linear,phys}_mmio_access() would
be quite a bit more difficult. To maintain proper faulting behavior,
instead of dropping earlier write instances of fully overlapping slots
altogether, write the data of the final of these slots multiple times.
(We also can't pull ahead the [single] write of the data of the last of
the slots, clearing all involved slots' op_mask bits together, as this
would yield incorrect results if there were intervening partially
overlapping ones.)
Note that due to cache slot use being linear address based, there's no
similar issue with multiple writes to the same physical address (mapped
through different linear addresses).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a8cddbac5051020bb4a59a7f0ea27500c51063fb
master date: 2021-10-19 10:02:39 +0200
Jan Beulich [Fri, 15 Oct 2021 09:20:04 +0000 (11:20 +0200)]
x86/PV32: fix physdev_op_compat handling
The conversion of the original code failed to recognize that the 32-bit
compat variant of this (sorry, two different meanings of "compat" here)
needs to continue to invoke the compat handler, not the native one.
Arrange for this by adding yet another #define.
Affected functions (having existed prior to the introduction of the new
hypercall) are PHYSDEVOP_set_iobitmap and PHYSDEVOP_apic_{read,write}.
For all others the operand struct layout doesn't differ.
Fixes: 1252e2823117 ("x86/pv: Export pv_hypercall_table[] rather than working around it in several ways") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 834cb8761051f7d87816785c0d99fe9bd5f0ce30
master date: 2021-10-12 11:55:42 +0200
Jan Beulich [Fri, 15 Oct 2021 09:19:41 +0000 (11:19 +0200)]
AMD/IOMMU: consider hidden devices when flushing device I/O TLBs
Hidden devices are associated with DomXEN but usable by the
hardware domain. Hence they need flushing as well when all devices are
to have flushes invoked.
While there drop a redundant ATS-enabled check and constify the first
parameter of the involved function.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 036432e8b27e1ef21e0f0204ba9b0e3972a031c2
master date: 2021-10-12 11:54:34 +0200
Jan Beulich [Fri, 15 Oct 2021 09:19:11 +0000 (11:19 +0200)]
x86/HVM: fix xsm_op for 32-bit guests
Like for PV, 32-bit guests need to invoke the compat handler, not the
native one.
Fixes: db984809d61b ("hvm: wire up domctl and xsm hypercalls") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b6b672e8a925ff4b71a1a67bc7d213ef445af74f
master date: 2021-10-11 10:58:44 +0200
Jan Beulich [Fri, 15 Oct 2021 09:18:51 +0000 (11:18 +0200)]
x86/build: suppress EFI-related tool chain checks upon local $(MAKE) recursion
The xen-syms and xen.efi linking steps are serialized only when the
intermediate note.o file is necessary. Otherwise both may run in
parallel. This in turn means that the compiler / linker invocations to
create efi/check.o / efi/check.efi may also happen twice in parallel.
Obviously it's a bad idea to have multiple producers of the same output
race with one another - every once in a while one may e.g. observe
objdump: efi/check.efi: file format not recognized
We don't need this EFI related checking to occur when producing the
intermediate symbol and relocation table objects, and we have an easy
way of suppressing it: Simply pass in "efi-y=", overriding the
assignments done in the Makefile and thus forcing the tool chain checks
to be bypassed.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: 24b0ce9a5da2e648cde818055a085bcbcf24ecb0
master date: 2021-10-11 10:58:17 +0200
Jan Beulich [Fri, 15 Oct 2021 09:17:32 +0000 (11:17 +0200)]
VT-d: consider hidden devices when unmapping
Whether to clear an IOMMU's bit in the domain's bitmap should depend on
all devices the domain can control. For the hardware domain this
includes hidden devices, which are associated with DomXEN.
While touching related logic
- convert the "current device" exclusion check to a simple pointer
comparison,
- convert "found" to "bool",
- adjust style and correct a typo in an existing comment.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 75bfe6ec4844f83b300b9807bceaed1e2fe23270
master date: 2021-09-20 10:24:27 +0200
Roger Pau Monné [Fri, 15 Oct 2021 09:16:41 +0000 (11:16 +0200)]
x86: quote section names when defining them in linker script
LLVM ld seems to require section names to be quoted at both definition
and when referencing them for a match to happen, or else we get the
following errors:
The original fix for GNU ld 2.37 only quoted the section name when
referencing it in the ADDR function. Fix by also quoting the section
names when declaring them.
Fixes: 58ad654ebce7 ("x86: work around build issue with GNU ld 2.37") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6254920587c33bcc7ab884e6c9a11cfc0d5867ab
master date: 2021-09-15 11:02:21 +0200
Andrew Cooper [Fri, 15 Oct 2021 09:15:14 +0000 (11:15 +0200)]
x86/amd: Use newer SSBD mechanisms if they exist
The opencoded legacy Memory Disambiguation logic in init_amd() neglected
Fam19h for the Zen3 microarchitecture. Further more, all Zen2 based system
have the architectural MSR_SPEC_CTRL and the SSBD bit within it, so shouldn't
be using MSR_AMD64_LS_CFG.
Implement the algorithm given in AMD's SSBD whitepaper, and leave a
printk_once() behind in the case that no controls can be found.
This now means that a user explicitly choosing `spec-ctrl=ssbd` will properly
turn off Memory Disambiguation on Fam19h/Zen3 systems.
This still remains a single system-wide setting (for now), and is not context
switched between vCPUs. As such, it doesn't interact with Intel's use of
MSR_SPEC_CTRL and default_xen_spec_ctrl (yet).
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2a4e6c4e4bea2e0bb720418c331ee28ff9c7632e
master date: 2021-09-08 14:16:19 +0100
Andrew Cooper [Fri, 15 Oct 2021 09:14:46 +0000 (11:14 +0200)]
x86/amd: Enumeration for speculative features/hints
There is a step change in speculation protections between the Zen1 and Zen2
microarchitectures.
Zen1 and older have no special support. Control bits in non-architectural
MSRs are used to make lfence be dispatch-serialising (Spectre v1), and to
disable Memory Disambiguation (Speculative Store Bypass). IBPB was
retrofitted in a microcode update, and software methods are required for
Spectre v2 protections.
Because the bit controlling Memory Disambiguation is model specific,
hypervisors are expected to expose a MSR_VIRT_SPEC_CTRL interface which
abstracts the model specific details.
Zen2 and later implement the MSR_SPEC_CTRL interface in hardware, and
virtualise the interface for HVM guests to use. A number of hint bits are
specified too to help guide OS software to the most efficient mitigation
strategy.
Zen3 introduced a new feature, Predictive Store Forwarding, along with a
control to disable it in sensitive code.
Add CPUID and VMCB details for all the new functionality.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 747424c664bb164a04e7a9f2ffbf02d4a1630d7d
master date: 2021-09-08 14:16:19 +0100
Andrew Cooper [Fri, 15 Oct 2021 09:14:16 +0000 (11:14 +0200)]
x86/spec-ctrl: Split the "Hardware features" diagnostic line
Separate the read-only hints from the features requiring active actions on
Xen's behalf.
Also take the opportunity split the IBRS/IBPB and IBPB mess. More features
with overlapping enumeration are on the way, and and it is not useful to split
them like this.
No practical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 565ebcda976c05b0c6191510d5e32b621a2b1867
master date: 2021-09-08 14:16:19 +0100
Jan Beulich [Fri, 1 Oct 2021 13:05:42 +0000 (15:05 +0200)]
VT-d: fix deassign of device with RMRR
Ignoring a specific error code here was not meant to short circuit
deassign to _just_ the unmapping of RMRRs. This bug was previously
hidden by the bogus (potentially indefinite) looping in
pci_release_devices(), until f591755823a7 ("IOMMU/PCI: don't let domain
cleanup continue when device de-assignment failed") fixed that loop.
This is CVE-2021-28702 / XSA-386.
Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling") Reported-by: Ivan Kardykov <kardykov@tabit.pro> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Ivan Kardykov <kardykov@tabit.pro>
(cherry picked from commit 24ebe875a77833696bbe5c9372e9e1590a7e7101)
Jan Beulich [Wed, 8 Sep 2021 12:53:04 +0000 (14:53 +0200)]
gnttab: deal with status frame mapping race
Once gnttab_map_frame() drops the grant table lock, the MFN it reports
back to its caller is free to other manipulation. In particular
gnttab_unpopulate_status_frames() might free it, by a racing request on
another CPU, thus resulting in a reference to a deallocated page getting
added to a domain's P2M.
Obtain a page reference in gnttab_map_frame() to prevent freeing of the
page until xenmem_add_to_physmap_one() has actually completed its acting
on the page. Do so uniformly, even if only strictly required for v2
status pages, to avoid extra conditionals (which then would all need to
be kept in sync going forward).
Jan Beulich [Wed, 8 Sep 2021 12:52:13 +0000 (14:52 +0200)]
x86/p2m-pt: fix p2m_flags_to_access()
The initial if() was inverted, invalidating all output from this
function. Which in turn means the mirroring of P2M mappings into the
IOMMU didn't always work as intended: Mappings may have got updated when
there was no need to. There would not have been too few (un)mappings;
what saves us is that alongside the flags comparison MFNs also get
compared, with non-present entries always having an MFN of 0 or
INVALID_MFN while present entries always have MFNs different from these
two (0 in the table also meant to cover INVALID_MFN):
OLD NEW
P W access MFN P W access MFN
0 0 r 0 0 0 n 0
0 1 rw 0 0 1 n 0
1 0 n non-0 1 0 r non-0
1 1 n non-0 1 1 rw non-0
present <-> non-present transitions are fine because the MFNs differ.
present -> present transitions as well as non-present -> non-present
ones are potentially causing too many map/unmap operations, but never
too few, because in that case old (bogus) and new access differ.
Fixes: d1bb6c97c31e ("IOMMU: also pass p2m_access_t to p2m_get_iommu_flags()) Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e70a9a043a5ce6d4025420f729bc473f711bf5d1
master date: 2021-09-07 14:24:49 +0200
Jan Beulich [Wed, 8 Sep 2021 12:51:48 +0000 (14:51 +0200)]
x86/P2M: relax guarding of MMIO entries
One of the changes comprising the fixes for XSA-378 disallows replacing
MMIO mappings by code paths not intended for this purpose. At least in
the case of PVH Dom0 hitting an RMRR covered by an E820 ACPI region,
this is too strict. Generally short-circuit requests establishing the
same kind of mapping (mfn, type), but allow permissions to differ.
While there, also add a log message to the other domain_crash()
invocation that did prevent PVH Dom0 from coming up after the XSA-378
changes.
Fixes: 753cb68e6530 ("x86/p2m: guard (in particular) identity mapping entries") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 111469cc7b3f586c2335e70205320ed3c828b89e
master date: 2021-09-07 09:39:38 +0200
Jan Beulich [Wed, 8 Sep 2021 12:51:24 +0000 (14:51 +0200)]
x86/PVH: de-duplicate mappings for first Mb of Dom0 memory
One of the changes comprising the fixes for XSA-378 disallows replacing
MMIO mappings by code paths not intended for this purpose. This means we
need to be more careful about the mappings put in place in this range -
mappings should be created exactly once:
- iommu_hwdom_init() comes first; it should avoid the first Mb,
- pvh_populate_p2m() should insert identity mappings only into ranges
not populated as RAM,
- pvh_setup_acpi() should again avoid the first Mb, which was already
dealt with at that point.
Fixes: 753cb68e6530 ("x86/p2m: guard (in particular) identity mapping entries") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6b4f6a31ace125d658a581e8d10809e4fccdc272
master date: 2021-08-31 17:43:36 +0200
Jan Beulich [Wed, 8 Sep 2021 12:50:39 +0000 (14:50 +0200)]
gnttab: avoid triggering assertion in radix_tree_ulong_to_ptr()
Relevant quotes from the C11 standard:
"Except where explicitly stated otherwise, for the purposes of this
subclause unnamed members of objects of structure and union type do not
participate in initialization. Unnamed members of structure objects
have indeterminate value even after initialization."
"If there are fewer initializers in a brace-enclosed list than there are
elements or members of an aggregate, [...], the remainder of the
aggregate shall be initialized implicitly the same as objects that have
static storage duration."
"If an object that has static or thread storage duration is not
initialized explicitly, then:
[...]
— if it is an aggregate, every member is initialized (recursively)
according to these rules, and any padding is initialized to zero
bits;
[...]"
"A bit-field declaration with no declarator, but only a colon and a
width, indicates an unnamed bit-field." Footnote: "An unnamed bit-field
structure member is useful for padding to conform to externally imposed
layouts."
"There may be unnamed padding within a structure object, but not at its
beginning."
Which makes me conclude:
- Whether an unnamed bit-field member is an unnamed member or padding is
unclear, and hence also whether the last quote above would render the
big endian case of the structure declaration invalid.
- Whether the number of members of an aggregate includes unnamed ones is
also not really clear.
- The initializer in map_grant_ref() initializes all fields of the "cnt"
sub-structure of the union, so assuming the second quote above applies
here (indirectly), the compiler isn't required to implicitly
initialize the rest (i.e. in particular any padding) like would happen
for static storage duration objects.
Gcc 7.4.1 can be observed (apparently in debug builds only) to translate
aforementioned initializer to a read-modify-write operation of a stack
variable, leaving unchanged the top two bits of whatever was previously
in that stack slot. Clearly if either of the two bits were set,
radix_tree_ulong_to_ptr()'s assertion would trigger.
Therefore, to be on the safe side, add an explicit padding field for the
non-big-endian-bitfields case and give a dummy name to both padding
fields.
Fixes: 9781b51efde2 ("gnttab: replace mapkind()") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b6da9d0414d69c2682214ee3ecf9816fcac500d0
master date: 2021-08-27 10:54:46 +0200
Anthony PERARD [Tue, 1 Jun 2021 10:28:03 +0000 (11:28 +0100)]
tools/firmware/ovmf: Use OvmfXen platform file is exist
A platform introduced in EDK II named OvmfXen is now the one to use for
Xen instead of OvmfX64. It comes with PVH support.
Also, the Xen support in OvmfX64 is deprecated,
"deprecation notice: *dynamic* multi-VMM (QEMU vs. Xen) support in OvmfPkg"
https://edk2.groups.io/g/devel/message/75498
Juergen Gross [Wed, 25 Aug 2021 13:11:24 +0000 (15:11 +0200)]
xen/sched: fix get_cpu_idle_time() for smt=0 suspend/resume
With smt=0 during a suspend/resume cycle of the machine the threads
which have been parked before will briefly come up again. This can
result in problems e.g. with cpufreq driver being active as this will
call into get_cpu_idle_time() for a cpu without initialized scheduler
data.
Fix that by letting get_cpu_idle_time() deal with this case. Drop a
redundant check in exchange.
Fixes: 132cbe8f35632fb2 ("sched: fix get_cpu_idle_time() with core scheduling") Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Dario Faggioli <dfaggioli@suse.com>
master commit: 5293470a77ad980dce2af9b7e6c3f11eeebf1b64
master date: 2021-08-19 13:38:31 +0200
Jan Beulich [Wed, 25 Aug 2021 13:11:11 +0000 (15:11 +0200)]
VT-d: Tylersburg errata apply to further steppings
While for 5500 and 5520 chipsets only B3 and C2 are mentioned in the
spec update, X58's also mentions B2, and searching the internet suggests
systems with this stepping are actually in use. Even worse, for X58
erratum #69 is marked applicable even to C2. Split the check to cover
all applicable steppings and to also report applicable errata numbers in
the log message. The splitting requires using the DMI port instead of
the System Management Registers device, but that's then in line (also
revision checking wise) with the spec updates.
Fixes: 6890cebc6a98 ("VT-d: deal with 5500/5520/X58 errata") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 517a90d1ca09ce00e50d46ac25566cc3bd2eb34d
master date: 2021-08-18 09:44:14 +0200
Andrew Cooper [Wed, 25 Aug 2021 13:10:58 +0000 (15:10 +0200)]
x86/cet: Fix shskt manipulation error with BUGFRAME_{warn,run_fn}
This was a clear oversight in the original CET work. The BUGFRAME_run_fn and
BUGFRAME_warn paths update regs->rip without an equivalent adjustment to the
shadow stack, causing IRET to suffer #CP because of the mismatch.
One subtle, and therefore fragile, aspect of extable_shstk_fixup() was that it
required regs->rip to have its old value as a cross-check that the right word
in the shadow stack was being edited.
Rework extable_shstk_fixup() into fixup_exception_return() which takes
ownership of the update to both the regular and shadow stacks, ensuring that
the regs->rip update is ordered correctly.
Use the new fixup_exception_return() for BUGFRAME_run_fn and BUGFRAME_warn to
ensure that the shadow stack is updated too.
Fixes: 209fb9919b50 ("x86/extable: Adjust extable handling to be shadow stack compatible") Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/cet: Fix build on newer versions of GCC
Some versions of GCC complain with:
traps.c:405:22: error: 'get_shstk_bottom' defined but not used [-Werror=unused-function]
static unsigned long get_shstk_bottom(unsigned long sp)
^~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
Change #ifdef to if ( IS_ENABLED(...) ) to make the sole user of
get_shstk_bottom() visible to the compiler.
Fixes: 35727551c070 ("x86/cet: Fix shskt manipulation error with BUGFRAME_{warn,run_fn}") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Compile-tested-by: Jan Beulich <jbeulich@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
master commit: 35727551c0703493a2240e967cffc3063b13d49c
master date: 2021-08-16 16:03:20 +0100
master commit: 54c9736382e0d558a6acd820e44185e020131c48
master date: 2021-08-17 12:55:48 +0100
Dario Faggioli [Wed, 25 Aug 2021 13:10:45 +0000 (15:10 +0200)]
credit2: avoid picking a spurious idle unit when caps are used
Commit 07b0eb5d0ef0 ("credit2: make sure we pick a runnable unit from the
runq if there is one") did not fix completely the problem of potentially
selecting a scheduling unit that will then not be able to run.
In fact, in case caps are used and the unit we are currently looking
at, during the runqueue scan, does not have enough budget for being run,
we should continue looking instead than giving up and picking the idle
unit.
Suggested-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Dario Faggioli <dfaggioli@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0f742839ae57e10687e7a573070c37430f31068c
master date: 2021-08-10 09:29:10 +0200
Jane Malalane [Wed, 25 Aug 2021 13:10:32 +0000 (15:10 +0200)]
xen/lib: Fix strcmp() and strncmp()
The C standard requires that each character be compared as unsigned
char. Xen's current behaviour compares as signed char, which changes
the answer when chars with a value greater than 0x7f are used.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jane Malalane <jane.malalane@citrix.com> Reviewed-by: Ian Jackson <iwj@xenproject.org>
master commit: 3747a2bb67daa5a8baeff6cda57dc98a5ef79c3e
master date: 2021-07-30 10:52:46 +0100
Andrew Cooper [Wed, 25 Aug 2021 13:10:18 +0000 (15:10 +0200)]
x86/hvm: Propagate real error information up through hvm_load()
hvm_load() is currently a mix of -errno and -1 style error handling, which
aliases -EPERM. This leads to the following confusing diagnostics:
From userspace:
xc: info: Restoring domain
xc: error: Unable to restore HVM context (1 = Operation not permitted): Internal error
xc: error: Restore failed (1 = Operation not permitted): Internal error
xc_domain_restore: [1] Restore failed (1 = Operation not permitted)
From Xen:
(XEN) HVM10.0 restore: inconsistent xsave state (feat=0x2ff accum=0x21f xcr0=0x7 bv=0x3 err=-22)
(XEN) HVM10 restore: failed to load entry 16/0
The actual error was a bad backport, but the -EINVAL got converted to -EPERM
on the way out of the hypercall.
The overwhelming majority of *_load() handlers already use -errno consistenty.
Fix up the rest to be consistent, and fix a few other errors noticed along the
way.
* Failures of hvm_load_entry() indicate a truncated record or other bad data
size. Use -ENODATA.
* Don't use {g,}dprintk(). Omitting diagnostics in release builds is rude,
and almost everything uses unconditional printk()'s.
* Switch some errors for more appropriate ones.
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 96e5ad4c476e70688295b3cfb537847a3351d6fd
master date: 2021-07-19 14:34:38 +0100
Julien Grall [Wed, 25 Aug 2021 13:08:29 +0000 (15:08 +0200)]
xen/arm: Restrict the amount of memory that dom0less domU and dom0 can allocate
Currently, both dom0less domUs and dom0 can allocate an "unlimited"
amount of memory because d->max_pages is set to ~0U.
In particular, the former are meant to be unprivileged. Therefore the
memory they could allocate should be bounded. As the domain are not yet
officially aware of Xen (we don't expose advertise it in the DT, yet
the hypercalls are accessible), they should not need to allocate more
than the initial amount. So cap set d->max_pages directly the amount of
memory we are meant to allocate.
Take the opportunity to also restrict the memory for dom0 as the
domain is direct mapped (e.g. MFN == GFN) and therefore cannot
allocate outside of the pre-allocated region.
Jan Beulich [Wed, 25 Aug 2021 13:08:09 +0000 (15:08 +0200)]
gnttab: fix array capacity check in gnttab_get_status_frames()
The number of grant frames is of no interest here; converting the passed
in op.nr_frames this way means we allow for 8 times as many GFNs to be
written as actually fit in the array. We would corrupt xlat areas of
higher vCPU-s (after having faulted many times while trying to write to
the guard pages between any two areas) for 32-bit PV guests. For HVM
guests we'd simply crash as soon as we hit the first guard page, as
accesses to the xlat area are simply memcpy() there.
This is CVE-2021-28699 / XSA-382.
Fixes: 18b1be5e324b ("gnttab: make resource limits per domain") Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: ec820035b875cdbedce5e73f481ce65963ede9ed
master date: 2021-08-25 14:19:09 +0200
Jan Beulich [Wed, 25 Aug 2021 13:07:40 +0000 (15:07 +0200)]
gnttab: replace mapkind()
mapkind() doesn't scale very well with larger maptrack entry counts,
using a brute force linear search through all entries, with the only
option of an early loop exit if a matching writable entry was found.
Introduce a radix tree alongside the main maptrack table, thus
allowing much faster MFN-based lookup. To avoid the need to actually
allocate space for the individual nodes, encode the two counters in the
node pointers themselves, thus limiting the number of permitted
simultaneous r/o and r/w mappings of the same MFN to 2³¹-1 (64-bit) /
2¹⁵-1 (32-bit) each.
To avoid enforcing an unnecessarily low bound on the number of
simultaneous mappings of a single MFN, introduce
radix_tree_{ulong_to_ptr,ptr_to_ulong} paralleling
radix_tree_{int_to_ptr,ptr_to_int}.
As a consequence locking changes are also applicable: With there no
longer being any inspection of the remote domain's active entries,
there's also no need anymore to hold the remote domain's grant table
lock. And since we're no longer iterating over the local domain's map
track table, the lock in map_grant_ref() can also be dropped before the
new maptrack entry actually gets populated.
As a nice side effect this also reduces the number of IOMMU operations
in unmap_common(): Previously we would have "established" a readable
mapping whenever we didn't find a writable entry anymore (yet, of
course, at least one readable one). But we only need to do this if we
actually dropped the last writable entry, not if there were none already
before.
Jan Beulich [Wed, 25 Aug 2021 13:07:25 +0000 (15:07 +0200)]
gnttab: add preemption check to gnttab_release_mappings()
A guest may die with many grant mappings still in place, or simply with
a large maptrack table. Iterating through this may take more time than
is reasonable without intermediate preemption (to run softirqs and
perhaps the scheduler).
Move the invocation of the function to the section where other
restartable functions get invoked, and have the function itself check
for preemption every once in a while. Have it iterate the table
backwards, such that decreasing the maptrack limit is all it takes to
convey restart information.
In domain_teardown() introduce PROG_none such that inserting at the
front will be easier going forward.
This is part of CVE-2021-28698 / XSA-380.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: b1ee10be5625b7d502cef1e6ee3818610ab0d29c
master date: 2021-08-25 14:18:18 +0200
Jan Beulich [Wed, 25 Aug 2021 13:07:09 +0000 (15:07 +0200)]
x86/mm: widen locked region in xenmem_add_to_physmap_one()
For pages which can be made part of the P2M by the guest, but which can
also later be de-allocated (grant table v2 status pages being the
present example), it is imperative that they be mapped at no more than a
single GFN. We therefore need to make sure that of two parallel
XENMAPSPACE_grant_table requests for the same status page one completes
before the second checks at which other GFN the underlying MFN is
presently mapped.
Pull ahead the respective get_gfn() and push down the respective
put_gfn(). This leverages that gfn_lock() really aliases p2m_lock(), but
the function makes this assumption already anyway: In the
XENMAPSPACE_gmfn case lock nesting constraints for both involved GFNs
would otherwise need to be enforced to avoid ABBA deadlocks.
Jan Beulich [Wed, 25 Aug 2021 13:06:51 +0000 (15:06 +0200)]
x86/p2m: guard (in particular) identity mapping entries
Such entries, created by set_identity_p2m_entry(), should only be
destroyed by clear_identity_p2m_entry(). However, similarly, entries
created by set_mmio_p2m_entry() should only be torn down by
clear_mmio_p2m_entry(), so the logic gets based upon p2m_mmio_direct as
the entry type (separation between "ordinary" and 1:1 mappings would
require a further indicator to tell apart the two).
As to the guest_remove_page() change, commit 48dfb297a20a ("x86/PVH:
allow guest_remove_page to remove p2m_mmio_direct pages"), which
introduced the call to clear_mmio_p2m_entry(), claimed this was done for
hwdom only without this actually having been the case. However, this
code shouldn't be there in the first place, as MMIO entries shouldn't be
dropped this way. Avoid triggering the warning again that 48dfb297a20a
silenced by an adjustment to xenmem_add_to_physmap_one() instead.
Note that guest_physmap_mark_populate_on_demand() gets tightened beyond
the immediate purpose of this change.
Note also that I didn't inspect code which isn't security supported,
e.g. sharing, paging, or altp2m.
This is CVE-2021-28694 / part of XSA-378.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 753cb68e653002e89fdcd1c80e52905fdbfb78cb
master date: 2021-08-25 14:17:32 +0200