Andrew Cooper [Mon, 17 Jan 2022 20:29:09 +0000 (20:29 +0000)]
x86/cpuid: Enable MSR_SPEC_CTRL in SVM guests by default
With all other pieces in place, MSR_SPEC_CTRL is fully working for HVM guests.
Update the CPUID derivation logic (both PV and HVM to avoid losing subtle
changes), drop the MSR intercept, and explicitly enable the CPUID bits for HVM
guests.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit a7e7c7260cde78a148810db5320cbf39686c3e09)
Andrew Cooper [Mon, 17 Jan 2022 20:29:09 +0000 (20:29 +0000)]
x86/msr: AMD MSR_SPEC_CTRL infrastructure
Fill in VMCB accessors for spec_ctrl in svm_{get,set}_reg(), and CPUID checks
for all supported bits in guest_{rd,wr}msr().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 22b9add22b4a9af37305c8441fec12cb26bd142b)
Andrew Cooper [Fri, 21 Jan 2022 15:59:03 +0000 (15:59 +0000)]
x86/svm: VMEntry/Exit logic for MSR_SPEC_CTRL
Hardware maintains both host and guest versions of MSR_SPEC_CTRL, but guests
run with the logical OR of both values. Therefore, in principle we want to
clear Xen's value before entering the guest. However, for migration
compatibility (future work), and for performance reasons with SEV-SNP guests,
we want the ability to use a nonzero value behind the guest's back. Use
vcpu_msrs to hold this value, with the guest value in the VMCB.
On the VMEntry path, adjusting MSR_SPEC_CTRL must be done after CLGI so as to
be atomic with respect to NMIs/etc.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 614cec7d79d76786f5638a6e4da0576b57732ca1)
Andrew Cooper [Fri, 21 Jan 2022 15:59:03 +0000 (15:59 +0000)]
x86/spec-ctrl: Use common MSR_SPEC_CTRL logic for AMD
Currently, amd_init_ssbd() works by being the only write to MSR_SPEC_CTRL in
the system. This ceases to be true when using the common logic.
Include AMD MSR_SPEC_CTRL in has_spec_ctrl to activate the common paths, and
introduce an AMD specific block to control alternatives. Also update the
boot/resume paths to configure default_xen_spec_ctrl.
svm.h needs an adjustment to remove a dependency on include order.
For now, only active alternatives for HVM - PV will require more work. No
functional change, as no alternatives are defined yet for HVM yet.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 378f2e6df31442396f0afda19794c5c6091d96f9)
Andrew Cooper [Fri, 28 Jan 2022 11:57:19 +0000 (11:57 +0000)]
x86/spec-ctrl: Record the last write to MSR_SPEC_CTRL
In some cases, writes to MSR_SPEC_CTRL do not have interesting side effects,
and we should implement lazy context switching like we do with other MSRs.
In the short term, this will be used by the SVM infrastructure, but I expect
to extend it to other contexts in due course.
Introduce cpu_info.last_spec_ctrl for the purpose, and cache writes made from
the boot/resume paths. The value can't live in regular per-cpu data when it
is eventually used for PV guests when XPTI might be active.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 00f2992b6c7a9d4090443c1a85bf83224a87eeb9)
Andrew Cooper [Fri, 28 Jan 2022 12:03:42 +0000 (12:03 +0000)]
x86/spec-ctrl: Don't use spec_ctrl_{enter,exit}_idle() for S3
'idle' here refers to hlt/mwait. The S3 path isn't an idle path - it is a
platform reset.
We need to load default_xen_spec_ctrl unilaterally on the way back up.
Currently it happens as a side effect of X86_FEATURE_SC_MSR_IDLE or the next
return-to-guest, but that's fragile behaviour.
Conversely, there is no need to clear IBRS and flush the store buffers on the
way down; we're microseconds away from cutting power.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 71fac402e05ade7b0af2c34f77517449f6f7e2c1)
Andrew Cooper [Tue, 25 Jan 2022 17:14:48 +0000 (17:14 +0000)]
x86/spec-ctrl: Introduce new has_spec_ctrl boolean
Most MSR_SPEC_CTRL setup will be common between Intel and AMD. Instead of
opencoding an OR of two features everywhere, introduce has_spec_ctrl instead.
Reword the comment above the Intel specific alternatives block to highlight
that it is Intel specific, and pull the setting of default_xen_spec_ctrl.IBRS
out because it will want to be common.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 5d9eff3a312763d889cfbf3c8468b6dfb3ab490c)
Andrew Cooper [Tue, 25 Jan 2022 16:09:59 +0000 (16:09 +0000)]
x86/spec-ctrl: Drop use_spec_ctrl boolean
Several bugfixes have reduced the utility of this variable from it's original
purpose, and now all it does is aid in the setup of SCF_ist_wrmsr.
Simplify the logic by drop the variable, and doubling up the setting of
SCF_ist_wrmsr for the PV and HVM blocks, which will make the AMD SPEC_CTRL
support easier to follow. Leave a comment explaining why SCF_ist_wrmsr is
still necessary for the VMExit case.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit ec083bf552c35e10347449e21809f4780f8155d2)
Andrew Cooper [Thu, 27 Jan 2022 21:28:48 +0000 (21:28 +0000)]
x86/cpuid: Advertise SSB_NO to guests by default
This is a statement of hardware behaviour, and not related to controls for the
guest kernel to use. Pass it straight through from hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 15b7611efd497c4b65f350483857082cb70fc348)
Andrew Cooper [Wed, 19 Jan 2022 19:55:02 +0000 (19:55 +0000)]
x86/msr: Fix migration compatibility issue with MSR_SPEC_CTRL
This bug existed in early in 2018 between MSR_SPEC_CTRL arriving in microcode,
and SSBD arriving a few months later. It went unnoticed presumably because
everyone was busy rebooting everything.
The same bug will reappear when adding PSFD support.
Clamp the guest MSR_SPEC_CTRL value to that permitted by CPUID on migrate.
The guest is already playing with reserved bits at this point, and clamping
the value will prevent a migration to a less capable host from failing.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 969a57f73f6b011b2ebf4c0ab1715efc65837335)
It was Xen 4.14 where CPUID data was added to the migration stream, and 4.13
that we need to worry about with regards to compatibility. Xen 4.12 isn't
relevant.
Expand and correct the commentary.
Fixes: 111c8c33a8a1 ("x86/cpuid: do not expand max leaves on restore") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 820cc393434097f3b7976acdccbf1d96071d6d23)
Andrew Cooper [Tue, 1 Feb 2022 13:34:49 +0000 (13:34 +0000)]
x86/vmx: Drop spec_ctrl load in VMEntry path
This is not needed now that the VMEntry path is not responsible for loading
the guest's MSR_SPEC_CTRL value.
Fixes: 81f0eaadf84d ("x86/spec-ctrl: Fix NMI race condition with VT-x MSR_SPEC_CTRL handling") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9ce3ef20b4f085a7dc8ee41b0fec6fdeced3773e)
Roger Pau Monné [Wed, 26 Jan 2022 11:18:07 +0000 (12:18 +0100)]
x86/pvh: fix population of the low 1MB for dom0
RMRRs are setup ahead of populating the p2m and hence the ASSERT when
populating the low 1MB needs to be relaxed when it finds an existing
entry: it's either RAM or a RMRR resulting from the IOMMU setup.
Rework the logic a bit and introduce a local mfn variable in order to
assert that if the gfn is populated and not RAM it is an identity map.
Fixes: 6b4f6a31ac ('x86/PVH: de-duplicate mappings for first Mb of Dom0 memory') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2d5fc9120d556ec3c4b1acf0ab5660a6d3f7ebeb
master date: 2022-01-25 10:52:24 +0000
Andrew Cooper [Wed, 26 Jan 2022 11:17:04 +0000 (12:17 +0100)]
x86: Fix build with the get/set_reg() infrastructure
I clearly messed up concluding that the stubs were safe to drop.
The is_{pv,hvm}_domain() predicates are not symmetrical with both CONFIG_PV
and CONFIG_HVM. As a result logic of the form `if ( pv/hvm ) ... else ...`
will always have one side which can't be DCE'd.
While technically only the hvm stubs are needed, due to the use of the
is_pv_domain() predicate in guest_{rd,wr}msr(), sort out the pv stubs too to
avoid leaving a bear trap for future users.
Fixes: 88d3ff7ab15d ("x86/guest: Introduce {get,set}_reg() infrastructure") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 13caa585791234fe3e3719c8376f7ea731012451
master date: 2022-01-21 12:42:11 +0000
Andrew Cooper [Tue, 25 Jan 2022 12:39:44 +0000 (13:39 +0100)]
x86/spec-ctrl: Fix NMI race condition with VT-x MSR_SPEC_CTRL handling
The logic was based on a mistaken understanding of how NMI blocking on vmexit
works. NMIs are only blocked for EXIT_REASON_NMI, and not for general exits.
Therefore, an NMI can in general hit early in the vmx_asm_vmexit_handler path,
and the guest's value will be clobbered before it is saved.
Switch to using MSR load/save lists. This causes the guest value to be saved
atomically with respect to NMIs/MCEs/etc.
First, update vmx_cpuid_policy_changed() to configure the load/save lists at
the same time as configuring the intercepts. This function is always used in
remote context, so extend the vmx_vmcs_{enter,exit}() block to cover the whole
function, rather than having multiple remote acquisitions of the same VMCS.
Both of vmx_{add,del}_guest_msr() can fail. The -ESRCH delete case is fine,
but all others are fatal to the running of the VM, so handle them using
domain_crash() - this path is only used during domain construction anyway.
Second, update vmx_{get,set}_reg() to use the MSR load/save lists rather than
vcpu_msrs, and update the vcpu_msrs comment to describe the new state
location.
Finally, adjust the entry/exit asm.
Because the guest value is saved and loaded atomically, we do not need to
manually load the guest value, nor do we need to enable SCF_use_shadow. This
lets us remove the use of DO_SPEC_CTRL_EXIT_TO_GUEST. Additionally,
SPEC_CTRL_ENTRY_FROM_PV gets removed too, because on an early entry failure,
we're no longer in the guest MSR_SPEC_CTRL context needing to switch back to
Xen's context.
The only action remaining is to load Xen's MSR_SPEC_CTRL value on vmexit. We
could in principle use the host msr list, but is expected to complicated
future work. Delete DO_SPEC_CTRL_ENTRY_FROM_HVM entirely, and use a shorter
code sequence to simply reload Xen's setting from the top-of-stack block.
Adjust the comment at the top of spec_ctrl_asm.h in light of this bugfix.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 81f0eaadf84d273a6ff8df3660b874a02d0e7677
master date: 2022-01-20 16:32:11 +0000
Andrew Cooper [Tue, 25 Jan 2022 12:39:31 +0000 (13:39 +0100)]
x86/spec-ctrl: Drop SPEC_CTRL_{ENTRY_FROM,EXIT_TO}_HVM
These were written before Spectre/Meltdown went public, and there was large
uncertainty in how the protections would evolve. As it turns out, they're
very specific to Intel hardware, and not very suitable for AMD.
Drop the macros, opencoding the relevant subset of functionality, and leaving
grep-fodder to locate the logic. No change at all for VT-x.
For AMD, the only relevant piece of functionality is DO_OVERWRITE_RSB,
although we will soon be adding (different) logic to handle MSR_SPEC_CTRL.
This has a marginal improvement of removing an unconditional pile of long-nops
from the vmentry/exit path.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 95b13fa43e0753b7514bef13abe28253e8614f62
master date: 2022-01-20 16:32:11 +0000
Various registers have per-guest-type or per-vendor locations or access
requirements. To support their use from common code, provide accessors which
allow for per-guest-type behaviour.
For now, just infrastructure handling default cases and expectations.
Subsequent patches will start handling registers using this infrastructure.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 88d3ff7ab15da277a85b39735797293fb541c718
master date: 2022-01-20 16:32:11 +0000
Jason Andryuk [Tue, 25 Jan 2022 12:38:14 +0000 (13:38 +0100)]
libxl/PCI: Fix PV hotplug & stubdom coldplug
commit 0fdb48ffe7a1 "libxl: Make sure devices added by pci-attach are
reflected in the config" broken PCI hotplug (xl pci-attach) for PV
domains when it moved libxl__create_pci_backend() later in the function.
This also broke HVM + stubdom PCI passthrough coldplug. For that, the
PCI devices are hotplugged to a running PV stubdom, and then the QEMU
QMP device_add commands are made to QEMU inside the stubdom.
A running PV domain calls libxl__wait_for_backend(). With the current
placement of libxl__create_pci_backend(), the path does not exist and
the call immediately fails:
libxl: error: libxl_device.c:1388:libxl__wait_for_backend: Backend /local/domain/0/backend/pci/43/0 does not exist
libxl: error: libxl_pci.c:1764:device_pci_add_done: Domain 42:libxl__device_pci_add failed for PCI device 0:2:0.0 (rc -3)
libxl: error: libxl_create.c:1857:domcreate_attach_devices: Domain 42:unable to add pci devices
The wait is only relevant when:
1) The domain is PV
2) The domain is running
3) The backend is already present
This is because:
1) xen-pcifront is only used for PV. It does not load for HVM domains
where QEMU is used.
2) If the domain is not running (starting), then the frontend state will
be Initialising. xen-pciback waits for the frontend to transition to
at Initialised before attempting to connect. So a wait for a
non-running domain is not applicable as the backend will not
transition to Connected.
3) For presence, num_devs is already used to determine if the backend
needs to be created. Re-use num_devs to determine if the backend
wait is necessary. The wait is necessary to avoid racing with
another PCI attachment reconfiguring the front/back or changing to
some other state like closing. If we are creating the backend, then
we don't have to worry about the state since it is being created.
Fixes: 0fdb48ffe7a1 ("libxl: Make sure devices added by pci-attach are
reflected in the config")
Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: 73ee2795aaef2cb086ac078bffe1c6b33c0ea91b
master date: 2022-01-13 14:33:16 +0100
Jan Beulich [Tue, 25 Jan 2022 12:37:59 +0000 (13:37 +0100)]
x86/time: improve TSC / CPU freq calibration accuracy
While the problem report was for extreme errors, even smaller ones would
better be avoided: The calculated period to run calibration loops over
can (and usually will) be shorter than the actual time elapsed between
first and last platform timer and TSC reads. Adjust values returned from
the init functions accordingly.
On a Skylake system I've tested this on accuracy (using HPET) went from
detecting in some cases more than 220kHz too high a value to about
±2kHz. On other systems (or on this system, but with PMTMR) the original
error range was much smaller, with less (in some cases only very little)
improvement.
Reported-by: James Dingwall <james-xen@dingwall.me.uk> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a5c9a80af34eefcd6e31d0ed2b083f452cd9076d
master date: 2022-01-13 14:31:52 +0100
Jan Beulich [Tue, 25 Jan 2022 12:37:45 +0000 (13:37 +0100)]
x86/time: use relative counts in calibration loops
Looping until reaching/exceeding a certain value is error prone: If the
target value is close enough to the wrapping point, the loop may not
terminate at all. Switch to using delta values, which then allows to
fold the two loops each into just one.
Fixes: 93340297802b ("x86/time: calibrate TSC against platform timer") Reported-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 467191641d2a2fd2e43b3ae7b80399f89d339980
master date: 2022-01-13 14:30:18 +0100
Julien Grall [Tue, 25 Jan 2022 12:35:08 +0000 (13:35 +0100)]
xen/grant-table: Only decrement the refcounter when grant is fully unmapped
The grant unmapping hypercall (GNTTABOP_unmap_grant_ref) is not a
simple revert of the changes done by the grant mapping hypercall
(GNTTABOP_map_grant_ref).
Instead, it is possible to partially (or even not) clear some flags.
This will leave the grant is mapped until a future call where all
the flags would be cleared.
XSA-380 introduced a refcounting that is meant to only be dropped
when the grant is fully unmapped. Unfortunately, unmap_common() will
decrement the refcount for every successful call.
A consequence is a domain would be able to underflow the refcount
and trigger a BUG().
Looking at the code, it is not clear to me why a domain would
want to partially clear some flags in the grant-table. But as
this is part of the ABI, it is better to not change the behavior
for now.
Fix it by checking if the maptrack handle has been released before
decrementing the refcounting.
Julien Grall [Tue, 25 Jan 2022 12:34:55 +0000 (13:34 +0100)]
xen/arm: p2m: Always clear the P2M entry when the mapping is removed
Commit 2148a125b73b ("xen/arm: Track page accessed between batch of
Set/Way operations") allowed an entry to be invalid from the CPU PoV
(lpae_is_valid()) but valid for Xen (p2m_is_valid()). This is useful
to track which page is accessed and only perform an action on them
(e.g. clean & invalidate the cache after a set/way instruction).
Unfortunately, __p2m_set_entry() is only zeroing the P2M entry when
lpae_is_valid() returns true. This means the entry will not be zeroed
if the entry was valid from Xen PoV but invalid from the CPU PoV for
tracking purpose.
As a consequence, this will allow a domain to continue to access the
page after it was removed.
Resolve the issue by always zeroing the entry if it the LPAE bit is
set or the entry is about to be removed.
Andrew Cooper [Thu, 6 Jan 2022 13:15:14 +0000 (14:15 +0100)]
x86/spec-ctrl: Fix default calculation of opt_srb_lock
Since this logic was introduced, opt_tsx has become more complicated and
shouldn't be compared to 0 directly. While there are no buggy logic paths,
the correct expression is !(opt_tsx & 1) but the rtm_disabled boolean is
easier and clearer to use.
Fixes: 8fe24090d940 ("x86/cpuid: Rework HLE and RTM handling") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 31f3bc97f4508687215e459a5e35676eecf1772b
master date: 2022-01-05 09:44:26 +0000
While its description is correct from an abstract or real hardware pov,
the range is special inside HVM guests. The range being UC in particular
gets in the way of OVMF, which places itself at [FFE00000,FFFFFFFF].
While this is benign to epte_get_entry_emt() as long as the IOMMU isn't
enabled for a guest, it becomes a very noticable problem otherwise: It
takes about half a minute for OVMF to decompress itself into its
designated address range.
And even beyond OVMF there's no reason to have e.g. the ACPI memory
range marked UC.
Fixes: c22bd567ce22 ("hvmloader: PA range 0xfc000000-0xffffffff should be UC") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ea187c0b7a73c26258c0e91e4f3656989804555f
master date: 2021-12-17 08:56:15 +0100
Jan Beulich [Thu, 6 Jan 2022 13:13:38 +0000 (14:13 +0100)]
x86/HVM: permit CLFLUSH{,OPT} on execute-only code segments
Both SDM and PM explicitly permit this.
Fixes: 52dba7bd0b36 ("x86emul: generalize wbinvd() hook") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Paul Durrant <paul@xen.org>
master commit: df3e1a5efe700a9f59eced801cac73f9fd02a0e2
master date: 2021-12-10 14:03:56 +0100
Jan Beulich [Thu, 6 Jan 2022 13:12:53 +0000 (14:12 +0100)]
x86: avoid wrong use of all-but-self IPI shorthand
With "nosmp" I did observe a flood of "APIC error on CPU0: 04(04), Send
accept error" log messages on an AMD system. And rightly so - nothing
excludes the use of the shorthand in send_IPI_mask() in this case. Set
"unaccounted_cpus" to "true" also when command line restrictions are the
cause.
Note that PV-shim mode is unaffected by this change, first and foremost
because "nosmp" and "maxcpus=" are ignored in this case.
Fixes: 5500d265a2a8 ("x86/smp: use APIC ALLBUT destination shorthand when possible") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 7621880de0bb40bae6436a5b106babc0e4718f4d
master date: 2021-12-10 10:26:52 +0100
Jan Beulich [Thu, 6 Jan 2022 13:12:26 +0000 (14:12 +0100)]
x86/HVM: fail virt-to-linear conversion for insn fetches from non-code segments
Just like (in protected mode) reads may not go to exec-only segments and
writes may not go to non-writable ones, insn fetches may not access data
segments.
Fixes: 623e83716791 ("hvm: Support hardware task switching") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 311297f4216a4387bdae6df6cfbb1f5edb06618a
master date: 2021-12-06 14:15:05 +0100
Jan Beulich [Thu, 6 Jan 2022 13:11:58 +0000 (14:11 +0100)]
x86/Viridian: fix error code use
Both the wrong use of HV_STATUS_* and the return type of
hv_vpset_to_vpmask() can lead to viridian_hypercall()'s
ASSERT_UNREACHABLE() triggering when translating error codes from Xen
to Viridian representation.
Fixes: b4124682db6e ("viridian: add ExProcessorMasks variants of the flush hypercalls") Fixes: 9afa867d42ba ("viridian: add ExProcessorMasks variant of the IPI hypercall") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 857fee77845be0c5c35fd51bac64455369d32a6f
master date: 2021-11-24 11:09:56 +0100
Jan Beulich [Thu, 6 Jan 2022 13:11:23 +0000 (14:11 +0100)]
VT-d: don't leak domid mapping on error path
While domain_context_mapping() invokes domain_context_unmap() in a sub-
case of handling DEV_TYPE_PCI when encountering an error, thus avoiding
a leak, individual calls to domain_context_mapping_one() aren't
similarly covered. Such a leak might persist until domain destruction.
Leverage that these cases can be recognized by pdev being non-NULL.
Fixes: dec403cc668f ("VT-d: fix iommu_domid for PCI/PCIx devices assignment") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: e6252a51faf42c892eb5fc71f8a2617580832196
master date: 2021-11-24 11:07:11 +0100
Andrew Cooper [Wed, 24 Nov 2021 21:11:52 +0000 (21:11 +0000)]
Revert "x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents"
OSSTest has identified a 3rd regression caused by this change. Migration
between Xen 4.15 and 4.16 on the nocera pair of machines (AMD Opteron 4133)
fails with:
which is a safety check to prevent resuming the guest when the CPUID data has
been truncated. The problem is caused by shrinking of the max policies, which
is an ABI that needs handling compatibly between different versions of Xen.
Furthermore, shrinking of the default policies also breaks things in some
cases, because certain cpuid= settings in a VM config file which used to have
an effect will now be silently discarded.
Fixes: 540d911c2813 ("x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents") Fixes: 81da2b544cbb ("x86/cpuid: prevent shrinking migrated policies max leaves") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Release_Acked-by: Ian Jackson <iwj@xenproject.org>
Jan Beulich [Mon, 22 Nov 2021 11:12:32 +0000 (11:12 +0000)]
x86/P2M: deal with partial success of p2m_set_entry()
M2P and PoD stats need to remain in sync with P2M; if an update succeeds
only partially, respective adjustments need to be made. If updates get
made before the call, they may also need undoing upon complete failure
(i.e. including the single-page case).
Log-dirty state would better also be kept in sync.
Note that the change to set_typed_p2m_entry() may not be strictly
necessary (due to the order restriction enforced near the top of the
function), but is being kept here to be on the safe side.
This is CVE-2021-28705 and CVE-2021-28709 / XSA-389.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Mon, 22 Nov 2021 11:11:44 +0000 (11:11 +0000)]
x86/PoD: handle intermediate page orders in p2m_pod_cache_add()
p2m_pod_decrease_reservation() may pass pages to the function which
aren't 4k, 2M, or 1G. Handle all intermediate orders as well, to avoid
hitting the BUG() at the switch() statement's "default" case.
This is CVE-2021-28708 / part of XSA-388.
Fixes: 3c352011c0d3 ("x86/PoD: shorten certain operations on higher order ranges") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Mon, 22 Nov 2021 11:11:44 +0000 (11:11 +0000)]
x86/PoD: deal with misaligned GFNs
Users of XENMEM_decrease_reservation and XENMEM_populate_physmap aren't
required to pass in order-aligned GFN values. (While I consider this
bogus, I don't think we can fix this there, as that might break existing
code, e.g Linux'es swiotlb, which - while affecting PV only - until
recently had been enforcing only page alignment on the original
allocation.) Only non-PoD code paths (guest_physmap_{add,remove}_page(),
p2m_set_entry()) look to be dealing with this properly (in part by being
implemented inefficiently, handling every 4k page separately).
Introduce wrappers taking care of splitting the incoming request into
aligned chunks, without putting much effort in trying to determine the
largest possible chunk at every iteration.
Also "handle" p2m_set_entry() failure for non-order-0 requests by
crashing the domain in one more place. Alongside putting a log message
there, also add one to the other similar path.
Note regarding locking: This is left in the actual worker functions on
the assumption that callers aren't guaranteed atomicity wrt acting on
multiple pages at a time. For mis-aligned GFNs gfn_lock() wouldn't have
locked the correct GFN range anyway, if it didn't simply resolve to
p2m_lock(), and for well-behaved callers there continues to be only a
single iteration, i.e. behavior is unchanged for them. (FTAOD pulling
out just pod_lock() into p2m_pod_decrease_reservation() would result in
a lock order violation.)
This is CVE-2021-28704 and CVE-2021-28707 / part of XSA-388.
Fixes: 3c352011c0d3 ("x86/PoD: shorten certain operations on higher order ranges") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Julien Grall [Mon, 22 Nov 2021 11:11:05 +0000 (11:11 +0000)]
xen/page_alloc: Harden assign_pages()
domain_tot_pages() and d->max_pages are 32-bit values. While the order
should always be quite small, it would still be possible to overflow
if domain_tot_pages() is near to (2^32 - 1).
As this code may be called by a guest via XENMEM_increase_reservation
and XENMEM_populate_physmap, we want to make sure the guest is not going
to be able to allocate more than it is allowed.
Rework the allocation check to avoid any possible overflow. While the
check domain_tot_pages() < d->max_pages should technically not be
necessary, it is probably best to have it to catch any possible
inconsistencies in the future.
This is CVE-2021-28706 / part of XSA-385.
Signed-off-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Thu, 18 Nov 2021 08:28:06 +0000 (09:28 +0100)]
efi: fix alignment of function parameters in compat mode
Currently the max_store_size, remain_store_size and max_size in
compat_pf_efi_runtime_call are 4 byte aligned, which makes clang
13.0.0 complain with:
In file included from compat.c:30:
./runtime.c:646:13: error: passing 4-byte aligned argument to 8-byte aligned parameter 2 of 'QueryVariableInfo' may result in an unaligned pointer access [-Werror,-Walign-mismatch]
&op->u.query_variable_info.max_store_size,
^
./runtime.c:647:13: error: passing 4-byte aligned argument to 8-byte aligned parameter 3 of 'QueryVariableInfo' may result in an unaligned pointer access [-Werror,-Walign-mismatch]
&op->u.query_variable_info.remain_store_size,
^
./runtime.c:648:13: error: passing 4-byte aligned argument to 8-byte aligned parameter 4 of 'QueryVariableInfo' may result in an unaligned pointer access [-Werror,-Walign-mismatch]
&op->u.query_variable_info.max_size);
^
Fix this by bouncing the variables on the stack in order for them to
be 8 byte aligned.
Note this could be done in a more selective manner to only apply to
compat code calls, but given the overhead of making an EFI call doing
an extra copy of 3 variables doesn't seem to warrant the special
casing.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Ian Jackson <iwj@xenproject.org> Signed-off-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
Changes since v3:
- Remove hard tabs. Apply Jan's r-b as authorised in email.
Changes since v2:
- Adjust the commentary as per discussion.
Changes since v1:
- Copy back the results.
Anthony PERARD [Fri, 19 Nov 2021 10:29:48 +0000 (10:29 +0000)]
golang/xenlight: regen generated code
Fixes: 7379f9e10a3b ("gnttab: allow setting max version per-domain") Fixes: 1e6706b0d123 ("xen/arm: Introduce gpaddr_bits field to struct xen_domctl_getdomaininfo") Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Nick Rosbrook <rosbrookn@ainfosec.com> Acked-by: Ian Jackson <iwj@xenproject.org> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Jan Beulich [Fri, 19 Nov 2021 14:14:08 +0000 (15:14 +0100)]
VT-d: fix reduced page table levels support when sharing tables
domain_pgd_maddr() contains logic to adjust the root address to be put
in the context entry in case 4-level page tables aren't supported by an
IOMMU. This logic may not be bypassed when sharing page tables.
This is CVE-2021-28710 / XSA-390.
Fixes: 25ccd093425c ("iommu: remove the share_p2m operation") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Roger Pau Monné [Wed, 17 Nov 2021 11:43:05 +0000 (12:43 +0100)]
tools/python: fix python libxc bindings to pass a max grant version
Such max version should be provided by the caller, otherwise the
bindings will default to specifying a max version of 2, which is
inline with the current defaults in the hypervisor.
Fixes: 7379f9e10a ('gnttab: allow setting max version per-domain') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <iwj@xenproject.org> Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Roger Pau Monné [Wed, 17 Nov 2021 07:13:18 +0000 (08:13 +0100)]
test/tsx: set grant version for created domains
Set the grant table version for the created domains to use version 1,
as such tests domains don't require the usage of the grant table at
all. A TODO note is added to switch those dummy domains to not have a
grant table at all when possible. Without setting the grant version
the domains for the tests cannot be created.
Fixes: 7379f9e10a ('gnttab: allow setting max version per-domain') Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Roger Pau Monné [Wed, 17 Nov 2021 07:13:02 +0000 (08:13 +0100)]
tests/resource: set grant version for created domains
Set the grant table version for the created domains to use version 1,
as that's the used by the test cases. Without setting the grant
version the domains for the tests cannot be created.
Fixes: 7379f9e10a ('gnttab: allow setting max version per-domain') Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Roger Pau Monné [Wed, 17 Nov 2021 07:12:00 +0000 (08:12 +0100)]
domctl: introduce a macro to set the grant table max version
Such macro just clamps the passed version to fit in the designated
bits of the domctl field. The main purpose is to make it clearer in
the code when max grant version is being set in the grant_opts field.
Existing users that where setting the version in the grant_opts field
are switched to use the macro.
No functional change intended.
Requested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com> Acked-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Ian Jackson <iwj@xenproject.org> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Jane Malalane [Fri, 12 Nov 2021 14:48:21 +0000 (14:48 +0000)]
tests/resource: Extend to check that the grant frames are mapped correctly
Previously, we checked that we could map 40 pages with nothing
complaining. Now we're adding extra logic to check that those 40
frames are "correct".
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jane Malalane <jane.malalane@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Wed, 10 Nov 2021 17:40:59 +0000 (18:40 +0100)]
x86/cpuid: prevent shrinking migrated policies max leaves
CPUID policies from guest being migrated shouldn't have the maximum
leaves shrink, as that would be a guest visible change. The hypervisor
has no knowledge on whether a guest has been migrated or is build from
scratch, and hence it must not blindly shrink the CPUID policy in
recalculate_cpuid_policy. Remove the
x86_cpuid_policy_shrink_max_leaves call from recalculate_cpuid_policy.
Removing such call could be seen as a partial revert of 540d911c28.
Instead let the toolstack shrink the policies for newly created
guests, while keeping the previous values for guests that are migrated
in. Note that guests migrated in without a CPUID policy won't get any
kind of shrinking applied.
Fixes: 540d911c28 ('x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Jan Beulich [Fri, 12 Nov 2021 12:56:51 +0000 (13:56 +0100)]
VT-d: per-domain IOMMU bitmap needs to have dynamic size
With no upper bound (anymore) on the number of IOMMUs, a fixed-size
64-bit map may be insufficient (systems with 40 IOMMUs have already been
observed).
Fixes: 27713fa2aa21 ("VT-d: improve save/restore of registers across S3") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
xen/arm: allocate_bank_memory: don't create memory banks of size zero
allocate_bank_memory can be called with a tot_size of zero, as an
example see the implementation of allocate_memory which can call
allocate_bank_memory with a tot_size of zero for the second memory bank.
If tot_size == 0, don't create an empty memory bank, just return
immediately without error. Otherwise a zero-size memory bank will be
added to the domain device tree.
Note that Linux is known to be able to cope with zero-size memory banks,
and Xen more recently gained the ability to do so as well (5a37207df520
"xen/arm: bootfdt: Ignore empty memory bank"). However, there might be
other non-Linux OSes that are not able to cope with empty memory banks
as well as Linux (and now Xen). It would be more robust to avoid
zero-size memory banks unless required.
Moreover, the code to find empty address regions in make_hypervisor_node
in Xen is not able to cope with empty memory banks today and would
result in a Xen crash. This is only a latent bug because
make_hypervisor_node is only called for Dom0 at present and
allocate_memory is only called for DomU at the moment. (But if
make_hypervisor_node was to be called for a DomU, then the Xen crash
would become manifest.)
xen/arm: don't assign domU static-mem to dom0 as reserved-memory
DomUs static-mem ranges are added to the reserved_mem array for
accounting, but they shouldn't be assigned to dom0 as the other regular
reserved-memory ranges in device tree.
In make_memory_nodes, fix the error by skipping banks with xen_domain
set to true in the reserved-memory array. Also make sure to use the
first valid (!xen_domain) start address for the memory node name.
Fixes: 41c031ff437b ("xen/arm: introduce domain on Static Allocation") Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Reviewed-by: Penny Zheng <penny.zheng@arm.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Roger Pau Monne [Tue, 9 Nov 2021 09:47:21 +0000 (10:47 +0100)]
tools/configure: make iPXE dependent on QEMU traditional
iPXE is only used by QEMU traditional, so make it off by default
unless QEMU traditional is enabled.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Fixes: bcf77ce510 ('configure: modify default of building rombios') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Ian Jackson <iwj@xenproject.org>
Roger Pau Monne [Thu, 4 Nov 2021 10:48:34 +0000 (11:48 +0100)]
gnttab: allow setting max version per-domain
Introduce a new domain create field so that toolstack can specify the
maximum grant table version usable by the domain. This is plumbed into
xl and settable by the user as max_grant_version.
Previously this was only settable on a per host basis using the
gnttab command line option.
Note the version is specified using 4 bits, which leaves room to
specify up to grant table version 15. Given that we only have 2 grant
table versions right now, and a new version is unlikely in the near
future using 4 bits seems more than enough.
xenstored stubdomains are limited to grant table v1 because the
current MiniOS code used to build them only has support for grants v1.
There are existing limits set for xenstored stubdomains at creation
time that already match the defaults in MiniOS.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Andrew Cooper [Fri, 29 Oct 2021 17:38:13 +0000 (18:38 +0100)]
xen: Report grant table v1/v2 capabilities to the toolstack
In order to let the toolstack be able to set the gnttab version on a
per-domain basis, it needs to know which ABIs Xen supports. Introduce
XEN_SYSCTL_PHYSCAP_gnttab_v{1,2} for the purpose, and plumb in down into
userspace.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Releae-Acked-by: Ian Jackson <iwj@xenproject.org>
Luca Fancellu [Fri, 5 Nov 2021 13:07:28 +0000 (13:07 +0000)]
xen/efi: Fix Grub2 boot on arm64
The code introduced by commit a1743fc3a9fe9b68c265c45264dddf214fd9b882
("arm/efi: Use dom0less configuration when using EFI boot") is
introducing a problem to boot Xen using Grub2 on ARM machine using EDK2.
Despite UEFI specification, EDK2+Grub2 is returning a NULL DeviceHandle
inside the interface given by the LOADED_IMAGE_PROTOCOL service, this
handle is used later by efi_bs->HandleProtocol(...) inside
get_parent_handle(...) when requesting the SIMPLE_FILE_SYSTEM_PROTOCOL
interface, causing Xen to stop the boot because of an EFI_INVALID_PARAMETER
error.
Before the commit above, the function was never called because the
logic was skipping the call when there were multiboot modules in the
DT because the filesystem was never used and the bootloader had
put in place all the right modules in memory and the addresses
in the DT.
To fix the problem the old logic is put back in place. Because the handle
was given to the efi_check_dt_boot(...), but the revert put the handle
out of scope, the signature of the function is changed to use an
EFI_LOADED_IMAGE handle and request the EFI_FILE_HANDLE only when
needed (module found using xen,uefi-binary).
Another problem is found when the UEFI stub tries to check if Dom0
image or DomUs are present.
The logic doesn't work when the UEFI stub is not responsible to load
any modules, so the efi_check_dt_boot(...) return value is modified
to return the number of multiboot module found and not only the number
of module loaded by the stub.
Taking the occasion to update the comment in handle_module_node(...)
to explain why we return success even if xen,uefi-binary is not found.
Fixes: a1743fc3a9 ("arm/efi: Use dom0less configuration when using EFI boot") Signed-off-by: Luca Fancellu <luca.fancellu@arm.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Juergen Gross [Thu, 4 Nov 2021 16:11:21 +0000 (17:11 +0100)]
tools: disable building qemu-trad per default
Using qemu-traditional as device model is deprecated for some time now.
So change the default for building it to "disable". This will affect
ioemu-stubdom, too, as there is a direct dependency between the two.
Today it is possible to use a PVH/HVM Linux-based stubdom as device
model. Additionally using ioemu-stubdom isn't really helping for
security, as it requires to run a very old and potentially buggy qemu
version in a PV domain. This is adding probably more security problems
than it is removing by using a stubdom.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Acked-by: Ian Jackson <iwj@xenproject.org> Release-acked-by: Ian Jackson <iwj@xenproject.org>
Juergen Gross [Thu, 4 Nov 2021 16:11:20 +0000 (17:11 +0100)]
configure: modify default of building rombios
The tools/configure script will default to build rombios if qemu
traditional is enabled. If rombios is being built, ipxe will be built
per default, too.
This results in rombios and ipxe no longer being built by default when
disabling qemu traditional.
Fix that be rearranging the dependencies:
- build ipxe by default
- build rombios by default if either ipxe or qemu traditional are
being built
This modification prepares not building qemu traditional by default
without affecting build of rombios and ipxe.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Ian Jackson <iwj@xenproject.org> Release-acked-by: Ian Jackson <iwj@xenproject.org>
Juergen Gross [Thu, 4 Nov 2021 14:42:42 +0000 (15:42 +0100)]
tools/helpers: fix broken xenstore stubdom init
Commit 1787cc167906f3f ("libs/guest: Move the guest ABI check earlier
into xc_dom_parse_image()") broke starting the xenstore stubdom. This
is due to a rather special way the xenstore stubdom domain config is
being initialized: in order to support both, PV and PVH stubdom,
init-xenstore-domain is using xc_dom_parse_image() to find the correct
domain type. Unfortunately above commit requires xc_dom_boot_xen_init()
to have been called before using xc_dom_parse_image(). This requires
the domid, which is known only after xc_domain_create(), which requires
the domain type.
In order to break this circular dependency, call xc_dom_boot_xen_init()
with an arbitrary domid first, and then set dom->guest_domid later.
Fixes: 1787cc167906f3f ("libs/guest: Move the guest ABI check earlier into xc_dom_parse_image()") Signed-off-by: Juergen Gross <jgross@suse.com> Release-acked-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 4 Nov 2021 13:44:43 +0000 (14:44 +0100)]
x86/APIC: avoid iommu_supports_x2apic() on error path
The value it returns may change from true to false in case
iommu_enable_x2apic() fails and, as a side effect, clears iommu_intremap
(as can happen at least on AMD). Latch the return value from the first
invocation to replace the second one.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Jan Beulich [Thu, 4 Nov 2021 13:44:01 +0000 (14:44 +0100)]
x86/IOMMU: mark IOMMU / intremap not in use when ACPI tables are missing
x2apic_bsp_setup() gets called ahead of iommu_setup(), and since x2APIC
mode (physical vs clustered) depends on iommu_intremap, that variable
needs to be set to off as soon as we know we can't / won't enable
interrupt remapping, i.e. in particular when parsing of the respective
ACPI tables failed. Move the turning off of iommu_intremap from AMD
specific code into acpi_iommu_init(), accompanying it by clearing of
iommu_enable.
Take the opportunity and also fully skip ACPI table parsing logic on
VT-d when both "iommu=off" and "iommu=no-intremap" are in effect anyway,
like was already the case for AMD.
The tag below only references the commit uncovering a pre-existing
anomaly.
Fixes: d8bd82327b0f ("AMD/IOMMU: obtain IVHD type to use earlier") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
x86/xstate: reset cached register values on resume
set_xcr0() and set_msr_xss() use cached value to avoid setting the
register to the same value over and over. But suspend/resume implicitly
reset the registers and since percpu areas are not deallocated on
suspend anymore, the cache gets stale.
Reset the cache on resume, to ensure the next write will really hit the
hardware. Choose value 0, as it will never be a legitimate write to
those registers - and so, will force write (and cache update).
Note the cache is used io get_xcr0() and get_msr_xss() too, but:
- set_xcr0() is called few lines below in xstate_init(), so it will
update the cache with appropriate value
- get_msr_xss() is not used anywhere - and thus not before any
set_msr_xss() that will fill the cache
Fixes: aca2a985a55a "xen: don't free percpu areas during suspend" Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Andrew Cooper [Tue, 28 Sep 2021 20:55:56 +0000 (21:55 +0100)]
x86/traps: Fix typo in do_entry_CP()
The call to debugger_trap_entry() should pass the correct vector. The
break-for-gdbsx logic is in practice unreachable because PV guests can't
generate #CP, but it will interfere with anyone inserting custom debugging
into debugger_trap_entry().
Fixes: 5ad05b9c2490 ("x86/traps: Implement #CP handler and extend #PF for shadow stacks") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
xen/arm: fix SBDF calculation for vPCI MMIO handlers
While in vPCI MMIO trap handlers for the guest PCI host bridge it is not
enough for SBDF translation to simply call VPCI_ECAM_BDF(info->gpa) as
the base address may not be aligned in the way that the translation
always work. If not adjusted with respect to the base address it may not be
able to properly convert SBDF.
Fix this by adjusting the gpa with respect to the host bridge base address
in a way as it is done for x86.
Please note, that this change is not strictly required given the current
value of GUEST_VPCI_ECAM_BASE which has bits 0 to 27 clear, but could cause
issues if such value is changed, or when handlers for dom0 ECAM
regions are added as those will be mapped over existing hardware
regions that could use non-aligned base addresses.
Fixes: d59168dc05a5 ("xen/arm: Enable the existing x86 virtual PCI support for ARM") Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Andrew Cooper [Mon, 1 Nov 2021 20:45:26 +0000 (20:45 +0000)]
x86/shstk: Fix use of shadow stacks with XPTI active
The call to setup_cpu_root_pgt(0) in smp_prepare_cpus() is too early. It
clones the BSP's stack while the .data mapping is still in use, causing all
mappings to be fully read read/write (and with no guard pages either). This
ultimately causes #DF when trying to enter the dom0 kernel for the first time.
Defer setting up BSPs XPTI pagetable until reinit_bsp_stack() after we've set
up proper shadow stack permissions.
Fixes: 60016604739b ("x86/shstk: Rework the stack layout to support shadow stacks") Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Using qemu-traditional as device model is deprecated for some time now.
So change the default for building it to "disable". This will affect
ioemu-stubdom, too, as there is a direct dependency between the two.
Today it is possible to use a PVH/HVM Linux-based stubdom as device
model. Additionally using ioemu-stubdom isn't really helping for
security, as it requires to run a very old and potentially buggy qemu
version in a PV domain. This is adding probably more security problems
than it is removing by using a stubdom.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Acked-by: Ian Jackson <iwj@xenproject.org> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Dongli Zhang [Wed, 3 Nov 2021 09:19:06 +0000 (10:19 +0100)]
update system time immediately when VCPUOP_register_vcpu_info
The guest may access the pv vcpu_time_info immediately after
VCPUOP_register_vcpu_info. This is to borrow the idea of
VCPUOP_register_vcpu_time_memory_area, where the
force_update_vcpu_system_time() is called immediately when the new memory
area is registered.
Otherwise, we may observe clock drift at the VM side if the VM accesses
the clocksource immediately after VCPUOP_register_vcpu_info().
Reference: https://lists.xenproject.org/archives/html/xen-devel/2021-10/msg00571.html Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
As of 724b55f48a6c ("x86: introduce MWAIT-based, ACPI-less CPU idle
driver") they (also) live in asm/mwait.h; no idea how I missed the
duplicates back at the time.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
automation: add a QEMU based x86_64 Dom0/DomU test
Introduce a test based on QEMU to run Xen, Dom0 and start a DomU.
This is similar to the existing qemu-alpine-arm64.sh script and test.
The only differences are:
- use Debian's qemu-system-x86_64 (on ARM we build our own)
- use ipxe instead of u-boot and ImageBuilder
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Build a 5.10 kernel to be used as Dom0 and DomU kernel for testing. This
is almost the same as the existing ARM64 recipe for Linux 5.9, the
only differences are:
- upgrade to latest 5.10.x stable
- force Xen modules to built-in (on ARM it was already done by defconfig)
Also add the exporting job to build.yaml so that the binary can be used
during gitlab-ci runs.
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Roger Pau Monne [Wed, 27 Oct 2021 14:00:50 +0000 (16:00 +0200)]
x86/cpuid: prevent decreasing of hypervisor max leaf on migration
In order to be compatible with previous Xen versions, and not change
max hypervisor leaf as a result of a migration, keep the clamping of
the maximum leaf value provided to XEN_CPUID_MAX_NUM_LEAVES, instead
of doing it based on the domain type. Also set the default maximum
leaf without taking the domain type into account. The maximum
hypervisor leaf is not migrated, so we need the default to not regress
beyond what might already be reported to a guest by existing Xen
versions.
This is a partial revert of 540d911c28 and restores the previous
behaviour and assures that HVM guests won't see it's maximum
hypervisor leaf reduced from 5 to 4 as a result of a migration.
Fixes: 540d911c28 ('x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents') Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Roger Pau Monne [Tue, 26 Oct 2021 15:12:33 +0000 (17:12 +0200)]
x86/hpet: setup HPET even when disabled due to stopping in deep C states
Always allow the HPET to be setup, but don't report a frequency back
to the platform time source probe in order to avoid it from being
selected as a valid timer if it's not usable.
Doing the setup even when not intended to be used as a platform timer
is required so that is can be used in legacy replacement mode in order
to assert the IO-APIC is capable of receiving interrupts.
Fixes: c12731493a ('x86/hpet: Use another crystalball to evaluate HPET usability') Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Anthony PERARD [Fri, 22 Oct 2021 16:36:44 +0000 (17:36 +0100)]
automation: actually build with clang for ubuntu-focal-clang* jobs
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Hongda Deng [Thu, 21 Oct 2021 12:03:19 +0000 (20:03 +0800)]
xen/arm: vgic: Ignore write access to ICPENDR*
Currently, Xen will return IO unhandled when guests write ICPENDR*
virtual registers, which will raise a data abort inside the guest.
For Linux guest, these virtual registers will not be accessed. But
for Zephyr, these virtual registers will be accessed during the
initialization. Zephyr guest will get an IO data abort and crash.
Emulating ICPENDR is not easy with the existing vGIC, this patch
reworks the emulation to ignore write access to ICPENDR* virtual
registers and print a message about whether they are already pending
instead of returning unhandled.
More details can be found at [1].
Julien Grall [Wed, 20 Oct 2021 14:45:19 +0000 (14:45 +0000)]
tools/xenstored: Ignore domain we were unable to restore
Commit 939775cfd3 "handle dying domains in live update" was meant to
handle gracefully dying domain. However, the @releaseDomain watch
will end up to be sent as soon as we finished to restore Xenstored
state.
This may be before Xen reports the domain to be dying (such as if
the guest decided to revoke access to the xenstore page). Consequently
daemon like xenconsoled will not clean-up the domain and it will be
left as a zombie.
To avoid the problem, mark the connection as ignored. This also
requires to tweak conn_can_write() and conn_can_read() to prevent
dereferencing a NULL pointer (the interface will not mapped).
The check conn->is_ignored was originally added after the callbacks
because the helpers for a socket connection may close the fd. However,
ignore_connection() will close a socket connection directly. So it is
fine to do the re-order.
Signed-off-by: Julien Grall <jgrall@amazon.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Bertrand Marquis [Wed, 20 Oct 2021 15:22:52 +0000 (17:22 +0200)]
xen/pci: Install vpci handlers on x86 and fix error paths
Xen might not be able to discover at boot time all devices or some devices
might appear after specific actions from dom0.
In this case dom0 can use the PHYSDEVOP_pci_device_add to signal some
PCI devices to Xen.
As those devices where not known from Xen before, the vpci handlers must
be properly installed during pci_device_add for x86 PVH Dom0, in the
same way as what is done currently on arm (where Xen does not detect PCI
devices but relies on Dom0 to declare them all the time).
So this patch is removing the ifdef protecting the call to
vpci_add_handlers and the comment which was arm specific.
vpci_add_handlers is called on during pci_device_add which can be called
at runtime through hypercall physdev_op.
Remove __hwdom_init as the call is not limited anymore to hardware
domain init and fix linker script to only keep vpci_array in rodata
section.
Add missing vpci handlers cleanup during pci_device_remove and in case
of error with iommu during pci_device_add.
Move code adding the domain to the pdev domain_list as vpci_add_handlers
needs this to be set and remove it from the list in the error path.
Exit early of vpci_remove_device if the domain has no vpci support.
Add empty static inline for vpci_remove_device when CONFIG_VPCI is not
defined.
Add an ASSERT in vpci_add_handlers to check that the function is not
called twice for the same device.
Fixes: d59168dc05 ("xen/arm: Enable the existing x86 virtual PCI support for ARM") Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <jgrall@amazon.com>
Thomas Gleixner [Wed, 20 Oct 2021 10:50:15 +0000 (12:50 +0200)]
x86/hpet: Use another crystalball to evaluate HPET usability
On recent Intel systems the HPET stops working when the system reaches PC10
idle state.
The approach of adding PCI ids to the early quirks to disable HPET on
these systems is a whack a mole game which makes no sense.
Check for PC10 instead and force disable HPET if supported. The check is
overbroad as it does not take ACPI, mwait-idle enablement and command
line parameters into account. That's fine as long as there is at least
PMTIMER available to calibrate the TSC frequency. The decision can be
overruled by adding "clocksource=hpet" on the Xen command line.
Remove the related PCI quirks for affected Coffee Lake systems as they
are not longer required. That should also cover all other systems, i.e.
Ice Lake, Tiger Lake, and newer generations, which are most likely
affected by this as well.
Fixes: Yet another hardware trainwreck Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[Linux commit: 6e3cd95234dc1eda488f4f487c281bac8fef4d9b]
I have to admit that the purpose of checking CPUID5_ECX_INTERRUPT_BREAK
is unclear to me, but I didn't want to diverge in technical aspects from
the Linux commit.
In mwait_pc10_supported(), besides some cosmetic adjustments, avoid UB
from shifting left a signed 4-bit constant by 28 bits.
Pull in Linux'es MSR_PKG_CST_CONFIG_CONTROL.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Wed, 20 Oct 2021 10:42:44 +0000 (12:42 +0200)]
x86/PoD: defer nested P2M flushes
With NPT or shadow in use, the p2m_set_entry() -> p2m_pt_set_entry() ->
write_p2m_entry() -> p2m_flush_nestedp2m() call sequence triggers a lock
order violation when the PoD lock is held around it. Hence such flushing
needs to be deferred. Steal the approach from p2m_change_type_range().
(Note that strictly speaking the change at the out_of_memory label is
not needed, as the domain gets crashed there anyway. The change is being
made nevertheless to avoid setting up a trap from someone meaning to
deal with that case better than by domain_crash().)
Similarly for EPT I think ept_set_entry() -> ept_sync_domain() ->
ept_sync_domain_prepare() -> p2m_flush_nestedp2m() is affected. Make its
p2m_flush_nestedp2m() invocation conditional. Note that this then also
alters behavior of p2m_change_type_range() on EPT, deferring the nested
flushes there as well. I think this should have been that way from the
introduction of the flag.
Reported-by: Elliott Mitchell <ehem+xen@m5p.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Juergen Gross [Tue, 19 Oct 2021 11:21:40 +0000 (13:21 +0200)]
tools: fix oom setting of xenstored
Commit f282182af32939 ("tools/xenstore: set oom score for xenstore
daemon on Linux") introduced a regression when not setting the oom
value in the xencommons file. Fix that.
Fixes: f282182af32939 ("tools/xenstore: set oom score for xenstore daemon on Linux") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Ian Jackson <iwj@xenproject.org> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Jan Beulich [Tue, 19 Oct 2021 08:08:30 +0000 (10:08 +0200)]
x86/paging: restrict physical address width reported to guests
Modern hardware may report more than 48 bits of physical address width.
For paging-external guests our P2M implementation does not cope with
larger values. Telling the guest of more available bits means misleading
it into perhaps trying to actually put some page there (like was e.g.
intermediately done in OVMF for the shared info page).
While there also convert the PV check to a paging-external one (which in
our current code base are synonyms of one another anyway).
Fixes: 5dbd60e16a1f ("x86/shadow: Correct guest behaviour when creating PTEs above maxphysaddr") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Tue, 19 Oct 2021 08:07:42 +0000 (10:07 +0200)]
x86/PV: replace assertions in '0' debug key stack dumping
While it was me to add them, I'm afraid I don't see justification for
the assertions: A vCPU may very well have got preempted while in user
mode. Limit compat guest user mode stack dumps to the containing page
(like is done when using do_page_walk()), and suppress user mode stack
dumping altogether for 64-bit domains.
Fixes: cc0de53a903c ("x86: improve output resulting from sending '0' over serial") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>