Rahul Singh [Wed, 4 May 2022 17:15:12 +0000 (18:15 +0100)]
arm/its: enable LPIs before mapping the collection table
When Xen boots on the platform that implements the GIC 600, ITS
MAPC_LPI_OFF uncorrectable command error issue is observed.
As per the GIC-600 TRM (Revision: r1p6) MAPC_LPI_OFF command error can
be reported if the MAPC command has tried to map a collection to a core
that does not have LPIs enabled. The definition of GICR.EnableLPIs
also suggests enabling the LPIs before sending any ITS command that
involves LPIs
0b0 LPI support is disabled. Any doorbell interrupt generated as a
result of a write to a virtual LPI register must be discarded,
and any ITS translation requests or commands involving LPIs in
this Redistributor are ignored.
0b1 LPI support is enabled.
To fix the MAPC command error issue, enable the LPIs using
GICR_CTLR.EnableLPIs before mapping the collection table.
gicv3_enable_lpis() is using writel_relaxed(), write to the GICR_CTLR
register may not be visible before gicv3_its_setup_collection() send the
MAPC command. Use wmb() after writel_relaxed() to make sure register
write to enable LPIs is visible.
Edwin Török [Wed, 3 Aug 2022 10:39:13 +0000 (12:39 +0200)]
x86/msr: fix X2APIC_LAST
The latest Intel manual now says the X2APIC reserved range is only
0x800 to 0x8ff (NOT 0xbff).
This changed between SDM 68 (Nov 2018) and SDM 69 (Jan 2019).
The AMD manual documents 0x800-0x8ff too.
There are non-X2APIC MSRs in the 0x900-0xbff range now:
e.g. 0x981 is IA32_TME_CAPABILITY, an architectural MSR.
The new MSR in this range appears to have been introduced in Icelake,
so this commit should be backported to Xen versions supporting Icelake.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 13316827faadbb4f72ae6c625af9938d8f976f86
master date: 2022-07-27 12:57:10 +0200
Roger Pau Monné [Wed, 3 Aug 2022 10:38:36 +0000 (12:38 +0200)]
tools/libxl: env variable to signal whether disk/nic backend is trusted
Introduce support in libxl for fetching the default backend trusted
option for disk and nic devices.
Users can set LIBXL_{DISK,NIC}_BACKEND_UNTRUSTED environment variable
to notify libxl of whether the backends for disk and nic devices
should be trusted. Such information is passed into the frontend so it
can take the appropriate measures.
This is part of XSA-403.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
In common/memory.c the ifdef code surrounding ptdom_max_order is
using HAS_PASSTHROUGH instead of CONFIG_HAS_PASSTHROUGH, fix the
problem using the correct macro.
Fixes: e0d44c1f9461 ("build: convert HAS_PASSTHROUGH use to Kconfig") Signed-off-by: Luca Fancellu <luca.fancellu@arm.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5707470bf3103ebae43697a7ac2faced6cd35f92
master date: 2022-07-26 08:33:46 +0200
Jan Beulich [Wed, 27 Jul 2022 07:22:31 +0000 (09:22 +0200)]
x86: also suppress use of MMX insns
Passing -mno-sse alone is not enough: The compiler may still find
(questionable) reasons to use MMX insns. In particular with gcc12 use
of MOVD+PUNPCKLDQ+MOVQ was observed in an apparent attempt to auto-
vectorize the storing of two adjacent zeroes, 32 bits each.
Reported-by: ChrisD <chris@dalessio.org> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6fe2e39a0243bddba60f83b77b972a5922d25eb8
master date: 2022-07-20 15:48:49 +0200
Jan Beulich [Wed, 27 Jul 2022 07:21:59 +0000 (09:21 +0200)]
x86emul: add memory operand low bits checks for ENQCMD{,S}
Already ISE rev 044 added text to this effect; rev 045 further dropped
leftover earlier text indicating the contrary:
- ENQCMD requires the low 32 bits of the memory operand to be clear,
- ENDCMDS requires bits 20...30 of the memory operand to be clear.
Fixes: d27385968741 ("x86emul: support ENQCMD insns") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d620c66bdbe5510c3bae89be8cc7ca9a2a6cbaba
master date: 2022-07-20 15:46:48 +0200
Jan Beulich [Wed, 27 Jul 2022 07:21:20 +0000 (09:21 +0200)]
x86: deal with gcc12 release build issues
While a number of issues we previously had with pre-release gcc12 were
fixed in the final release, we continue to have one issue (with multiple
instances) when doing release builds (i.e. at higher optimization
levels): The compiler takes issue with subtracting (always 1 in our
case) from artifical labels (expressed as array) marking the end of
certain regions. This isn't an unreasonable position to take. Simply
hide the "array-ness" by casting to an integer type. To keep things
looking consistently, apply the same cast also on the respective
expressions dealing with the starting addresses. (Note how
efi_arch_memory_setup()'s l2_table_offset() invocations avoid a similar
issue by already having the necessary casts.) In is_xen_fixed_mfn()
further switch from __pa() to virt_to_maddr() to better match the left
sides of the <= operators.
Reported-by: Charles Arnold <carnold@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9723507daf2120131410c91980d4e4d9b0d0aa90
master date: 2022-07-19 08:37:29 +0200
Jan Beulich [Wed, 27 Jul 2022 07:20:06 +0000 (09:20 +0200)]
xl: move freemem()'s "credit expired" loop exit
Move the "credit expired" loop exit to the middle of the loop,
immediately after "return true". This way having reached the goal on the
last iteration would be reported as success to the caller, rather than
as "timed out".
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: d8f8cb8bdd02fad3b6986ae93511f750fa7f7e6a
master date: 2022-07-18 17:48:18 +0200
Jan Beulich [Wed, 27 Jul 2022 07:14:32 +0000 (09:14 +0200)]
xl: relax freemem()'s retry calculation
While in principle possible also under other conditions as long as other
parallel operations potentially consuming memory aren't "locked out", in
particular with IOMMU large page mappings used in Dom0 (for PV when in
strict mode; for PVH when not sharing page tables with HAP) ballooning
out of individual pages can actually lead to less free memory available
afterwards. This is because to split a large page, one or more page
table pages are necessary (one per level that is split).
When rebooting a guest I've observed freemem() to fail: A single page
was required to be ballooned out (presumably because of heap
fragmentation in the hypervisor). This ballooning out of a single page
of course went fast, but freemem() then found that it would require to
balloon out another page. This repeating just another time leads to the
function to signal failure to the caller - without having come anywhere
near the designated 30s that the whole process is allowed to not make
any progress at all.
Convert from a simple retry count to actually calculating elapsed time,
subtracting from an initial credit of 30s. Don't go as far as limiting
the "wait_secs" value passed to libxl_wait_for_memory_target(), though.
While this leads to the overall process now possibly taking longer (if
the previous iteration ended very close to the intended 30s), this
compensates to some degree for the value passed really meaning "allowed
to run for this long without making progress".
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: e58370df76eacf1f7ca0340e9b96430c77b41a79
master date: 2022-07-12 15:25:00 +0200
Jan Beulich [Tue, 26 Jul 2022 12:59:07 +0000 (14:59 +0200)]
x86/mm: correct TLB flush condition in _get_page_type()
When this logic was moved, it was moved across the point where nx is
updated to hold the new type for the page. IOW originally it was
equivalent to using x (and perhaps x would better have been used), but
now it isn't anymore. Switch to using x, which then brings things in
line again with the slightly earlier comment there (now) talking about
transitions _from_ writable.
I have to confess though that I cannot make a direct connection between
the reported observed behavior of guests leaving several pages around
with pending general references and the change here. Repeated testing,
nevertheless, confirms the reported issue is no longer there.
This is CVE-2022-33745 / XSA-408.
Reported-by: Charles Arnold <carnold@suse.com> Fixes: 8cc5036bc385 ("x86/pv: Fix ABAC cmpxchg() race in _get_page_type()") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a9949efb288fd6e21bbaf9d5826207c7c41cda27
master date: 2022-07-26 14:54:34 +0200
Andrew Cooper [Mon, 27 Jun 2022 18:29:40 +0000 (19:29 +0100)]
x86/spec-ctrl: Mitigate Branch Type Confusion when possible
Branch Type Confusion affects AMD/Hygon CPUs on Zen2 and earlier. To
mitigate, we require SMT safety (STIBP on Zen2, no-SMT on Zen1), and to issue
an IBPB on each entry to Xen, to flush the BTB.
Due to performance concerns, dom0 (which is trusted in most configurations) is
excluded from protections by default.
Therefore:
* Use STIBP by default on Zen2 too, which now means we want it on by default
on all hardware supporting STIBP.
* Break the current IBPB logic out into a new function, extending it with
IBPB-at-entry logic.
* Change the existing IBPB-at-ctxt-switch boolean to be tristate, and disable
it by default when IBPB-at-entry is providing sufficient safety.
If all PV guests on the system are trusted, then it is recommended to boot
with `spec-ctrl=ibpb-entry=no-pv`, as this will provide an additional marginal
perf improvement.
This is part of XSA-407 / CVE-2022-23825.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit d8cb7e0f069e0f106d24941355b59b45a731eabe)
Andrew Cooper [Mon, 16 May 2022 14:48:24 +0000 (15:48 +0100)]
x86/cpuid: Enumeration for BTC_NO
BTC_NO indicates that hardware is not succeptable to Branch Type Confusion.
Zen3 CPUs don't suffer BTC.
This is part of XSA-407.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 76cb04ad64f3ab9ae785988c40655a71dde9c319)
Andrew Cooper [Thu, 24 Feb 2022 13:44:33 +0000 (13:44 +0000)]
x86/spec-ctrl: Support IBPB-on-entry
We are going to need this to mitigate Branch Type Confusion on AMD/Hygon CPUs,
but as we've talked about using it in other cases too, arrange to support it
generally. However, this is also very expensive in some cases, so we're going
to want per-domain controls.
Introduce SCF_ist_ibpb and SCF_entry_ibpb controls, adding them to the IST and
DOM masks as appropriate. Also introduce X86_FEATURE_IBPB_ENTRY_{PV,HVM} to
to patch the code blocks.
For SVM, the STGI is serialising enough to protect against Spectre-v1 attacks,
so no "else lfence" is necessary. VT-x will use use the MSR host load list,
so doesn't need any code in the VMExit path.
For the IST path, we can't safely check CPL==0 to skip a flush, as we might
have hit an entry path before it's IBPB. As IST hitting Xen is rare, flush
irrespective of CPL. A later path, SCF_ist_sc_msr, provides Spectre-v1
safety.
For the PV paths, we know we're interrupting CPL>0, while for the INTR paths,
we can safely check CPL==0. Only flush when interrupting guest context.
An "else lfence" is needed for safety, but we want to be able to skip it on
unaffected CPUs, so the block wants to be an alternative, which means the
lfence has to be inline rather than UNLIKELY() (the replacement block doesn't
have displacements fixed up for anything other than the first instruction).
As with SPEC_CTRL_ENTRY_FROM_INTR_IST, %rdx is 0 on entry so rely on this to
shrink the logic marginally. Update the comments to specify this new
dependency.
This is part of XSA-407.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 53a570b285694947776d5190f591a0d5b9b18de7)
We are shortly going to add a conditional IBPB in this path.
Therefore, we cannot hold spec_ctrl_flags in %eax, and rely on only clobbering
it after we're done with its contents. %rbx is available for use, and the
more normal register to hold preserved information in.
With %rax freed up, use it instead of %rdx for the RSB tmp register, and for
the adjustment to spec_ctrl_flags.
This leaves no use of %rdx, except as 0 for the upper half of WRMSR. In
practice, %rdx is 0 from SAVE_ALL on all paths and isn't likely to change in
the foreseeable future, so update the macro entry requirements to state this
dependency. This marginal optimisation can be revisited if circumstances
change.
No practical change.
This is part of XSA-407.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit e9b8d31981f184c6539f91ec54bd9cae29cdae36)
Andrew Cooper [Mon, 4 Jul 2022 20:32:17 +0000 (21:32 +0100)]
x86/spec-ctrl: Rename opt_ibpb to opt_ibpb_ctxt_switch
We are about to introduce the use of IBPB at different points in Xen, making
opt_ibpb ambiguous. Rename it to opt_ibpb_ctxt_switch.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit a8e5ef079d6f5c88c472e3e620db5a8d1402a50d)
Andrew Cooper [Tue, 28 Jun 2022 13:36:56 +0000 (14:36 +0100)]
x86/spec-ctrl: Rename SCF_ist_wrmsr to SCF_ist_sc_msr
We are about to introduce SCF_ist_ibpb, at which point SCF_ist_wrmsr becomes
ambiguous.
No functional change.
This is part of XSA-407.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 76d6a36f645dfdbad8830559d4d52caf36efc75e)
We are shortly going to need to context switch new bits in both the vcpu and
S3 paths. Introduce SCF_IST_MASK and SCF_DOM_MASK, and rework d->arch.verw
into d->arch.spec_ctrl_flags to accommodate.
No functional change.
This is part of XSA-407.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 5796912f7279d9348a3166655588d30eae9f72cc)
Anthony PERARD [Tue, 12 Jul 2022 09:16:30 +0000 (11:16 +0200)]
libxl: check return value of libxl__xs_directory in name2bdf
libxl__xs_directory() can potentially return NULL without setting `n`.
As `n` isn't initialised, we need to check libxl__xs_directory()
return value before checking `n`. Otherwise, `n` might be non-zero
with `bdfs` NULL which would lead to a segv.
Fixes: 57bff091f4 ("libxl: add 'name' field to 'libxl_device_pci' in the IDL...") Reported-by: "G.R." <firemeteor@users.sourceforge.net> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Tested-by: "G.R." <firemeteor@users.sourceforge.net>
master commit: d778089ac70e5b8e3bdea0c85fc8c0b9ed0eaf2f
master date: 2022-07-12 08:38:51 +0200
Charles Arnold [Tue, 12 Jul 2022 09:14:07 +0000 (11:14 +0200)]
libxc: fix compilation error with gcc13
xc_psr.c:161:5: error: conflicting types for 'xc_psr_cmt_get_data'
due to enum/integer mismatch;
Signed-off-by: Charles Arnold <carnold@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: 8eeae8c2b4efefda8e946461e86cf2ae9c18e5a9
master date: 2022-07-06 13:06:40 +0200
Andrew Cooper [Tue, 12 Jul 2022 09:13:33 +0000 (11:13 +0200)]
x86/spec-ctrl: Knobs for STIBP and PSFD, and follow hardware STIBP hint
STIBP and PSFD are slightly weird bits, because they're both implied by other
bits in MSR_SPEC_CTRL. Add fine grain controls for them, and take the
implications into account when setting IBRS/SSBD.
Rearrange the IBPB text/variables/logic to keep all the MSR_SPEC_CTRL bits
together, for consistency.
However, AMD have a hardware hint CPUID bit recommending that STIBP be set
unilaterally. This is advertised on Zen3, so follow the recommendation.
Furthermore, in such cases, set STIBP behind the guest's back for now. This
has negligible overhead for the guest, but saves a WRMSR on vmentry. This is
the only default change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: fef244b179c06fcdfa581f7d57fa6e578c49ff50
master date: 2022-06-30 18:07:13 +0100
Andrew Cooper [Tue, 12 Jul 2022 09:12:46 +0000 (11:12 +0200)]
x86/spec-ctrl: Only adjust MSR_SPEC_CTRL for idle with legacy IBRS
Back at the time of the original Spectre-v2 fixes, it was recommended to clear
MSR_SPEC_CTRL when going idle. This is because of the side effects on the
sibling thread caused by the microcode IBRS and STIBP implementations which
were retrofitted to existing CPUs.
However, there are no relevant cross-thread impacts for the hardware
IBRS/STIBP implementations, so this logic should not be used on Intel CPUs
supporting eIBRS, or any AMD CPUs; doing so only adds unnecessary latency to
the idle path.
Furthermore, there's no point playing with MSR_SPEC_CTRL in the idle paths if
SMT is disabled for other reasons.
Fixes: 8d03080d2a33 ("x86/spec-ctrl: Cease using thunk=lfence on AMD") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: ffc7694e0c99eea158c32aa164b7d1e1bb1dc46b
master date: 2022-06-30 18:07:13 +0100
At the moment, corrupt() is neither checking for allocation failure
nor freeing the allocated memory.
Harden the code by printing ENOMEM if the allocation failed and
free 'str' after the last use.
This is not considered to be a security issue because corrupt() should
only be called when Xenstored thinks the database is corrupted. Note
that the trigger (i.e. a guest reliably provoking the call) would be
a security issue.
Fixes: 06d17943f0cd ("Added a basic integrity checker, and some basic ability to recover from store") Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Juergen Gross <jgross@suse.com>
master commit: db3382dd4f468c763512d6bf91c96773395058fb
master date: 2022-06-23 13:44:10 +0100
Jan Beulich [Tue, 12 Jul 2022 09:10:34 +0000 (11:10 +0200)]
IOMMU/x86: work around bogus gcc12 warning in hvm_gsi_eoi()
As per [1] the expansion of the pirq_dpci() macro causes a -Waddress
controlled warning (enabled implicitly in our builds, if not by default)
tying the middle part of the involved conditional expression to the
surrounding boolean context. Work around this by introducing a local
inline function in the affected source file.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102967
master commit: 80ad8db8a4d9bb24952f0aea788ce6f47566fa76
master date: 2022-06-15 10:19:32 +0200
Andrew Cooper [Mon, 13 Jun 2022 18:18:32 +0000 (19:18 +0100)]
x86/spec-ctrl: Add spec-ctrl=unpriv-mmio
Per Xen's support statement, PCI passthrough should be to trusted domains
because the overall system security depends on factors outside of Xen's
control.
As such, Xen, in a supported configuration, is not vulnerable to DRPW/SBDR.
However, users who have risk assessed their configuration may be happy with
the risk of DoS, but unhappy with the risk of cross-domain data leakage. Such
users should enable this option.
On CPUs vulnerable to MDS, the existing mitigations are the best we can do to
mitigate MMIO cross-domain data leakage.
On CPUs fixed to MDS but vulnerable MMIO stale data leakage, this option:
* On CPUs susceptible to FBSDP, mitigates cross-domain fill buffer leakage
using FB_CLEAR.
* On CPUs susceptible to SBDR, mitigates RNG data recovery by engaging the
srb-lock, previously used to mitigate SRBDS.
Both mitigations require microcode from IPU 2022.1, May 2022.
This is part of XSA-404.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 8c24b70fedcb52633b2370f834d8a2be3f7fa38e)
Andrew Cooper [Mon, 20 Sep 2021 17:47:49 +0000 (18:47 +0100)]
x86/spec-ctrl: Enumeration for MMIO Stale Data controls
The three *_NO bits indicate non-susceptibility to the SSDP, FBSDP and PSDP
data movement primitives.
FB_CLEAR indicates that the VERW instruction has re-gained it's Fill Buffer
flushing side effect. This is only enumerated on parts where VERW had
previously lost it's flushing side effect due to the MDS/TAA vulnerabilities
being fixed in hardware.
FB_CLEAR_CTRL is available on a subset of FB_CLEAR parts where the Fill Buffer
clearing side effect of VERW can be turned off for performance reasons.
This is part of XSA-404.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 2ebe8fe9b7e0d36e9ec3cfe4552b2b197ef0dcec)
Andrew Cooper [Mon, 13 Jun 2022 15:19:01 +0000 (16:19 +0100)]
x86/spec-ctrl: Make VERW flushing runtime conditional
Currently, VERW flushing to mitigate MDS is boot time conditional per domain
type. However, to provide mitigations for DRPW (CVE-2022-21166), we need to
conditionally use VERW based on the trustworthiness of the guest, and the
devices passed through.
Remove the PV/HVM alternatives and instead issue a VERW on the return-to-guest
path depending on the SCF_verw bit in cpuinfo spec_ctrl_flags.
Introduce spec_ctrl_init_domain() and d->arch.verw to calculate the VERW
disposition at domain creation time, and context switch the SCF_verw bit.
For now, VERW flushing is used and controlled exactly as before, but later
patches will add per-domain cases too.
No change in behaviour.
This is part of XSA-404.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit e06b95c1d44ab80da255219fc9f1e2fc423edcb6)
Jan Beulich [Fri, 10 Jun 2022 08:28:28 +0000 (10:28 +0200)]
x86/mm: account for PGT_pae_xen_l2 in recently added assertion
While PGT_pae_xen_l2 will be zapped once the type refcount of an L2 page
reaches zero, it'll be retained as long as the type refcount is non-
zero. Hence any checking against the requested type needs to either zap
the bit from the type or include it in the used mask.
Fixes: 9186e96b199e ("x86/pv: Clean up _get_page_type()") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c2095ac76be0f4a1940346c9ffb49fb967345060
master date: 2022-06-10 10:21:06 +0200
Andrew Cooper [Thu, 9 Jun 2022 13:29:38 +0000 (15:29 +0200)]
x86/pv: Track and flush non-coherent mappings of RAM
There are legitimate uses of WC mappings of RAM, e.g. for DMA buffers with
devices that make non-coherent writes. The Linux sound subsystem makes
extensive use of this technique.
For such usecases, the guest's DMA buffer is mapped and consistently used as
WC, and Xen doesn't interact with the buffer.
However, a mischevious guest can use WC mappings to deliberately create
non-coherency between the cache and RAM, and use this to trick Xen into
validating a pagetable which isn't actually safe.
Allocate a new PGT_non_coherent to track the non-coherency of mappings. Set
it whenever a non-coherent writeable mapping is created. If the page is used
as anything other than PGT_writable_page, force a cache flush before
validation. Also force a cache flush before the page is returned to the heap.
This is CVE-2022-26364, part of XSA-402.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c1c9cae3a9633054b177c5de21ad7268162b2f2c
master date: 2022-06-09 14:23:37 +0200
Andrew Cooper [Thu, 9 Jun 2022 13:29:13 +0000 (15:29 +0200)]
x86/amd: Work around CLFLUSH ordering on older parts
On pre-CLFLUSHOPT AMD CPUs, CLFLUSH is weakely ordered with everything,
including reads and writes to the address, and LFENCE/SFENCE instructions.
This creates a multitude of problematic corner cases, laid out in the manual.
Arrange to use MFENCE on both sides of the CLFLUSH to force proper ordering.
This is part of XSA-402.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 062868a5a8b428b85db589fa9a6d6e43969ffeb9
master date: 2022-06-09 14:23:07 +0200
Andrew Cooper [Thu, 9 Jun 2022 13:28:48 +0000 (15:28 +0200)]
x86: Split cache_flush() out of cache_writeback()
Subsequent changes will want a fully flushing version.
Use the new helper rather than opencoding it in flush_area_local(). This
resolves an outstanding issue where the conditional sfence is on the wrong
side of the clflushopt loop. clflushopt is ordered with respect to older
stores, not to younger stores.
Rename gnttab_cache_flush()'s helper to avoid colliding in name.
grant_table.c can see the prototype from cache.h so the build fails
otherwise.
This is part of XSA-402.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 9a67ffee3371506e1cbfdfff5b90658d4828f6a2
master date: 2022-06-09 14:22:38 +0200
Andrew Cooper [Thu, 9 Jun 2022 13:28:23 +0000 (15:28 +0200)]
x86: Don't change the cacheability of the directmap
Changeset 55f97f49b7ce ("x86: Change cache attributes of Xen 1:1 page mappings
in response to guest mapping requests") attempted to keep the cacheability
consistent between different mappings of the same page.
The reason wasn't described in the changelog, but it is understood to be in
regards to a concern over machine check exceptions, owing to errata when using
mixed cacheabilities. It did this primarily by updating Xen's mapping of the
page in the direct map when the guest mapped a page with reduced cacheability.
Unfortunately, the logic didn't actually prevent mixed cacheability from
occurring:
* A guest could map a page normally, and then map the same page with
different cacheability; nothing prevented this.
* The cacheability of the directmap was always latest-takes-precedence in
terms of guest requests.
* Grant-mapped frames with lesser cacheability didn't adjust the page's
cacheattr settings.
* The map_domain_page() function still unconditionally created WB mappings,
irrespective of the page's cacheattr settings.
Additionally, update_xen_mappings() had a bug where the alias calculation was
wrong for mfn's which were .init content, which should have been treated as
fully guest pages, not Xen pages.
Worse yet, the logic introduced a vulnerability whereby necessary
pagetable/segdesc adjustments made by Xen in the validation logic could become
non-coherent between the cache and main memory. The CPU could subsequently
operate on the stale value in the cache, rather than the safe value in main
memory.
The directmap contains primarily mappings of RAM. PAT/MTRR conflict
resolution is asymmetric, and generally for MTRR=WB ranges, PAT of lesser
cacheability resolves to being coherent. The special case is WC mappings,
which are non-coherent against MTRR=WB regions (except for fully-coherent
CPUs).
Xen must not have any WC cacheability in the directmap, to prevent Xen's
actions from creating non-coherency. (Guest actions creating non-coherency is
dealt with in subsequent patches.) As all memory types for MTRR=WB ranges
inter-operate coherently, so leave Xen's directmap mappings as WB.
Only PV guests with access to devices can use reduced-cacheability mappings to
begin with, and they're trusted not to mount DoSs against the system anyway.
Drop PGC_cacheattr_{base,mask} entirely, and the logic to manipulate them.
Shift the later PGC_* constants up, to gain 3 extra bits in the main reference
count. Retain the check in get_page_from_l1e() for special_pages() because a
guest has no business using reduced cacheability on these.
Andrew Cooper [Thu, 9 Jun 2022 13:27:37 +0000 (15:27 +0200)]
x86/pv: Fix ABAC cmpxchg() race in _get_page_type()
_get_page_type() suffers from a race condition where it incorrectly assumes
that because 'x' was read and a subsequent a cmpxchg() succeeds, the type
cannot have changed in-between. Consider:
CPU A:
1. Creates an L2e referencing pg
`-> _get_page_type(pg, PGT_l1_page_table), sees count 0, type PGT_writable_page
2. Issues flush_tlb_mask()
CPU B:
3. Creates a writeable mapping of pg
`-> _get_page_type(pg, PGT_writable_page), count increases to 1
4. Writes into new mapping, creating a TLB entry for pg
5. Removes the writeable mapping of pg
`-> _put_page_type(pg), count goes back down to 0
CPU A:
7. Issues cmpxchg(), setting count 1, type PGT_l1_page_table
CPU B now has a writeable mapping to pg, which Xen believes is a pagetable and
suitably protected (i.e. read-only). The TLB flush in step 2 must be deferred
until after the guest is prohibited from creating new writeable mappings,
which is after step 7.
Defer all safety actions until after the cmpxchg() has successfully taken the
intended typeref, because that is what prevents concurrent users from using
the old type.
Also remove the early validation for writeable and shared pages. This removes
race conditions where one half of a parallel mapping attempt can return
successfully before:
* The IOMMU pagetables are in sync with the new page type
* Writeable mappings to shared pages have been torn down
This is part of XSA-401 / CVE-2022-26362.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 8cc5036bc385112a82f1faff27a0970e6440dfed
master date: 2022-06-09 14:21:04 +0200
Andrew Cooper [Thu, 9 Jun 2022 13:27:19 +0000 (15:27 +0200)]
x86/pv: Clean up _get_page_type()
Various fixes for clarity, ahead of making complicated changes.
* Split the overflow check out of the if/else chain for type handling, as
it's somewhat unrelated.
* Comment the main if/else chain to explain what is going on. Adjust one
ASSERT() and state the bit layout for validate-locked and partial states.
* Correct the comment about TLB flushing, as it's backwards. The problem
case is when writeable mappings are retained to a page becoming read-only,
as it allows the guest to bypass Xen's safety checks for updates.
* Reduce the scope of 'y'. It is an artefact of the cmpxchg loop and not
valid for use by subsequent logic. Switch to using ACCESS_ONCE() to treat
all reads as explicitly volatile. The only thing preventing the validated
wait-loop being infinite is the compiler barrier hidden in cpu_relax().
* Replace one page_get_owner(page) with the already-calculated 'd' already in
scope.
No functional change.
This is part of XSA-401 / CVE-2022-26362.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 9186e96b199e4f7e52e033b238f9fe869afb69c7
master date: 2022-06-09 14:20:36 +0200
Jan Beulich [Tue, 7 Jun 2022 12:08:06 +0000 (14:08 +0200)]
PCI: don't allow "pci-phantom=" to mark real devices as phantom functions
IOMMU code mapping / unmapping devices and interrupts will misbehave if
a wrong command line option declared a function "phantom" when there's a
real device at that position. Warn about this and adjust the specified
stride (in the worst case ignoring the option altogether).
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 444b555dc9e09fa3ce90f066e0c88dec9b47f422
master date: 2022-05-20 12:20:35 +0200
Intel LPSS has INTERRUPT_LINE set to 0xff by default, that is declared
by the PCI Local Bus Specification Revision 3.0 (from 2004) as
"unknown"/"no connection". Fallback to poll mode in this case.
The 0xff handling is x86-specific, the surrounding code is guarded with
CONFIG_X86 anyway.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 6a2ea1a2370a0c8a0210accac0ae62e68c185134
master date: 2022-05-20 12:19:45 +0200
Jan Beulich [Tue, 7 Jun 2022 12:07:11 +0000 (14:07 +0200)]
build: silence GNU ld warning about executable stacks
While for C files the compiler is supposed to arrange for emitting
respective information, for assembly sources we're responsible ourselves.
Present GNU ld master started warning about such, and hence 2.39 is
anticipated to have this warning.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <jgrall@amazon.com>
master commit: 62d22296a95d259c934ca2f39ac511d729cfbb68
master date: 2022-05-18 11:18:45 +0200
Jan Beulich [Tue, 7 Jun 2022 12:06:51 +0000 (14:06 +0200)]
build: suppress GNU ld warning about RWX load segments
We cannot really avoid such and we're also not really at risk because of
them, as we control page table permissions ourselves rather than relying
on a loader of some sort. Present GNU ld master started warning about
such, and hence 2.39 is anticipated to have this warning.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <jgrall@amazon.com>
master commit: 68f5aac012b9ae36ce9b65d9ca9cc9f232191ad3
master date: 2022-05-18 11:17:19 +0200
Julien Grall [Tue, 7 Jun 2022 12:06:11 +0000 (14:06 +0200)]
xen: io: Fix race between sending an I/O and domain shutdown
Xen provides hypercalls to shutdown (SCHEDOP_shutdown{,_code}) and
resume a domain (XEN_DOMCTL_resumedomain). They can be used for checkpoint
where the expectation is the domain should continue as nothing happened
afterwards.
hvmemul_do_io() and handle_pio() will act differently if the return
code of hvm_send_ioreq() (resp. hvmemul_do_pio_buffer()) is X86EMUL_RETRY.
In this case, the I/O state will be reset to STATE_IOREQ_NONE (i.e
no I/O is pending) and/or the PC will not be advanced.
If the shutdown request happens right after the I/O was sent to the
IOREQ, then emulation code will end up to re-execute the instruction
and therefore forward again the same I/O (at least when reading IO port).
This would be problem if the access has a side-effect. A dumb example,
is a device implementing a counter which is incremented by one for every
access. When running shutdown/resume in a loop, the value read by the
OS may not be the old value + 1.
Add an extra boolean in the structure hvm_vcpu_io to indicate whether
the I/O was suspended. This is then used in place of checking the domain
is shutting down in hvmemul_do_io() and handle_pio() as they should
act on suspend (i.e. vcpu_start_shutdown_deferral() returns false) rather
than shutdown.
This confuses some tools (like gdb) and prevents proper parsing of the
binary.
The issue has already been reported and is being fixed in LLD. In
order to workaround this issue and keep the GNU ld support define
different DECL_SECTION macros depending on the used ld
implementation.
Drop the quotes from the definitions of the debug sections in
DECL_DEBUG{2}, as those quotes are not required for GNU ld either.
Fixes: 6254920587c3 ('x86: quote section names when defining them in linker script') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 702c9a800eb3ecd4b8595998d37a769d470c5bb0
master date: 2022-05-02 08:51:45 +0200
Roger Pau Monné [Tue, 7 Jun 2022 12:05:06 +0000 (14:05 +0200)]
kconfig: detect LD implementation
Detect GNU and LLVM ld implementations. This is required for further
patches that will introduce diverging behaviour depending on the
linker implementation in use.
Note that LLVM ld returns "compatible with GNU linkers" as part of the
version string, so be on the safe side and use '^' to only match at
the start of the line in case LLVM ever decides to change the text to
use "compatible with GNU ld" instead.
Roger Pau Monné [Tue, 7 Jun 2022 12:04:16 +0000 (14:04 +0200)]
x86/msr: handle reads to MSR_P5_MC_{ADDR,TYPE}
Windows Server 2019 Essentials will unconditionally attempt to read
P5_MC_ADDR MSR at boot and throw a BSOD if injected a #GP.
Fix this by mapping MSR_P5_MC_{ADDR,TYPE} to
MSR_IA32_MCi_{ADDR,STATUS}, as reported also done by hardware in Intel
SDM "Mapping of the Pentium Processor Machine-Check Errors to the
Machine-Check Architecture" section.
Jan Beulich [Tue, 7 Jun 2022 12:03:20 +0000 (14:03 +0200)]
IOMMU/x86: disallow device assignment to PoD guests
While it is okay for IOMMU page tables to be set up for guests starting
in PoD mode, actual device assignment may only occur once all PoD
entries have been removed from the P2M. So far this was enforced only
for boot-time assignment, and only in the tool stack.
Also use the new function to replace p2m_pod_entry_count(): Its unlocked
access to p2m->pod.entry_count wasn't really okay (irrespective of the
result being stale by the time the caller gets to see it). Nor was the
use of that function in line with the immediately preceding comment: A
PoD guest isn't just one with a non-zero entry count, but also one with
a non-empty cache (e.g. prior to actually launching the guest).
To allow the tool stack to see a consistent snapshot of PoD state, move
the tail of XENMEM_{get,set}_pod_target handling into a function, adding
proper locking there.
In libxl take the liberty to use the new local variable r also for a
pre-existing call into libxc.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: ad4312d764e8b40a1e45b64aac6d840a60c59f13
master date: 2022-05-02 08:48:02 +0200
Jan Beulich [Tue, 7 Jun 2022 12:02:30 +0000 (14:02 +0200)]
IOMMU: make domctl handler tolerate NULL domain
Besides the reporter's issue of hitting a NULL deref when !CONFIG_GDBSX,
XEN_DOMCTL_test_assign_device can legitimately end up having NULL passed
here, when the domctl was passed DOMID_INVALID.
Fixes: 71e617a6b8f6 ("use is_iommu_enabled() where appropriate...") Reported-by: Cheyenne Wills <cheyenne.wills@gmail.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Juergen Gross <jgross@suse.com>
master commit: fa4d84e6dd3c3bfd23a525b75a5483d4ce15adbb
master date: 2022-04-26 10:25:54 +0200
Juergen Gross [Tue, 7 Jun 2022 12:02:08 +0000 (14:02 +0200)]
xen/iommu: cleanup iommu related domctl handling
Today iommu_do_domctl() is being called from arch_do_domctl() in the
"default:" case of a switch statement. This has led already to crashes
due to unvalidated parameters.
Fix that by moving the call of iommu_do_domctl() to the main switch
statement of do_domctl().
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> # Arm
master commit: 9cd7e31b3f584e97a138a770cfb031a91a867936
master date: 2022-04-26 10:23:58 +0200
Juergen Gross [Tue, 7 Jun 2022 12:01:27 +0000 (14:01 +0200)]
tools/libs/guest: don't set errno to a negative value
Setting errno to a negative error value makes no sense.
Fixes: cb99a64029c9 ("libxc: arm: allow passing a device tree blob to the guest") Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 438e96ab479495a932391a22e219ee62fa8c4f47
master date: 2022-04-22 20:39:34 +0100
Juergen Gross [Tue, 7 Jun 2022 12:01:03 +0000 (14:01 +0200)]
tools/libs/ctrl: don't set errno to a negative value
The claimed reason for setting errno to -1 is wrong. On x86
xc_domain_pod_target() will set errno to a sane value in the error
case.
Fixes: ff1745d5882b ("tools: libxl: do not set the PoD target on ARM") Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a0fb7e0e73483ed042d5ca34861a891a51ad337b
master date: 2022-04-22 20:39:34 +0100
Juergen Gross [Tue, 7 Jun 2022 12:00:31 +0000 (14:00 +0200)]
tools/libs/evtchn: don't set errno to negative values
Setting errno to a negative value makes no sense.
Fixes: 6b6500b3cbaa ("tools/libs/evtchn: Add support for restricting a handle") Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 60245b71c1cd001686fa7b7a26869cbcb80d074c
master date: 2022-04-22 20:39:34 +0100
David Vrabel [Tue, 7 Jun 2022 11:59:31 +0000 (13:59 +0200)]
x86/mm: avoid inadvertently degrading a TLB flush to local only
If the direct map is incorrectly modified with interrupts disabled,
the required TLB flushes are degraded to flushing the local CPU only.
This could lead to very hard to diagnose problems as different CPUs will
end up with different views of memory. Although, no such issues have yet
been identified.
Change the check in the flush_area() macro to look at system_state
instead. This defers the switch from local to all later in the boot
(see xen/arch/x86/setup.c:__start_xen()). This is fine because
additional PCPUs are not brought up until after the system state is
SYS_STATE_smp_boot.
Signed-off-by: David Vrabel <dvrabel@amazon.co.uk> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/flushtlb: remove flush_area check on system state
Booting with Shadow Stacks leads to the following assert on a debug
hypervisor:
Assertion 'local_irq_is_enabled()' failed at arch/x86/smp.c:265
----[ Xen-4.17.0-10.24-d x86_64 debug=y Not tainted ]----
CPU: 0
RIP: e008:[<ffff82d040345300>] flush_area_mask+0x40/0x13e
[...]
Xen call trace:
[<ffff82d040345300>] R flush_area_mask+0x40/0x13e
[<ffff82d040338a40>] F modify_xen_mappings+0xc5/0x958
[<ffff82d0404474f9>] F arch/x86/alternative.c#_alternative_instructions+0xb7/0xb9
[<ffff82d0404476cc>] F alternative_branches+0xf/0x12
[<ffff82d04044e37d>] F __start_xen+0x1ef4/0x2776
[<ffff82d040203344>] F __high_start+0x94/0xa0
This is due to SYS_STATE_smp_boot being set before calling
alternative_branches(), and the flush in modify_xen_mappings() then
using flush_area_all() with interrupts disabled. Note that
alternative_branches() is called before APs are started, so the flush
must be a local one (and indeed the cpumask passed to
flush_area_mask() just contains one CPU).
Take the opportunity to simplify a bit the logic and make flush_area()
an alias of flush_area_all() in mm.c, taking into account that
cpu_online_map just contains the BSP before APs are started. This
requires widening the assert in flush_area_mask() to allow being
called with interrupts disabled as long as it's strictly a local only
flush.
The overall result is that a conditional can be removed from
flush_area().
While there also introduce an ASSERT to check that a vCPU state flush
is not issued for the local CPU only.
Fixes: 78e072bc37 ('x86/mm: avoid inadvertently degrading a TLB flush to local only') Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 78e072bc375043e81691a59454e09f0b38241ddd
master date: 2022-04-20 10:55:01 +0200
master commit: 9f735ee4903f1b9f1966bb4ba5b5616b03ae08b5
master date: 2022-05-25 11:09:46 +0200
Jan Beulich [Tue, 7 Jun 2022 11:58:16 +0000 (13:58 +0200)]
VT-d: refuse to use IOMMU with reserved CAP.ND value
The field taking the value 7 (resulting in 18-bit DIDs when using the
calculation in cap_ndoms(), when the DID fields are only 16 bits wide)
is reserved. Instead of misbehaving in case we would encounter such an
IOMMU, refuse to use it.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: a1545fbf45c689aff39ce76a6eaa609d32ef72a7
master date: 2022-04-20 10:54:26 +0200
Juergen Gross [Tue, 7 Jun 2022 11:56:54 +0000 (13:56 +0200)]
xen: fix XEN_DOMCTL_gdbsx_guestmemio crash
A hypervisor built without CONFIG_GDBSX will crash in case the
XEN_DOMCTL_gdbsx_guestmemio domctl is being called, as the call will
end up in iommu_do_domctl() with d == NULL:
It used to be permitted to pass DOMID_IDLE to dbg_rw_mem(), which is why the
special case skipping the domid checks exists. Now that it is only permitted
to pass proper domids, remove the special case, making 'd' always valid.
Reported-by: Cheyenne Wills <cheyenne.wills@gmail.com> Fixes: e726a82ca0dc ("xen: make gdbsx support configurable") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f00daf1fb3213a9b0335d9dcd90fe9cb5c02b7a9
master date: 2022-04-19 17:07:08 +0100
Jason Andryuk [Tue, 7 Jun 2022 11:55:39 +0000 (13:55 +0200)]
x86/irq: skip unmap_domain_pirq XSM during destruction
xsm_unmap_domain_irq was seen denying unmap_domain_pirq when called from
complete_domain_destroy as an RCU callback. The source context was an
unexpected, random domain. Since this is a xen-internal operation,
going through the XSM hook is inapproriate.
Check d->is_dying and skip the XSM hook when set since this is a cleanup
operation for a domain being destroyed.
Suggested-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 2e6f95a942d1927a53f077c301db0b799c54c05a
master date: 2022-04-08 14:51:52 +0200
Track whether symbols belong to ignored sections in order to avoid
applying relocations referencing those symbols. The address of such
symbols won't be resolved and thus the relocation will likely fail or
write garbage to the destination.
Return an error in that case, as leaving unresolved relocations would
lead to malfunctioning payload code.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Bjoern Doebel <doebel@amazon.de> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: 9120b5737f517fe9d2a3936c38d3a2211630323b
master date: 2022-04-08 10:27:11 +0200
A side effect of ignoring such sections is that symbols belonging to
them won't be resolved, and that could make relocations belonging to
other sections that reference those symbols fail.
For example it's likely to have an empty .altinstr_replacement with
symbols pointing to it, and marking the section as ignored will
prevent the symbols from being resolved, which in turn will cause any
relocations against them to fail.
In order to solve this do not ignore sections with 0 size, only ignore
sections that don't have the SHF_ALLOC flag set.
Special case such empty sections in move_payload so they are not taken
into account in order to decide whether a livepatch can be safely
re-applied after a revert.
Fixes: 98b728a7b2 ('livepatch: Disallow applying after an revert') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Bjoern Doebel <doebel@amazon.de> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: 0dc1f929e8fed681dec09ca3ea8de38202d5bf30
master date: 2022-04-08 10:24:10 +0200
Andrew Cooper [Fri, 8 Apr 2022 12:57:54 +0000 (14:57 +0200)]
x86/cpuid: Clobber CPUID leaves 0x800000{1d..20} in policies
c/s 1a914256dca5 increased the AMD max leaf from 0x8000001c to 0x80000021, but
did not adjust anything in the calculate_*_policy() chain. As a result, on
hardware supporting these leaves, we read the real hardware values into the
raw policy, then copy into host, and all the way into the PV/HVM default
policies.
All 4 of these leaves have enable bits (first two by TopoExt, next by SEV,
next by PQOS), so any software following the rules is fine and will leave them
alone. However, leaf 0x8000001d takes a subleaf input and at least two
userspace utilities have been observed to loop indefinitely under Xen (clearly
waiting for eax to report "no more cache levels").
Such userspace is buggy, but Xen's behaviour isn't great either.
In the short term, clobber all information in these leaves. This is a giant
bodge, but there are complexities with implementing all of these leaves
properly.
Fixes: 1a914256dca5 ("x86/cpuid: support LFENCE always serialising CPUID bit") Link: https://github.com/QubesOS/qubes-issues/issues/7392 Reported-by: fosslinux <fosslinux@aussies.space> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d4012d50082c2eae2f3cbe7770be13b9227fbc3f
master date: 2022-04-07 11:36:45 +0100
Jan Beulich [Fri, 8 Apr 2022 12:57:25 +0000 (14:57 +0200)]
VT-d: avoid infinite recursion on domain_context_mapping_one() error path
Despite the comment there infinite recursion was still possible, by
flip-flopping between two domains. This is because prev_dom is derived
from the DID found in the context entry, which was already updated by
the time error recovery is invoked. Simply introduce yet another mode
flag to prevent rolling back an in-progress roll-back of a prior
mapping attempt.
Also drop the existing recursion prevention for having been dead anyway:
Earlier in the function we already bail when prev_dom == domain.
Jan Beulich [Fri, 8 Apr 2022 12:56:54 +0000 (14:56 +0200)]
VT-d: avoid NULL deref on domain_context_mapping_one() error paths
First there's a printk() which actually wrongly uses pdev in the first
place: We want to log the coordinates of the (perhaps fake) device
acted upon, which may not be pdev.
Then it was quite pointless for eb19326a328d ("VT-d: prepare for per-
device quarantine page tables (part I)") to add a domid_t parameter to
domain_context_unmap_one(): It's only used to pass back here via
me_wifi_quirk() -> map_me_phantom_function(). Drop the parameter again.
Finally there's the invocation of domain_context_mapping_one(), which
needs to be passed the correct domain ID. Avoid taking that path when
pdev is NULL and the quarantine state is what would need restoring to.
This means we can't security-support non-PCI-Express devices with RMRRs
(if such exist in practice) any longer; note that as of trhe 1st of the
two commits referenced below assigning them to DomU-s is unsupported
anyway.
Fixes: 8f41e481b485 ("VT-d: re-assign devices directly") Fixes: 14dd241aad8a ("IOMMU/x86: use per-device page tables for quarantining")
Coverity ID: 1503784 Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 608394b906e71587f02e6662597bc985bad33a5a
master date: 2022-04-07 12:30:19 +0200
Jan Beulich [Fri, 8 Apr 2022 12:55:55 +0000 (14:55 +0200)]
VT-d: don't needlessly look up DID
If get_iommu_domid() in domain_context_unmap_one() fails, we better
wouldn't clear the context entry in the first place, as we're then unable
to issue the corresponding flush. However, we have no need to look up the
DID in the first place: What needs flushing is very specifically the DID
that was in the context entry before our clearing of it.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 445ab9852d69d8957467f0036098ebec75fec092
master date: 2022-04-07 12:29:03 +0200
tools/firmware: do not add a .note.gnu.property section
Prevent the assembler from creating a .note.gnu.property section on
the output objects, as it's not useful for firmware related binaries,
and breaks the resulting rombios image.
This requires modifying the cc-option Makefile macro so it can test
assembler options (by replacing the usage of the -S flag with -c) and
also stripping the -Wa, prefix if present when checking for the test
output.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e270af94280e6a9610705ebc1fdd1d7a9b1f8a98
master date: 2022-04-04 12:30:07 +0100
Do so right in firmware/Rules.mk, like it's done for other compiler
flags.
Fixes: 3667f7f8f7 ('x86: Introduce support for CET-IBT') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 7225f6e0cd3afd48b4d61c43dd8fead0f4c92193
master date: 2022-04-04 12:30:00 +0100
Jason Andryuk [Wed, 6 Apr 2022 08:19:57 +0000 (10:19 +0200)]
libxl: Re-scope qmp_proxy_spawn.ao usage
I've observed this failed assertion:
libxl_event.c:2057: libxl__ao_inprogress_gc: Assertion `ao' failed.
AFAICT, this is happening in qmp_proxy_spawn_outcome where
sdss->qmp_proxy_spawn.ao is NULL.
The out label of spawn_stub_launch_dm() calls qmp_proxy_spawn_outcome(),
but it is only in the success path that sdss->qmp_proxy_spawn.ao gets
set to the current ao.
qmp_proxy_spawn_outcome() should instead use sdss->dm.spawn.ao, which is
the already in-use ao when spawn_stub_launch_dm() is called. The same
is true for spawn_qmp_proxy().
With this, move sdss->qmp_proxy_spawn.ao initialization to
spawn_qmp_proxy() since its use is for libxl__spawn_spawn() and it can
be initialized along with the rest of sdss->qmp_proxy_spawn.
Fixes: 83c845033dc8 ("libxl: use vchan for QMP access with Linux stubdomain") Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: d62a34423a1a98aefd7c30e22d2d82d198f077c8
master date: 2022-04-01 17:01:57 +0100
Move dcs->console_xswait initialization into the callers of
initiate_domain_create, do_domain_create() and do_domain_soft_reset(),
so it is initialized along with the other dcs state.
Jason Andryuk [Wed, 6 Apr 2022 08:18:36 +0000 (10:18 +0200)]
xl: Fix global pci options
commit babde47a3fed "introduce a 'passthrough' configuration option to
xl.cfg..." moved the pci list parsing ahead of the global pci option
parsing. This broke the global pci configuration options since they
need to be set first so that looping over the pci devices assigns their
values.
Move the global pci options ahead of the pci list to restore their
function.
Fixes: babde47a3fed ("introduce a 'passthrough' configuration option to xl.cfg...") Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: e45ad0b1b0bd6a43f59aaf4a6f86d88783c630e5
master date: 2022-03-31 19:48:12 +0100
The size of the video memory of PVH guests should be set to 0 in case
no value has been specified.
Doing not so will leave it to be -1, resulting in an additional 1 kB
of RAM being advertised in the memory map (here the output of a PVH
Mini-OS boot with 16 MB of RAM assigned):
Jan Beulich [Tue, 5 Apr 2022 12:49:40 +0000 (14:49 +0200)]
IOMMU/x86: use per-device page tables for quarantining
Devices with RMRRs / unity mapped regions, due to it being unspecified
how/when these memory regions may be accessed, may not be left
disconnected from the mappings of these regions (as long as it's not
certain that the device has been fully quiesced). Hence even the page
tables used when quarantining such devices need to have mappings of
those regions. This implies installing page tables in the first place
even when not in scratch-page quarantining mode.
This is CVE-2022-26361 / part of XSA-400.
While for the purpose here it would be sufficient to have devices with
RMRRs / unity mapped regions use per-device page tables, extend this to
all devices (in scratch-page quarantining mode). This allows the leaf
pages to be mapped r/w, thus covering also memory writes (rather than
just reads) issued by non-quiescent devices.
Set up quarantine page tables as late as possible, yet early enough to
not encounter failure during de-assign. This means setup generally
happens in assign_device(), while (for now) the one in deassign_device()
is there mainly to be on the safe side.
As to the removal of QUARANTINE_SKIP() from domain_context_unmap_one():
I think this was never really needed there, as the function explicitly
deals with finding a non-present context entry. Leaving it there would
require propagating pgd_maddr into the function (like was done by "VT-d:
prepare for per-device quarantine page tables" for
domain_context_mapping_one()).
In VT-d's DID allocation function don't require the IOMMU lock to be
held anymore: All involved code paths hold pcidevs_lock, so this way we
avoid the need to acquire the IOMMU lock around the new call to
context_set_domain_id().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 14dd241aad8af447680ac73e8579990e2c09c1e7
master date: 2022-04-05 14:24:18 +0200
Jan Beulich [Tue, 5 Apr 2022 12:48:58 +0000 (14:48 +0200)]
IOMMU/x86: drop TLB flushes from quarantine_init() hooks
The page tables just created aren't hooked up yet anywhere, so there's
nothing that could be present in any TLB, and hence nothing to flush.
Dropping this flush is, at least on the VT-d side, a prereq to per-
device domain ID use when quarantining devices, as dom_io isn't going
to be assigned a DID anymore: The warning in get_iommu_did() would
trigger.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 54c5cef49239e2f27ec3b3fc8804bf57aa4bf46d
master date: 2022-04-05 14:19:42 +0200
Jan Beulich [Tue, 5 Apr 2022 12:48:29 +0000 (14:48 +0200)]
IOMMU/x86: maintain a per-device pseudo domain ID
In order to subsequently enable per-device quarantine page tables, we'll
need domain-ID-like identifiers to be inserted in the respective device
(AMD) or context (Intel) table entries alongside the per-device page
table root addresses.
Make use of "real" domain IDs occupying only half of the value range
coverable by domid_t.
Note that in VT-d's iommu_alloc() I didn't want to introduce new memory
leaks in case of error, but existing ones don't get plugged - that'll be
the subject of a later change.
The VT-d changes are slightly asymmetric, but this way we can avoid
assigning pseudo domain IDs to devices which would never be mapped while
still avoiding to add a new parameter to domain_context_unmap().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 97af062b89d52c0ecf7af254b53345c97d438e33
master date: 2022-04-05 14:19:10 +0200
Jan Beulich [Tue, 5 Apr 2022 12:48:09 +0000 (14:48 +0200)]
VT-d: prepare for per-device quarantine page tables (part II)
Replace the passing of struct domain * by domid_t in preparation of
per-device quarantine page tables also requiring per-device pseudo
domain IDs, which aren't going to be associated with any struct domain
instances.
No functional change intended (except for slightly adjusted log message
text).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 7131163c4806e3c7de24873164d1a003d2a27dee
master date: 2022-04-05 14:18:48 +0200
Jan Beulich [Tue, 5 Apr 2022 12:47:32 +0000 (14:47 +0200)]
VT-d: prepare for per-device quarantine page tables (part I)
Arrange for domain ID and page table root to be passed around, the latter in
particular to domain_pgd_maddr() such that taking it from the per-domain
fields can be overridden.
No functional change intended.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: eb19326a328d49a6a4dc3930391b340f3bcd8948
master date: 2022-04-05 14:18:26 +0200
Jan Beulich [Tue, 5 Apr 2022 12:47:05 +0000 (14:47 +0200)]
AMD/IOMMU: re-assign devices directly
Devices with unity map ranges, due to it being unspecified how/when
these memory ranges may get accessed, may not be left disconnected from
their unity mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than tearing down the old root page
table pointer and then establishing the new one, re-assignment needs to
be done in a single step.
This is CVE-2022-26360 / part of XSA-400.
Reported-by: Roger Pau Monné <roger.pau@citrix.com>
Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.
To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any unity map
ranges. The main difference is when it comes to updating DTEs, which need
to be atomic when there are unity mappings. Yet atomicity can only be
achieved with CMPXCHG16B, availability of which we can't take for given.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 1fa6e9aa36233fe9c29a204fcb2697e985b8345f
master date: 2022-04-05 14:18:04 +0200
Jan Beulich [Tue, 5 Apr 2022 12:46:45 +0000 (14:46 +0200)]
VT-d: re-assign devices directly
Devices with RMRRs, due to it being unspecified how/when the specified
memory regions may get accessed, may not be left disconnected from their
respective mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than unmapping the old context and
then mapping the new one, re-assignment needs to be done in a single
step.
This is CVE-2022-26359 / part of XSA-400.
Reported-by: Roger Pau Monné <roger.pau@citrix.com>
Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.
To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any RMRRs. The
main difference is when it comes to updating context entries, which need
to be atomic when there are RMRRs. Yet atomicity can only be achieved
with CMPXCHG16B, availability of which we can't take for given.
The seemingly complicated choice of non-negative return values for
domain_context_mapping_one() is to limit code churn: This way callers
passing NULL for pdev don't need fiddling with.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8f41e481b4852173909363b88c1ab3da747d3a05
master date: 2022-04-05 14:17:42 +0200
Jan Beulich [Tue, 5 Apr 2022 12:46:03 +0000 (14:46 +0200)]
VT-d: drop ownership checking from domain_context_mapping_one()
Despite putting in quite a bit of effort it was not possible to
establish why exactly this code exists (beyond possibly sanity
checking). Instead of a subsequent change further complicating this
logic, simply get rid of it.
Take the opportunity and move the respective unmap_vtd_domain_page() out
of the locked region.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: a680b8134b2d1828bbbf443a97feea66e8a85c75
master date: 2022-04-05 14:17:21 +0200
Jan Beulich [Tue, 5 Apr 2022 12:44:53 +0000 (14:44 +0200)]
VT-d: fix add/remove ordering when RMRRs are in use
In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully cleared.
Also switch to %pd in related log messages.
Fixes: fa88cfadf918 ("vt-d: Map RMRR in intel_iommu_add_device() if the device has RMRR") Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 3221f270cf2eba0a22fd4f92319d664eacb92889
master date: 2022-04-05 14:16:10 +0200
Jan Beulich [Tue, 5 Apr 2022 12:44:14 +0000 (14:44 +0200)]
VT-d: fix (de)assign ordering when RMRRs are in use
In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully updated.
Also adjust a related log message.
This is CVE-2022-26358 / part of XSA-400.
Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 78a40f8b5dfa1a3aec43528663f39473d4429101
master date: 2022-04-05 14:15:33 +0200
Jan Beulich [Tue, 5 Apr 2022 12:43:57 +0000 (14:43 +0200)]
VT-d: correct ordering of operations in cleanup_domid_map()
The function may be called without any locks held (leaving aside the
domctl one, which we surely don't want to depend on here), so needs to
play safe wrt other accesses to domid_map[] and domid_bitmap[]. This is
to avoid context_set_domain_id()'s writing of domid_map[] to be reset to
zero right away in the case of it racing the freeing of a DID.
For the interaction with context_set_domain_id() and did_to_domain_id()
see the code comment.
{check_,}cleanup_domid_map() are called with pcidevs_lock held or during
domain cleanup only (and pcidevs_lock is also held around
context_set_domain_id()), i.e. racing calls with the same (dom, iommu)
tuple cannot occur.
domain_iommu_domid(), besides its use by cleanup_domid_map(), has its
result used only to control flushing, and hence a stale result would
only lead to a stray extra flush.
x86/hap: do not switch on log dirty for VRAM tracking
XEN_DMOP_track_dirty_vram possibly calls into paging_log_dirty_enable
when using HAP mode, and it can interact badly with other ongoing
paging domctls, as XEN_DMOP_track_dirty_vram is not holding the domctl
lock.
This was detected as a result of the following assert triggering when
doing repeated migrations of a HAP HVM domain with a stubdom:
Assertion 'd->arch.paging.log_dirty.allocs == 0' failed at paging.c:198
----[ Xen-4.17-unstable x86_64 debug=y Not tainted ]----
CPU: 34
RIP: e008:[<ffff82d040314b3b>] arch/x86/mm/paging.c#paging_free_log_dirty_bitmap+0x606/0x6
RFLAGS: 0000000000010206 CONTEXT: hypervisor (d0v23)
[...]
Xen call trace:
[<ffff82d040314b3b>] R arch/x86/mm/paging.c#paging_free_log_dirty_bitmap+0x606/0x63a
[<ffff82d040279f96>] S xsm/flask/hooks.c#domain_has_perm+0x5a/0x67
[<ffff82d04031577f>] F paging_domctl+0x251/0xd41
[<ffff82d04031640c>] F paging_domctl_continuation+0x19d/0x202
[<ffff82d0403202fa>] F pv_hypercall+0x150/0x2a7
[<ffff82d0403a729d>] F lstar_enter+0x12d/0x140
Such assert triggered because the stubdom used
XEN_DMOP_track_dirty_vram while dom0 was in the middle of executing
XEN_DOMCTL_SHADOW_OP_OFF, and so log dirty become enabled while
retiring the old structures, thus leading to new entries being
populated in already clear slots.
Fix this by not enabling log dirty for VRAM tracking, similar to what
is done when using shadow instead of HAP. Call
p2m_enable_hardware_log_dirty when enabling VRAM tracking in order to
get some hardware assistance if available. As a side effect the memory
pressure on the p2m pool should go down if only VRAM tracking is
enabled, as the dirty bitmap is no longer allocated.
Note that paging_log_dirty_range (used to get the dirty bitmap for
VRAM tracking) doesn't use the log dirty bitmap, and instead relies on
checking whether each gfn on the range has been switched from
p2m_ram_logdirty to p2m_ram_rw in order to account for dirty pages.
This is CVE-2022-26356 / XSA-397.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4f4db53784d912c4f409a451c36ebfd4754e0a42
master date: 2022-04-05 14:11:30 +0200
Jan Beulich [Thu, 31 Mar 2022 09:00:57 +0000 (11:00 +0200)]
livepatch: account for patch offset when applying NOP patch
While not triggered by the trivial xen_nop in-tree patch on
staging/master, that patch exposes a problem on the stable trees, where
all functions have ENDBR inserted. When NOP-ing out a range, we need to
account for this. Handle this right in livepatch_insn_len().
This requires livepatch_insn_len() to be called _after_ ->patch_offset
was set.
Fixes: 6974c75180f1 ("xen/x86: Livepatch: support patching CET-enhanced functions") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8a87b9a0fb0564f9d68f0be0a0d1a17c34117b8b
master date: 2022-03-31 10:45:46 +0200
Roger Pau Monné [Thu, 31 Mar 2022 08:58:42 +0000 (10:58 +0200)]
vpci/msix: fix PBA accesses
Map the PBA in order to access it from the MSI-X read and write
handlers. Note that previously the handlers would pass the physical
host address into the {read,write}{l,q} handlers, which is wrong as
those expect a linear address.
Map the PBA using ioremap when the first access is performed. Note
that 32bit arches might want to abstract the call to ioremap into a
vPCI arch handler, so they can use a fixmap range to map the PBA.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Tested-by: Alex Olson <Alex.Olson@starlab.io>
master commit: b4f21160601155762a4d014db9623af921fec959
master date: 2022-03-09 16:21:01 +0100
Roger Pau Monné [Thu, 31 Mar 2022 08:57:23 +0000 (10:57 +0200)]
x86/Kconfig: introduce option to select retpoline usage
Add a new Kconfig option under the "Speculative hardening" section
that allows selecting whether to enable retpoline. This depends on the
underlying compiler having retpoline support.
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 95d9ab46143685f169f636cfdd7997e2fc630e86
master date: 2022-02-21 18:17:56 +0000
Roger Pau Monné [Thu, 31 Mar 2022 08:56:34 +0000 (10:56 +0200)]
x86/clang: add retpoline support
Detect whether the compiler supports clang retpoline option and enable
by default if available, just like it's done for gcc.
Note clang already disables jump tables when retpoline is enabled, so
there's no need to also pass the fno-jump-tables parameter. Also clang
already passes the return address in a register always on amd64, so
there's no need for any equivalent mindirect-branch-register
parameter.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9412486707f8f1ca2eb31c2ef330c5e39c0a2f30
master date: 2022-02-21 18:17:56 +0000
Roger Pau Monné [Thu, 31 Mar 2022 08:54:08 +0000 (10:54 +0200)]
x86/retpoline: split retpoline compiler support into separate option
Keep the previous option as a way to signal generic retpoline support
regardless of the underlying compiler, while introducing a new
CC_HAS_INDIRECT_THUNK that signals whether the underlying compiler
supports retpoline.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e245bc154300b5d0367b64e8b937c9d1da508ad3
master date: 2022-02-21 18:17:56 +0000
Bjoern Doebel [Wed, 9 Mar 2022 15:22:03 +0000 (16:22 +0100)]
livepatch: resolve old address before function verification
When verifying that a livepatch can be applied, we may as well want to
inspect the target function to be patched. To do so, we need to resolve
this function's address before running the arch-specific
livepatch_verify hook.
Signed-off-by: Bjoern Doebel <doebel@amazon.de> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
(cherry picked from commit 5142dc5c25e317c208e3dc16d16b664b9f05dab5)
Andrew Cooper [Mon, 28 Feb 2022 19:31:00 +0000 (19:31 +0000)]
x86/cet: Remove XEN_SHSTK's dependency on EXPERT
CET-SS hardware is now available from multiple vendors, the feature has
downstream users, and was declared security supported in XSA-398.
Enable it by default.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit fc90d75c2b71ae15b75128e7d0d4dbe718164ecb)
Bjoern Doebel [Thu, 10 Mar 2022 07:35:36 +0000 (07:35 +0000)]
xen/x86: Livepatch: support patching CET-enhanced functions
Xen enabled CET for supporting architectures. The control flow aspect of
CET require functions that can be called indirectly (i.e., via function
pointers) to start with an ENDBR64 instruction. Otherwise a control flow
exception is raised.
This expectation breaks livepatching flows because we patch functions by
overwriting their first 5 bytes with a JMP + <offset>, thus breaking the
ENDBR64. We fix this by checking the start of a patched function for
being ENDBR64. In the positive case we move the livepatch JMP to start
behind the ENDBR64 instruction.
To avoid having to guess the ENDBR64 offset again on patch reversal
(which might race with other mechanisms adding/removing ENDBR
dynamically), use the livepatch metadata to store the computed offset
along with the saved bytes of the overwritten function.
Signed-off-by: Bjoern Doebel <doebel@amazon.de> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com> Tested-by: Jiamei Xie <jiamei.xie@arm.com>
(cherry picked from commit 6974c75180f1aad44e5428eabf2396b2b50fb0e4)
Note: For backports to 4.14 thru 4.16, there is no endbr-clobbering, hence no
is_endbr64_poison() logic.
Andrew Cooper [Tue, 15 Mar 2022 12:07:18 +0000 (12:07 +0000)]
x86/cet: Remove writeable mapping of the BSPs shadow stack
An unintended consequence of the BSP using cpu0_stack[] is that writeable
mappings to the BSPs shadow stacks are retained in the bss. This renders
CET-SS almost useless, as an attacker can update both return addresses and the
ret will not fault.
We specifically don't want to shatter the superpage mapping .data and .bss, so
the only way to fix this is to not have the BSP stack in the main Xen image.
Break cpu_alloc_stack() out of cpu_smpboot_alloc(), and dynamically allocate
the BSP stack as early as reasonable in __start_xen(). As a consequence,
there is no need to delay the BSP's memguard_guard_stack() call.
Copy the top of cpu info block just before switching to use the new stack.
Fix a latent bug by setting %rsp to info->guest_cpu_user_regs rather than
->es; this would be buggy if reinit_bsp_stack() called schedule() (which
rewrites the GPR block) directly, but luckily it doesn't.
Finally, move cpu0_stack[] into .init, so it can be reclaimed after boot.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 37786b23b027ab83051175cb8ce9ac86cacfc58e)