Jan Beulich [Fri, 8 Apr 2022 12:50:29 +0000 (14:50 +0200)]
x86/P2M: p2m.c is HVM-only
This only requires moving p2m_percpu_rwlock elsewhere (ultimately I
think all P2M locking should go away as well when !HVM, but this looks
to require further code juggling). The two other unguarded functions are
already unneeded (by virtue of DCE) when !HVM.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Fri, 8 Apr 2022 12:48:45 +0000 (14:48 +0200)]
paged_pages field is MEM_PAGING-only
Conditionalize it and its uses accordingly.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tamas K Lengyel <tamas@tklengyel.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Fri, 8 Apr 2022 12:47:56 +0000 (14:47 +0200)]
shr_pages field is MEM_SHARING-only
Conditionalize it and its uses accordingly. The main goal though is to
demonstrate that x86's p2m_teardown() is now empty when !HVM, which in
particular means the last remaining use of p2m_lock() in this cases goes
away.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tamas K Lengyel <tamas@tklengyel.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Fri, 8 Apr 2022 12:47:11 +0000 (14:47 +0200)]
x86/p2m: re-arrange {,__}put_gfn()
All explicit callers of __put_gfn() are in HVM-only code and hold a valid
P2M pointer in their hands. Move the paging_mode_translate() check out of
there into put_gfn(), renaming __put_gfn() and making its GFN parameter
type-safe.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Fri, 8 Apr 2022 12:46:30 +0000 (14:46 +0200)]
x86/P2M: derive HVM-only variant from __get_gfn_type_access()
Introduce an inline wrapper dealing with the non-translated-domain case,
while stripping that logic from the main function, which gets renamed to
p2m_get_gfn_type_access(). HVM-only callers can then directly use the
main function.
Along with renaming the main function also make its and the new inline
helper's GFN parameters type-safe.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Fri, 8 Apr 2022 12:45:37 +0000 (14:45 +0200)]
x86/P2M: p2m_get_page_from_gfn() is HVM-only
This function is the wrong layer to go through for PV guests. It happens
to work, but produces results which aren't fully consistent with
get_page_from_gfn(). The latter function, however, cannot be used in
map_domain_gfn() as it may not be the host P2M we mean to act on.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Fri, 8 Apr 2022 12:44:05 +0000 (14:44 +0200)]
x86/P2M: split out init/teardown functions
Mostly just code movement, and certainly no functional change intended.
In p2m_final_teardown() the calls to p2m_teardown_{alt,nested}p2m() need
to be guarded by an is_hvm_domain() check now, though. This matches
p2m_init(). And p2m_is_logdirty_range() also gets moved inside the (so
far) adjacent #ifdef.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Fri, 8 Apr 2022 12:41:51 +0000 (14:41 +0200)]
x86/P2M: PoD, altp2m, and nested-p2m are HVM-only
There's no need to initialize respective data for PV domains. Note that
p2m_teardown_{alt,nested}p2m() will handle the lack-of-initialization
case fine.
As a result, despite PV domains having a host P2M associated with them
and hence using XENMEM_get_pod_target on such may not be a real problem,
calling p2m_pod_set_mem_target() for a PV domain is surely wrong, even
if benign at present. Add a guard there as well.
In p2m_pod_demand_populate() the situation is a little different: This
function is reachable only for HVM domains anyway, but following from
other PoD functions only ever acting on the host P2M (and hence PoD
entries only ever existing in host P2Ms), assert and bail from there for
non-host-P2Ms.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Fri, 8 Apr 2022 12:39:43 +0000 (14:39 +0200)]
x86/mm: split set_identity_p2m_entry() into PV and HVM parts
..., moving the former into the new physmap.c. Also call the new
functions directly from arch_iommu_hwdom_init() and
vpci_make_msix_hole(), as the PV/HVM split is explicit there.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Track whether symbols belong to ignored sections in order to avoid
applying relocations referencing those symbols. The address of such
symbols won't be resolved and thus the relocation will likely fail or
write garbage to the destination.
Return an error in that case, as leaving unresolved relocations would
lead to malfunctioning payload code.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Bjoern Doebel <doebel@amazon.de> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
A side effect of ignoring such sections is that symbols belonging to
them won't be resolved, and that could make relocations belonging to
other sections that reference those symbols fail.
For example it's likely to have an empty .altinstr_replacement with
symbols pointing to it, and marking the section as ignored will
prevent the symbols from being resolved, which in turn will cause any
relocations against them to fail.
In order to solve this do not ignore sections with 0 size, only ignore
sections that don't have the SHF_ALLOC flag set.
Special case such empty sections in move_payload so they are not taken
into account in order to decide whether a livepatch can be safely
re-applied after a revert.
Fixes: 98b728a7b2 ('livepatch: Disallow applying after an revert') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Bjoern Doebel <doebel@amazon.de> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Anthony PERARD [Thu, 7 Apr 2022 15:58:44 +0000 (17:58 +0200)]
build: shuffle main Makefile
Reorganize a bit the Makefile ahead of patch
"build: adding out-of-tree support to the xen build"
We are going to want to calculate all the $(*srctree) and $(*objtree)
once, when we can calculate them. This can happen within the
"$(root-make-done)" guard, in an out-of-tree build scenario, so move
those variable there.
$(XEN_ROOT) is going to depends on the value of $(abs_srctree) so
needs to move as well. "Kbuild.include" also depends on $(srctree).
Next, "Config.mk" depends on $(XEN_ROOT) and $(TARGET_*ARCH) depends
on "Config.mk" so those needs to move as well.
This should only be code movement without functional changes.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Anthony PERARD [Thu, 7 Apr 2022 15:57:44 +0000 (17:57 +0200)]
build: specify source tree in include/ for prerequisite
When doing an out-of-tree build, and thus setting VPATH,
GNU Make 3.81 on Ubuntu Trusty complains about Circular dependency of
include/Makefile and include/xlat.lst and drop them. The build fails
later due to headers malformed.
This might be due to bug #13529
"Incorrect circular dependancy"
https://savannah.gnu.org/bugs/?13529
which was fixed in 3.82.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Anthony PERARD [Thu, 7 Apr 2022 15:56:53 +0000 (17:56 +0200)]
build: rework "headers*.chk" prerequisite in include/
Listing public headers when out-of-tree build are involved becomes
more annoying where every path to every headers needs to start with
"$(srctree)/$(src)", or $(wildcard ) will not work. This means more
repetition. ( "$(srcdir)" is a shortcut for "$(srctree)/$(src)" )
This patch attempt to reduce the amount of duplication and make better
use of make's meta programming capability. The filters are now listed
in a variable and don't have to repeat the path to the headers files
as this is added later as needed.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Anthony PERARD [Thu, 7 Apr 2022 15:56:00 +0000 (17:56 +0200)]
build: replace $(BASEDIR) and use $(srctree)
$(srctree) is a better description for the source directory than
$(BASEDIR) that has been used for both source and build directory
(which where the same).
This adds $(srctree) to a few path where make's VPATH=$(srctree) won't
apply. And replace $(BASEDIR) by $(srctree).
Introduce "$(srcdir)" as a shortcut for "$(srctree)/$(src)" as the
later is used often enough.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Daniel P. Smith <dpsmith@apertussolutions.com> # XSM
Andrew Cooper [Wed, 6 Apr 2022 21:40:20 +0000 (22:40 +0100)]
x86/cpuid: Clobber CPUID leaves 0x800000{1d..20} in policies
c/s 1a914256dca5 increased the AMD max leaf from 0x8000001c to 0x80000021, but
did not adjust anything in the calculate_*_policy() chain. As a result, on
hardware supporting these leaves, we read the real hardware values into the
raw policy, then copy into host, and all the way into the PV/HVM default
policies.
All 4 of these leaves have enable bits (first two by TopoExt, next by SEV,
next by PQOS), so any software following the rules is fine and will leave them
alone. However, leaf 0x8000001d takes a subleaf input and at least two
userspace utilities have been observed to loop indefinitely under Xen (clearly
waiting for eax to report "no more cache levels").
Such userspace is buggy, but Xen's behaviour isn't great either.
In the short term, clobber all information in these leaves. This is a giant
bodge, but there are complexities with implementing all of these leaves
properly.
Fixes: 1a914256dca5 ("x86/cpuid: support LFENCE always serialising CPUID bit") Link: https://github.com/QubesOS/qubes-issues/issues/7392 Reported-by: fosslinux <fosslinux@aussies.space> Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 7 Apr 2022 10:31:16 +0000 (12:31 +0200)]
VT-d: avoid infinite recursion on domain_context_mapping_one() error path
Despite the comment there infinite recursion was still possible, by
flip-flopping between two domains. This is because prev_dom is derived
from the DID found in the context entry, which was already updated by
the time error recovery is invoked. Simply introduce yet another mode
flag to prevent rolling back an in-progress roll-back of a prior
mapping attempt.
Also drop the existing recursion prevention for having been dead anyway:
Earlier in the function we already bail when prev_dom == domain.
Fixes: 8f41e481b485 ("VT-d: re-assign devices directly") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Thu, 7 Apr 2022 10:30:19 +0000 (12:30 +0200)]
VT-d: avoid NULL deref on domain_context_mapping_one() error paths
First there's a printk() which actually wrongly uses pdev in the first
place: We want to log the coordinates of the (perhaps fake) device
acted upon, which may not be pdev.
Then it was quite pointless for eb19326a328d ("VT-d: prepare for per-
device quarantine page tables (part I)") to add a domid_t parameter to
domain_context_unmap_one(): It's only used to pass back here via
me_wifi_quirk() -> map_me_phantom_function(). Drop the parameter again.
Finally there's the invocation of domain_context_mapping_one(), which
needs to be passed the correct domain ID. Avoid taking that path when
pdev is NULL and the quarantine state is what would need restoring to.
This means we can't security-support non-PCI-Express devices with RMRRs
(if such exist in practice) any longer; note that as of trhe 1st of the
two commits referenced below assigning them to DomU-s is unsupported
anyway.
Fixes: 8f41e481b485 ("VT-d: re-assign devices directly") Fixes: 14dd241aad8a ("IOMMU/x86: use per-device page tables for quarantining")
Coverity ID: 1503784 Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Thu, 7 Apr 2022 10:29:03 +0000 (12:29 +0200)]
VT-d: don't needlessly look up DID
If get_iommu_domid() in domain_context_unmap_one() fails, we better
wouldn't clear the context entry in the first place, as we're then unable
to issue the corresponding flush. However, we have no need to look up the
DID in the first place: What needs flushing is very specifically the DID
that was in the context entry before our clearing of it.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
platform/cpufreq: add public defines for CPUFREQ_SHARED_TYPE_
The values set in the shared_type field of xen_processor_performance
have so far relied on Xen and Linux having the same
CPUFREQ_SHARED_TYPE_ defines, as those have never been part of the
public interface.
Formalize by adding the defines for the allowed values in the public
header, while renaming them to use the XEN_CPUPERF_SHARED_TYPE_ prefix
for clarity.
Set the Xen internal defines for CPUFREQ_SHARED_TYPE_ using the newly
introduced XEN_CPUPERF_SHARED_TYPE_ public defines in order to avoid
unnecessary code churn. While there also drop
CPUFREQ_SHARED_TYPE_NONE as it's unused.
Fixes: 2fa7bee0a0 ('Get ACPI Px from dom0 and choose Px controller') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 7 Apr 2022 06:34:58 +0000 (08:34 +0200)]
x86/boot: fold branches in video handling code
Using Jcc to branch around a JMP is necessary only in pre-386 code,
where Jcc is limited to disp8. Use the opposite Jcc directly in two
places. Since it's adjacent, also convert an ORB to TESTB.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Thu, 7 Apr 2022 06:34:07 +0000 (08:34 +0200)]
x86/boot: simplify mode_table
There's no point in writing 80x25 text mode information via multiple
insns all storing immediate values. The data can simply be included
first thing in the vga_modes table, allowing the already present
REP MOVSB to take care of everything in one go.
While touching this also correct a related but stale comment.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Thu, 7 Apr 2022 06:33:09 +0000 (08:33 +0200)]
x86/EFI: retrieve EDID
When booting directly from EFI, obtaining this information from EFI is
the only possible way. And even when booting with a boot loader
interposed, it's more clean not to use legacy BIOS calls for this
purpose. (The downside being that there are no "capabilities" that we
can retrieve the EFI way.)
To achieve this we need to propagate the handle used to obtain the
EFI_GRAPHICS_OUTPUT_PROTOCOL instance for further obtaining an
EFI_EDID_*_PROTOCOL instance, which has been part of the spec since 2.5.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Luca Fancellu <luca.fancellu@arm.com> # Arm, common Acked-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Bertrand Marquis <bertrand.marquis@arm.com> #arm
Jan Beulich [Thu, 7 Apr 2022 06:29:33 +0000 (08:29 +0200)]
x86/P2M: introduce p2m_{add,remove}_page()
Rename guest_physmap_add_entry() to p2m_add_page(); make
guest_physmap_remove_page() a trivial wrapper around p2m_remove_page().
This way callers can use suitable pairs of functions (previously
violated by hvm/grant_table.c).
In HVM-specific code further avoid going through the guest_physmap_*()
layer, and instead use the two new/renamed functions directly.
Ultimately the goal is to have guest_physmap_...() functions cover all
types of guests, but p2m_...() dealing only with translated ones.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Tue, 5 Apr 2022 12:24:18 +0000 (14:24 +0200)]
IOMMU/x86: use per-device page tables for quarantining
Devices with RMRRs / unity mapped regions, due to it being unspecified
how/when these memory regions may be accessed, may not be left
disconnected from the mappings of these regions (as long as it's not
certain that the device has been fully quiesced). Hence even the page
tables used when quarantining such devices need to have mappings of
those regions. This implies installing page tables in the first place
even when not in scratch-page quarantining mode.
This is CVE-2022-26361 / part of XSA-400.
While for the purpose here it would be sufficient to have devices with
RMRRs / unity mapped regions use per-device page tables, extend this to
all devices (in scratch-page quarantining mode). This allows the leaf
pages to be mapped r/w, thus covering also memory writes (rather than
just reads) issued by non-quiescent devices.
Set up quarantine page tables as late as possible, yet early enough to
not encounter failure during de-assign. This means setup generally
happens in assign_device(), while (for now) the one in deassign_device()
is there mainly to be on the safe side.
As to the removal of QUARANTINE_SKIP() from domain_context_unmap_one():
I think this was never really needed there, as the function explicitly
deals with finding a non-present context entry. Leaving it there would
require propagating pgd_maddr into the function (like was done by "VT-d:
prepare for per-device quarantine page tables" for
domain_context_mapping_one()).
In VT-d's DID allocation function don't require the IOMMU lock to be
held anymore: All involved code paths hold pcidevs_lock, so this way we
avoid the need to acquire the IOMMU lock around the new call to
context_set_domain_id().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Tue, 5 Apr 2022 12:19:42 +0000 (14:19 +0200)]
IOMMU/x86: drop TLB flushes from quarantine_init() hooks
The page tables just created aren't hooked up yet anywhere, so there's
nothing that could be present in any TLB, and hence nothing to flush.
Dropping this flush is, at least on the VT-d side, a prereq to per-
device domain ID use when quarantining devices, as dom_io isn't going
to be assigned a DID anymore: The warning in get_iommu_did() would
trigger.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Tue, 5 Apr 2022 12:19:10 +0000 (14:19 +0200)]
IOMMU/x86: maintain a per-device pseudo domain ID
In order to subsequently enable per-device quarantine page tables, we'll
need domain-ID-like identifiers to be inserted in the respective device
(AMD) or context (Intel) table entries alongside the per-device page
table root addresses.
Make use of "real" domain IDs occupying only half of the value range
coverable by domid_t.
Note that in VT-d's iommu_alloc() I didn't want to introduce new memory
leaks in case of error, but existing ones don't get plugged - that'll be
the subject of a later change.
The VT-d changes are slightly asymmetric, but this way we can avoid
assigning pseudo domain IDs to devices which would never be mapped while
still avoiding to add a new parameter to domain_context_unmap().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Tue, 5 Apr 2022 12:18:48 +0000 (14:18 +0200)]
VT-d: prepare for per-device quarantine page tables (part II)
Replace the passing of struct domain * by domid_t in preparation of
per-device quarantine page tables also requiring per-device pseudo
domain IDs, which aren't going to be associated with any struct domain
instances.
No functional change intended (except for slightly adjusted log message
text).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Tue, 5 Apr 2022 12:18:26 +0000 (14:18 +0200)]
VT-d: prepare for per-device quarantine page tables (part I)
Arrange for domain ID and page table root to be passed around, the latter in
particular to domain_pgd_maddr() such that taking it from the per-domain
fields can be overridden.
No functional change intended.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Tue, 5 Apr 2022 12:18:04 +0000 (14:18 +0200)]
AMD/IOMMU: re-assign devices directly
Devices with unity map ranges, due to it being unspecified how/when
these memory ranges may get accessed, may not be left disconnected from
their unity mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than tearing down the old root page
table pointer and then establishing the new one, re-assignment needs to
be done in a single step.
This is CVE-2022-26360 / part of XSA-400.
Reported-by: Roger Pau Monné <roger.pau@citrix.com>
Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.
To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any unity map
ranges. The main difference is when it comes to updating DTEs, which need
to be atomic when there are unity mappings. Yet atomicity can only be
achieved with CMPXCHG16B, availability of which we can't take for given.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Tue, 5 Apr 2022 12:17:42 +0000 (14:17 +0200)]
VT-d: re-assign devices directly
Devices with RMRRs, due to it being unspecified how/when the specified
memory regions may get accessed, may not be left disconnected from their
respective mappings (as long as it's not certain that the device has
been fully quiesced). Hence rather than unmapping the old context and
then mapping the new one, re-assignment needs to be done in a single
step.
This is CVE-2022-26359 / part of XSA-400.
Reported-by: Roger Pau Monné <roger.pau@citrix.com>
Similarly quarantining scratch-page mode relies on page tables to be
continuously wired up.
To avoid complicating things more than necessary, treat all devices
mostly equally, i.e. regardless of their association with any RMRRs. The
main difference is when it comes to updating context entries, which need
to be atomic when there are RMRRs. Yet atomicity can only be achieved
with CMPXCHG16B, availability of which we can't take for given.
The seemingly complicated choice of non-negative return values for
domain_context_mapping_one() is to limit code churn: This way callers
passing NULL for pdev don't need fiddling with.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Tue, 5 Apr 2022 12:17:21 +0000 (14:17 +0200)]
VT-d: drop ownership checking from domain_context_mapping_one()
Despite putting in quite a bit of effort it was not possible to
establish why exactly this code exists (beyond possibly sanity
checking). Instead of a subsequent change further complicating this
logic, simply get rid of it.
Take the opportunity and move the respective unmap_vtd_domain_page() out
of the locked region.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
This is to make more obvious that nothing outside of domain_iommu(d)
actually changes or is otherwise needed by the function.
No functional change intended.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Tue, 5 Apr 2022 12:16:10 +0000 (14:16 +0200)]
VT-d: fix add/remove ordering when RMRRs are in use
In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully cleared.
Also switch to %pd in related log messages.
Fixes: fa88cfadf918 ("vt-d: Map RMRR in intel_iommu_add_device() if the device has RMRR") Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Tue, 5 Apr 2022 12:15:33 +0000 (14:15 +0200)]
VT-d: fix (de)assign ordering when RMRRs are in use
In the event that the RMRR mappings are essential for device operation,
they should be established before updating the device's context entry,
while they should be torn down only after the device's context entry was
successfully updated.
Also adjust a related log message.
This is CVE-2022-26358 / part of XSA-400.
Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Tue, 5 Apr 2022 12:12:27 +0000 (14:12 +0200)]
VT-d: correct ordering of operations in cleanup_domid_map()
The function may be called without any locks held (leaving aside the
domctl one, which we surely don't want to depend on here), so needs to
play safe wrt other accesses to domid_map[] and domid_bitmap[]. This is
to avoid context_set_domain_id()'s writing of domid_map[] to be reset to
zero right away in the case of it racing the freeing of a DID.
For the interaction with context_set_domain_id() and did_to_domain_id()
see the code comment.
{check_,}cleanup_domid_map() are called with pcidevs_lock held or during
domain cleanup only (and pcidevs_lock is also held around
context_set_domain_id()), i.e. racing calls with the same (dom, iommu)
tuple cannot occur.
domain_iommu_domid(), besides its use by cleanup_domid_map(), has its
result used only to control flushing, and hence a stale result would
only lead to a stray extra flush.
This is CVE-2022-26357 / XSA-399.
Fixes: b9c20c78789f ("VT-d: per-iommu domain-id") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Wed, 23 Feb 2022 08:40:40 +0000 (09:40 +0100)]
x86/hap: do not switch on log dirty for VRAM tracking
XEN_DMOP_track_dirty_vram possibly calls into paging_log_dirty_enable
when using HAP mode, and it can interact badly with other ongoing
paging domctls, as XEN_DMOP_track_dirty_vram is not holding the domctl
lock.
This was detected as a result of the following assert triggering when
doing repeated migrations of a HAP HVM domain with a stubdom:
Assertion 'd->arch.paging.log_dirty.allocs == 0' failed at paging.c:198
----[ Xen-4.17-unstable x86_64 debug=y Not tainted ]----
CPU: 34
RIP: e008:[<ffff82d040314b3b>] arch/x86/mm/paging.c#paging_free_log_dirty_bitmap+0x606/0x6
RFLAGS: 0000000000010206 CONTEXT: hypervisor (d0v23)
[...]
Xen call trace:
[<ffff82d040314b3b>] R arch/x86/mm/paging.c#paging_free_log_dirty_bitmap+0x606/0x63a
[<ffff82d040279f96>] S xsm/flask/hooks.c#domain_has_perm+0x5a/0x67
[<ffff82d04031577f>] F paging_domctl+0x251/0xd41
[<ffff82d04031640c>] F paging_domctl_continuation+0x19d/0x202
[<ffff82d0403202fa>] F pv_hypercall+0x150/0x2a7
[<ffff82d0403a729d>] F lstar_enter+0x12d/0x140
Such assert triggered because the stubdom used
XEN_DMOP_track_dirty_vram while dom0 was in the middle of executing
XEN_DOMCTL_SHADOW_OP_OFF, and so log dirty become enabled while
retiring the old structures, thus leading to new entries being
populated in already clear slots.
Fix this by not enabling log dirty for VRAM tracking, similar to what
is done when using shadow instead of HAP. Call
p2m_enable_hardware_log_dirty when enabling VRAM tracking in order to
get some hardware assistance if available. As a side effect the memory
pressure on the p2m pool should go down if only VRAM tracking is
enabled, as the dirty bitmap is no longer allocated.
Note that paging_log_dirty_range (used to get the dirty bitmap for
VRAM tracking) doesn't use the log dirty bitmap, and instead relies on
checking whether each gfn on the range has been switched from
p2m_ram_logdirty to p2m_ram_rw in order to account for dirty pages.
This is CVE-2022-26356 / XSA-397.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 5 Apr 2022 09:40:58 +0000 (11:40 +0200)]
x86/time: use fake read_tsc()
Go a step further than bed9ae54df44 ("x86/time: switch platform timer
hooks to altcall") did and eliminate the "real" read_tsc() altogether:
It's not used except in pointer comparisons, and hence it looks overall
more safe to simply poison plt_tsc's read_counter hook.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Tue, 5 Apr 2022 09:38:04 +0000 (11:38 +0200)]
x86/APIC: make connections between seemingly arbitrary numbers
Making adjustments to arbitrarily chosen values shouldn't require
auditing the code for possible derived numbers - such a change should
be doable in a single place, having an effect on all code depending on
that choice.
For one make the TDCR write actually use APIC_DIVISOR. With the
necessary mask constant introduced, also use that in vLAPIC code. While
introducing the constant, drop APIC_TDR_DIV_TMBASE: The bit has been
undefined in halfway recent SDM and PM versions.
And then introduce a constant tying together the scale used when
converting nanoseconds to bus clocks.
No functional change intended.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Tue, 5 Apr 2022 09:36:32 +0000 (11:36 +0200)]
x86/APIC: calibrate against platform timer when possible
Use the original calibration against PIT only when the platform timer
is PIT. This implicitly excludes the "xen_guest" case from using the PIT
logic (init_pit() fails there, and as of 5e73b2594c54 ["x86/time: minor
adjustments to init_pit()"] using_pit also isn't being set too early
anymore), so the respective hack there can be dropped at the same time.
This also reduces calibration time from 100ms to 50ms, albeit this step
is being skipped as of 0731a56c7c72 ("x86/APIC: no need for timer
calibration when using TDT") anyway.
While re-indenting the PIT logic in calibrate_APIC_clock(), besides
adjusting style also switch around the 2nd TSC/TMCCT read pair, to match
the order of the 1st one, yielding more consistent deltas.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
tools/firmware: do not add a .note.gnu.property section
Prevent the assembler from creating a .note.gnu.property section on
the output objects, as it's not useful for firmware related binaries,
and breaks the resulting rombios image.
This requires modifying the cc-option Makefile macro so it can test
assembler options (by replacing the usage of the -S flag with -c) and
also stripping the -Wa, prefix if present when checking for the test
output.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
tools/firmware: fix setting of fcf-protection=none
Setting the fcf-protection=none option in EMBEDDED_EXTRA_CFLAGS in the
Makefile doesn't get it propagated to the subdirectories, so instead
set the flag in firmware/Rules.mk, like it's done for other compiler
flags.
Fixes: 3667f7f8f7 ('x86: Introduce support for CET-IBT') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jason Andryuk [Fri, 1 Apr 2022 14:33:10 +0000 (10:33 -0400)]
libxl: Re-scope qmp_proxy_spawn.ao usage
I've observed this failed assertion:
libxl_event.c:2057: libxl__ao_inprogress_gc: Assertion `ao' failed.
AFAICT, this is happening in qmp_proxy_spawn_outcome where
sdss->qmp_proxy_spawn.ao is NULL.
The out label of spawn_stub_launch_dm() calls qmp_proxy_spawn_outcome(),
but it is only in the success path that sdss->qmp_proxy_spawn.ao gets
set to the current ao.
qmp_proxy_spawn_outcome() should instead use sdss->dm.spawn.ao, which is
the already in-use ao when spawn_stub_launch_dm() is called. The same
is true for spawn_qmp_proxy().
With this, move sdss->qmp_proxy_spawn.ao initialization to
spawn_qmp_proxy() since its use is for libxl__spawn_spawn() and it can
be initialized along with the rest of sdss->qmp_proxy_spawn.
Fixes: 83c845033dc8 ("libxl: use vchan for QMP access with Linux stubdomain") Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Move dcs->console_xswait initialization into the callers of
initiate_domain_create, do_domain_create() and do_domain_soft_reset(),
so it is initialized along with the other dcs state.
Fixes: c57e6ebd8c3e ("(lib)xl: soft reset support") Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Jason Andryuk [Wed, 30 Mar 2022 18:17:41 +0000 (14:17 -0400)]
xl: Fix global pci options
commit babde47a3fed "introduce a 'passthrough' configuration option to
xl.cfg..." moved the pci list parsing ahead of the global pci option
parsing. This broke the global pci configuration options since they
need to be set first so that looping over the pci devices assigns their
values.
Move the global pci options ahead of the pci list to restore their
function.
Fixes: babde47a3fed ("introduce a 'passthrough' configuration option to xl.cfg...") Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Jan Beulich [Thu, 31 Mar 2022 08:45:46 +0000 (10:45 +0200)]
livepatch: account for patch offset when applying NOP patch
While not triggered by the trivial xen_nop in-tree patch on
staging/master, that patch exposes a problem on the stable trees, where
all functions have ENDBR inserted. When NOP-ing out a range, we need to
account for this. Handle this right in livepatch_insn_len().
This requires livepatch_insn_len() to be called _after_ ->patch_offset
was set.
Fixes: 6974c75180f1 ("xen/x86: Livepatch: support patching CET-enhanced functions") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Tue, 29 Mar 2022 13:48:15 +0000 (15:48 +0200)]
build: generic top-level rule to build individual files
In particular when cross-compiling or having in place other tool chain
overrides, invoking make to build individual files (e.g. object,
preprocessed, or assembly ones) so far involves putting the various
overrides on the command line instead of simply getting them from
./.config.
Furthermore this helps working around a yet unaddressed make quirk [1]:
Variables put on the command line are invisible to $(shell ...), unless
invoked from a recursive make: During the recursive invocation such
variables are put in the recursive make's environment and hence become
"visible".
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
[1] https://savannah.gnu.org/bugs/?10593
xen/arm: set CPSR Z bit when creating aarch32 guests
The first 32 bytes of zImage are NOPs. When CONFIG_EFI is enabled in the
kernel, certain versions of Linux will use an UNPREDICTABLE NOP
encoding, sometimes resulting in an unbootable kernel. Whether the
resulting kernel is bootable or not depends on the processor. See commit a92882a4d270 in the Linux kernel for all the details.
All kernel releases starting from Linux 4.9 without commit a92882a4d270
are affected.
Fortunately there is a simple workaround: setting the "Z" bit in CPSR
make it so those invalid NOP instructions are never executed. That is
because the instruction is conditional (not equal). So, on QEMU at
least, the instruction will end up to be ignored and not generate an
exception. Setting the "Z" bit makes those kernel versions bootable
again and it is harmless in the other cases.
Note that both U-Boot and QEMU -kernel set the "Z" bit in CPSR when
booting a zImage kernel on aarch32.
Jan Beulich [Tue, 22 Mar 2022 12:10:59 +0000 (13:10 +0100)]
x86/build: work around older GNU ld not leaving .got.plt empty
The initial three entries in .got.plt are "static", i.e. present
independent of actual entries allocation of which is triggered by
respective relocations. When no real entries are needed, halfway recent
ld discards the "static" portion of the table as well, but older GNU ld
fails to do so.
Fixes: dedb0aa42c6d ("x86/build: use --orphan-handling linker option if available") Reported-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Julien Grall <jgrall@amazon.com>
Andrew Cooper [Tue, 22 Mar 2022 12:07:24 +0000 (13:07 +0100)]
x86/hvm: Annotate hvm_physdev_op() with cf_check
This was missed previously, and would yield a fatal #CP for any HVM domain
which issues a physdevop hypercall.
Fixes: cdbe2b0a1aec ("x86: Enable CET Indirect Branch Tracking") Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Raphael Ning [Wed, 16 Mar 2022 18:38:41 +0000 (18:38 +0000)]
evtchn/fifo: Don't set PENDING bit if guest misbehaves
Currently, evtchn_fifo_set_pending() will mark the event as PENDING even
if it fails to lock the FIFO event queue(s), or if the guest has not
initialized the FIFO control block for the target vCPU. A well-behaved
guest should never trigger either of these cases.
There is no good reason to set the PENDING bit (the guest should not
depend on this behaviour anyway) or check for pollers in such corner
cases, so skip that. In fact, both the comment above the for loop and
the commit message for
suggest that the bit should be set after the FIFO queue(s) are locked.
Take the opportunity to rename the was_pending variable (flipping its
sense) and switch to the standard bool type.
Suggested-by: David Vrabel <dvrabel@amazon.co.uk> Signed-off-by: Raphael Ning <raphning@amazon.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: David Vrabel <dvrabel@amazon.co.uk> Tested-by: Luca Fancellu <luca.fancellu@arm.com>
xen/arm64: io: Handle the abort due to access to stage1 translation table
If the abort was caused due to access to stage1 translation table, Xen
will try to set the p2m entry (assuming that the Stage 1 translation
table is in a non MMIO region).
If there is no such entry found, then Xen will try to map the address as
a MMIO region (assuming that the Stage 1 translation table is in a
direct MMIO region).
If that fails as well, then there are the two following scenarios:-
1. Stage 1 translation table being in an emulated MMIO region - Xen
can read the region, but it has no way to return the value read to the
CPU page table walker (which tries to go through the stage1 tables to
resolve the translation fault).
2. Stage 1 translation table address is invalid.
In both the above scenarios, Xen will forward the abort to the guest.
xen/arm64: io: Emulate instructions (with invalid ISS) on MMIO region
When an instruction is trapped in Xen due to translation fault, Xen
checks if the ISS is invalid (for data abort) or it is an instruction
abort. If so, Xen tries to resolve the translation fault using p2m page
tables. In case of data abort, Xen will try to map the mmio region to
the guest (ie tries to emulate the mmio region).
If the ISS is not valid and it is a data abort, then Xen tries to
decode the instruction. In case of ioreq, Xen saves the decoding state,
rn and imm9 to vcpu_io. Whenever the vcpu handles the ioreq successfully,
it will read the decoding state to determine if the instruction decoded
was a ldr/str post indexing (ie INSTR_LDR_STR_POSTINDEXING). If so, it
uses these details to post increment rn.
In case of mmio handler, if the mmio operation was successful, then Xen
retrives the decoding state, rn and imm9. For state ==
INSTR_LDR_STR_POSTINDEXING, Xen will update rn.
If there is an error encountered while decoding/executing the instruction,
Xen will forward the abort to the guest.
Also, the logic to infer the type of instruction has been moved from
try_handle_mmio() to try_decode_instruction() which is called before.
try_handle_mmio() is solely responsible for handling the mmio operation.
Bjoern Doebel [Thu, 10 Mar 2022 07:35:36 +0000 (07:35 +0000)]
xen/x86: Livepatch: support patching CET-enhanced functions
Xen enabled CET for supporting architectures. The control flow aspect of
CET require functions that can be called indirectly (i.e., via function
pointers) to start with an ENDBR64 instruction. Otherwise a control flow
exception is raised.
This expectation breaks livepatching flows because we patch functions by
overwriting their first 5 bytes with a JMP + <offset>, thus breaking the
ENDBR64. We fix this by checking the start of a patched function for
being ENDBR64. In the positive case we move the livepatch JMP to start
behind the ENDBR64 instruction.
To avoid having to guess the ENDBR64 offset again on patch reversal
(which might race with other mechanisms adding/removing ENDBR
dynamically), use the livepatch metadata to store the computed offset
along with the saved bytes of the overwritten function.
Signed-off-by: Bjoern Doebel <doebel@amazon.de> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com> Tested-by: Jiamei Xie <jiamei.xie@arm.com>
Andrew Cooper [Tue, 15 Mar 2022 12:07:18 +0000 (12:07 +0000)]
x86/cet: Remove writeable mapping of the BSPs shadow stack
An unintended consequence of the BSP using cpu0_stack[] is that writeable
mappings to the BSPs shadow stacks are retained in the bss. This renders
CET-SS almost useless, as an attacker can update both return addresses and the
ret will not fault.
We specifically don't want to shatter the superpage mapping .data and .bss, so
the only way to fix this is to not have the BSP stack in the main Xen image.
Break cpu_alloc_stack() out of cpu_smpboot_alloc(), and dynamically allocate
the BSP stack as early as reasonable in __start_xen(). As a consequence,
there is no need to delay the BSP's memguard_guard_stack() call.
Copy the top of cpu info block just before switching to use the new stack.
Fix a latent bug by setting %rsp to info->guest_cpu_user_regs rather than
->es; this would be buggy if reinit_bsp_stack() called schedule() (which
rewrites the GPR block) directly, but luckily it doesn't.
Finally, move cpu0_stack[] into .init, so it can be reclaimed after boot.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 8 Mar 2022 13:47:25 +0000 (13:47 +0000)]
x86/cet: Use dedicated NOP4 for cf_clobber
For livepatching, we need to look at a potentially clobbered function and
determine whether it used to have an ENDBR64 instruction.
Use a non-default 4-byte P6 long nop, not emitted by toolchains, and extend
check-endbr.sh to look for it. The same logic can check for the absence of
any endbr32 instructions, so include a check for those too.
The choice of nop has some complicated consequences. nopw (%rax) has a ModRM
byte of 0, which the Bourne compatible shells unconditionally strip from
parameters, meaning that we can't pass it to `grep -aob`.
Therefore, use nopw (%rcx) so the ModRM byte becomes 1.
This then demonstrates another bug. Under perl regexes, \1 thru \9 are
subpattern matches, and not octal escapes, while the behaviour of \10 and
higher depend on the number of capture groups. Switch the `grep -P` runes to
use hex escapes instead, which are unambiguous.
The build time check then requires that the endbr64 poison have the same
treatment as endbr64 to avoid placing the byte pattern in immediate operands.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 14 Mar 2022 10:30:46 +0000 (10:30 +0000)]
x86/cet: Clear IST supervisor token busy bits on S3 resume
Stacks are not freed across S3. Execution just stops, leaving supervisor
token busy bits active. Fixing this for the primary shadow stack was done
previously, but there is a (rare) risk that an IST token is left busy too, if
the platform power-off happens to intersect with an NMI/#MC arriving. This
will manifest as #DF next time the IST vector gets used.
Introduce rdssp() and wrss() helpers in a new shstk.h, cleaning up
fixup_exception_return() and explaining the trick with the literal 1.
Then this infrastructure to rewrite the IST tokens in load_system_tables()
when all the other IST details are being set up. In the case that an IST
token were left busy across S3, this will clear the busy bit before the stack
gets used.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 17 Mar 2022 16:42:13 +0000 (17:42 +0100)]
x86emul/test: correct VSCALEF{P,S}{S,D} entries in predicates test
I can't see why these would want / need to suppress testing of the
register forms of the insns. Quite likely a copy-and-paste oversight
when originally creating the table.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
George Dunlap [Thu, 10 Mar 2022 21:37:05 +0000 (21:37 +0000)]
MAINTAINERS: Propose Henry Wang as the new release manager
ARM has proposed Henry Wang as a release manager for 4.17. Signify
this by giving him maintainership over CHANGELOG.md.
Below is an introduction given by Bertrand Marquis:
Henry Wang is an open-source software engineer at Arm focusing on the
hypervisor and virtualization technology. Before joining the
AIS-Hypervisor team, he was one of the leading Arm contributors of the
Rust-VMM and the Cloud Hypervisor community. He is the Arm reviewer
of the Cloud Hypervisor project. His work includes basic project
enabling on Arm platform, Arm device emulation, advanced features
support on Arm and bug fixes.
After joining the AIS-Hypervisor team at Arm, he has been involved in Xen feature
development on Arm in various areas, including:
1. Xen Arm MPAM extension research and PoC: Ongoing, the design will
share in xen-devel soon.
2. Port of Xen to Arm MPU systems: Working together with Penny Zheng
on coding and testing, will be soon sent to xen-devel.
3. Static Xen heap on Arm: Work done but depend on the direct mapping
series from Penny Zheng, will be upstreamed in the next weeks.
4. Virtio PoC for Xen on Arm using kvmtool as the Xen virtio backend:
Work done, including the enabling of the virtio and the virtio
performance tuning.
5. Participated in code reviews and discussions in xen-devel,
including the foreign memory mapping series from EPAM, etc.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Henry Wang <Henry.Wang@arm.com> Acked-by: Julien Grall <jgrall@amazon.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
This adds support for serial console as found in a laptop with TGL-LP
(StarBook MkV). Since the device is on the bus 0, it needs to be enabled
via "com1=...,amt", not just "...,pci".
Device specification is in Intel docs 631119-007 and 631120-001.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/arm64: io: Handle the abort due to access to stage1 translation table
If the abort was caused due to access to stage1 translation table, Xen
will try to set the p2m entry (assuming that the Stage 1 translation
table is in the non MMIO region).
If there is no such entry found, then Xen will try to map the address as
a MMIO region (assuming that the Stage 1 translation table is in the
direct MMIO region).
If that fails as well, then there are the two following scenarios:-
1. Stage 1 translation table being in an emulated MMIO region - Xen
can read the region, but it has no way to return the value read to the
CPU page table walker (which tries to go through the stage1 tables to
resolve the translation fault).
2. Stage 1 translation table address is invalid.
In both the above scenarios, Xen will forward the abort to the guest.
xen/arm64: io: Support instructions (for which ISS is not valid) on emulated MMIO region using MMIO/ioreq handler
When an instruction is trapped in Xen due to translation fault, Xen
checks if the ISS is invalid (for data abort) or it is an instruction
abort. If so, Xen tries to resolve the translation fault using p2m page
tables. In case of data abort, Xen will try to map the mmio region to
the guest (ie tries to emulate the mmio region).
If the ISS is not valid and it is a data abort, then Xen tries to
decode the instruction. In case of ioreq, Xen saves the decoding state,
rn and imm9 to vcpu_io. Whenever the vcpu handles the ioreq successfully,
it will read the decoding state to determine if the instruction decoded
was a ldr/str post indexing (ie INSTR_LDR_STR_POSTINDEXING). If so, it
uses these details to post increment rn.
In case of mmio handler, if the mmio operation was successful, then Xen
retrives the decoding state, rn and imm9. For state ==
INSTR_LDR_STR_POSTINDEXING, Xen will update rn.
If there is an error encountered while decoding/executing the instruction,
Xen will forward the abort to the guest.
Also, the logic to infer the type of instruction has been moved from
try_handle_mmio() to try_decode_instruction() which is called before.
try_handle_mmio() is solely responsible for handling the mmio operation.
xen/arm64: Decode ldr/str post increment operations
At the moment, Xen does not decode any of the arm64 instructions. This
means that when hsr_dabt.isv == 0, Xen cannot handle those instructions.
This will lead to Xen to abort the guests (from which those instructions
originate).
With this patch, Xen is able to decode ldr/str post indexing instructions.
These are a subset of instructions for which hsr_dabt.isv == 0.
Jan Beulich [Mon, 14 Mar 2022 09:33:35 +0000 (10:33 +0100)]
x86/build: use --orphan-handling linker option if available
As was e.g. making necessary 4b7fd8153ddf ("x86: fold sections in final
binaries"), arbitrary sections appearing without our linker script
placing them explicitly can be a problem. Have the linker make us aware
of such sections, so we would know that the script needs adjusting.
To deal with the resulting warnings:
- Retain .note.* explicitly for ELF, and discard all of them (except the
earlier consumed .note.gnu.build-id) for PE/COFF.
- Have explicit statements for .got, .plt, and alike and add assertions
that they're empty. No output sections will be created for these as
long as they remain empty (or else the assertions would cause early
failure anyway).
- Collect all .rela.* into a single section, with again an assertion
added for the resulting section to be empty.
- Extend the enumerating of .debug_* to ELF. Note that for Clang adding
of .debug_macinfo is necessary. Amend this by its Dwarf5 counterpart,
.debug_macro, then as well (albeit more may need adding for full
coverage).
- For LLVM ld also enumerate .symtab, .strtab, and .shstrtab.
Suggested-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Mon, 14 Mar 2022 09:32:40 +0000 (10:32 +0100)]
IOMMU/x86: tidy adjust_irq_affinities hook
As of 3e56754b0887 ("xen/cet: Fix __initconst_cf_clobber") there's no
need for a non-void return value anymore, as the hook functions are no
longer themselves passed to __initcall(). For the same reason the
iommu_enabled checks can now move from the individual functions to the
wrapper.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Roger Pau Monné [Mon, 14 Mar 2022 09:30:02 +0000 (10:30 +0100)]
pci/ats: do not allow broken devices to be assigned to guests
Introduce a new field to mark devices as broken: having it set prevents
the device from being assigned to guests. Use the field in order to mark
ATS devices that have failed a flush when using VT-d as broken, thus
preventing them to be assigned to any guest.
This allows the device IOMMU context entry to be cleaned up properly, as
calling _pci_hide_device will just change the ownership of the device,
but the IOMMU context entry of the device would be left as-is. It would
also leak a VT-d Domain ID if using one, as removing the device from
its previous owner will allow releasing the IOMMU DID used by the device
without having cleaned up the context entry.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Roger Pau Monné [Mon, 14 Mar 2022 09:29:24 +0000 (10:29 +0100)]
x86/vmx: remove dead code to create domains without a vLAPIC
After the removal of PVHv1 it's no longer supported to create a domain
using hardware virtualization extensions and without a local APIC:
PVHv2 mandates domains to always have a LAPIC. Remove some stale code
in VMCS construction and related helpers that catered for that
use-case.
No functional change.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Mon, 14 Mar 2022 09:27:57 +0000 (10:27 +0100)]
x86/time: further improve TSC / CPU freq calibration accuracy
Calibration logic assumes that the platform timer (HPET or ACPI PM
timer) and the TSC are read at about the same time. This assumption may
not hold when a long latency event (e.g. SMI or NMI) occurs between the
two reads. Reduce the risk of reading uncorrelated values by doing at
least four pairs of reads, using the tuple where the delta between the
enclosing TSC reads was smallest. From the fourth iteration onwards bail
if the new TSC delta isn't better (smaller) than the best earlier one.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Wed, 9 Mar 2022 12:28:46 +0000 (13:28 +0100)]
livepatch: set -f{function,data}-sections compiler option
If livepatching support is enabled build the hypervisor with
-f{function,data}-sections compiler options, which is required by the
livepatching tools to detect changes and create livepatches.
This shouldn't result in any functional change on the hypervisor
binary image, but does however require some changes in the linker
script in order to handle that each function and data item will now be
placed into its own section in object files. As a result add catch-all
for .text, .data and .bss in order to merge each individual item
section into the final image.
The main difference will be that .text.startup will end up being part
of .text rather than .init, and thus won't be freed. .text.exit will
also be part of .text rather than dropped. Overall this could make the
image bigger, and package some .text code in a sub-optimal way.
On Arm the .data.read_mostly needs to be moved ahead of the .data
section like it's already done on x86, so the .data.* catch-all
doesn't also include .data.read_mostly. The alignment of
.data.read_mostly also needs to be set to PAGE_SIZE so it doesn't end
up being placed at the tail of a read-only page from the previous
section. While there move the alignment of the .data section ahead of
the section declaration, like it's done for other sections.
The benefit of having CONFIG_LIVEPATCH enable those compiler option
is that the livepatch build tools no longer need to fiddle with the
build system in order to enable them. Note the current livepatch tools
are broken after the recent build changes due to the way they
attempt to set -f{function,data}-sections.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monne [Wed, 9 Mar 2022 12:28:45 +0000 (13:28 +0100)]
xen/build: put image header into a separate section
So it can be explicitly placed ahead of the rest of the .text content
in the linker script (and thus the resulting image). This is a
prerequisite for further work that will add a catch-all to the text
section (.text.*).
Note that placement of the sections inside of .text is also slightly
adjusted to be more similar to the position found in the default GNU
ld linker script.
The special handling of the object file containing the header data as
the first object file passed to the linker command line can also be
removed.
While there also remove the special handling of efi/ on x86. There's
no need for the resulting object file to be passed in any special
order to the linker.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <jgrall@amazon.com>
Andrew Cooper [Mon, 7 Mar 2022 20:19:18 +0000 (20:19 +0000)]
x86/kexec: Fix kexec-reboot with CET active
The kexec_reloc() asm has an indirect jump to relocate onto the identity
trampoline. While we clear CET in machine_crash_shutdown(), we fail to clear
CET for the non-crash path. This in turn highlights that the same is true of
resetting the CPUID masking/faulting.
Move both pieces of logic from machine_crash_shutdown() to machine_kexec(),
the latter being common for all kexec transitions. Adjust the condition for
CET being considered active to check in CR4, which is simpler and more robust.
Fixes: 311434bfc9d1 ("x86/setup: Rework MSR_S_CET handling for CET-IBT") Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks") Fixes: 5ab9564c6fa1 ("x86/cpu: Context switch cpuid masks and faulting state in context_switch()") Reported-by: David Vrabel <dvrabel@amazon.co.uk> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: David Vrabel <dvrabel@amazon.co.uk>
Bjoern Doebel [Wed, 9 Mar 2022 15:22:03 +0000 (16:22 +0100)]
livepatch: resolve old address before function verification
When verifying that a livepatch can be applied, we may as well want to
inspect the target function to be patched. To do so, we need to resolve
this function's address before running the arch-specific
livepatch_verify hook.
Signed-off-by: Bjoern Doebel <doebel@amazon.de> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Roger Pau Monné [Wed, 9 Mar 2022 15:21:01 +0000 (16:21 +0100)]
vpci/msix: fix PBA accesses
Map the PBA in order to access it from the MSI-X read and write
handlers. Note that previously the handlers would pass the physical
host address into the {read,write}{l,q} handlers, which is wrong as
those expect a linear address.
Map the PBA using ioremap when the first access is performed. Note
that 32bit arches might want to abstract the call to ioremap into a
vPCI arch handler, so they can use a fixmap range to map the PBA.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Tested-by: Alex Olson <Alex.Olson@starlab.io>
Andrew Cooper [Mon, 7 Mar 2022 16:35:52 +0000 (16:35 +0000)]
x86/spec-ctrl: Cease using thunk=lfence on AMD
AMD have updated their Spectre v2 guidance, and lfence/jmp is no longer
considered safe. AMD are recommending using retpoline everywhere.
Retpoline is incompatible with CET. All CET-capable hardware has efficient
IBRS (specifically, not something retrofitted in microcode), so use IBRS (and
STIBP for consistency sake).
This is a logical change on AMD, but not on Intel as the default calculations
would end up with these settings anyway. Leave behind a message if IBRS is
found to be missing.
Also update the default heuristics to never select THUNK_LFENCE. This causes
AMD CPUs to change their default to retpoline.
Also update the printed message to include the AMD MSR_SPEC_CTRL settings, and
STIBP now that we set it for consistency sake.
This is part of XSA-398 / CVE-2021-26401.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Bertrand Marquis [Thu, 17 Feb 2022 14:52:54 +0000 (14:52 +0000)]
xen/arm: Allow to discover and use SMCCC_ARCH_WORKAROUND_3
Allow guest to discover whether or not SMCCC_ARCH_WORKAROUND_3 is
supported and create a fastpath in the code to handle guests request to
do the workaround.
The function SMCCC_ARCH_WORKAROUND_3 will be called by the guest for
flushing the branch history. So we want the handling to be as fast as
possible.
As the mitigation is applied on every guest exit, we can check for the
call before saving all context and return very early.