Andrew Cooper [Mon, 4 Oct 2021 18:11:45 +0000 (19:11 +0100)]
x86/pv: Split pv_hypercall() in two
The is_pv_32bit_vcpu() conditionals hide four lfences, with two taken on any
individual path through the function. There is very little code common
between compat and native, and context-dependent conditionals predict very
badly for a period of time after context switch.
Move do_entry_int82() from pv/traps.c into pv/hypercall.c, allowing
_pv_hypercall() to be static and forced inline. The delta is:
add/remove: 0/0 grow/shrink: 1/1 up/down: 300/-282 (18)
Function old new delta
do_entry_int82 50 350 +300
pv_hypercall 579 297 -282
which is tiny, but the perf implications are large:
These are percentage improvements in raw TSC detlas for a xen_version
hypercall, with obvious outliers excluded. Therefore, it is an idealised best
case improvement.
The pv64 path uses `syscall`, while the pv32 path uses `int $0x82` so
necessarily has higher overhead. Therefore, dropping the lfences is less over
an overall improvement.
I don't know why the Naples pv32 improvement is so small, but I've double
checked the numbers and they're consistent. There's presumably something
we're doing which is a large overhead in the pipeline.
On the Intel side, both systems are writing to MSR_SPEC_CTRL on
entry/exit (SKX using the retrofitted microcode implementation, CFL-R using
the hardware implementation), while SKX is suffering further from XPTI for
Meltdown protection.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 12 Oct 2021 09:57:08 +0000 (11:57 +0200)]
VT-d: Tylersburg isoch DMAR unit with no TLB space
BIOSes, when enabling the dedicated DMAR unit for the sound device,
need to also set a non-zero number of TLB entries in a respective
system management register (VTISOCHCTRL). At least one BIOS is known
to fail to do so, causing the VT-d engine to deadlock when used.
Vaguely based on Linux'es e0fc7e0b4b5e ("intel-iommu: Yet another BIOS
workaround: Isoch DMAR unit with no TLB space").
To limit message string redundancy, fold parts with the IGD quirk logic.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Jan Beulich [Tue, 12 Oct 2021 09:56:21 +0000 (11:56 +0200)]
VT-d: generalize and correct "iommu=no-igfx" handling
Linux'es supposedly equivalent "intel_iommu=igfx_off" deals with any
graphics devices (not just Intel ones) while at the same time limiting
the effect to IOMMUs covering only graphics devices. Keying the decision
to leave translation disabled for an IOMMU to merely a magic SBDF tuple
was wrong in the first place - systems may very well have non-graphics
devices at 0000:00:02.0 (ordinary root ports commonly live there, for
example). Any use of igd_drhd_address (and hence is_igd_drhd()) needs
further qualification.
Introduce a new "graphics only" field in struct acpi_drhd_unit and set
it according to device scope parsing outcome. Replace the bad use of
is_igd_drhd() in iommu_enable_translation() by use of this new field.
While adding the new field also convert the adjacent include_all one to
"bool".
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Jan Beulich [Tue, 12 Oct 2021 09:55:42 +0000 (11:55 +0200)]
x86/PV32: fix physdev_op_compat handling
The conversion of the original code failed to recognize that the 32-bit
compat variant of this (sorry, two different meanings of "compat" here)
needs to continue to invoke the compat handler, not the native one.
Arrange for this by adding yet another #define.
Affected functions (having existed prior to the introduction of the new
hypercall) are PHYSDEVOP_set_iobitmap and PHYSDEVOP_apic_{read,write}.
For all others the operand struct layout doesn't differ.
Fixes: 1252e2823117 ("x86/pv: Export pv_hypercall_table[] rather than working around it in several ways") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 12 Oct 2021 09:54:34 +0000 (11:54 +0200)]
AMD/IOMMU: consider hidden devices when flushing device I/O TLBs
Hidden devices are associated with DomXEN but usable by the
hardware domain. Hence they need flushing as well when all devices are
to have flushes invoked.
While there drop a redundant ATS-enabled check and constify the first
parameter of the involved function.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Anthony PERARD [Tue, 12 Oct 2021 09:53:47 +0000 (11:53 +0200)]
build: avoid building arm/arm/*/head.o twice
head.o is been built twice, once because it is in $(ALL_OBJS) and a
second time because it is in $(extra-y) and thus it is rebuilt when
building "arch/arm/built_in.o".
Fix this by adding a dependency of "head.o" on the directory
"arch/arm/".
Also, we should avoid building object that are in subdirectories, so
move the declaration in there. This doesn't change anything as
"arch/arm/built_in.o" depends on "arch/arm/$subarch/built_in.o" which
depends on $(extra-y), so we still need to depend on
"arch/arm/built_in.o".
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Julien Grall <jgrall@amazon.com>
Anthony PERARD [Tue, 12 Oct 2021 09:48:46 +0000 (11:48 +0200)]
x86/mm: avoid building multiple .o from a single .c file
This replace the use of a single .c file use for multiple .o file by
creating multiple .c file including the first one.
There's quite a few issues with trying to build more than one object
file from a single source file: there's is a duplication of the make
rules to generate those targets; there is an additional ".file" symbol
added in order to differentiate between the object files; and the
tools/symbols have an heuristic to try to pick up the right ".file".
This patch adds new .c source file which avoid the need to add a
second ".file" symbol and thus avoid the need to deal with those
issues.
Also remove __OBJECT_FILE__ from $(CC) command line as it isn't used
anywhere anymore. And remove the macro "build-intermediate" since the
generic rules for single targets can be used.
And rename the objects in mm/hap/ to remove the extra "level".
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
libxl: Only map legacy PCI IRQs if they are supported
Arm's PCI passthrough implementation doesn't support legacy interrupts,
but MSI/MSI-X. This can be the case for other platforms too.
For that reason introduce a new CONFIG_PCI_SUPP_LEGACY_IRQ and add
it to the CFLAGS and compile the relevant code in the toolstack only if
applicable.
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
[stefano: minor change to Makefile] Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Rahul Singh <rahul.singh@arm.com> Tested-by: Rahul Singh <rahul.singh@arm.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Luca Fancellu [Mon, 11 Oct 2021 07:56:38 +0000 (08:56 +0100)]
arm/efi: Fix null pointer dereference
Fix for commit 60649d443dc395243e74d2b3e05594ac0c43cfe3
that introduces a null pointer dereference when the
fdt_node_offset_by_compatible is called with "fdt"
argument null.
tools/console: use xenforeigmemory to map console ring
This patch replaces the usage of xc_map_foreign_range with
xenforeignmemory_map from the stable xenforeignmemory library. Note
there are still other uses of libxc functions which prevents removing
the dependency.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Ian Jackson <iwj@xenproject.org>
docs: add references to Argo Linux driver sources and information
Add a section to the Argo design document to supply guidance on how to
enable Argo in Xen and where to obtain source code and documentation
for Argo device drivers for guest OSes, primarily from OpenXT.
Signed-off-by: Christopher Clark <christopher.w.clark@gmail.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Jan Beulich [Mon, 11 Oct 2021 08:58:44 +0000 (10:58 +0200)]
x86/HVM: fix xsm_op for 32-bit guests
Like for PV, 32-bit guests need to invoke the compat handler, not the
native one.
Fixes: db984809d61b ("hvm: wire up domctl and xsm hypercalls") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 11 Oct 2021 08:58:17 +0000 (10:58 +0200)]
x86/build: suppress EFI-related tool chain checks upon local $(MAKE) recursion
The xen-syms and xen.efi linking steps are serialized only when the
intermediate note.o file is necessary. Otherwise both may run in
parallel. This in turn means that the compiler / linker invocations to
create efi/check.o / efi/check.efi may also happen twice in parallel.
Obviously it's a bad idea to have multiple producers of the same output
race with one another - every once in a while one may e.g. observe
objdump: efi/check.efi: file format not recognized
We don't need this EFI related checking to occur when producing the
intermediate symbol and relocation table objects, and we have an easy
way of suppressing it: Simply pass in "efi-y=", overriding the
assignments done in the Makefile and thus forcing the tool chain checks
to be bypassed.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
translate_noncontig() allocates domheap page for translated list
before calling to allocate_optee_shm_buf(), which can fail for number
of reason. Anyways, after fail we need to free the allocated page(s).
Another leak is possible if the same translate_noncontig() function
fails to get domain page. In this case it should free allocated
optee_shm_buf prior exit. This will also free allocated domheap page.
While adding a PCI device mark it as such, so other frameworks
can distinguish it from DT devices.
For that introduce an architecture defined helper which may perform
additional initialization of the newly created PCI device.
Add new device type (DEV_PCI) to distinguish PCI devices from platform
DT devices, so some drivers, like IOMMU, can handle PCI devices
differently.
Also add a helper which is when given a struct device returns the
corresponding struct pci_dev which this device is a part of.
Because of the header cross-dependencies, e.g. we need both
struct pci_dev and struct arch_pci_dev at the same time, this cannot be
done with an inline.
Andrew Cooper [Mon, 4 Oct 2021 20:39:03 +0000 (21:39 +0100)]
x86/spec-ctrl: Build with BRANCH_HARDEN lfences by default
Branch Harden is enabled by default at compile and boot time. Invert the
logic to compile with lfence by default and nop out in the non-default case.
This has several advantages. It removes 3829 patch points (in the random
build of Xen I have to hand) by default on boot, 70% (!) of the
.altinstr_replacement section. For builds of Xen with a non-nops capable tool
chain, the code after `spec-ctrl=no-branch-harden` is better because Xen can
write long nops.
Most importantly however, it means the disassembly actually matches what runs
in the common case, with the ability to distinguish the lfences from other
uses of nops.
Finally, make opt_branch_harden local to spec_ctrl.c and __initdata. It has
never been used externally, even at it's introduction in c/s 3860d5534df4
"spec: add l1tf-barrier".
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
This is a follow-up of
"b6fe410 xen/arm: Add handling of extended regions for Dom0"
Add various in-code comments, update Xen hypervisor device tree
bindings text, change the log level for some prints and clarify
format specifier, reuse dt_for_each_range() to avoid open-coding
in find_memory_holes().
Rahul Singh [Wed, 6 Oct 2021 17:40:33 +0000 (18:40 +0100)]
xen/domctl: Introduce XEN_DOMCTL_CDF_vpci flag
Introduce XEN_DOMCTL_CDF_vpci flag to enable VPCI support in XEN.
Reject the use of this new flag for x86 as VPCI is not supported for
DOMU guests for x86.
Signed-off-by: Rahul Singh <rahul.singh@arm.com>
[stefano: drop _XEN_DOMCTL_CDF_vpci] Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Christian Lindig <christian.lindig@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Rahul Singh [Wed, 6 Oct 2021 17:40:30 +0000 (18:40 +0100)]
xen/arm: PCI host bridge discovery within XEN on ARM
XEN during boot will read the PCI device tree node “reg” property
and will map the PCI config space to the XEN memory.
As of now only "pci-host-ecam-generic" compatible board is supported.
"linux,pci-domain" device tree property assigns a fixed PCI domain
number to a host bridge, otherwise an unstable (across boots) unique
number will be assigned by Linux. XEN access the PCI devices based on
Segment:Bus:Device:Function. A Segment number in the XEN is same as a
domain number in Linux. Segment number and domain number has to be in
sync to access the correct PCI devices.
XEN will read the “linux,pci-domain” property from the device tree node
and configure the host bridge segment number accordingly. If this
property is not available XEN will allocate the unique segment number
to the host bridge.
Rahul Singh [Wed, 6 Oct 2021 17:40:28 +0000 (18:40 +0100)]
xen/arm: Add PHYSDEVOP_pci_device_(*add/remove) support for ARM
Hardware domain is in charge of doing the PCI enumeration and will
discover the PCI devices and then will communicate to XEN via hyper
call PHYSDEVOP_pci_device_add(..) to add the PCI devices in XEN.
Also implement PHYSDEVOP_pci_device_remove(..) to remove the PCI device.
As most of the code for PHYSDEVOP_pci_device_* is the same between x86
and ARM, move the code to a common file to avoid duplication.
There are other PHYSDEVOP_pci_device_* operations to add PCI devices.
Currently implemented PHYSDEVOP_pci_device_remove(..) and
PHYSDEVOP_pci_device_add(..) only as those are minimum required to
support PCI passthrough on ARM.
Anthony PERARD [Thu, 7 Oct 2021 15:57:10 +0000 (17:57 +0200)]
build/riscv: tell the build system about riscv64/head.S
This allows to `make arch/riscv/riscv64/head.o`.
Example of rune on a fresh copy of the repository:
make XEN_TARGET_ARCH=riscv64 CROSS_COMPILE=riscv64-linux-gnu- KBUILD_DEFCONFIG=tiny64_defconfig arch/riscv/riscv64/head.o
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Bob Eshleman <bobbyeshleman@gmail.com> Reviewed-by: Alistair Francis <alistair.francis@wdc.com> Acked-by: Connor Davis <connojdavis@gmail.com>
Jan Beulich [Fri, 1 Oct 2021 13:05:42 +0000 (15:05 +0200)]
VT-d: fix deassign of device with RMRR
Ignoring a specific error code here was not meant to short circuit
deassign to _just_ the unmapping of RMRRs. This bug was previously
hidden by the bogus (potentially indefinite) looping in
pci_release_devices(), until f591755823a7 ("IOMMU/PCI: don't let domain
cleanup continue when device de-assignment failed") fixed that loop.
This is CVE-2021-28702 / XSA-386.
Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling") Reported-by: Ivan Kardykov <kardykov@tabit.pro> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Ivan Kardykov <kardykov@tabit.pro>
xen/arm: Add handling of extended regions for Dom0
The extended region (safe range) is a region of guest physical
address space which is unused and could be safely used to create
grant/foreign mappings instead of wasting real RAM pages from
the domain memory for establishing these mappings.
The extended regions are chosen at the domain creation time and
advertised to it via "reg" property under hypervisor node in
the guest device-tree. As region 0 is reserved for grant table
space (always present), the indexes for extended regions are 1...N.
If extended regions could not be allocated for some reason,
Xen doesn't fail and behaves as usual, so only inserts region 0.
Please note the following limitations:
- The extended region feature is only supported for 64-bit domain
currently.
- The ACPI case is not covered.
***
As Dom0 is direct mapped domain on Arm (e.g. MFN == GFN)
the algorithm to choose extended regions for it is different
in comparison with the algorithm for non-direct mapped DomU.
What is more, that extended regions should be chosen differently
whether IOMMU is enabled or not.
Provide RAM not assigned to Dom0 if IOMMU is disabled or memory
holes found in host device-tree if otherwise. Make sure that
extended regions are 2MB-aligned and located within maximum possible
addressable physical memory range. The minimum size of extended
region is 64MB. The maximum number of extended regions is 128,
which is an artificial limit to minimize code changes (we reuse
struct meminfo to describe extended regions, so there are an array
field for 128 elements).
It worth mentioning that unallocated memory solution (when the IOMMU
is disabled) will work safely until Dom0 is able to allocate memory
outside of the original range.
Also introduce command line option to be able to globally enable or
disable support for extended regions for Dom0 (enabled by default).
Introduce the xen,uefi-cfg-load DT property of /chosen
node for ARM whose presence decide whether to force
the load of the UEFI Xen configuration file.
The logic is that if any multiboot,module is found in
the DT, then the xen,uefi-cfg-load property is used to see
if the UEFI Xen configuration file is needed.
Modify a comment in efi_arch_use_config_file, removing
the part that states "dom0 required" because it's not
true anymore with this commit.
Juergen Gross [Fri, 1 Oct 2021 13:11:03 +0000 (15:11 +0200)]
include/public: add possible status values to usbif.h
The interface definition of PV USB devices is lacking the specification
of possible values of the status field in a response. Those are
negative errno values as used in Linux, so they might differ in other
OS's. Specify them via appropriate defines.
Anthony PERARD [Thu, 30 Sep 2021 16:17:19 +0000 (17:17 +0100)]
automation: switch GitLab x86 smoke test to use PV 64bit binary
Xen is now built without CONFIG_PV32 by default and thus test jobs
"qemu-smoke-x86-64-gcc" and "qemu-smoke-x86-64-clang" fails because
they are using XTF's "test-pv32pae-example" which is an hello word
32bit PV guest.
As we are looking for whether Xen boot or not with a quick smoke test,
just use 64bit tests instead.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Import the Linux helper of_get_pci_domain_nr. This function will try to
obtain the host bridge domain number by finding a property called
"linux,pci-domain" of the given device node.
Import the Linux helper of_property_read_variable_u32_array. This
function find and read an array of 32 bit integers from a property,
with bounds on the minimum and maximum array size.
xen/pci: Refactor MSI code that implements MSI functionality within XEN
On Arm, the initial plan is to only support GICv3 ITS which doesn't
require us to manage the MSIs because the HW will protect against
spoofing. Move the code under CONFIG_HAS_PCI_MSI flag to gate the code
for ARM.
No functional change intended.
Signed-off-by: Rahul Singh <rahul.singh@arm.com> Reviewed-by: Daniel P. Smith <dpsmith@apertussolutions.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Jan Beulich <jbeulich@suse.com>
Anthony PERARD [Thu, 9 Sep 2021 14:33:06 +0000 (15:33 +0100)]
build: add --full to version.sh to guess $(XEN_FULLVERSION)
Running $(MAKE) like that in a $(shell ) while parsing the Makefile
doesn't work reliably. In some case, make will complain with
"jobserver unavailable: using -j1. Add '+' to parent make rule.".
Also, it isn't possible to distinguish between the output produced by
the target "xenversion" and `make`'s own output.
Instead of running make, this patch "improve" `version.sh` to try to
guess the output of `make xenversion`.
In order to have version.sh works in more scenario, it will use
XEN_EXTRAVERSION and XEN_VENDORVERSION from the environment when
present. As for the cases were those two variables are overridden by a
make command line arguments, we export them when invoking version.sh
via a new $(XEN_FULLVERSION) macro.
That should hopefully get us to having ./version.sh returning the same
value that `make xenversion` would.
This fix GitLab CI's build job "debian-unstable-gcc-arm64".
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Daniel P. Smith <dpsmith@apertussolutions.com> Reviewed-by: Ian Jackson <iwj@xenproject.org>
Jan Beulich [Wed, 29 Sep 2021 09:57:22 +0000 (11:57 +0200)]
x86/PVH: actually show Dom0's stacks from debug key '0'
show_guest_stack() does nothing for HVM. Introduce a HVM-specific
dumping function, paralleling the 64- and 32-bit PV ones. We don't know
the real stack size, so only dump up to the next page boundary.
Rather than adding a vcpu parameter to hvm_copy_from_guest_linear(),
introduce hvm_copy_from_vcpu_linear() which - for now at least - in
return won't need a "pfinfo" parameter.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Wed, 29 Sep 2021 09:56:18 +0000 (11:56 +0200)]
x86/HVM: convert hvm_virtual_to_linear_addr() to be remote-capable
While all present callers want to act on "current", stack dumping for
HVM vCPU-s will require the function to be able to act on a remote vCPU.
To avoid touching all present callers, convert the existing function to
an inline wrapper around the extend new one.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Tue, 28 Sep 2021 14:03:38 +0000 (16:03 +0200)]
x86/PVH: actually show Dom0's register state from debug key '0'
vcpu_show_registers() didn't do anything for HVM so far. Note though
that some extra hackery is needed for VMX - see the code comment.
Note further that the show_guest_stack() invocation is left alone here:
While strictly speaking guest_kernel_mode() should be predicated by a
PV / !HVM check, show_guest_stack() itself will bail immediately for
HVM.
While there and despite not being PVH-specific, take the opportunity and
filter offline vCPU-s: There's not really any register state associated
with them, so avoid spamming the log with useless information while
still leaving an indication of the fact.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
xen/arm: optee: Fix arm_smccc_smc's a0 for OPTEE_SMC_DISABLE_SHM_CACHE
Fix a possible copy-paste error in arm_smccc_smc's first argument (a0)
for OPTEE_SMC_DISABLE_SHM_CACHE case.
This error causes Linux > v5.14-rc5 (b5c10dd04b7418793517e3286cde5c04759a86de
optee: Clear stale cache entries during initialization) to stuck
repeatedly issuing OPTEE_SMC_DISABLE_SHM_CACHE call and waiting for
the result to be OPTEE_SMC_RETURN_ENOTAVAIL which will never happen.
In case abi-dumper is available the stubdom builds will fail due to a
false dependency on dynamic loadable libraries. Fix that.
Fixes: d7c9f7a7a3959913b4 ("tools/libs: Write out an ABI analysis when abi-dumper is available") Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Kevin Stefanov [Wed, 15 Sep 2021 14:30:00 +0000 (15:30 +0100)]
tools/libxl: Correctly align the ACPI tables
The memory allocator currently calculates alignment in libxl's virtual
address space, rather than guest physical address space. This results
in the FACS being commonly misaligned.
Furthermore, the allocator has several other bugs.
The opencoded align-up calculation is currently susceptible to a bug
that occurs in the corner case that the buffer is already aligned to
begin with. In that case, an align-sized memory hole is introduced.
The while loop is dead logic because its effects are entirely and
unconditionally overwritten immediately after it.
Rework the memory allocator to align in guest physical address space
instead of libxl's virtual memory and improve the calculation, drop
errant extra page in allocated buffer for ACPI tables, and give some
of the variables better names/types.
Fixes: 14c0d328da2b ("libxl/acpi: Build ACPI tables for HVMlite guests") Signed-off-by: Kevin Stefanov <kevin.stefanov@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Jackson <iwj@xenproject.org>
x86: initialize memnodemapsize while faking NUMA node
When system turns NUMA off or system lacks of NUMA support,
Xen will fake a NUMA node to make system works as a single
node NUMA system.
In this case the memory node map doesn't need to be allocated
from boot pages, it will use the _memnodemap directly. But
memnodemapsize hasn't been set. Xen should assert in phys_to_nid.
Because x86 was using an empty macro "VIRTUAL_BUG_ON" to replace
ASSERT, this bug will not be triggered on x86.
Actually, Xen will only use 1 slot of memnodemap in this case.
So we set memnodemap[0] to 0 and memnodemapsize to 1 in this
patch to fix it.
Signed-off-by: Wei Chen <wei.chen@arm.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Fri, 24 Sep 2021 09:00:30 +0000 (11:00 +0200)]
common: guest_physmap_add_page()'s return value needs checking
The function may fail; it is not correct to indicate "success" in this
case up the call stack. Mark the function must-check to prove all
cases have been caught (and no new ones will get introduced).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Ian Jackson <iwj@xenproject.org> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Jan Beulich [Wed, 22 Sep 2021 14:19:21 +0000 (16:19 +0200)]
x86: drop a bogus SHARED_M2P() check from PV Dom0 building code
If anything, a check covering a wider range of invalid M2P entries ought
to be used (e.g. VALID_M2P()). But since everything is fully under Xen's
control at this stage, simply remove the BUG_ON().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Commit 540a637c3410780b519fc055f432afe271f642f8 defines a new
helper mark_page_free to extract common codes, while it accidently
breaks the local variable "tainted".
This patch fix it by letting mark_page_free() return bool of whether the
page is offlined and rename local variable "tainted" to "pg_offlined".
Tamas K Lengyel [Wed, 22 Sep 2021 14:17:54 +0000 (16:17 +0200)]
x86/mem_sharing: don't lock parent during fork reset
During fork reset operation the parent domain doesn't need to be gathered using
rcu_lock_live_remote_domain_by_id, the fork already has the parent pointer.
Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Wed, 22 Sep 2021 14:17:04 +0000 (16:17 +0200)]
AMD/IOMMU: add "ivmd=" command line option
Just like VT-d's "rmrr=" it can be used to cover for firmware omissions.
Since systems surfacing IVMDs seem to be rare, it is also meant to allow
testing of the involved code.
Only the IVMD flavors actually understood by the IVMD parsing logic can
be generated, and for this initial implementation there's also no way to
control the flags field - unity r/w mappings are assumed.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Jan Beulich [Wed, 22 Sep 2021 14:16:28 +0000 (16:16 +0200)]
AMD/IOMMU: provide function backing XENMEM_reserved_device_memory_map
Just like for VT-d, exclusion / unity map ranges would better be
reflected in e.g. the guest's E820 map. The reporting infrastructure
was put in place still pretty tailored to VT-d's needs; extend
get_reserved_device_memory() to allow vendor specific code to probe
whether a particular (seg,bus,dev,func) tuple would get its data
actually recorded. I admit the de-duplication of entries is quite
limited for now, but considering our trouble to find a system
surfacing _any_ IVMD this is likely not a critical issue for this
initial implementation.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Jan Beulich [Wed, 22 Sep 2021 14:15:29 +0000 (16:15 +0200)]
AMD/IOMMU: also insert IVMD ranges into Dom0's page tables
So far only one region would be taken care of, if it can be placed in
the exclusion range registers of the IOMMU. Take care of further ranges
as well. Seeing that we've been doing fine without this, make both
insertion and removal best effort only.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Jan Beulich [Wed, 22 Sep 2021 14:14:19 +0000 (16:14 +0200)]
AMD/IOMMU: check / convert IVMD ranges for being / to be reserved
While the specification doesn't say so, just like for VT-d's RMRRs no
good can come from these ranges being e.g. conventional RAM or entirely
unmarked and hence usable for placing e.g. PCI device BARs. Check
whether they are, and put in some limited effort to convert to reserved.
(More advanced logic can be added if actual problems are found with this
simplistic variant.)
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Andrew Cooper [Mon, 20 Sep 2021 14:02:32 +0000 (15:02 +0100)]
x86/pv: Move x86/trace.c to x86/pv/trace.c
This entire file is pv-only, and not excluded from the build by
CONFIG_TRACEBUFFER. Move it into the pv/ directory, build it conditionally,
and drop unused includes.
Also move the contents of asm/trace.h to asm/pv/trace.h to avoid the functions
being declared across the entire hypervisor.
One caller in fixup_page_fault() is effectively PV only, but is not subject to
dead code elimination. Add an additional IS_ENABLED(CONFIG_PV) to keep the
build happy.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 17 Sep 2021 23:32:12 +0000 (00:32 +0100)]
x86/hvm: Remove duplicate calls caused by tracing
1) vpic_ack_pending_irq() calls vlapic_accept_pic_intr() twice, once in the
TRACE_2D() instantiation and once "for real". Make the call only once.
2) vlapic_accept_pic_intr() similarly calls __vlapic_accept_pic_intr() twice,
although this is more complicated to disentangle.
v cannot be NULL because it has already been dereferenced in the function,
causing the ternary expression to always call __vlapic_accept_pic_intr().
However, the return expression of the function takes care to skip the call
if this vCPU isn't the PIC target. As __vlapic_accept_pic_intr() is far
from trivial, make the TRACE_2D() semantics match the return semantics by
only calling __vlapic_accept_pic_intr() when the vCPU is the PIC target.
3) hpet_set_timer() duplicates calls to hpet_tick_to_ns(). Pull the logic out
which simplifies both the TRACE and create_periodic_time() calls.
4) lapic_rearm() makes multiple calls to vlapic_lvtt_period(). Pull it out
into a local variable.
vlapic_accept_pic_intr() is called on every VMEntry, so this is a reduction in
VMEntry complexity across the board.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 15 Sep 2021 16:04:00 +0000 (17:04 +0100)]
x86/hvm: Reduce stack usage from HVMTRACE_ND()
It is pointless to write all 6 entries and only consume the useful subset.
bloat-o-meter shows quite how obscene the overhead is in vmx_vmexit_handler(),
weighing in at 12% of the function arranging unread zeroes on the stack, and
8% for svm_vmexit_handler().
Adjust all users of HVMTRACE_ND(), using TRC_PAR_LONG() where appropriate
instead of opencoding it.
The 0 case needs a little help. All object in C must have a unique address
and _d is passed by pointer. Explicitly permit the optimiser to drop the
array.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Mon, 20 Sep 2021 08:24:27 +0000 (10:24 +0200)]
VT-d: consider hidden devices when unmapping
Whether to clear an IOMMU's bit in the domain's bitmap should depend on
all devices the domain can control. For the hardware domain this
includes hidden devices, which are associated with DomXEN.
While touching related logic
- convert the "current device" exclusion check to a simple pointer
comparison,
- convert "found" to "bool",
- adjust style and correct a typo in an existing comment.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Mon, 20 Sep 2021 08:23:08 +0000 (10:23 +0200)]
page-alloc: further adjust assign_page{,s}()
The on-commit editing of 5260e8fb93f0 ("xen: re-define assign_pages and
introduce a new function assign_page") didn't go quite far enough: A
local variable and a function argument also would have wanted adjusting.
modify acquire_domstatic_pages to take an unsigned int size parameter
acquire_domstatic_pages currently takes an unsigned long nr_mfns
parameter, but actually it cannot handle anything larger than an
unsigned int nr_mfns. That's because acquire_domstatic_pages is based on
assign_pages which also takes an unsigned int nr parameter.
So modify the nr_mfns parameter of acquire_domstatic_pages to be
unsigned int.
There is only one caller in
xen/arch/arm/domain_build.c:allocate_static_memory. Check that the value
to be passed to acquire_domstatic_pages is no larger than UINT_MAX. If
it is, print an error and goto fail.
Sanitize CTR_EL0 value between cores and taint Xen if incompatible
values are found.
In the case of different i-cache types, the sanitize ctr_el0 will have a
sanitize value but this is currently not used or exposed to guest which
are seeing the original ctr_el0 value.
Use the opportunity to rename CTR_L1Ip to use an upper case name like
Linux does.
The patch is also defining ICACHE_POLICY_xxx instead of only having
CTR_L1IP_xxx to sync the definitions with Linux and is updating the code
using those accordingly (arm32 setup).
On platforms with only the same type of cores, this patch should not
modify the current Xen behaviour.
Use arm64 cpu feature sanitization to TAINT Xen if different DCZID values
are found (ftr_dczid is using only STRICT method).
In this case actual memory being cleaned by DC ZVA operations would be
different depending on the cores which could make a guest zeroing too
much or too little memory if it is merged between CPUs.
We could, on processors supporting it, trap access to DCZID_EL0 register
using HFGRTR_EL2 register but this would not solve the case where a
process is being migrated during a copy or if it cached the value of the
register.
Replace the code in p2m trying to find a sane value for the VMID size
supported and the PAR to use. We are now using the boot cpuinfo as the
values there are sanitized during boot and the value for those
parameters is now the safest possible value on the system.
Define a sanitize_cpu function to be called on secondary cores to
sanitize the system cpuinfo structure.
The safest value is taken when possible and the system is marked tainted
if we encounter values which are incompatible with each other.
Call the update_system_features function on all secondary cores that are
kept running and taint the system if different midr are found between
cores but hmp-unsafe=true was passed on Xen command line.
This is only supported on arm64 so update_system_features is an empty
static inline on arm32.
The patch is adding a new TAINT_CPU_OUT_OF_SPEC to warn the user if
Xen is running on a system with features differences between cores which
are not supported.
The patch is disabling CTR_EL0, DCZID_EL0 and ZCRusing #if 0 with a TODO
as this patch is not handling sanitization of those registers.
CTR_EL0/DCZID will be handled in a future patch to properly handle
different cache attributes when possible.
ZCR should be sanitize once we add support for SVE in Xen.
As we will sanitize the content of boot_cpu_data it will not really
contain the boot cpu information but the system sanitize information.
Rename the structure to system_cpuinfo so the user is informed that this
is the system wide available feature and not anymore the features of the
boot cpu.
The original boot cpu data is still available in cpu_data.
Import structures declared in Linux file arch/arm64/kernel/cpufeature.c
and the required types from arch/arm64/include/asm/cpufeature.h.
Current code has been imported from Linux 5.13-rc5 (Commit ID cd1245d75ce93b8fd206f4b34eb58bcfe156d5e9) and copied into cpufeature.c
in arm64 code and cpufeature.h in arm64 specific headers.
Those structure will be used to sanitize the cpu features available to
the ones availble on all cores of a system even if we are on an
heterogeneous platform (from example a big/LITTLE).
For each feature field of all ID registers, those structures define what
is the safest value and if we can allow to have different values in
different cores.
This patch is introducing Linux code without any changes to it.
Jan Beulich [Thu, 16 Sep 2021 09:02:48 +0000 (11:02 +0200)]
VT-d: skip IOMMU bitmap cleanup for phantom devices
Doing the cleanup also for phantom devices is at best redundant with
doing it for the corresponding real device. I couldn't force myself into
checking all the code paths whether it really is: It seems better to
explicitly skip this step in such cases.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Thu, 16 Sep 2021 09:02:08 +0000 (11:02 +0200)]
VT-d: defer "no DRHD" check when (un)mapping devices
If devices are to be skipped anyway (which is the case in particular for
host bridges), there's no point complaining about a missing DRHD (and
hence a missing association with an IOMMU).
While there convert assignments to initializers and constify "drhd"
local variables.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Daniel P. Smith [Thu, 16 Sep 2021 08:59:40 +0000 (10:59 +0200)]
xsm: refactor xsm_ops handling
This renames the `struct xsm_operations` to the shorter `struct xsm_ops` and
converts the global xsm_ops from being a pointer to an explicit instance. As
part of this conversion, it reworks the XSM modules init function to return
their xsm_ops struct which is copied in to the global xsm_ops instance.
Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Daniel P. Smith [Thu, 16 Sep 2021 08:58:59 +0000 (10:58 +0200)]
xsm: apply coding style
Instead of intermixing coding style changes with code changes as they
are come upon in this patch set, moving all coding style changes
into a single commit. The focus of coding style changes here are,
- move trailing comments to line above
- ensuring line length does not exceed 80 chars
- ensuring proper indentation for 80 char wrapping
- covert u32 type statements to uint32_t
- remove space between closing and opening parens
- drop extern on function declarations
Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 16 Sep 2021 08:56:25 +0000 (10:56 +0200)]
IOMMU: page table dumping adjustments
For one none of the three IOMMU implementations on Arm specify a dumping
hook. Generalize VT-d's "don't dump shared page tables" to cover for
this.
Further in the past I was told that on Arm in principle there could be
multiple different IOMMUs, and hence different domains' platform_ops
pointers could differ. Use each domain's ops for calling the dump hook.
(In the long run all uses of iommu_get_ops() would likely need to
disappear for this reason.)
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
If the new gfn matches the previous one (ie: gpfn == old_gpfn)
xenmem_add_to_physmap_one will issue a duplicated call to
guest_physmap_remove_page with the same guest frame number, because
the get_gpfn_from_mfn call has been moved by commit f8582da041 to be
performed before the original page is removed. This leads to the
second guest_physmap_remove_page failing, which was not the case
before commit f8582da041.
Fix this by adding a check that prevents a second call to
guest_physmap_remove_page if the previous one has already removed the
backing page from that gfn.
Fixes: f8582da041 ('x86/mm: pull a sanity check earlier in xenmem_add_to_physmap_one()') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86: quote section names when defining them in linker script
LLVM ld seems to require section names to be quoted at both definition
and when referencing them for a match to happen, or else we get the
following errors:
The original fix for GNU ld 2.37 only quoted the section name when
referencing it in the ADDR function. Fix by also quoting the section
names when declaring them.
Fixes: 58ad654ebce7 ("x86: work around build issue with GNU ld 2.37") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Wed, 15 Sep 2021 09:01:29 +0000 (11:01 +0200)]
x86/boot: properly "ignore" early evaluated "no-real-mode"
The option parser takes off "no-" prefixes before matching, so they also
shouldn't be specified to match against.
Fixes: e44d98608476 ("x86/setup: Ignore early boot parameters like no-real-mode") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 15 Sep 2021 09:00:40 +0000 (11:00 +0200)]
x86/ACPI: ignore processors which cannot be brought online
ACPI 6.3 introduced a flag allowing to tell MADT entries describing
hotpluggable processors from ones which are simply placeholders (often
used by firmware writers to simplify handling there).
Inspired by a Linux patch by Mario Limonciello <mario.limonciello@amd.com>.
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>