Roger Pau Monné [Wed, 17 Nov 2021 07:13:18 +0000 (08:13 +0100)]
test/tsx: set grant version for created domains
Set the grant table version for the created domains to use version 1,
as such tests domains don't require the usage of the grant table at
all. A TODO note is added to switch those dummy domains to not have a
grant table at all when possible. Without setting the grant version
the domains for the tests cannot be created.
Fixes: 7379f9e10a ('gnttab: allow setting max version per-domain') Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Roger Pau Monné [Wed, 17 Nov 2021 07:13:02 +0000 (08:13 +0100)]
tests/resource: set grant version for created domains
Set the grant table version for the created domains to use version 1,
as that's the used by the test cases. Without setting the grant
version the domains for the tests cannot be created.
Fixes: 7379f9e10a ('gnttab: allow setting max version per-domain') Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Roger Pau Monné [Wed, 17 Nov 2021 07:12:00 +0000 (08:12 +0100)]
domctl: introduce a macro to set the grant table max version
Such macro just clamps the passed version to fit in the designated
bits of the domctl field. The main purpose is to make it clearer in
the code when max grant version is being set in the grant_opts field.
Existing users that where setting the version in the grant_opts field
are switched to use the macro.
No functional change intended.
Requested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com> Acked-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Ian Jackson <iwj@xenproject.org> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Jane Malalane [Fri, 12 Nov 2021 14:48:21 +0000 (14:48 +0000)]
tests/resource: Extend to check that the grant frames are mapped correctly
Previously, we checked that we could map 40 pages with nothing
complaining. Now we're adding extra logic to check that those 40
frames are "correct".
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jane Malalane <jane.malalane@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Wed, 10 Nov 2021 17:40:59 +0000 (18:40 +0100)]
x86/cpuid: prevent shrinking migrated policies max leaves
CPUID policies from guest being migrated shouldn't have the maximum
leaves shrink, as that would be a guest visible change. The hypervisor
has no knowledge on whether a guest has been migrated or is build from
scratch, and hence it must not blindly shrink the CPUID policy in
recalculate_cpuid_policy. Remove the
x86_cpuid_policy_shrink_max_leaves call from recalculate_cpuid_policy.
Removing such call could be seen as a partial revert of 540d911c28.
Instead let the toolstack shrink the policies for newly created
guests, while keeping the previous values for guests that are migrated
in. Note that guests migrated in without a CPUID policy won't get any
kind of shrinking applied.
Fixes: 540d911c28 ('x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Jan Beulich [Fri, 12 Nov 2021 12:56:51 +0000 (13:56 +0100)]
VT-d: per-domain IOMMU bitmap needs to have dynamic size
With no upper bound (anymore) on the number of IOMMUs, a fixed-size
64-bit map may be insufficient (systems with 40 IOMMUs have already been
observed).
Fixes: 27713fa2aa21 ("VT-d: improve save/restore of registers across S3") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
xen/arm: allocate_bank_memory: don't create memory banks of size zero
allocate_bank_memory can be called with a tot_size of zero, as an
example see the implementation of allocate_memory which can call
allocate_bank_memory with a tot_size of zero for the second memory bank.
If tot_size == 0, don't create an empty memory bank, just return
immediately without error. Otherwise a zero-size memory bank will be
added to the domain device tree.
Note that Linux is known to be able to cope with zero-size memory banks,
and Xen more recently gained the ability to do so as well (5a37207df520
"xen/arm: bootfdt: Ignore empty memory bank"). However, there might be
other non-Linux OSes that are not able to cope with empty memory banks
as well as Linux (and now Xen). It would be more robust to avoid
zero-size memory banks unless required.
Moreover, the code to find empty address regions in make_hypervisor_node
in Xen is not able to cope with empty memory banks today and would
result in a Xen crash. This is only a latent bug because
make_hypervisor_node is only called for Dom0 at present and
allocate_memory is only called for DomU at the moment. (But if
make_hypervisor_node was to be called for a DomU, then the Xen crash
would become manifest.)
xen/arm: don't assign domU static-mem to dom0 as reserved-memory
DomUs static-mem ranges are added to the reserved_mem array for
accounting, but they shouldn't be assigned to dom0 as the other regular
reserved-memory ranges in device tree.
In make_memory_nodes, fix the error by skipping banks with xen_domain
set to true in the reserved-memory array. Also make sure to use the
first valid (!xen_domain) start address for the memory node name.
Fixes: 41c031ff437b ("xen/arm: introduce domain on Static Allocation") Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Reviewed-by: Penny Zheng <penny.zheng@arm.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Roger Pau Monne [Tue, 9 Nov 2021 09:47:21 +0000 (10:47 +0100)]
tools/configure: make iPXE dependent on QEMU traditional
iPXE is only used by QEMU traditional, so make it off by default
unless QEMU traditional is enabled.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Fixes: bcf77ce510 ('configure: modify default of building rombios') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Ian Jackson <iwj@xenproject.org>
Roger Pau Monne [Thu, 4 Nov 2021 10:48:34 +0000 (11:48 +0100)]
gnttab: allow setting max version per-domain
Introduce a new domain create field so that toolstack can specify the
maximum grant table version usable by the domain. This is plumbed into
xl and settable by the user as max_grant_version.
Previously this was only settable on a per host basis using the
gnttab command line option.
Note the version is specified using 4 bits, which leaves room to
specify up to grant table version 15. Given that we only have 2 grant
table versions right now, and a new version is unlikely in the near
future using 4 bits seems more than enough.
xenstored stubdomains are limited to grant table v1 because the
current MiniOS code used to build them only has support for grants v1.
There are existing limits set for xenstored stubdomains at creation
time that already match the defaults in MiniOS.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Andrew Cooper [Fri, 29 Oct 2021 17:38:13 +0000 (18:38 +0100)]
xen: Report grant table v1/v2 capabilities to the toolstack
In order to let the toolstack be able to set the gnttab version on a
per-domain basis, it needs to know which ABIs Xen supports. Introduce
XEN_SYSCTL_PHYSCAP_gnttab_v{1,2} for the purpose, and plumb in down into
userspace.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Releae-Acked-by: Ian Jackson <iwj@xenproject.org>
Luca Fancellu [Fri, 5 Nov 2021 13:07:28 +0000 (13:07 +0000)]
xen/efi: Fix Grub2 boot on arm64
The code introduced by commit a1743fc3a9fe9b68c265c45264dddf214fd9b882
("arm/efi: Use dom0less configuration when using EFI boot") is
introducing a problem to boot Xen using Grub2 on ARM machine using EDK2.
Despite UEFI specification, EDK2+Grub2 is returning a NULL DeviceHandle
inside the interface given by the LOADED_IMAGE_PROTOCOL service, this
handle is used later by efi_bs->HandleProtocol(...) inside
get_parent_handle(...) when requesting the SIMPLE_FILE_SYSTEM_PROTOCOL
interface, causing Xen to stop the boot because of an EFI_INVALID_PARAMETER
error.
Before the commit above, the function was never called because the
logic was skipping the call when there were multiboot modules in the
DT because the filesystem was never used and the bootloader had
put in place all the right modules in memory and the addresses
in the DT.
To fix the problem the old logic is put back in place. Because the handle
was given to the efi_check_dt_boot(...), but the revert put the handle
out of scope, the signature of the function is changed to use an
EFI_LOADED_IMAGE handle and request the EFI_FILE_HANDLE only when
needed (module found using xen,uefi-binary).
Another problem is found when the UEFI stub tries to check if Dom0
image or DomUs are present.
The logic doesn't work when the UEFI stub is not responsible to load
any modules, so the efi_check_dt_boot(...) return value is modified
to return the number of multiboot module found and not only the number
of module loaded by the stub.
Taking the occasion to update the comment in handle_module_node(...)
to explain why we return success even if xen,uefi-binary is not found.
Fixes: a1743fc3a9 ("arm/efi: Use dom0less configuration when using EFI boot") Signed-off-by: Luca Fancellu <luca.fancellu@arm.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Juergen Gross [Thu, 4 Nov 2021 16:11:21 +0000 (17:11 +0100)]
tools: disable building qemu-trad per default
Using qemu-traditional as device model is deprecated for some time now.
So change the default for building it to "disable". This will affect
ioemu-stubdom, too, as there is a direct dependency between the two.
Today it is possible to use a PVH/HVM Linux-based stubdom as device
model. Additionally using ioemu-stubdom isn't really helping for
security, as it requires to run a very old and potentially buggy qemu
version in a PV domain. This is adding probably more security problems
than it is removing by using a stubdom.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Acked-by: Ian Jackson <iwj@xenproject.org> Release-acked-by: Ian Jackson <iwj@xenproject.org>
Juergen Gross [Thu, 4 Nov 2021 16:11:20 +0000 (17:11 +0100)]
configure: modify default of building rombios
The tools/configure script will default to build rombios if qemu
traditional is enabled. If rombios is being built, ipxe will be built
per default, too.
This results in rombios and ipxe no longer being built by default when
disabling qemu traditional.
Fix that be rearranging the dependencies:
- build ipxe by default
- build rombios by default if either ipxe or qemu traditional are
being built
This modification prepares not building qemu traditional by default
without affecting build of rombios and ipxe.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Ian Jackson <iwj@xenproject.org> Release-acked-by: Ian Jackson <iwj@xenproject.org>
Juergen Gross [Thu, 4 Nov 2021 14:42:42 +0000 (15:42 +0100)]
tools/helpers: fix broken xenstore stubdom init
Commit 1787cc167906f3f ("libs/guest: Move the guest ABI check earlier
into xc_dom_parse_image()") broke starting the xenstore stubdom. This
is due to a rather special way the xenstore stubdom domain config is
being initialized: in order to support both, PV and PVH stubdom,
init-xenstore-domain is using xc_dom_parse_image() to find the correct
domain type. Unfortunately above commit requires xc_dom_boot_xen_init()
to have been called before using xc_dom_parse_image(). This requires
the domid, which is known only after xc_domain_create(), which requires
the domain type.
In order to break this circular dependency, call xc_dom_boot_xen_init()
with an arbitrary domid first, and then set dom->guest_domid later.
Fixes: 1787cc167906f3f ("libs/guest: Move the guest ABI check earlier into xc_dom_parse_image()") Signed-off-by: Juergen Gross <jgross@suse.com> Release-acked-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 4 Nov 2021 13:44:43 +0000 (14:44 +0100)]
x86/APIC: avoid iommu_supports_x2apic() on error path
The value it returns may change from true to false in case
iommu_enable_x2apic() fails and, as a side effect, clears iommu_intremap
(as can happen at least on AMD). Latch the return value from the first
invocation to replace the second one.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Jan Beulich [Thu, 4 Nov 2021 13:44:01 +0000 (14:44 +0100)]
x86/IOMMU: mark IOMMU / intremap not in use when ACPI tables are missing
x2apic_bsp_setup() gets called ahead of iommu_setup(), and since x2APIC
mode (physical vs clustered) depends on iommu_intremap, that variable
needs to be set to off as soon as we know we can't / won't enable
interrupt remapping, i.e. in particular when parsing of the respective
ACPI tables failed. Move the turning off of iommu_intremap from AMD
specific code into acpi_iommu_init(), accompanying it by clearing of
iommu_enable.
Take the opportunity and also fully skip ACPI table parsing logic on
VT-d when both "iommu=off" and "iommu=no-intremap" are in effect anyway,
like was already the case for AMD.
The tag below only references the commit uncovering a pre-existing
anomaly.
Fixes: d8bd82327b0f ("AMD/IOMMU: obtain IVHD type to use earlier") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
x86/xstate: reset cached register values on resume
set_xcr0() and set_msr_xss() use cached value to avoid setting the
register to the same value over and over. But suspend/resume implicitly
reset the registers and since percpu areas are not deallocated on
suspend anymore, the cache gets stale.
Reset the cache on resume, to ensure the next write will really hit the
hardware. Choose value 0, as it will never be a legitimate write to
those registers - and so, will force write (and cache update).
Note the cache is used io get_xcr0() and get_msr_xss() too, but:
- set_xcr0() is called few lines below in xstate_init(), so it will
update the cache with appropriate value
- get_msr_xss() is not used anywhere - and thus not before any
set_msr_xss() that will fill the cache
Fixes: aca2a985a55a "xen: don't free percpu areas during suspend" Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Andrew Cooper [Tue, 28 Sep 2021 20:55:56 +0000 (21:55 +0100)]
x86/traps: Fix typo in do_entry_CP()
The call to debugger_trap_entry() should pass the correct vector. The
break-for-gdbsx logic is in practice unreachable because PV guests can't
generate #CP, but it will interfere with anyone inserting custom debugging
into debugger_trap_entry().
Fixes: 5ad05b9c2490 ("x86/traps: Implement #CP handler and extend #PF for shadow stacks") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
xen/arm: fix SBDF calculation for vPCI MMIO handlers
While in vPCI MMIO trap handlers for the guest PCI host bridge it is not
enough for SBDF translation to simply call VPCI_ECAM_BDF(info->gpa) as
the base address may not be aligned in the way that the translation
always work. If not adjusted with respect to the base address it may not be
able to properly convert SBDF.
Fix this by adjusting the gpa with respect to the host bridge base address
in a way as it is done for x86.
Please note, that this change is not strictly required given the current
value of GUEST_VPCI_ECAM_BASE which has bits 0 to 27 clear, but could cause
issues if such value is changed, or when handlers for dom0 ECAM
regions are added as those will be mapped over existing hardware
regions that could use non-aligned base addresses.
Fixes: d59168dc05a5 ("xen/arm: Enable the existing x86 virtual PCI support for ARM") Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Andrew Cooper [Mon, 1 Nov 2021 20:45:26 +0000 (20:45 +0000)]
x86/shstk: Fix use of shadow stacks with XPTI active
The call to setup_cpu_root_pgt(0) in smp_prepare_cpus() is too early. It
clones the BSP's stack while the .data mapping is still in use, causing all
mappings to be fully read read/write (and with no guard pages either). This
ultimately causes #DF when trying to enter the dom0 kernel for the first time.
Defer setting up BSPs XPTI pagetable until reinit_bsp_stack() after we've set
up proper shadow stack permissions.
Fixes: 60016604739b ("x86/shstk: Rework the stack layout to support shadow stacks") Fixes: b60ab42db2f0 ("x86/shstk: Activate Supervisor Shadow Stacks") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Using qemu-traditional as device model is deprecated for some time now.
So change the default for building it to "disable". This will affect
ioemu-stubdom, too, as there is a direct dependency between the two.
Today it is possible to use a PVH/HVM Linux-based stubdom as device
model. Additionally using ioemu-stubdom isn't really helping for
security, as it requires to run a very old and potentially buggy qemu
version in a PV domain. This is adding probably more security problems
than it is removing by using a stubdom.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Acked-by: Ian Jackson <iwj@xenproject.org> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Dongli Zhang [Wed, 3 Nov 2021 09:19:06 +0000 (10:19 +0100)]
update system time immediately when VCPUOP_register_vcpu_info
The guest may access the pv vcpu_time_info immediately after
VCPUOP_register_vcpu_info. This is to borrow the idea of
VCPUOP_register_vcpu_time_memory_area, where the
force_update_vcpu_system_time() is called immediately when the new memory
area is registered.
Otherwise, we may observe clock drift at the VM side if the VM accesses
the clocksource immediately after VCPUOP_register_vcpu_info().
Reference: https://lists.xenproject.org/archives/html/xen-devel/2021-10/msg00571.html Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
As of 724b55f48a6c ("x86: introduce MWAIT-based, ACPI-less CPU idle
driver") they (also) live in asm/mwait.h; no idea how I missed the
duplicates back at the time.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
automation: add a QEMU based x86_64 Dom0/DomU test
Introduce a test based on QEMU to run Xen, Dom0 and start a DomU.
This is similar to the existing qemu-alpine-arm64.sh script and test.
The only differences are:
- use Debian's qemu-system-x86_64 (on ARM we build our own)
- use ipxe instead of u-boot and ImageBuilder
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Build a 5.10 kernel to be used as Dom0 and DomU kernel for testing. This
is almost the same as the existing ARM64 recipe for Linux 5.9, the
only differences are:
- upgrade to latest 5.10.x stable
- force Xen modules to built-in (on ARM it was already done by defconfig)
Also add the exporting job to build.yaml so that the binary can be used
during gitlab-ci runs.
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Roger Pau Monne [Wed, 27 Oct 2021 14:00:50 +0000 (16:00 +0200)]
x86/cpuid: prevent decreasing of hypervisor max leaf on migration
In order to be compatible with previous Xen versions, and not change
max hypervisor leaf as a result of a migration, keep the clamping of
the maximum leaf value provided to XEN_CPUID_MAX_NUM_LEAVES, instead
of doing it based on the domain type. Also set the default maximum
leaf without taking the domain type into account. The maximum
hypervisor leaf is not migrated, so we need the default to not regress
beyond what might already be reported to a guest by existing Xen
versions.
This is a partial revert of 540d911c28 and restores the previous
behaviour and assures that HVM guests won't see it's maximum
hypervisor leaf reduced from 5 to 4 as a result of a migration.
Fixes: 540d911c28 ('x86/CPUID: shrink max_{,sub}leaf fields according to actual leaf contents') Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Roger Pau Monne [Tue, 26 Oct 2021 15:12:33 +0000 (17:12 +0200)]
x86/hpet: setup HPET even when disabled due to stopping in deep C states
Always allow the HPET to be setup, but don't report a frequency back
to the platform time source probe in order to avoid it from being
selected as a valid timer if it's not usable.
Doing the setup even when not intended to be used as a platform timer
is required so that is can be used in legacy replacement mode in order
to assert the IO-APIC is capable of receiving interrupts.
Fixes: c12731493a ('x86/hpet: Use another crystalball to evaluate HPET usability') Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Anthony PERARD [Fri, 22 Oct 2021 16:36:44 +0000 (17:36 +0100)]
automation: actually build with clang for ubuntu-focal-clang* jobs
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Hongda Deng [Thu, 21 Oct 2021 12:03:19 +0000 (20:03 +0800)]
xen/arm: vgic: Ignore write access to ICPENDR*
Currently, Xen will return IO unhandled when guests write ICPENDR*
virtual registers, which will raise a data abort inside the guest.
For Linux guest, these virtual registers will not be accessed. But
for Zephyr, these virtual registers will be accessed during the
initialization. Zephyr guest will get an IO data abort and crash.
Emulating ICPENDR is not easy with the existing vGIC, this patch
reworks the emulation to ignore write access to ICPENDR* virtual
registers and print a message about whether they are already pending
instead of returning unhandled.
More details can be found at [1].
Julien Grall [Wed, 20 Oct 2021 14:45:19 +0000 (14:45 +0000)]
tools/xenstored: Ignore domain we were unable to restore
Commit 939775cfd3 "handle dying domains in live update" was meant to
handle gracefully dying domain. However, the @releaseDomain watch
will end up to be sent as soon as we finished to restore Xenstored
state.
This may be before Xen reports the domain to be dying (such as if
the guest decided to revoke access to the xenstore page). Consequently
daemon like xenconsoled will not clean-up the domain and it will be
left as a zombie.
To avoid the problem, mark the connection as ignored. This also
requires to tweak conn_can_write() and conn_can_read() to prevent
dereferencing a NULL pointer (the interface will not mapped).
The check conn->is_ignored was originally added after the callbacks
because the helpers for a socket connection may close the fd. However,
ignore_connection() will close a socket connection directly. So it is
fine to do the re-order.
Signed-off-by: Julien Grall <jgrall@amazon.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Bertrand Marquis [Wed, 20 Oct 2021 15:22:52 +0000 (17:22 +0200)]
xen/pci: Install vpci handlers on x86 and fix error paths
Xen might not be able to discover at boot time all devices or some devices
might appear after specific actions from dom0.
In this case dom0 can use the PHYSDEVOP_pci_device_add to signal some
PCI devices to Xen.
As those devices where not known from Xen before, the vpci handlers must
be properly installed during pci_device_add for x86 PVH Dom0, in the
same way as what is done currently on arm (where Xen does not detect PCI
devices but relies on Dom0 to declare them all the time).
So this patch is removing the ifdef protecting the call to
vpci_add_handlers and the comment which was arm specific.
vpci_add_handlers is called on during pci_device_add which can be called
at runtime through hypercall physdev_op.
Remove __hwdom_init as the call is not limited anymore to hardware
domain init and fix linker script to only keep vpci_array in rodata
section.
Add missing vpci handlers cleanup during pci_device_remove and in case
of error with iommu during pci_device_add.
Move code adding the domain to the pdev domain_list as vpci_add_handlers
needs this to be set and remove it from the list in the error path.
Exit early of vpci_remove_device if the domain has no vpci support.
Add empty static inline for vpci_remove_device when CONFIG_VPCI is not
defined.
Add an ASSERT in vpci_add_handlers to check that the function is not
called twice for the same device.
Fixes: d59168dc05 ("xen/arm: Enable the existing x86 virtual PCI support for ARM") Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <jgrall@amazon.com>
Thomas Gleixner [Wed, 20 Oct 2021 10:50:15 +0000 (12:50 +0200)]
x86/hpet: Use another crystalball to evaluate HPET usability
On recent Intel systems the HPET stops working when the system reaches PC10
idle state.
The approach of adding PCI ids to the early quirks to disable HPET on
these systems is a whack a mole game which makes no sense.
Check for PC10 instead and force disable HPET if supported. The check is
overbroad as it does not take ACPI, mwait-idle enablement and command
line parameters into account. That's fine as long as there is at least
PMTIMER available to calibrate the TSC frequency. The decision can be
overruled by adding "clocksource=hpet" on the Xen command line.
Remove the related PCI quirks for affected Coffee Lake systems as they
are not longer required. That should also cover all other systems, i.e.
Ice Lake, Tiger Lake, and newer generations, which are most likely
affected by this as well.
Fixes: Yet another hardware trainwreck Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[Linux commit: 6e3cd95234dc1eda488f4f487c281bac8fef4d9b]
I have to admit that the purpose of checking CPUID5_ECX_INTERRUPT_BREAK
is unclear to me, but I didn't want to diverge in technical aspects from
the Linux commit.
In mwait_pc10_supported(), besides some cosmetic adjustments, avoid UB
from shifting left a signed 4-bit constant by 28 bits.
Pull in Linux'es MSR_PKG_CST_CONFIG_CONTROL.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Wed, 20 Oct 2021 10:42:44 +0000 (12:42 +0200)]
x86/PoD: defer nested P2M flushes
With NPT or shadow in use, the p2m_set_entry() -> p2m_pt_set_entry() ->
write_p2m_entry() -> p2m_flush_nestedp2m() call sequence triggers a lock
order violation when the PoD lock is held around it. Hence such flushing
needs to be deferred. Steal the approach from p2m_change_type_range().
(Note that strictly speaking the change at the out_of_memory label is
not needed, as the domain gets crashed there anyway. The change is being
made nevertheless to avoid setting up a trap from someone meaning to
deal with that case better than by domain_crash().)
Similarly for EPT I think ept_set_entry() -> ept_sync_domain() ->
ept_sync_domain_prepare() -> p2m_flush_nestedp2m() is affected. Make its
p2m_flush_nestedp2m() invocation conditional. Note that this then also
alters behavior of p2m_change_type_range() on EPT, deferring the nested
flushes there as well. I think this should have been that way from the
introduction of the flag.
Reported-by: Elliott Mitchell <ehem+xen@m5p.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Juergen Gross [Tue, 19 Oct 2021 11:21:40 +0000 (13:21 +0200)]
tools: fix oom setting of xenstored
Commit f282182af32939 ("tools/xenstore: set oom score for xenstore
daemon on Linux") introduced a regression when not setting the oom
value in the xencommons file. Fix that.
Fixes: f282182af32939 ("tools/xenstore: set oom score for xenstore daemon on Linux") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Ian Jackson <iwj@xenproject.org> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Jan Beulich [Tue, 19 Oct 2021 08:08:30 +0000 (10:08 +0200)]
x86/paging: restrict physical address width reported to guests
Modern hardware may report more than 48 bits of physical address width.
For paging-external guests our P2M implementation does not cope with
larger values. Telling the guest of more available bits means misleading
it into perhaps trying to actually put some page there (like was e.g.
intermediately done in OVMF for the shared info page).
While there also convert the PV check to a paging-external one (which in
our current code base are synonyms of one another anyway).
Fixes: 5dbd60e16a1f ("x86/shadow: Correct guest behaviour when creating PTEs above maxphysaddr") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Tue, 19 Oct 2021 08:07:42 +0000 (10:07 +0200)]
x86/PV: replace assertions in '0' debug key stack dumping
While it was me to add them, I'm afraid I don't see justification for
the assertions: A vCPU may very well have got preempted while in user
mode. Limit compat guest user mode stack dumps to the containing page
(like is done when using do_page_walk()), and suppress user mode stack
dumping altogether for 64-bit domains.
Fixes: cc0de53a903c ("x86: improve output resulting from sending '0' over serial") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Tue, 19 Oct 2021 08:07:00 +0000 (10:07 +0200)]
x86/PV: make '0' debug key dump Dom0's stacks again
The conversion to __get_guest() failed to account for the fact that for
remote vCPU-s dumping gets done through a pointer obtained from
map_domain_page(): __get_guest() arranges for (apparent) accesses to
hypervisor space to cause #GP(0).
Fixes: 6a1d72d3739e ('x86: split __{get,put}_user() into "guest" and "unsafe" variants') Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Tue, 19 Oct 2021 08:05:10 +0000 (10:05 +0200)]
x86/altp2m: don't consider "active" when enabling failed
We should not rely on guests to not use altp2m after reporting failure
of HVMOP_altp2m_set_domain_state to them. Set "active" back to false in
this case.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tamas K Lengyel <tamas@tklengyel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Tue, 19 Oct 2021 08:04:13 +0000 (10:04 +0200)]
x86/AMD: make HT range dynamic for Fam17 and up
At the time of d838ac2539cf ("x86: don't allow Dom0 access to the HT
address range") documentation correctly stated that the range was
completely fixed. For Fam17 and newer, it lives at the top of physical
address space, though.
To correctly determine the top of physical address space, we need to
account for their physical address reduction, hence the calculation of
paddr_bits also gets adjusted.
While for paddr_bits < 40 the HT range is completely hidden, there's no
need to suppress the range insertion in that case: It'll just have no
real meaning.
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Tue, 19 Oct 2021 08:02:39 +0000 (10:02 +0200)]
x86emul: de-duplicate scatters to the same linear address
The SDM specifically allows for earlier writes to fully overlapping
ranges to be dropped. If a guest did so, hvmemul_phys_mmio_access()
would crash it if varying data was written to the same address. Detect
overlaps early, as doing so in hvmemul_{linear,phys}_mmio_access() would
be quite a bit more difficult. To maintain proper faulting behavior,
instead of dropping earlier write instances of fully overlapping slots
altogether, write the data of the final of these slots multiple times.
(We also can't pull ahead the [single] write of the data of the last of
the slots, clearing all involved slots' op_mask bits together, as this
would yield incorrect results if there were intervening partially
overlapping ones.)
Note that due to cache slot use being linear address based, there's no
similar issue with multiple writes to the same physical address (mapped
through different linear addresses).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
The stubdom based pv-grub is using a very outdated version of grub
(0.97) and should not be used any longer. Mainline grub has support for
PV guests for a long time now, so that should be used as a boot loader
of a PV domain.
So disable building pv-grub per default. In case someone really wants
to continue using it he/she can still use a pv-grub binary from an older
Xen version or manually enable building it via:
configure --enable-pv-grub
[ This was already disabled in osstest by 8dee6e333622
"make-flight: Drop pvgrub (pvgrub1) tests" -iwj ]
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Acked-by: Ian Jackson <iwj@xenproject.org> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Juergen Gross [Tue, 12 Oct 2021 13:41:48 +0000 (15:41 +0200)]
tools/xenstore: set open file descriptor limit for xenstored
Add a configuration item for the maximum number of open file
descriptors xenstored should be allowed to have.
The default should be "unlimited" in order not to restrict xenstored
in the number of domains it can support, but unfortunately the kernel
is normally limiting the maximum value via /proc/sys/fs/nr_open [1],
[2]. So check that file to exist and if it does, limit the maximum
value to the one specified by /proc/sys/fs/nr_open.
As an aid for the admin configuring the value add a comment specifying
the common needs of xenstored for the different domain types.
Juergen Gross [Tue, 12 Oct 2021 13:41:47 +0000 (15:41 +0200)]
tools/xenstore: set oom score for xenstore daemon on Linux
Xenstored is absolutely mandatory for a Xen host and it can't be
restarted, so being killed by OOM-killer in case of memory shortage is
to be avoided.
Set /proc/$pid/oom_score_adj (if available) per default to -500 (this
translates to 50% of dom0 memory size) in order to allow xenstored to
use large amounts of memory without being killed.
The percentage of dom0 memory above which the oom killer is allowed to
kill xenstored can be set via XENSTORED_OOM_MEM_THRESHOLD in
xencommons.
Make sure the pid file isn't a left-over from a previous run delete it
before starting xenstored.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Ian Jackson <iwj@xenproject.org> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Jan Beulich [Mon, 18 Oct 2021 12:23:29 +0000 (14:23 +0200)]
x86/PV: address odd UB in I/O emulation
Compilers are certainly right in detecting UB here, given that fully
parenthesized (to express precedence) the original offending expression
was (((stub_va + p) - ctxt->io_emul_stub) + 5), which in fact exhibits
two overflows in pointer calculations. We really want to calculate
(p - ctxt->io_emul_stub) first, which is guaranteed to not overflow.
The issue was observed with clang 9 on 4.13.
The oddities are
- the issue was detected on APPEND_CALL(save_guest_gprs), despite the
earlier similar APPEND_CALL(load_guest_gprs),
- merely casting the original offending expression to long was reported
to also help.
While at it also avoid converting guaranteed (with our current address
space layout) negative values to unsigned long (which has implementation
defined behavior): Have stub_va be of pointer type. And since it's on an
immediately adjacent line, also constify this_stubs.
Fixes: d89e5e65f305 ("x86/ioemul: Rewrite stub generation to be shadow stack compatible") Reported-by: Franklin Shen <2284696125@qq.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Mon, 18 Oct 2021 12:22:32 +0000 (14:22 +0200)]
x86/shadow: make a local variable in sh_page_fault() HVM-only
I recall checking that "r" would still have a user, but when doing so I
failed to recognize that all uses are inside a CONFIG_HVM conditional
section.
Fixes: 9f4f20b27b07 ("x86/shadow: adjust some shadow_set_l<N>e() callers") Reported-by: Stefano Stabellini <sstabellini@kernel.org> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 18 Oct 2021 12:21:17 +0000 (14:21 +0200)]
x86/HVM: correct cleanup after failed viridian_vcpu_init()
This happens after nestedhvm_vcpu_initialise(), so its effects also need
to be undone.
Fixes: 40a4a9d72d16 ("viridian: add init hooks") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Rahul Singh [Fri, 15 Oct 2021 16:51:45 +0000 (17:51 +0100)]
arm/libxl: Emulated PCI device tree node in libxl
libxl will create an emulated PCI device tree node in the device tree to
enable the guest OS to discover the virtual PCI during guest boot.
Emulated PCI device tree node will only be created when there is any
device assigned to guest.
A new area has been reserved in the arm guest physical map at
which the VPCI bus is declared in the device tree (reg and ranges
parameters of the node).
Note that currently we are using num_pcidevs instead of
c_info->passthrough to decide whether to create a vPCI DT node.
This will be insufficient if and when ARM does PCI hotplug.
Add this note inside libxl_create.c where c_info->passthrough
is set.
Signed-off-by: Rahul Singh <rahul.singh@arm.com> Signed-off-by: Michal Orzel <michal.orzel@arm.com> Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Acked-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Michal Orzel [Fri, 15 Oct 2021 16:51:44 +0000 (17:51 +0100)]
tools/libxl_arm: Modify libxl__prepare_dtb...
... to take a second argument of type libxl_domain_config*
rather than libxl_domain_build_info*.
This change will be needed to get access from
libxl__prepare_dtb to "num_pcidevs" field of
libxl_domain_config to check whether to create
a vPCI DT node or not.
Signed-off-by: Michal Orzel <michal.orzel@arm.com> Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com> Suggested-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
... to take a second argument of type libxl_domain_config*
rather than libxl_domain_build_info*.
We need to pass the whole libxl_domain_config
structure as this will be needed later on to modify
the libxl__prepare_dtb function to also take
libxl_domain_config.
Signed-off-by: Michal Orzel <michal.orzel@arm.com> Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com> Suggested-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Ian Jackson <iwj@xenproject.org> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Rahul Singh [Fri, 15 Oct 2021 16:51:42 +0000 (17:51 +0100)]
xen/arm: Enable the existing x86 virtual PCI support for ARM
The existing VPCI support available for X86 is adapted for Arm.
When the device is added to XEN via the hyper call
“PHYSDEVOP_pci_device_add”, VPCI handler for the config space
access is added to the Xen to emulate the PCI devices config space.
A MMIO trap handler for the PCI ECAM space is registered in XEN
so that when guest is trying to access the PCI config space,XEN
will trap the access and emulate read/write using the VPCI and
not the real PCI hardware.
For Dom0less systems scan_pci_devices() would be used to discover the
PCI device in XEN and VPCI handler will be added during XEN boots.
This patch is also doing some small fixes to fix compilation errors on
arm32 of vpci and prevent 64bit accesses on 32bit:
- use %zu instead of lu in header.c for print
- prevent 64bit accesses in vpci_access_allowed
- ifdef out using CONFIG_64BIT handling of len 8 in
vpci_ecam_{read/write}
TODO: currently vpci_add_handlers is marked as __hwdom_init, but on ARM
vpci_add_handlers can be called after boot from
PHYSDEVOP_pci_device_add. Consider removing __hwdom_init.
Signed-off-by: Rahul Singh <rahul.singh@arm.com> Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
[stefano: add TODO item to commit message] Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Bertrand Marquis [Fri, 15 Oct 2021 16:51:41 +0000 (17:51 +0100)]
xen/vpci: Move ecam access functions to common code
PCI standard is using ECAM and not MCFG which is coming from ACPI[1].
Use ECAM/ecam instead of MCFG in common code and in new functions added
in common vpci code by this patch.
Move vpci_access_allowed from arch/x86/hvm/io.c to drivers/vpci/vpci.c.
Create vpci_ecam_{read,write} in drivers/vpci/vpci.c that
contains the common code to perform these operations, changed
vpci_mmcfg_{read,write} accordingly to make use of these functions.
The vpci_ecam_{read,write} functions are returning false on error and
true on success. As the x86 code was previously always returning
X86EMUL_OKAY the return code is ignored. A comment has been added in
the code to show that this is intentional.
Those functions will be used in a following patch inside by arm vpci
implementation.
Rename MMCFG_BDF to VPCI_ECAM_BDF and move it to vpci.h.
This macro is only used by functions calling vpci_ecam helpers.
No functional change intended with this patch.
[1] https://wiki.osdev.org/PCI_Express
Suggested-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Jan Beulich [Fri, 15 Oct 2021 11:43:35 +0000 (13:43 +0200)]
x86/shadow: adjust 2-level case of SHADOW_FOREACH_L2E()
Coverity apparently takes issue with the assignment inside an if(), but
then only in two of the cases (sh_destroy_l2_shadow() and
sh_unhook_32b_mappings()). As it's pretty simple to break out of the
outer loop without the need for a local helper variable, adjust the code
that way.
While there, with the other "unused value" reports also in mind, further
drop a dead assignment from SHADOW_FOREACH_L1E().
Coverity-ID: 1492857 Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 15 Oct 2021 10:48:31 +0000 (12:48 +0200)]
x86/shadow: adjust some shadow_set_l<N>e() callers
Coverity dislikes sh_page_fault() storing the return value into a local
variable but then never using the value (and oddly enough spots this in
the 2- and 3-level cases, but not in the 4-level one). Instead of adding
yet another cast to void as replacement, take the opportunity and drop a
bunch of such casts at the same time - not using function return values
is a common thing to do. (It of course is an independent question
whether ignoring errors like this is a good idea.)
Coverity-ID: 1492856
Coverity-ID: 1492858 Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 15 Oct 2021 10:47:18 +0000 (12:47 +0200)]
AMD/IOMMU: pull ATS disabling earlier
Disabling should be done in the opposite order of enabling: ATS wants to
be turned off before adjusting the DTE, just like it gets enabled only
after the DTE was suitably prepared. Note that we want ATS to be
disabled as soon as any of the DTEs involved in the handling of a device
(including phantom devices) gets adjusted respectively. For this reason
the "devfn == pdev->devfn" of the original conditional gets dropped.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Jan Beulich [Fri, 15 Oct 2021 10:46:42 +0000 (12:46 +0200)]
AMD/IOMMU: respect AtsDisabled device flag
IVHD entries may specify that ATS is to be blocked for a device or range
of devices. Honor firmware telling us so.
While adding respective checks I noticed that the 2nd conditional in
amd_iommu_setup_domain_device() failed to check the IOMMU's capability.
Add the missing part of the condition there, as no good can come from
enabling ATS on a device when the IOMMU is not capable of dealing with
ATS requests.
For actually using ACPI_IVHD_ATS_DISABLED, make its expansion no longer
exhibit UB.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Jan Beulich [Fri, 15 Oct 2021 10:46:05 +0000 (12:46 +0200)]
AMD/IOMMU: check IVMD ranges against host implementation limits
When such ranges can't be represented as 1:1 mappings in page tables,
reject them as presumably bogus. Note that when we detect features late
(because of EFRSup being clear in the ACPI tables), it would be quite a
bit of work to check for (and drop) out of range IVMD ranges, so IOMMU
initialization gets failed in this case instead.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Jan Beulich [Fri, 15 Oct 2021 10:45:16 +0000 (12:45 +0200)]
AMD/IOMMU: improve (extended) feature detection
First of all the documentation is very clear about ACPI table data
superseding raw register data. Use raw register data only if EFRSup is
clear in the ACPI tables (which may still go too far). Additionally if
this flag is clear, the IVRS type 11H table is reserved and hence may
not be recognized.
Furthermore propagate IVRS type 10H data into the feature flags
recorded, as the full extended features field is available in type 11H
only.
Note that this also makes necessary to stop the bad practice of us
finding a type 11H IVHD entry, but still processing the type 10H one
in detect_iommu_acpi()'s invocation of amd_iommu_detect_one_acpi().
Note also that the features.raw check in amd_iommu_prepare_one() needs
replacing, now that the field can also be populated by different means.
Key IOMMUv2 availability off of IVHD type not being 10H, and then move
it a function layer up, so that it would be set only once all IOMMUs
have been successfully prepared.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Jan Beulich [Fri, 15 Oct 2021 10:44:20 +0000 (12:44 +0200)]
AMD/IOMMU: obtain IVHD type to use earlier
Doing this in amd_iommu_prepare() is too late for it, in particular, to
be used in amd_iommu_detect_one_acpi(), as a subsequent change will want
to do. Moving it immediately ahead of amd_iommu_detect_acpi() is
(luckily) pretty simple, (pretty importantly) without breaking
amd_iommu_prepare()'s logic to prevent multiple processing.
This involves moving table checksumming, as
amd_iommu_get_supported_ivhd_type() -> get_supported_ivhd_type() will
now be invoked before amd_iommu_detect_acpi() -> detect_iommu_acpi(). In
the course of doing so stop open-coding acpi_tb_checksum(), seeing that
we have other uses of this originally ACPI-private function elsewhere in
the tree.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Today the build will fail if --disable-pv-grub as a parameter of
configure, as the main Makefile will unconditionally try to build a
32-bit pv-grub stubdom.
Fix that by introducing a pv-grub-if-enabled target in
stubdom/Makefile taking care of this situation.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Reviewed-by: Ian Jackson <iwj@xenproject.org>
Allocate anonymous domheap pages as there is no strict need to
account them to a particular domain.
Since XSA-383 "xen/arm: Restrict the amount of memory that dom0less
domU and dom0 can allocate" the dom0 cannot allocate memory outside
of the pre-allocated region. This means if we try to allocate
non-anonymous page to be accounted to dom0 we will get an
over-allocation issue when assigning that page to the domain.
The anonymous page, in turn, is not assigned to any domain.
libxl/arm: Add handling of extended regions for DomU
The extended region (safe range) is a region of guest physical
address space which is unused and could be safely used to create
grant/foreign mappings instead of wasting real RAM pages from
the domain memory for establishing these mappings.
The extended regions are chosen at the domain creation time and
advertised to it via "reg" property under hypervisor node in
the guest device-tree. As region 0 is reserved for grant table
space (always present), the indexes for extended regions are 1...N.
If extended regions could not be allocated for some reason,
Xen doesn't fail and behaves as usual, so only inserts region 0.
Please note the following limitations:
- The extended region feature is only supported for 64-bit domain
currently.
- The ACPI case is not covered.
***
The algorithm to choose extended regions for non-direct mapped
DomU is simpler in comparison with the algorithm for direct mapped
Dom0. We usually have a lot of unused space above 4GB, and might
have some unused space below 4GB (depends on guest memory size).
Try to allocate separate 2MB-aligned extended regions from the first
(below 4GB) and second (above 4GB) RAM banks taking into the account
the maximum supported guest physical address space size and the amount
of memory assigned to the guest. The minimum size of extended region
the same as for Dom0 (64MB).
Please note, we introduce fdt_property_reg_placeholder helper which
purpose is to create N ranges that are zeroed. The interesting fact
is that libfdt already has fdt_property_placeholder(). But this was
introduced only in 2017, so there is a risk that some distros may not
ship the last libfdt version. This is why we implement our own light
variant for now.
Suggested-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Acked-by: Ian Jackson <iwj@xenproject.org>
xen/arm: Introduce gpaddr_bits field to struct xen_domctl_getdomaininfo
We need to pass info about maximum supported guest physical
address space size to the toolstack on Arm in order to properly
calculate the base and size of the extended region (safe range)
for the guest. The extended region is unused address space which
could be safely used by domain for foreign/grant mappings on Arm.
The extended region itself will be handled by the subsequent
patch.
Currently the same guest physical address space size is used
for all guests (p2m_ipa_bits variable on Arm, the x86 equivalent
is hap_paddr_bits).
Add an explicit padding after "gpaddr_bits" field and also
(while at it) after "domain" field.
Also make sure that full structure is cleared in all cases by
moving the clearing into getdomaininfo(). Currently it is only
cleared by the sysctl caller (and only once).
Please note, we do not need to bump XEN_DOMCTL_INTERFACE_VERSION
as a bump has already occurred in this release cycle. But we do
need to bump XEN_SYSCTL_INTERFACE_VERSION as the structure is
re-used in a sysctl.
Suggested-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Reviewed-by: Ian Jackson <iwj@xenproject.org>
[hypervisor parts] Reviewed-by: Jan Beulich <jbeulich@suse.com>
Rahul Singh [Wed, 13 Oct 2021 20:28:43 +0000 (13:28 -0700)]
xen/arm: Add linux,pci-domain property for hwdom if not available.
If the property is not present in the device tree node for host bridge,
XEN while creating the dtb for hwdom will create this property and
assigns the already allocated segment to the host bridge
so that XEN and linux will have the same segment for the host bridges.
Luca Fancellu [Wed, 13 Oct 2021 14:52:02 +0000 (15:52 +0100)]
arm/docs: Clarify legacy DT bindings on UEFI
Since the introduction of UEFI boot for Xen, the legacy
compatible strings were not supported and the stub code
was checking only the presence of “multiboot,module” to
require the Xen UEFI configuration file or not.
The documentation was not updated to specify that behavior.
Add a phrase to docs/misc/arm/device-tree/booting.txt
to clarify it.
Michal Orzel [Wed, 13 Oct 2021 12:33:52 +0000 (14:33 +0200)]
xen: Expose the PMU to the guests
Add parameter vpmu to xl domain configuration syntax
to enable the access to PMU registers by disabling
the PMU traps(currently only for ARM).
The current status is that the PMU registers are not
virtualized and the physical registers are directly
accessible when this parameter is enabled. There is no
interrupt support and Xen will not save/restore the
register values on context switches.
According to Arm Arm, section D7.1:
"The Performance Monitors Extension is common
to AArch64 operation and AArch32 operation."
That means we have an ensurance that if PMU is
present in one exception state, it must also be
present in the other.
Please note that this feature is experimental.
Signed-off-by: Michal Orzel <michal.orzel@arm.com> Signed-off-by: Julien Grall <julien@xen.org> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <iwj@xenproject.org>
Michal Orzel [Tue, 12 Oct 2021 08:13:22 +0000 (10:13 +0200)]
xen/arm: Check for PMU platform support
ID_AA64DFR0_EL1/ID_DFR0_EL1 registers provide
information about PMU support. Replace structure
dbg64/dbg32 with a union and fill in all the
register fields according to document:
ARM Architecture Registers(DDI 0595, 2021-06).
Add macros boot_dbg_feature64/boot_dbg_feature32
to check for a debug feature. Add macro
cpu_has_pmu to check for PMU support.
Any value higher than 0 and less than 15 means
that PMU is supported (we do not care about its
version for now).
Michal Orzel [Tue, 12 Oct 2021 08:13:21 +0000 (10:13 +0200)]
xen+tools: Introduce XEN_SYSCTL_PHYSCAP_vpmu
Introduce flag XEN_SYSCTL_PHYSCAP_vpmu which
indicates whether the platform supports vPMU
functionality. Modify Xen and tools accordingly.
Take the opportunity and fix XEN_SYSCTL_PHYSCAP_vmtrace
definition in sysctl.h which wrongly use (1 << 6)
instead of (1u << 6).
Signed-off-by: Michal Orzel <michal.orzel@arm.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com> Acked-by: Nick Rosbrook <rosbrookn@ainfosec.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Ian Jackson <iwj@xenproject.org> Acked-by: Christian Lindig <christian.lindig@citrix.com>
Luca Fancellu [Mon, 11 Oct 2021 18:15:27 +0000 (19:15 +0100)]
arm/efi: Use dom0less configuration when using EFI boot
This patch introduces the support for dom0less configuration
when using UEFI boot on ARM, it permits the EFI boot to
continue if no dom0 kernel is specified but at least one domU
is found.
Introduce the new property "xen,uefi-binary" for device tree boot
module nodes that are subnode of "xen,domain" compatible nodes.
The property holds a string containing the file name of the
binary that shall be loaded by the uefi loader from the filesystem.
Introduce a new call efi_check_dt_boot(...) called during EFI boot
that checks for module to be loaded using device tree.
Architectures that don't support device tree don't have to
provide this function.
Update efi documentation about how to start a dom0less
setup using UEFI
Signed-off-by: Luca Fancellu <luca.fancellu@arm.com>
[stefano: drop inline from efi_check_dt_boot] Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 4 Oct 2021 18:11:45 +0000 (19:11 +0100)]
x86/pv: Split pv_hypercall() in two
The is_pv_32bit_vcpu() conditionals hide four lfences, with two taken on any
individual path through the function. There is very little code common
between compat and native, and context-dependent conditionals predict very
badly for a period of time after context switch.
Move do_entry_int82() from pv/traps.c into pv/hypercall.c, allowing
_pv_hypercall() to be static and forced inline. The delta is:
add/remove: 0/0 grow/shrink: 1/1 up/down: 300/-282 (18)
Function old new delta
do_entry_int82 50 350 +300
pv_hypercall 579 297 -282
which is tiny, but the perf implications are large:
These are percentage improvements in raw TSC detlas for a xen_version
hypercall, with obvious outliers excluded. Therefore, it is an idealised best
case improvement.
The pv64 path uses `syscall`, while the pv32 path uses `int $0x82` so
necessarily has higher overhead. Therefore, dropping the lfences is less over
an overall improvement.
I don't know why the Naples pv32 improvement is so small, but I've double
checked the numbers and they're consistent. There's presumably something
we're doing which is a large overhead in the pipeline.
On the Intel side, both systems are writing to MSR_SPEC_CTRL on
entry/exit (SKX using the retrofitted microcode implementation, CFL-R using
the hardware implementation), while SKX is suffering further from XPTI for
Meltdown protection.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 12 Oct 2021 09:57:08 +0000 (11:57 +0200)]
VT-d: Tylersburg isoch DMAR unit with no TLB space
BIOSes, when enabling the dedicated DMAR unit for the sound device,
need to also set a non-zero number of TLB entries in a respective
system management register (VTISOCHCTRL). At least one BIOS is known
to fail to do so, causing the VT-d engine to deadlock when used.
Vaguely based on Linux'es e0fc7e0b4b5e ("intel-iommu: Yet another BIOS
workaround: Isoch DMAR unit with no TLB space").
To limit message string redundancy, fold parts with the IGD quirk logic.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Jan Beulich [Tue, 12 Oct 2021 09:56:21 +0000 (11:56 +0200)]
VT-d: generalize and correct "iommu=no-igfx" handling
Linux'es supposedly equivalent "intel_iommu=igfx_off" deals with any
graphics devices (not just Intel ones) while at the same time limiting
the effect to IOMMUs covering only graphics devices. Keying the decision
to leave translation disabled for an IOMMU to merely a magic SBDF tuple
was wrong in the first place - systems may very well have non-graphics
devices at 0000:00:02.0 (ordinary root ports commonly live there, for
example). Any use of igd_drhd_address (and hence is_igd_drhd()) needs
further qualification.
Introduce a new "graphics only" field in struct acpi_drhd_unit and set
it according to device scope parsing outcome. Replace the bad use of
is_igd_drhd() in iommu_enable_translation() by use of this new field.
While adding the new field also convert the adjacent include_all one to
"bool".
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
Jan Beulich [Tue, 12 Oct 2021 09:55:42 +0000 (11:55 +0200)]
x86/PV32: fix physdev_op_compat handling
The conversion of the original code failed to recognize that the 32-bit
compat variant of this (sorry, two different meanings of "compat" here)
needs to continue to invoke the compat handler, not the native one.
Arrange for this by adding yet another #define.
Affected functions (having existed prior to the introduction of the new
hypercall) are PHYSDEVOP_set_iobitmap and PHYSDEVOP_apic_{read,write}.
For all others the operand struct layout doesn't differ.
Fixes: 1252e2823117 ("x86/pv: Export pv_hypercall_table[] rather than working around it in several ways") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 12 Oct 2021 09:54:34 +0000 (11:54 +0200)]
AMD/IOMMU: consider hidden devices when flushing device I/O TLBs
Hidden devices are associated with DomXEN but usable by the
hardware domain. Hence they need flushing as well when all devices are
to have flushes invoked.
While there drop a redundant ATS-enabled check and constify the first
parameter of the involved function.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Anthony PERARD [Tue, 12 Oct 2021 09:53:47 +0000 (11:53 +0200)]
build: avoid building arm/arm/*/head.o twice
head.o is been built twice, once because it is in $(ALL_OBJS) and a
second time because it is in $(extra-y) and thus it is rebuilt when
building "arch/arm/built_in.o".
Fix this by adding a dependency of "head.o" on the directory
"arch/arm/".
Also, we should avoid building object that are in subdirectories, so
move the declaration in there. This doesn't change anything as
"arch/arm/built_in.o" depends on "arch/arm/$subarch/built_in.o" which
depends on $(extra-y), so we still need to depend on
"arch/arm/built_in.o".
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Julien Grall <jgrall@amazon.com>
Anthony PERARD [Tue, 12 Oct 2021 09:48:46 +0000 (11:48 +0200)]
x86/mm: avoid building multiple .o from a single .c file
This replace the use of a single .c file use for multiple .o file by
creating multiple .c file including the first one.
There's quite a few issues with trying to build more than one object
file from a single source file: there's is a duplication of the make
rules to generate those targets; there is an additional ".file" symbol
added in order to differentiate between the object files; and the
tools/symbols have an heuristic to try to pick up the right ".file".
This patch adds new .c source file which avoid the need to add a
second ".file" symbol and thus avoid the need to deal with those
issues.
Also remove __OBJECT_FILE__ from $(CC) command line as it isn't used
anywhere anymore. And remove the macro "build-intermediate" since the
generic rules for single targets can be used.
And rename the objects in mm/hap/ to remove the extra "level".
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
libxl: Only map legacy PCI IRQs if they are supported
Arm's PCI passthrough implementation doesn't support legacy interrupts,
but MSI/MSI-X. This can be the case for other platforms too.
For that reason introduce a new CONFIG_PCI_SUPP_LEGACY_IRQ and add
it to the CFLAGS and compile the relevant code in the toolstack only if
applicable.
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
[stefano: minor change to Makefile] Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Rahul Singh <rahul.singh@arm.com> Tested-by: Rahul Singh <rahul.singh@arm.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Luca Fancellu [Mon, 11 Oct 2021 07:56:38 +0000 (08:56 +0100)]
arm/efi: Fix null pointer dereference
Fix for commit 60649d443dc395243e74d2b3e05594ac0c43cfe3
that introduces a null pointer dereference when the
fdt_node_offset_by_compatible is called with "fdt"
argument null.
tools/console: use xenforeigmemory to map console ring
This patch replaces the usage of xc_map_foreign_range with
xenforeignmemory_map from the stable xenforeignmemory library. Note
there are still other uses of libxc functions which prevents removing
the dependency.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Ian Jackson <iwj@xenproject.org>
docs: add references to Argo Linux driver sources and information
Add a section to the Argo design document to supply guidance on how to
enable Argo in Xen and where to obtain source code and documentation
for Argo device drivers for guest OSes, primarily from OpenXT.
Signed-off-by: Christopher Clark <christopher.w.clark@gmail.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Jan Beulich [Mon, 11 Oct 2021 08:58:44 +0000 (10:58 +0200)]
x86/HVM: fix xsm_op for 32-bit guests
Like for PV, 32-bit guests need to invoke the compat handler, not the
native one.
Fixes: db984809d61b ("hvm: wire up domctl and xsm hypercalls") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 11 Oct 2021 08:58:17 +0000 (10:58 +0200)]
x86/build: suppress EFI-related tool chain checks upon local $(MAKE) recursion
The xen-syms and xen.efi linking steps are serialized only when the
intermediate note.o file is necessary. Otherwise both may run in
parallel. This in turn means that the compiler / linker invocations to
create efi/check.o / efi/check.efi may also happen twice in parallel.
Obviously it's a bad idea to have multiple producers of the same output
race with one another - every once in a while one may e.g. observe
objdump: efi/check.efi: file format not recognized
We don't need this EFI related checking to occur when producing the
intermediate symbol and relocation table objects, and we have an easy
way of suppressing it: Simply pass in "efi-y=", overriding the
assignments done in the Makefile and thus forcing the tool chain checks
to be bypassed.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
translate_noncontig() allocates domheap page for translated list
before calling to allocate_optee_shm_buf(), which can fail for number
of reason. Anyways, after fail we need to free the allocated page(s).
Another leak is possible if the same translate_noncontig() function
fails to get domain page. In this case it should free allocated
optee_shm_buf prior exit. This will also free allocated domheap page.