]> xenbits.xensource.com Git - people/andrewcoop/xen.git/log
people/andrewcoop/xen.git
7 weeks agoxen/mm: Exclude flushtlb.h from mm.h for x86 xen-tlb-clk
Andrew Cooper [Fri, 7 Mar 2025 14:24:23 +0000 (14:24 +0000)]
xen/mm: Exclude flushtlb.h from mm.h for x86

Various files pick up flushtlb.h transitively through mm.h.  Fix these, and
finally resolve the TODO in microcode/amd.c

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Anthony PERARD <anthony.perard@vates.tech>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Julien Grall <julien@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
CC: Oleksii Kurochko <oleksii.kurochko@gmail.com>
CC: Shawn Anastasio <sanastasio@raptorengineering.com>
v2:
 * hyperv/tlb.c as well

7 weeks agoxen/mm: Exclude flushtlb.h from mm.h for ARM
Andrew Cooper [Wed, 12 Mar 2025 16:37:09 +0000 (16:37 +0000)]
xen/mm: Exclude flushtlb.h from mm.h for ARM

A number of files pick up flushtlb.h transitively through mm.h, while the
flushtlb.h hierachy themselves aren't even self-sufficient.

Address all of these, and exclude flushtlb.h from mm.h

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Anthony PERARD <anthony.perard@vates.tech>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Julien Grall <julien@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
CC: Oleksii Kurochko <oleksii.kurochko@gmail.com>
CC: Shawn Anastasio <sanastasio@raptorengineering.com>
7 weeks agoxen/mm: Exclude flushtlb.h from mm.h for PPC and RISC-V
Andrew Cooper [Wed, 12 Mar 2025 15:28:33 +0000 (15:28 +0000)]
xen/mm: Exclude flushtlb.h from mm.h for PPC and RISC-V

put_page_alloc_ref(), the final function in xen/mm.h uses test_and_clear_bit()
which is picked up transitively by all architectures.  RISC-V gets it only via
flushtlb.h, hence why it notices here.

ARM and x86 will be cleaned up in subsequent patches.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Anthony PERARD <anthony.perard@vates.tech>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Julien Grall <julien@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
CC: Oleksii Kurochko <oleksii.kurochko@gmail.com>
CC: Shawn Anastasio <sanastasio@raptorengineering.com>
7 weeks agoxen/arch: Strip out tlb-clock stubs for non-implementors
Andrew Cooper [Wed, 12 Mar 2025 14:37:25 +0000 (14:37 +0000)]
xen/arch: Strip out tlb-clock stubs for non-implementors

Now that there's a common stub implementation TLB clocks, there's no need for
architectures to provide their own.

Repeatedly zeroing page->tlbflush_timestamp is no use, so provide an even more
empty common stub for page_set_tlbflush_timestamp().

No practical change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Anthony PERARD <anthony.perard@vates.tech>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Julien Grall <julien@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
CC: Oleksii Kurochko <oleksii.kurochko@gmail.com>
CC: Shawn Anastasio <sanastasio@raptorengineering.com>
7 weeks agoxen/common: Split tlb-clock.h out of mm.h
Andrew Cooper [Wed, 12 Mar 2025 14:12:47 +0000 (14:12 +0000)]
xen/common: Split tlb-clock.h out of mm.h

xen/mm.h includes asm/tlbflush.h almost at the end, which creates a horrible
tangle.  This is in order to provide two common files with an abstraction over
the x86-specific TLB clock logic.

First, introduce CONFIG_HAS_TLB_CLOCK, selected by x86 only.  Next, introduce
xen/tlb-clock.h, providing empty stubs, and include this into memory.c and
page_alloc.c

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Anthony PERARD <anthony.perard@vates.tech>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Julien Grall <julien@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
CC: Oleksii Kurochko <oleksii.kurochko@gmail.com>
CC: Shawn Anastasio <sanastasio@raptorengineering.com>
There is still a mess here with the common vs x86 split, but it's better
contained than before.

7 weeks agoxen: Sort includes
Andrew Cooper [Fri, 7 Mar 2025 17:28:35 +0000 (17:28 +0000)]
xen: Sort includes

... needing later adjustment.  Drop types.h when it's clearly not needed.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Anthony PERARD <anthony.perard@vates.tech>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Julien Grall <julien@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
CC: Oleksii Kurochko <oleksii.kurochko@gmail.com>
CC: Shawn Anastasio <sanastasio@raptorengineering.com>
7 weeks agoxen/livepatch: Fix include hierarchy
Andrew Cooper [Fri, 7 Mar 2025 14:55:37 +0000 (14:55 +0000)]
xen/livepatch: Fix include hierarchy

xen/livepatch.h includes public/sysctl.h twice, which can be deduplicated, and
includes asm/livepatch.h meaning that each livepatch.c does not need to
include both.

Comment the #else and #endif cases to aid legibility.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Anthony PERARD <anthony.perard@vates.tech>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Julien Grall <julien@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
CC: Oleksii Kurochko <oleksii.kurochko@gmail.com>
CC: Shawn Anastasio <sanastasio@raptorengineering.com>
7 weeks agoxen/elfstructs: Include xen/types.h
Andrew Cooper [Fri, 7 Mar 2025 14:40:27 +0000 (14:40 +0000)]
xen/elfstructs: Include xen/types.h

elfstructs.h needs the stdint.h types.  Two headers arrange this manually, but
elf.h and livepatch.h do not, which breaks source files whose headers are
properly sorted.

elfstructs.h is used by tools too, so limit this to Xen only.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Anthony PERARD <anthony.perard@vates.tech>
CC: Michal Orzel <michal.orzel@amd.com>
CC: Jan Beulich <jbeulich@suse.com>
CC: Julien Grall <julien@xen.org>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Stefano Stabellini <sstabellini@kernel.org>
CC: Volodymyr Babchuk <Volodymyr_Babchuk@epam.com>
CC: Bertrand Marquis <bertrand.marquis@arm.com>
CC: Oleksii Kurochko <oleksii.kurochko@gmail.com>
CC: Shawn Anastasio <sanastasio@raptorengineering.com>
7 weeks agox86/iommu: avoid MSI address and data writes if IRT index hasn't changed
Roger Pau Monne [Fri, 7 Mar 2025 09:16:01 +0000 (10:16 +0100)]
x86/iommu: avoid MSI address and data writes if IRT index hasn't changed

Attempt to reduce the MSI entry writes, and the associated checking whether
memory decoding and MSI-X is enabled for the PCI device, when the MSI data
hasn't changed.

When using Interrupt Remapping the MSI entry will contain an index into
the remapping table, and it's in such remapping table where the MSI vector
and destination CPU is stored.  As such, when using interrupt remapping,
changes to the interrupt affinity shouldn't result in changes to the MSI
entry, and the MSI entry update can be avoided.

Signal from the IOMMU update_ire_from_msi hook whether the MSI data or
address fields have changed, and thus need writing to the device registers.
Such signaling is done by returning 1 from the function.  Otherwise
returning 0 means no update of the MSI fields, and thus no write
required.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
7 weeks agox86/hvm: check return code of hvm_pi_update_irte when binding
Roger Pau Monne [Mon, 10 Mar 2025 17:13:52 +0000 (18:13 +0100)]
x86/hvm: check return code of hvm_pi_update_irte when binding

Consume the return code from hvm_pi_update_irte(), and propagate the error
back to the caller if hvm_pi_update_irte() fails.

Fixes: 35a1caf8b6b5 ('pass-through: update IRTE according to guest interrupt config changes')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
7 weeks agox86/vmx: fix posted interrupts usage of msi_desc->msg field
Roger Pau Monne [Mon, 10 Mar 2025 15:49:29 +0000 (16:49 +0100)]
x86/vmx: fix posted interrupts usage of msi_desc->msg field

The current usage of msi_desc->msg in vmx_pi_update_irte() will make the
field contain a translated MSI message, instead of the expected
untranslated one.  This breaks dump_msi(), that use the data in
msi_desc->msg to print the interrupt details.

Fix this by introducing a dummy local msi_msg, and use it with
iommu_update_ire_from_msi().  vmx_pi_update_irte() relies on the MSI
message not changing, so there's no need to propagate the resulting msi_msg
to the hardware, and the contents can be ignored.

Additionally add a comment to clarify that msi_desc->msg must always
contain the untranslated MSI message.

Fixes: a5e25908d18d ('VT-d: introduce new fields in msi_desc to track binding with guest interrupt')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
7 weeks agoxen/page_alloc: Simplify domain_adjust_tot_pages
Alejandro Vallejo [Tue, 4 Mar 2025 11:10:00 +0000 (11:10 +0000)]
xen/page_alloc: Simplify domain_adjust_tot_pages

The logic has too many levels of indirection and it's very hard to
understand it its current form. Split it between the corner case where
the adjustment is bigger than the current claim and the rest to avoid 5
auxiliary variables.

Add a functional change to prevent negative adjustments from
re-increasing the claim. This has the nice side effect of avoiding
taking the heap lock here on every free.

While at it, fix incorrect field name in nearby comment.

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
7 weeks agox86/msr: expose MSR_FAM10H_MMIO_CONF_BASE on AMD
Roger Pau Monne [Fri, 21 Feb 2025 11:34:49 +0000 (12:34 +0100)]
x86/msr: expose MSR_FAM10H_MMIO_CONF_BASE on AMD

The MMIO_CONF_BASE reports the base of the MCFG range on AMD systems.
Linux pre-6.14 is unconditionally attempting to read the MSR without a
safe MSR accessor, and since Xen doesn't allow access to it Linux reports
the following error:

unchecked MSR access error: RDMSR from 0xc0010058 at rIP: 0xffffffff8101d19f (xen_do_read_msr+0x7f/0xa0)
Call Trace:
 xen_read_msr+0x1e/0x30
 amd_get_mmconfig_range+0x2b/0x80
 quirk_amd_mmconfig_area+0x28/0x100
 pnp_fixup_device+0x39/0x50
 __pnp_add_device+0xf/0x150
 pnp_add_device+0x3d/0x100
 pnpacpi_add_device_handler+0x1f9/0x280
 acpi_ns_get_device_callback+0x104/0x1c0
 acpi_ns_walk_namespace+0x1d0/0x260
 acpi_get_devices+0x8a/0xb0
 pnpacpi_init+0x50/0x80
 do_one_initcall+0x46/0x2e0
 kernel_init_freeable+0x1da/0x2f0
 kernel_init+0x16/0x1b0
 ret_from_fork+0x30/0x50
 ret_from_fork_asm+0x1b/0x30

Such access is conditional to the presence of a device with PnP ID
"PNP0c01", which triggers the execution of the quirk_amd_mmconfig_area()
function.  Note that prior to commit 3fac3734c43a MSR accesses when running
as a PV guest would always use the safe variant, and thus silently handle
the #GP.

Fix by allowing access to the MSR on AMD systems for the hardware domain.

Write attempts to the MSR will still result in #GP for all domain types.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
7 weeks agox86/IDT: Fix IDT generation for INT $0x80
Andrew Cooper [Tue, 11 Mar 2025 21:13:33 +0000 (21:13 +0000)]
x86/IDT: Fix IDT generation for INT $0x80

When PV is enabled, entry_int80 needs to be DPL3, not DPL0.

This, combined with a QEMU bug which incorrectly calculates the error
code (fix submitted separately), causes the XSA-259 PoC to fail with:

  --- Xen Test Framework ---
  Environment: PV 64bit (Long mode 4 levels)
  XSA-259 PoC
  Error: Unexpected fault 0x800d0802, #GP[IDT[256]]
  Test result: ERROR

Fixes: 3da2149cf4dc ("x86/IDT: Generate bsp_idt[] at build time")
Reported-by: Luca Fancellu <luca.fancellu@arm.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Luca Fancellu <luca.fancellu@arm.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
7 weeks agodocs: add explanation for 'Resolves:'
Denis Mukhin [Tue, 11 Mar 2025 07:28:26 +0000 (07:28 +0000)]
docs: add explanation for 'Resolves:'

'Resolves:' tag may be used if the patch addresses one of the tickets
logged via Gitlab to auto-close such ticket when the patch got merged.

Add documentation for the tag.

Resolves: https://gitlab.com/xen-project/xen/-/issues/199
Signed-off-by: Denis Mukhin <dmukhin@ford.com>
Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
7 weeks agoMISRA: Rephrase the deviation for Directive 4.10
Andrew Cooper [Tue, 4 Mar 2025 23:48:54 +0000 (23:48 +0000)]
MISRA: Rephrase the deviation for Directive 4.10

The use of "legitimately" mixes the concepts of "it was designed to do this"
and "it was correct to do this".

The latter in particular can go stale.  "intended" is a better way of phrasing
this.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 weeks agox86/P2M: don't include MMIO_DM in p2m_is_valid()
Jan Beulich [Tue, 11 Mar 2025 08:55:47 +0000 (09:55 +0100)]
x86/P2M: don't include MMIO_DM in p2m_is_valid()

MMIO_DM specifically marks pages which aren't valid, much like INVALID
does. Dropping the type from the predicate
- (conceptually) corrects _sh_propagate(), where the comment says that
  "something valid" is needed (the only call path not passing in RAM_RW
  would pass in INVALID_GFN along with MMIO_DM),
- is benign to the use in sh_page_fault(), where the subsequent
  mfn_valid() check would otherwise cause the same bail-out code path to
  be taken,
- is benign to all three uses in p2m_pt_get_entry(), as MMIO_DM entries
  will only ever yield non-present entries, which are being checked for
  earlier,
- is benign to sh_unshadow_for_p2m_change(), for the same reason,
- is benign to gnttab_transfer() with EPT not in use, again because
  MMIO_DM entries will only ever yield non-present entries, and
  INVALID_MFN is returned for those anyway by p2m_pt_get_entry().
- for gnttab_transfer() with EPT in use (conceptually) corrects the
  corner case of a page first being subject to XEN_DMOP_set_mem_type
  converting a RAM type to MMIO_DM (which retains the MFN in the entry),
  and then being subject to GNTTABOP_transfer, except that steal_page()
  would later make the operation fail unconditionally anyway.

While there also drop the unused (and otherwise now redundant)
p2m_has_emt().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
8 weeks agox86/P2M: correct old entry checking in p2m_remove_entry()
Jan Beulich [Tue, 11 Mar 2025 08:55:20 +0000 (09:55 +0100)]
x86/P2M: correct old entry checking in p2m_remove_entry()

Using p2m_is_valid() isn't quite right here. It expanding to RAM+MMIO,
the subsequent p2m_mmio_direct check effectively reduces its use to
RAM+MMIO_DM. Yet MMIO_DM entries, which are never marked present in the
page tables, won't pass the mfn_valid() check. It is, however, quite
plausible (and supported by the rest of the function) to permit
"removing" hole entries, i.e. in particular to convert MMIO_DM to
INVALID. Which leaves the original check to be against RAM (plus MFN
validity), while HOLE then instead wants INVALID_MFN to be passed in.

Further more grant and foreign entries (together with RAM becoming
ANY_RAM) as well as BROKEN want the MFN checking, too.

All other types (i.e. MMIO_DIRECT and POD) want rejecting here rather
than skipping, for needing handling / accounting elsewhere.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
8 weeks agoPCI: drop pci_segments_init()
Jan Beulich [Tue, 11 Mar 2025 08:54:19 +0000 (09:54 +0100)]
PCI: drop pci_segments_init()

Have callers invoke pci_add_segment() directly instead: With radix tree
initialization moved out of the function, its name isn't quite
describing anymore what it actually does.

On x86 move the logic into __start_xen() itself, to reduce the risk of
re-introducing ordering issues like the one which was addressed by
26fe09e34566 ("radix-tree: introduce RADIX_TREE{,_INIT}()").

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Julien Grall <jgrall@amazon.com>
8 weeks agoautomation/cirrus-ci: store xen/.config as an artifact
Roger Pau Monne [Mon, 10 Mar 2025 17:41:57 +0000 (18:41 +0100)]
automation/cirrus-ci: store xen/.config as an artifact

Always store xen/.config as an artifact, renamed to xen-config to match
the naming used in the Gitlab CI tests.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
8 weeks agox86/apic: remove delivery and destination mode fields from drivers
Roger Pau Monne [Thu, 6 Mar 2025 08:07:31 +0000 (09:07 +0100)]
x86/apic: remove delivery and destination mode fields from drivers

All local APIC drivers use physical destination and fixed delivery modes,
remove the fields from the genapic struct and simplify the logic.

No functional change intended.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
8 weeks agodocs: fix INTRODUCE description in xenstore.txt
Juergen Gross [Thu, 6 Mar 2025 07:47:52 +0000 (08:47 +0100)]
docs: fix INTRODUCE description in xenstore.txt

The description of the Xenstore INTRODUCE command is still referencing
xend. Fix that.

The <evtchn> description is starting with a grammatically wrong
sentence. Fix that.

While at it, make clear that the Xenstore implementation is allowed
to ignore the specified gfn and use the Xenstore reserved grant id
GNTTAB_RESERVED_XENSTORE instead.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
8 weeks agoxen/watchdog: Identify which domain watchdog fired
Andrew Cooper [Fri, 7 Mar 2025 14:24:42 +0000 (14:24 +0000)]
xen/watchdog: Identify which domain watchdog fired

When a watchdog fires, the domain is crashed and can't dump any state.

Xen allows a domain to have two separate watchdogs.  Therefore, for a
domain running multiple watchdogs (e.g. one based around network, one
for disk), it is important for diagnostics to know which watchdog
fired.

As the printk() is in a timer callback, this is a bit awkward to
arrange, but there are 12 spare bits in the bottom of the domain
pointer owing to its alignment.

Reuse these bits to encode the watchdog id too, so the one which fired
is identified when the domain is crashed.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
8 weeks agoxen/domain: Initialise the domain handle before inserting into the domlist
Andrew Cooper [Fri, 7 Mar 2025 16:38:26 +0000 (16:38 +0000)]
xen/domain: Initialise the domain handle before inserting into the domlist

As soon as the the domain is in the domlist, it can be queried via various
means, ahead of being fully constructed.  Ensure it has the toolstack-given
UUID prior to becoming visible.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 weeks agoCI: Drop the now-obsolete 11-riscv64.dockerfile
Andrew Cooper [Fri, 7 Mar 2025 15:16:29 +0000 (15:16 +0000)]
CI: Drop the now-obsolete 11-riscv64.dockerfile

Fixes: bd9bda50553b ("automation: drop debian:11-riscv64 container")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
8 weeks agotools/libs: Make uselibs.mk more legible
Andrew Cooper [Mon, 8 Mar 2021 23:31:11 +0000 (23:31 +0000)]
tools/libs: Make uselibs.mk more legible

A few blank lines go a very long way.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Anthony PERARD <anthony.perard@vates.tech>
8 weeks agovpci: Add resizable bar support
Jiqian Chen [Mon, 24 Feb 2025 03:24:33 +0000 (11:24 +0800)]
vpci: Add resizable bar support

Some devices, like AMDGPU, support resizable bar capability,
but vpci of Xen doesn't support this feature, so they fail
to resize bars and then cause probing failure.

According to PCIe spec, each bar that supports resizing has
two registers, PCI_REBAR_CAP and PCI_REBAR_CTRL. So, add
handlers to support resizing the size of BARs.

Note that Xen will only trap PCI_REBAR_CTRL, as PCI_REBAR_CAP
is read-only register and the hardware domain already gets
access to it without needing any setup.

Link: https://gitlab.com/xen-project/xen/-/issues/87
Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com>
Reviewed-by: Roger Pau Monné <roger.pau@cirtrix.com>
Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>
8 weeks agoxen/arm: Factor out construct_hwdom()
Jason Andryuk [Mon, 10 Mar 2025 08:53:51 +0000 (09:53 +0100)]
xen/arm: Factor out construct_hwdom()

Factor out construct_hwdom() from construct_dom0().  This will be
re-used by the dom0less code when building a domain with the hardware
capability.

iommu_hwdom_init(d) is moved into construct_hwdom() which moves it after
kernel_probe().  kernel_probe() doesn't seem to depend on its setting.

Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 weeks agoxen/consoled: clean up console handling for PV shim
Denis Mukhin [Mon, 10 Mar 2025 08:53:11 +0000 (09:53 +0100)]
xen/consoled: clean up console handling for PV shim

There are few places which check pv_shim console under CONFIG_PV_SHIM or
CONFIG_X86 in xen console driver.

Instead of inconsistent #ifdef-ing, introduce and use consoled_is_enabled() in
switch_serial_input() and __serial_rx().

PV shim case is fixed in __serial_rx() - should be under 'pv_shim &&
pv_console' check.

Signature of consoled_guest_{rx,tx} has changed so the errors can be logged
on the callsites.

Also, move get_initial_domain_id() to arch-independent header since it is now
required by console driver.

Lastly, add missing SPDX-License-Identifier to xen/consoled.h

Signed-off-by: Denis Mukhin <dmukhin@ford.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
8 weeks agotools/libs/store: use single_with_domid() in xs_get_domain_path()
Juergen Gross [Mon, 10 Mar 2025 08:52:54 +0000 (09:52 +0100)]
tools/libs/store: use single_with_domid() in xs_get_domain_path()

xs_get_domain_path() can be simplified by using single_with_domid().

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
8 weeks agotools/hvmloader: Replace LAPIC_ID() with cpu_to_apicid[]
Alejandro Vallejo [Mon, 10 Mar 2025 08:52:39 +0000 (09:52 +0100)]
tools/hvmloader: Replace LAPIC_ID() with cpu_to_apicid[]

Replace uses of the LAPIC_ID() macro with accesses to the
cpu_to_apicid[] lookup table. This table contains the APIC IDs of each
vCPU as probed at runtime rather than assuming a predefined relation.

Moved smp_initialise() ahead of apic_setup() in order to initialise
cpu_to_apicid ASAP and avoid using it uninitialised. Note that bringing
up the APs doesn't need the APIC in hvmloader becasue it always runs
virtualized and uses the PV interface.

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
8 weeks agotools/hvmloader: Retrieve APIC IDs from the APs themselves
Alejandro Vallejo [Mon, 10 Mar 2025 08:52:30 +0000 (09:52 +0100)]
tools/hvmloader: Retrieve APIC IDs from the APs themselves

Make it so the APs expose their own APIC IDs in a lookup table (LUT). We
can use that LUT to populate the MADT, decoupling the algorithm that
relates CPU IDs and APIC IDs from hvmloader.

Modified the printf to also print the APIC ID of each CPU, as well as
fixing a (benign) wrong specifier being used for the vcpu id.

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
8 weeks agodocs: hardware runners setup
Stefano Stabellini [Sat, 8 Mar 2025 00:57:44 +0000 (16:57 -0800)]
docs: hardware runners setup

Document how to setup a new hardware runner

Signed-off-by: Victor Lira <VictorM.Lira@amd.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com>
Reviewed-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
8 weeks agox86/e820: Remove opencoded vendor/feature checks
Andrew Cooper [Thu, 6 Mar 2025 23:21:07 +0000 (23:21 +0000)]
x86/e820: Remove opencoded vendor/feature checks

We've already scanned features by the time init_e820() is called.  Remove the
cpuid() calls.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 weeks agox86/vlapic: Drop vlapic->esr_lock
Andrew Cooper [Thu, 28 Nov 2024 00:47:37 +0000 (00:47 +0000)]
x86/vlapic: Drop vlapic->esr_lock

The exact behaviour of LVTERR interrupt generation is implementation
specific.

 * Newer Intel CPUs generate an interrupt when pending_esr becomes
   nonzero.

 * Older Intel and all AMD CPUs generate an interrupt when any
   individual bit in pending_esr becomes nonzero.

Neither vendor documents their behaviour very well.  Xen implements
the per-bit behaviour and has done since support was added.

Importantly, the per-bit behaviour can be expressed using the atomic
operations available in the x86 architecture, whereas the
former (interrupt only on pending_esr becoming nonzero) cannot.

With vlapic->hw.pending_esr held outside of the main LAPIC register page,
it's much easier to use atomic operations.

Use xchg() in vlapic_reg_write(), and *set_bit() in vlapic_error().

The only interesting change is that vlapic_error() now needs to take a
single bit only, rather than a mask, but this fine for all current
callers and forseable changes.

No change from a guests perspective.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 weeks agox86/vlapic: Fix handling of writes to APIC_ESR
Andrew Cooper [Thu, 28 Nov 2024 00:47:36 +0000 (00:47 +0000)]
x86/vlapic: Fix handling of writes to APIC_ESR

Xen currently presents APIC_ESR to guests as a simple read/write register.

This is incorrect.  The SDM states:

  The ESR is a write/read register. Before attempt to read from the ESR,
  software should first write to it. (The value written does not affect the
  values read subsequently; only zero may be written in x2APIC mode.) This
  write clears any previously logged errors and updates the ESR with any
  errors detected since the last write to the ESR.

Introduce a new pending_esr field in hvm_hw_lapic.

Update vlapic_error() to accumulate errors here, and extend vlapic_reg_write()
to discard the written value and transfer pending_esr into APIC_ESR.  Reads
are still as before.

Importantly, this means that guests no longer destroys the ESR value it's
looking for in the LVTERR handler when following the SDM instructions.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 weeks agox86/trampoline: Rename entrypoints
Andrew Cooper [Wed, 25 Sep 2024 14:28:04 +0000 (15:28 +0100)]
x86/trampoline: Rename entrypoints

... to be more concise, and to match our other entrypoints into Xen.

In acpi_sleep_prepare(), calculate bootsym_phys() once, which GCC seems
unwilling to of it's own accord.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
8 weeks agox86/vmx: Rewrite vmx_sync_pir_to_irr() to be more efficient
Andrew Cooper [Tue, 25 Jun 2024 16:23:12 +0000 (17:23 +0100)]
x86/vmx: Rewrite vmx_sync_pir_to_irr() to be more efficient

There are two issues.  First, pi_test_and_clear_on() pulls the cache-line to
the CPU and dirties it even if there's nothing outstanding, and second,
bitmap_for_each() is O(256) when O(8) would do, and would avoid multiple
atomic updates to the same IRR word.

Rewrite it from scratch, explaining what's going on at each step.

Bloat-o-meter reports 177 -> 145 (net -32), but real improvement is the
removal of calls to __find_{first,next}_bit() hidden behind bitmap_for_each().

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 weeks agoxen/domain: Annotate struct domain as page aligned
Andrew Cooper [Tue, 18 Feb 2025 23:01:11 +0000 (23:01 +0000)]
xen/domain: Annotate struct domain as page aligned

struct domain is always a page aligned allocation.  Update it's type to
reflect this, so we can safely reuse the lower bits in the pointer for
auxiliary information.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 weeks agox86/traps: Convert pv_trap_init() to being an initcall
Andrew Cooper [Fri, 3 Jan 2025 17:17:38 +0000 (17:17 +0000)]
x86/traps: Convert pv_trap_init() to being an initcall

With most of pv_trap_init() being done at build time, opening of NMI_SOFTIRQ
can be a regular initcall, simplifying trap_init().

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 weeks agox86/IDT: Don't rewrite bsp_idt[] at boot time
Andrew Cooper [Fri, 3 Jan 2025 15:16:45 +0000 (15:16 +0000)]
x86/IDT: Don't rewrite bsp_idt[] at boot time

Now that bsp_idt[] is constructed at build time, we do not need to manually
initialise it in init_idt_traps() and trap_init().

When swapping the early pagefault handler for the normal one, switch to using
_update_gate_addr_lower() as we do on the kexec path for NMI and #MC.

This in turn allows us to drop set_{intr,swint}_gate() and the underlying
infrastructure.  It also lets us drop autogen_entrypoints[] and that
underlying infrastructure.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 weeks agox86/IDT: Generate bsp_idt[] at build time
Andrew Cooper [Fri, 3 Jan 2025 14:44:19 +0000 (14:44 +0000)]
x86/IDT: Generate bsp_idt[] at build time

... rather than dynamically at boot time.  Aside from less runtime overhead,
this approach is less fragile than the preexisting autogen stubs mechanism.

We can manage this with some linker calculations.  See patch comments for full
details.

For simplicity, we create a new set of entry stubs here, and clean up the old
ones in the subsequent patch.  bsp_idt[] needs to move from .bss to .data.

No functional change yet; the boot path still (re)writes bsp_idt[] at this
juncture.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 weeks agox86/IDT: Make idt_tables[] be per_cpu(idt)
Andrew Cooper [Thu, 2 Jan 2025 17:47:24 +0000 (17:47 +0000)]
x86/IDT: Make idt_tables[] be per_cpu(idt)

This can be a plain per_cpu() variable, and __read_mostly seeing as it's
allocated once and not touched again during it's lifetime.

This removes a NR_CPU's sized structure, and improves NUMA locality of access
for both the the VT-x and SVM context switch paths.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
8 weeks agox86/IDT: Rename idt_table[] to bsp_idt[]
Andrew Cooper [Thu, 2 Jan 2025 17:17:30 +0000 (17:17 +0000)]
x86/IDT: Rename idt_table[] to bsp_idt[]

Having variables named idt_table[] and idt_tables[] is not ideal.

Use X86_IDT_VECTORS and remove IDT_ENTRIES.  State the size of bsp_idt[] in
idt.h so that load_system_tables() and cpu_smpboot_alloc() can use sizeof()
rather than opencoding the calculation.

Move the variable into a new traps-setup.c, to make a start at splitting
traps.c in half.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
8 weeks agoxen/public: Fix documentation of VIRQs
Andrew Cooper [Fri, 7 Mar 2025 11:34:04 +0000 (11:34 +0000)]
xen/public: Fix documentation of VIRQs

It has been discovered that VIRQ_ARGO is a 3rd type of VIRQ.  Also, recent
work has prevented global VIRQs from being stolen from the owning domain.

Rewrite the description of VIRQ classifications.  Drop the (DOM0) comment from
the global VIRQs; it's not been true for a long time.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 weeks agoxen/events: fix global virq handling
Juergen Gross [Fri, 7 Mar 2025 10:11:41 +0000 (11:11 +0100)]
xen/events: fix global virq handling

VIRQs are split into "global" and "per vcpu" ones. Unfortunately in
reality there are "per domain" ones, too.

send_global_virq() and set_global_virq_handler() make only sense for
the real "global" ones, so replace virq_is_global() with a new
function get_virq_type() returning one of the 3 possible types (global,
domain, vcpu VIRQ).

To make its intended purpose more clear, also rename
send_guest_global_virq() to send_guest_domain_virq().

Fixes: 980822c5edd1 ("xen/events: allow setting of global virq handler only for unbound virqs")
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
8 weeks agoxen/events: fix get_global_virq_handler() usage without hardware domain
Juergen Gross [Thu, 6 Mar 2025 16:23:36 +0000 (17:23 +0100)]
xen/events: fix get_global_virq_handler() usage without hardware domain

Some use cases of get_global_virq_handler() didn't account for the
case of running without hardware domain.

Fix that by testing get_global_virq_handler() returning NULL where
needed (e.g. when directly dereferencing the result).

Fixes: 980822c5edd1 ("xen/events: allow setting of global virq handler only for unbound virqs")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 months agoXSM: correct xsm_get_domain_state()
Jan Beulich [Thu, 6 Mar 2025 14:21:52 +0000 (15:21 +0100)]
XSM: correct xsm_get_domain_state()

Add the missing first parameter and move it next to a close relative.

Fixes: 3ad3df1bd0aa ("xen: add new domctl get_domain_state")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
2 months agoRevert "EFI: Avoid crash calling PrintErrMesg() from efi_multiboot2()"
Jan Beulich [Thu, 6 Mar 2025 14:20:39 +0000 (15:20 +0100)]
Revert "EFI: Avoid crash calling PrintErrMesg() from efi_multiboot2()"

This reverts commit eaed0d185ab8b73cd18ac2830878520b3011f5ab. It breaks the
build with old Clang (3.8).

2 months agoconfig: update Mini-OS commit
Juergen Gross [Thu, 6 Mar 2025 13:54:50 +0000 (14:54 +0100)]
config: update Mini-OS commit

Update the Mini-OS upstream revision.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
2 months agoxen/public: add missing Xenstore commands to xs_wire.h
Juergen Gross [Thu, 6 Mar 2025 13:03:51 +0000 (14:03 +0100)]
xen/public: add missing Xenstore commands to xs_wire.h

The GET_FEATURE, SET_FEATURE, GET_QUOTA and SET_QUOTA Xenstore commands
are defined in docs/misc/xenstore.txt, but they are missing in
xs_wire.h.

Add the missing commands to xs_wire.h

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 months agoxen/public: remove some unused defines from xs_wire.h
Juergen Gross [Thu, 6 Mar 2025 13:03:37 +0000 (14:03 +0100)]
xen/public: remove some unused defines from xs_wire.h

xs_wire.h contains some defines XS_WRITE_* which seem to be leftovers
from some decades ago. They haven't been used in the Xen tree since at
least Xen 2.0 and they make no sense anyway.

Remove them, as they seem not to be related to any Xen interface we
have today.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
2 months agoRISCV/bitops: Use Zbb to provide arch-optimised bitops
Andrew Cooper [Thu, 6 Mar 2025 13:03:15 +0000 (14:03 +0100)]
RISCV/bitops: Use Zbb to provide arch-optimised bitops

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
2 months agoxen/riscv: identify specific ISA supported by cpu
Oleksii Kurochko [Thu, 6 Mar 2025 13:02:51 +0000 (14:02 +0100)]
xen/riscv: identify specific ISA supported by cpu

Supported ISA extensions are specified in the device tree within the CPU
node, using two properties: `riscv,isa-extensions` and `riscv,isa`.

Currently, Xen does not support the `riscv,isa-extensions` property and
will be added in the future.

The `riscv,isa` property is parsed for each CPU, and the common extensions
are stored in the `host_riscv_isa` bitmap.
This bitmap is then used by `riscv_isa_extension_available()` to check
if a specific extension is supported.

The current implementation is based on Linux kernel v6.12-rc3
implementation with the following changes:
 - Drop unconditional setting of {RISCV_ISA_EXT_ZICSR,
   RISCV_ISA_EXT_ZIFENCEI, RISCV_ISA_EXT_ZICNTR, RISCV_ISA_EXT_ZIHPM} because
   Xen is going to run on hardware produced after the aforementioned
   extensions were split out of "i".
 - Remove saving of the ISA for each CPU, only the common available ISA is
   saved.
 - Remove ACPI-related code as ACPI is not supported by Xen.
 - Drop handling of elf_hwcap, since Xen does not provide hwcap to
   userspace.
 - Replace of_cpu_device_node_get() API, which is not available in Xen,
   with a combination of dt_for_each_child_node(), dt_device_type_is_equal(),
   and dt_get_cpuid_from_node() to retrieve cpuid and riscv,isa in
   riscv_fill_hwcap_from_isa_string().
 - Rename arguments of __RISCV_ISA_EXT_DATA() from _name to ext_name, and
   _id to ext_id for clarity.
 - Replace instances of __RISCV_ISA_EXT_DATA with RISCV_ISA_EXT_DATA.
 - Replace instances of __riscv_isa_extension_available with
   riscv_isa_extension_available for consistency. Also, update the type of
   `bit` argument of riscv_isa_extension_available().
 - Redefine RISCV_ISA_EXT_DATA() to work only with ext_name and ext_id,
   as other fields are not used in Xen currently. Also RISCV_ISA_EXT_DATA()
   is reworked in the way to take only one argument `ext_name`.
 - Add check of first 4 letters of riscv,isa string to
   riscv_isa_parse_string() as Xen doesn't do this check before so it is
   necessary to check correctness of riscv,isa string. ( it should start with
   rv{32,64} with taking into account upper and lower case of "rv").
   Additionally, check also that 'i' goes after 'rv{32,64}' to be sure that
   `out_bitmap` can't be empty.
 - Drop an argument of riscv_fill_hwcap() and riscv_fill_hwcap_from_isa_string()
   as it isn't used, at the moment.
 - Update the comment message about QEMU workaround.
 - Apply Xen coding style.
 - s/pr_info/printk.
 - Drop handling of uppercase letters of riscv,isa in riscv_isa_parse_string() as
   Xen checks that riscv,isa should be in lowercase according to the device tree
   bindings.
 - Update logic of riscv_isa_parse_string(): now it stops parsing of riscv,isa
   if illegal symbol was found instead of ignoring them.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
2 months agoxen/riscv: make zbb as mandatory
Oleksii Kurochko [Thu, 6 Mar 2025 13:01:53 +0000 (14:01 +0100)]
xen/riscv: make zbb as mandatory

According to riscv/booting.txt, it is expected that Zbb should be supported.

Drop ANDN_INSN() in asm/cmpxchg.h as Zbb is mandatory now so `andn`
instruction could be used directly.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 months agoxen/riscv: drop CONFIG_RISCV_ISA_RV64G
Oleksii Kurochko [Thu, 6 Mar 2025 13:01:26 +0000 (14:01 +0100)]
xen/riscv: drop CONFIG_RISCV_ISA_RV64G

'G' stands for "imafd_zicsr_zifencei".

Extensions 'f' and 'd' aren't really needed for Xen, and allowing floating
point registers to be used can lead to crashes.

Extensions 'i', 'm', 'a', 'zicsr', and 'zifencei' are necessary for the
operation of Xen, which is why they are used explicitly (unconditionally)
in -march.

Drop "Base ISA" choice from riscv/Kconfig as it is always empty.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 months agoautomation: drop debian:11-riscv64 container
Oleksii Kurochko [Thu, 6 Mar 2025 13:01:07 +0000 (14:01 +0100)]
automation: drop debian:11-riscv64 container

There are two reasons for that:
1. In the README, GCC baseline is chosen to be 12.2, whereas Debian 11
   uses GCC 10.2.1.
2. Xen requires mandatory some Z extensions, but GCC 10.2.1 does not
   support Z extensions in -march, causing the compilation to fail.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2 months agoVMX: convert vmx_vmfunc
Jan Beulich [Thu, 6 Mar 2025 13:00:25 +0000 (14:00 +0100)]
VMX: convert vmx_vmfunc

... to a field in the capability/controls struct.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
2 months agoVMX: convert vmx_ept_vpid_cap
Jan Beulich [Thu, 6 Mar 2025 12:59:56 +0000 (13:59 +0100)]
VMX: convert vmx_ept_vpid_cap

... to fields in the capability/controls struct: Take the opportunity
and split the two halves into separate EPT and VPID fields.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
2 months agoVMX: convert vmx_vmentry_control
Jan Beulich [Thu, 6 Mar 2025 12:59:30 +0000 (13:59 +0100)]
VMX: convert vmx_vmentry_control

... to a field in the capability/controls struct.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
2 months agoVMX: convert vmx_vmexit_control
Jan Beulich [Thu, 6 Mar 2025 12:59:09 +0000 (13:59 +0100)]
VMX: convert vmx_vmexit_control

... to a field in the capability/controls struct.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
2 months agoVMX: convert vmx_tertiary_exec_control
Jan Beulich [Thu, 6 Mar 2025 12:58:47 +0000 (13:58 +0100)]
VMX: convert vmx_tertiary_exec_control

... to a field in the capability/controls struct.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
2 months agoVMX: convert vmx_secondary_exec_control
Jan Beulich [Thu, 6 Mar 2025 12:58:24 +0000 (13:58 +0100)]
VMX: convert vmx_secondary_exec_control

... to a field in the capability/controls struct.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
2 months agoVMX: convert vmx_cpu_based_exec_control
Jan Beulich [Thu, 6 Mar 2025 12:58:04 +0000 (13:58 +0100)]
VMX: convert vmx_cpu_based_exec_control

... to a field in the capability/controls struct.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
2 months agoVMX: convert vmx_pin_based_exec_control
Jan Beulich [Thu, 6 Mar 2025 12:57:41 +0000 (13:57 +0100)]
VMX: convert vmx_pin_based_exec_control

... to a field in the capability/controls struct. Use an instance of
that struct also in vmx_init_vmcs_config().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
2 months agoVMX: convert vmx_basic_msr
Jan Beulich [Thu, 6 Mar 2025 12:57:21 +0000 (13:57 +0100)]
VMX: convert vmx_basic_msr

... to a struct field, which is then going to be accompanied by other
capability/control data presently living in individual variables. As
this structure isn't supposed to be altered post-boot, put it in
.data.ro_after_init right away.

Suggested-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
2 months agoVMX: drop vmcs_revision_id
Jan Beulich [Thu, 6 Mar 2025 12:56:49 +0000 (13:56 +0100)]
VMX: drop vmcs_revision_id

It's effectively redundant with vmx_basic_msr. For the #define
replacement to work, struct vmcs_struct's respective field name also
needs to change: Drop the not really meaningful "vmcs_" prefix from it.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
2 months agox86/HVM: improve CET-IBT pruning of ENDBR
Jan Beulich [Thu, 6 Mar 2025 12:56:21 +0000 (13:56 +0100)]
x86/HVM: improve CET-IBT pruning of ENDBR

__init{const,data}_cf_clobber can have an effect only for pointers
actually populated in the respective tables. While not the case for SVM
right now, VMX installs a number of pointers only under certain
conditions. Hence the respective functions would have their ENDBR purged
only when those conditions are met. Invoke "pruning" functions after
having copied the respective tables, for them to install any "missing"
pointers.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
2 months agotools/xenstored: use new stable interface instead of libxenctrl
Juergen Gross [Thu, 6 Mar 2025 12:54:55 +0000 (13:54 +0100)]
tools/xenstored: use new stable interface instead of libxenctrl

Replace the current use of the unstable xc_domain_getinfo_single()
interface with the stable domctl XEN_DOMCTL_get_domain_state call
via the new libxenmanage library.

This will remove the last usage of libxenctrl by Xenstore, so update
the library dependencies accordingly.

For now only do a direct replacement without using the functionality
of obtaining information about domains having changed the state.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
2 months agotools/libs: add a new libxenmanage library
Juergen Gross [Thu, 6 Mar 2025 12:53:56 +0000 (13:53 +0100)]
tools/libs: add a new libxenmanage library

In order to have a stable interface in user land for using stable
domctl and possibly later sysctl interfaces, add a new library
libxenmanage.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
2 months agoxen: add new domctl get_domain_state
Juergen Gross [Thu, 6 Mar 2025 12:52:38 +0000 (13:52 +0100)]
xen: add new domctl get_domain_state

Add a new domctl sub-function to get data of a domain having changed
state (this is needed by Xenstore).

The returned state just contains the domid, the domain unique id,
and some flags (existing, shutdown, dying).

In order to enable Xenstore stubdom being built for multiple Xen
versions, make this domctl stable.  For stable domctls the
interface_version is always 0.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 months agoxen: add bitmap to indicate per-domain state changes
Juergen Gross [Thu, 6 Mar 2025 12:52:14 +0000 (13:52 +0100)]
xen: add bitmap to indicate per-domain state changes

Add a bitmap with one bit per possible domid indicating the respective
domain has changed its state (created, deleted, dying, crashed,
shutdown).

Registering the VIRQ_DOM_EXC event will result in setting the bits for
all existing domains and resetting all other bits.

As the usage of this bitmap is tightly coupled with the VIRQ_DOM_EXC
event, it is meant to be used only by a single consumer in the system,
just like the VIRQ_DOM_EXC event.

Resetting a bit will be done in a future patch.

This information is needed for Xenstore to keep track of all domains.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 months agoxen/events: allow setting of global virq handler only for unbound virqs
Juergen Gross [Thu, 6 Mar 2025 12:51:55 +0000 (13:51 +0100)]
xen/events: allow setting of global virq handler only for unbound virqs

XEN_DOMCTL_set_virq_handler will happily steal a global virq from the
current domain having bound it and assign it to another domain. The
former domain will just never receive any further events for that
virq without knowing what happened.

Change the behavior to allow XEN_DOMCTL_set_virq_handler only if the
virq in question is not bound by the current domain allowed to use it.

Currently the only user of XEN_DOMCTL_set_virq_handler in the Xen code
base is init-xenstore-domain, so changing the behavior like above will
not cause any problems.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
2 months agoxen/events: don't allow binding a global virq from any domain
Juergen Gross [Thu, 6 Mar 2025 12:51:35 +0000 (13:51 +0100)]
xen/events: don't allow binding a global virq from any domain

Today Xen will happily allow binding a global virq by a domain which
isn't configured to receive it. This won't result in any bad actions,
but the bind will appear to have succeeded with no event ever being
received by that event channel.

Instead of allowing the bind, error out if the domain isn't set to
handle that virq. Note that this check is inside the write_lock() on
purpose, as a future patch will put a related check into
set_global_virq_handler() with the addition of using the same lock.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
2 months agoEFI: Avoid crash calling PrintErrMesg() from efi_multiboot2()
Frediano Ziglio [Thu, 6 Mar 2025 12:51:01 +0000 (13:51 +0100)]
EFI: Avoid crash calling PrintErrMesg() from efi_multiboot2()

Although code is compiled with -fpic option data is not position
independent. This causes data pointer to become invalid if
code is not relocated properly which is what happens for
efi_multiboot2 which is called by multiboot entry code.

Code tested adding
   PrintErrMesg(L"Test message", EFI_BUFFER_TOO_SMALL);
in efi_multiboot2 before calling efi_arch_edd (this function
can potentially call PrintErrMesg).

Before the patch (XenServer installation on Qemu, xen replaced
with vanilla xen.gz):
  Booting `XenServer (Serial)'Booting `XenServer (Serial)'
  Test message: !!!! X64 Exception Type - 0E(#PF - Page-Fault)  CPU Apic ID - 00000000 !!!!
  ExceptionData - 0000000000000000  I:0 R:0 U:0 W:0 P:0 PK:0 SS:0 SGX:0
  RIP  - 000000007EE21E9A, CS  - 0000000000000038, RFLAGS - 0000000000210246
  RAX  - 000000007FF0C1B5, RCX - 0000000000000050, RDX - 0000000000000010
  RBX  - 0000000000000000, RSP - 000000007FF0C180, RBP - 000000007FF0C210
  RSI  - FFFF82D040467CE8, RDI - 0000000000000000
  R8   - 000000007FF0C1C8, R9  - 000000007FF0C1C0, R10 - 0000000000000000
  R11  - 0000000000001020, R12 - FFFF82D040467CE8, R13 - 000000007FF0C1B8
  R14  - 000000007EA33328, R15 - 000000007EA332D8
  DS   - 0000000000000030, ES  - 0000000000000030, FS  - 0000000000000030
  GS   - 0000000000000030, SS  - 0000000000000030
  CR0  - 0000000080010033, CR2 - FFFF82D040467CE8, CR3 - 000000007FC01000
  CR4  - 0000000000000668, CR8 - 0000000000000000
  DR0  - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
  DR3  - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
  GDTR - 000000007F9DB000 0000000000000047, LDTR - 0000000000000000
  IDTR - 000000007F48E018 0000000000000FFF,   TR - 0000000000000000
  FXSAVE_STATE - 000000007FF0BDE0
  !!!! Find image based on IP(0x7EE21E9A) (No PDB)  (ImageBase=000000007EE20000, EntryPoint=000000007EE23935) !!!!

After the patch:
  Booting `XenServer (Serial)'Booting `XenServer (Serial)'
  Test message: Buffer too small
  BdsDxe: loading Boot0000 "UiApp" from Fv(7CB8BDC9-F8EB-4F34-AAEA-3EE4AF6516A1)/FvFile(462CAA21-7614-4503-836E-8AB6F4662331)
  BdsDxe: starting Boot0000 "UiApp" from Fv(7CB8BDC9-F8EB-4F34-AAEA-3EE4AF6516A1)/FvFile(462CAA21-7614-4503-836E-8AB6F4662331)

This partially rollback commit 00d5d5ce23e6.

Fixes: 9180f5365524 ("x86: add multiboot2 protocol support for EFI platforms")
Signed-off-by: Frediano Ziglio <frediano.ziglio@cloud.com>
Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
2 months agoxen/arm: mpu: Ensure that the page size is 4KB
Ayan Kumar Halder [Tue, 4 Mar 2025 17:57:08 +0000 (17:57 +0000)]
xen/arm: mpu: Ensure that the page size is 4KB

Similar to commit (d736b6eb451b, "xen/arm: mpu: Define Xen start address for
MPU systems"), one needs to add a build assertion to ensure that the page size
is 4KB on arm32 based systems as well.
The existing build assertion is moved under "xen/arch/arm/mpu" as it applies
for both arm64 and arm32 based systems.

Signed-off-by: Ayan Kumar Halder <ayan.kumar.halder@amd.com>
Acked-by: Michal Orzel <michal.orzel@amd.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
2 months agoxen/arm: mpu: Move some of the definitions to common file
Ayan Kumar Halder [Tue, 4 Mar 2025 17:57:07 +0000 (17:57 +0000)]
xen/arm: mpu: Move some of the definitions to common file

For AArch32, refer to ARM DDI 0568A.c ID110520.
MPU_REGION_SHIFT is same between AArch32 and AArch64 (HPRBAR).
Also, NUM_MPU_REGIONS_SHIFT is same between AArch32 and AArch64
(HMPUIR).

Signed-off-by: Ayan Kumar Halder <ayan.kumar.halder@amd.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Acked-by: Michal Orzel <michal.orzel@amd.com>
2 months agoXen: CI fixes from XSN-2
Andrew Cooper [Wed, 5 Mar 2025 22:17:22 +0000 (22:17 +0000)]
Xen: CI fixes from XSN-2

 * Add cf_check annotation to cmp_patch_id() used by bsearch().
 * Add U suffix to the K[] table to fix MISRA Rule 7.2 violations.

Fixes: 372af524411f ("xen/lib: Introduce SHA2-256")
Fixes: 630e8875ab36 ("x86/ucode: Perform extra SHA2 checks on AMD Fam17h/19h microcode")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2 months agox86/ucode: Perform extra SHA2 checks on AMD Fam17h/19h microcode
Andrew Cooper [Fri, 13 Dec 2024 14:34:00 +0000 (14:34 +0000)]
x86/ucode: Perform extra SHA2 checks on AMD Fam17h/19h microcode

Collisions have been found in the microcode signing algorithm used by AMD
Fam17h/19h CPUs, and now anyone can sign their own.

For more details, see:
  https://bughunters.google.com/blog/5424842357473280/zen-and-the-art-of-microcode-hacking
  https://www.amd.com/en/resources/product-security/bulletin/amd-sb-7033.html

As a stopgap mitigation, check the digest of patches against a table of blobs
with known provenance.  These are all Fam17h and Fam19h blobs included in
linux-firwmare at the time of writing, specifically:

  https://git.kernel.org/firmware/linux-firmware/c/48bb90cceb882cab8e9ab692bc5779d3bf3a13b8

This checks can be opted out of by booting with ucode=no-digest-check, but
doing so is not recommended.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
2 months agoxen/lib: Introduce SHA2-256
Andrew Cooper [Fri, 13 Dec 2024 14:34:00 +0000 (14:34 +0000)]
xen/lib: Introduce SHA2-256

A future change will need to calculate SHA2-256 digests.  Introduce an
implementation in lib/, derived from Trenchboot which itself is derived from
Linux.

In order to be useful to other architectures, it is careful with endianness
and misaligned accesses as well as being more MISRA friendly, but is only
wired up for x86 in the short term.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
2 months agoRevert "xen/riscv: drop CONFIG_RISCV_ISA_RV64G"
Jan Beulich [Wed, 5 Mar 2025 16:06:23 +0000 (17:06 +0100)]
Revert "xen/riscv: drop CONFIG_RISCV_ISA_RV64G"

This reverts commit 86b1b8ec3d9d0508a95540e368432291b883837f. It
fails in CI without an adjustment there.

2 months agotools/xl: fix channel configuration setting
Juergen Gross [Wed, 5 Mar 2025 15:37:37 +0000 (16:37 +0100)]
tools/xl: fix channel configuration setting

Channels work differently than other device types: their devid should
be -1 initially in order to distinguish them from the primary console
which has the devid of 0.

So when parsing the channel configuration, use
ARRAY_EXTEND_INIT_NODEVID() in order to avoid overwriting the devid
set by libxl_device_channel_init().

Fixes: 3a6679634766 ("libxl: set channel devid when not provided by application")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
2 months agox86/xstate: Map/unmap xsave area in {compress,expand}_xsave_states()
Alejandro Vallejo [Wed, 5 Mar 2025 15:37:14 +0000 (16:37 +0100)]
x86/xstate: Map/unmap xsave area in {compress,expand}_xsave_states()

No functional change.

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 months agox86/domctl: Map/unmap xsave area in arch_get_info_guest()
Alejandro Vallejo [Wed, 5 Mar 2025 15:37:02 +0000 (16:37 +0100)]
x86/domctl: Map/unmap xsave area in arch_get_info_guest()

No functional change.

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
2 months agox86/hvm: Map/unmap xsave area in hvmemul_{get,put}_fpu()
Alejandro Vallejo [Wed, 5 Mar 2025 15:36:25 +0000 (16:36 +0100)]
x86/hvm: Map/unmap xsave area in hvmemul_{get,put}_fpu()

No functional change.

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
2 months agox86/xstate: Map/unmap xsave area in xstate_set_init() and handle_setbv()
Alejandro Vallejo [Wed, 5 Mar 2025 15:35:57 +0000 (16:35 +0100)]
x86/xstate: Map/unmap xsave area in xstate_set_init() and handle_setbv()

No functional change.

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
2 months agox86/fpu: Map/umap xsave area in vcpu_{reset,setup}_fpu()
Alejandro Vallejo [Wed, 5 Mar 2025 15:35:37 +0000 (16:35 +0100)]
x86/fpu: Map/umap xsave area in vcpu_{reset,setup}_fpu()

No functional change.

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
2 months agox86/hvm: Map/unmap xsave area in hvm_save_cpu_ctxt()
Alejandro Vallejo [Wed, 5 Mar 2025 15:35:04 +0000 (16:35 +0100)]
x86/hvm: Map/unmap xsave area in hvm_save_cpu_ctxt()

No functional change.

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
2 months agox86/xstate: Create map/unmap primitives for xsave areas
Alejandro Vallejo [Wed, 5 Mar 2025 15:34:27 +0000 (16:34 +0100)]
x86/xstate: Create map/unmap primitives for xsave areas

Add infrastructure to simplify ASI handling. With ASI in the picture
we'll have several different means of accessing the XSAVE area of a
given vCPU, depending on whether a domain is covered by ASI or not and
whether the vCPU is question is scheduled on the current pCPU or not.

Having these complexities exposed at the call sites becomes unwieldy
very fast. These wrappers are intended to be used in a similar way to
map_domain_page() and unmap_domain_page(); The map operation will
dispatch the appropriate pointer for each case in a future patch, while
unmap will remain a no-op where no unmap is required (e.g: when there's
no ASI) and remove the transient maping if one was required.

Follow-up patches replace all uses of raw v->arch.xsave_area by this
mechanism in preparation to add the beforementioned dispatch logic to be
added at a later time.

Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
2 months agoxen/cpufreq: abstract Energy Performance Preference value
Penny Zheng [Wed, 5 Mar 2025 14:45:10 +0000 (15:45 +0100)]
xen/cpufreq: abstract Energy Performance Preference value

Intel's hwp Energy Performance Preference value is compatible with
CPPC's Energy Performance Preference value, so this commit abstracts
the value and re-place it in common header file cpufreq.h, to be
used not only for hwp in the future.

Signed-off-by: Penny Zheng <Penny.Zheng@amd.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
2 months agoxen/riscv: drop CONFIG_RISCV_ISA_RV64G
Oleksii Kurochko [Wed, 5 Mar 2025 14:44:12 +0000 (15:44 +0100)]
xen/riscv: drop CONFIG_RISCV_ISA_RV64G

'G' stands for "imafd_zicsr_zifencei".

Extensions 'f' and 'd' aren't really needed for Xen, and allowing floating
point registers to be used can lead to crashes.

Extensions 'i', 'm', 'a', 'zicsr', and 'zifencei' are necessary for the
operation of Xen, which is why they are used explicitly (unconditionally)
in -march.

Drop "Base ISA" choice from riscv/Kconfig as it is always empty.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 months agoxen/README: add compiler and binutils versions for RISCV-64
Oleksii Kurochko [Wed, 5 Mar 2025 14:43:55 +0000 (15:43 +0100)]
xen/README: add compiler and binutils versions for RISCV-64

Considering that the Zbb extension is supported since GCC version 12 [1]
and that older GCC versions do not support Z extensions in -march (I haven't
faced this issue for GCC >=11.2), leading to compilation failures,
the baseline version for GCC is set to 12.2 and for GNU binutils to 2.39.

The GCC version is set to 12.2 instead of 12.1 because Xen's GitLab CI uses
Debian 12, which includes GCC 12.2 and GNU binutils 2.39.

[1] https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=149e217033f01410a9783c5cb2d020cf8334ae4c

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
2 months agoxen/list: fix comments in include/xen/list.h
Juergen Gross [Wed, 5 Mar 2025 14:43:32 +0000 (15:43 +0100)]
xen/list: fix comments in include/xen/list.h

There are several places in list.h where "list_struct" is used instead
of "struct list_head". Fix that.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
2 months agoxen/console: introduce console_{get,put}_domain()
Denis Mukhin [Wed, 5 Mar 2025 14:42:49 +0000 (15:42 +0100)]
xen/console: introduce console_{get,put}_domain()

console_input_domain() takes an RCU lock to protect domain structure.
That implies call to rcu_unlock_domain() after use.

Introduce a pair of console_get_domain() / console_put_domain() to highlight
the correct use of the call within the code interacting with Xen console
driver.

The new calls used in __serial_rx(), which also fixed console forwarding to
late hardware domains which run with domain IDs different from 0.

While moving the guest_printk() invocation also drop the redundant _G infix.

Signed-off-by: Denis Mukhin <dmukhin@ford.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
2 months agox86/HVM: drop redundant access splitting
Jan Beulich [Wed, 5 Mar 2025 14:42:12 +0000 (15:42 +0100)]
x86/HVM: drop redundant access splitting

With all paths into hvmemul_linear_mmio_access() coming through
linear_{read,write}(), there's no need anymore to split accesses at
page boundaries there. Leave an assertion, though.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
2 months agox86/HVM: slightly improve CMPXCHG16B emulation
Jan Beulich [Wed, 5 Mar 2025 14:41:14 +0000 (15:41 +0100)]
x86/HVM: slightly improve CMPXCHG16B emulation

Using hvmemul_linear_mmio_write() directly (as fallback when mapping the
memory operand isn't possible) won't work properly when the access
crosses a RAM/MMIO boundary. Use linear_write() instead, which splits at
such boundaries as necessary.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
2 months agox86/dom0: be less restrictive with the Interrupt Address Range
Roger Pau Monne [Wed, 12 Feb 2025 10:37:50 +0000 (11:37 +0100)]
x86/dom0: be less restrictive with the Interrupt Address Range

Xen currently prevents dom0 from creating CPU or IOMMU page-table mappings
into the interrupt address range [0xfee00000, 0xfeefffff].  This range has
two different purposes.  For accesses from the CPU is contains the default
position of local APIC page at 0xfee00000.  For accesses from devices
it's the MSI address range, so the address field in the MSI entries
(usually) point to an address on that range to trigger an interrupt.

There are reports of Lenovo Thinkpad devices placing what seems to be the
UCSI shared mailbox at address 0xfeec2000 in the interrupt address range.
Attempting to use that device with a Linux PV dom0 leads to an error when
Linux kernel maps 0xfeec2000:

RIP: e030:xen_mc_flush+0x1e8/0x2b0
 xen_leave_lazy_mmu+0x15/0x60
 vmap_range_noflush+0x408/0x6f0
 __ioremap_caller+0x20d/0x350
 acpi_os_map_iomem+0x1a3/0x1c0
 acpi_ex_system_memory_space_handler+0x229/0x3f0
 acpi_ev_address_space_dispatch+0x17e/0x4c0
 acpi_ex_access_region+0x28a/0x510
 acpi_ex_field_datum_io+0x95/0x5c0
 acpi_ex_extract_from_field+0x36b/0x4e0
 acpi_ex_read_data_from_field+0xcb/0x430
 acpi_ex_resolve_node_to_value+0x2e0/0x530
 acpi_ex_resolve_to_value+0x1e7/0x550
 acpi_ds_evaluate_name_path+0x107/0x170
 acpi_ds_exec_end_op+0x392/0x860
 acpi_ps_parse_loop+0x268/0xa30
 acpi_ps_parse_aml+0x221/0x5e0
 acpi_ps_execute_method+0x171/0x3e0
 acpi_ns_evaluate+0x174/0x5d0
 acpi_evaluate_object+0x167/0x440
 acpi_evaluate_dsm+0xb6/0x130
 ucsi_acpi_dsm+0x53/0x80
 ucsi_acpi_read+0x2e/0x60
 ucsi_register+0x24/0xa0
 ucsi_acpi_probe+0x162/0x1e3
 platform_probe+0x48/0x90
 really_probe+0xde/0x340
 __driver_probe_device+0x78/0x110
 driver_probe_device+0x1f/0x90
 __driver_attach+0xd2/0x1c0
 bus_for_each_dev+0x77/0xc0
 bus_add_driver+0x112/0x1f0
 driver_register+0x72/0xd0
 do_one_initcall+0x48/0x300
 do_init_module+0x60/0x220
 __do_sys_init_module+0x17f/0x1b0
 do_syscall_64+0x82/0x170

Remove the restrictions to create mappings in the interrupt address range
for dom0.  Note that the restriction to map the local APIC page is enforced
separately, and that continues to be present.  Additionally make sure the
emulated local APIC page is also not mapped, in case dom0 is using it.

Note that even if the interrupt address range entries are populated in the
IOMMU page-tables no device access will reach those pages.  Device accesses
to the Interrupt Address Range will always be converted into Interrupt
Messages and are not subject to DMA remapping.

There's also the following restriction noted in Intel VT-d:

> Software must not program paging-structure entries to remap any address to
> the interrupt address range. Untranslated requests and translation requests
> that result in an address in the interrupt range will be blocked with
> condition code LGN.4 or SGN.8. Translated requests with an address in the
> interrupt address range are treated as Unsupported Request (UR).

Similarly for AMD-Vi:

> Accesses to the interrupt address range (Table 3) are defined to go through
> the interrupt remapping portion of the IOMMU and not through address
> translation processing. Therefore, when a transaction is being processed as
> an interrupt remapping operation, the transaction attribute of
> pretranslated or untranslated is ignored.
>
> Software Note: The IOMMU should
> not be configured such that an address translation results in a special
> address such as the interrupt address range.

However those restrictions don't apply to the identity mappings possibly
created for dom0, since the interrupt address range is never subject to DMA
remapping, and hence there's no output address after translation that
belongs to the interrupt address range.

Reported-by: Jürgen Groß <jgross@suse.com>
Link: https://lore.kernel.org/xen-devel/baade0a7-e204-4743-bda1-282df74e5f89@suse.com/
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
2 months agox86/iommu: account for IOMEM caps when populating dom0 IOMMU page-tables
Roger Pau Monne [Fri, 14 Feb 2025 09:39:29 +0000 (10:39 +0100)]
x86/iommu: account for IOMEM caps when populating dom0 IOMMU page-tables

The current code in arch_iommu_hwdom_init() kind of open-codes the same
MMIO permission ranges that are added to the hardware domain ->iomem_caps.
Avoid this duplication and use ->iomem_caps in arch_iommu_hwdom_init() to
filter which memory regions should be added to the dom0 IOMMU page-tables.

Note the IO-APIC and MCFG page(s) must be set as not accessible for a PVH
dom0, otherwise the internal Xen emulation for those ranges won't work.
This requires adjustments in dom0_setup_permissions().

The call to pvh_setup_mmcfg() in dom0_construct_pvh() must now strictly be
done ahead of setting up dom0 permissions, so take the opportunity to also
put it inside the existing is_hardware_domain() region.

Also the special casing of E820_UNUSABLE regions no longer needs to be done
in arch_iommu_hwdom_init(), as those regions are already blocked in
->iomem_caps and thus would be removed from the rangeset as part of
->iomem_caps processing in arch_iommu_hwdom_init().  The E820_UNUSABLE
regions below 1Mb are not removed from ->iomem_caps, that's a slight
difference for the IOMMU created page-tables, but the aim is to allow
access to the same memory either from the CPU or the IOMMU page-tables.

Since ->iomem_caps already takes into account the domain max paddr, there's
no need to remove any regions past the last address addressable by the
domain, as applying ->iomem_caps would have already taken care of that.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 months agox86/dom0: correctly set the maximum ->iomem_caps bound for PVH
Roger Pau Monne [Tue, 18 Feb 2025 16:57:49 +0000 (17:57 +0100)]
x86/dom0: correctly set the maximum ->iomem_caps bound for PVH

The logic in dom0_setup_permissions() sets the maximum bound in
->iomem_caps unconditionally using paddr_bits, which is not correct for HVM
based domains.  Instead use domain_max_paddr_bits() to get the correct
maximum paddr bits for each possible domain type.

Switch to using PFN_DOWN() instead of PAGE_SHIFT, as that's shorter.

Fixes: 53de839fb409 ('x86: constrain MFN range Dom0 may access')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 months agox86/dom0: attempt to fixup p2m page-faults for PVH dom0
Roger Pau Monne [Thu, 13 Feb 2025 09:58:45 +0000 (10:58 +0100)]
x86/dom0: attempt to fixup p2m page-faults for PVH dom0

When building a PVH dom0 Xen attempts to map all (relevant) MMIO regions
into the p2m for dom0 access.  However the information Xen has about the
host memory map is limited.  Xen doesn't have access to any resources
described in ACPI dynamic tables, and hence the p2m mappings provided might
not be complete.

PV doesn't suffer from this issue because a PV dom0 is capable of mapping
into it's page-tables any address not explicitly banned in d->iomem_caps.

Introduce a new command line options that allows Xen to attempt to fixup
the p2m page-faults, by creating p2m identity maps in response to p2m
page-faults.

This is aimed as a workaround to small ACPI regions Xen doesn't know about.
Note that missing large MMIO regions mapped in this way will lead to
slowness due to the VM exit processing, plus the mappings will always use
small pages.

The ultimate aim is to attempt to bring better parity with a classic PV
dom0.

Note such fixup rely on the CPU doing the access to the unpopulated
address.  If the access is attempted from a device instead there's no
possible way to fixup, as IOMMU page-fault are asynchronous.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Only slightly tested on my local PVH dom0 deployment.
---
Changes since v1:
 - Make the fixup function static.
 - Print message in case mapping already exists.