Andrew Cooper [Wed, 19 Mar 2025 02:58:18 +0000 (02:58 +0000)]
x86/mm: Fix IS_ALIGNED() check in IS_LnE_ALIGNED()
The current CI failures turn out to be a latent bug triggered by a narrow set
of properties of the initrd and the host memory map, which CI encountered by
chance.
One step during boot involves constructing directmap mappings for modules.
With some probing at the point of creation, it is observed that there's a 4k
mapping missing towards the end of the initrd.
The conditions for this bug appear to be map_pages_to_xen() call with a start
address of exactly 4k beyond a 2M boundary, some number of full 2M pages, then
a tail needing 4k pages.
Anyway, the condition for spotting superpage boundaries in map_pages_to_xen()
is wrong. The IS_ALIGNED() macro expects a power of two for the alignment
argument, and subtracts 1 itself.
Fixing this causes the failing case to now boot.
Fixes: 97fb6fcf26e8 ("x86/mm: introduce helpers to detect super page alignment") Debugged-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jiqian Chen [Tue, 18 Mar 2025 08:48:00 +0000 (09:48 +0100)]
CHANGELOG.md: Mention PCI passthrough for HVM domUs
PCI passthrough is already supported for HVM domUs when dom0 is PVH
on x86. The last related patch on Qemu side was merged after Xen4.20
release. So mention this feature in Xen4.21 entry.
But SR-IOV is not yet supported on PVH dom0, add a note for it.
Juergen Gross [Tue, 18 Mar 2025 08:47:45 +0000 (09:47 +0100)]
tools/xenstored: use xenmanage_poll_changed_domain()
Instead of checking each known domain after having received a
VIRQ_DOM_EXC event, use the new xenmanage_poll_changed_domain()
function for directly getting the domid of a domain having changed
its state.
A test doing "xl shutdown" of 1000 guests has shown to reduce the
consumed cpu time of xenstored by 6% with this change applied.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
Jan Beulich [Tue, 18 Mar 2025 08:44:57 +0000 (09:44 +0100)]
symbols: don't over-align generated data
x86 is one of the few architectures where .align has the same meaning as
.balign; most other architectures (Arm, PPC, and RISC-V in particular)
give it the same meaning as .p2align. Aligning every one of these item
to 256 bytes (on all 64-bit architectures except x86-64) is clearly too
much.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
tools: Mark ACPI SDTs as NVS in the PVH build path
Commit cefeffc7e583 marked ACPI tables as NVS in the hvmloader path
because SeaBIOS may otherwise just mark it as RAM. There is, however,
yet another reason to do it even in the PVH path. Xen's incarnation of
AML relies on having access to some ACPI tables (e.g: _STA of Processor
objects relies on reading the processor online bit in its MADT entry)
This is problematic if the OS tries to reclaim ACPI memory for page
tables as it's needed for runtime and can't be reclaimed after the OSPM
is up and running.
Fixes: de6d188a519f ("hvmloader: flip "ACPI data" to "ACPI NVS" type for ACPI table region)" Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 11 Jun 2024 19:03:32 +0000 (20:03 +0100)]
x86/hvm: Use for_each_set_bit() in hvm_emulate_writeback()
... which is more consise than the opencoded form, and more efficient when
compiled.
Furthermore, now that find_{first,next}_bit() are no longer in use, the
seg_reg_{accessed,dirty} fields aren't forced to be unsigned long, although
they do need to remain unsigned int because of __set_bit() elsewhere.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 31 Dec 2024 16:52:39 +0000 (16:52 +0000)]
x86/boot: Fix zap_low_mappings() to map less of the trampoline
Regular data access into the trampoline is via the directmap.
As now discussed quite extensively in asm/trampoline.h, the trampoline is
arranged so that only the AP and S3 paths need an identity mapping, and that
they fit within a single page.
Right now, PFN_UP(trampoline_end - trampoline_start) is 2, causing more than
expected of the trampoline to be mapped. Cut it down just the single page it
ought to be.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monne [Thu, 13 Mar 2025 11:19:48 +0000 (12:19 +0100)]
x86/ioremap: prevent additions against the NULL pointer
This was reported by clang UBSAN as:
UBSAN: Undefined behaviour in arch/x86/mm.c:6297:40
applying zero offset to null pointer
[...]
Xen call trace:
[<ffff82d040303662>] R common/ubsan/ubsan.c#ubsan_epilogue+0xa/0xc0
[<ffff82d040304aa3>] F __ubsan_handle_pointer_overflow+0xcb/0x100
[<ffff82d0406ebbc0>] F ioremap_wc+0xc8/0xe0
[<ffff82d0406c3728>] F video_init+0xd0/0x180
[<ffff82d0406ab6f5>] F console_init_preirq+0x3d/0x220
[<ffff82d0406f1876>] F __start_xen+0x68e/0x5530
[<ffff82d04020482e>] F __high_start+0x8e/0x90
Fix bt_ioremap() and ioremap{,_wc}() to not add the offset if the returned
pointer from __vmap() is NULL.
Fixes: d0d4635d034f ('implement vmap()') Fixes: f390941a92f1 ('x86/DMI: fix table mapping when one lives above 1Mb') Fixes: 81d195c6c0e2 ('x86: introduce ioremap_wc()') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monne [Thu, 13 Mar 2025 10:08:05 +0000 (11:08 +0100)]
x86/dom0: placate GCC 12 compile-time errors with UBSAN and PVH_GUEST
When building Xen with GCC 12 with UBSAN and PVH_GUEST both enabled the
compiler emits the following errors:
arch/x86/setup.c: In function '__start_xen':
arch/x86/setup.c:1504:19: error: 'consider_modules' reading 40 bytes from a region of size 4 [-Werror=stringop-overread]
1504 | end = consider_modules(s, e, reloc_size + mask,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1505 | bi->mods, bi->nr_modules, -1);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/x86/setup.c:1504:19: note: referencing argument 4 of type 'const struct boot_module[0]'
arch/x86/setup.c:686:24: note: in a call to function 'consider_modules'
686 | static uint64_t __init consider_modules(
| ^~~~~~~~~~~~~~~~
arch/x86/setup.c:1535:19: error: 'consider_modules' reading 40 bytes from a region of size 4 [-Werror=stringop-overread]
1535 | end = consider_modules(s, e, size, bi->mods,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1536 | bi->nr_modules + relocated, j);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/x86/setup.c:1535:19: note: referencing argument 4 of type 'const struct boot_module[0]'
arch/x86/setup.c:686:24: note: in a call to function 'consider_modules'
686 | static uint64_t __init consider_modules(
| ^~~~~~~~~~~~~~~~
This seems to be the result of some function manipulation done by UBSAN
triggering GCC stringops related errors. Placate the errors by declaring
the function parameter as `const struct *boot_module` instead of `const
struct boot_module[]`.
Note that GCC 13 seems to be fixed, and doesn't trigger the error when
using `[]`.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monne [Wed, 12 Mar 2025 12:35:53 +0000 (13:35 +0100)]
xen/ubsan: provide helper for clang's -fsanitize=function
clang's -fsanitize=function relies on the presence of
__ubsan_handle_function_type_mismatch() to print the detection of indirect
calls of a function through a function pointer of the wrong type.
Implement the helper, inspired on the llvm ubsan lib implementation.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Extend coverage of CONFIG_VM_EVENT option and make the build of VM events
and monitoring support optional. Also make MEM_PAGING option depend on VM_EVENT
to document that mem_paging is relying on vm_event.
This is to reduce code size on Arm when this option isn't enabled.
Sergiy Kibrik [Fri, 14 Mar 2025 05:23:14 +0000 (07:23 +0200)]
x86:monitor: control monitor.c build with CONFIG_VM_EVENT option
Replace more general CONFIG_HVM option with CONFIG_VM_EVENT which is more
relevant and specific to monitoring. This is only to clarify at build level
to which subsystem this file belongs.
No functional change here, as VM_EVENT depends on HVM.
Signed-off-by: Sergiy Kibrik <Sergiy_Kibrik@epam.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Sergiy Kibrik [Fri, 14 Mar 2025 05:21:09 +0000 (07:21 +0200)]
xen: kconfig: rename MEM_ACCESS -> VM_EVENT
Use more generic CONFIG_VM_EVENT name throughout Xen code instead of
CONFIG_MEM_ACCESS. This reflects the fact that vm_event is a higher level
feature, with mem_access & monitor depending on it.
Suggested-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com> Signed-off-by: Sergiy Kibrik <Sergiy_Kibrik@epam.com>
Andrew Cooper [Sun, 29 Dec 2024 14:06:18 +0000 (14:06 +0000)]
x86/elf: Improve code generation in elf_core_save_regs()
A CALL with 0 displacement is handled specially, and is why this logic
functions even with CET Shadow Stacks active. Nevertheless a RIP-relative LEA
is the more normal way of doing this in 64bit code.
The retrieval of flags modifies the stack pointer so needs to state a
dependency on the stack pointer. Despite it's name, ASM_CALL_CONSTRAINT is
the way to do this.
read_sreg() forces the answer through a register, causing code generation of
the form:
Jan Beulich [Fri, 14 Mar 2025 09:18:34 +0000 (10:18 +0100)]
VT-d: have set_msi_source_id() return a success indicator
Handling possible internal errors by just emitting a (debug-build-only)
log message can't be quite enough. Return error codes in those cases,
and have the caller propagate those up.
Drop a pointless return path, rather than "inventing" an error code for
it.
While touching the function declarator anyway also constify its first
parameter.
Fixes: 476bbccc811c ("VT-d: fix MSI source-id of interrupt remapping") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 14 Mar 2025 09:18:12 +0000 (10:18 +0100)]
VT-d: move obtaining of MSI/HPET source ID
This was the original attempt to address XSA-467, until it was found
that IRQs can be off already from higher up the call stack. Nevertheless
moving code out of locked regions is generally desirable anyway; some of
the callers, after all, don't disable interrupts or acquire other locks.
Hence, despite this not addressing the original report:
Data collection solely depends on the passed in PCI device. Furthermore,
since the function only writes to a local variable, we can pull the
invocation of set_msi_source_id() (and also set_hpet_source_id()) ahead
of the acquiring of the (IRQ-safe) lock.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Juergen Gross [Fri, 14 Mar 2025 09:17:11 +0000 (10:17 +0100)]
xen/sched: fix arinc653 to not use variables across cpupools
a653sched_do_schedule() is using two function local static variables,
which is resulting in bad behavior when using more than one cpupool
with the arinc653 scheduler.
Fix that by moving those variables to the scheduler private data.
Fixes: 22787f2e107c ("ARINC 653 scheduler") Reported-by: Choi Anderson <Anderson.Choi@boeing.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Nathan Studer <nathan.studer@dornerworks.com>
Jan Beulich [Thu, 13 Mar 2025 09:24:15 +0000 (10:24 +0100)]
x86/shadow: replace p2m_is_valid() uses
The justification for dropping p2m_mmio_dm from p2m_is_valid() was wrong
for two of the shadow mode uses.
In _sh_propagate() we want to create special L1 entries for p2m_mmio_dm
pages. Hence we need to make sure we don't bail early for that type.
In _sh_page_fault() we want to handle p2m_mmio_dm by forwarding to
(internal or external) emulation. Pull the !p2m_is_mmio() check out of
the || expression (as otherwise it would need adding to the lhs as
well).
In both cases, p2m_is_valid() in combination with p2m_is_grant() still
doesn't cover foreign mappings. Hence use p2m_is_any_ram() plus (as
necessary) p2m_mmio_* instead.
Fixes: be59cceb2dbb ("x86/P2M: don't include MMIO_DM in p2m_is_valid()") Reported-by: Luca Fancellu <Luca.Fancellu@arm.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Luca Fancellu <luca.fancellu@arm.com>
Jason Andryuk [Thu, 13 Mar 2025 09:23:52 +0000 (10:23 +0100)]
tools/libxl: Skip missing PCI GSIs
A PCI device may not have a legacy IRQ. In that case, we don't need to
do anything, so don't fail in libxl__arch_hvm_map_gsi() and
libxl__arch_hvm_unmap_gsi().
Requires an updated pciback to return -ENOENT.
Fixes: f97f885c7198 ("tools: Add new function to do PIRQ (un)map on PVH dom0") Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
Jason Andryuk [Thu, 13 Mar 2025 09:23:42 +0000 (10:23 +0100)]
tools/ctrl: Silence missing GSI in xc_pcidev_get_gsi()
It is valid for a PCI device to not have a legacy IRQ. In that case, do
not print an error to keep the logs clean.
This relies on pciback being updated to return -ENOENT for a missing
GSI.
Fixes: b93e5981d258 ("tools: Add new function to get gsi from dev") Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
Jan Beulich [Thu, 13 Mar 2025 09:23:10 +0000 (10:23 +0100)]
libxl: avoid infinite loop in libxl__remove_directory()
Infinitely retrying the rmdir() invocation makes little sense. While the
original observation was the log filling the disk (due to repeated
"Directory not empty" errors, in turn occurring for unclear reasons),
the loop wants breaking even if there was no error message being logged
(much like is done in the similar loops in libxl__remove_file() and
libxl__remove_file_or_directory()).
Fixes: c4dcbee67e6d ("libxl: provide libxl__remove_file et al") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Anthony PERARD <anthony.perard@vates.tech>
Luca Fancellu [Wed, 12 Mar 2025 13:52:49 +0000 (13:52 +0000)]
xen/passthrough: Provide stub functions when !HAS_PASSTHROUGH
When Xen is built without HAS_PASSTHROUGH, there are some parts
in arm where iommu_* functions are called in the codebase, but
their implementation is under xen/drivers/passthrough that is
not built.
So provide some stub for these functions in order to build Xen
when !HAS_PASSTHROUGH, which is the case for example on systems
with MPU support.
For gnttab_need_iommu_mapping() in the Arm part, modify the macro
to use IS_ENABLED for the HAS_PASSTHROUGH Kconfig.
Fixes: 0388a5979b21 ("xen/arm: mpu: Introduce choice between MMU and MPU") Signed-off-by: Luca Fancellu <luca.fancellu@arm.com> Acked-by: Julien Grall <jgrall@amazon.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Michal Orzel [Wed, 12 Mar 2025 10:16:19 +0000 (11:16 +0100)]
tools/arm: Reject configuration with incorrect nr_spis value
If the calculated value for nr_spis by the toolstack is bigger than the
value provided by the user, we silently ignore the latter. This is not
consistent with the approach we have in Xen on Arm when we try to reject
incorrect configuration. Also, the documentation for nr_spis is
incorrect as it mentions 991 as the number of max SPIs, where it should
be 960 i.e. (1020 - 32) rounded down to the nearest multiple of 32.
Signed-off-by: Michal Orzel <michal.orzel@amd.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Michal Orzel [Wed, 12 Mar 2025 10:16:18 +0000 (11:16 +0100)]
xen/arm: Improve handling of nr_spis
At the moment, we print a warning about max number of IRQs supported by
GIC bigger than vGIC only for hardware domain. This check is not hwdom
special, and should be made common. Also, in case of user not specifying
nr_spis for dom0less domUs, we should take into account max number of
IRQs supported by vGIC if it's smaller than for GIC.
Introduce VGIC_MAX_IRQS macro and use it instead of hardcoded 992 value.
Introduce VGIC_DEF_NR_SPIS macro to store the default number of vGIC SPIs.
Fix calculation of nr_spis for dom0less domUs and make the GIC/vGIC max
IRQs comparison common.
Signed-off-by: Michal Orzel <michal.orzel@amd.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
xen/arm: fix iomem_ranges cfg in map_range_to_domain()
Now the following code in map_range_to_domain()
res = rangeset_add_range(mr_data->iomem_ranges,
paddr_to_pfn(addr),
paddr_to_pfn_aligned(addr + len - 1));
where
paddr_to_pfn_aligned(paddr) defined as paddr_to_pfn(PAGE_ALIGN(paddr))
calculates the iomem range end address by rounding it up to the next Xen
page with incorrect assumption that iomem range end address passed to
rangeset_add_range() is exclusive, while it is expected to be inclusive.
For example, if requested range is [00e6140000:00e6141004] then it expected
to add [e6140:e6141] range (num_pages=2) to the mr_data->iomem_ranges
rangeset, but will add [e6140:e6142] (num_pages=3) instead.
To fix it, drop PAGE_ALIGN() from the iomem range end address calculation
formula and just use paddr_to_pfn(addr + len - 1).
xen/arm: fix iomem permissions cfg in map_range_to_domain()
Now the following code in map_range_to_domain()
res = iomem_permit_access(d, paddr_to_pfn(addr),
paddr_to_pfn(PAGE_ALIGN(addr + len - 1)));
calculates the iomem range end address by rounding it up to the next Xen
page with incorrect assumption that iomem range end address passed to
iomem_permit_access() is exclusive, while it is expected to be inclusive.
It gives Control domain (Dom0) access to manage incorrect MMIO range with
one additional page.
For example, if requested range is [00e6140000:00e6141004] then it expected
to add [e6140:e6141] range (num_pages=2) to the domain iomem_caps rangeset,
but will add [e6140:e6142] (num_pages=3) instead.
To fix it, drop PAGE_ALIGN() from the iomem range end address calculation
formula.
Roger Pau Monne [Fri, 7 Mar 2025 09:16:01 +0000 (10:16 +0100)]
x86/iommu: avoid MSI address and data writes if IRT index hasn't changed
Attempt to reduce the MSI entry writes, and the associated checking whether
memory decoding and MSI-X is enabled for the PCI device, when the MSI data
hasn't changed.
When using Interrupt Remapping the MSI entry will contain an index into
the remapping table, and it's in such remapping table where the MSI vector
and destination CPU is stored. As such, when using interrupt remapping,
changes to the interrupt affinity shouldn't result in changes to the MSI
entry, and the MSI entry update can be avoided.
Signal from the IOMMU update_ire_from_msi hook whether the MSI data or
address fields have changed, and thus need writing to the device registers.
Such signaling is done by returning 1 from the function. Otherwise
returning 0 means no update of the MSI fields, and thus no write
required.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monne [Mon, 10 Mar 2025 17:13:52 +0000 (18:13 +0100)]
x86/hvm: check return code of hvm_pi_update_irte when binding
Consume the return code from hvm_pi_update_irte(), and propagate the error
back to the caller if hvm_pi_update_irte() fails.
Fixes: 35a1caf8b6b5 ('pass-through: update IRTE according to guest interrupt config changes') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monne [Mon, 10 Mar 2025 15:49:29 +0000 (16:49 +0100)]
x86/vmx: fix posted interrupts usage of msi_desc->msg field
The current usage of msi_desc->msg in vmx_pi_update_irte() will make the
field contain a translated MSI message, instead of the expected
untranslated one. This breaks dump_msi(), that use the data in
msi_desc->msg to print the interrupt details.
Fix this by introducing a dummy local msi_msg, and use it with
iommu_update_ire_from_msi(). vmx_pi_update_irte() relies on the MSI
message not changing, so there's no need to propagate the resulting msi_msg
to the hardware, and the contents can be ignored.
Additionally add a comment to clarify that msi_desc->msg must always
contain the untranslated MSI message.
Fixes: a5e25908d18d ('VT-d: introduce new fields in msi_desc to track binding with guest interrupt') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
The logic has too many levels of indirection and it's very hard to
understand it its current form. Split it between the corner case where
the adjustment is bigger than the current claim and the rest to avoid 5
auxiliary variables.
Add a functional change to prevent negative adjustments from
re-increasing the claim. This has the nice side effect of avoiding
taking the heap lock here on every free.
While at it, fix incorrect field name in nearby comment.
Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Fri, 21 Feb 2025 11:34:49 +0000 (12:34 +0100)]
x86/msr: expose MSR_FAM10H_MMIO_CONF_BASE on AMD
The MMIO_CONF_BASE reports the base of the MCFG range on AMD systems.
Linux pre-6.14 is unconditionally attempting to read the MSR without a
safe MSR accessor, and since Xen doesn't allow access to it Linux reports
the following error:
Such access is conditional to the presence of a device with PnP ID
"PNP0c01", which triggers the execution of the quirk_amd_mmconfig_area()
function. Note that prior to commit 3fac3734c43a MSR accesses when running
as a PV guest would always use the safe variant, and thus silently handle
the #GP.
Fix by allowing access to the MSR on AMD systems for the hardware domain.
Write attempts to the MSR will still result in #GP for all domain types.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 11 Mar 2025 08:55:47 +0000 (09:55 +0100)]
x86/P2M: don't include MMIO_DM in p2m_is_valid()
MMIO_DM specifically marks pages which aren't valid, much like INVALID
does. Dropping the type from the predicate
- (conceptually) corrects _sh_propagate(), where the comment says that
"something valid" is needed (the only call path not passing in RAM_RW
would pass in INVALID_GFN along with MMIO_DM),
- is benign to the use in sh_page_fault(), where the subsequent
mfn_valid() check would otherwise cause the same bail-out code path to
be taken,
- is benign to all three uses in p2m_pt_get_entry(), as MMIO_DM entries
will only ever yield non-present entries, which are being checked for
earlier,
- is benign to sh_unshadow_for_p2m_change(), for the same reason,
- is benign to gnttab_transfer() with EPT not in use, again because
MMIO_DM entries will only ever yield non-present entries, and
INVALID_MFN is returned for those anyway by p2m_pt_get_entry().
- for gnttab_transfer() with EPT in use (conceptually) corrects the
corner case of a page first being subject to XEN_DMOP_set_mem_type
converting a RAM type to MMIO_DM (which retains the MFN in the entry),
and then being subject to GNTTABOP_transfer, except that steal_page()
would later make the operation fail unconditionally anyway.
While there also drop the unused (and otherwise now redundant)
p2m_has_emt().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Tue, 11 Mar 2025 08:55:20 +0000 (09:55 +0100)]
x86/P2M: correct old entry checking in p2m_remove_entry()
Using p2m_is_valid() isn't quite right here. It expanding to RAM+MMIO,
the subsequent p2m_mmio_direct check effectively reduces its use to
RAM+MMIO_DM. Yet MMIO_DM entries, which are never marked present in the
page tables, won't pass the mfn_valid() check. It is, however, quite
plausible (and supported by the rest of the function) to permit
"removing" hole entries, i.e. in particular to convert MMIO_DM to
INVALID. Which leaves the original check to be against RAM (plus MFN
validity), while HOLE then instead wants INVALID_MFN to be passed in.
Further more grant and foreign entries (together with RAM becoming
ANY_RAM) as well as BROKEN want the MFN checking, too.
All other types (i.e. MMIO_DIRECT and POD) want rejecting here rather
than skipping, for needing handling / accounting elsewhere.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Tue, 11 Mar 2025 08:54:19 +0000 (09:54 +0100)]
PCI: drop pci_segments_init()
Have callers invoke pci_add_segment() directly instead: With radix tree
initialization moved out of the function, its name isn't quite
describing anymore what it actually does.
On x86 move the logic into __start_xen() itself, to reduce the risk of
re-introducing ordering issues like the one which was addressed by 26fe09e34566 ("radix-tree: introduce RADIX_TREE{,_INIT}()").
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Julien Grall <jgrall@amazon.com>
Roger Pau Monne [Mon, 10 Mar 2025 17:41:57 +0000 (18:41 +0100)]
automation/cirrus-ci: store xen/.config as an artifact
Always store xen/.config as an artifact, renamed to xen-config to match
the naming used in the Gitlab CI tests.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Juergen Gross [Thu, 6 Mar 2025 07:47:52 +0000 (08:47 +0100)]
docs: fix INTRODUCE description in xenstore.txt
The description of the Xenstore INTRODUCE command is still referencing
xend. Fix that.
The <evtchn> description is starting with a grammatically wrong
sentence. Fix that.
While at it, make clear that the Xenstore implementation is allowed
to ignore the specified gfn and use the Xenstore reserved grant id
GNTTAB_RESERVED_XENSTORE instead.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Fri, 7 Mar 2025 14:24:42 +0000 (14:24 +0000)]
xen/watchdog: Identify which domain watchdog fired
When a watchdog fires, the domain is crashed and can't dump any state.
Xen allows a domain to have two separate watchdogs. Therefore, for a
domain running multiple watchdogs (e.g. one based around network, one
for disk), it is important for diagnostics to know which watchdog
fired.
As the printk() is in a timer callback, this is a bit awkward to
arrange, but there are 12 spare bits in the bottom of the domain
pointer owing to its alignment.
Reuse these bits to encode the watchdog id too, so the one which fired
is identified when the domain is crashed.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Fri, 7 Mar 2025 16:38:26 +0000 (16:38 +0000)]
xen/domain: Initialise the domain handle before inserting into the domlist
As soon as the the domain is in the domlist, it can be queried via various
means, ahead of being fully constructed. Ensure it has the toolstack-given
UUID prior to becoming visible.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 7 Mar 2025 15:16:29 +0000 (15:16 +0000)]
CI: Drop the now-obsolete 11-riscv64.dockerfile
Fixes: bd9bda50553b ("automation: drop debian:11-riscv64 container") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Jiqian Chen [Mon, 24 Feb 2025 03:24:33 +0000 (11:24 +0800)]
vpci: Add resizable bar support
Some devices, like AMDGPU, support resizable bar capability,
but vpci of Xen doesn't support this feature, so they fail
to resize bars and then cause probing failure.
According to PCIe spec, each bar that supports resizing has
two registers, PCI_REBAR_CAP and PCI_REBAR_CTRL. So, add
handlers to support resizing the size of BARs.
Note that Xen will only trap PCI_REBAR_CTRL, as PCI_REBAR_CAP
is read-only register and the hardware domain already gets
access to it without needing any setup.
tools/hvmloader: Replace LAPIC_ID() with cpu_to_apicid[]
Replace uses of the LAPIC_ID() macro with accesses to the
cpu_to_apicid[] lookup table. This table contains the APIC IDs of each
vCPU as probed at runtime rather than assuming a predefined relation.
Moved smp_initialise() ahead of apic_setup() in order to initialise
cpu_to_apicid ASAP and avoid using it uninitialised. Note that bringing
up the APs doesn't need the APIC in hvmloader becasue it always runs
virtualized and uses the PV interface.
Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> Acked-by: Jan Beulich <jbeulich@suse.com>
tools/hvmloader: Retrieve APIC IDs from the APs themselves
Make it so the APs expose their own APIC IDs in a lookup table (LUT). We
can use that LUT to populate the MADT, decoupling the algorithm that
relates CPU IDs and APIC IDs from hvmloader.
Modified the printf to also print the APIC ID of each CPU, as well as
fixing a (benign) wrong specifier being used for the vcpu id.
Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 28 Nov 2024 00:47:37 +0000 (00:47 +0000)]
x86/vlapic: Drop vlapic->esr_lock
The exact behaviour of LVTERR interrupt generation is implementation
specific.
* Newer Intel CPUs generate an interrupt when pending_esr becomes
nonzero.
* Older Intel and all AMD CPUs generate an interrupt when any
individual bit in pending_esr becomes nonzero.
Neither vendor documents their behaviour very well. Xen implements
the per-bit behaviour and has done since support was added.
Importantly, the per-bit behaviour can be expressed using the atomic
operations available in the x86 architecture, whereas the
former (interrupt only on pending_esr becoming nonzero) cannot.
With vlapic->hw.pending_esr held outside of the main LAPIC register page,
it's much easier to use atomic operations.
Use xchg() in vlapic_reg_write(), and *set_bit() in vlapic_error().
The only interesting change is that vlapic_error() now needs to take a
single bit only, rather than a mask, but this fine for all current
callers and forseable changes.
No change from a guests perspective.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 28 Nov 2024 00:47:36 +0000 (00:47 +0000)]
x86/vlapic: Fix handling of writes to APIC_ESR
Xen currently presents APIC_ESR to guests as a simple read/write register.
This is incorrect. The SDM states:
The ESR is a write/read register. Before attempt to read from the ESR,
software should first write to it. (The value written does not affect the
values read subsequently; only zero may be written in x2APIC mode.) This
write clears any previously logged errors and updates the ESR with any
errors detected since the last write to the ESR.
Introduce a new pending_esr field in hvm_hw_lapic.
Update vlapic_error() to accumulate errors here, and extend vlapic_reg_write()
to discard the written value and transfer pending_esr into APIC_ESR. Reads
are still as before.
Importantly, this means that guests no longer destroys the ESR value it's
looking for in the LVTERR handler when following the SDM instructions.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 25 Jun 2024 16:23:12 +0000 (17:23 +0100)]
x86/vmx: Rewrite vmx_sync_pir_to_irr() to be more efficient
There are two issues. First, pi_test_and_clear_on() pulls the cache-line to
the CPU and dirties it even if there's nothing outstanding, and second,
bitmap_for_each() is O(256) when O(8) would do, and would avoid multiple
atomic updates to the same IRR word.
Rewrite it from scratch, explaining what's going on at each step.
Bloat-o-meter reports 177 -> 145 (net -32), but real improvement is the
removal of calls to __find_{first,next}_bit() hidden behind bitmap_for_each().
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 18 Feb 2025 23:01:11 +0000 (23:01 +0000)]
xen/domain: Annotate struct domain as page aligned
struct domain is always a page aligned allocation. Update it's type to
reflect this, so we can safely reuse the lower bits in the pointer for
auxiliary information.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Andrew Cooper [Fri, 3 Jan 2025 15:16:45 +0000 (15:16 +0000)]
x86/IDT: Don't rewrite bsp_idt[] at boot time
Now that bsp_idt[] is constructed at build time, we do not need to manually
initialise it in init_idt_traps() and trap_init().
When swapping the early pagefault handler for the normal one, switch to using
_update_gate_addr_lower() as we do on the kexec path for NMI and #MC.
This in turn allows us to drop set_{intr,swint}_gate() and the underlying
infrastructure. It also lets us drop autogen_entrypoints[] and that
underlying infrastructure.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 3 Jan 2025 14:44:19 +0000 (14:44 +0000)]
x86/IDT: Generate bsp_idt[] at build time
... rather than dynamically at boot time. Aside from less runtime overhead,
this approach is less fragile than the preexisting autogen stubs mechanism.
We can manage this with some linker calculations. See patch comments for full
details.
For simplicity, we create a new set of entry stubs here, and clean up the old
ones in the subsequent patch. bsp_idt[] needs to move from .bss to .data.
No functional change yet; the boot path still (re)writes bsp_idt[] at this
juncture.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 2 Jan 2025 17:17:30 +0000 (17:17 +0000)]
x86/IDT: Rename idt_table[] to bsp_idt[]
Having variables named idt_table[] and idt_tables[] is not ideal.
Use X86_IDT_VECTORS and remove IDT_ENTRIES. State the size of bsp_idt[] in
idt.h so that load_system_tables() and cpu_smpboot_alloc() can use sizeof()
rather than opencoding the calculation.
Move the variable into a new traps-setup.c, to make a start at splitting
traps.c in half.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Fri, 7 Mar 2025 10:11:41 +0000 (11:11 +0100)]
xen/events: fix global virq handling
VIRQs are split into "global" and "per vcpu" ones. Unfortunately in
reality there are "per domain" ones, too.
send_global_virq() and set_global_virq_handler() make only sense for
the real "global" ones, so replace virq_is_global() with a new
function get_virq_type() returning one of the 3 possible types (global,
domain, vcpu VIRQ).
To make its intended purpose more clear, also rename
send_guest_global_virq() to send_guest_domain_virq().
Fixes: 980822c5edd1 ("xen/events: allow setting of global virq handler only for unbound virqs") Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Juergen Gross [Thu, 6 Mar 2025 16:23:36 +0000 (17:23 +0100)]
xen/events: fix get_global_virq_handler() usage without hardware domain
Some use cases of get_global_virq_handler() didn't account for the
case of running without hardware domain.
Fix that by testing get_global_virq_handler() returning NULL where
needed (e.g. when directly dereferencing the result).
Fixes: 980822c5edd1 ("xen/events: allow setting of global virq handler only for unbound virqs") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 6 Mar 2025 14:21:52 +0000 (15:21 +0100)]
XSM: correct xsm_get_domain_state()
Add the missing first parameter and move it next to a close relative.
Fixes: 3ad3df1bd0aa ("xen: add new domctl get_domain_state") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Juergen Gross [Thu, 6 Mar 2025 13:03:37 +0000 (14:03 +0100)]
xen/public: remove some unused defines from xs_wire.h
xs_wire.h contains some defines XS_WRITE_* which seem to be leftovers
from some decades ago. They haven't been used in the Xen tree since at
least Xen 2.0 and they make no sense anyway.
Remove them, as they seem not to be related to any Xen interface we
have today.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Supported ISA extensions are specified in the device tree within the CPU
node, using two properties: `riscv,isa-extensions` and `riscv,isa`.
Currently, Xen does not support the `riscv,isa-extensions` property and
will be added in the future.
The `riscv,isa` property is parsed for each CPU, and the common extensions
are stored in the `host_riscv_isa` bitmap.
This bitmap is then used by `riscv_isa_extension_available()` to check
if a specific extension is supported.
The current implementation is based on Linux kernel v6.12-rc3
implementation with the following changes:
- Drop unconditional setting of {RISCV_ISA_EXT_ZICSR,
RISCV_ISA_EXT_ZIFENCEI, RISCV_ISA_EXT_ZICNTR, RISCV_ISA_EXT_ZIHPM} because
Xen is going to run on hardware produced after the aforementioned
extensions were split out of "i".
- Remove saving of the ISA for each CPU, only the common available ISA is
saved.
- Remove ACPI-related code as ACPI is not supported by Xen.
- Drop handling of elf_hwcap, since Xen does not provide hwcap to
userspace.
- Replace of_cpu_device_node_get() API, which is not available in Xen,
with a combination of dt_for_each_child_node(), dt_device_type_is_equal(),
and dt_get_cpuid_from_node() to retrieve cpuid and riscv,isa in
riscv_fill_hwcap_from_isa_string().
- Rename arguments of __RISCV_ISA_EXT_DATA() from _name to ext_name, and
_id to ext_id for clarity.
- Replace instances of __RISCV_ISA_EXT_DATA with RISCV_ISA_EXT_DATA.
- Replace instances of __riscv_isa_extension_available with
riscv_isa_extension_available for consistency. Also, update the type of
`bit` argument of riscv_isa_extension_available().
- Redefine RISCV_ISA_EXT_DATA() to work only with ext_name and ext_id,
as other fields are not used in Xen currently. Also RISCV_ISA_EXT_DATA()
is reworked in the way to take only one argument `ext_name`.
- Add check of first 4 letters of riscv,isa string to
riscv_isa_parse_string() as Xen doesn't do this check before so it is
necessary to check correctness of riscv,isa string. ( it should start with
rv{32,64} with taking into account upper and lower case of "rv").
Additionally, check also that 'i' goes after 'rv{32,64}' to be sure that
`out_bitmap` can't be empty.
- Drop an argument of riscv_fill_hwcap() and riscv_fill_hwcap_from_isa_string()
as it isn't used, at the moment.
- Update the comment message about QEMU workaround.
- Apply Xen coding style.
- s/pr_info/printk.
- Drop handling of uppercase letters of riscv,isa in riscv_isa_parse_string() as
Xen checks that riscv,isa should be in lowercase according to the device tree
bindings.
- Update logic of riscv_isa_parse_string(): now it stops parsing of riscv,isa
if illegal symbol was found instead of ignoring them.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Extensions 'f' and 'd' aren't really needed for Xen, and allowing floating
point registers to be used can lead to crashes.
Extensions 'i', 'm', 'a', 'zicsr', and 'zifencei' are necessary for the
operation of Xen, which is why they are used explicitly (unconditionally)
in -march.
Drop "Base ISA" choice from riscv/Kconfig as it is always empty.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
There are two reasons for that:
1. In the README, GCC baseline is chosen to be 12.2, whereas Debian 11
uses GCC 10.2.1.
2. Xen requires mandatory some Z extensions, but GCC 10.2.1 does not
support Z extensions in -march, causing the compilation to fail.
Jan Beulich [Thu, 6 Mar 2025 12:57:21 +0000 (13:57 +0100)]
VMX: convert vmx_basic_msr
... to a struct field, which is then going to be accompanied by other
capability/control data presently living in individual variables. As
this structure isn't supposed to be altered post-boot, put it in
.data.ro_after_init right away.
Suggested-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Thu, 6 Mar 2025 12:56:49 +0000 (13:56 +0100)]
VMX: drop vmcs_revision_id
It's effectively redundant with vmx_basic_msr. For the #define
replacement to work, struct vmcs_struct's respective field name also
needs to change: Drop the not really meaningful "vmcs_" prefix from it.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Thu, 6 Mar 2025 12:56:21 +0000 (13:56 +0100)]
x86/HVM: improve CET-IBT pruning of ENDBR
__init{const,data}_cf_clobber can have an effect only for pointers
actually populated in the respective tables. While not the case for SVM
right now, VMX installs a number of pointers only under certain
conditions. Hence the respective functions would have their ENDBR purged
only when those conditions are met. Invoke "pruning" functions after
having copied the respective tables, for them to install any "missing"
pointers.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Juergen Gross [Thu, 6 Mar 2025 12:54:55 +0000 (13:54 +0100)]
tools/xenstored: use new stable interface instead of libxenctrl
Replace the current use of the unstable xc_domain_getinfo_single()
interface with the stable domctl XEN_DOMCTL_get_domain_state call
via the new libxenmanage library.
This will remove the last usage of libxenctrl by Xenstore, so update
the library dependencies accordingly.
For now only do a direct replacement without using the functionality
of obtaining information about domains having changed the state.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
Juergen Gross [Thu, 6 Mar 2025 12:52:38 +0000 (13:52 +0100)]
xen: add new domctl get_domain_state
Add a new domctl sub-function to get data of a domain having changed
state (this is needed by Xenstore).
The returned state just contains the domid, the domain unique id,
and some flags (existing, shutdown, dying).
In order to enable Xenstore stubdom being built for multiple Xen
versions, make this domctl stable. For stable domctls the
interface_version is always 0.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>