Roger Pau Monne [Wed, 5 Mar 2025 10:53:20 +0000 (11:53 +0100)]
xen: remove -N from the linker command line
It's unclear why -N is being used in the first place. It was added by
commit 40828c657dd0c back in 2004 without any justification.
When building a PE image it's actually detrimental to fercefully set the
.text section as writable. The GNU LD man page contains the following
warning regarding the -N option:
> Note: Although a writable text section is allowed for PE-COFF targets, it
> does not conform to the format specification published by Microsoft.
Remove the usage of -N uniformly on all architectures, assuming that the
addition was simply done as a copy and paste or the original x86 linking
rune.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Wed, 5 Mar 2025 17:08:13 +0000 (18:08 +0100)]
x86/mkelf32: account for offset when detecting note phadr placement
mkelf32 attempt to check that the program header defined NOTE segment falls
inside of the LOAD segment, as the build-id should be loaded for Xen at
runtime to check.
However the current code doesn't take into account the LOAD program header
segment offset when calculating overlap with the NOTE segment. This
results in incorrect detection, and the following build error:
arch/x86/boot/mkelf32 --notes xen-syms ./.xen.elf32 0x200000 \
`nm xen-syms | sed -ne 's/^\([^ ]*\) . __2M_rwdata_end$/0x\1/p'`
Expected .note section within .text section!
Offset 4244776 not within 2910364!
Roger Pau Monne [Wed, 12 Feb 2025 10:37:50 +0000 (11:37 +0100)]
x86/dom0: be less restrictive with the Interrupt Address Range
Xen currently prevents dom0 from creating CPU or IOMMU page-table mappings
into the interrupt address range [0xfee00000, 0xfeefffff]. This range has
two different purposes. For accesses from the CPU is contains the default
position of local APIC page at 0xfee00000. For accesses from devices
it's the MSI address range, so the address field in the MSI entries
(usually) point to an address on that range to trigger an interrupt.
There are reports of Lenovo Thinkpad devices placing what seems to be the
UCSI shared mailbox at address 0xfeec2000 in the interrupt address range.
Attempting to use that device with a Linux PV dom0 leads to an error when
Linux kernel maps 0xfeec2000:
Remove the restrictions to create mappings in the interrupt address range
for dom0. Note that the restriction to map the local APIC page is enforced
separately, and that continues to be present. Additionally make sure the
emulated local APIC page is also not mapped, in case dom0 is using it.
Note that even if the interrupt address range entries are populated in the
IOMMU page-tables no device access will reach those pages. Device accesses
to the Interrupt Address Range will always be converted into Interrupt
Messages and are not subject to DMA remapping.
There's also the following restriction noted in Intel VT-d:
> Software must not program paging-structure entries to remap any address to
> the interrupt address range. Untranslated requests and translation requests
> that result in an address in the interrupt range will be blocked with
> condition code LGN.4 or SGN.8. Translated requests with an address in the
> interrupt address range are treated as Unsupported Request (UR).
Similarly for AMD-Vi:
> Accesses to the interrupt address range (Table 3) are defined to go through
> the interrupt remapping portion of the IOMMU and not through address
> translation processing. Therefore, when a transaction is being processed as
> an interrupt remapping operation, the transaction attribute of
> pretranslated or untranslated is ignored.
>
> Software Note: The IOMMU should
> not be configured such that an address translation results in a special
> address such as the interrupt address range.
However those restrictions don't apply to the identity mappings possibly
created for dom0, since the interrupt address range is never subject to DMA
remapping, and hence there's no output address after translation that
belongs to the interrupt address range.
Roger Pau Monne [Fri, 14 Feb 2025 09:39:29 +0000 (10:39 +0100)]
x86/iommu: account for IOMEM caps when populating dom0 IOMMU page-tables
The current code in arch_iommu_hwdom_init() kind of open-codes the same
MMIO permission ranges that are added to the hardware domain ->iomem_caps.
Avoid this duplication and use ->iomem_caps in arch_iommu_hwdom_init() to
filter which memory regions should be added to the dom0 IOMMU page-tables.
Note the IO-APIC and MCFG page(s) must be set as not accessible for a PVH
dom0, otherwise the internal Xen emulation for those ranges won't work.
This requires adjustments in dom0_setup_permissions().
The call to pvh_setup_mmcfg() in dom0_construct_pvh() must now strictly be
done ahead of setting up dom0 permissions, so take the opportunity to also
put it inside the existing is_hardware_domain() region.
Also the special casing of E820_UNUSABLE regions no longer needs to be done
in arch_iommu_hwdom_init(), as those regions are already blocked in
->iomem_caps and thus would be removed from the rangeset as part of
->iomem_caps processing in arch_iommu_hwdom_init(). The E820_UNUSABLE
regions below 1Mb are not removed from ->iomem_caps, that's a slight
difference for the IOMMU created page-tables, but the aim is to allow
access to the same memory either from the CPU or the IOMMU page-tables.
Since ->iomem_caps already takes into account the domain max paddr, there's
no need to remove any regions past the last address addressable by the
domain, as applying ->iomem_caps would have already taken care of that.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monne [Tue, 18 Feb 2025 16:57:49 +0000 (17:57 +0100)]
x86/dom0: correctly set the maximum ->iomem_caps bound for PVH
The logic in dom0_setup_permissions() sets the maximum bound in
->iomem_caps unconditionally using paddr_bits, which is not correct for HVM
based domains. Instead use domain_max_paddr_bits() to get the correct
maximum paddr bits for each possible domain type.
Switch to using PFN_DOWN() instead of PAGE_SHIFT, as that's shorter.
Fixes: 53de839fb409 ('x86: constrain MFN range Dom0 may access') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monne [Thu, 13 Feb 2025 09:58:45 +0000 (10:58 +0100)]
x86/dom0: attempt to fixup p2m page-faults for PVH dom0
When building a PVH dom0 Xen attempts to map all (relevant) MMIO regions
into the p2m for dom0 access. However the information Xen has about the
host memory map is limited. Xen doesn't have access to any resources
described in ACPI dynamic tables, and hence the p2m mappings provided might
not be complete.
PV doesn't suffer from this issue because a PV dom0 is capable of mapping
into it's page-tables any address not explicitly banned in d->iomem_caps.
Introduce a new command line options that allows Xen to attempt to fixup
the p2m page-faults, by creating p2m identity maps in response to p2m
page-faults.
This is aimed as a workaround to small ACPI regions Xen doesn't know about.
Note that missing large MMIO regions mapped in this way will lead to
slowness due to the VM exit processing, plus the mappings will always use
small pages.
The ultimate aim is to attempt to bring better parity with a classic PV
dom0.
Note such fixup rely on the CPU doing the access to the unpopulated
address. If the access is attempted from a device instead there's no
possible way to fixup, as IOMMU page-fault are asynchronous.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Only slightly tested on my local PVH dom0 deployment.
---
Changes since v1:
- Make the fixup function static.
- Print message in case mapping already exists.
Roger Pau Monne [Thu, 13 Feb 2025 08:08:01 +0000 (09:08 +0100)]
x86/emul: dump unhandled memory accesses for PVH dom0
A PV dom0 can map any host memory as long as it's allowed by the IO
capability range in d->iomem_caps. On the other hand, a PVH dom0 has no
way to populate MMIO region onto it's p2m, so it's limited to what Xen
initially populates on the p2m based on the host memory map and the enabled
device BARs.
Introduce a new debug build only printk that reports attempts by dom0 to
access addresses not populated on the p2m, and not handled by any emulator.
This is for information purposes only, but might allow getting an idea of
what MMIO ranges might be missing on the p2m.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 1 Jan 2025 15:43:20 +0000 (15:43 +0000)]
x86/IDT: Collect IDT related content idt.h
Logic concerning the IDT is somewhat different to the other system tables, and
in particular ought not to be in asm/processor.h. Collect it together a new
header.
While doing so, make a few minor adjustments:
* Make set_ist() use volatile rather than ACCESS_ONCE(), as
_write_gate_lower() already does, removing the need for xen/lib.h.
* Move the BUILD_BUG_ON() from subarch_percpu_traps_init() into mm.c's
build_assertions(), rather than including idt.h into x86_64/traps.c.
* Drop UL from IST constants.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 30 Dec 2024 06:41:46 +0000 (06:41 +0000)]
xen: Don't cast away const-ness in vcpu_show_registers()
The final hunk is `(struct vcpu *)v` in disguise, expressed using a runtime
pointer chase through memory and a technicality of the C type system to work
around the fact that get_hvm_registers() strictly requires a mutable pointer.
For anyone interested, this is one reason why C cannot optimise any reads
across sequence points, even for a function purporting to take a const object.
Anyway, have the function correctly state that it needs a mutable vcpu. All
callers have a mutable vCPU to hand, and it removes the runtime pointer chase
in x86.
Make one style adjustment in ARM while adjusting the parameter type.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Nicola Vetrini [Tue, 4 Mar 2025 17:49:36 +0000 (18:49 +0100)]
automation/eclair: Reduce verbosity of ECLAIR logs.
While activating verbose logging simplifies debugging, this causes
GitLab logs to be truncated, preventing the links to the ECLAIR
analysis database to be shown.
Andrew Cooper [Mon, 3 Mar 2025 14:06:55 +0000 (14:06 +0000)]
CHANGELOG.md: Set release date for 4.20
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
common: remove -fno-stack-protector from EMBEDDED_EXTRA_CFLAGS
This patch is preparation for making stack protector
configurable. First step is to remove -fno-stack-protector flag from
EMBEDDED_EXTRA_CFLAGS so separate components (Hypervisor in this case)
can enable/disable this feature by themselves.
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Fix platforms/Kconfig and Kconfig.debug help indent to respect the
standard (tab + 2 spaces).
While there also move some default in Kconfig.debug before the help
message.
Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com>
Michal Orzel [Mon, 3 Mar 2025 08:56:50 +0000 (09:56 +0100)]
xen/arm: Don't blindly print hwdom in generic panic messages
These functions are generic and used not only for hardware domain. This
creates confusion when printing any of these panic messages (e.g.
failure when loading domU kernel would result in informing a user about
a failure in loading hwdom kernel).
Signed-off-by: Michal Orzel <michal.orzel@amd.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Michal Orzel [Mon, 3 Mar 2025 08:56:49 +0000 (09:56 +0100)]
xen/arm: static-shmem: Drop unused size_cells
Value stored in size_cells is never read because we're only interested
in retrieving gbase address of shmem region for which we only need
address cells.
Signed-off-by: Michal Orzel <michal.orzel@amd.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Michal Orzel [Mon, 3 Mar 2025 08:56:48 +0000 (09:56 +0100)]
xen/arm: Check return code from fdt_finish_reservemap()
fdt_finish_reservemap() may fail (with -FDT_ERR_NOSPACE) in which case
further DTB creation (in prepare_dtb_hwdom()) makes no sense. Fix it.
Fixes: 13bb63b754e4 ("device tree,arm: supply a flat device tree to dom0") Signed-off-by: Michal Orzel <michal.orzel@amd.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Michal Orzel [Mon, 3 Mar 2025 08:56:47 +0000 (09:56 +0100)]
xen/arm: dm: Bail out if padding != 0 for XEN_DMOP_set_irq_level
XEN_DMOP_set_irq_level operation requires elements of pad array (being
member of xen_dm_op_set_irq_level structure) to be 0. While handling the
hypercall we validate this. If one of the elements is not zero, we set
rc to -EINVAL. At this point we should stop further DM handling and bail
out propagating the error to the caller. However, instead of goto the
code uses break which has basically no meaningful effect. The rc value
is never read and the code continues with the hypercall processing ending
up (possibly) with the interrupt injection. Fix it.
Luca Fancellu [Wed, 26 Feb 2025 21:52:56 +0000 (21:52 +0000)]
xen/arm: Don't use copy_from_paddr for DTB relocation
Currently the early stage of the Arm boot maps the DTB using
early_fdt_map() using PAGE_HYPERVISOR_RO which is cacheable
read-only memory, later during DTB relocation the function
copy_from_paddr() is used to map pages in the same range on
the fixmap but using PAGE_HYPERVISOR_WC which is non-cacheable
read-write memory.
The Arm specifications, ARM DDI0487L.a, section B2.11 "Mismatched
memory attributes" discourage using mismatched attributes for
aliases of the same location.
Given that there is nothing preventing the relocation since the region
is already mapped, fix that by open-coding copy_from_paddr inside
relocate_fdt, without mapping on the fixmap.
Signed-off-by: Luca Fancellu <luca.fancellu@arm.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com>
automation: allow selecting individual jobs via CI variables
Debugging sometimes involves running specific jobs on different
versions. It's useful to easily avoid running all of the not interesting
ones (for given case) to save both time and CI resources. Doing so used
to require changing the yaml files, usually in several places.
Ease this step by adding SELECTED_JOBS_ONLY variable that takes a regex.
Note that one needs to satisfy job dependencies on their own (for
example if a test job needs a build job, that specific build job
needs to be included too).
The variable can be specified via Gitlab web UI when scheduling a
pipeline, but it can be also set when doing git push directly:
More details at https://docs.gitlab.co.jp/ee/user/project/push_options.html
The variable needs to include regex for selecting jobs, including
enclosing slashes.
A coma/space separated list of jobs to select would be friendlier UX,
but unfortunately that is not supported:
https://gitlab.com/gitlab-org/gitlab/-/issues/209904 (note the proposed
workaround doesn't work for job-level CI_JOB_NAME).
On the other hand, the regex is more flexible (one can select for
example all arm32 jobs).
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
automation: add jobs running tests from tools/tests/*
There are a bunch of tests in tools/tests/, let them run in CI.
For each subdirectory expect "make run" will run the test, and observe
its exit code. This way, adding new tests is easy, and they will be
automatically picked up.
For better visibility, log test output to junit xml format, and let
gitlab ingest it. Set SUT_ADDR variable with name/address of the system
under test, so a network can be used to extract the file. The actual
address is set using DHCP. And for the test internal network, still add
the 192.168.0.1 IP (but don't replace the DHCP-provided one).
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Stefano Stabellini <stefano.stabellini@amd.com>
Nicola Vetrini [Fri, 14 Feb 2025 20:45:23 +0000 (21:45 +0100)]
automation: Update ECLAIR analysis configuration
The Xen configurations for the ARM64 and X86_64 ECLAIR analyses
is currently held in fixed files under
'automation/eclair_analysis/xen_{arm,x86}_config'. The values
of the configuration options there are susceptible to going stale
due to configuration option changes.
To enhance maintainability, the configuration under analysis is
derived from the respective architecture's defconfig, with suitable
changes added via EXTRA_XEN_CONFIG.
ARM: ITS: implement quirks and add support for Renesas Gen4 ITS
There are number of ITS implementations exist which are different from
the base one which have number of functionalities defined as is
"IMPLEMENTATION DEFINED", e.g. there may exist differences in cacheability,
shareability and memory requirements and others. This requires
appropriate handling of such HW requirements which are implemented as
ITS quirks: GITS_IIDR (ITS Implementer Identification Register) is used to
differentiate the ITS implementations and select appropriate set of
quirks if any.
As an example of such ITSes add quirk implementation for Renesas Gen4 ITS:
- add possibility to override default cacheability and shareability
settings used for ITS memory allocations;
- change relevant memory allocations to alloc_xenheap_pages which allows
to specify memory access flags, free_xenheap_pages is used to free;
- add quirks validation to ensure that all ITSes share the same quirks
in case of multiple ITSes are present in the system;
The Gen4 ITS memory requirements are not covered in any errata as of yet,
but observed behavior suggests that they are indeed required.
The idea of the quirk implementation is inspired by the Linux kernel ITS
quirk implementation [1].
Changelog:
v2 -> v3:
- added missing memset;
v1 -> v2:
- switched to using alloc_xenheap_pages/free_xenheap_pages for ITS memory
allocations;
- updated declaration of its_quirk_flags;
- added quirks validation to ensure that all ITSes share the same quirks;
- removed unnecessary vITS changes;
Michal Orzel [Wed, 19 Feb 2025 17:29:46 +0000 (18:29 +0100)]
xen/arm: Create GIC node using the node name from host dt
At the moment the GIC node we create for hwdom has a name
"interrupt-controller". Change it so that we use the same name as the
GIC node from host device tree. This is done for at least 2 purposes:
1) The convention in DT spec is that a node name with "reg" property
is formed "node-name@unit-address".
2) With DT overlay feature, many overlays refer to the GIC node using
the symbol under __symbols__ that we copy to hwdom 1:1. With the name
changed, the symbol is no longer valid and requires error prone manual
change by the user.
The unit-address part of the node name always refers to the first
address in the "reg" property which in case of GIC, always refers to
GICD and hwdom uses host memory layout.
Signed-off-by: Michal Orzel <michal.orzel@amd.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
arch: arm64: always set IL=1 when injecting an abort exception
ARM Architecture Reference Manual states that IL field of ESR_EL1
register should be 1 in some cases, and all these cases are covered by
inject_abt64_exception()
Section D24.2.40, page D24-7337 of ARM DDI 0487L:
IL, bit [25]
Instruction Length for synchronous exceptions. Possible values of this bit are:
[...]
0b1 - 32-bit instruction trapped.
This value is also used when the exception is one of the following:
[...]
- An Instruction Abort exception.
- A Data Abort exception for which the value of the ISV bit is 0.
[...]
inject_abt64_exception() function injects either Instruction Abort or
Data Abort exception. In both cases, ISS is 0, which means that ISV
bit is 0 as well. Thus, IL must be set to 1 unconditionally.
To align code with the specification, set .len field to 1 in
inject_abt64_exception() and remove unneeded third parameter.
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com>
arch: arm64: always set IL=1 when injecting undefined exception
ARM Architecture Reference Manual states that IL field of ESR_EL1
register should be 1 when EC is 0b000000 aka HSR_EC_UNKNOWN.
Section D24.2.40, page D24-7337 of ARM DDI 0487L:
IL, bit [25]
Instruction Length for synchronous exceptions. Possible values of this bit are:
[...]
0b1 - 32-bit instruction trapped.
This value is also used when the exception is one of the following:
[...]
- An exception reported using EC value 0b000000.
To align code with the specification, set .len field to 1 in
inject_undef64_exception() and remove unneeded second parameter.
Michal Orzel [Wed, 12 Feb 2025 15:43:58 +0000 (17:43 +0200)]
device-tree: optimize size of struct dt_device_node
The current placement of fields in struct dt_device_node is not optimal and
introduces holes due to fields alignment.
Checked with "'pahole xen-syms -C dt_device_node"
ARM64 size 144B, 16B holes:
/* size: 144, cachelines: 3, members: 15 */
/* sum members: 128, holes: 3, sum holes: 16 */
/* last cacheline: 16 bytes */
ARM32 size 72B, 4B holes
/* size: 72, cachelines: 2, members: 15 */
/* sum members: 68, holes: 2, sum holes: 4 */
/* last cacheline: 8 bytes */
This patch optimizes size of struct dt_device_node by rearranging its
field, which eliminates holes and reduces structure size by 16B(ARM64) and
4B(ARM32).
After ARM64 size 128B, no holes (-16B):
/* size: 128, cachelines: 2, members: 15 */
After ARM32 size 68B, no holes (-4B)
/* size: 68, cachelines: 2, members: 15 */
/* last cacheline: 4 bytes */
Note that the property hard-affinity is optional. It is possible to add
other properties in the future not only to specify soft affinity, but
also to provide more precise methods for configuring affinity. For
instance, on ARM the MPIDR could be use to specify the pCPU. For now, it
is left to the future.
xen/arm: introduce legacy dom0less option for xenstore allocation
The new xenstore page allocation scheme might break older unpatched
Linux kernels that do not check for the Xenstore connection status
before proceeding with Xenstore initialization.
Introduce a dom0less configuration option to retain the older behavior.
The older behavior triggered by this option is to allocate the xenstore
page in init-dom0less. That does not work with static-mem guests.
However, it will make it possible to run as regular guests older Linux
kernel versions that are left unpatched.
Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com>
init-dom0less: allocate xenstore page if not already allocated
We check if the xenstore page is already allocated. If yes, there is
nothing to do. If no, we proceed allocating it to support old unpatched
Linux kernels.
Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com>
Henry Wang [Fri, 24 May 2024 22:55:22 +0000 (15:55 -0700)]
docs/features/dom0less: Update the late XenStore init protocol
With the new allocation strategy of Dom0less DomUs XenStore page,
update the doc of the late XenStore init protocol accordingly.
Signed-off-by: Henry Wang <xin.wang2@amd.com> Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com>
Henry Wang [Fri, 24 May 2024 22:55:20 +0000 (15:55 -0700)]
xen/arm: Alloc XenStore page for Dom0less DomUs from hypervisor
There are use cases (for example using the PV driver) in Dom0less
setup that require Dom0less DomUs start immediately with Dom0, but
initialize XenStore later after Dom0's successful boot and call to
the init-dom0less application.
An error message can seen from the init-dom0less application on
1:1 direct-mapped domains:
```
Allocating magic pages
memory.c:238:d0v0 mfn 0x39000 doesn't belong to d1
Error on alloc magic pages
```
The "magic page" is a terminology used in the toolstack as reserved
pages for the VM to have access to virtual platform capabilities.
Currently the magic pages for Dom0less DomUs are populated by the
init-dom0less app through populate_physmap(), and populate_physmap()
automatically assumes gfn == mfn for 1:1 direct mapped domains. This
cannot be true for the magic pages that are allocated later from the
init-dom0less application executed in Dom0. For domain using statically
allocated memory but not 1:1 direct-mapped, similar error "failed to
retrieve a reserved page" can be seen as the reserved memory list is
empty at that time.
Since for init-dom0less, the magic page region is only for XenStore.
To solve above issue, this commit allocates the XenStore page for
Dom0less DomUs at the domain construction time. The PFN will be
noted and communicated to the init-dom0less application executed
from Dom0. To keep the XenStore late init protocol, set the connection
status to XENSTORE_RECONNECT.
Since the guest magic region allocation from init-dom0less is for
XenStore, and the XenStore page is now allocated from the hypervisor,
instead of hardcoding the guest magic pages region, use
xc_hvm_param_get() to get the XenStore page PFN. Rename alloc_xs_page()
to get_xs_page() to reflect the changes.
With this change, some existing code is not needed anymore, including:
(1) The definition of the XenStore page offset.
(2) Call to xc_domain_setmaxmem() and xc_clear_domain_page() as we
don't need to set the max mem and clear the page anymore.
(3) Foreign mapping of the XenStore page, setting of XenStore interface
status and HVM_PARAM_STORE_PFN from init-dom0less, as they are set
by the hypervisor.
Take the opportunity to do some coding style improvements when possible.
Reported-by: Alec Kwapis <alec.kwapis@medtronic.com> Signed-off-by: Henry Wang <xin.wang2@amd.com> Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com> Tested-by: Michal Orzel <michal.orzel@amd.com>
Andrew Cooper [Fri, 28 Feb 2025 09:58:51 +0000 (09:58 +0000)]
MISRA: Update path for bsearch devation
This ought to have been part of the original patch, so as to avoid breaking
CI.
Fixes: 31c0d6fdf421 ("xen/bsearch: Split out of lib.h into it's own header") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Nicola Vetrini <nicola.vetrini@bugseng.com>
Jan Beulich [Thu, 27 Feb 2025 12:58:32 +0000 (12:58 +0000)]
IOMMU/x86: the bus-to-bridge lock needs to be acquired IRQ-safe
The function's use from set_msi_source_id() is guaranteed to be in an
IRQs-off region. While the invocation of that function could be moved
ahead in msi_msg_to_remap_entry() (doesn't need to be in the IOMMU-
intremap-locked region), the call tree from map_domain_pirq() holds an
IRQ descriptor lock. Hence all use sites of the lock need become IRQ-
safe ones.
In find_upstream_bridge() do a tiny bit of tidying in adjacent code:
Change a variable's type to unsigned and merge a redundant assignment
into another variable's initializer.
This is XSA-467 / CVE-2025-1713.
Fixes: 476bbccc811c ("VT-d: fix MSI source-id of interrupt remapping") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Wed, 26 Feb 2025 03:27:33 +0000 (21:27 -0600)]
PPC: Activate UBSAN in testing
Also enable -fno-sanitize=alignment like x86 since support for unaligned
accesses is guaranteed by the ISA and the existing OPAL setup code
relies on it.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Shawn Anastasio <sanastasio@raptorengineering.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 24 Oct 2024 12:47:20 +0000 (13:47 +0100)]
x86/ucode: Drop the match_reg[] field from AMD's microcode_patch
This was true in the K10 days, but even back then the match registers were
really payload data rather than header data.
But, it's really model specific data, and these days typically part of the
signature, so is random data for all intents and purposes.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 23 Aug 2024 18:38:59 +0000 (19:38 +0100)]
x86/ucode: Rename hypercall-context functions
microcode_update{,_helper}() are overly generic names in a file that has
multiple update routines and helper functions contexts.
Rename microcode_update() to ucode_update_hcall() so it explicitly identifies
itself as hypercall context, and rename microcode_update_helper() to
ucode_update_hcall_cont() to make it clear it is in continuation context.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
Jan Beulich [Wed, 26 Feb 2025 11:26:23 +0000 (12:26 +0100)]
x86/DM: slightly simplify set_mem_type()
There's no need to access the static array twice per iteration, even
more so when that's effectively open-coding array_access_nospec().
Along with renaming the "new type" variable, rename the "old type" one
as well, to clarify which one is which.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Oleksii Kurochko [Wed, 26 Feb 2025 11:24:26 +0000 (12:24 +0100)]
xen/riscv: update mfn calculation in pt_mapping_level()
When pt_update() is called with arguments (..., INVALID_MFN, ..., 0 or 1),
it indicates that a mapping is being destroyed/modifyed.
In the case when modifying or destroying a mapping, it is necessary to
search until a leaf node is found, instead of searching for a page table
entry based on the precalculated `level` and `order`(look at pt_update()).
This is because when `mfn` == INVALID_MFN, the `mask` (in pt_mapping_level())
will take into account only `vfn`, which could accidentally return an
incorrect level, leading to the discovery of an incorrect page table entry.
For example, if `vfn` is page table level 1 aligned, but it was mapped as
page table level 0, then pt_mapping_level() will return `level` = 1, since
only `vfn` (which is page table level 1 aligned) is taken into account when
`mfn` == INVALID_MFN (look at pt_mapping_level()).
Have unmap_table() check for NULL, such that individual callers don't need
to.
Jan Beulich [Wed, 26 Feb 2025 11:23:49 +0000 (12:23 +0100)]
x86/MCE-telem: drop unnecessary per-CPU field
struct mc_telem_cpu_ctl's processing field is used solely in
mctelem_process_deferred(), where the local variable can as well be used
directly when retrieving the head of the list to process. This then also
eliminates the field holding a dangling pointer once the processing of
the list finished, in particular when the entry is handed to
mctelem_dismiss().
No functional change intended.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 26 Feb 2025 11:23:19 +0000 (12:23 +0100)]
x86/MCE: fail init more gracefully when CPU vendor isn't supported
When mcheck_init() doesn't recognize the CPU vendor, it will undo the
all-banks allocation, and it will in particular not install the CPU
notifier. This way APs will pointlessly try to re-establish an
all-banks allocation, while then falling over NULL pointers due to the
notifier not having run and hence not having allocated anything for
them.
Prevent both from happening, and additionally delay writing MCG_CTL
until no errors can occur anymore in mca_cap_init().
Fixes: 741367e77d6c ("mce: Clean-up mcheck_init handler") Fixes: a5e1b534ac6f ("x86: mce cleanup for both Intel and AMD mce logic") Fixes: 560cf418c845 ("x86/mcheck: allow varying bank counts per CPU") Reported-by: Teddy Astie <teddy.astie@vates.tech> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Mon, 24 Feb 2025 15:36:11 +0000 (15:36 +0000)]
CirrusCI: Use shallow clone
This reduces the Clone step from ~50s to ~3s.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Thu, 31 Oct 2024 13:35:40 +0000 (13:35 +0000)]
scripts: Fix git-checkout.sh to work with non-master branches (take 2)
First, rename $TAG to $COMMITTISH. We already pass tags, branches (well, only
master) and full SHAs into this script.
Xen uses master for QEMU_UPSTREAM_REVISION, and has done for other trees too
in the path. Apparently we've never specified a different branch, because the
git-clone rune only pulls in the master branch; it does not pull in diverging
branches.
Fix this by performing an explicit fetch of the $COMMITTISH, then checking out
the dummy branch from the FETCH_HEAD.
Suggested-by: Jason Andryuk <jason.andryuk@amd.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
Andrii Sultanov [Fri, 14 Feb 2025 15:24:27 +0000 (15:24 +0000)]
tools/ocaml: Fix oxenstored build warning
OCaml, in preparation for a renaming of the error string associated with
conversion failure in 'int_of_string' functions, started to issue this
warning:
File "process.ml", line 440, characters 13-28:
440 | | (Failure "int_of_string") -> reply_error "EINVAL"
^^^^^^^^^^^^^^^
Warning 52 [fragile-literal-pattern]: Code should not depend on the actual values of
this constructor's arguments. They are only for information
and may change in future versions. (See manual section 11.5)
Deal with this at the source, and instead create our own stable
ConversionFailure exception that's raised on the None case in
'int_of_string_opt'.
'c_int_of_string' is safe and does not raise such exceptions.
Signed-off-by: Andrii Sultanov <andrii.sultanov@cloud.com> Acked-by: Christian Lindig <christian.lindig@cloud.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Mon, 17 Feb 2025 19:13:01 +0000 (19:13 +0000)]
xen/ACPI: Drop local acpi_os_{v,}printf() and use plain {v,}printk()
Now that Xen has a real vprintk(), there's no need to opencode it locally with
vsnprintf(). Redirect the debug routines to the real {v,}printk() and drop
the local acpi_os_{v,}printf() implementations.
Amongst other things, this removes one arbitrary limit on message size, as
well as removing a 512 byte static buffer that ought to have been in
__initdata given that is private to an __init function.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 23 Jan 2025 03:27:07 +0000 (03:27 +0000)]
xen/console: Optimise the parameter order of vprintk_common()
For ABIs which pass parameters by register (all cases that we compile Xen
for), inserting new arguments on the left hand side involves shuffling all
other parameters along by one register whereas appending a new argument
doesn't involve shuffling of existing registers.
Reorder vprintk_common()'s prefix parameter to being last. This is a marginal
improvement on all architectures:
Function old new delta
vprintk 18 12 -6 x86
vprintk 32 24 -8 arm32
vprintk 52 48 -4 arm64
vprintk 52 48 -4 riscv64
vprintk 80 72 -8 ppc64
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Ross Lagerwall [Mon, 17 Feb 2025 17:50:11 +0000 (17:50 +0000)]
x86/ucode: Add option to scan microcode by default
A lot of systems automatically add microcode to the initramfs so it can
be useful as a vendor policy to always scan for microcode. Add a Kconfig
option to allow setting the default behaviour.
The default behaviour is unchanged since the new option defaults to
"no".
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Oleksii Kurochko [Tue, 25 Feb 2025 07:47:12 +0000 (08:47 +0100)]
xen/riscv: update defintion of vmap_to_mfn()
vmap_to_mfn() uses virt_to_maddr(), which is designed to work with VA from
either the direct map region or Xen's linkage region (XEN_VIRT_START).
An assertion will occur if it is used with other regions, in particular for
the VMAP region.
Since RISC-V lacks a hardware feature to request the MMU to translate a VA to
a PA (as Arm does, for example), software page table walking (pt_walk()) is
used for the VMAP region to obtain the mfn from pte_t.
To avoid introduce a circular dependency between asm/mm.h and asm/page.h by
including each other, the static inline function _vmap_to_mfn() is introduced
in asm/page.h, as it uses struct pte_t and pte_is_mapping() from asm/page.h.
_vmap_to_mfn() is then reused in the definition of vmap_to_mfn() macro in
asm/mm.h.
Fixes: 7db8d2bd9b ("xen/riscv: add minimal stuff to mm.h to build full Xen") Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Oleksii Kurochko [Tue, 25 Feb 2025 07:46:32 +0000 (08:46 +0100)]
xen/riscv: implement software page table walking
RISC-V doesn't have hardware feature to ask MMU to translate
virtual address to physical address ( like Arm has, for example ),
so software page table walking is implemented.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 25 Feb 2025 07:45:14 +0000 (08:45 +0100)]
radix-tree: don't left-shift negative values
Any (signed) integer is okay to pass into radix_tree_int_to_ptr(), yet
left shifting negative values is UB. Use an unsigned intermediate type,
reducing the impact to implementation defined behavior (for the
unsigned->signed conversion).
Also please Misra C:2012 rule 7.3 by dropping the lower case numeric 'l'
tag.
No difference in generated code, at least on x86.
Fixes: b004883e29bb ("Simplify and build-fix (for some gcc versions) radix_tree_int_to_ptr()") Reported-by: Teddy Astie <teddy.astie@vates.tech> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Penny Zheng [Tue, 25 Feb 2025 07:41:41 +0000 (08:41 +0100)]
xen/x86: add CPPC feature flag for AMD processors
Add Collaborative Processor Performance Control feature flag for
AMD processors.
amd-cppc is the AMD CPU performance scaling driver that
introduces a new CPU frequency control mechanism on modern AMD
APU and CPU series.
There are two types of hardware implementations: "Full MSR Support"
and "Shared Memory Support".
Right now, xen will only implement "Full MSR Support", and this new
feature flag indicates whether processor has this feature or not.
Signed-off-by: Penny Zheng <Penny.Zheng@amd.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Sergiy Kibrik [Tue, 25 Feb 2025 07:41:07 +0000 (08:41 +0100)]
ioreq: allow arch_vcpu_ioreq_completion() to signal an error
Return false from arch_vcpu_ioreq_completion() when completion is not handled.
According to coding-best-practices.pandoc an error should be propagated to
caller, if caller is expecting to handle it, which seems to the case for
callers of arch_vcpu_ioreq_completion().
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Sergiy Kibrik <Sergiy_Kibrik@epam.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 20 Feb 2025 12:50:19 +0000 (13:50 +0100)]
x86/MCE-telem: adjust cookie definition
struct mctelem_ent is opaque outside of mcetelem.c; the cookie
abstraction exists - afaict - just to achieve this opaqueness. Then it
is irrelevant though which kind of pointer mctelem_cookie_t resolves to.
IOW we can as well use struct mctelem_ent there, allowing to remove the
casts from COOKIE2MCTE() and MCTE2COOKIE(). Their removal addresses
Misra C:2012 rule 11.2 ("Conversions shall not be performed between a
pointer to an incomplete type and any other type") violations.
No functional change intended.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-By: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Andrew Cooper [Mon, 17 Feb 2025 15:51:51 +0000 (15:51 +0000)]
x86/svm: Separate STI and VMRUN instructions in svm_asm_do_resume()
There is a corner case in the VMRUN instruction where its INTR_SHADOW state
leaks into guest state if a VMExit occurs before the VMRUN is complete. An
example of this could be taking #NPF due to event injection.
Xen can safely execute STI anywhere between CLGI and VMRUN, as CLGI blocks
external interrupts too. However, an exception (while fatal) will appear to
be in an irqs-on region (as GIF isn't considered), so position the STI after
the speculation actions but prior to the GPR pops.
xen/memory: Make resource_max_frames() to return 0 on unknown type
This is actually what the caller acquire_resource() expects on any kind
of error (the comment on top of resource_max_frames() also suggests that).
Otherwise, the caller will treat -errno as a valid value and propagate incorrect
nr_frames to the VM. As a possible consequence, a VM trying to query a resource
size of an unknown type will get the success result from the hypercall and obtain
nr_frames 4294967201.
Also, add an ASSERT_UNREACHABLE() in the default case of _acquire_resource(),
normally we won't get to this point, as an unknown type will always be rejected
earlier in resource_max_frames().
Also, update test-resource app to verify that Xen can deal with invalid
(unknown) resource type properly.
Fixes: 9244528955de ("xen/memory: Fix acquire_resource size semantics") Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Andrew Cooper [Wed, 22 Jan 2025 12:13:24 +0000 (12:13 +0000)]
xen/console: Fix truncation of panic() messages
The panic() function uses a static buffer to format its arguments into, simply
to emit the result via printk("%s", buf). This buffer is not large enough for
some existing users in Xen. e.g.:
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Invalid device tree blob at physical address 0x46a00000.
(XEN) The DTB must be 8-byte aligned and must not exceed 2 MB in size.
(XEN)
(XEN) Plea****************************************
The remainder of this particular message is 'e check your bootloader.', but
has been inherited by RISC-V from ARM.
It is also pointless double buffering. Implement vprintk() beside printk(),
and use it directly rather than rendering into a local buffer, removing it as
one source of message limitation.
This marginally simplifies panic(), and drops a global used-once buffer.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
+ dump_execution_state();
+
for_each_domain( d )
domain_unpause_by_systemcontroller(d);
failed with:
(XEN) *** Serial input to DOM0 (type 'CTRL-a' three times to switch input)
(XEN) CPU0: Unexpected Trap: Undefined Instruction
(XEN) ----[ Xen-4.20-rc arm32 debug=n Not tainted ]----
(XEN) CPU: 0
<snip>
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) CPU0: Unexpected Trap: Undefined Instruction
(XEN) ****************************************
This is because the condition for init text is wrong. While there's nothing
interesting from that point onwards in start_xen(), it's also wrong for
livepatches too.
Use is_active_kernel_text() which is the correct test for this purpose, and is
aware of init and livepatch regions as well as their lifetimes.
Fixes: 3e802c6ca1fb ("xen/arm: Correctly support WARN_ON") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Roger Pau Monne [Tue, 4 Feb 2025 10:46:14 +0000 (11:46 +0100)]
x86/iommu: disable interrupts at shutdown
Add a new hook to inhibit interrupt generation by the IOMMU(s). Note the
hook is currently only implemented for x86 IOMMUs. The purpose is to
disable interrupt generation at shutdown so any kexec chained image finds
the IOMMU(s) in a quiesced state.
It would also prevent "Receive accept error" being raised as a result of
non-disabled interrupts targeting offline CPUs.
Note that the iommu_quiesce() call in nmi_shootdown_cpus() is still
required even when there's a preceding iommu_crash_shutdown() call; the
later can become a no-op depending on the setting of the "crash-disable"
command line option.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Roger Pau Monne [Wed, 5 Feb 2025 14:05:47 +0000 (15:05 +0100)]
x86/pci: disable MSI(-X) on all devices at shutdown
Attempt to disable MSI(-X) capabilities on all PCI devices know by Xen at
shutdown. Doing such disabling should facilitate kexec chained kernel from
booting more reliably, as device MSI(-X) interrupt generation should be
quiesced.
Only attempt to disable MSI(-X) on all devices in the crash context if the
PCI lock is not taken, otherwise the PCI device list could be in an
inconsistent state. This requires introducing a new pcidevs_trylock()
helper to check whether the lock is currently taken.
Disabling MSI(-X) should prevent "Receive accept error" being raised as a
result of non-disabled interrupts targeting offline CPUs.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Roger Pau Monne [Thu, 6 Feb 2025 11:20:04 +0000 (12:20 +0100)]
x86/smp: perform disabling on interrupts ahead of AP shutdown
Move the disabling of interrupt sources so it's done ahead of the offlining
of APs. This is to prevent AMD systems triggering "Receive accept error"
when interrupts target CPUs that are no longer online.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Roger Pau Monne [Tue, 28 Jan 2025 15:06:07 +0000 (16:06 +0100)]
x86/irq: drop fixup_irqs() parameters
The solely remaining caller always passes the same globally available
parameters. Drop the parameters and modify fixup_irqs() to use
cpu_online_map in place of the input mask parameter, and always be verbose
in its output printing.
While there remove some of the checks given the single context where
fixup_irqs() is now called, which should always be in the CPU offline path,
after the CPU going offline has been removed from cpu_online_map.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Roger Pau Monne [Tue, 28 Jan 2025 08:34:20 +0000 (09:34 +0100)]
x86/shutdown: offline APs with interrupts disabled on all CPUs
The current shutdown logic in smp_send_stop() will disable the APs while
having interrupts enabled on the BSP or possibly other APs. On AMD systems
this can lead to local APIC errors:
APIC error on CPU0: 00(08), Receive accept error
Such error message can be printed in a loop, thus blocking the system from
rebooting. I assume this loop is created by the error being triggered by
the console interrupt, which is further stirred by the ESR handler
printing to the console.
Intel SDM states:
"Receive Accept Error.
Set when the local APIC detects that the message it received was not
accepted by any APIC on the APIC bus, including itself. Used only on P6
family and Pentium processors."
So the error shouldn't trigger on any Intel CPU supported by Xen.
However AMD doesn't make such claims, and indeed the error is broadcast to
all local APICs when an interrupt targets a CPU that's already offline.
To prevent the error from stalling the shutdown process perform the
disabling of APs and the BSP local APIC with interrupts disabled on all
CPUs in the system, so that by the time interrupts are unmasked on the BSP
the local APIC is already disabled. This can still lead to a spurious:
APIC error on CPU0: 00(00)
As a result of an LVT Error getting injected while interrupts are masked on
the CPU, and the vector only handled after the local APIC is already
disabled. ESR reports 0 because as part of disable_local_APIC() the ESR
register is cleared.
Note the NMI crash path doesn't have such issue, because disabling of APs
and the caller local APIC is already done in the same contiguous region
with interrupts disabled. There's a possible window on the NMI crash path
(nmi_shootdown_cpus()) where some APs might be disabled (and thus
interrupts targeting them raising "Receive accept error") before others APs
have interrupts disabled. However the shutdown NMI will be handled,
regardless of whether the AP is processing a local APIC error, and hence
such interrupts will not cause the shutdown process to get stuck.
Remove the call to fixup_irqs() in smp_send_stop(): it doesn't achieve the
intended goal of moving all interrupts to the BSP anyway. The logic in
fixup_irqs() will move interrupts whose affinity doesn't overlap with the
passed mask, but the movement of interrupts is done to any CPU set in
cpu_online_map. As in the shutdown path fixup_irqs() is called before APs
are cleared from cpu_online_map this leads to interrupts being shuffled
around, but not assigned to the BSP exclusively.
The Fixes tag is more of a guess than a certainty; it's possible the
previous sleep window in fixup_irqs() allowed any in-flight interrupt to be
delivered before APs went offline. However fixup_irqs() was still
incorrectly used, as it didn't (and still doesn't) move all interrupts to
target the provided cpu mask.
Fixes: e2bb28d62158 ('x86/irq: forward pending interrupts to new destination in fixup_irqs()') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Andrew Cooper [Fri, 7 Feb 2025 21:19:21 +0000 (21:19 +0000)]
RISCV: Activate UBSAN in testing
RISC-V has less complicated headers, so update ubsan.c to pull in everything
it needs. Provide dump_execution_state(), and update the printk() message to
make it more obvious that it's an outstanding task.
As with commit 8ef2ac727e21 ("automation: enable UBSAN for debug tests"),
enable UBSAN in RISC-V testing too.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Andrew Cooper [Fri, 7 Feb 2025 15:04:25 +0000 (15:04 +0000)]
RISCV/asm: Use CALL rather than JAL
JAL has a maximium displacement of 2M. To branch further, it needs pairing
with an AUIPC instruction. CALL is a pseudoinstruction which allows the
linker to pick the appropriate sequence when relaxations are enabled.
This avoids a build failure of the form:
prelink.o: in function `start':
xen/xen/arch/riscv/riscv64/head.S:28:(.text.header+0x2c):
relocation truncated to fit: R_RISCV_JAL against symbol `calc_phys_offset' defined in .init.text section in prelink.o
make[3]: *** [arch/riscv/Makefile:18: xen-syms] Error 1
when Xen gets large enough, e.g. with CONFIG_UBSAN enabled.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>