Roger Pau Monné [Thu, 20 Mar 2025 11:48:51 +0000 (12:48 +0100)]
x86/dom0: attempt to fixup p2m page-faults for PVH dom0
When building a PVH dom0 Xen attempts to map all (relevant) MMIO regions
into the p2m for dom0 access. However the information Xen has about the
host memory map is limited. Xen doesn't have access to any resources
described in ACPI dynamic tables, and hence the p2m mappings provided might
not be complete.
PV doesn't suffer from this issue because a PV dom0 is capable of mapping
into it's page-tables any address not explicitly banned in d->iomem_caps.
Introduce a new command line options that allows Xen to attempt to fixup
the p2m page-faults, by creating p2m identity maps in response to p2m
page-faults.
This is aimed as a workaround to small ACPI regions Xen doesn't know about.
Note that missing large MMIO regions mapped in this way will lead to
slowness due to the VM exit processing, plus the mappings will always use
small pages.
The ultimate aim is to attempt to bring better parity with a classic PV
dom0.
Note such fixup rely on the CPU doing the access to the unpopulated
address. If the access is attempted from a device instead there's no
possible way to fixup, as IOMMU page-fault are asynchronous.
Roger Pau Monné [Thu, 20 Mar 2025 11:48:32 +0000 (12:48 +0100)]
x86/emul: dump unhandled memory accesses for PVH dom0
A PV dom0 can map any host memory as long as it's allowed by the IO
capability range in d->iomem_caps. On the other hand, a PVH dom0 has no
way to populate MMIO region onto it's p2m, so it's limited to what Xen
initially populates on the p2m based on the host memory map and the enabled
device BARs.
Introduce a new debug build only printk that reports attempts by dom0 to
access addresses not populated on the p2m, and not handled by any emulator.
This is for information purposes only, but might allow getting an idea of
what MMIO ranges might be missing on the p2m.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 43d8a80a0cccfe3715bb3178b5c15fb983979651
master date: 2025-03-05 10:26:46 +0100
Andrew Cooper [Mon, 3 Mar 2025 14:06:55 +0000 (14:06 +0000)]
CHANGELOG.md: Set release date for 4.20
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
(cherry picked from commit e28802927e0a24dab9c73082c3e322ef4dd0bd02)
Oleksii Kurochko [Thu, 27 Feb 2025 14:27:52 +0000 (15:27 +0100)]
CHANGELOG.md: Finalize changes in 4.20 release cycle
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit d3a7d29d76fe4ca4f58164cbe20a6b2dd4500ab8)
Jan Beulich [Thu, 27 Feb 2025 12:58:32 +0000 (12:58 +0000)]
IOMMU/x86: the bus-to-bridge lock needs to be acquired IRQ-safe
The function's use from set_msi_source_id() is guaranteed to be in an
IRQs-off region. While the invocation of that function could be moved
ahead in msi_msg_to_remap_entry() (doesn't need to be in the IOMMU-
intremap-locked region), the call tree from map_domain_pirq() holds an
IRQ descriptor lock. Hence all use sites of the lock need become IRQ-
safe ones.
In find_upstream_bridge() do a tiny bit of tidying in adjacent code:
Change a variable's type to unsigned and merge a redundant assignment
into another variable's initializer.
This is XSA-467 / CVE-2025-1713.
Fixes: 476bbccc811c ("VT-d: fix MSI source-id of interrupt remapping") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit 39bc6af3ba483282ed6bbf94b08aec38c93d39e6)
Andrew Cooper [Wed, 26 Feb 2025 03:27:33 +0000 (21:27 -0600)]
PPC: Activate UBSAN in testing
Also enable -fno-sanitize=alignment like x86 since support for unaligned
accesses is guaranteed by the ISA and the existing OPAL setup code
relies on it.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Shawn Anastasio <sanastasio@raptorengineering.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>
(cherry picked from commit 7cf163879c5add0a4f7f9c987b61f04f8f7051b1)
Jan Beulich [Thu, 20 Feb 2025 12:50:19 +0000 (13:50 +0100)]
x86/MCE-telem: adjust cookie definition
struct mctelem_ent is opaque outside of mcetelem.c; the cookie
abstraction exists - afaict - just to achieve this opaqueness. Then it
is irrelevant though which kind of pointer mctelem_cookie_t resolves to.
IOW we can as well use struct mctelem_ent there, allowing to remove the
casts from COOKIE2MCTE() and MCTE2COOKIE(). Their removal addresses
Misra C:2012 rule 11.2 ("Conversions shall not be performed between a
pointer to an incomplete type and any other type") violations.
No functional change intended.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-By: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Andrew Cooper [Mon, 17 Feb 2025 15:51:51 +0000 (15:51 +0000)]
x86/svm: Separate STI and VMRUN instructions in svm_asm_do_resume()
There is a corner case in the VMRUN instruction where its INTR_SHADOW state
leaks into guest state if a VMExit occurs before the VMRUN is complete. An
example of this could be taking #NPF due to event injection.
Xen can safely execute STI anywhere between CLGI and VMRUN, as CLGI blocks
external interrupts too. However, an exception (while fatal) will appear to
be in an irqs-on region (as GIF isn't considered), so position the STI after
the speculation actions but prior to the GPR pops.
xen/memory: Make resource_max_frames() to return 0 on unknown type
This is actually what the caller acquire_resource() expects on any kind
of error (the comment on top of resource_max_frames() also suggests that).
Otherwise, the caller will treat -errno as a valid value and propagate incorrect
nr_frames to the VM. As a possible consequence, a VM trying to query a resource
size of an unknown type will get the success result from the hypercall and obtain
nr_frames 4294967201.
Also, add an ASSERT_UNREACHABLE() in the default case of _acquire_resource(),
normally we won't get to this point, as an unknown type will always be rejected
earlier in resource_max_frames().
Also, update test-resource app to verify that Xen can deal with invalid
(unknown) resource type properly.
Fixes: 9244528955de ("xen/memory: Fix acquire_resource size semantics") Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Andrew Cooper [Wed, 22 Jan 2025 12:13:24 +0000 (12:13 +0000)]
xen/console: Fix truncation of panic() messages
The panic() function uses a static buffer to format its arguments into, simply
to emit the result via printk("%s", buf). This buffer is not large enough for
some existing users in Xen. e.g.:
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Invalid device tree blob at physical address 0x46a00000.
(XEN) The DTB must be 8-byte aligned and must not exceed 2 MB in size.
(XEN)
(XEN) Plea****************************************
The remainder of this particular message is 'e check your bootloader.', but
has been inherited by RISC-V from ARM.
It is also pointless double buffering. Implement vprintk() beside printk(),
and use it directly rather than rendering into a local buffer, removing it as
one source of message limitation.
This marginally simplifies panic(), and drops a global used-once buffer.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
+ dump_execution_state();
+
for_each_domain( d )
domain_unpause_by_systemcontroller(d);
failed with:
(XEN) *** Serial input to DOM0 (type 'CTRL-a' three times to switch input)
(XEN) CPU0: Unexpected Trap: Undefined Instruction
(XEN) ----[ Xen-4.20-rc arm32 debug=n Not tainted ]----
(XEN) CPU: 0
<snip>
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) CPU0: Unexpected Trap: Undefined Instruction
(XEN) ****************************************
This is because the condition for init text is wrong. While there's nothing
interesting from that point onwards in start_xen(), it's also wrong for
livepatches too.
Use is_active_kernel_text() which is the correct test for this purpose, and is
aware of init and livepatch regions as well as their lifetimes.
Fixes: 3e802c6ca1fb ("xen/arm: Correctly support WARN_ON") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Roger Pau Monne [Tue, 4 Feb 2025 10:46:14 +0000 (11:46 +0100)]
x86/iommu: disable interrupts at shutdown
Add a new hook to inhibit interrupt generation by the IOMMU(s). Note the
hook is currently only implemented for x86 IOMMUs. The purpose is to
disable interrupt generation at shutdown so any kexec chained image finds
the IOMMU(s) in a quiesced state.
It would also prevent "Receive accept error" being raised as a result of
non-disabled interrupts targeting offline CPUs.
Note that the iommu_quiesce() call in nmi_shootdown_cpus() is still
required even when there's a preceding iommu_crash_shutdown() call; the
later can become a no-op depending on the setting of the "crash-disable"
command line option.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Roger Pau Monne [Wed, 5 Feb 2025 14:05:47 +0000 (15:05 +0100)]
x86/pci: disable MSI(-X) on all devices at shutdown
Attempt to disable MSI(-X) capabilities on all PCI devices know by Xen at
shutdown. Doing such disabling should facilitate kexec chained kernel from
booting more reliably, as device MSI(-X) interrupt generation should be
quiesced.
Only attempt to disable MSI(-X) on all devices in the crash context if the
PCI lock is not taken, otherwise the PCI device list could be in an
inconsistent state. This requires introducing a new pcidevs_trylock()
helper to check whether the lock is currently taken.
Disabling MSI(-X) should prevent "Receive accept error" being raised as a
result of non-disabled interrupts targeting offline CPUs.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Roger Pau Monne [Thu, 6 Feb 2025 11:20:04 +0000 (12:20 +0100)]
x86/smp: perform disabling on interrupts ahead of AP shutdown
Move the disabling of interrupt sources so it's done ahead of the offlining
of APs. This is to prevent AMD systems triggering "Receive accept error"
when interrupts target CPUs that are no longer online.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Roger Pau Monne [Tue, 28 Jan 2025 15:06:07 +0000 (16:06 +0100)]
x86/irq: drop fixup_irqs() parameters
The solely remaining caller always passes the same globally available
parameters. Drop the parameters and modify fixup_irqs() to use
cpu_online_map in place of the input mask parameter, and always be verbose
in its output printing.
While there remove some of the checks given the single context where
fixup_irqs() is now called, which should always be in the CPU offline path,
after the CPU going offline has been removed from cpu_online_map.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Roger Pau Monne [Tue, 28 Jan 2025 08:34:20 +0000 (09:34 +0100)]
x86/shutdown: offline APs with interrupts disabled on all CPUs
The current shutdown logic in smp_send_stop() will disable the APs while
having interrupts enabled on the BSP or possibly other APs. On AMD systems
this can lead to local APIC errors:
APIC error on CPU0: 00(08), Receive accept error
Such error message can be printed in a loop, thus blocking the system from
rebooting. I assume this loop is created by the error being triggered by
the console interrupt, which is further stirred by the ESR handler
printing to the console.
Intel SDM states:
"Receive Accept Error.
Set when the local APIC detects that the message it received was not
accepted by any APIC on the APIC bus, including itself. Used only on P6
family and Pentium processors."
So the error shouldn't trigger on any Intel CPU supported by Xen.
However AMD doesn't make such claims, and indeed the error is broadcast to
all local APICs when an interrupt targets a CPU that's already offline.
To prevent the error from stalling the shutdown process perform the
disabling of APs and the BSP local APIC with interrupts disabled on all
CPUs in the system, so that by the time interrupts are unmasked on the BSP
the local APIC is already disabled. This can still lead to a spurious:
APIC error on CPU0: 00(00)
As a result of an LVT Error getting injected while interrupts are masked on
the CPU, and the vector only handled after the local APIC is already
disabled. ESR reports 0 because as part of disable_local_APIC() the ESR
register is cleared.
Note the NMI crash path doesn't have such issue, because disabling of APs
and the caller local APIC is already done in the same contiguous region
with interrupts disabled. There's a possible window on the NMI crash path
(nmi_shootdown_cpus()) where some APs might be disabled (and thus
interrupts targeting them raising "Receive accept error") before others APs
have interrupts disabled. However the shutdown NMI will be handled,
regardless of whether the AP is processing a local APIC error, and hence
such interrupts will not cause the shutdown process to get stuck.
Remove the call to fixup_irqs() in smp_send_stop(): it doesn't achieve the
intended goal of moving all interrupts to the BSP anyway. The logic in
fixup_irqs() will move interrupts whose affinity doesn't overlap with the
passed mask, but the movement of interrupts is done to any CPU set in
cpu_online_map. As in the shutdown path fixup_irqs() is called before APs
are cleared from cpu_online_map this leads to interrupts being shuffled
around, but not assigned to the BSP exclusively.
The Fixes tag is more of a guess than a certainty; it's possible the
previous sleep window in fixup_irqs() allowed any in-flight interrupt to be
delivered before APs went offline. However fixup_irqs() was still
incorrectly used, as it didn't (and still doesn't) move all interrupts to
target the provided cpu mask.
Fixes: e2bb28d62158 ('x86/irq: forward pending interrupts to new destination in fixup_irqs()') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Andrew Cooper [Fri, 7 Feb 2025 21:19:21 +0000 (21:19 +0000)]
RISCV: Activate UBSAN in testing
RISC-V has less complicated headers, so update ubsan.c to pull in everything
it needs. Provide dump_execution_state(), and update the printk() message to
make it more obvious that it's an outstanding task.
As with commit 8ef2ac727e21 ("automation: enable UBSAN for debug tests"),
enable UBSAN in RISC-V testing too.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Andrew Cooper [Fri, 7 Feb 2025 15:04:25 +0000 (15:04 +0000)]
RISCV/asm: Use CALL rather than JAL
JAL has a maximium displacement of 2M. To branch further, it needs pairing
with an AUIPC instruction. CALL is a pseudoinstruction which allows the
linker to pick the appropriate sequence when relaxations are enabled.
This avoids a build failure of the form:
prelink.o: in function `start':
xen/xen/arch/riscv/riscv64/head.S:28:(.text.header+0x2c):
relocation truncated to fit: R_RISCV_JAL against symbol `calc_phys_offset' defined in .init.text section in prelink.o
make[3]: *** [arch/riscv/Makefile:18: xen-syms] Error 1
when Xen gets large enough, e.g. with CONFIG_UBSAN enabled.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Enable CONFIG_UBSAN and CONFIG_UBSAN_FATAL for the ARM64 and x86_64
build jobs, with debug enabled, which are later used for Xen tests on
QEMU and/or real hardware.
Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> R-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Jan Beulich [Fri, 7 Feb 2025 09:00:04 +0000 (10:00 +0100)]
radix-tree: introduce RADIX_TREE{,_INIT}()
... now that static initialization is possible. Use RADIX_TREE() for
pci_segments and ivrs_maps.
This then fixes an ordering issue on x86: With the call to
radix_tree_init(), acpi_mmcfg_init()'s invocation of pci_segments_init()
will zap the possible earlier introduction of segment 0 by
amd_iommu_detect_one_acpi()'s call to pci_ro_device(), and thus the
write-protection of the PCI devices representing AMD IOMMUs.
Fixes: 3950f2485bbc ("x86/x2APIC: defer probe until after IOMMU ACPI table parsing") Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Jan Beulich [Fri, 7 Feb 2025 08:59:11 +0000 (09:59 +0100)]
radix-tree: purge node allocation override hooks
These were needed by TMEM only, which is long gone. The Linux original
doesn't have such either. This effectively reverts one of the "Other
changes" from 8dc6738dbb3c ("Update radix-tree.[ch] from upstream Linux
to gain RCU awareness").
Positive side effect: Two cf_check go away.
While there also convert xmalloc()+memset() to xzalloc(). (Don't convert
to xvzalloc(), as that would require touching the freeing side, too.)
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Jan Beulich [Tue, 4 Feb 2025 12:50:49 +0000 (13:50 +0100)]
AMD/IOMMU: drop stray MSI enabling
While the 2nd of the commits referenced below should have moved the call
to amd_iommu_msi_enable() instead of adding another one, the situation
wasn't quite right even before: It can't have done any good to enable
MSI when no IRQ was allocated for it, yet.
The other call to amd_iommu_msi_enable(), just out of patch context,
needs to stay there until S3 resume is re-worked. For the boot path that
call should be unnecessary, as iommu{,_maskable}_msi_startup() will have
done it already (by way of invoking iommu_msi_unmask()).
Fixes: 5f569f1ac50e ("AMD/IOMMU: allow enabling with IRQ not yet set up") Fixes: d9e49d1afe2e ("AMD/IOMMU: adjust setup of internal interrupt for x2APIC mode") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Tested-by: Jason Andryuk <jason.andryuk@amd.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Jens Wiklander [Mon, 3 Feb 2025 10:21:12 +0000 (11:21 +0100)]
xen/arm: ffa: fix bind/unbind notification
The notification bitmask is in passed in the FF-A ABI in two 32-bit
registers w3 and w4. The lower 32-bits should go in w3 and the higher in
w4. These two registers has unfortunately been swapped for
FFA_NOTIFICATION_BIND and FFA_NOTIFICATION_UNBIND in the FF-A mediator.
So fix that by using the correct registers.
Jan Beulich [Mon, 3 Feb 2025 10:43:49 +0000 (11:43 +0100)]
AMD/IOMMU: log IVHD contents
Despite all the verbosity with "iommu=debug", information on the IOMMUs
themselves was missing.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Tested-by: Jason Andryuk <jason.andryuk@amd.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Michal Orzel [Tue, 28 Jan 2025 09:40:02 +0000 (10:40 +0100)]
xen/arm: Fix build issue when CONFIG_PHYS_ADDR_T_32=y
On Arm32, when CONFIG_PHYS_ADDR_T_32 is set, a build failure is observed:
arch/arm/platforms/vexpress.c: In function 'vexpress_smp_init':
arch/arm/platforms/vexpress.c:102:12: error: format '%lx' expects argument of type 'long unsigned int', but argument 2 has type 'long long unsigned int' [-Werror=format=]
102 | printk("Set SYS_FLAGS to %"PRIpaddr" (%p)\n",
When CONFIG_PHYS_ADDR_T_32 is set, paddr_t is defined as unsigned long.
Commit 96f35de69e59 dropped __virt_to_maddr() which used paddr_t as a
return type. Without a cast, the expression type is unsigned long long
which causes the issue. Fix it.
Michal Orzel [Tue, 28 Jan 2025 09:40:01 +0000 (10:40 +0100)]
device-tree: bootfdt: Fix build issue when CONFIG_PHYS_ADDR_T_32=y
On Arm32, when CONFIG_PHYS_ADDR_T_32 is set, a build failure is observed:
common/device-tree/bootfdt.c: In function 'build_assertions':
./include/xen/macros.h:47:31: error: static assertion failed: "!(alignof(struct membanks) != 8)"
47 | #define BUILD_BUG_ON(cond) ({ _Static_assert(!(cond), "!(" #cond ")"); })
| ^~~~~~~~~~~~~~
common/device-tree/bootfdt.c:31:5: note: in expansion of macro 'BUILD_BUG_ON'
31 | BUILD_BUG_ON(alignof(struct membanks) != 8);
When CONFIG_PHYS_ADDR_T_32 is set, paddr_t is defined as unsigned long,
therefore the struct membanks alignment is 4B and not 8B. The check is
there to ensure the struct membanks and struct membank, which is a
member of the former, are equally aligned. Therefore modify the check to
compare alignments obtained via alignof not to rely on hardcoded
values.
Fixes: 2209c1e35b47 ("xen/arm: Introduce a generic way to access memory bank structures") Signed-off-by: Michal Orzel <michal.orzel@amd.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Tested-by: Luca Fancellu <luca.fancellu@arm.com> Reviewed-by: Julien Grall <julien@xen.org>
Andrew Cooper [Tue, 21 Jan 2025 16:56:26 +0000 (16:56 +0000)]
x86/intel: Fix PERF_GLOBAL fixup when virtualised
Logic using performance counters needs to look at
MSR_MISC_ENABLE.PERF_AVAILABLE before touching any other resources.
When virtualised under ESX, Xen dies with a #GP fault trying to read
MSR_CORE_PERF_GLOBAL_CTRL.
Factor this logic out into a separate function (it's already too squashed to
the RHS), and insert a check of MSR_MISC_ENABLE.PERF_AVAILABLE.
This also avoids setting X86_FEATURE_ARCH_PERFMON if MSR_MISC_ENABLE says that
PERF is unavailable, although oprofile (the only consumer of this flag)
cross-checks too.
Fixes: 6bdb965178bb ("x86/intel: ensure Global Performance Counter Control is setup correctly") Reported-by: Jonathan Katz <jonathan.katz@aptar.com> Link: https://xcp-ng.org/forum/topic/10286/nesting-xcp-ng-on-esx-8 Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Jonathan Katz <jonathan.katz@aptar.com>
Jan Beulich [Mon, 27 Jan 2025 14:23:59 +0000 (15:23 +0100)]
x86/PV: further harden guest memory accesses against speculative abuse
The original implementation has two issues: For one it doesn't preserve
non-canonical-ness of inputs in the range 0x8000000000000000 through
0x80007fffffffffff. Bogus guest pointers in that range would not cause a
(#GP) fault upon access, when they should.
And then there is an AMD-specific aspect, where only the low 48 bits of
an address are used for speculative execution; the architecturally
mandated #GP for non-canonical addresses would be raised at a later
execution stage. Therefore to prevent Xen controlled data to make it
into any of the caches in a guest controllable manner, we need to
additionally ensure that for non-canonical inputs bit 47 would be clear.
See the code comment for how addressing both is being achieved.
Fixes: 4dc181599142 ("x86/PV: harden guest memory accesses against speculative abuse") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Mon, 27 Jan 2025 14:23:19 +0000 (15:23 +0100)]
x86emul: further correct 64-bit mode zero count repeated string insn handling
In an entirely different context I came across Linux commit 428e3d08574b
("KVM: x86: Fix zero iterations REP-string"), which points out that
we're still doing things wrong: For one, there's no zero-extension at
all on AMD. And then while RCX is zero-extended from 32 bits uniformly
for all string instructions on newer hardware, RSI/RDI are only for MOVS
and STOS on the systems I have access to. (On an old family 0xf system
I've further found that for REP LODS even RCX is not zero-extended.)
While touching the lines anyway, replace two casts in get_rep_prefix().
Fixes: 79e996a89f69 ("x86emul: correct 64-bit mode repeated string insn handling with zero count") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Released-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Roger Pau Monne [Mon, 20 Jan 2025 14:48:21 +0000 (15:48 +0100)]
iommu/amd: atomically update IRTE
Either when using a 32bit Interrupt Remapping Entry or a 128bit one update
the entry atomically, by using cmpxchg unconditionally as IOMMU depends on
it. No longer disable the entry by setting RemapEn = 0 ahead of updating
it. As a consequence of not toggling RemapEn ahead of the update the
Interrupt Remapping Table needs to be flushed after the entry update.
This avoids a window where the IRTE has RemapEn = 0, which can lead to
IO_PAGE_FAULT if the underlying interrupt source is not masked.
There's no guidance in AMD-Vi specification about how IRTE update should be
performed as opposed to DTE updating which has specific guidance. However
DTE updating claims that reads will always be at least 128bits in size, and
hence for the purposes here assume that reads and caching of the IRTE
entries in either 32 or 128 bit format will be done atomically from
the IOMMU.
Note that as part of introducing a new raw128 field in the IRTE struct, the
current raw field is renamed to raw64 to explicitly contain the size in the
field name.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
x86/iommu: remove non-CX16 logic from DMA remapping
As CX16 support is now mandatory for IOMMU usage, the checks for CX16 in
the DMA remapping code are stale. Remove them together with the associated
code introduced in case CX16 was not available.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Teddy Astie <teddy.astie@vates.tech> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
iommu/vtd: remove non-CX16 logic from interrupt remapping
As CX16 support is now mandatory for IOMMU usage, the checks for CX16 in
the interrupt remapping code are stale. Remove them together with the
associated code introduced in case CX16 was not available.
Note that AMD-Vi support for atomically updating a 128bit IRTE entry is
still not implemented, it will be done by further changes.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Teddy Astie <teddy.astie@vates.tech> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Teddy Astie [Fri, 24 Jan 2025 11:31:15 +0000 (12:31 +0100)]
x86/iommu: check for CMPXCHG16B when enabling IOMMU
All hardware with VT-d/AMD-Vi has CMPXCHG16B support. Check this at
initialisation time, and otherwise refuse to use the IOMMU.
If the local APICs support x2APIC mode the IOMMU support for interrupt
remapping will be checked earlier using a specific helper. If no support
for CX16 is detected by that earlier hook disable the IOMMU at that point
and prevent further poking for CX16 later in the boot process, which would
also fail.
There's a possible corner case when running virtualized, and the underlying
hypervisor exposing an IOMMU but no CMPXCHG16B support. In which case
ignoring the IOMMU is fine, albeit the most natural would be for the
underlying hypervisor to also expose CMPXCHG16B support if an IOMMU is
available to the VM.
Note this change only introduces the checks, but doesn't remove the now
stale checks for CX16 support sprinkled in the IOMMU code. Further changes
will take care of that.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Teddy Astie <teddy.astie@vates.tech> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Jan Beulich [Fri, 24 Jan 2025 09:15:56 +0000 (10:15 +0100)]
x86/HVM: correct read/write split at page boundaries
The MMIO cache is intended to have one entry used per independent memory
access that an insn does. This, in particular, is supposed to be
ignoring any page boundary crossing. Therefore when looking up a cache
entry, the access'es starting (linear) address is relevant, not the one
possibly advanced past a page boundary.
In order for the same offset-into-buffer variable to be usable in
hvmemul_phys_mmio_access() for both the caller's buffer and the cache
entry's it is further necessary to have the un-adjusted caller buffer
passed into there.
Fixes: 2d527ba310dc ("x86/hvm: split all linear reads and writes at page boundary") Reported-by: Manuel Andreas <manuel.andreas@tum.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Both caches may need higher capacity, and the upper bound will need to
be determined dynamically based on CPUID policy (for AMX'es TILELOAD /
TILESTORE at least).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
To avoid overrunning the internal buffer we need to take the offset into
the buffer into account.
Fixes: d95da91fb497 ("x86/HVM: grow MMIO cache data size to 64 bytes") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Add a new randconfig job for each FreeBSD version. This requires some
rework of the template so common parts can be shared between the full and
the randconfig builds. Such randconfig builds are relevant because FreeBSD
is the only tested system that has a full non-GNU toolchain.
While there replace the usage of the python311 package with python3, which is
already using 3.11, and remove the install of the plain python package for full
builds.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Roger Pau Monne [Thu, 16 Jan 2025 08:07:31 +0000 (09:07 +0100)]
automation/cirrus-ci: update FreeBSD to 13.4
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Jan Beulich [Fri, 17 Jan 2025 07:54:03 +0000 (08:54 +0100)]
xl: properly dispose of libxl_dominfo struct instances
The ssid_label field requires separate freeing; make sure to call
libxl_dominfo_dispose() as well as libxl_dominfo_init(). Since vcpuset()
calls only the former, add a call to the latter there at the same time.
Coverity-ID: 1638727
Coverity-ID: 1638728 Fixes: c458c404da16 ("xl: use libxl_domain_info to get the uuid in printf_info") Fixes: 48dab9767d2e ("tools/xl: use libxl_domain_info to get domain type for vcpu-pin") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Anthony PERARD <anthony.perard@vates.tech> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Jan Beulich [Fri, 17 Jan 2025 07:53:27 +0000 (08:53 +0100)]
xentrace: free CPU mask string before overwriting pointer
While multiple -c options may be unexpected, we'd still better deal with
them properly.
Also restore the blank line that was bogusly zapped by the same commit.
Coverity-ID: 1638723 Fixes: e4ad2836842a ("xentrace: Implement cpu mask range parsing of human values (-c)") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Anthony PERARD <anthony.perard@vates.tech> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Bernhard Kaindl [Wed, 15 Jan 2025 15:09:04 +0000 (16:09 +0100)]
docs/misc: Fix a few typos
While skimming through the misc docs, I spotted a few typos.
Signed-off-by: Bernhard Kaindl <bernhard.kaindl@cloud.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Roger Pau Monne [Tue, 14 Jan 2025 14:10:14 +0000 (15:10 +0100)]
automation/gitlab: disable coverage from clang randconfig
If randconfig enables coverage support the build times out due to GNU LD
taking too long. For the time being prevent coverage from being enabled in
clang randconfig job.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
****************************************
Panic on CPU 0:
FATAL PAGE FAULT
[error_code=0011]
Faulting linear address: 0000000062ccfa70
****************************************
Swap the preference to default to CMOS first, and EFI later, in an attempt to
use EFI_GET_TIME as a last resort option only. Note that Linux for example
doesn't allow calling the get_time method, and instead provides a dummy handler
that unconditionally returns EFI_UNSUPPORTED on x86-64.
Such change in the preferences requires some re-arranging of the function
logic, so that panic messages with workaround suggestions are suitably printed.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-By: Oleksii Kurochko<oleksii.kurochko@gmail.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
x86/time: introduce command line option to select wallclock
Allow setting the used wallclock from the command line. When the option is set
to a value different than `auto` the probing is bypassed and the selected
implementation is used (as long as it's available).
The `xen` and `efi` options require being booted as a Xen guest (with Xen guest
supported built-in) or from UEFI firmware respectively.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Roger Pau Monne [Tue, 14 Jan 2025 11:08:22 +0000 (12:08 +0100)]
automation/eclair: make Misra rule 20.7 blocking
There are no violations left, make the rule globally blocking for both x86
and ARM.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Oleksii Kurochko<oleksii.kurochko@gmail.com>
Rule 11.8 states as following: "A cast shall not remove any `const' or
`volatile' qualification from the type pointed to by a pointer".
Function `__hvm_copy' in `xen/arch/x86/hvm/hvm.c' is a double-use
function, where the parameter needs to not be const because it can be
set for write or not. As it was decided a new const-only function will
lead to more developer confusion than it's worth, this violation is
addressed by deviating the function.
All cases of casting away const-ness are accompanied with a comment
explaining why it is safe given the other flags passed in; such comment is used
by the deviation in order to match the appropriate function call.
Petr Beneš [Thu, 2 Jan 2025 17:13:28 +0000 (17:13 +0000)]
x86: Add Support for Paging-Write Feature
This patch introduces a new XENMEM_access_r_pw permission.
Functionally, it is similar to XENMEM_access_r, but for processors
with TERTIARY_EXEC_EPT_PAGING_WRITE support (Intel 12th Gen/Alder Lake
and later, Xeon 4th Gen/Sappire Rapids and later), it also permits the
CPU to write to the page during guest page-table walks (e.g., updating
A/D bits) without triggering an EPT violation.
This behavior works by both enabling the EPT paging-write feature and
setting the EPT paging-write flag in the EPT leaf entry.
This feature provides a significant performance boost for
introspection tools that monitor guest page-table updates. Previously,
every page-table modification by the guest—including routine updates
like setting A/D bits—triggered an EPT violation, adding unnecessary
overhead. The new XENMEM_access_r_pw permission allows these
"uninteresting" updates to occur without EPT violations, improving
efficiency.
Additionally, this feature simplifies the handling of race conditions
in scenarios where an introspection tool:
- Sets an "invisible breakpoint" in the altp2m view for a function F.
- Monitors guest page-table updates to track whether the page
containing F is paged out.
- Encounters a cleared Access (A) bit on the page containing F while
the guest is about to execute the breakpoint.
In the current implementation:
- If xc_monitor_inguest_pagefault() is enabled, the introspection tool
must emulate both the breakpoint and the setting of the Access bit.
- If xc_monitor_inguest_pagefault() is disabled, Xen handles the EPT
violation without notifying the introspection tool, setting the
Access bit and emulating the instruction. However, Xen fetches the
instruction from the default view instead of the altp2m view,
potentially causing the breakpoint to be missed.
With this patch, setting XENMEM_access_r_pw for monitored guest
page-tables prevents EPT violations in these cases. This change
enhances performance and reduces complexity for introspection tools,
ensuring seamless breakpoint handling while tracking guest page-table
updates.
Signed-off-by: Petr Beneš <w1benny@gmail.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Petr Beneš [Thu, 2 Jan 2025 17:13:27 +0000 (17:13 +0000)]
x86: Rename _rsvd field to pw and move it to the bit 58
The EPT Paging-write feature (when enabled by the
TERTIARY_EXEC_EPT_PAGING_WRITE bit) uses bit 58 of the EPT entry to
indicate that guest paging may update the page, even if the W access
is not set.
This patch is a preparation for the EPT Paging-write feature.
Signed-off-by: Petr Beneš <w1benny@gmail.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Use the solution described in [1] to provide a wrapper to the 'date'
command that uses SOURCE_DATE_EPOCH if available. This is needed for
reproducible builds.
The -d "@..." syntax was introduced in GNU date about 2005 (but only
added to the docuemntation in 2011), so I assume a version supporting
this syntax is available, if SOURCE_DATE_EPOCH is defined. If
SOURCE_DATE_EPOCH is not defined, nothing changes with respect to the
current behavior.
Update all users of 'date' in the tree to use the new wrapper.
Not having ppc and riscv included in DOC_ARCHES causes "multiple
definitions of ..." message on documentation build, similar to the
example shown below:
include/public/arch-ppc.h:91: multiple definitions of Typedef
vcpu_guest_core_regs_t: include/public/arch-arm.h:300
include/public/arch-ppc.h:91: multiple definitions of Typedef
vcpu_guest_core_regs_t: include/public/arch-ppc.h:85
It can also make the generated html documentation link to the header
files of another architecture. This is additionally a problem as it can
randomly make the documentation build non-reproducible.
Signed-off-by: Maximilian Engelhardt <maxi@daemonizer.de> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Thu, 9 Jan 2025 15:06:34 +0000 (15:06 +0000)]
Update Xen version to 4.20-rc
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Andrew Cooper [Thu, 9 Jan 2025 15:10:01 +0000 (15:10 +0000)]
Config.mk: Pin QEMU_UPSTREAM_REVISION
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Anthony PERARD <anthony.perard@vates.tech> Release-Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Commit a14593e3995a ("xen/device-tree: Allow region overlapping with
/memreserve/ ranges") introduced a type in the 'struct membanks_hdr'
but forgot to update the 'struct kernel_info' initialiser, while
it doesn't lead to failures because the field is not currently
used while managing kernel_info structures, it's good to have it
for completeness.
There are other instance of structures using 'struct membanks_hdr'
that are dynamically allocated and don't fully initialise these
fields, provide a static inline helper for that.
Fixes: a14593e3995a ("xen/device-tree: Allow region overlapping with /memreserve/ ranges") Signed-off-by: Luca Fancellu <luca.fancellu@arm.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com> Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Juergen Gross [Thu, 9 Jan 2025 16:34:01 +0000 (17:34 +0100)]
xen/events: fix race with set_global_virq_handler()
There is a possible race scenario between set_global_virq_handler()
and clear_global_virq_handlers() targeting the same domain, which
might result in that domain ending as a zombie domain.
In case set_global_virq_handler() is being called for a domain which
is just dying, it might happen that clear_global_virq_handlers() is
running first, resulting in set_global_virq_handler() taking a new
reference for that domain and entering in the global_virq_handlers[]
array afterwards. The reference will never be dropped, thus the domain
will never be freed completely.
This can be fixed by checking the is_dying state of the domain inside
the region guarded by global_virq_handlers_lock. In case the domain is
dying, handle it as if the domain wouldn't exist, which will be the
case in near future anyway.
Fixes: 87521589aa6a ("xen: allow global VIRQ handlers to be delegated to other domains") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
In file included from arch/arm/tee/ffa.c:72:
arch/arm/tee/ffa_private.h:329:17: error: 'used' attribute ignored on a non-definition declaration [-Werror,-Wignored-attributes]
extern uint32_t __ro_after_init ffa_fw_version;
^
The variable ffa_fw_version is only used in ffa.c. Remove the
declaration in the header and make the definition in ffa.c static.
Fixes: 2f9f240a5e87 ("xen/arm: ffa: Fine granular call support") Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Andrew Cooper [Wed, 8 Jan 2025 12:05:38 +0000 (12:05 +0000)]
CI: Update Fedora to 41
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Michal Orzel [Wed, 8 Jan 2025 07:57:19 +0000 (08:57 +0100)]
xen/arm64: Drop relocate_and_switch_ttbr() stub
In the original patch e7a80636f16e ("xen/arm: add cache coloring support
for Xen image"), the stub was added under wrong assumption that DCE
won't remove the function call if it's not static. This assumption is
incorrect as we already rely on DCE for cases like this one. Therefore
drop the stub, that otherwise would be a place potentially prone to
errors in the future.
Michal Orzel [Tue, 7 Jan 2025 09:27:19 +0000 (10:27 +0100)]
xen/flask: Wire up XEN_DOMCTL_set_llc_colors
Addition of FLASK permission for this hypercall was overlooked in the
original patch. Fix it. Setting LLC colors is only possible during domain
creation.
Fixes: 6985aa5e0c3c ("xen: extend domctl interface for cache coloring") Signed-off-by: Michal Orzel <michal.orzel@amd.com> Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com> Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
Michal Orzel [Tue, 7 Jan 2025 09:27:18 +0000 (10:27 +0100)]
xen/flask: Wire up XEN_DOMCTL_dt_overlay
Addition of FLASK permission for this hypercall was overlooked in the
original patch. Fix it. The only dt overlay operation is attaching that can
happen only after the domain is created. Dom0 can attach overlay to itself
as well.
Fixes: 4c733873b5c2 ("xen/arm: Add XEN_DOMCTL_dt_overlay and device attachment to domains") Signed-off-by: Michal Orzel <michal.orzel@amd.com> Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com> Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
Michal Orzel [Tue, 7 Jan 2025 09:27:17 +0000 (10:27 +0100)]
xen/flask: Wire up XEN_DOMCTL_vuart_op
Addition of FLASK permission for this hypercall was overlooked in the
original patch. Fix it. The only VUART operation is initialization that
can occur only during domain creation.
Fixes: 86039f2e8c20 ("xen/arm: vpl011: Add a new domctl API to initialize vpl011") Signed-off-by: Michal Orzel <michal.orzel@amd.com> Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com> Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
All selector fields under ctxt->regs are (normally) poisoned in the HVM
case, and the four ones besides CS and SS are potentially stale for PV.
Avoid using them in the hypervisor incarnation of the emulator, when
trying to cover for a missing ->read_segment() hook.
To make sure there's always a valid ->read_segment() handler for all HVM
cases, add a respective function to shadow code, even if it is not
expected for FPU insns to be used to update page tables.
Fixes: 0711b59b858a ("x86emul: correct FPU code/data pointers and opcode handling") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 8 Jan 2025 10:01:17 +0000 (11:01 +0100)]
x86emul: VCVT{,U}DQ2PD ignores embedded rounding
IOW we shouldn't raise #UD in that case. Be on the safe side though and
only encode fully legitimate forms into the stub to be executed.
Things weren't quite right for VCVT{,U}SI2SD either, in the attempt to
be on the safe side: Clearing EVEX.L'L isn't useful; it's EVEX.b which
primarily needs clearing. Also reflect the somewhat improved doc
situation in the comment there.
Fixes: ed806f373730 ("x86emul: support AVX512F legacy-equivalent packed int/FP conversion insns") Fixes: baf4a376f550 ("x86emul: support AVX512F legacy-equivalent scalar int/FP conversion insns") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Sun, 29 Dec 2024 18:18:22 +0000 (18:18 +0000)]
xen/perfc: Add perfc_defn.h to asm-generic
... and hook it up for RISC-V and PPC.
On RISC-V at least, no combination of headers pulls in errno.h, so include it
explicitly.
Guard the hypercalls array declaration based on NR_hypercalls existing. This
is sufficient to get PERF_COUNTERS fully working on RISC-V and PPC, so drop
the randconfig override.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Oleksii Kurohcko <oleksii.kurochko@gmail.com>
Andrew Cooper [Thu, 2 Jan 2025 19:46:19 +0000 (19:46 +0000)]
x86/pv: Fix build with Clang and CONFIG_PERF_COUNTERS
Clang, of at least verion 17 complains:
arch/x86/pv/hypercall.c:30:10: error: variable 'eax' is used uninitialized
whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
30 | if ( !compat )
| ^~~~~~~
arch/x86/pv/hypercall.c:87:29: note: uninitialized use occurs here
87 | perfc_incra(hypercalls, eax);
| ^~~
This function is forced always_inline to cause compat to be
constant-propagated through, but that is only a heuristic to try and get the
compiler to do what we want, not a gurantee that it does.
Clang doesn't appear to be able to see that the only case where compat is
true (and therefore the if() is false) is when there's an else clause on the
end which sets eax too.
Initialise eax to -1, which ought to be optimised out, but if for whatever
reason it happens not to be, then perfc_incra() will fail it's bounds check
and do nothing.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 31 Dec 2024 14:06:19 +0000 (14:06 +0000)]
x86/traps: Rework LER initialisation and support Zen5/Diamond Rapids
AMD have always used the architectural MSRs for LER. As the first processor
to support LER was the K7 (which was 32bit), we can assume it's presence
unconditionally in 64bit mode.
Intel are about to run out of space in Family 6 and start using 19. It is
only the Pentium 4 which uses non-architectural LER MSRs.
percpu_traps_init(), which runs on every CPU, contains a lot of code which
should be init-only, and is the only reason why opt_ler can't be in initdata.
Write a brand new init_ler() which expects all future Intel and AMD CPUs to
continue using the architectural MSRs, and does all setup together. Call it
from trap_init(), and remove the setup logic percpu_traps_init() except for
the single path configuring MSR_IA32_DEBUGCTLMSR.
Leave behind a warning if the user asked for LER and Xen couldn't enable it.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Nicola Vetrini [Sun, 22 Dec 2024 14:04:08 +0000 (15:04 +0100)]
eclair-analysis: tidy toolchain.ecl configuration and mark Rule 1.1 clean
Reformat the list of GNU extensions and non-standard tokens used by Xen
in the ECLAIR configuration to make it easier to review any changes to it.
The extension "ext_missing_varargs_arg", which captures the GNU extension that
allows variadic functions and macros not to require at least one named parameter
before C23 has been renamed to "ext_c_missing_varargs_arg" in the current version
of ECLAIR used in CI, therefore this resolves regressions on MISRA C Rule 1.1:
"The program shall contain no violations of the standard C syntax and constraints,
and shall not exceed the implementation's translation limits."
As a result, Rule 1.1 now has no violations and is tagged as such.
Remove two unused configurations, that were already commented out.
Signed-off-by: Nicola Vetrini <nicola.vetrini@bugseng.com> Fixes: 631f535a3d4f ("xen: update ECLAIR service identifiers from MC3R1 to MC3A2.") Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Andrew Cooper [Mon, 25 Mar 2024 15:14:46 +0000 (15:14 +0000)]
x86/spec-ctrl: Support for SRSO_U/S_NO and SRSO_MSR_FIX
AMD have updated the SRSO whitepaper[1] with further information. These
features exist on AMD Zen5 CPUs and are necessary for Xen to use.
The two features are in principle unrelated:
* SRSO_U/S_NO is an enumeration saying that SRSO attacks can't cross the
User(CPL3) / Supervisor(CPL<3) boundary. i.e. Xen don't need to use
IBPB-on-entry for PV64. PV32 guests are explicitly unsupported for
speculative issues, and excluded from consideration for simplicity.
* SRSO_MSR_FIX is an enumeration identifying that the BP_SPEC_REDUCE bit is
available in MSR_BP_CFG. When set, SRSO attacks can't cross the host/guest
boundary. i.e. Xen don't need to use IBPB-on-entry for HVM.
Extend ibpb_calculations() to account for these when calculating
opt_ibpb_entry_{pv,hvm} defaults. Add a `bp-spec-reduce=<bool>` option to
control the use of BP_SPEC_REDUCE, with it active by default.
Because MSR_BP_CFG is core-scoped with a race condition updating it, repurpose
amd_check_erratum_1485() into amd_check_bp_cfg() and calculate all updates at
once.
Xen also needs to to advertise SRSO_U/S_NO to guests to allow the guest kernel
to skip SRSO mitigations too:
* This is trivial for HVM guests. It is also is accurate for PV32 guests
too, but we have already excluded them from consideration, and do so again
here to simplify the policy logic.
* As written, SRSO_U/S_NO does not help for the PV64 user->kernel boundary.
However, after discussing with AMD, an implementation detail of having
BP_SPEC_REDUCE active causes the PV64 user->kernel boundary to have the
property described by SRSO_U/S_NO, so we can advertise SRSO_U/S_NO to
guests when the BP_SPEC_REDUCE precondition is met.
Finally, fix a typo in the SRSO_NO's comment.
[1] https://www.amd.com/content/dam/amd/en/documents/corporate/cr/speculative-return-stack-overflow-whitepaper.pdf Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
xen/arch/x86: make objdump output user locale agnostic
The objdump output is fed to grep, so make sure it doesn't change with
different user locales and break the grep parsing.
This problem was identified while updating xen in Debian and the fix is
needed for generating reproducible builds in varying environments.
Signed-off-by: Maximilian Engelhardt <maxi@daemonizer.de> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>