With the stack mapped on a per-CPU basis there's no risk of other CPUs being
able to read the stack contents, but vCPUs running on the current pCPU could
read stack rubble from operations of previous vCPUs.
The #DF stack is not zeroed because handling of #DF results in a panic.
The contents of the shadow stack are not cleared as part of this change. It's
arguable that leaking internal Xen return addresses is not guest confidential
data. At most those could be used by an attacker to figure out the paths
inside of Xen previous execution flows have used.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
- Zero the stack forward to use ERMS.
- Only zero the IST stacks if they have been used.
- Only zero the primary stack for full context switches.
x86/mm: switch to a per-CPU mapped stack when using ASI
When using ASI the CPU stack is mapped using a range of fixmap entries in the
per-CPU region. This ensures the stack is only accessible by the current CPU.
Note however there's further work required in order to allocate the stack from
domheap instead of xenheap, and ensure the stack is not part of the direct
map.
For domains not running with ASI enabled all the CPU stacks are mapped in the
per-domain L3, so that the stack is always at the same linear address,
regardless of whether ASI is enabled or not for the domain.
When calling UEFI runtime methods the current per-domain slot needs to be added
to the EFI L4, so that the stack is available in UEFI.
Finally, some users of callfunc IPIs pass parameters from the stack, so when
handling a callfunc IPI the stack of the caller CPU is mapped into the address
space of the CPU handling the IPI. This needs further work to use a bounce
buffer in order to avoid having to map remote CPU stacks.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
There's also further work required in order to avoid mapping remote stack when
handling callfunc IPIs.
Roger Pau Monne [Thu, 10 Oct 2024 07:31:46 +0000 (09:31 +0200)]
x86/pv: allow using a unique per-pCPU root page table (L4)
When running PV guests it's possible for the guest to use the same root page
table (L4) for all vCPUs, which in turn will result in Xen also using the same
root page table on all pCPUs that are running any domain vCPU.
When using XPTI Xen switches to a per-CPU shadow L4 when running in guest
context, switching to the fully populated L4 when in Xen context.
Take advantage of this existing shadowing and force the usage of a per-CPU L4
that shadows the guest selected L4 when Address Space Isolation is requested
for PV guests.
The mapping of the guest L4 is done with a per-CPU fixmap entry, that however
requires that the currently loaded L4 has the per-CPU slot setup. In order to
ensure this switch to the shadow per-CPU L4 with just the Xen slots populated,
and then map the guest L4 and copy the contents of the guest controlled
slots.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Wed, 27 Nov 2024 16:37:01 +0000 (17:37 +0100)]
x86/mm: switch {create,destroy}_perdomain_mapping() domain parameter to vCPU
In preparation for the per-domain area being per-vCPU. This requires moving
some of the {create,destroy}_perdomain_mapping() calls to the domain
initialization and tear down paths into vCPU initialization and tear down.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Fri, 14 Jun 2024 11:07:31 +0000 (13:07 +0200)]
x86/spec-ctrl: introduce Address Space Isolation command line option
No functional change, as the option is not used.
Introduced new so newly added functionality is keyed on the option being
enabled, even if the feature is non-functional.
When ASI is enabled for PV domains, printing the usage of XPTI might be
omitted if it must be uniformly disabled given the usage of ASI.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
- Improve comments and documentation about what ASI provides.
- Do not print the XPTI information if ASI is used for pv domUs and dom0 is
PVH, or if ASI is used for both domU and dom0.
FWIW, I would print the state of XPTI uniformly, as otherwise I find the output
might be confusing for user expecting to assert the state of XPTI.
Roger Pau Monne [Thu, 20 Jun 2024 11:12:14 +0000 (13:12 +0200)]
x86/mm: split setup of the per-domain slot on context switch
It's currently only used for XPTI. Move the code to a separate helper in
preparation for it gaining more logic.
Gate the calling of the new helper in paravirt_ctxt_switch_to() to XPTI being
enabled for the domain rather than root_pgt having been allocated for the pCPU,
as it's possible to have XPTI active only for domUs and not dom0, and in that
case setting the domain slot in root_pgt is useless when running dom0 without
XPTI.
While there switch to using l4e_write(): in the current context the L4 is
not active when modified, but that could change.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
x86/mm: move FLUSH_ROOT_PGTBL handling before TLB flush
Move the handling of FLUSH_ROOT_PGTBL in flush_area_local() ahead of the logic
that does the TLB flushing, in preparation for further changes requiring the
TLB flush to be strictly done after having handled FLUSH_ROOT_PGTBL.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Fri, 14 Jun 2024 10:41:04 +0000 (12:41 +0200)]
x86/pv: untie issuing FLUSH_ROOT_PGTBL from XPTI
The current logic gates issuing flush TLB requests with the FLUSH_ROOT_PGTBL
flag to XPTI being enabled.
In preparation for FLUSH_ROOT_PGTBL also being needed when not using XPTI,
untie it from the xpti domain boolean and instead introduce a new flush_root_pt
field.
No functional change intended, as flush_root_pt == xpti.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Fri, 18 Oct 2024 13:49:42 +0000 (15:49 +0200)]
x86/mm: replace {create,destroy}_perdomain_mapping() implementations to use map_pages()
The current logic in {create,destroy}_perdomain_mapping() is an ad-hoc
implementation that modifies page-tables to fulfill the usages required by the
per-domain slot.
Remove such open-coded page-table manipulations and instead delegate page-table
changes to map_pages() and destroy_mappings(). Since such functions require a
root page table as parameter, use a scratch L4 only populated with the
per-domain slot as the parameter.
If the caller has requested the region to be populated, such filling is done by
fetching the L1 page-table after map_pages() has created the paging structures
and populating the slots directly. Otherwise each entry to be populated would
require a call to map_pages() which is wasteful when we know the L1 table is
unconditionally present.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
There are no longer any callers of create_perdomain_mapping() that request a
reference to the used L1 tables, and hence the only difference between them is
whether the caller wants the region to be populated, or just the paging
structures to be allocated.
Simplify the arguments to create_perdomain_mapping() to reflect the current
usages: drop the last two arguments and instead introduce a boolean to signal
whether the caller wants the region populated.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Thu, 24 Oct 2024 08:54:23 +0000 (09:54 +0100)]
x86/pv: remove stashing of GDT/LDT L1 page-tables
There are no remaining callers of pv_gdt_ptes() or pv_ldt_ptes() that use the
stashed L1 page-tables in the domain structure. As such, the helpers and the
fields can now be removed.
No functional change intended, as the removed logic is not used.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Tue, 22 Oct 2024 11:47:55 +0000 (13:47 +0200)]
x86/pv: update guest LDT mappings using the linear entries
The pv_map_ldt_shadow_page() and pv_destroy_ldt() functions rely on the L1
table(s) that contain such mappings being stashed in the domain structure, and
thus such mappings being modified by merely updating the require L1 entries.
Switch pv_map_ldt_shadow_page() to unconditionally use the linear recursive, as
that logic is always called while the vCPU is running on the current pCPU.
For pv_destroy_ldt() use the linear mappings if the vCPU is the one currently
running on the pCPU, otherwise use destroy_mappings().
Note this requires keeping an array with the pages currently mapped at the LDT
area, as that allows dropping the extra taken page reference when removing the
mappings.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Tue, 26 Nov 2024 15:29:13 +0000 (16:29 +0100)]
x86/pv: set/clear guest GDT mappings using {populate,destroy}_perdomain_mapping()
The pv_{set,destroy}_gdt() functions rely on the L1 table(s) that contain such
mappings being stashed in the domain structure, and thus such mappings being
modified by merely updating the L1 entries.
Switch both pv_{set,destroy}_gdt() to instead use
{populate,destory}_perdomain_mapping().
Note that this requires moving the pv_set_gdt() call in arch_set_info_guest()
strictly after update_cr3(), so v->arch.cr3 is valid when
populate__perdomain_mapping() is called.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Mon, 21 Oct 2024 09:44:04 +0000 (11:44 +0200)]
x86/pv: introduce function to populate perdomain area and use it to map Xen GDT
The current code to update the Xen part of the GDT when running a PV guest
relies on caching the direct map address of all the L1 tables used to map the
GDT and LDT, so that entries can be modified.
Introduce a new function that populates the per-domain region, either using the
recursive linear mappings when the target vCPU is the current one, or by
directly modifying the L1 table of the per-domain region.
Using such function to populate per-domain addresses drops the need to keep a
reference to per-domain L1 tables previously used to change the per-domain
mappings.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Wed, 27 Nov 2024 09:55:12 +0000 (10:55 +0100)]
x86/mm: introduce helper to detect per-domain L1 entries that need freeing
L1 entries that require the underlying page to be freed have the _PAGE_AVAIL0
bit set, introduce a helper to unify the checking logic into a single place.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Wed, 27 Nov 2024 09:00:48 +0000 (10:00 +0100)]
x86/mm: rely on free_perdomain_mappings() to teardown all per-domain mappings
The current domain destroy and several error paths attempt to cleanup specific
per-domain areas using destroy_perdomain_mapping(). This is not needed, since
free_perdomain_mappings() will already take care of tearing down per-domain
page tables, plus freeing any pages that had been allocated by
create_perdomain_mapping().
Rely on such behavior and avoid unnecessary destroy_perdomain_mapping().
Note that after this change destroy_perdomain_mapping() becomes unused, however
don't remove it yet, further changes will make use of the function.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Fri, 4 Oct 2024 08:23:47 +0000 (10:23 +0200)]
x86/domain: limit window where curr_vcpu != current on context switch
On x86 Xen will perform lazy context switches to the idle vCPU, where the
previously running vCPU context is not overwritten, and only current is updated
to point to the idle vCPU. The state is then disjunct between current and
curr_vcpu: current points to the idle vCPU, while curr_vcpu points to the vCPU
whose context is loaded on the pCPU.
While on that lazy context switched state, certain calls (like
map_domain_page()) will trigger a full synchronization of the pCPU state by
forcing a context switch. Note however how calling any of such functions
inside the context switch code itself is very likely to trigger an infinite
recursion loop.
Attempt to limit the window where curr_vcpu != current in the context switch
code, as to prevent and infinite recursion loop around sync_local_execstate().
This is required for using map_domain_page() in the vCPU context switch code,
otherwise using map_domain_page() in that context ends up in a recursive
sync_local_execstate() loop:
Roger Pau Monne [Fri, 22 Nov 2024 10:42:13 +0000 (11:42 +0100)]
x86/mm: limit non-idle page-table manipulations to the per-domain area
Page-tables used by Xen, either the idle page-table, the HVM monitor ones, or
the page-tables built by PV guests all share a set of slots that are owned by
Xen. Limit which modifications can be made to non-idle page-tables and only
allow modifying the per-domain slot region, which is different and set based on
the domain owning the page-table.
With the added limitation of which area can be manipulated on non-idle
page-tables also lessen the locking and TBL flushing discipline. When
modifying the per-domain area the caller is responsible of ensuring correctness
of concurrent accesses; either by external locking, or by not concurrently
modifying the same entries. Similarly the caller is also responsible for doing
any TLB flushing if necessary.
flush_area() has now a dependency on 'root_pgt' being declared in the caller
context, and containing the address of the root page-table being modified.
Note the scope of flush_area() is limited to the functions where it's used, so
the dependency is better than the churn introduced by modifying every caller to
pass 'root_pgt' as a macro parameter.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Fri, 22 Nov 2024 11:51:46 +0000 (12:51 +0100)]
x86/mm: optionally pass a domain to allocate page-tables from
Functions that perform page-table management need to allocate memory to use for
the paging structures. So far this memory has always been allocated from
domheap and passing a NULL domain parameter.
In preparation for using map_pages() and related functions to manipulate
page-tables different than the idle one, allow passing a domain parameter that
will be propagated into alloc_domheap_pages() when allocating memory for paging
structures.
Note that virt_to_l3e() is not adjusted to take an extra domain parameter,
that's because L$ slots are expected to always be populated on the idle
page-tables, and hence d should always be NULL. There's already a check in
virt_to_l3e() that ensures L4 entry setup only happens for the idle
page-tables.
No functional change intended, as no caller yet passes a domain parameter.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Fri, 18 Oct 2024 13:03:54 +0000 (15:03 +0200)]
x86/mm: allow modifying page-tables different than the idle vCPU
The current functions map_pages_to_xen(), modify_xen_mappings() and
destroy_xen_mappings() are generic enough to be used against arbitrary
page-tables, however they are currently limited to only manipulating the idle
vCPU page-table.
Introduce non Xen variants of the above functions, that take a root page-table
pointer as a parameter. This allows modifying page-tables different than the
idle vCPU one.
No functional change intended, as there are no callers of the newly introduced
functions.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
The destroy_perdomain_mapping() call in the hvm_domain_initialise() fail path
is useless. destroy_perdomain_mapping() called with nr == 0 is effectively a
no op, as there are not entries torn down. Remove the call, as
arch_domain_create() already calls free_perdomain_mappings() on failure.
There's also a call to destroy_perdomain_mapping() in pv_domain_destroy() which
is also not needed. arch_domain_destroy() will already unconditionally call
free_perdomain_mappings(), which does the same as destroy_perdomain_mapping(),
plus additionally frees the page table structures.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Michal Orzel [Tue, 3 Dec 2024 09:22:14 +0000 (10:22 +0100)]
bootfdt: Add missing trailing commas in BOOTINFO_{ACPI,SHMEM}_INIT
Commit a14593e3995a extended BOOTINFO_{ACPI,SHMEM}_INIT initializers
list with a new 'type' member but forgot to add trailing commas (they
were present before). This results in a build failure when building
with CONFIG_ACPI=y and CONFIG_STATIC_SHM=y:
./include/xen/bootfdt.h:155:5: error: request for member 'shmem' in something not a structure or union
155 | .shmem.common.max_banks = NR_SHMEM_BANKS, \
| ^
./include/xen/bootfdt.h:168:5: note: in expansion of macro 'BOOTINFO_SHMEM_INIT'
168 | BOOTINFO_SHMEM_INIT \
| ^~~~~~~~~~~~~~~~~~~
common/device-tree/bootinfo.c:22:39: note: in expansion of macro 'BOOTINFO_INIT'
22 | struct bootinfo __initdata bootinfo = BOOTINFO_INIT;
Fixes: a14593e3995a ("xen/device-tree: Allow region overlapping with /memreserve/ ranges") Signed-off-by: Michal Orzel <michal.orzel@amd.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com> Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Andrew Cooper [Mon, 2 Dec 2024 17:27:05 +0000 (17:27 +0000)]
libs/guest: Fix migration compatibility with a security-patched Xen 4.13
xc_cpuid_apply_policy() provides compatibility for migration of a pre-4.14 VM
where no CPUID data was provided in the stream.
It guesses the various max-leaf limits, based on what was true at the time of
writing, but this was not correctly adapted when speculative security issues
forced the advertisement of new feature bits. Of note are:
* LFENCE-DISPATCH, in leaf 0x80000021.eax
* BHI-CTRL, in leaf 0x7[2].edx
In both cases, a VM booted on a security-patched Xen 4.13, and then migrated
on to any newer version of Xen on the same or compatible hardware would have
these features stripped back because Xen is still editing the cpu-policy for
sanity behind the back of the toolstack.
For VMs using BHI_DIS_S to mitigate Native-BHI, this resulted in a failure to
restore the guests MSR_SPEC_CTRL setting:
(XEN) HVM d7v0 load MSR 0x48 with value 0x401 failed
(XEN) HVM7 restore: failed to load entry 20/0 rc -6
Fixes: e9b4fe263649 ("x86/cpuid: support LFENCE always serialising CPUID bit") Fixes: f3709b15fc86 ("x86/cpuid: Infrastructure for cpuid word 7:2.edx") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 3 Dec 2024 08:14:46 +0000 (08:14 +0000)]
CI: Update to FreeBSD 14.2
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
--- CC: Roger Pau Monné <roger.pau@citrix.com> CC: Anthony PERARD <anthony.perard@vates.tech> CC: Stefano Stabellini <sstabellini@kernel.org>
https://cirrus-ci.com/task/5913096629256192
xen/arm: mpu: Implement a dummy enable_secondary_cpu_mm
Secondary cpus initialization is not yet supported. Thus, we print an
appropriate message and put the secondary cpus in WFE state.
And we introduce to BUILD_BUG_ON to prevent users using from building Xen
on multiprocessor based MPU systems.
In Arm, there is no clean way to disable SMP. As of now, we wish to support
MPU on UNP only. So, we have defined the default range of NR_CPUs to be 1 for
MPU.
After the regions have been created, now we enable the MPU. For this we disable
the background region so that the new memory map created for the regions take
effect. Also, we treat all RW regions as non executable and the data cache is
enabled.
xen/arm: mpu: Create boot-time MPU protection regions
Define enable_boot_cpu_mm() for the Armv8-R AArch64.
Like boot-time page table in MMU system, we need a boot-time MPU protection
region configuration in MPU system so Xen can fetch code and data from normal
memory.
To do this, Xen maps the following sections of the binary as separate regions
(with permissions) :-
1. Text (Read only at EL2, execution is permitted)
2. RO data (Read only at EL2)
3. RO after init data and RW data (Read/Write at EL2)
4. Init Text (Read only at EL2, execution is permitted)
5. Init data and BSS (Read/Write at EL2)
Before creating a region, we check if the count exceeds the number defined in
MPUIR_EL2. If so, then the boot fails.
Also we check if the region is empty or not. IOW, if the start and end address
are same, we skip mapping the region.
To map a region, Xen uses the PRBAR_EL2, PRLAR_EL2 and PRSELR_EL2 registers.
One can refer to ARM DDI 0600B.a ID062922 G1.3 "General System Control
Registers", to get the definitions of these registers. Also, refer to G1.2
"Accessing MPU memory region registers", the following
```
The MPU provides two register interfaces to program the MPU regions:
- Access to any of the MPU regions via PRSELR_ELx, PRBAR<n>_ELx, and
PRLAR<n>_ELx.
```
We use the above mechanism to create the MPU memory regions.
Also, the compiler needs the flag ("-march=armv8-r") in order to build Xen for
Armv8-R AArch64 MPU based systems. There will be no need for us to explicitly
define MPU specific registers.
Julien Grall [Wed, 27 Nov 2024 10:55:12 +0000 (10:55 +0000)]
xen/arm32: Get rid of __memzero()
All the code in arch/arm32/lib/ where copied from Linux 3.16
and never re-synced since then.
A few years ago, Linux got rid of __memzero() because the implementation
is very similar to memset(p,0,n) and the current use of __memzero()
interferes with optimization. See full commit message from Linux below.
So it makes sense to get rid of __memzero in Xen as well.
From ff5fdafc9e9702846480e0cea55ba861f72140a2 Mon Sep 17 00:00:00 2001
From: Nicolas Pitre <nicolas.pitre@linaro.org>
Date: Fri, 19 Jan 2018 18:17:46 +0100
Subject: [PATCH] ARM: 8745/1: get rid of __memzero()
The __memzero assembly code is almost identical to memset's except for
two orr instructions. The runtime performance of __memset(p, n) and
memset(p, 0, n) is accordingly almost identical.
However, the memset() macro used to guard against a zero length and to
call __memzero at compile time when the fill value is a constant zero
interferes with compiler optimizations.
Arnd found tha the test against a zero length brings up some new
warnings with gcc v8:
And successively rremoving the test against a zero length and the call
to __memzero optimization produces the following kernel sizes for
defconfig with gcc 6:
So it is probably not worth keeping __memzero around given that the
compiler can do a better job at inlining trivial memset(p,0,n) on its
own. And the memset code already handles a zero length just fine.
Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Nicolas Pitre <nico@linaro.org> Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git ff5fdafc9e97 Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Michal Orzel <michal.orzel@amd.com> Signed-off-by: Julien Grall <jgrall@amazon.com>
Roger Pau Monné [Mon, 2 Dec 2024 14:22:05 +0000 (15:22 +0100)]
xen/Kconfig: livepatch-build-tools requires debug information
The tools infrastructure used to build livepatches for Xen
(livepatch-build-tools) consumes some DWARF debug information present in
xen-syms to generate a livepatch (see livepatch-build script usage of readelf
-wi).
The current Kconfig defaults however will enable LIVEPATCH without DEBUG_INFO
on release builds, thus providing a default Kconfig selection that's not
suitable for livepatch-build-tools even when LIVEPATCH support is enabled,
because it's missing the DWARF debug section.
Fix by defaulting DEBUG_INFO to enabled when LIVEPATCH is.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Mon, 2 Dec 2024 08:50:14 +0000 (09:50 +0100)]
x86emul: MOVBE requires a memory operand
The reg-reg forms should cause #UD; they come into existence only with
APX, where MOVBE also extends BSWAP (for the latter not being "eligible"
to a REX2 prefix).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jason Andryuk [Mon, 2 Dec 2024 08:49:38 +0000 (09:49 +0100)]
xl: Keep monitoring suspended domain
When a VM transitioned to LIBXL_SHUTDOWN_REASON_SUSPEND, the xl daemon
was exiting as 0 = DOMAIN_RESTART_NONE "No domain restart".
Later, when the VM actually shutdown, the missing xl daemon meant the
domain wasn't cleaned up properly.
Add a new DOMAIN_RESTART_SUSPENDED to handle the case. The xl daemon
keeps running to react to future shutdown events.
The domain death event needs to be re-enabled to catch subsequent
events. The libxl_evgen_domain_death is moved from death_list to
death_reported, and then it isn't found on subsequent iterations through
death_list. We enable the new event before disabling the old event, to
keep the xenstore watch active. If it is unregistered and
re-registered, it'll fire immediately for our suspended domain which
will end up continuously re-triggering.
Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
Oleksii Kurochko [Tue, 19 Nov 2024 14:55:32 +0000 (15:55 +0100)]
drivers/char: rename arm-uart.c to uart-init.c
Rename the file containing uart_init() to enable reuse across other
architectures that utilize device trees or SPCR tables to locate UART
information.
After locating UART data, {acpi}_device_init() is called to initialize
the UART.
arm_uart_init() is renamed to uart_init() to be reused by other
architectures.
A new configuration option, CONFIG_GENERIC_UART_INIT, is introduced,
currently available only for Arm. Enabling CONFIG_UART_INIT on additional
architectures will require additional functionality, such as device tree
mapping and unflattening, etc.
arm-uart.c is removed from "ARM (W/ VIRTUALIZATION EXTENSIONS) ARCHITECTURE"
section in the MAINTAINERS file, as it is no longer Arm-specific and can
now be maintained by maintainers of other architectures.
Use GENERIC_UART_INIT for CONFIG_ARM by adding `select GENERIC_UART_INIT`
to CONFIG_ARM.
Luca Fancellu [Thu, 14 Nov 2024 10:28:02 +0000 (10:28 +0000)]
xen/device-tree: Allow region overlapping with /memreserve/ ranges
There are some cases where the device tree exposes a memory range
in both /memreserve/ and reserved-memory node, in this case the
current code will stop Xen to boot since it will find that the
latter range is clashing with the already recorded /memreserve/
ranges.
Furthermore, u-boot lists boot modules ranges, such as ramdisk,
in the /memreserve/ part and even in this case this will prevent
Xen to boot since it will see that the module memory range that
it is going to add in 'add_boot_module' clashes with a /memreserve/
range.
When Xen populate the data structure that tracks the memory ranges,
it also adds a memory type described in 'enum membank_type', so
in order to fix this behavior, allow overlapping with the /memreserve/
ranges in the 'check_reserved_regions_overlap' function when a flag
is set.
In order to implement this solution, there is a distinction between
the 'struct membanks *' handled by meminfo_overlap_check(...) that
needs to be done, because the static shared memory banks doesn't have
a usable bank[].type field and so it can't be accessed, hence now
the 'struct membanks_hdr' have a 'enum region_type type' field in order
to be able to identify static shared memory banks in meminfo_overlap_check(...).
While there, set a type for the memory recorded using meminfo_add_bank()
from efi-boot.h.
Denis Mukhin [Tue, 26 Nov 2024 23:21:52 +0000 (15:21 -0800)]
xsm/flask: missing breaks, MISRA rule 16.4
While working on console forwarding for virtual NS8250 I stepped into
flask_domain_alloc_security()
where break statement was missing in default case which violates MISRA
rule 16.4.
Fixed everywhere in hooks.c.
Signed-off-by: Denis Mukhin <dmukhin@ford.com> Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
Oleksii Kurochko [Wed, 27 Nov 2024 10:40:55 +0000 (11:40 +0100)]
xen/riscv: finalize boot allocator and transition to boot state
Add a call to end_boot_allocator() in start_xen() to finalize the
boot memory allocator, moving free pages to the domain sub-allocator.
After initializing the memory subsystem, update `system_state` from
`SYS_STATE_early_boot` to `SYS_STATE_boot`, signifying the end of the
early boot phase.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Oleksii Kurochko [Wed, 27 Nov 2024 10:40:20 +0000 (11:40 +0100)]
xen/riscv: introduce setup_mm()
Introduce the implementation of setup_mm(), which includes:
1. Adding all free regions to the boot allocator, as memory is needed
to allocate page tables used for frame table mapping.
2. Calculating RAM size and the RAM end address.
3. Setting up direct map mappings from each RAM bank and initialize
directmap_virt_start to keep simple VA <-> PA translation and if
RAM_start isn't properly aligned then add an additional alignment
to directmap_virt_start to be properly aligned with RAM
start to use more superpages to reduce pressure on the TLB.
4. Setting up frame table mappings for range [ram_start, ram_end)
and initialize properly frametable_virt_start to have simplified
version of mfn_to_page() and page_to_mfn().
5. Setting up max_page.
Introduce DIRECTMAP_VIRT_END to have a convient way to do some basic
checks of address ranges.
Based on the memory layout mentioned in config.h, DIRECTMAP_SIZE is
expected to be inclusive, i.e., [DIRECTMAP_VIRT_START, DIRECTMAP_VIRT_END].
Therefore, DIRECTMAP_SIZE is updated to reflect this.
Update virt_to_maddr() to use introduced directmap_virt_start and newly
introduced DIRECTMAP_VIRT_END.
Implement maddr_to_virt() function to convert a machine address
to a virtual address. This function is specifically designed to be used
only for the DIRECTMAP region, so a check has been added to ensure that
the address does not exceed DIRECTMAP_VIRT_END.
After the introduction of maddr_to_virt() the following linkage error starts
to occur and to avoid it share_xen_page_with_guest() stub is added:
riscv64-linux-gnu-ld: prelink.o: in function `tasklet_kill':
/build/xen/common/tasklet.c:176: undefined reference to
`share_xen_page_with_guest'
riscv64-linux-gnu-ld: ./.xen-syms.0: hidden symbol `share_xen_page_with_guest'
isn't defined riscv64-linux-gnu-ld: final link failed: bad value
Despite the linkger fingering tasklet.c, it's trace.o which has the undefined
refenrece:
$ find . -name \*.o | while read F; do nm $F | grep share_xen_page_with_guest &&
echo $F; done
U share_xen_page_with_guest
./xen/common/built_in.o
U share_xen_page_with_guest
./xen/common/trace.o
U share_xen_page_with_guest
./xen/prelink.o
Looking at trace.i, there is call of share_xen_page_with_guest() but in case of
when maddr_to_virt() is defined as stub ("BUG_ON(); return NULL;") DCE happens and
the code is just eliminated.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 26 Nov 2024 10:25:45 +0000 (11:25 +0100)]
page-alloc: make scrub_one_page() static
Before starting to alter its properties, restrict the function's
visibility. The only external user is mem-paging, which we can
accommodate by different means.
Also move the function up in its source file, so we won't need to
forward-declare it. Constify its parameter at the same time.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <jgrall@amazon.com>
Roger Pau Monné [Tue, 26 Nov 2024 10:23:58 +0000 (11:23 +0100)]
x86/pv: don't populate the GDT/LDT L3 slot at domain creation
The current code in pv_domain_initialise() populates the L3 slot used for the
GDT/LDT, however that's not needed, since the create_perdomain_mapping() in
pv_create_gdt_ldt_l1tab() will already take care of allocating an L2 and
populating the L3 entry if not present.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/pci: remove logic catering to adding VF without PF
The hardware domain is expected to add a PF first before adding
associated VFs. If adding happens out of order, print a warning and
return an error. Drop the recursive call to pci_add_device().
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
In commit 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci
structure") a lock was moved from allocate_and_map_msi_pirq() to the
caller and changed from pcidevs_lock() to read_lock(&d->pci_lock).
However, one call path wasn't updated to reflect the change, leading to
a failed assertion observed under the following conditions:
* PV dom0
* Debug build (CONFIG_DEBUG=y) of Xen
* There is an SR-IOV device in the system with one or more VFs enabled
* Dom0 has loaded the driver for the VF and enabled MSI-X
(XEN) Assertion 'd || pcidevs_locked()' failed at drivers/passthrough/pci.c:535
(XEN) ----[ Xen-4.20-unstable x86_64 debug=y Not tainted ]----
...
(XEN) Xen call trace:
(XEN) [<ffff82d040284da8>] R pci_get_pdev+0x4c/0xab
(XEN) [<ffff82d040344f5c>] F arch/x86/msi.c#read_pci_mem_bar+0x58/0x272
(XEN) [<ffff82d04034530e>] F arch/x86/msi.c#msix_capability_init+0x198/0x755
(XEN) [<ffff82d040345dad>] F arch/x86/msi.c#__pci_enable_msix+0x82/0xe8
(XEN) [<ffff82d0403463e5>] F pci_enable_msi+0x3f/0x78
(XEN) [<ffff82d04034be2b>] F map_domain_pirq+0x2a4/0x6dc
(XEN) [<ffff82d04034d4d5>] F allocate_and_map_msi_pirq+0x103/0x262
(XEN) [<ffff82d04035da5d>] F physdev_map_pirq+0x210/0x259
(XEN) [<ffff82d04035e798>] F do_physdev_op+0x9c3/0x1454
(XEN) [<ffff82d040329475>] F pv_hypercall+0x5ac/0x6af
(XEN) [<ffff82d0402012d3>] F lstar_enter+0x143/0x150
In read_pci_mem_bar(), the VF obtains the struct pci_dev pointer for its
associated PF to access the vf_rlen array. This array is initialized in
pci_add_device() and is only populated in the associated PF's struct
pci_dev.
Access the vf_rlen array via the link to the PF, and remove the
troublesome call to pci_get_pdev().
Fixes: 4f78438b45e2 ("vpci: use per-domain PCI lock to protect vpci structure") Reported-by: Teddy Astie <teddy.astie@vates.tech> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Add links between a VF's struct pci_dev and its associated PF struct
pci_dev.
The hardware domain is expected to remove the associated VFs before
removing the PF. If removal happens out of order, print a warning and
return an error. This means that VFs can only exist with an associated
PF.
Additionally, if the hardware domain attempts to remove a PF with VFs
still present, mark the PF and VFs broken, because Linux Dom0 has been
observed to not respect the error returned.
Move the calls to pci_get_pdev() and pci_add_device() down to avoid
dropping and re-acquiring the pcidevs_lock().
Check !pdev->pf_pdev before adding the VF to the list to guard against
adding it multiple times.
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 25 Nov 2024 11:30:41 +0000 (11:30 +0000)]
build: Remove -fno-stack-protector-all from EMBEDDED_EXTRA_CFLAGS
This seems to have been introduced in commit f8beb54e2455 ("Disable PIE/SSP
features when building Xen, if GCC supports them.") in 2004.
However, neither GCC nor Clang appear to have ever supported taking the
negated form of -fstack-protector-all, meaning this been useless since its
introduction.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 22 Nov 2024 16:29:01 +0000 (16:29 +0000)]
docs/sphinx: Refresh config for newer Sphinx
Sphinx 5.0 and newer objects to language = None. Switch to 'en'.
Also update the copyright year. Use %Y to avoid this problem in the future,
and provide compatibility for versions of Sphinx prior to 8.1 which don't
support the syntax.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Andrew Cooper [Fri, 22 Nov 2024 16:34:20 +0000 (16:34 +0000)]
docs/sphinx: Fix FUSA indexing
Sphinx complains:
docs/fusa/index.rst:6: WARNING: toctree contains reference to nonexisting document 'fusa/reqs'
docs/fusa/reqs/index.rst:6: WARNING: toctree contains reference to nonexisting document 'fusa/reqs/market-reqs'
docs/fusa/reqs/index.rst:6: WARNING: toctree contains reference to nonexisting document 'fusa/reqs/product-reqs'
docs/fusa/reqs/index.rst:6: WARNING: toctree contains reference to nonexisting document 'fusa/reqs/design-reqs/arm64'
docs/fusa/index.rst: WARNING: document isn't included in any toctree
docs/fusa/reqs/design-reqs/arm64/generic-timer.rst: WARNING: document isn't included in any toctree
docs/fusa/reqs/design-reqs/arm64/sbsa-uart.rst: WARNING: document isn't included in any toctree
docs/fusa/reqs/index.rst: WARNING: document isn't included in any toctree
docs/fusa/reqs/market-reqs/reqs.rst: WARNING: document isn't included in any toctree
docs/fusa/reqs/product-reqs/arm64/reqs.rst: WARNING: document isn't included in any toctree
Fix the toctrees.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Ayan Kumar Halder <ayan.kumar.halder@amd.com>
Oleksii Kurochko [Mon, 25 Nov 2024 10:34:40 +0000 (11:34 +0100)]
xen/common: Move gic_dt_preinit() to common code
Introduce intc_dt_preinit() in the common codebase, as it is not
architecture-specific and can be reused by both PPC and RISC-V.
This function identifies the node with the interrupt-controller property
in the device tree and calls device_init() to handle architecture-specific
initialization of the interrupt controller.
Make minor adjustments compared to the original ARM implementation of
gic_dt_preinit():
- Remove the local rc variable in gic_dt_preinit() since it is only used once.
- Change the prefix from gic to intc to clarify that the function is not
specific to ARM’s GIC, making it suitable for other architectures as well.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com>
Roger Pau Monné [Mon, 25 Nov 2024 10:33:38 +0000 (11:33 +0100)]
x86/pvh: also print hardware domain pIRQ limit for PVH
Do not return early in the PVH/HVM case, so that the number of pIRQs is also
printed. While PVH dom0 doesn't have access to the hypercalls to manage pIRQs
itself, nor the knowledge to do so, pIRQs are still used by Xen to map and
bind interrupts to a PVH dom0 behind its back. Hence the pIRQ limit is still
relevant for a PVH dom0.
Fixes: 17f6d398f765 ('cmdline: document and enforce "extra_guest_irqs" upper bounds') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monné [Mon, 25 Nov 2024 10:33:06 +0000 (11:33 +0100)]
x86/irq: fix calculation of max PV dom0 pIRQs
The current calculation of PV dom0 pIRQs uses:
n = min(fls(num_present_cpus()), dom0_max_vcpus());
The usage of fls() is wrong, as num_present_cpus() already returns the number
of present CPUs, not the bitmap mask of CPUs.
Fix by removing the usage of fls().
Fixes: 7e73a6e7f12a ('have architectures specify the number of PIRQs a hardware domain gets') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Julien Grall [Mon, 25 Nov 2024 10:32:41 +0000 (11:32 +0100)]
xen/arm32: mm: Rename 'first' to 'root' in init_secondary_pagetables()
The arm32 version of init_secondary_pagetables() will soon be re-used
for arm64 as well where the root table starts at level 0 rather than level 1.
So rename 'first' to 'root'.
Signed-off-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Elias El Yandouzi <eliasely@amazon.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com>
Andrew Cooper [Tue, 19 Nov 2024 10:40:41 +0000 (10:40 +0000)]
x86/mce: Compile do_mca() for CONFIG_PV only
Eclair reports a Misra Rule 8.4 violation; that do_mca() can't see it's
declaration. It turns out that this is a consequence of do_mca() being
PV-only, and the declaration being compiled out in !PV builds.
Therefore, arrange for do_mca() to be compiled out in !PV builds. This in
turn requires a number of static functions to become __maybe_unused.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Michal Orzel [Tue, 19 Nov 2024 11:51:41 +0000 (12:51 +0100)]
bootfdt: Unify early printing of memory ranges endpoints
At the moment, when printing memory ranges during early boot, endpoints
of some ranges are printed as inclusive (RAM, RESVD, SHMEM) and some
as exclusive (Initrd, MODULE). Make the behavior consistent and print
all the endpoints as inclusive.
misra: increase identifiers length to 63 and align doc with ECLAIR config
Currently the identifiers characters limit is arbitrarily set to 40. It
causes a few violations as we have some identifiers longer than 40.
Increase the limit to another rather arbitrary limit of 63. Thanks to
this change, we remove a few violations, getting us one step closer to
marking Rules 5.2 and 5.4 as clean.
The ECLAIR configuration is already using 63, so this change matches
the rules.rst documentation with the ECLAIR behavior.
Daniel P. Smith [Fri, 15 Nov 2024 13:12:01 +0000 (08:12 -0500)]
x86/boot: add start and size fields to struct boot_module
Introduce the start and size fields to struct boot_module and assigns
their value during boot_info construction. All uses of module_t to get
the address and size of a module are replaced with start and size.
The EFI entry point is a special case, as the EFI file loading boot
service may load a file beyond the 4G barrier. As a result, to make the
address fit in the 32bit integer used by the MB1 module_t structure, the
frame number is stored in mod_start and size in mod_end. Until the EFI
entry point is enlightened to work with boot_info and boot_module,
multiboot_fill_boot_info will handle the alternate values in mod_start
and mod_end when EFI is detected.
A result of the switch to start/size removes all uses of the mod field
in struct boot_modules, along with the uses of bootstrap_map() and
release_module() functions. With all usage gone, they all are dropped
here.
Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 19 Nov 2024 08:12:43 +0000 (09:12 +0100)]
x86/pmstat: deal with Misra 8.4 violations
While the override #define-s in x86_64/platform_hypercall.c are good for
the consuming side of the compat variants of set_{cx,px}_pminfo(), the
producers lack the respective declarations. Include pmstat.h early,
before the overrides are put in place, while adding explicit
declarations of the compat functions (alongside structure forward
declarations).
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Mon, 18 Nov 2024 16:57:29 +0000 (16:57 +0000)]
x86/boot: Introduce boot-helpers.h
Eclair complains that neither reloc_trampoline{32,64}() can see their
declarations.
reloc_trampoline32() needs to become asmlinkage, while reloc_trampoline64()
needs declaring properly in a way that both efi-boot.h and reloc-trampoline.c
can see.
Introduce boot-helpers.h for the purpose.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Tue, 19 Nov 2024 10:34:41 +0000 (11:34 +0100)]
x86/mm: fix IS_LnE_ALIGNED() to comply with Misra Rule 20.7
While not strictly needed to guarantee operator precedence is as expected, add
the parentheses to comply with Misra Rule 20.7.
No functional change intended.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Fixes: 5b52e1b0436f ('x86/mm: skip super-page alignment checks for non-present entries') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Daniel P. Smith [Fri, 15 Nov 2024 13:12:00 +0000 (08:12 -0500)]
x86/boot: introduce module release
A precarious approach was used to release the pages used to hold a boot module.
The precariousness stemmed from the fact that in the case of PV dom0, the
initrd module pages may be either mapped or copied into the dom0 address space.
In the former case, the PV dom0 construction code will set the size of the
module to zero, relying on discard_initial_images() to skip any modules with a
size of zero. In the latter case, the pages are freed by the PV dom0
construction code. This freeing of pages is done so that in either case, the
initrd variable can be reused for tracking the initrd location in dom0 memory
through the remaining dom0 construction code.
To encapsulate the logical action of releasing a boot module, the function
release_boot_module() is introduced along with the `released` flag added to
boot module. The boot module flag `released` allows the tracking of when a boot
module has been released by release_boot_module().
As part of adopting release_boot_module() the function discard_initial_images()
is renamed to free_boot_modules(), a name that better reflects the functions
actions.
Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Carlo Nonato [Fri, 25 Oct 2024 09:50:11 +0000 (11:50 +0200)]
xen/arm: use domain memory to allocate p2m page tables
Cache colored domains can benefit from having p2m page tables allocated
with the same coloring schema so that isolation can be achieved also for
those kind of memory accesses.
In order to do that, the domain struct is passed to the allocator and the
MEMF_no_owner flag is used.
This will be useful also when NUMA will be supported on Arm.
Signed-off-by: Carlo Nonato <carlo.nonato@minervasys.tech> Acked-by: Julien Grall <julien@xen.org>
Daniel P. Smith [Fri, 15 Nov 2024 13:11:59 +0000 (08:11 -0500)]
x86/boot: convert domain construction to use boot info
With all the components used to construct dom0 encapsulated in struct boot_info
and struct boot_module, it is no longer necessary to pass all them as
parameters down the domain construction call chain. Change the parameter list
to pass the struct boot_info instance and the struct domain reference.
In dom0_construct() change i to be unsigned, and split some multiple
assignments to placate MISRA.
Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
Andrew Cooper [Fri, 15 Nov 2024 13:12:27 +0000 (13:12 +0000)]
x86/emul: Adjust get_stub() to avoid shadowing an outer variable
Eclair reports a violation of MISRA Rule 5.3.
get_stub() has a local ptr variable which genuinely shadows x86_emul_rmw()'s
parameter of the same name. The logic is correct, so the easiest fix is to
rename one of variables.
With this addressed, Rule 5.3 is clean, so mark it as such.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 7 Nov 2024 17:11:39 +0000 (17:11 +0000)]
x86/ucode: Drop MIS_UCODE and microcode_match_result
All uses of MIS_UCODE, have been removed, leaving only a simple ordering
relation, and microcode_match_result being a stale name.
Drop the enum entirely, and use a simple int -1/0/1 scheme like other standard
ordering primitives in C.
Swap the order or parameters to compare_patch(), to reduce cognitive
complexity; all other logic operates the other way around. Rename the hook to
simply compare().
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 7 Nov 2024 22:33:53 +0000 (22:33 +0000)]
x86/ucode: Fix cache handling in microcode_update_helper()
microcode_update_cache() now has a single caller, but inlining it shows how
unnecessarily complicated the logic really is.
Outside of error paths, there is always one microcode patch to free. Its
either result of parse_blob(), or it's the old cached value.
In order to fix this, have a local patch pointer (mostly to avoid the
unnecessary verbosity of patch_with_flags.patch), and always free it at the
end. The only error path needing care is the IS_ERR(patch) path, which is
easy enough to handle.
Also, widen the scope of result. We only need to call compare_patch() once,
and the answer is still good later when updating the cache. In order to
update the cache, simply SWAP() the patch and the cache pointers, allowing the
singular xfree() at the end to cover both cases.
This also removes all callers microcode_free_patch() which fixes the need to
cast away const to allow it to compile. This also removed several violations
of MISRA Rule 11.8 which disallows casting away const.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 6 Nov 2024 15:56:48 +0000 (15:56 +0000)]
x86/ucode: Remove the collect_cpu_info() call from parse_blob()
With the tangle of logic starting to come under control, it is now plain to
see that parse_blob()'s side effect of re-gathering the signature/revision is
pointless.
The signature is invariant for the lifetime of Xen, and the revision is kept
suitably up to date in apply_microcode(). The BSP gathers this in
early_microcode_init(), and the APs and S3 in microcode_update_one().
Therefore, there is no need for parse_blob() to discard a good copy of the
data and re-gather it.
This finally gets us down to a single call per CPU on boot / S3 resume, and no
calls during late-load hypercalls.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Fri, 15 Nov 2024 13:14:12 +0000 (14:14 +0100)]
x86/mm: fix alignment check for non-present entries
While the alignment of the mfn is not relevant for non-present entries, the
alignment of the linear address is. Commit 5b52e1b0436f introduced a
regression by not checking the alignment of the linear address when the new
entry was a non-present one.
Fix by always checking the alignment of the linear address, non-present entries
must just skip the alignment check of the physical address.
Fixes: 5b52e1b0436f ('x86/mm: skip super-page alignment checks for non-present entries') Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 21 Jun 2024 20:58:00 +0000 (21:58 +0100)]
xen/multicall: Change nr_calls to uniformly be unsigned long
Right now, the non-compat declaration and definition of do_multicall()
differing types for the nr_calls parameter.
This is a MISRA rule 8.3 violation, but it's also time-bomb waiting for the
first 128bit architecture (RISC-V looks as if it might get there first).
Worse, the type chosen here has a side effect of truncating the guest
parameter, because Xen still doesn't have a clean hypercall ABI definition.
Switch uniformly to using unsigned long.
This addresses the MISRA violation, and while it is a guest-visible ABI
change, it's only in the corner case where the guest kernel passed a
bogus-but-correct-when-truncated value. I can't find any any users of
mutilcall which pass a bad size to begin with, so this should have no
practical effect on guests.
In fact, this brings the behaviour of multicalls more in line with the header
description of how it behaves.
With this fix, Xen is now fully clean to Rule 8.3, so mark it so.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Andrew Cooper [Fri, 8 Nov 2024 15:59:05 +0000 (15:59 +0000)]
x86/trampoline: Rationalise the constants to describe the size
The logic is far more sane to follow with a total size, and the position of
the end of the heap. Remove or fix the remaining descriptions of how the
trampoline is laid out.
Move the relevant constants into trampoline.h, which requires making the
header safe to include in assembly files.
No functional change. The compiled binary is identical.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Frediano Ziglio <frediano.ziglio@cloud.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 13 Nov 2024 18:46:47 +0000 (18:46 +0000)]
xen/multiboot: Make headers be standalone
Both require xen/stdint.h.
Change multiboot.h to include const.h by it's more normal path, and swap u32
for uint32_t.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Frediano Ziglio <frediano.ziglio@cloud.com>
Andrew Cooper [Wed, 6 Nov 2024 14:17:37 +0000 (14:17 +0000)]
xen/earlycpio: Fix header to be standalone
Split out of yet-more microcode cleanup work.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Frediano Ziglio <frediano.ziglio@cloud.com>
Roger Pau Monné [Thu, 14 Nov 2024 15:13:10 +0000 (16:13 +0100)]
x86/mm: ensure L2 is always freed if empty
The current logic in modify_xen_mappings() allows for fully empty L2 tables to
not be freed and unhooked from the parent L3 if the last L2 slot is not
populated.
Ensure that even when an L2 slot is empty the logic to check whether the whole
L2 can be removed is not skipped.
Fixes: 4376c05c3113 ('x86-64: use 1GB pages in 1:1 mapping if available') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Thu, 14 Nov 2024 15:12:51 +0000 (16:12 +0100)]
x86/setup: remove bootstrap_map_addr() usage of destroy_xen_mappings()
bootstrap_map_addr() needs to be careful to not remove existing page-table
structures when tearing down mappings, as such pagetable structures might be
needed to fulfill subsequent mappings requests. The comment ahead of the
function already notes that pagetable memory shouldn't be allocated.
Fix this by using map_pages_to_xen(), which does zap the page-table entries,
but does not free page-table structures even when empty.
Fixes: 4376c05c3113 ('x86-64: use 1GB pages in 1:1 mapping if available') Signed-off-by: Roger Pau Monné <roger.pau@ctrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Thu, 14 Nov 2024 15:12:35 +0000 (16:12 +0100)]
x86/mm: skip super-page alignment checks for non-present entries
INVALID_MFN is ~0, so by it having all bits as 1s it doesn't fulfill the
super-page address alignment checks for L3 and L2 entries. Skip the alignment
checks if the new entry is a non-present one.
This fixes a regression introduced by 0b6b51a69f4d, where the switch from 0 to
INVALID_MFN caused all super-pages to be shattered when attempting to remove
mappings by passing INVALID_MFN instead of 0.
Fixes: 0b6b51a69f4d ('xen/mm: Switch map_pages_to_xen to use MFN typesafe') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 14 Nov 2024 12:03:18 +0000 (13:03 +0100)]
x86emul: avoid double memory read for RORX
Originally only twobyte_table[0x3a] determined what part of generic
operand fetching (near the top of x86_emulate()) comes into play. When
ext0f3a_table[] was added, ->desc was updated to properly describe the
ModR/M byte's function. With that generic source operand fetching came
into play for RORX, rendering the explicit fetching in the respective
case block redundant (and wrong at the very least when MMIO with side
effects is accessed).
While there also make a purely cosmetic / documentary adjustment to
ext0f3a_table[]: RORX really is a 2-operand insn, MOV-like in that it
only writes its destination register.
Fixes: 9f7f5f6bc95b ("x86emul: add tables for 0f38 and 0f3a extension space") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Make explicit the fallthrough intention by adding the pseudo keyword
where missing and replace fallthrough comments not following the
agreed syntax.
This satisfies the requirements to deviate violations of
MISRA C:2012 Rule 16.3 "An unconditional break statement shall
terminate every switch-clause".
No functional change.
Signed-off-by: Federico Serafini <federico.serafini@bugseng.com> Acked-by: Jan Beulich <jbeulich@suse.com>
x86/emul: auxiliary definition of pseudo keyword fallthrough
The pseudo keyword fallthrough shall be used to make explicit the
fallthrough intention at the end of a case statement (doing this
using comments is deprecated).
A definition of such pseudo keyword is already present in the
Xen build. This auxiliary definition makes it available also for
for test and fuzzing harness without iterfearing with the one
that the Xen build has.
Signed-off-by: Federico Serafini <federico.serafini@bugseng.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 14 Nov 2024 12:00:57 +0000 (13:00 +0100)]
x86emul: ignore VEX.W for BMI{1,2} insns in 32-bit mode
While result values and other status flags are unaffected as long as we
can ignore the case of registers having their upper 32 bits non-zero
outside of 64-bit mode, EFLAGS.SF may obtain a wrong value when we
mistakenly re-execute the original insn with VEX.W set.
Note that guest the memory access, if any, is correctly carried out as
32-bit regardless of VEX.W. The emulator-local memory operand will be
accessed as a 64-bit quantity, but it is pre-initialised to zero so no
internal state can leak.
Fixes: 771daacd197a ("x86emul: support BMI1 insns") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>