Jan Beulich [Mon, 12 Feb 2024 08:37:18 +0000 (09:37 +0100)]
x86/PV32: restore PAE-extended-CR3 logic
While the PAE-extended-CR3 VM assist is a 32-bit only concept, it still
applies to guests also when run on a 64-bit hypervisor: The "extended
CR3" format has to be used there as well, to fit the address in the only
32-bit wide register there. As a result it was a mistake that the check
was never enabled for that case, and was then mistakenly deleted in the
course of removal of 32-bit-Xen code (218adf199e68 ["x86: We can assume
CONFIG_PAGING_LEVELS==4"]).
Similarly during Dom0 construction kernel awareness needs to be taken
into account, and respective code was again mistakenly never enabled for
32-bit Dom0 when running on 64-bit Xen (and thus wrongly deleted by 5d1181a5ea5e ["xen: Remove x86_32 build target"]).
At the same time restrict enabling of the assist for Dom0 to just the
32-bit case. Furthermore there's no need for an atomic update there.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
The failure is reported for the following line:
(paddr_t)(uintptr_t)(_start + boot_phys_offset)
This occurs because the compiler treats (ptr + size) with size bigger than
PTRDIFF_MAX as undefined behavior. To address this, switch to macro
virt_to_maddr(), given the future plans to eliminate boot_phys_offset.
eclair: move function and macro properties outside ECLAIR
Function and macro properties contained in ECLAIR/call_properties.ecl are of
general interest: this patch moves these annotations in a generaric JSON file
in docs. In this way, they can be exploited for other purposes (i.e. documentation,
other tools).
Add rst file containing explanation on how to update function_macro_properties.json.
Add script to convert the JSON file in ECL configurations.
Remove ECLAIR/call_properties.ecl: the file is now automatically generated from
the JSON file.
Jason Andryuk [Wed, 7 Feb 2024 12:46:52 +0000 (13:46 +0100)]
block-common: Fix same_vm for no targets
same_vm is broken when the two main domains do not have targets. otvm
and targetvm are both missing, which means they get set to -1 and then
converted to empty strings:
++10697+ local targetvm=-1
++10697+ local otvm=-1
++10697+ otvm=
++10697+ othervm=/vm/cc97bc2f-3a91-43f7-8fbc-4cb92f90b4e4
++10697+ targetvm=
++10697+ local frontend_uuid=/vm/844dea4e-44f8-4e3e-8145-325132a31ca5
The final comparison returns true since the two empty strings match:
Replace -1 with distinct strings indicating the lack of a value and
remove the collescing to empty stings. The strings themselves will no
longer match, and that is correct.
Michal Orzel [Tue, 6 Feb 2024 15:20:12 +0000 (16:20 +0100)]
automation: Switch yocto-qemux86-64 job to run on x86
At the moment, all Yocto jobs run on Arm64 runners. To address CI
capacity issues, move yocto-qemux86-64 job to x86. Reflect the change in
the makefile generating Yocto docker files and fix CONTAINER name
definition that incorrectly expects YOCTO_HOST variable to be set for x86
container as well, which does not have a platform name appended.
Signed-off-by: Michal Orzel <michal.orzel@amd.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Frediano Ziglio [Tue, 6 Feb 2024 10:56:38 +0000 (11:56 +0100)]
x86/paging: Use more specific constant
__HYPERVISOR_arch_1 and __HYPERVISOR_paging_domctl_cont for x86
have the same value but this function is handling
"paging_domctl_cont" hypercall so using the latter mnemonic in
the code is more clear.
Signed-off-by: Frediano Ziglio <frediano.ziglio@cloud.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Tue, 6 Feb 2024 10:56:13 +0000 (11:56 +0100)]
amd-vi: fix IVMD memory type checks
The current code that parses the IVMD blocks is relaxed with regard to the
restriction that such unity regions should always fall into memory ranges
marked as reserved in the memory map.
However the type checks for the IVMD addresses are inverted, and as a result
IVMD ranges falling into RAM areas are accepted. Note that having such ranges
in the first place is a firmware bug, as IVMD should always fall into reserved
ranges.
Fixes: ed6c77ebf0c1 ('AMD/IOMMU: check / convert IVMD ranges for being / to be reserved') Reported-by: Ox <oxjo@proton.me> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: oxjo <oxjo@proton.me> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Hongyan Xia [Tue, 6 Feb 2024 10:54:50 +0000 (11:54 +0100)]
acpi: vmap pages in acpi_os_alloc_memory
Also, introduce a wrapper around vmap that maps a contiguous range for
boot allocations. Unfortunately, the new helper cannot be a static inline
because the dependencies are a mess. We would need to re-include
asm/page.h (was removed in aa4b9d1ee653 "include: don't use asm/page.h
from common headers") and it doesn't look to be enough anymore
because bits from asm/cpufeature.h is used in the definition of PAGE_NX.
Lastly, with the move to vmap(), it is now easier to find the size
of the mapping. So pass the whole area to init_boot_pages() rather than
just the first page.
Signed-off-by: Hongyan Xia <hongyxia@amazon.com> Signed-off-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Elias El Yandouzi <eliasely@amazon.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Tue, 6 Feb 2024 10:54:17 +0000 (11:54 +0100)]
xen/vmap: Introduce vmap_size() and use it
vunmap() and vfree() currently duplicate the (small) logic to find the
size of an vmap area. In a follow-up patch, we will want to introduce
another one (this time externally).
So introduce a new helper vmap_size() that will return the number of
pages in the area starting at the given address. Take the opportunity
to replace the open-coded version.
Note that vfree() was storing the type of the area in a local variable.
But this seems to have never been used (even when it was introduced).
Signed-off-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Elias El Yandouzi <eliasely@amazon.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Wei Liu [Tue, 6 Feb 2024 10:53:04 +0000 (11:53 +0100)]
setup: Move vm_init() before acpi calls
After the direct map removal, pages from the boot allocator are not
going to be mapped in the direct map. Although we have map_domain_page,
they are ephemeral and are less helpful for mappings that are more than a
page, so we want a mechanism to globally map a range of pages, which is
what vmap is for. Therefore, we bring vm_init into early boot stage.
To allow vmap to be initialised and used in early boot, we need to
modify vmap to receive pages from the boot allocator during early boot
stage.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: David Woodhouse <dwmw2@amazon.com> Signed-off-by: Hongyan Xia <hongyxia@amazon.com> Signed-off-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Elias El Yandouzi <eliasely@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Juergen Gross [Mon, 5 Feb 2024 10:49:57 +0000 (11:49 +0100)]
tools/xenstored: map stubdom interface
When running as stubdom, map the stubdom's Xenstore ring page in order
to support using the 9pfs frontend.
Use the same pattern as in dom0_init() when running as daemon in dom0
(introduce the own domain, then send an event to the client side to
signal Xenstore is ready to communicate).
Juergen Gross [Mon, 5 Feb 2024 10:49:56 +0000 (11:49 +0100)]
tools/xenstored: split domain_init()
Today domain_init() is called either just before calling dom0_init()
in case no live update is being performed, or it is called after
reading the global state from read_state_global(), as the event
channel fd is needed.
Split up domain_init() into a preparation part which can be called
unconditionally, and in a part setting up the event channel handle.
Note that there is no chance that chk_domain_generation() can be
called now before xc_handle has been setup, so there is no need for
the related special case anymore.
Juergen Gross [Mon, 5 Feb 2024 10:49:55 +0000 (11:49 +0100)]
tools/xenstored: rework ring page (un)map functions
When [un]mapping the ring page of a Xenstore client, different actions
are required for "normal" guests and dom0. Today this distinction is
made at call site.
Move this distinction into [un]map_interface() instead, avoiding code
duplication and preparing special handling for [un]mapping the stub
domain's ring page.
Juergen Gross [Mon, 5 Feb 2024 10:49:52 +0000 (11:49 +0100)]
tools/xenstored: move all log-pipe handling into posix.c
All of the log-pipe handling is needed only when running as daemon.
Move it into posix.c. This requires to have a service function in the
main event loop for handling the related requests and one for setting
the fds[] array, which is renamed to poll_fds to have a more specific
name. Use a generic name for those functions, as socket handling can
be added to them later, too.
Juergen Gross [Mon, 5 Feb 2024 10:49:50 +0000 (11:49 +0100)]
tools/xenstored: add early_init() function
Some xenstored initialization needs to be done in the daemon case only,
so split it out into a new early_init() function being a stub in the
stubdom case.
Remove the call of talloc_enable_leak_report_full(), as it serves no
real purpose: the daemon only ever exits due to a crash, in which case
a log of talloc()ed memory hardly has any value.
Juergen Gross [Mon, 5 Feb 2024 10:49:47 +0000 (11:49 +0100)]
tools/xenstored: rename xenbus_evtchn()
Rename the xenbus_evtchn() function to get_xenbus_evtchn() in order to
avoid two externally visible symbols with the same name when Xenstore-
stubdom is being built with a Mini-OS with CONFIG_XENBUS set.
Cyril Rébert [Sun, 4 Feb 2024 10:19:40 +0000 (11:19 +0100)]
tools/xentop: fix sorting bug for some columns
Sort doesn't work on columns VBD_OO, VBD_RD, VBD_WR and VBD_RSECT.
Fix by adjusting variables names in compare functions.
Bug fix only. No functional change.
Fixes: 91c3e3dc91d6 ("tools/xentop: Display '-' when stats are not available.") Signed-off-by: Cyril Rébert (zithro) <slack@rabbit.lu> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
We have two copies of __bitmap_weight() that differ by whether they make
hweight32() or hweight64() calls, yet we already have hweight_long() which is
the form that __bitmap_weight() wants.
Fix hweight_long() to return unsigned int like all the other hweight helpers,
and fix __bitmap_weight() to used unsigned integers.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 2 Feb 2024 17:57:37 +0000 (17:57 +0000)]
x86/ucode: Remove accidentally introduced tabs
Fixes: cf7fe8b72dea ("x86/ucode: Fix stability of the raw CPU Policy rescan") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Mon, 5 Feb 2024 09:48:11 +0000 (10:48 +0100)]
x86/CPU: convert vendor hook invocations to altcall
While not performance critical, these hook invocations still want
converting: This way all pre-filled struct cpu_dev instances can become
__initconst_cf_clobber, thus allowing to eliminate further 8 ENDBR
during the 2nd phase of alternatives patching (besides moving previously
resident data to .init.*).
Since all use sites need touching anyway, take the opportunity and also
address a Misra C:2012 Rule 5.5 violation: Rename the this_cpu static
variable.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 5 Feb 2024 09:45:31 +0000 (10:45 +0100)]
x86/guest: finish conversion to altcall
While .setup() and .e820_fixup() don't need fiddling with for being run
only very early, both .ap_setup() and .resume() want converting too:
This way both pre-filled struct hypervisor_ops instances can become
__initconst_cf_clobber, thus allowing to eliminate up to 5 more ENDBR
(configuration dependent) during the 2nd phase of alternatives patching.
While fiddling with section annotations here, also move "ops" itself to
.data.ro_after_init.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Paul Durrant <paul@xen.org>
Jan Beulich [Mon, 5 Feb 2024 09:44:46 +0000 (10:44 +0100)]
x86: arrange for ENDBR zapping from <vendor>_ctxt_switch_masking()
While altcall is already used for them, the functions want announcing in
.init.rodata.cf_clobber, even if the resulting static variables aren't
otherwise used.
While doing this also move ctxt_switch_masking to .data.ro_after_init.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Fri, 26 Jan 2024 19:57:01 +0000 (19:57 +0000)]
x86: Remove gdbstub
In 13y of working on Xen, I've never seen seen it used. The implementation
was introduced (commit b69f92f3012e, Jul 28 2004) with known issues such as:
/* Resuming after we've stopped used to work, but more through luck
than any actual intention. It doesn't at the moment. */
which appear to have gone unfixed for the 20 years since.
Nowadays there are more robust ways of inspecting crashed state, such as a
kexec crash kernel, or running Xen in a VM.
This will allow us to clean up some hooks around the codebase which are
proving awkward for other tasks.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Tue, 30 Jan 2024 09:14:00 +0000 (10:14 +0100)]
x86/spec-ctrl: Expose BHI_CTRL to guests
The CPUID feature bit signals the presence of the BHI_DIS_S control in
SPEC_CTRL MSR, first available in Intel AlderLake and Sapphire Rapids CPUs
Xen already knows how to context switch MSR_SPEC_CTRL properly between guest
and hypervisor context.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monné [Tue, 30 Jan 2024 09:13:59 +0000 (10:13 +0100)]
x86/spec-ctrl: Expose RRSBA_CTRL to guests
The CPUID feature bit signals the presence of the RRSBA_DIS_{U,S} controls in
SPEC_CTRL MSR, first available in Intel AlderLake and Sapphire Rapids CPUs.
Xen already knows how to context switch MSR_SPEC_CTRL properly between guest
and hypervisor context.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monné [Tue, 30 Jan 2024 09:13:58 +0000 (10:13 +0100)]
x86/spec-ctrl: Expose IPRED_CTRL to guests
The CPUID feature bit signals the presence of the IPRED_DIS_{U,S} controls in
SPEC_CTRL MSR, first available in Intel AlderLake and Sapphire Rapids CPUs.
Xen already knows how to context switch MSR_SPEC_CTRL properly between guest
and hypervisor context.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Wed, 31 Jan 2024 10:42:48 +0000 (10:42 +0000)]
tools/ocaml: Bump minimum version to OCaml 4.05
Char.lowercase got removed in OCaml 5.0 (it has been deprecated since 2014),
and doesn't build any more.
Char.lowercase_ascii has existed since OCaml 4.03, so that is the new
minimum version for oxenstored.
However, OCaml 4.05 is the oldest new-enough version found in common distros,
so pick this as a baseline.
Signed-off-by: Edwin Török <edwin.torok@cloud.com> Acked-by: Christian Lindig <christian.lindig@cloud.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
[Update CHANGELOG.md] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Wed, 31 Jan 2024 17:05:47 +0000 (17:05 +0000)]
xen/bitmap: Consistently use unsigned bits values
Right now, most of the static inline helpers take an unsigned nbits quantity,
and most of the library functions take a signed quanity. Because
BITMAP_LAST_WORD_MASK() is expressed as a divide, the compiler is forced to
emit two different paths to get the correct semantics for signed division.
Swap all signed bit-counts to being unsigned bit-counts for the simple cases.
This includes the return value of bitmap_weight().
Bloat-o-meter for a random x86 build reports:
add/remove: 0/0 grow/shrink: 8/19 up/down: 167/-413 (-246)
which all comes from compiler not emitting "dead" logic paths for negative bit
counts.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 30 Jan 2024 13:59:07 +0000 (13:59 +0000)]
x86/boot: Add braces in reloc.c
107 lines is an unreasonably large switch statement to live inside a
brace-less for loop. Drop the comment that's clumsily trying to cover the
fact that this logic has wrong-looking indentation.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 30 Jan 2024 20:44:34 +0000 (20:44 +0000)]
xen/sched: Fix UB shift in compat_set_timer_op()
Tamas reported this UBSAN failure from fuzzing:
(XEN) ================================================================================
(XEN) UBSAN: Undefined behaviour in common/sched/compat.c:48:37
(XEN) left shift of negative value -2147425536
(XEN) ----[ Xen-4.19-unstable x86_64 debug=y ubsan=y Not tainted ]----
...
(XEN) Xen call trace:
(XEN) [<ffff82d040307c1c>] R ubsan.c#ubsan_epilogue+0xa/0xd9
(XEN) [<ffff82d040308afb>] F __ubsan_handle_shift_out_of_bounds+0x11a/0x1c5
(XEN) [<ffff82d040307758>] F compat_set_timer_op+0x41/0x43
(XEN) [<ffff82d04040e4cc>] F hvm_do_multicall_call+0x77f/0xa75
(XEN) [<ffff82d040519462>] F arch_do_multicall_call+0xec/0xf1
(XEN) [<ffff82d040261567>] F do_multicall+0x1dc/0xde3
(XEN) [<ffff82d04040d2b3>] F hvm_hypercall+0xa00/0x149a
(XEN) [<ffff82d0403cd072>] F vmx_vmexit_handler+0x1596/0x279c
(XEN) [<ffff82d0403d909b>] F vmx_asm_vmexit_handler+0xdb/0x200
Left-shifting any negative value is strictly undefined behaviour in C, and
the two parameters here come straight from the guest.
The fuzzer happened to choose lo 0xf, hi 0x8000e300.
Switch everything to be unsigned values, making the shift well defined.
As GCC documents:
As an extension to the C language, GCC does not use the latitude given in
C99 and C11 only to treat certain aspects of signed '<<' as undefined.
However, -fsanitize=shift (and -fsanitize=undefined) will diagnose such
cases.
this was deemed not to need an XSA.
Note: The unsigned -> signed conversion for do_set_timer_op()'s s_time_t
parameter is also well defined. C makes it implementation defined, and GCC
defines it as reduction modulo 2^N to be within range of the new type.
Fixes: 2942f45e09fb ("Enable compatibility mode operation for HYPERVISOR_sched_op and HYPERVISOR_set_timer_op.") Reported-by: Tamas K Lengyel <tamas@tklengyel.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 30 Jan 2024 18:13:14 +0000 (18:13 +0000)]
x86/hvm: Fix UBSAN failure in do_hvm_op() printk
Tamas reported this UBSAN failure from fuzzing:
(XEN) ================================================================================
(XEN) UBSAN: Undefined behaviour in common/vsprintf.c:64:19
(XEN) negation of -9223372036854775808 cannot be represented in type 'long long int':
(XEN) ----[ Xen-4.19-unstable x86_64 debug=y ubsan=y Not tainted ]----
...
(XEN) Xen call trace:
(XEN) [<ffff82d040307c1c>] R ubsan.c#ubsan_epilogue+0xa/0xd9
(XEN) [<ffff82d04030805d>] F __ubsan_handle_negate_overflow+0x99/0xce
(XEN) [<ffff82d04028868f>] F vsprintf.c#number+0x10a/0x93e
(XEN) [<ffff82d04028ac74>] F vsnprintf+0x19e2/0x1c56
(XEN) [<ffff82d04030a47a>] F console.c#vprintk_common+0x76/0x34d
(XEN) [<ffff82d04030a79e>] F printk+0x4d/0x4f
(XEN) [<ffff82d04040c42b>] F do_hvm_op+0x288e/0x28f5
(XEN) [<ffff82d04040d385>] F hvm_hypercall+0xad2/0x149a
(XEN) [<ffff82d0403cd072>] F vmx_vmexit_handler+0x1596/0x279c
(XEN) [<ffff82d0403d909b>] F vmx_asm_vmexit_handler+0xdb/0x200
The problem is an unsigned -> signed converstion because of a bad
formatter (%ld trying to format an unsigned long).
We could fix it by swapping to %lu, but this is a useless printk() even in
debug builds, so just drop it completely.
Reported-by: Tamas K Lengyel <tamas@tklengyel.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Thu, 1 Feb 2024 17:35:22 +0000 (17:35 +0000)]
xen/arm: Properly clean update to init_ttbr and smp_up_cpu
Recent rework to the secondary boot code modified how init_ttbr and
smp_up_cpu are accessed. Rather than directly accessing them, we
are using a pointer to them.
The helper clean_dcache() is expected to take the variable in parameter
and then clean its content. As we now pass a pointer to the variable,
we will clean the area storing the address rather than the content itself.
Switch to use clean_dcache_va_range() to avoid casting the pointer.
Fixes: a5ed59e62c6f ("arm/mmu: Move init_ttbr to a new section .data.idmap") Fixes: 9a5114074b04 ("arm/smpboot: Move smp_up_cpu to a new section .data.idmap) Reported-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Signed-off-by: Julien Grall <jgrall@amazon.com> Tested-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Jan Beulich [Thu, 1 Feb 2024 15:21:04 +0000 (16:21 +0100)]
IOMMU: iommu_use_hap_pt() implies CONFIG_HVM
Allow the compiler a little more room on DCE by moving the compile-time-
constant condition into the predicate (from the one place where it was
added in an open-coded fashion for XSA-450).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jason Andryuk [Thu, 1 Feb 2024 15:19:36 +0000 (16:19 +0100)]
xenpm: Print message for disabled commands
xenpm get-cpufreq-states currently just prints no output when cpufreq is
disabled or HWP is running. Have it print an appropriate message. The
cpufreq disabled one mirrors the cpuidle disabled one.
cpufreq disabled:
$ xenpm get-cpufreq-states
Either Xen cpufreq is disabled or no valid information is registered!
Under HWP:
$ xenpm get-cpufreq-states
P-State information not supported. Try 'get-cpufreq-average' or 'start'.
Also allow xenpm to handle EOPNOTSUPP from the pmstat hypercalls.
EOPNOTSUPP is returned when HWP is active in some cases and allows the
differentiation from cpufreq being disabled.
Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 1 Feb 2024 15:18:28 +0000 (16:18 +0100)]
x86/PoD: simplify / improve p2m_pod_cache_add()
Avoid recurring MFN -> page or page -> MFN translations. Drop the pretty
pointless local variable "p". Make sure the MFN logged in a debugging
error message is actually the offending one. Return negative errno
values rather than -1 (presently no caller really cares, but imo this
should change). Adjust style.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@cloud.com>
Andrew Cooper [Tue, 30 Jan 2024 13:29:15 +0000 (14:29 +0100)]
VT-d: Fix "else" vs "#endif" misplacement
In domain_pgd_maddr() the "#endif" is misplaced with respect to "else". This
generates incorrect logic when CONFIG_HVM is compiled out, as the "else" body
is executed unconditionally.
Rework the logic to use IS_ENABLED() instead of explicit #ifdef-ary, as it's
clearer to follow. This in turn involves adjusting p2m_get_pagetable() to
compile when CONFIG_HVM is disabled.
This is XSA-450 / CVE-2023-46840.
Fixes: 033ff90aa9c1 ("x86/P2M: p2m_{alloc,free}_ptp() and p2m_alloc_table() are HVM-only") Reported-by: Teddy Astie <teddy.astie@vates.tech> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Tue, 30 Jan 2024 13:28:01 +0000 (14:28 +0100)]
pci: fail device assignment if phantom functions cannot be assigned
The current behavior is that no error is reported if (some) phantom functions
fail to be assigned during device add or assignment, so the operation succeeds
even if some phantom functions are not correctly setup.
This can lead to devices possibly being successfully assigned to a domU while
some of the device phantom functions are still assigned to dom0. Even when the
device is assigned domIO before being assigned to a domU phantom functions
might fail to be assigned to domIO, and also fail to be assigned to the domU,
leaving them assigned to dom0.
Since the device can generate requests using the IDs of those phantom
functions, given the scenario above a device in such state would be in control
of a domU, but still capable of generating transactions that use a context ID
targeting dom0 owned memory.
Modify device assign in order to attempt to deassign the device if phantom
functions failed to be assigned.
Note that device addition is not modified in the same way, as in that case the
device is assigned to a trusted domain, and hence partial assign can lead to
device malfunction but not a security issue.
This is XSA-449 / CVE-2023-46839
Fixes: 4e9950dc1bd2 ('IOMMU: add phantom function support') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monne [Wed, 24 Jan 2024 17:29:52 +0000 (18:29 +0100)]
x86/iommu: switch hwdom IOMMU to use a rangeset
The current loop that iterates from 0 to the maximum RAM address in order to
setup the IOMMU mappings is highly inefficient, and it will get worse as the
amount of RAM increases. It's also not accounting for any reserved regions
past the last RAM address.
Instead of iterating over memory addresses, iterate over the memory map regions
and use a rangeset in order to keep track of which ranges need to be identity
mapped in the hardware domain physical address space.
On an AMD EPYC 7452 with 512GiB of RAM, the time to execute
arch_iommu_hwdom_init() in nanoseconds is:
x old
+ new
N Min Max Median Avg Stddev
x 5 2.2364154e+10 2.338244e+10 2.2474685e+10 2.2622409e+10 4.2949869e+08
+ 5 1025012103303610261881028276.2 3623.1194
Difference at 95.0% confidence
-2.26214e+10 +/- 4.42931e+08
-99.9955% +/- 9.05152e-05%
(Student's t, pooled s = 3.03701e+08)
Execution time of arch_iommu_hwdom_init() goes down from ~22s to ~0.001s.
Note there's a change for HVM domains (ie: PVH dom0) that get switched to
create the p2m mappings using map_mmio_regions() instead of
p2m_add_identity_entry(), so that ranges can be mapped with a single function
call if possible. Note that the interface of map_mmio_regions() doesn't
allow creating read-only mappings, but so far there are no such mappings
created for PVH dom0 in arch_iommu_hwdom_init().
No change intended in the resulting mappings that a hardware domain gets.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monne [Thu, 25 Jan 2024 13:26:26 +0000 (14:26 +0100)]
x86/iommu: remove regions not to be mapped
Introduce the code to remove regions not to be mapped from the rangeset
that will be used to setup the IOMMU page tables for the hardware domain.
This change also introduces two new functions: remove_xen_ranges() and
vpci_subtract_mmcfg() that copy the logic in xen_in_range() and
vpci_is_mmcfg_address() respectively and remove the ranges that would otherwise
be intercepted by the original functions.
Note that the rangeset is still not populated.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Jan Beulich [Mon, 29 Jan 2024 08:23:43 +0000 (09:23 +0100)]
x86: purge NMI_IO_APIC
Even going back to 3.2 source code, I can't spot how this watchdog mode
could ever have been enabled in Xen. The only effect its presence had
for all the years was the retaining of a dead string literal.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 29 Jan 2024 08:22:35 +0000 (09:22 +0100)]
x86/APIC: purge {GET,SET}_APIC_DELIVERY_MODE()
The few uses we have can easily be replaced, eliminating the need for
redundant APIC_DM_* and APIC_MODE_* constants. Therefore also purge all
respective APIC_MODE_* constants, introducing APIC_DM_MASK anew instead.
This is further relevant since we have a different set of APIC_MODE_*,
which could otherwise end up confusing.
No functional change intended.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 29 Jan 2024 08:21:16 +0000 (09:21 +0100)]
NUMA: no need for asm/numa.h when !NUMA
There's no point in every architecture carrying the same stubs for the
case when NUMA isn't enabled (or even supported). Move all of that to
xen/numa.h; replace explicit uses of asm/numa.h in common code. Make
inclusion of asm/numa.h dependent upon NUMA=y.
Drop the no longer applicable "implement NUMA support" comments - in a
!NUMA section this simply makes no sense.
Roger Pau Monné [Fri, 26 Jan 2024 14:54:18 +0000 (15:54 +0100)]
x86/entry: fix jump into restore_all_guest without %rbx correctly set
e047b8d0fa05 went too far when limiting obtaining the vCPU pointer. While the
code in ist_dispatch_done does indeed only need the vCPU pointer when PV32 is
enabled, the !PV32 path will end up jumping into restore_all_guest which does
require rbx == vCPU pointer.
Fix by moving the fetching of the vCPU pointer to be done outside of the PV32
code block.
Fixes: e047b8d0fa05 ('x86/entry: replace two GET_CURRENT() uses') Reported-by: Edwin Torok <edwin.torok@cloud.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Julien Grall [Thu, 25 Jan 2024 18:36:27 +0000 (18:36 +0000)]
xen/arm64: head: Use PRINT_ID() for secondary CPU MMU-off boot code
With the upcoming work to color Xen, the binary will not be anymore
physically contiguous. This will be a problem during boot as the
assembly code will need to work out where each piece of Xen reside.
An easy way to solve the issue is to have all code/data accessed
by the secondary CPUs while the MMU is off within a single page.
Right now, most of the early printk messages are using PRINT() which
will add the message in .rodata. This is unlikely to be within the
same page as the rest of the idmap.
So replace all the PRINT() that can be reachable by the secondary
CPU with MMU-off with PRINT_ID().
Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com>
Julien Grall [Thu, 25 Jan 2024 18:33:50 +0000 (18:33 +0000)]
arm/smpboot: Move smp_up_cpu to a new section .data.idmap
With the upcoming work to color Xen, the binary will not be anymore
physically contiguous. This will be a problem during boot as the
assembly code will need to work out where each piece of Xen reside.
An easy way to solve the issue is to have all code/data accessed
by the secondary CPUs while the MMU is off within a single page.
Right now, smp_up_cpu is used by secondary CPUs to wait their turn for
booting before the MMU is on. Yet it is currently in .data which is
unlikely to be within the same page as the rest of the idmap.
Move smp_up_cpu to the recently created section .data.idmap. The idmap is
currently part of the text section and therefore will be mapped read-only
executable. This means that we need to temporarily remap
smp_up_cpu in order to update it.
Introduce a new function set_smp_up_cpu() for this purpose so the code
is not duplicated between when opening and closing the gate.
Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com>
Julien Grall [Thu, 25 Jan 2024 18:32:38 +0000 (18:32 +0000)]
arm/mmu: Move init_ttbr to a new section .data.idmap
With the upcoming work to color Xen, the binary will not be anymore
physically contiguous. This will be a problem during boot as the
assembly code will need to work out where each piece of Xen reside.
An easy way to solve the issue is to have all code/data accessed
by the secondary CPUs while the MMU is off within a single page.
Right now, init_ttbr is used by secondary CPUs to find there page-tables
before the MMU is on. Yet it is currently in .data which is unlikely
to be within the same page as the rest of the idmap.
Create a new section .data.idmap that will be used for variables
accessed by the early boot code. The first one is init_ttbr.
The idmap is currently part of the text section and therefore will
be mapped read-only executable. This means that we need to temporarily
remap init_ttbr in order to update it.
Introduce a new function set_init_ttbr() for this purpose so the code
is not duplicated between arm64 and arm32.
Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com>
Andrew Cooper [Fri, 10 Feb 2023 21:20:42 +0000 (21:20 +0000)]
x86/entry: Avoid register spilling in cr4_pv32_restore()
cr4_pv32_restore() needs two registers. Right now, it spills %rdx and
clobbers %rax.
However, %rcx is free to use at all callsites. Annotate CR4_PV32_RESTORE with
our usual clobber comments, and swap %rdx for %rcx in the non-fatal paths
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monne [Thu, 25 Jan 2024 09:30:41 +0000 (10:30 +0100)]
tools: don't expose XENFEAT_hvm_pirqs by default
The HVM pirq feature allows routing interrupts from both physical and emulated
devices over event channels, this was done a performance improvement. However
its usage is fully undocumented, and the only reference implementation is in
Linux. It defeats the purpose of local APIC hardware virtualization, because
when using it interrupts avoid the usage of the local APIC altogether.
It has also been reported to not work properly with certain devices, at least
when using some AMD GPUs Linux attempts to route interrupts over event
channels, but Xen doesn't correctly detect such routing, which leads to the
hypervisor complaining with:
(XEN) d15v0: Unsupported MSI delivery mode 7 for Dom15
When MSIs are attempted to be routed over event channels the entry delivery
mode is set to ExtINT, but Xen doesn't detect such routing and attempts to
inject the interrupt following the native MSI path, and the ExtINT delivery
mode is not supported.
Disable HVM PIRQs by default and provide a per-domain option in xl.cfg to
enable such feature. Also for backwards compatibility keep the feature enabled
for any resumed domains that don't have an explicit selection.
Note that the only user of the feature (Linux) is also able to handle native
interrupts fine, as the feature was already not used if Xen reported local APIC
hardware virtualization active.
Link: https://github.com/QubesOS/qubes-issues/issues/7971 Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monne [Thu, 25 Jan 2024 09:30:40 +0000 (10:30 +0100)]
x86/hvm: make X86_EMU_USE_PIRQ optional
Allow selecting X86_EMU_USE_PIRQ for HVM guests, so it's no longer mandated to
be always on.
There's no restriction in Xen that forces such feature to be always on for HVM
guests, as for example PVH guests don't support it, as such allow toolstack to
select whether to enabled it on a per-domain basis.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Tue, 23 Jan 2024 20:24:22 +0000 (20:24 +0000)]
x86/ucode: Fix stability of the raw CPU Policy rescan
Always run microcode_update_helper() on the BSP, so the the updated Raw CPU
policy doesn't get non-BSP topology details included.
Have calculate_raw_cpu_policy() clear the instantanious XSTATE sizes. The
value XCR0 | MSR_XSS had when we scanned the policy isn't terribly interesting
to report.
When CPUID Masking is active, it affects CPUID instructions issued by Xen
too. Transiently disable masking to get a clean scan.
Fixes: 694d79ed5aac ("x86/ucode: Refresh raw CPU policy after microcode load") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jason Andryuk [Thu, 25 Jan 2024 15:11:49 +0000 (16:11 +0100)]
pmstat: Limit hypercalls under HWP
When HWP is active, the cpufreq P-state information is not updated. In
that case, return -EOPNOTSUPP instead of bogus, incomplete info.
Similarly, set_cpufreq_para() is not applicable when HWP is active.
Many of the options already checked the governor and were inaccessible,
but SCALING_MIN/MAX_FREQ was still accessible (though it would do
nothing). Add an ealier HWP check to handle all cases.
Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 25 Jan 2024 15:10:58 +0000 (16:10 +0100)]
x86/entry: replace two GET_CURRENT() uses
Now that we have %r14 set up using GET_STACK_END() in a number of
places, in two places we can eliminate the redundancy of GET_CURRENT()
also invoking that macro. In handle_ist_exception() actually go a step
farther and avoid using %rbx altogether when retrieving the processor
ID: Obtain the current vCPU pointer only in the PV32-specific code
actually needing it.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 25 Jan 2024 15:10:06 +0000 (16:10 +0100)]
x86/NMI: refine "watchdog stuck" log message
Observing
"Testing NMI watchdog on all CPUs: 0 stuck"
it felt like it's not quite right, but I still read it as "no CPU stuck;
all good", when really the system suffered from what 6bdb965178bb
("x86/intel: ensure Global Performance Counter Control is setup
correctly") works around. Convert this to
"Testing NMI watchdog on all CPUs: {0} stuck"
or, with multiple CPUs having an issue, e.g.
"Testing NMI watchdog on all CPUs: {0,40} stuck"
to make more obvious that a lone number is not a count of CPUs.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Mon, 22 Jan 2024 14:50:10 +0000 (14:50 +0000)]
x86/entry: Fix ELF metadata for NMI and handle_ist_exception
handle_ist_exception isn't part of the NMI handler, just like handle_exception
isn't part of #PF.
Fixes: b3a9037550df ("x86: annotate entry points with type and size") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 28 Oct 2021 19:03:21 +0000 (20:03 +0100)]
x86/kexec: Drop compatibility_mode_far
LJMP is (famously?) incompatible between Intel and AMD CPUs, and while we're
using one of the compatible forms, we've got a good stack and LRET is the far
more common way of doing this.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 23 Jan 2024 11:03:23 +0000 (12:03 +0100)]
IRQ: generalize [gs]et_irq_regs()
Move functions (and their data) to common code, and invoke the functions
on Arm as well. This is in preparation of dropping the register
parameters from handler functions.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
Michal Orzel [Tue, 23 Jan 2024 11:02:44 +0000 (12:02 +0100)]
lib{fdt,elf}: move lib{fdt,elf}-temp.o and their deps to $(targets)
At the moment, trying to run xencov read/reset (calling SYSCTL_coverage_op
under the hood) results in a crash. This is due to a profiler trying to
access data in the .init.* sections (libfdt for Arm and libelf for x86)
that are stripped after boot. Normally, the build system compiles any
*.init.o file without COV_FLAGS. However, these two libraries are
handled differently as sections will be renamed to init after linking.
To override COV_FLAGS to empty for these libraries, lib{fdt,elf}.o were
added to nocov-y. This worked until e321576f4047 ("xen/build: start using
if_changed") that added lib{fdt,elf}-temp.o and their deps to extra-y.
This way, even though these objects appear as prerequisites of
lib{fdt,elf}.o and the settings should propagate to them, make can also
build them as a prerequisite of __build, in which case COV_FLAGS would
still have the unwanted flags. Fix it by switching to $(targets) instead.
Also, for libfdt, append libfdt.o to nocov-y only if CONFIG_OVERLAY_DTB
is not set. Otherwise, there is no section renaming and we should be able
to run the coverage.
Fixes: e321576f4047 ("xen/build: start using if_changed") Signed-off-by: Michal Orzel <michal.orzel@amd.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Mon, 22 Jan 2024 12:55:11 +0000 (13:55 +0100)]
RISC-V: annotate entry points with type and size
Use the generic framework in xen/linkage.h. No change in generated code
except of course the converted symbols change to be hidden ones and gain
a valid size.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Jan Beulich [Mon, 22 Jan 2024 12:54:34 +0000 (13:54 +0100)]
Arm: annotate entry points with type and size
Use the generic framework in xen/linkage.h. No change in generated code
except for the changed padding value (noticable when config.gz isn't a
multiple of 4 in size). Plus of course the converted symbols change to
be hidden ones.
Note that ASM_INT() is switched to DATA(), not DATA_LOCAL(), as the only
use site wants the symbol global anyway.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
Jan Beulich [Mon, 22 Jan 2024 12:52:13 +0000 (13:52 +0100)]
x86: also mark assembler globals hidden
Let's have assembler symbols be consistent with C ones. In principle
there are (a few) cases where gas can produce smaller code this way,
just that for now there's a gas bug causing smaller code to be emitted
even when that shouldn't be the case.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Mon, 22 Jan 2024 12:51:31 +0000 (13:51 +0100)]
x86: annotate entry points with type and size
Use the generic framework in xen/linkage.h.
For switch_to_kernel() and restore_all_guest() so far implicit alignment
(from being first in their respective sections) is being made explicit
(as in: using FUNC() without 2nd argument). Whereas for
{,compat}create_bounce_frame() and autogen_entrypoints[] alignment is
newly arranged for.
Except for the added/adjusted alignment padding (including their
knock-on effects) no change in generated code/data. Note that the basis
for support of weak definitions is added despite them not having any use
right now.
Note that ASM_INT() is switched to DATA(), not DATA_LOCAL(), as the only
use site wants the symbol global anyway.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Mon, 22 Jan 2024 12:50:40 +0000 (13:50 +0100)]
common: assembly entry point type/size annotations
Recent gas versions generate minimalistic Dwarf debug info for items
annotated as functions and having their sizes specified [1]. Furthermore
generating live patches wants items properly annotated. "Borrow" Arm's
END() and (remotely) derive other annotation infrastructure from
Linux'es, for all architectures to use.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
[1] https://sourceware.org/git?p=binutils-gdb.git;a=commitdiff;h=591cc9fbbfd6d51131c0f1d4a92e7893edcc7a28
Jan Beulich [Mon, 22 Jan 2024 12:41:07 +0000 (13:41 +0100)]
x86/MCE: switch some callback invocations to altcall
While not performance critical, these hook invocations still would
better be converted: This way all pre-filled (and newly introduced)
struct mce_callback instances can become __initconst_cf_clobber, thus
allowing to eliminate another 9 ENDBR during the 2nd phase of
alternatives patching.
While this means registering callbacks a little earlier, doing so is
perhaps even advantageous, for having pointers be non-NULL earlier on.
Only one set of callbacks would only ever be registered anyway, and
neither of the respective initialization function can (subsequently)
fail.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 22 Jan 2024 12:40:32 +0000 (13:40 +0100)]
x86/MCE: separate BSP-only initialization
Several function pointers are registered over and over again, when
setting them once on the BSP suffices. Arrange for this in the vendor
init functions and mark involved registration functions __init.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 22 Jan 2024 12:40:00 +0000 (13:40 +0100)]
x86/PV: avoid indirect call for I/O emulation quirk hook
This way ioemul_handle_proliant_quirk() won't need ENDBR anymore.
While touching this code, also
- arrange for it to not be built at all when !PV,
- add "const" to the last function parameter and bring the definition
in sync with the declaration (for Misra).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 22 Jan 2024 12:39:23 +0000 (13:39 +0100)]
x86/MTRR: avoid several indirect calls
The use of (supposedly) vendor-specific hooks is a relic from the days
when Xen was still possible to build as 32-bit binary. There's no
expectation that a new need for such an abstraction would arise. Convert
mttr_if to a mere boolean and all prior calls through it to direct ones,
thus allowing to eliminate 6 ENDBR from .text.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Fri, 22 Dec 2023 21:06:16 +0000 (21:06 +0000)]
xen/livepatch: Make check_for_livepatch_work() faster in the common case
When livepatching is enabled, this function is used all the time. Really do
check the fastpath first, and annotate it likely() as this is the right answer
100% of the time (to many significant figures). This cuts out 3 pointer
dereferences in the "nothing to do path".
However, GCC still needs some help to persuade it not to set the full stack
frame (6 spilled registers, 3 slots of locals) even on the fastpath.
Create a new check_for_livepatch_work() with the fastpath only, and make the
"new" do_livepatch_work() noinline. This causes the fastpath to need no stack
frame, making it faster still.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 27 Oct 2023 16:02:21 +0000 (17:02 +0100)]
x86/vmx: Disallow the use of inactivity states
Right now, vvmx will blindly copy L12's ACTIVITY_STATE into the L02 VMCS and
enter the vCPU. Luckily for us, nested-virt is explicitly unsupported for
security bugs.
The inactivity states are HLT, SHUTDOWN and WAIT-FOR-SIPI, and as noted by the
SDM in Vol3 27.7 "Special Features of VM Entry":
If VM entry ends with the logical processor in an inactive activity state,
the VM entry generates any special bus cycle that is normally generated when
that activity state is entered from the active state.
Also,
Some activity states unconditionally block certain events.
I.e. A VMEntry with ACTIVITY=SHUTDOWN will initiate a platform reset, while a
VMEntry with ACTIVITY=WAIT-FOR-SIPI will really block everything other than
SIPIs.
Both of these activity states are for the TXT ACM to use, not for regular
hypervisors, and Xen doesn't support dropping the HLT intercept either.
There are two paths in Xen which operate on ACTIVITY_STATE.
1) The vmx_{get,set}_nonreg_state() helpers for VM-Fork.
As regular VMs can't use any inactivity states, this is just duplicating
the 0 from construct_vmcs(). Retain the ability to query activity_state,
but crash the domain on any attempt to set an inactivity state.
2) Nested virt, because of ACTIVITY_STATE in vmcs_gstate_field[].
Explicitly hide the inactivity states in the guest's view of MSR_VMX_MISC,
and remove ACTIVITY_STATE from vmcs_gstate_field[].
In virtual_vmentry(), we should trigger a VMEntry failure for the use of
any inactivity states, but there's no support for that in the code at all
so leave a TODO for when we finally start working on nested-virt in
earnest.
Reported-by: Reima Ishii <ishiir@g.ecc.u-tokyo.ac.jp> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tamas K Lengyel <tamas@tklengyel.com>