xenbits.xensource.com Git

x86/PV32: restore PAE-extended-CR3 logic

While the PAE-extended-CR3 VM assist is a 32-bit only concept, it still
applies to guests also when run on a 64-bit hypervisor: The "extended
CR3" format has to be used there as well, to fit the address in the only
32-bit wide register there. As a result it was a mistake that the check
was never enabled for that case, and was then mistakenly deleted in the
course of removal of 32-bit-Xen code (218adf199e68 ["x86: We can assume
CONFIG_PAGING_LEVELS==4"]).

Similarly during Dom0 construction kernel awareness needs to be taken
into account, and respective code was again mistakenly never enabled for
32-bit Dom0 when running on 64-bit Xen (and thus wrongly deleted by
5d1181a5ea5e ["xen: Remove x86_32 build target"]).

At the same time restrict enabling of the assist for Dom0 to just the
32-bit case. Furthermore there's no need for an atomic update there.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>

config: update Mini-OS commit

Update the Mini-OS upstream revision.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

xen/serial: Drop serial_{put,get}c

After removal of gdbstub by commit d182f9feae6d ("x86: Remove gdbstub"),
we no longer have users of serial_{put,get}c functions, so remove them.

Signed-off-by: Michal Orzel <michal.orzel@amd.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

tools/ocaml: Add missing vmtrace_buf_kb field

Add the missing `vmtrace_buf_kb` field to the OCaml bindings to match the
vm.cfg configuration, correcting an oversight from its initial introduction.

Signed-off-by: Petr Beneš <w1benny@gmail.com>
Acked-by: Christian Lindig <christian.lindig@cloud.com>

x86: convert misc assembly function annotations

Use the generic framework from xen/linkage.h.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

SVM: convert entry point annotations

Use the generic framework from xen/linkage.h.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen/arm: Fix UBSAN failure in start_xen()

When running Xen on arm32, in scenario where Xen is loaded at an address
such as boot_phys_offset >= 2GB, UBSAN reports the following:

(XEN) UBSAN: Undefined behaviour in arch/arm/setup.c:739:58
(XEN) pointer operation underflowed 00200000 to 86800000
(XEN) Xen WARN at common/ubsan/ubsan.c:172
(XEN) ----[ Xen-4.19-unstable  arm32  debug=y ubsan=y  Not tainted ]----
...
(XEN) Xen call trace:
(XEN)    [<0031b4c0>] ubsan.c#ubsan_epilogue+0x18/0xf0 (PC)
(XEN)    [<0031d134>] __ubsan_handle_pointer_overflow+0xb8/0xd4 (LR)
(XEN)    [<0031d134>] __ubsan_handle_pointer_overflow+0xb8/0xd4
(XEN)    [<004d15a8>] start_xen+0xe0/0xbe0
(XEN)    [<0020007c>] head.o#primary_switched+0x4/0x30

The failure is reported for the following line:
(paddr_t)(uintptr_t)(_start + boot_phys_offset)

This occurs because the compiler treats (ptr + size) with size bigger than
PTRDIFF_MAX as undefined behavior. To address this, switch to macro
virt_to_maddr(), given the future plans to eliminate boot_phys_offset.

Signed-off-by: Michal Orzel <michal.orzel@amd.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Tested-by: Luca Fancellu <luca.fancellu@arm.com>
Acked-by: Julien Grall <jgrall@amazon.com>

eclair: move function and macro properties outside ECLAIR

Function and macro properties contained in ECLAIR/call_properties.ecl are of
general interest: this patch moves these annotations in a generaric JSON file
in docs. In this way, they can be exploited for other purposes (i.e. documentation,
other tools).

Add rst file containing explanation on how to update function_macro_properties.json.
Add script to convert the JSON file in ECL configurations.
Remove ECLAIR/call_properties.ecl: the file is now automatically generated from
the JSON file.

Signed-off-by: Maria Celeste Cesario <maria.celeste.cesario@bugseng.com>
Signed-off-by: Simone Ballarin <simone.ballarin@bugseng.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>

xen/ns16550: address violations of MISRA C:2012 Rule 13.1

Rule 13.1: Initializer lists shall not contain persistent side effects

The assignment operation in:

.irq = rc = uart->irq,

is a persistent side effect in a struct initializer list.

This patch assigns rc separately outside the structure.

No functional change.

Signed-off-by: Simone Ballarin <simone.ballarin@bugseng.com>
Signed-off-by: Maria Celeste Cesario <maria.celeste.cesario@bugseng.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

block-common: Fix same_vm for no targets

same_vm is broken when the two main domains do not have targets. otvm
and targetvm are both missing, which means they get set to -1 and then
converted to empty strings:

++10697+ local targetvm=-1
++10697+ local otvm=-1
++10697+ otvm=
++10697+ othervm=/vm/cc97bc2f-3a91-43f7-8fbc-4cb92f90b4e4
++10697+ targetvm=
++10697+ local frontend_uuid=/vm/844dea4e-44f8-4e3e-8145-325132a31ca5

The final comparison returns true since the two empty strings match:

++10697+ '[' /vm/844dea4e-44f8-4e3e-8145-325132a31ca5 = /vm/cc97bc2f-3a91-43f7-8fbc-4cb92f90b4e4 -o '' = /vm/cc97bc2f-3a91-43f7-8fbc-4cb92f90b4e4 -o /vm/844dea4e-44f8-4e3e-8145-325132a31ca5 = '' -o '' = '' ']'

Replace -1 with distinct strings indicating the lack of a value and
remove the collescing to empty stings. The strings themselves will no
longer match, and that is correct.

++12364+ '[' /vm/844dea4e-44f8-4e3e-8145-325132a31ca5 = /vm/cc97bc2f-3a91-43f7-8fbc-4cb92f90b4e4 -o 'No target' = /vm/cc97bc2f-3a91-43f7-8fbc-4cb92f90b4e4 -o /vm/844dea4e-44f8-4e3e-8145-325132a31ca5 = 'No other target' -o 'No target' = 'No other target' ']'

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>

VMX: tertiary execution control infrastructure

This is a prereq to enabling e.g. the MSRLIST feature.

Note that the PROCBASED_CTLS3 MSR is different from other VMX feature
reporting MSRs, in that all 64 bits report allowed 1-settings.

vVMX code is left alone, though, for the time being.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

automation: Switch yocto-qemux86-64 job to run on x86

At the moment, all Yocto jobs run on Arm64 runners. To address CI
capacity issues, move yocto-qemux86-64 job to x86. Reflect the change in
the makefile generating Yocto docker files and fix CONTAINER name
definition that incorrectly expects YOCTO_HOST variable to be set for x86
container as well, which does not have a platform name appended.

Signed-off-by: Michal Orzel <michal.orzel@amd.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

golang: Regen binding

Fixes: 024e7131be5c ("tools: don't expose XENFEAT_hvm_pirqs by default")
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: George Dunlap <george.dunlap@cloud.com>

x86/entry: Reduce assembly instruction size

Use 32 bit versions in all cases, not only for registers till 8th.
This reduces the encoding from (example with r14):

49 c7 c6 ff 7f 00 00 mov $0x7fff,%r14

to

41 be ff 7f 00 00 mov $0x7fff,%r14d

Signed-off-by: Frediano Ziglio <frediano.ziglio@cloud.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/paging: Use more specific constant

__HYPERVISOR_arch_1 and __HYPERVISOR_paging_domctl_cont for x86
have the same value but this function is handling
"paging_domctl_cont" hypercall so using the latter mnemonic in
the code is more clear.

Signed-off-by: Frediano Ziglio <frediano.ziglio@cloud.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

amd-vi: fix IVMD memory type checks

The current code that parses the IVMD blocks is relaxed with regard to the
restriction that such unity regions should always fall into memory ranges
marked as reserved in the memory map.

However the type checks for the IVMD addresses are inverted, and as a result
IVMD ranges falling into RAM areas are accepted. Note that having such ranges
in the first place is a firmware bug, as IVMD should always fall into reserved
ranges.

Fixes: ed6c77ebf0c1 ('AMD/IOMMU: check / convert IVMD ranges for being / to be reserved')
Reported-by: Ox <oxjo@proton.me>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: oxjo <oxjo@proton.me>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/srat: vmap the pages for acpi_slit

This avoids the assumption that boot pages are in the direct map.

Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Elias El Yandouzi <eliasely@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/numa: vmap the pages for memnodemap

This avoids the assumption that there is a direct map and boot pages
fall inside the direct map.

Clean up the variables so that mfn actually stores a type-safe mfn.

Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Elias El Yandouzi <eliasely@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

acpi: vmap pages in acpi_os_alloc_memory

Also, introduce a wrapper around vmap that maps a contiguous range for
boot allocations. Unfortunately, the new helper cannot be a static inline
because the dependencies are a mess. We would need to re-include
asm/page.h (was removed in aa4b9d1ee653 "include: don't use asm/page.h
from common headers") and it doesn't look to be enough anymore
because bits from asm/cpufeature.h is used in the definition of PAGE_NX.

Lastly, with the move to vmap(), it is now easier to find the size
of the mapping. So pass the whole area to init_boot_pages() rather than
just the first page.

Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Elias El Yandouzi <eliasely@amazon.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

xen/vmap: Introduce vmap_size() and use it

vunmap() and vfree() currently duplicate the (small) logic to find the
size of an vmap area. In a follow-up patch, we will want to introduce
another one (this time externally).

So introduce a new helper vmap_size() that will return the number of
pages in the area starting at the given address. Take the opportunity
to replace the open-coded version.

Note that vfree() was storing the type of the area in a local variable.
But this seems to have never been used (even when it was introduced).

Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Elias El Yandouzi <eliasely@amazon.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

setup: Move vm_init() before acpi calls

After the direct map removal, pages from the boot allocator are not
going to be mapped in the direct map. Although we have map_domain_page,
they are ephemeral and are less helpful for mappings that are more than a
page, so we want a mechanism to globally map a range of pages, which is
what vmap is for. Therefore, we bring vm_init into early boot stage.

To allow vmap to be initialised and used in early boot, we need to
modify vmap to receive pages from the boot allocator during early boot
stage.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David Woodhouse <dwmw2@amazon.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Elias El Yandouzi <eliasely@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>

xen/evtchn: Remove useless assignment in evtchn_alloc_unbound()

The variable 'rc' is assigned later, that value is never used.

Signed-off-by: Frediano Ziglio <frediano.ziglio@cloud.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>

tools/xenstored: map stubdom interface

When running as stubdom, map the stubdom's Xenstore ring page in order
to support using the 9pfs frontend.

Use the same pattern as in dom0_init() when running as daemon in dom0
(introduce the own domain, then send an event to the client side to
signal Xenstore is ready to communicate).

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

tools/xenstored: split domain_init()

Today domain_init() is called either just before calling dom0_init()
in case no live update is being performed, or it is called after
reading the global state from read_state_global(), as the event
channel fd is needed.

Split up domain_init() into a preparation part which can be called
unconditionally, and in a part setting up the event channel handle.

Note that there is no chance that chk_domain_generation() can be
called now before xc_handle has been setup, so there is no need for
the related special case anymore.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

tools/xenstored: rework ring page (un)map functions

When [un]mapping the ring page of a Xenstore client, different actions
are required for "normal" guests and dom0. Today this distinction is
made at call site.

Move this distinction into [un]map_interface() instead, avoiding code
duplication and preparing special handling for [un]mapping the stub
domain's ring page.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

tools/xenstored: get own domid in stubdom case

Obtain own domid when running as stubdom.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

tools/xenstored: move all socket handling into posix.c

All of the socket handling is needed only when running as daemon.

Move it into posix.c, allowing to remove the NO_SOCKETS macro.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

tools/xenstored: move all log-pipe handling into posix.c

All of the log-pipe handling is needed only when running as daemon.

Move it into posix.c. This requires to have a service function in the
main event loop for handling the related requests and one for setting
the fds[] array, which is renamed to poll_fds to have a more specific
name. Use a generic name for those functions, as socket handling can
be added to them later, too.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

tools/xenstored: move systemd handling to posix.c

Move systemd handling to a new late_init() function in posix.c.

This prepares a future removal of the NO_SOCKETS macro.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

tools/xenstored: add early_init() function

Some xenstored initialization needs to be done in the daemon case only,
so split it out into a new early_init() function being a stub in the
stubdom case.

Remove the call of talloc_enable_leak_report_full(), as it serves no
real purpose: the daemon only ever exits due to a crash, in which case
a log of talloc()ed memory hardly has any value.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

tools/xenstored: rename xenbus_evtchn()

Rename the xenbus_evtchn() function to get_xenbus_evtchn() in order to
avoid two externally visible symbols with the same name when Xenstore-
stubdom is being built with a Mini-OS with CONFIG_XENBUS set.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>

tools/helpers: allocate xenstore event channel for xenstore stubdom

In order to prepare support of PV frontends in xenstore-stubdom, add
allocation of a Xenstore event channel to init-xenstore-domain.c.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>

tools/xentop: fix sorting bug for some columns

Sort doesn't work on columns VBD_OO, VBD_RD, VBD_WR and VBD_RSECT.
Fix by adjusting variables names in compare functions.
Bug fix only. No functional change.

Fixes: 91c3e3dc91d6 ("tools/xentop: Display '-' when stats are not available.")
Signed-off-by: Cyril Rébert (zithro) <slack@rabbit.lu>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>

x86/cpu: Fix mixed tabs/spaces

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

xen/bitmap: Deduplicate __bitmap_weight() implementations

We have two copies of __bitmap_weight() that differ by whether they make
hweight32() or hweight64() calls, yet we already have hweight_long() which is
the form that __bitmap_weight() wants.

Fix hweight_long() to return unsigned int like all the other hweight helpers,
and fix __bitmap_weight() to used unsigned integers.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/ucode: Remove accidentally introduced tabs

Fixes: cf7fe8b72dea ("x86/ucode: Fix stability of the raw CPU Policy rescan")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/CPU: convert vendor hook invocations to altcall

While not performance critical, these hook invocations still want
converting: This way all pre-filled struct cpu_dev instances can become
__initconst_cf_clobber, thus allowing to eliminate further 8 ENDBR
during the 2nd phase of alternatives patching (besides moving previously
resident data to .init.*).

Since all use sites need touching anyway, take the opportunity and also
address a Misra C:2012 Rule 5.5 violation: Rename the this_cpu static
variable.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/guest: finish conversion to altcall

While .setup() and .e820_fixup() don't need fiddling with for being run
only very early, both .ap_setup() and .resume() want converting too:
This way both pre-filled struct hypervisor_ops instances can become
__initconst_cf_clobber, thus allowing to eliminate up to 5 more ENDBR
(configuration dependent) during the 2nd phase of alternatives patching.

While fiddling with section annotations here, also move "ops" itself to
.data.ro_after_init.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Paul Durrant <paul@xen.org>

x86: arrange for ENDBR zapping from <vendor>_ctxt_switch_masking()

While altcall is already used for them, the functions want announcing in
.init.rodata.cf_clobber, even if the resulting static variables aren't
otherwise used.

While doing this also move ctxt_switch_masking to .data.ro_after_init.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen: Remove debugger.h

With x86 having dropped gdbstub, Xen's only debugger has gone.

Drop xen/debugger.h and remove the hooks spread around the codebase.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86: Remove gdbstub

In 13y of working on Xen, I've never seen seen it used.  The implementation
was introduced (commit b69f92f3012e, Jul 28 2004) with known issues such as:

  /* Resuming after we've stopped used to work, but more through luck
     than any actual intention.  It doesn't at the moment. */

which appear to have gone unfixed for the 20 years since.

Nowadays there are more robust ways of inspecting crashed state, such as a
kexec crash kernel, or running Xen in a VM.

This will allow us to clean up some hooks around the codebase which are
proving awkward for other tasks.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/spec-ctrl: Expose BHI_CTRL to guests

The CPUID feature bit signals the presence of the BHI_DIS_S control in
SPEC_CTRL MSR, first available in Intel AlderLake and Sapphire Rapids CPUs

Xen already knows how to context switch MSR_SPEC_CTRL properly between guest
and hypervisor context.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/spec-ctrl: Expose RRSBA_CTRL to guests

The CPUID feature bit signals the presence of the RRSBA_DIS_{U,S} controls in
SPEC_CTRL MSR, first available in Intel AlderLake and Sapphire Rapids CPUs.

Xen already knows how to context switch MSR_SPEC_CTRL properly between guest
and hypervisor context.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/spec-ctrl: Expose IPRED_CTRL to guests

The CPUID feature bit signals the presence of the IPRED_DIS_{U,S} controls in
SPEC_CTRL MSR, first available in Intel AlderLake and Sapphire Rapids CPUs.

Xen already knows how to context switch MSR_SPEC_CTRL properly between guest
and hypervisor context.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

tools/ocaml: Bump minimum version to OCaml 4.05

Char.lowercase got removed in OCaml 5.0 (it has been deprecated since 2014),
and doesn't build any more.

Char.lowercase_ascii has existed since OCaml 4.03, so that is the new
minimum version for oxenstored.

However, OCaml 4.05 is the oldest new-enough version found in common distros,
so pick this as a baseline.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Acked-by: Christian Lindig <christian.lindig@cloud.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
[Update CHANGELOG.md]
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen/bitmap: Consistently use unsigned bits values

Right now, most of the static inline helpers take an unsigned nbits quantity,
and most of the library functions take a signed quanity. Because
BITMAP_LAST_WORD_MASK() is expressed as a divide, the compiler is forced to
emit two different paths to get the correct semantics for signed division.

Swap all signed bit-counts to being unsigned bit-counts for the simple cases.
This includes the return value of bitmap_weight().

Bloat-o-meter for a random x86 build reports:
add/remove: 0/0 grow/shrink: 8/19 up/down: 167/-413 (-246)

which all comes from compiler not emitting "dead" logic paths for negative bit
counts.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/traps: Annotate {l,c}star_enter() as nocall

... as with other declarations which aren't legal to call from C.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/boot: Add braces in reloc.c

107 lines is an unreasonably large switch statement to live inside a
brace-less for loop. Drop the comment that's clumsily trying to cover the
fact that this logic has wrong-looking indentation.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

xen/sched: Fix UB shift in compat_set_timer_op()

Tamas reported this UBSAN failure from fuzzing:

  (XEN) ================================================================================
  (XEN) UBSAN: Undefined behaviour in common/sched/compat.c:48:37
  (XEN) left shift of negative value -2147425536
  (XEN) ----[ Xen-4.19-unstable  x86_64  debug=y ubsan=y  Not tainted ]----
  ...
  (XEN) Xen call trace:
  (XEN)    [<ffff82d040307c1c>] R ubsan.c#ubsan_epilogue+0xa/0xd9
  (XEN)    [<ffff82d040308afb>] F __ubsan_handle_shift_out_of_bounds+0x11a/0x1c5
  (XEN)    [<ffff82d040307758>] F compat_set_timer_op+0x41/0x43
  (XEN)    [<ffff82d04040e4cc>] F hvm_do_multicall_call+0x77f/0xa75
  (XEN)    [<ffff82d040519462>] F arch_do_multicall_call+0xec/0xf1
  (XEN)    [<ffff82d040261567>] F do_multicall+0x1dc/0xde3
  (XEN)    [<ffff82d04040d2b3>] F hvm_hypercall+0xa00/0x149a
  (XEN)    [<ffff82d0403cd072>] F vmx_vmexit_handler+0x1596/0x279c
  (XEN)    [<ffff82d0403d909b>] F vmx_asm_vmexit_handler+0xdb/0x200

Left-shifting any negative value is strictly undefined behaviour in C, and
the two parameters here come straight from the guest.

The fuzzer happened to choose lo 0xf, hi 0x8000e300.

Switch everything to be unsigned values, making the shift well defined.

As GCC documents:

  As an extension to the C language, GCC does not use the latitude given in
  C99 and C11 only to treat certain aspects of signed '<<' as undefined.
  However, -fsanitize=shift (and -fsanitize=undefined) will diagnose such
  cases.

this was deemed not to need an XSA.

Note: The unsigned -> signed conversion for do_set_timer_op()'s s_time_t
parameter is also well defined.  C makes it implementation defined, and GCC
defines it as reduction modulo 2^N to be within range of the new type.

Fixes: 2942f45e09fb ("Enable compatibility mode operation for HYPERVISOR_sched_op and HYPERVISOR_set_timer_op.")
Reported-by: Tamas K Lengyel <tamas@tklengyel.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/hvm: Fix UBSAN failure in do_hvm_op() printk

Tamas reported this UBSAN failure from fuzzing:

  (XEN) ================================================================================
  (XEN) UBSAN: Undefined behaviour in common/vsprintf.c:64:19
  (XEN) negation of -9223372036854775808 cannot be represented in type 'long long int':
  (XEN) ----[ Xen-4.19-unstable  x86_64  debug=y ubsan=y  Not tainted ]----
  ...
  (XEN) Xen call trace:
  (XEN)    [<ffff82d040307c1c>] R ubsan.c#ubsan_epilogue+0xa/0xd9
  (XEN)    [<ffff82d04030805d>] F __ubsan_handle_negate_overflow+0x99/0xce
  (XEN)    [<ffff82d04028868f>] F vsprintf.c#number+0x10a/0x93e
  (XEN)    [<ffff82d04028ac74>] F vsnprintf+0x19e2/0x1c56
  (XEN)    [<ffff82d04030a47a>] F console.c#vprintk_common+0x76/0x34d
  (XEN)    [<ffff82d04030a79e>] F printk+0x4d/0x4f
  (XEN)    [<ffff82d04040c42b>] F do_hvm_op+0x288e/0x28f5
  (XEN)    [<ffff82d04040d385>] F hvm_hypercall+0xad2/0x149a
  (XEN)    [<ffff82d0403cd072>] F vmx_vmexit_handler+0x1596/0x279c
  (XEN)    [<ffff82d0403d909b>] F vmx_asm_vmexit_handler+0xdb/0x200

The problem is an unsigned -> signed converstion because of a bad
formatter (%ld trying to format an unsigned long).

We could fix it by swapping to %lu, but this is a useless printk() even in
debug builds, so just drop it completely.

Reported-by: Tamas K Lengyel <tamas@tklengyel.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

xen: Drop superfluous semi-colons

All these cases happen to be benign, but drop them anyway. This is one step
towards making -Wextra-semi work.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/arm: Properly clean update to init_ttbr and smp_up_cpu

Recent rework to the secondary boot code modified how init_ttbr and
smp_up_cpu are accessed. Rather than directly accessing them, we
are using a pointer to them.

The helper clean_dcache() is expected to take the variable in parameter
and then clean its content. As we now pass a pointer to the variable,
we will clean the area storing the address rather than the content itself.

Switch to use clean_dcache_va_range() to avoid casting the pointer.

Fixes: a5ed59e62c6f ("arm/mmu: Move init_ttbr to a new section .data.idmap")
Fixes: 9a5114074b04 ("arm/smpboot: Move smp_up_cpu to a new section .data.idmap)
Reported-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Tested-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>

shim: avoid building of vendor IOMMU code

There's no use for IOMMU code in the shim. Disable at least the vendor-
specific code, until eventually IOMMU code can be disabled altogether.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>

IOMMU: iommu_use_hap_pt() implies CONFIG_HVM

Allow the compiler a little more room on DCE by moving the compile-time-
constant condition into the predicate (from the one place where it was
added in an open-coded fashion for XSA-450).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen/page_alloc: introduce init_free_page_fields() helper

Introduce a new helper to initialize fields that have different uses for
free pages.

Signed-off-by: Carlo Nonato <carlo.nonato@minervasys.tech>
Acked-by: Jan Beulich <jbeulich@suse.com>

xenpm: Print message for disabled commands

xenpm get-cpufreq-states currently just prints no output when cpufreq is
disabled or HWP is running. Have it print an appropriate message. The
cpufreq disabled one mirrors the cpuidle disabled one.

cpufreq disabled:
$ xenpm get-cpufreq-states
Either Xen cpufreq is disabled or no valid information is registered!

Under HWP:
$ xenpm get-cpufreq-states
P-State information not supported. Try 'get-cpufreq-average' or 'start'.

Also allow xenpm to handle EOPNOTSUPP from the pmstat hypercalls.
EOPNOTSUPP is returned when HWP is active in some cases and allows the
differentiation from cpufreq being disabled.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/PoD: simplify / improve p2m_pod_cache_add()

Avoid recurring MFN -> page or page -> MFN translations. Drop the pretty
pointless local variable "p". Make sure the MFN logged in a debugging
error message is actually the offending one. Return negative errno
values rather than -1 (presently no caller really cares, but imo this
should change). Adjust style.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@cloud.com>

VT-d: Fix "else" vs "#endif" misplacement

In domain_pgd_maddr() the "#endif" is misplaced with respect to "else". This
generates incorrect logic when CONFIG_HVM is compiled out, as the "else" body
is executed unconditionally.

Rework the logic to use IS_ENABLED() instead of explicit #ifdef-ary, as it's
clearer to follow. This in turn involves adjusting p2m_get_pagetable() to
compile when CONFIG_HVM is disabled.

This is XSA-450 / CVE-2023-46840.

Fixes: 033ff90aa9c1 ("x86/P2M: p2m_{alloc,free}_ptp() and p2m_alloc_table() are HVM-only")
Reported-by: Teddy Astie <teddy.astie@vates.tech>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

pci: fail device assignment if phantom functions cannot be assigned

The current behavior is that no error is reported if (some) phantom functions
fail to be assigned during device add or assignment, so the operation succeeds
even if some phantom functions are not correctly setup.

This can lead to devices possibly being successfully assigned to a domU while
some of the device phantom functions are still assigned to dom0. Even when the
device is assigned domIO before being assigned to a domU phantom functions
might fail to be assigned to domIO, and also fail to be assigned to the domU,
leaving them assigned to dom0.

Since the device can generate requests using the IDs of those phantom
functions, given the scenario above a device in such state would be in control
of a domU, but still capable of generating transactions that use a context ID
targeting dom0 owned memory.

Modify device assign in order to attempt to deassign the device if phantom
functions failed to be assigned.

Note that device addition is not modified in the same way, as in that case the
device is assigned to a trusted domain, and hence partial assign can lead to
device malfunction but not a security issue.

This is XSA-449 / CVE-2023-46839

Fixes: 4e9950dc1bd2 ('IOMMU: add phantom function support')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/iommu: cleanup unused functions

Remove xen_in_range() and vpci_is_mmcfg_address() now that hey are unused.

Adjust comments to point to the new functions that replace the existing ones.

No functional change.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

x86/iommu: switch hwdom IOMMU to use a rangeset

The current loop that iterates from 0 to the maximum RAM address in order to
setup the IOMMU mappings is highly inefficient, and it will get worse as the
amount of RAM increases.  It's also not accounting for any reserved regions
past the last RAM address.

Instead of iterating over memory addresses, iterate over the memory map regions
and use a rangeset in order to keep track of which ranges need to be identity
mapped in the hardware domain physical address space.

On an AMD EPYC 7452 with 512GiB of RAM, the time to execute
arch_iommu_hwdom_init() in nanoseconds is:

x old
+ new
    N           Min           Max        Median           Avg        Stddev
x   5 2.2364154e+10  2.338244e+10 2.2474685e+10 2.2622409e+10 4.2949869e+08
+   5       1025012       1033036       1026188     1028276.2     3623.1194
Difference at 95.0% confidence
        -2.26214e+10 +/- 4.42931e+08
        -99.9955% +/- 9.05152e-05%
        (Student's t, pooled s = 3.03701e+08)

Execution time of arch_iommu_hwdom_init() goes down from ~22s to ~0.001s.

Note there's a change for HVM domains (ie: PVH dom0) that get switched to
create the p2m mappings using map_mmio_regions() instead of
p2m_add_identity_entry(), so that ranges can be mapped with a single function
call if possible.  Note that the interface of map_mmio_regions() doesn't
allow creating read-only mappings, but so far there are no such mappings
created for PVH dom0 in arch_iommu_hwdom_init().

No change intended in the resulting mappings that a hardware domain gets.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/iommu: remove regions not to be mapped

Introduce the code to remove regions not to be mapped from the rangeset
that will be used to setup the IOMMU page tables for the hardware domain.

This change also introduces two new functions: remove_xen_ranges() and
vpci_subtract_mmcfg() that copy the logic in xen_in_range() and
vpci_is_mmcfg_address() respectively and remove the ranges that would otherwise
be intercepted by the original functions.

Note that the rangeset is still not populated.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul@xen.org>

x86: purge NMI_IO_APIC

Even going back to 3.2 source code, I can't spot how this watchdog mode
could ever have been enabled in Xen. The only effect its presence had
for all the years was the retaining of a dead string literal.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/APIC: purge {GET,SET}_APIC_DELIVERY_MODE()

The few uses we have can easily be replaced, eliminating the need for
redundant APIC_DM_* and APIC_MODE_* constants. Therefore also purge all
respective APIC_MODE_* constants, introducing APIC_DM_MASK anew instead.
This is further relevant since we have a different set of APIC_MODE_*,
which could otherwise end up confusing.

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

NUMA: no need for asm/numa.h when !NUMA

There's no point in every architecture carrying the same stubs for the
case when NUMA isn't enabled (or even supported). Move all of that to
xen/numa.h; replace explicit uses of asm/numa.h in common code. Make
inclusion of asm/numa.h dependent upon NUMA=y.

Drop the no longer applicable "implement NUMA support" comments - in a
!NUMA section this simply makes no sense.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Shawn Anastasio <sanastasio@raptorengineering.com>
Acked-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Julien Grall <jgrall@amazon.com>

xen/vmap: Check the page has been mapped in vm_init_type()

The function map_pages_to_xen() could fail if it can't allocate the
underlying page tables or (at least on Arm) if the area was already
mapped.

The first error is caught by clear_page() because it would fault.
However, the second error while very unlikely is not caught at all.

As this is boot code, use BUG_ON() to check if map_pages_to_xen() has
succeeded.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Elias El Yandouzi <eliasely@amazon.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/entry: fix jump into restore_all_guest without %rbx correctly set

e047b8d0fa05 went too far when limiting obtaining the vCPU pointer. While the
code in ist_dispatch_done does indeed only need the vCPU pointer when PV32 is
enabled, the !PV32 path will end up jumping into restore_all_guest which does
require rbx == vCPU pointer.

Fix by moving the fetching of the vCPU pointer to be done outside of the PV32
code block.

Fixes: e047b8d0fa05 ('x86/entry: replace two GET_CURRENT() uses')
Reported-by: Edwin Torok <edwin.torok@cloud.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen/arm64: head: Use PRINT_ID() for secondary CPU MMU-off boot code

With the upcoming work to color Xen, the binary will not be anymore
physically contiguous. This will be a problem during boot as the
assembly code will need to work out where each piece of Xen reside.

An easy way to solve the issue is to have all code/data accessed
by the secondary CPUs while the MMU is off within a single page.

Right now, most of the early printk messages are using PRINT() which
will add the message in .rodata. This is unlikely to be within the
same page as the rest of the idmap.

So replace all the PRINT() that can be reachable by the secondary
CPU with MMU-off with PRINT_ID().

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Michal Orzel <michal.orzel@amd.com>

arm/smpboot: Move smp_up_cpu to a new section .data.idmap

With the upcoming work to color Xen, the binary will not be anymore
physically contiguous. This will be a problem during boot as the
assembly code will need to work out where each piece of Xen reside.

An easy way to solve the issue is to have all code/data accessed
by the secondary CPUs while the MMU is off within a single page.

Right now, smp_up_cpu is used by secondary CPUs to wait their turn for
booting before the MMU is on. Yet it is currently in .data which is
unlikely to be within the same page as the rest of the idmap.

Move smp_up_cpu to the recently created section .data.idmap. The idmap is
currently part of the text section and therefore will be mapped read-only
executable. This means that we need to temporarily remap
smp_up_cpu in order to update it.

Introduce a new function set_smp_up_cpu() for this purpose so the code
is not duplicated between when opening and closing the gate.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Michal Orzel <michal.orzel@amd.com>

arm/mmu: Move init_ttbr to a new section .data.idmap

With the upcoming work to color Xen, the binary will not be anymore
physically contiguous. This will be a problem during boot as the
assembly code will need to work out where each piece of Xen reside.

An easy way to solve the issue is to have all code/data accessed
by the secondary CPUs while the MMU is off within a single page.

Right now, init_ttbr is used by secondary CPUs to find there page-tables
before the MMU is on. Yet it is currently in .data which is unlikely
to be within the same page as the rest of the idmap.

Create a new section .data.idmap that will be used for variables
accessed by the early boot code. The first one is init_ttbr.

The idmap is currently part of the text section and therefore will
be mapped read-only executable. This means that we need to temporarily
remap init_ttbr in order to update it.

Introduce a new function set_init_ttbr() for this purpose so the code
is not duplicated between arm64 and arm32.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Michal Orzel <michal.orzel@amd.com>

x86/entry: Avoid register spilling in cr4_pv32_restore()

cr4_pv32_restore() needs two registers. Right now, it spills %rdx and
clobbers %rax.

However, %rcx is free to use at all callsites. Annotate CR4_PV32_RESTORE with
our usual clobber comments, and swap %rdx for %rcx in the non-fatal paths

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

tools: don't expose XENFEAT_hvm_pirqs by default

The HVM pirq feature allows routing interrupts from both physical and emulated
devices over event channels, this was done a performance improvement.  However
its usage is fully undocumented, and the only reference implementation is in
Linux.  It defeats the purpose of local APIC hardware virtualization, because
when using it interrupts avoid the usage of the local APIC altogether.

It has also been reported to not work properly with certain devices, at least
when using some AMD GPUs Linux attempts to route interrupts over event
channels, but Xen doesn't correctly detect such routing, which leads to the
hypervisor complaining with:

(XEN) d15v0: Unsupported MSI delivery mode 7 for Dom15

When MSIs are attempted to be routed over event channels the entry delivery
mode is set to ExtINT, but Xen doesn't detect such routing and attempts to
inject the interrupt following the native MSI path, and the ExtINT delivery
mode is not supported.

Disable HVM PIRQs by default and provide a per-domain option in xl.cfg to
enable such feature.  Also for backwards compatibility keep the feature enabled
for any resumed domains that don't have an explicit selection.

Note that the only user of the feature (Linux) is also able to handle native
interrupts fine, as the feature was already not used if Xen reported local APIC
hardware virtualization active.

Link: https://github.com/QubesOS/qubes-issues/issues/7971
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/hvm: make X86_EMU_USE_PIRQ optional

Allow selecting X86_EMU_USE_PIRQ for HVM guests, so it's no longer mandated to
be always on.

There's no restriction in Xen that forces such feature to be always on for HVM
guests, as for example PVH guests don't support it, as such allow toolstack to
select whether to enabled it on a per-domain basis.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/ucode: Fix stability of the raw CPU Policy rescan

Always run microcode_update_helper() on the BSP, so the the updated Raw CPU
policy doesn't get non-BSP topology details included.

Have calculate_raw_cpu_policy() clear the instantanious XSTATE sizes. The
value XCR0 | MSR_XSS had when we scanned the policy isn't terribly interesting
to report.

When CPUID Masking is active, it affects CPUID instructions issued by Xen
too. Transiently disable masking to get a clean scan.

Fixes: 694d79ed5aac ("x86/ucode: Refresh raw CPU policy after microcode load")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

pmstat: Limit hypercalls under HWP

When HWP is active, the cpufreq P-state information is not updated. In
that case, return -EOPNOTSUPP instead of bogus, incomplete info.

Similarly, set_cpufreq_para() is not applicable when HWP is active.
Many of the options already checked the governor and were inaccessible,
but SCALING_MIN/MAX_FREQ was still accessible (though it would do
nothing). Add an ealier HWP check to handle all cases.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/entry: replace two GET_CURRENT() uses

Now that we have %r14 set up using GET_STACK_END() in a number of
places, in two places we can eliminate the redundancy of GET_CURRENT()
also invoking that macro. In handle_ist_exception() actually go a step
farther and avoid using %rbx altogether when retrieving the processor
ID: Obtain the current vCPU pointer only in the PV32-specific code
actually needing it.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/NMI: refine "watchdog stuck" log message

Observing

"Testing NMI watchdog on all CPUs: 0 stuck"

it felt like it's not quite right, but I still read it as "no CPU stuck;
all good", when really the system suffered from what 6bdb965178bb
("x86/intel: ensure Global Performance Counter Control is setup
correctly") works around. Convert this to

"Testing NMI watchdog on all CPUs: {0} stuck"

or, with multiple CPUs having an issue, e.g.

"Testing NMI watchdog on all CPUs: {0,40} stuck"

to make more obvious that a lone number is not a count of CPUs.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/p2m-pt: fix off by one in entry check assert

The MMIO RO rangeset overlap check is bogus: the rangeset is inclusive so the
passed end mfn should be the last mfn to be mapped (not last + 1).

Fixes: 6fa1755644d0 ('amd/npt/shadow: replace assert that prevents creating 2M/1G MMIO entries')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@cloud.com>

x86/entry: Fix ELF metadata for NMI and handle_ist_exception

handle_ist_exception isn't part of the NMI handler, just like handle_exception
isn't part of #PF.

Fixes: b3a9037550df ("x86: annotate entry points with type and size")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/kexec: Drop compatibility_mode_far

LJMP is (famously?) incompatible between Intel and AMD CPUs, and while we're
using one of the compatible forms, we've got a good stack and LRET is the far
more common way of doing this.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/riscv: introduce guest_access.h

All necessary dummiy implementation of functions in this header
will be introduced in stubs.c

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

xen/riscv: introduce domain.h

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

IRQ: generalize [gs]et_irq_regs()

Move functions (and their data) to common code, and invoke the functions
on Arm as well. This is in preparation of dropping the register
parameters from handler functions.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

lib{fdt,elf}: move lib{fdt,elf}-temp.o and their deps to $(targets)

At the moment, trying to run xencov read/reset (calling SYSCTL_coverage_op
under the hood) results in a crash. This is due to a profiler trying to
access data in the .init.* sections (libfdt for Arm and libelf for x86)
that are stripped after boot. Normally, the build system compiles any
*.init.o file without COV_FLAGS. However, these two libraries are
handled differently as sections will be renamed to init after linking.

To override COV_FLAGS to empty for these libraries, lib{fdt,elf}.o were
added to nocov-y. This worked until e321576f4047 ("xen/build: start using
if_changed") that added lib{fdt,elf}-temp.o and their deps to extra-y.
This way, even though these objects appear as prerequisites of
lib{fdt,elf}.o and the settings should propagate to them, make can also
build them as a prerequisite of __build, in which case COV_FLAGS would
still have the unwanted flags. Fix it by switching to $(targets) instead.

Also, for libfdt, append libfdt.o to nocov-y only if CONFIG_OVERLAY_DTB
is not set. Otherwise, there is no section renaming and we should be able
to run the coverage.

Fixes: e321576f4047 ("xen/build: start using if_changed")
Signed-off-by: Michal Orzel <michal.orzel@amd.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

PPC: switch entry point annotations to common model

Use the generic framework in xen/linkage.h. No change in generated code
except of course the converted symbols change to be hidden ones.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Shawn Anastasio <sanastasio@raptorengineering.com>

tools/binfile: switch to common annotations model

Use DATA() / END() and drop the now redundant .global. No change in
generated data; of course the two symbols now properly gain "hidden"
binding.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

RISC-V: annotate entry points with type and size

Use the generic framework in xen/linkage.h. No change in generated code
except of course the converted symbols change to be hidden ones and gain
a valid size.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>

Arm: annotate entry points with type and size

Use the generic framework in xen/linkage.h. No change in generated code
except for the changed padding value (noticable when config.gz isn't a
multiple of 4 in size). Plus of course the converted symbols change to
be hidden ones.

Note that ASM_INT() is switched to DATA(), not DATA_LOCAL(), as the only
use site wants the symbol global anyway.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>

x86: also mark assembler globals hidden

Let's have assembler symbols be consistent with C ones. In principle
there are (a few) cases where gas can produce smaller code this way,
just that for now there's a gas bug causing smaller code to be emitted
even when that shouldn't be the case.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

x86: annotate entry points with type and size

Use the generic framework in xen/linkage.h.

For switch_to_kernel() and restore_all_guest() so far implicit alignment
(from being first in their respective sections) is being made explicit
(as in: using FUNC() without 2nd argument). Whereas for
{,compat}create_bounce_frame() and autogen_entrypoints[] alignment is
newly arranged for.

Except for the added/adjusted alignment padding (including their
knock-on effects) no change in generated code/data. Note that the basis
for support of weak definitions is added despite them not having any use
right now.

Note that ASM_INT() is switched to DATA(), not DATA_LOCAL(), as the only
use site wants the symbol global anyway.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>

common: assembly entry point type/size annotations

Recent gas versions generate minimalistic Dwarf debug info for items
annotated as functions and having their sizes specified [1]. Furthermore
generating live patches wants items properly annotated. "Borrow" Arm's
END() and (remotely) derive other annotation infrastructure from
Linux'es, for all architectures to use.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
[1] https://sourceware.org/git?p=binutils-gdb.git;a=commitdiff;h=591cc9fbbfd6d51131c0f1d4a92e7893edcc7a28

x86/MCE: switch some callback invocations to altcall

While not performance critical, these hook invocations still would
better be converted: This way all pre-filled (and newly introduced)
struct mce_callback instances can become __initconst_cf_clobber, thus
allowing to eliminate another 9 ENDBR during the 2nd phase of
alternatives patching.

While this means registering callbacks a little earlier, doing so is
perhaps even advantageous, for having pointers be non-NULL earlier on.
Only one set of callbacks would only ever be registered anyway, and
neither of the respective initialization function can (subsequently)
fail.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/MCE: separate BSP-only initialization

Several function pointers are registered over and over again, when
setting them once on the BSP suffices. Arrange for this in the vendor
init functions and mark involved registration functions __init.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/PV: avoid indirect call for I/O emulation quirk hook

This way ioemul_handle_proliant_quirk() won't need ENDBR anymore.

While touching this code, also
- arrange for it to not be built at all when !PV,
- add "const" to the last function parameter and bring the definition
in sync with the declaration (for Misra).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/MTRR: avoid several indirect calls

The use of (supposedly) vendor-specific hooks is a relic from the days
when Xen was still possible to build as 32-bit binary. There's no
expectation that a new need for such an abstraction would arise. Convert
mttr_if to a mere boolean and all prior calls through it to direct ones,
thus allowing to eliminate 6 ENDBR from .text.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

core-parking: use alternative_call()

This way we can arrange for core_parking_{performance,power}()'s ENDBR
to also be zapped.

For the decision to be taken before the 2nd alternative patching pass,
the initcall needs to become a pre-SMP one, though.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

xen: Fold exit paths in find_text_region()

Despite rcu_read_unlock() being fully inlineable, the optimiser doesn't appear
willing to fold the exit paths. Rework the logic to do so explicitly.

This compiles to marginally better code in all cases. No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

xen/livepatch: Make check_for_livepatch_work() faster in the common case

When livepatching is enabled, this function is used all the time.  Really do
check the fastpath first, and annotate it likely() as this is the right answer
100% of the time (to many significant figures).  This cuts out 3 pointer
dereferences in the "nothing to do path".

However, GCC still needs some help to persuade it not to set the full stack
frame (6 spilled registers, 3 slots of locals) even on the fastpath.

Create a new check_for_livepatch_work() with the fastpath only, and make the
"new" do_livepatch_work() noinline.  This causes the fastpath to need no stack
frame, making it faster still.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/cpuid: Change cpuid() from a macro to a static inline

Addresses MISRA Rule 5.5. Introduces others, but lets fix one thing at a
time.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/vmx: Disallow the use of inactivity states

Right now, vvmx will blindly copy L12's ACTIVITY_STATE into the L02 VMCS and
enter the vCPU.  Luckily for us, nested-virt is explicitly unsupported for
security bugs.

The inactivity states are HLT, SHUTDOWN and WAIT-FOR-SIPI, and as noted by the
SDM in Vol3 27.7 "Special Features of VM Entry":

  If VM entry ends with the logical processor in an inactive activity state,
  the VM entry generates any special bus cycle that is normally generated when
  that activity state is entered from the active state.

Also,

  Some activity states unconditionally block certain events.

I.e. A VMEntry with ACTIVITY=SHUTDOWN will initiate a platform reset, while a
VMEntry with ACTIVITY=WAIT-FOR-SIPI will really block everything other than
SIPIs.

Both of these activity states are for the TXT ACM to use, not for regular
hypervisors, and Xen doesn't support dropping the HLT intercept either.

There are two paths in Xen which operate on ACTIVITY_STATE.

1) The vmx_{get,set}_nonreg_state() helpers for VM-Fork.

   As regular VMs can't use any inactivity states, this is just duplicating
   the 0 from construct_vmcs().  Retain the ability to query activity_state,
   but crash the domain on any attempt to set an inactivity state.

2) Nested virt, because of ACTIVITY_STATE in vmcs_gstate_field[].

   Explicitly hide the inactivity states in the guest's view of MSR_VMX_MISC,
   and remove ACTIVITY_STATE from vmcs_gstate_field[].

   In virtual_vmentry(), we should trigger a VMEntry failure for the use of
   any inactivity states, but there's no support for that in the code at all
   so leave a TODO for when we finally start working on nested-virt in
   earnest.

Reported-by: Reima Ishii <ishiir@g.ecc.u-tokyo.ac.jp>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tamas K Lengyel <tamas@tklengyel.com>