Juergen Gross [Thu, 15 Feb 2024 13:04:49 +0000 (14:04 +0100)]
tools: add a new xen 9pfs daemon
Add "xen-9pfsd", a new 9pfs daemon meant to support infrastructure
domains (e.g. xenstore-stubdom) to access files in dom0.
For now only add the code needed for starting the daemon and
registering it with Xenstore via a new "libxl/xen-9pfs/state" node by
writing the "running" state to it.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jason Andryuk <jandryuk@gmail.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Juergen Gross [Thu, 8 Feb 2024 16:05:15 +0000 (17:05 +0100)]
docs: add a best practices coding guide
Today the CODING_STYLE contains a section "Handling unexpected
conditions" specific to the hypervisor. This section is kind of
misplaced for a coding style. It should rather be part of a "Coding
best practices" guide.
Add such a guide as docs/process/coding-best-practices.pandoc and
move the mentioned section from CODING_STYLE to the new file, while
converting the format to pandoc.
Roger Pau Monné [Wed, 14 Feb 2024 13:18:06 +0000 (14:18 +0100)]
iommu/x86: fix IVMD/RMRR range checker loop increment
mfn_add() doesn't store the incremented value in the parameter, and instead
returns it to the caller. As a result, the loop in iommu_unity_region_ok()
didn't make progress. Fix it by storing the incremented value.
Fixes: e45801dea17b ('iommu/x86: introduce a generic IVMD/RMRR range validity helper')
Coverity-ID: 1592056 Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Wed, 14 Feb 2024 09:41:59 +0000 (10:41 +0100)]
tools: add access macros for unaligned data
Add the basic access macros for unaligned data to common-macros.h.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Cyril Rébert [Wed, 14 Feb 2024 09:41:43 +0000 (10:41 +0100)]
tools/xentop: add option to display dom0 first
Add a command line option to xentop to be able to display dom0 first, on top of the list.
This is unconditional, so sorting domains with the S option will also ignore dom0.
Signed-off-by: Cyril Rébert (zithro) <slack@rabbit.lu> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Jason Andryuk [Wed, 14 Feb 2024 09:41:17 +0000 (10:41 +0100)]
libxl: Allow Phy backend for CDROM devices
A Linux HVM domain ignores PV block devices with type cdrom. The
Windows PV drivers also ignore device-type != "disk". Therefore QEMU's
emulated CD-ROM support is used. This allows ejection and other CD-ROM
features to work.
With a stubdom, QEMU is running in the stubdom. A PV disk is still
connected into the stubdom, and then QEMU can emulate the CD-ROM into
the guest. Phy support has been enhanced to provide a placeholder file
forempty disks, so it is usable as a CDROM backend as well. Allow Phy
to pass the check as well.
(Bypassing just for a linux-based stubdom doesn't work because
libxl__device_disk_setdefault() gets called early in domain creation
before xenstore is populated with relevant information for the stubdom
type. The build information isn't readily available and won't exist in
some call trees, so it isn't usable either.)
Let disk_try_backend() allow format empty for Phy cdrom drives.
Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Jason Andryuk [Wed, 14 Feb 2024 09:40:56 +0000 (10:40 +0100)]
libxl: Create empty file for Phy cdrom
With a device model stubdom, dom0 exports a PV disk to the stubdom.
Inside the stubdom, QEMU emulates a cdrom to the guest with a
host_device pointing at the PV frontend (/dev/xvdc)
An empty cdrom drive causes problems booting the stubdom. The PV disk
protocol isn't designed to support no media. That can be partially
hacked around, but the stubdom kernel waits for all block devices to
transition to Connected. Since the backend never connects empty media,
stubdom launch times out and it is destroyed.
Empty media and the PV disks not connecting is fine at runtime since the
stubdom keeps running irrespective of the disk state.
Empty media can be worked around my providing an empty file to the
stubdom for the PV disk source. This works as the disk is exposed as a
zero-size disk. Dynamically create the empty file as needed and remove
in the stubdom cleanup.
libxl__device_disk_set_backend() needs to allow through these "empty"
disks with a pdev_path.
Fixup the params writing since scripts have trouble with an empty params
field.
This works for non-stubdom HVMs as well.
Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Jan Beulich [Wed, 14 Feb 2024 09:38:38 +0000 (10:38 +0100)]
Argo: drop meaningless mfn_valid() check
Holding a valid struct page_info * in hands already means the referenced
MFN is valid; there's no need to check that again. Convert the checking
logic to a switch(), to help keeping the extra (and questionable) x86-
only check in somewhat tidy shape.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Christopher Clark <christopher.w.clark@gmail.com>
Nicola Vetrini [Thu, 8 Feb 2024 15:50:09 +0000 (16:50 +0100)]
docs/misra: add asm-offset.c to exclude-list
These files contain several deliberate violations of MISRA C rules such
as:
* R20.12 for macros DEFINE and OFFSET, where the second argument
of OFFSET is a macro and is used as a normal parameter and a
stringification operand.
* R2.1 because the file is not linked. That said it was decided to
deviate the rule itself to address that aspect).
Roger Pau Monné [Tue, 13 Feb 2024 08:37:20 +0000 (09:37 +0100)]
iommu/vt-d: switch to common RMRR checker
Use the newly introduced generic unity map checker.
Also drop the message recommending the usage of iommu_inclusive_mapping: the
ranges would end up being mapped anyway even if some of the checks above
failed, regardless of whether iommu_inclusive_mapping is set. Plus such option
is not supported for PVH, and it's deprecated.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 13 Feb 2024 08:36:14 +0000 (09:36 +0100)]
x86/HVM: tidy state on hvmemul_map_linear_addr()'s error path
While in the vast majority of cases failure of the function will not
be followed by re-invocation with the same emulation context, a few
very specific insns - involving multiple independent writes, e.g. ENTER
and PUSHA - exist where this can happen. Since failure of the function
only signals to the caller that it ought to try an MMIO write instead,
such failure also cannot be assumed to result in wholesale failure of
emulation of the current insn. Instead we have to maintain internal
state such that another invocation of the function with the same
emulation context remains possible. To achieve that we need to reset MFN
slots after putting page references on the error path.
Note that all of this affects debugging code only, in causing an
assertion to trigger (higher up in the function). There's otherwise no
misbehavior - such a "leftover" slot would simply be overwritten by new
contents in a release build.
Also extend the related unmap() assertion, to further check for MFN 0.
Fixes: 8cbd4fb0b7ea ("x86/hvm: implement hvmemul_write() using real mappings") Reported-by: Manuel Andreas <manuel.andreas@tum.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Paul Durrant <paul@xen.org>
asm/iommu.h shouldn't need to be included when CONFIG_HAS_PASSTHROUGH
isn't enabled.
As <asm/iommu.h> is ifdef-ed by CONFIG_HAS_PASSTHROUGH it should
be also ifdef-ed field "struct arch_iommu arch" in struct domain_iommu
as definition of arch_iommu is located in <asm/iommu.h>.
These amount of changes are just enough to avoid generation of empty
asm/iommu.h for now.
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jason Andryuk [Tue, 13 Feb 2024 08:28:58 +0000 (09:28 +0100)]
libxl: Add support for blktap vbd3
This patch re-introduces blktap support to libxl. Unlike earlier
versions, it does not link against any blktap library. libxl changes
are needed to write to the vbd3 backend XenStore nodes.
blktap has three components. tapdisk is a daemon implementing the disk
IO, NBD (Network Block Device), and Xen PV interfaces. tap-ctl is a
tool to control tapdisks - creating, starting, stopping and freeing.
tapback manages the XenStore operations and instructs tapdisk to
connect.
It is notable that tapdisk performs the grant and event channel ops, but
doesn't interact with XenStore. tapback performs XenStore operations
and notifies tapdisks of values and changes.
The flow is: libxl writes to the "vbd3" XenStore nodes and runs the
block-tap script. The block-tap script runs tap-ctl to create a tapdisk
instance as the physical device. tapback then sees the tapdisk and
instructs the tapdisk to connect up the PV blkif interface.
This is expected to work without the kernel blktap driver, so the
block-tap script is modified accordingly to write the UNIX NBD path.
backendtype=tap was not fully removed previously, but it would never
succeed since it would hit the hardcoded error in disk_try_backend().
It is reused now.
An example command to attach a vhd:
xl block-attach vm 'vdev=xvdf,backendtype=tap,format=vhd,target=/srv/target.vhd'
Format raw also works to run an "aio:" tapdisk.
Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Jan Beulich [Tue, 13 Feb 2024 08:27:47 +0000 (09:27 +0100)]
x86/p2m: make p2m_get_page_from_gfn() handle grant case correctly
The 'fast' path of p2m_get_page_from_gfn handles three cases: normal ram,
foreign p2m entries, and grant map entries. For normal ram and grant table
entries, get_page() is called, but for foreign entries,
page_get_owner_and_reference() is called, since the current domain is
expected not to be the owner.
Unfortunately, grant maps are *also* generally expected to be owned by
foreign domains; so this function will fail for any p2m entry containing a
grant map that doesn't happen to be local.
Have grant maps take the same path as foreign entries. Since grants may
actually be either foreign or local, adjust the assertion to allow for this.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: George Dunlap <george.dunlap@cloud.com> Reviewed-by: George Dunlap <george.dunlap@cloud.com>
Petr Beneš [Mon, 12 Feb 2024 08:37:58 +0000 (09:37 +0100)]
x86/hvm: Fix fast singlestep state persistence
This patch addresses an issue where the fast singlestep setting would persist
despite xc_domain_debug_control being called with XEN_DOMCTL_DEBUG_OP_SINGLE_STEP_OFF.
Specifically, if fast singlestep was enabled in a VMI session and that session
stopped before the MTF trap occurred, the fast singlestep setting remained
active even though MTF itself was disabled. This led to a situation where, upon
starting a new VMI session, the first event to trigger an EPT violation would
cause the corresponding EPT event callback to be skipped due to the lingering
fast singlestep setting.
The fix ensures that the fast singlestep setting is properly reset when
disabling single step debugging operations.
Signed-off-by: Petr Beneš <w1benny@gmail.com> Reviewed-by: Tamas K Lengyel <tamas@tklengyel.com>
Jan Beulich [Mon, 12 Feb 2024 08:37:18 +0000 (09:37 +0100)]
x86/PV32: restore PAE-extended-CR3 logic
While the PAE-extended-CR3 VM assist is a 32-bit only concept, it still
applies to guests also when run on a 64-bit hypervisor: The "extended
CR3" format has to be used there as well, to fit the address in the only
32-bit wide register there. As a result it was a mistake that the check
was never enabled for that case, and was then mistakenly deleted in the
course of removal of 32-bit-Xen code (218adf199e68 ["x86: We can assume
CONFIG_PAGING_LEVELS==4"]).
Similarly during Dom0 construction kernel awareness needs to be taken
into account, and respective code was again mistakenly never enabled for
32-bit Dom0 when running on 64-bit Xen (and thus wrongly deleted by 5d1181a5ea5e ["xen: Remove x86_32 build target"]).
At the same time restrict enabling of the assist for Dom0 to just the
32-bit case. Furthermore there's no need for an atomic update there.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
The failure is reported for the following line:
(paddr_t)(uintptr_t)(_start + boot_phys_offset)
This occurs because the compiler treats (ptr + size) with size bigger than
PTRDIFF_MAX as undefined behavior. To address this, switch to macro
virt_to_maddr(), given the future plans to eliminate boot_phys_offset.
eclair: move function and macro properties outside ECLAIR
Function and macro properties contained in ECLAIR/call_properties.ecl are of
general interest: this patch moves these annotations in a generaric JSON file
in docs. In this way, they can be exploited for other purposes (i.e. documentation,
other tools).
Add rst file containing explanation on how to update function_macro_properties.json.
Add script to convert the JSON file in ECL configurations.
Remove ECLAIR/call_properties.ecl: the file is now automatically generated from
the JSON file.
Jason Andryuk [Wed, 7 Feb 2024 12:46:52 +0000 (13:46 +0100)]
block-common: Fix same_vm for no targets
same_vm is broken when the two main domains do not have targets. otvm
and targetvm are both missing, which means they get set to -1 and then
converted to empty strings:
++10697+ local targetvm=-1
++10697+ local otvm=-1
++10697+ otvm=
++10697+ othervm=/vm/cc97bc2f-3a91-43f7-8fbc-4cb92f90b4e4
++10697+ targetvm=
++10697+ local frontend_uuid=/vm/844dea4e-44f8-4e3e-8145-325132a31ca5
The final comparison returns true since the two empty strings match:
Replace -1 with distinct strings indicating the lack of a value and
remove the collescing to empty stings. The strings themselves will no
longer match, and that is correct.
Michal Orzel [Tue, 6 Feb 2024 15:20:12 +0000 (16:20 +0100)]
automation: Switch yocto-qemux86-64 job to run on x86
At the moment, all Yocto jobs run on Arm64 runners. To address CI
capacity issues, move yocto-qemux86-64 job to x86. Reflect the change in
the makefile generating Yocto docker files and fix CONTAINER name
definition that incorrectly expects YOCTO_HOST variable to be set for x86
container as well, which does not have a platform name appended.
Signed-off-by: Michal Orzel <michal.orzel@amd.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Frediano Ziglio [Tue, 6 Feb 2024 10:56:38 +0000 (11:56 +0100)]
x86/paging: Use more specific constant
__HYPERVISOR_arch_1 and __HYPERVISOR_paging_domctl_cont for x86
have the same value but this function is handling
"paging_domctl_cont" hypercall so using the latter mnemonic in
the code is more clear.
Signed-off-by: Frediano Ziglio <frediano.ziglio@cloud.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Tue, 6 Feb 2024 10:56:13 +0000 (11:56 +0100)]
amd-vi: fix IVMD memory type checks
The current code that parses the IVMD blocks is relaxed with regard to the
restriction that such unity regions should always fall into memory ranges
marked as reserved in the memory map.
However the type checks for the IVMD addresses are inverted, and as a result
IVMD ranges falling into RAM areas are accepted. Note that having such ranges
in the first place is a firmware bug, as IVMD should always fall into reserved
ranges.
Fixes: ed6c77ebf0c1 ('AMD/IOMMU: check / convert IVMD ranges for being / to be reserved') Reported-by: Ox <oxjo@proton.me> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: oxjo <oxjo@proton.me> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Hongyan Xia [Tue, 6 Feb 2024 10:54:50 +0000 (11:54 +0100)]
acpi: vmap pages in acpi_os_alloc_memory
Also, introduce a wrapper around vmap that maps a contiguous range for
boot allocations. Unfortunately, the new helper cannot be a static inline
because the dependencies are a mess. We would need to re-include
asm/page.h (was removed in aa4b9d1ee653 "include: don't use asm/page.h
from common headers") and it doesn't look to be enough anymore
because bits from asm/cpufeature.h is used in the definition of PAGE_NX.
Lastly, with the move to vmap(), it is now easier to find the size
of the mapping. So pass the whole area to init_boot_pages() rather than
just the first page.
Signed-off-by: Hongyan Xia <hongyxia@amazon.com> Signed-off-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Elias El Yandouzi <eliasely@amazon.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Tue, 6 Feb 2024 10:54:17 +0000 (11:54 +0100)]
xen/vmap: Introduce vmap_size() and use it
vunmap() and vfree() currently duplicate the (small) logic to find the
size of an vmap area. In a follow-up patch, we will want to introduce
another one (this time externally).
So introduce a new helper vmap_size() that will return the number of
pages in the area starting at the given address. Take the opportunity
to replace the open-coded version.
Note that vfree() was storing the type of the area in a local variable.
But this seems to have never been used (even when it was introduced).
Signed-off-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Elias El Yandouzi <eliasely@amazon.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Wei Liu [Tue, 6 Feb 2024 10:53:04 +0000 (11:53 +0100)]
setup: Move vm_init() before acpi calls
After the direct map removal, pages from the boot allocator are not
going to be mapped in the direct map. Although we have map_domain_page,
they are ephemeral and are less helpful for mappings that are more than a
page, so we want a mechanism to globally map a range of pages, which is
what vmap is for. Therefore, we bring vm_init into early boot stage.
To allow vmap to be initialised and used in early boot, we need to
modify vmap to receive pages from the boot allocator during early boot
stage.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: David Woodhouse <dwmw2@amazon.com> Signed-off-by: Hongyan Xia <hongyxia@amazon.com> Signed-off-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Elias El Yandouzi <eliasely@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Juergen Gross [Mon, 5 Feb 2024 10:49:57 +0000 (11:49 +0100)]
tools/xenstored: map stubdom interface
When running as stubdom, map the stubdom's Xenstore ring page in order
to support using the 9pfs frontend.
Use the same pattern as in dom0_init() when running as daemon in dom0
(introduce the own domain, then send an event to the client side to
signal Xenstore is ready to communicate).
Juergen Gross [Mon, 5 Feb 2024 10:49:56 +0000 (11:49 +0100)]
tools/xenstored: split domain_init()
Today domain_init() is called either just before calling dom0_init()
in case no live update is being performed, or it is called after
reading the global state from read_state_global(), as the event
channel fd is needed.
Split up domain_init() into a preparation part which can be called
unconditionally, and in a part setting up the event channel handle.
Note that there is no chance that chk_domain_generation() can be
called now before xc_handle has been setup, so there is no need for
the related special case anymore.
Juergen Gross [Mon, 5 Feb 2024 10:49:55 +0000 (11:49 +0100)]
tools/xenstored: rework ring page (un)map functions
When [un]mapping the ring page of a Xenstore client, different actions
are required for "normal" guests and dom0. Today this distinction is
made at call site.
Move this distinction into [un]map_interface() instead, avoiding code
duplication and preparing special handling for [un]mapping the stub
domain's ring page.
Juergen Gross [Mon, 5 Feb 2024 10:49:52 +0000 (11:49 +0100)]
tools/xenstored: move all log-pipe handling into posix.c
All of the log-pipe handling is needed only when running as daemon.
Move it into posix.c. This requires to have a service function in the
main event loop for handling the related requests and one for setting
the fds[] array, which is renamed to poll_fds to have a more specific
name. Use a generic name for those functions, as socket handling can
be added to them later, too.
Juergen Gross [Mon, 5 Feb 2024 10:49:50 +0000 (11:49 +0100)]
tools/xenstored: add early_init() function
Some xenstored initialization needs to be done in the daemon case only,
so split it out into a new early_init() function being a stub in the
stubdom case.
Remove the call of talloc_enable_leak_report_full(), as it serves no
real purpose: the daemon only ever exits due to a crash, in which case
a log of talloc()ed memory hardly has any value.
Juergen Gross [Mon, 5 Feb 2024 10:49:47 +0000 (11:49 +0100)]
tools/xenstored: rename xenbus_evtchn()
Rename the xenbus_evtchn() function to get_xenbus_evtchn() in order to
avoid two externally visible symbols with the same name when Xenstore-
stubdom is being built with a Mini-OS with CONFIG_XENBUS set.
Cyril Rébert [Sun, 4 Feb 2024 10:19:40 +0000 (11:19 +0100)]
tools/xentop: fix sorting bug for some columns
Sort doesn't work on columns VBD_OO, VBD_RD, VBD_WR and VBD_RSECT.
Fix by adjusting variables names in compare functions.
Bug fix only. No functional change.
Fixes: 91c3e3dc91d6 ("tools/xentop: Display '-' when stats are not available.") Signed-off-by: Cyril Rébert (zithro) <slack@rabbit.lu> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
We have two copies of __bitmap_weight() that differ by whether they make
hweight32() or hweight64() calls, yet we already have hweight_long() which is
the form that __bitmap_weight() wants.
Fix hweight_long() to return unsigned int like all the other hweight helpers,
and fix __bitmap_weight() to used unsigned integers.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 2 Feb 2024 17:57:37 +0000 (17:57 +0000)]
x86/ucode: Remove accidentally introduced tabs
Fixes: cf7fe8b72dea ("x86/ucode: Fix stability of the raw CPU Policy rescan") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Mon, 5 Feb 2024 09:48:11 +0000 (10:48 +0100)]
x86/CPU: convert vendor hook invocations to altcall
While not performance critical, these hook invocations still want
converting: This way all pre-filled struct cpu_dev instances can become
__initconst_cf_clobber, thus allowing to eliminate further 8 ENDBR
during the 2nd phase of alternatives patching (besides moving previously
resident data to .init.*).
Since all use sites need touching anyway, take the opportunity and also
address a Misra C:2012 Rule 5.5 violation: Rename the this_cpu static
variable.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 5 Feb 2024 09:45:31 +0000 (10:45 +0100)]
x86/guest: finish conversion to altcall
While .setup() and .e820_fixup() don't need fiddling with for being run
only very early, both .ap_setup() and .resume() want converting too:
This way both pre-filled struct hypervisor_ops instances can become
__initconst_cf_clobber, thus allowing to eliminate up to 5 more ENDBR
(configuration dependent) during the 2nd phase of alternatives patching.
While fiddling with section annotations here, also move "ops" itself to
.data.ro_after_init.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Paul Durrant <paul@xen.org>
Jan Beulich [Mon, 5 Feb 2024 09:44:46 +0000 (10:44 +0100)]
x86: arrange for ENDBR zapping from <vendor>_ctxt_switch_masking()
While altcall is already used for them, the functions want announcing in
.init.rodata.cf_clobber, even if the resulting static variables aren't
otherwise used.
While doing this also move ctxt_switch_masking to .data.ro_after_init.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Fri, 26 Jan 2024 19:57:01 +0000 (19:57 +0000)]
x86: Remove gdbstub
In 13y of working on Xen, I've never seen seen it used. The implementation
was introduced (commit b69f92f3012e, Jul 28 2004) with known issues such as:
/* Resuming after we've stopped used to work, but more through luck
than any actual intention. It doesn't at the moment. */
which appear to have gone unfixed for the 20 years since.
Nowadays there are more robust ways of inspecting crashed state, such as a
kexec crash kernel, or running Xen in a VM.
This will allow us to clean up some hooks around the codebase which are
proving awkward for other tasks.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Tue, 30 Jan 2024 09:14:00 +0000 (10:14 +0100)]
x86/spec-ctrl: Expose BHI_CTRL to guests
The CPUID feature bit signals the presence of the BHI_DIS_S control in
SPEC_CTRL MSR, first available in Intel AlderLake and Sapphire Rapids CPUs
Xen already knows how to context switch MSR_SPEC_CTRL properly between guest
and hypervisor context.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monné [Tue, 30 Jan 2024 09:13:59 +0000 (10:13 +0100)]
x86/spec-ctrl: Expose RRSBA_CTRL to guests
The CPUID feature bit signals the presence of the RRSBA_DIS_{U,S} controls in
SPEC_CTRL MSR, first available in Intel AlderLake and Sapphire Rapids CPUs.
Xen already knows how to context switch MSR_SPEC_CTRL properly between guest
and hypervisor context.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monné [Tue, 30 Jan 2024 09:13:58 +0000 (10:13 +0100)]
x86/spec-ctrl: Expose IPRED_CTRL to guests
The CPUID feature bit signals the presence of the IPRED_DIS_{U,S} controls in
SPEC_CTRL MSR, first available in Intel AlderLake and Sapphire Rapids CPUs.
Xen already knows how to context switch MSR_SPEC_CTRL properly between guest
and hypervisor context.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Wed, 31 Jan 2024 10:42:48 +0000 (10:42 +0000)]
tools/ocaml: Bump minimum version to OCaml 4.05
Char.lowercase got removed in OCaml 5.0 (it has been deprecated since 2014),
and doesn't build any more.
Char.lowercase_ascii has existed since OCaml 4.03, so that is the new
minimum version for oxenstored.
However, OCaml 4.05 is the oldest new-enough version found in common distros,
so pick this as a baseline.
Signed-off-by: Edwin Török <edwin.torok@cloud.com> Acked-by: Christian Lindig <christian.lindig@cloud.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
[Update CHANGELOG.md] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Wed, 31 Jan 2024 17:05:47 +0000 (17:05 +0000)]
xen/bitmap: Consistently use unsigned bits values
Right now, most of the static inline helpers take an unsigned nbits quantity,
and most of the library functions take a signed quanity. Because
BITMAP_LAST_WORD_MASK() is expressed as a divide, the compiler is forced to
emit two different paths to get the correct semantics for signed division.
Swap all signed bit-counts to being unsigned bit-counts for the simple cases.
This includes the return value of bitmap_weight().
Bloat-o-meter for a random x86 build reports:
add/remove: 0/0 grow/shrink: 8/19 up/down: 167/-413 (-246)
which all comes from compiler not emitting "dead" logic paths for negative bit
counts.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 30 Jan 2024 13:59:07 +0000 (13:59 +0000)]
x86/boot: Add braces in reloc.c
107 lines is an unreasonably large switch statement to live inside a
brace-less for loop. Drop the comment that's clumsily trying to cover the
fact that this logic has wrong-looking indentation.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 30 Jan 2024 20:44:34 +0000 (20:44 +0000)]
xen/sched: Fix UB shift in compat_set_timer_op()
Tamas reported this UBSAN failure from fuzzing:
(XEN) ================================================================================
(XEN) UBSAN: Undefined behaviour in common/sched/compat.c:48:37
(XEN) left shift of negative value -2147425536
(XEN) ----[ Xen-4.19-unstable x86_64 debug=y ubsan=y Not tainted ]----
...
(XEN) Xen call trace:
(XEN) [<ffff82d040307c1c>] R ubsan.c#ubsan_epilogue+0xa/0xd9
(XEN) [<ffff82d040308afb>] F __ubsan_handle_shift_out_of_bounds+0x11a/0x1c5
(XEN) [<ffff82d040307758>] F compat_set_timer_op+0x41/0x43
(XEN) [<ffff82d04040e4cc>] F hvm_do_multicall_call+0x77f/0xa75
(XEN) [<ffff82d040519462>] F arch_do_multicall_call+0xec/0xf1
(XEN) [<ffff82d040261567>] F do_multicall+0x1dc/0xde3
(XEN) [<ffff82d04040d2b3>] F hvm_hypercall+0xa00/0x149a
(XEN) [<ffff82d0403cd072>] F vmx_vmexit_handler+0x1596/0x279c
(XEN) [<ffff82d0403d909b>] F vmx_asm_vmexit_handler+0xdb/0x200
Left-shifting any negative value is strictly undefined behaviour in C, and
the two parameters here come straight from the guest.
The fuzzer happened to choose lo 0xf, hi 0x8000e300.
Switch everything to be unsigned values, making the shift well defined.
As GCC documents:
As an extension to the C language, GCC does not use the latitude given in
C99 and C11 only to treat certain aspects of signed '<<' as undefined.
However, -fsanitize=shift (and -fsanitize=undefined) will diagnose such
cases.
this was deemed not to need an XSA.
Note: The unsigned -> signed conversion for do_set_timer_op()'s s_time_t
parameter is also well defined. C makes it implementation defined, and GCC
defines it as reduction modulo 2^N to be within range of the new type.
Fixes: 2942f45e09fb ("Enable compatibility mode operation for HYPERVISOR_sched_op and HYPERVISOR_set_timer_op.") Reported-by: Tamas K Lengyel <tamas@tklengyel.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 30 Jan 2024 18:13:14 +0000 (18:13 +0000)]
x86/hvm: Fix UBSAN failure in do_hvm_op() printk
Tamas reported this UBSAN failure from fuzzing:
(XEN) ================================================================================
(XEN) UBSAN: Undefined behaviour in common/vsprintf.c:64:19
(XEN) negation of -9223372036854775808 cannot be represented in type 'long long int':
(XEN) ----[ Xen-4.19-unstable x86_64 debug=y ubsan=y Not tainted ]----
...
(XEN) Xen call trace:
(XEN) [<ffff82d040307c1c>] R ubsan.c#ubsan_epilogue+0xa/0xd9
(XEN) [<ffff82d04030805d>] F __ubsan_handle_negate_overflow+0x99/0xce
(XEN) [<ffff82d04028868f>] F vsprintf.c#number+0x10a/0x93e
(XEN) [<ffff82d04028ac74>] F vsnprintf+0x19e2/0x1c56
(XEN) [<ffff82d04030a47a>] F console.c#vprintk_common+0x76/0x34d
(XEN) [<ffff82d04030a79e>] F printk+0x4d/0x4f
(XEN) [<ffff82d04040c42b>] F do_hvm_op+0x288e/0x28f5
(XEN) [<ffff82d04040d385>] F hvm_hypercall+0xad2/0x149a
(XEN) [<ffff82d0403cd072>] F vmx_vmexit_handler+0x1596/0x279c
(XEN) [<ffff82d0403d909b>] F vmx_asm_vmexit_handler+0xdb/0x200
The problem is an unsigned -> signed converstion because of a bad
formatter (%ld trying to format an unsigned long).
We could fix it by swapping to %lu, but this is a useless printk() even in
debug builds, so just drop it completely.
Reported-by: Tamas K Lengyel <tamas@tklengyel.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Thu, 1 Feb 2024 17:35:22 +0000 (17:35 +0000)]
xen/arm: Properly clean update to init_ttbr and smp_up_cpu
Recent rework to the secondary boot code modified how init_ttbr and
smp_up_cpu are accessed. Rather than directly accessing them, we
are using a pointer to them.
The helper clean_dcache() is expected to take the variable in parameter
and then clean its content. As we now pass a pointer to the variable,
we will clean the area storing the address rather than the content itself.
Switch to use clean_dcache_va_range() to avoid casting the pointer.
Fixes: a5ed59e62c6f ("arm/mmu: Move init_ttbr to a new section .data.idmap") Fixes: 9a5114074b04 ("arm/smpboot: Move smp_up_cpu to a new section .data.idmap) Reported-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Signed-off-by: Julien Grall <jgrall@amazon.com> Tested-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Jan Beulich [Thu, 1 Feb 2024 15:21:04 +0000 (16:21 +0100)]
IOMMU: iommu_use_hap_pt() implies CONFIG_HVM
Allow the compiler a little more room on DCE by moving the compile-time-
constant condition into the predicate (from the one place where it was
added in an open-coded fashion for XSA-450).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jason Andryuk [Thu, 1 Feb 2024 15:19:36 +0000 (16:19 +0100)]
xenpm: Print message for disabled commands
xenpm get-cpufreq-states currently just prints no output when cpufreq is
disabled or HWP is running. Have it print an appropriate message. The
cpufreq disabled one mirrors the cpuidle disabled one.
cpufreq disabled:
$ xenpm get-cpufreq-states
Either Xen cpufreq is disabled or no valid information is registered!
Under HWP:
$ xenpm get-cpufreq-states
P-State information not supported. Try 'get-cpufreq-average' or 'start'.
Also allow xenpm to handle EOPNOTSUPP from the pmstat hypercalls.
EOPNOTSUPP is returned when HWP is active in some cases and allows the
differentiation from cpufreq being disabled.
Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 1 Feb 2024 15:18:28 +0000 (16:18 +0100)]
x86/PoD: simplify / improve p2m_pod_cache_add()
Avoid recurring MFN -> page or page -> MFN translations. Drop the pretty
pointless local variable "p". Make sure the MFN logged in a debugging
error message is actually the offending one. Return negative errno
values rather than -1 (presently no caller really cares, but imo this
should change). Adjust style.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@cloud.com>
Andrew Cooper [Tue, 30 Jan 2024 13:29:15 +0000 (14:29 +0100)]
VT-d: Fix "else" vs "#endif" misplacement
In domain_pgd_maddr() the "#endif" is misplaced with respect to "else". This
generates incorrect logic when CONFIG_HVM is compiled out, as the "else" body
is executed unconditionally.
Rework the logic to use IS_ENABLED() instead of explicit #ifdef-ary, as it's
clearer to follow. This in turn involves adjusting p2m_get_pagetable() to
compile when CONFIG_HVM is disabled.
This is XSA-450 / CVE-2023-46840.
Fixes: 033ff90aa9c1 ("x86/P2M: p2m_{alloc,free}_ptp() and p2m_alloc_table() are HVM-only") Reported-by: Teddy Astie <teddy.astie@vates.tech> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Tue, 30 Jan 2024 13:28:01 +0000 (14:28 +0100)]
pci: fail device assignment if phantom functions cannot be assigned
The current behavior is that no error is reported if (some) phantom functions
fail to be assigned during device add or assignment, so the operation succeeds
even if some phantom functions are not correctly setup.
This can lead to devices possibly being successfully assigned to a domU while
some of the device phantom functions are still assigned to dom0. Even when the
device is assigned domIO before being assigned to a domU phantom functions
might fail to be assigned to domIO, and also fail to be assigned to the domU,
leaving them assigned to dom0.
Since the device can generate requests using the IDs of those phantom
functions, given the scenario above a device in such state would be in control
of a domU, but still capable of generating transactions that use a context ID
targeting dom0 owned memory.
Modify device assign in order to attempt to deassign the device if phantom
functions failed to be assigned.
Note that device addition is not modified in the same way, as in that case the
device is assigned to a trusted domain, and hence partial assign can lead to
device malfunction but not a security issue.
This is XSA-449 / CVE-2023-46839
Fixes: 4e9950dc1bd2 ('IOMMU: add phantom function support') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monne [Wed, 24 Jan 2024 17:29:52 +0000 (18:29 +0100)]
x86/iommu: switch hwdom IOMMU to use a rangeset
The current loop that iterates from 0 to the maximum RAM address in order to
setup the IOMMU mappings is highly inefficient, and it will get worse as the
amount of RAM increases. It's also not accounting for any reserved regions
past the last RAM address.
Instead of iterating over memory addresses, iterate over the memory map regions
and use a rangeset in order to keep track of which ranges need to be identity
mapped in the hardware domain physical address space.
On an AMD EPYC 7452 with 512GiB of RAM, the time to execute
arch_iommu_hwdom_init() in nanoseconds is:
x old
+ new
N Min Max Median Avg Stddev
x 5 2.2364154e+10 2.338244e+10 2.2474685e+10 2.2622409e+10 4.2949869e+08
+ 5 1025012103303610261881028276.2 3623.1194
Difference at 95.0% confidence
-2.26214e+10 +/- 4.42931e+08
-99.9955% +/- 9.05152e-05%
(Student's t, pooled s = 3.03701e+08)
Execution time of arch_iommu_hwdom_init() goes down from ~22s to ~0.001s.
Note there's a change for HVM domains (ie: PVH dom0) that get switched to
create the p2m mappings using map_mmio_regions() instead of
p2m_add_identity_entry(), so that ranges can be mapped with a single function
call if possible. Note that the interface of map_mmio_regions() doesn't
allow creating read-only mappings, but so far there are no such mappings
created for PVH dom0 in arch_iommu_hwdom_init().
No change intended in the resulting mappings that a hardware domain gets.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monne [Thu, 25 Jan 2024 13:26:26 +0000 (14:26 +0100)]
x86/iommu: remove regions not to be mapped
Introduce the code to remove regions not to be mapped from the rangeset
that will be used to setup the IOMMU page tables for the hardware domain.
This change also introduces two new functions: remove_xen_ranges() and
vpci_subtract_mmcfg() that copy the logic in xen_in_range() and
vpci_is_mmcfg_address() respectively and remove the ranges that would otherwise
be intercepted by the original functions.
Note that the rangeset is still not populated.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Jan Beulich [Mon, 29 Jan 2024 08:23:43 +0000 (09:23 +0100)]
x86: purge NMI_IO_APIC
Even going back to 3.2 source code, I can't spot how this watchdog mode
could ever have been enabled in Xen. The only effect its presence had
for all the years was the retaining of a dead string literal.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 29 Jan 2024 08:22:35 +0000 (09:22 +0100)]
x86/APIC: purge {GET,SET}_APIC_DELIVERY_MODE()
The few uses we have can easily be replaced, eliminating the need for
redundant APIC_DM_* and APIC_MODE_* constants. Therefore also purge all
respective APIC_MODE_* constants, introducing APIC_DM_MASK anew instead.
This is further relevant since we have a different set of APIC_MODE_*,
which could otherwise end up confusing.
No functional change intended.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 29 Jan 2024 08:21:16 +0000 (09:21 +0100)]
NUMA: no need for asm/numa.h when !NUMA
There's no point in every architecture carrying the same stubs for the
case when NUMA isn't enabled (or even supported). Move all of that to
xen/numa.h; replace explicit uses of asm/numa.h in common code. Make
inclusion of asm/numa.h dependent upon NUMA=y.
Drop the no longer applicable "implement NUMA support" comments - in a
!NUMA section this simply makes no sense.
Roger Pau Monné [Fri, 26 Jan 2024 14:54:18 +0000 (15:54 +0100)]
x86/entry: fix jump into restore_all_guest without %rbx correctly set
e047b8d0fa05 went too far when limiting obtaining the vCPU pointer. While the
code in ist_dispatch_done does indeed only need the vCPU pointer when PV32 is
enabled, the !PV32 path will end up jumping into restore_all_guest which does
require rbx == vCPU pointer.
Fix by moving the fetching of the vCPU pointer to be done outside of the PV32
code block.
Fixes: e047b8d0fa05 ('x86/entry: replace two GET_CURRENT() uses') Reported-by: Edwin Torok <edwin.torok@cloud.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Julien Grall [Thu, 25 Jan 2024 18:36:27 +0000 (18:36 +0000)]
xen/arm64: head: Use PRINT_ID() for secondary CPU MMU-off boot code
With the upcoming work to color Xen, the binary will not be anymore
physically contiguous. This will be a problem during boot as the
assembly code will need to work out where each piece of Xen reside.
An easy way to solve the issue is to have all code/data accessed
by the secondary CPUs while the MMU is off within a single page.
Right now, most of the early printk messages are using PRINT() which
will add the message in .rodata. This is unlikely to be within the
same page as the rest of the idmap.
So replace all the PRINT() that can be reachable by the secondary
CPU with MMU-off with PRINT_ID().
Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com>
Julien Grall [Thu, 25 Jan 2024 18:33:50 +0000 (18:33 +0000)]
arm/smpboot: Move smp_up_cpu to a new section .data.idmap
With the upcoming work to color Xen, the binary will not be anymore
physically contiguous. This will be a problem during boot as the
assembly code will need to work out where each piece of Xen reside.
An easy way to solve the issue is to have all code/data accessed
by the secondary CPUs while the MMU is off within a single page.
Right now, smp_up_cpu is used by secondary CPUs to wait their turn for
booting before the MMU is on. Yet it is currently in .data which is
unlikely to be within the same page as the rest of the idmap.
Move smp_up_cpu to the recently created section .data.idmap. The idmap is
currently part of the text section and therefore will be mapped read-only
executable. This means that we need to temporarily remap
smp_up_cpu in order to update it.
Introduce a new function set_smp_up_cpu() for this purpose so the code
is not duplicated between when opening and closing the gate.
Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com>