In preparation for reactivating the presently dead 2M page path of the
function, also deal with the case of replacing an L1 page table all in
one go. Note that the prior comparing of MFNs to bypass the removal of
shadows was insufficient (but kind of benign, for being dead code so
far) - at the very least the R/W bit also needs considering there (to be
on the safe side, compare the full [virtual] PTEs).
While adjusting the first conditional in the loop for the use of the new
local variable "nflags", also drop mfn_valid(): If anything we'd need to
compare against INVALID_MFN, but that won't come out of l1e_get_mfn().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
Pull common checks out of the switch(). This includes extending a
_PAGE_PRESENT check to L1 as well, which presumably was deemed redundant
with p2m_is_valid() || p2m_is_grant(), but I think we are better off
being explicit in all cases. Note that for L2 (or higher) the grant
check isn't strictly necessary, as grants are only ever single pages.
Leave a respective assertion.
With _PAGE_PRESENT checked uniformly, the suspicious mfn_valid(omfn)
checks can be dropped rather than moved/folded - if anything we'd need
to compare against INVALID_MFN, but that won't come out of l1e_get_mfn().
For L1 replace the moved out condition with a PTE comparison: There's
no need for any update or flushing when the two match.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
Replace a p2m_is_ram() check in the 2M case by an explicit _PAGE_PRESENT
one, to make more obvious that the subsequent l1e_get_mfn() actually
retrieves something that really is an MFN. It doesn't really matter
whether it's RAM, as the subsequent comparison with the original MFN is
going to lead to zapping of everything except the "same MFN again" case.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
Anthony PERARD [Thu, 18 Aug 2022 07:25:50 +0000 (09:25 +0200)]
tools/libxl: Replace deprecated -soundhw on QEMU command line
-soundhw is deprecated since 825ff02911c9 ("audio: add soundhw
deprecation notice"), QEMU v5.1, and is been remove for upcoming v7.1
by 039a68373c45 ("introduce -audio as a replacement for -soundhw").
Instead we can just add the sound card with "-device", for most option
that "-soundhw" could handle. "-device" is an option that existed
before QEMU 1.0, and could already be used to add audio hardware.
The list of possible option for libxl's "soundhw" is taken the list
from QEMU 7.0.
The list of options for "soundhw" are listed in order of preference in
the manual. The first three (hda, ac97, es1370) are PCI devices and
easy to test on Linux, and the last four are ISA devices which doesn't
seems to work out of the box on linux.
The sound card 'pcspk' isn't listed even if it used to be accepted by
'-soundhw' because QEMU crash when trying to add it to a Xen domain.
Also, it wouldn't work with "-device" might need to be "-machine
pcspk-audiodev=default" instead.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Jason Andryuk <jandryuk@gmail.com>
Anthony PERARD [Wed, 17 Aug 2022 15:21:06 +0000 (16:21 +0100)]
build: Fix missing MAKEFLAGS --no-print-directory
While we already have "--no-print-directory" added to the make flags
in some cases, there's one case where the flags is missing, when doing
an out-of-tree build with O=, e.g.
cd xen; make O=build
Without it, we just have loads of "Entering directory" and "Leaving
directory" with the same directory.
The comment and location in the Makefile are copied from Linux.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 12 Aug 2022 17:25:55 +0000 (18:25 +0100)]
x86/traps: Make nmi_show_execution_state() more useful
* Always emit current. It's critically important.
* Do not render (0000000000000000) for the symbol in guest context. It's
just line-noise. Instead, explicitly identify which Xen vs guest context.
* Try to tabulate the data, because there is often lots of it.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Penny Zheng [Tue, 16 Aug 2022 09:23:56 +0000 (11:23 +0200)]
xen/arm: rename PGC_reserved to PGC_static
PGC_reserved could be ambiguous, and we have to tell what the pages are
reserved for, so this commit intends to rename PGC_reserved to
PGC_static, which clearly indicates the page is reserved for static
memory.
drivers/char: add support for selecting specific xhci
Handle parameters similar to dbgp=ehci.
Implement this by not resettting dbc->sbdf again in dbc_init_xhc(), but
using a value found there if non-zero. Additionally, add xue->xhc_num to
select n-th controller.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
drivers/char: reset XHCI ports when initializing dbc
Reset ports, to force host system to re-enumerate devices. Otheriwse it
will require the cable to be re-plugged, or will wait in the
"configuring" state indefinitely.
Trick and code copied from Linux:
drivers/usb/early/xhci-dbc.c:xdbc_start()->xdbc_reset_debug_port()
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Connor Davis [Tue, 16 Aug 2022 09:20:01 +0000 (11:20 +0200)]
drivers/char: add support for USB3 DbC debugger
[Connor]
Xue is a cross-platform USB 3 debugger that drives the Debug
Capability (DbC) of xHCI-compliant host controllers. This patch
implements the operations needed for xue to initialize the host
controller's DbC and communicate with it. It also implements a struct
uart_driver that uses xue as a backend. Note that only target -> host
communication is supported for now. To use Xue as a console, add
'console=dbgp dbgp=xhci' to the command line.
[Marek]
The Xue driver is taken from https://github.com/connojd/xue and heavily
refactored to fit into Xen code base. Major changes include:
- rename to xhci_dbc
- drop support for non-Xen systems
- drop xue_ops abstraction
- use Xen's native helper functions for PCI access
- move all the code to xue.c, drop "inline"
- build for x86 only
- annotate functions with cf_check
- adjust for Xen's code style
At this stage, only the first xHCI is considered, and only output is
supported. Later patches add support for choosing specific device, and
input handling.
The driver is initiallized before memory allocator works, so all the
transfer buffers (about 230KiB of them) are allocated statically and will
use memory even if XUE console is not selected. The driver can be
disabled build time to reclaim this memory.
Most of this memory is shared with the controller via DMA. Later patch
will adjust structures placement to avoid anything else to be placed on
those DMA-reachable pages. This also means str_buf cannot use static
initializer, without reserving (at least) a whole page page in .data (or
more, when combined with other structures).
Signed-off-by: Connor Davis <davisc@ainfosec.com> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Anthony PERARD [Tue, 16 Aug 2022 09:18:39 +0000 (11:18 +0200)]
tools/flask/utils: list build targets in $(TARGETS)
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Henry Wang <Henry.Wang@arm.com> Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
Roger Pau Monné [Mon, 15 Aug 2022 07:58:55 +0000 (09:58 +0200)]
amd/msr: implement VIRT_SPEC_CTRL for HVM guests using legacy SSBD
Expose VIRT_SSBD to guests if the hardware supports setting SSBD in
the LS_CFG MSR (a.k.a. non-architectural way). Different AMD CPU
families use different bits in LS_CFG, so exposing VIRT_SPEC_CTRL.SSBD
allows for an unified way of exposing SSBD support to guests on AMD
hardware that's compatible migration wise, regardless of what
underlying mechanism is used to set SSBD.
Note that on AMD Family 17h and Hygon Family 18h processors the value
of SSBD in LS_CFG is shared between threads on the same core, so
there's extra logic in order to synchronize the value and have SSBD
set as long as one of the threads in the core requires it to be set.
Such logic also requires extra storage for each thread state, which is
allocated at initialization time.
Do the context switching of the SSBD selection in LS_CFG between
hypervisor and guest in the same handler that's already used to switch
the value of VIRT_SPEC_CTRL.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Henry Wang <Henry.Wang@arm.com>
Roger Pau Monné [Mon, 15 Aug 2022 07:58:08 +0000 (09:58 +0200)]
amd/msr: allow passthrough of VIRT_SPEC_CTRL for HVM guests
Allow HVM guests access to MSR_VIRT_SPEC_CTRL if the platform Xen is
running on has support for it. This requires adding logic in the
vm{entry,exit} paths for SVM in order to context switch between the
hypervisor value and the guest one. The added handlers for context
switch will also be used for the legacy SSBD support.
Introduce a new synthetic feature leaf (X86_FEATURE_VIRT_SC_MSR_HVM)
to signal whether VIRT_SPEC_CTRL needs to be handled on guest
vm{entry,exit}.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Mon, 15 Aug 2022 07:57:23 +0000 (09:57 +0200)]
amd/msr: implement VIRT_SPEC_CTRL for HVM guests on top of SPEC_CTRL
Use the logic to set shadow SPEC_CTRL values in order to implement
support for VIRT_SPEC_CTRL (signaled by VIRT_SSBD CPUID flag) for HVM
guests. This includes using the spec_ctrl vCPU MSR variable to store
the guest set value of VIRT_SPEC_CTRL.SSBD, which will be OR'ed with
any SPEC_CTRL values being set by the guest.
On hardware having SPEC_CTRL VIRT_SPEC_CTRL will not be offered by
default to guests. VIRT_SPEC_CTRL will only be part of the max CPUID
policy so it can be enabled for compatibility purposes.
Use '!' to annotate the feature in order to express that the presence
of the bit is not directly tied to its value in the host policy.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Anthony PERARD [Mon, 15 Aug 2022 06:55:25 +0000 (08:55 +0200)]
tools/xentop: rework makefile
Add "xentop" to "TARGETS" because this variable will be useful later.
Always define all the targets, even when configured with
--disable-monitor, instead don't visit the subdirectory.
This mean xentop/ isn't visited anymore during "make clean" that's how
most other subdirs in the tools/ works.
Also add missing "xentop" rules. It only works without it because we
still have make's built-ins rules and variables, but fix this to not
have to rely on them.
Use $(TARGETS) with $(INSTALL_PROG), and thus install into the
directory rather than spelling the program name.
In the "clean" rule, use $(RM) and remove all "*.o" instead of just
one object.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Anthony PERARD [Mon, 15 Aug 2022 06:55:21 +0000 (08:55 +0200)]
tools/libfsimage: Cleanup makefiles
Remove the need for "fs-*" targets by creating a "common.mk" which
have flags that are common to libfsimage/common/ and the other
libfsimages/*/ directories.
In common.mk, make $(PIC_OBJS) a recursively expanded variable so it
doesn't matter where $(LIB_SRCS-y) is defined, and remove the extra
$(PIC_OBJS) from libfsimage/common/Makefile.
Use a $(TARGETS) variable to list things to be built. And $(TARGETS)
can be use in the clean target in common.mk.
iso9660/:
Remove the explicit dependency between fsys_iso9660.c and
iso9660.h, this is handled automaticaly by the .*.d dependency files,
and iso9660.h already exist.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Rework dependencies of all objects. We don't need to add dependencies
for headers that $(CC) is capable of generating, we only need to
include $(DEPS_INCLUDE). Some dependencies are still needed so make
knows to generate symlinks for them.
We remove the use of "vpath" for cpuid.c. While it works fine for now,
when we will convert this makefile to subdirmk, vpath will not be
usable. Also, "-iquote" is now needed to build "cpuid.o".
Replace "-I." by "-iquote .", so it applies to double-quote includes
only.
Rather than checking if a symlink exist, always regenerate the
symlink. So if the source tree changed location, the symlink is
updated.
Since we are creating a new .gitignore for the symlink, also move the
entry to it.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Luca Fancellu <luca.fancellu@arm.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Anthony PERARD [Mon, 15 Aug 2022 06:55:14 +0000 (08:55 +0200)]
tools/firmware/hvmloader: rework Makefile
Setup proper dependencies with libacpi so we don't need to run "make
hvmloader" in the "all" target. ("build.o" new prerequisite isn't
exactly proper but a side effect of building the $(DSDT_FILES) is to
generate the "ssdt_*.h" needed by "build.o".)
Make use if "-iquote" instead of a plain "-I".
For "roms.inc" target, use "$(SHELL)" instead of plain "sh". And use
full path to "mkhex" instead of a relative one. Lastly, add "-f" flag
to "mv" to avoid a prompt in case the target already exist and we
don't have write permission.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Luca Fancellu <luca.fancellu@arm.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Mon, 15 Aug 2022 06:53:11 +0000 (08:53 +0200)]
x86/mm: re-arrange type check around _get_page_type()'s TLB flush
Checks dependent on only d and x can be pulled out, thus allowing to
skip the flush mask calculation.
(Also-)Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Thu, 14 Apr 2022 09:33:05 +0000 (10:33 +0100)]
x86/build: Don't convert boot/{cmdline,head}.bin back to .S
There's no point wasting time converting binaries back to asm source. Just
use .incbin directly. Explain in head.S what these binaries are.
Also, explicitly align the blobs. They contain 4-byte objects, and happen to
be 4-byte aligned currently because of the position of `lret` and the size of
cmdline.S but this is incredibly fragile.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
It's not clear why for x86-64 a different approach was used than the
(shorter) one x86-32 has been using. Move the setting to the respective
OS files and reuse x86-32's approach for x86-64, while at the same time
using an OS-independent variable name (thus avoiding the indirection
through $(XEN_OS)).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 12 Aug 2022 06:37:50 +0000 (08:37 +0200)]
PCI: bring pci_get_real_pdev() in line with pci_get_pdev()
Fold the three parameters into a single pci_sbdf_t one.
No functional change intended, despite the "(8 - stride)" ->
"stride" replacement (not really sure why it was written the more
complicated way originally).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Rahul Singh <rahul.singh@arm.com> Tested-by: Rahul Singh <rahul.singh@arm.com>
Jan Beulich [Fri, 12 Aug 2022 06:37:09 +0000 (08:37 +0200)]
PCI: fold pci_get_pdev{,_by_domain}()
Rename the latter, subsuming the functionality of the former when passed
NULL as first argument.
Since this requires touching all call sites anyway, take the opportunity
and fold the remaining three parameters into a single pci_sbdf_t one.
No functional change intended. In particular the locking related
assertion needs to continue to be kept silent when a non-NULL domain
pointer is passed - both vpci_read() and vpci_write() call the function
without holding the lock (adding respective locking to vPCI [or finding
an alternative to doing so] is the topic of a separate series).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Rahul Singh <rahul.singh@arm.com> Tested-by: Rahul Singh <rahul.singh@arm.com>
The last "wildcard" use of either function went away with f591755823a7
("IOMMU/PCI: don't let domain cleanup continue when device de-assignment
failed"). Don't allow them to be called this way anymore. Besides
simplifying the code this also fixes two bugs:
1) When seg != -1, the outer loops should have been terminated after the
first iteration, or else a device with the same BDF but on another
segment could be found / returned.
Reported-by: Rahul Singh <rahul.singh@arm.com>
2) When seg == -1 calling get_pseg() is bogus. The function (taking a
u16) would look for segment 0xffff, which might exist. If it exists,
we might then find / return a wrong device.
In pci_get_pdev_by_domain() also switch from using the per-segment list
to using the per-domain one, with the exception of the hardware domain
(see the code comment there).
While there also constify "pseg" and drop "pdev"'s already previously
unnecessary initializer.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Rahul Singh <rahul.singh@arm.com> Tested-by: Rahul Singh <rahul.singh@arm.com>
Jan Beulich [Thu, 11 Aug 2022 15:45:12 +0000 (17:45 +0200)]
build/x86: suppress GNU ld 2.39 warning about RWX load segments
Commit 68f5aac012b9 ("build: suppress future GNU ld warning about RWX
load segments") didn't quite cover all the cases: Apparently I missed
ones in the building of 32-bit helper objects because of only looking at
incremental builds (where those wouldn't normally be re-built). Clone
the workaround there to the specific Makefile in question.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Ross Lagerwall [Thu, 11 Aug 2022 15:44:26 +0000 (17:44 +0200)]
x86/amd: only call setup_force_cpu_cap for boot CPU
This should only be called for the boot CPU to avoid calling _init code
after it has been unloaded.
Fixes: 062868a5a8b4 ("x86/amd: Work around CLFLUSH ordering on older parts") Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Xenia Ragiadakou [Thu, 11 Aug 2022 09:47:34 +0000 (11:47 +0200)]
arm/vgic: fix coding style in macro REG_RANK_INDEX()
Add parentheses around the macro parameter 's' to prevent against unintended
expansions. This, also, resolves a MISRA C 2012 Rule 20.7 violation warning.
Anthony PERARD [Thu, 11 Aug 2022 09:47:11 +0000 (11:47 +0200)]
tools/libxl: Replace deprecated -sdl option on QEMU command line
"-sdl" is deprecated upstream since 6695e4c0fd9e ("softmmu/vl:
Deprecate the -sdl and -curses option"), QEMU v6.2, and the option is
removed by 707d93d4abc6 ("ui: Remove deprecated options "-sdl" and
"-curses""), in upcoming QEMU v7.1.
Instead, use "-display sdl", available since 1472a95bab1e ("Introduce
-display argument"), before QEMU v1.0.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Jason Andryuk <jandryuk@gmail.com>
Dario Faggioli [Thu, 11 Aug 2022 09:46:22 +0000 (11:46 +0200)]
xen/sched: setup dom0 vCPUs affinity only once
Right now, affinity for dom0 vCPUs is setup in two steps. This is a
problem as, at least in Credit2, unit_insert() sees and uses the
"intermediate" affinity, and place the vCPUs on CPUs where they cannot
be run. And this in turn results in boot hangs, if the "dom0_nodes"
parameter is used.
Fix this by setting up the affinity properly once and for all, in
sched_init_vcpu() called by create_vcpu().
Note that, unless a soft-affinity is explicitly specified for dom0 (by
using the relaxed mode of "dom0_nodes") we set it to the default, which
is all CPUs, instead of computing it basing on hard affinity (if any).
This is because hard and soft affinity should be considered as
independent user controlled properties. In fact, if we dor derive dom0's
soft-affinity from its boot-time hard-affinity, such computed value will
continue to be used even if later the user changes the hard-affinity.
And this could result in the vCPUs behaving differently than what the
user wanted and expects.
Fixes: dafd936dddbd ("Make credit2 the default scheduler") Reported-by: Olaf Hering <ohering@suse.de> Signed-off-by: Dario Faggioli <dfaggioli@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/arm: vreg: Fix MISRA C 2012 Rule 20.7 violation
In VREG_REG_HELPERS(), the macro parameter 'offmask' is used as expression and
therefore it is good to be enclosed in parentheses to prevent against
unintended expansions.
xen/arm: regs: Fix MISRA C 2012 Rule 20.7 violation
In macro psr_mode(), the macro parameter 'm' is used as expression and
therefore it is good to be enclosed in parentheses to prevent against
unintended expansions.
Jason Andryuk [Tue, 19 Jul 2022 20:08:15 +0000 (16:08 -0400)]
x86: Expose more MSR_ARCH_CAPS to hwdom
commit e46474278a0e ("x86/intel: Expose MSR_ARCH_CAPS to dom0") started
exposing MSR_ARCH_CAPS to dom0. More bits in MSR_ARCH_CAPS have since
been defined, but they haven't been exposed. Update the list to allow
them through.
As one example, this allows a Linux Dom0 to know that it has the
appropriate microcode via FB_CLEAR. Notably, and with the updated
microcode, this changes dom0's
/sys/devices/system/cpu/vulnerabilities/mmio_stale_data changes from:
"Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown"
to:
"Mitigation: Clear CPU buffers; SMT Host state unknown"
This exposes the MMIO Stale Data and Intel Branch History Injection
(BHI) controls as well as the page size change MCE issue bit.
Fixes: commit 2ebe8fe9b7e0 ("x86/spec-ctrl: Enumeration for MMIO Stale Data controls") Fixes: commit cea9ae062295 ("x86/spec-ctrl: Enumeration for new Intel BHI controls") Fixes: commit 59e89cdabc71 ("x86/vtx: Disable executable EPT superpages to work around CVE-2018-12207") Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
In MASK_DECLARE_ macros, the macro parameter 'x' is used as expression and
therefore it is good to be enclosed in parentheses to prevent against
unintended expansions.
Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
While there add the blanks missing around the + operators involved.
Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Jane Malalane [Tue, 9 Aug 2022 09:49:43 +0000 (11:49 +0200)]
x86/kexec: Add the '.L_' prefix to is_* and call_* labels
These are local symbols and shouldn't be externally visible.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jane Malalane <jane.malalane@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
automation: qemu-smoke-arm64: Run ping test over a pv network interface
This patch modified the test in the following way
- Dom0 is booted with an alpine linux rootfs with the xen tools.
- Once Dom0 is booted, it starts xenstored, calls init-dom0less to setup
the xenstore interface for the dom0less Dom1, setups the bridged network
and attaches a pv network interface to Dom1.
- In the meantime, Dom1 in its init script tries to assign an ip to eth0
and ping Dom0,
- If Dom1 manages to ping Dom0, it prints 'passed'.
Use kernel 5.19 to unblock testing dom0less enhanced.
This kernel version has the necessary patches for deferring xenbus probe
until xenstore is fully initialized.
Also, build kernel with bridging and xen netback support enabled because
it will be used for testing network connectivity between Dom0 and Dom1
over a pv network interface.
automation: disable xen,enhanced in qemu-smoke-arm64
Disable xen,enhanced because we don't use PV drivers in this test and
also because the kernel used for testing is old and unpatched and would
break if xen,enhanced is passed.
Edwin Török [Fri, 29 Jul 2022 17:53:25 +0000 (18:53 +0100)]
tools/ocaml/*/Makefile: generate paths.ml from configure
paths.ml contains various paths known to configure, and currently is generated
via a Makefile rule. Simplify this and generate it through configure, similar
to how oxenstored.conf is generated from oxenstored.conf.in.
This will allow to reuse the generated file more easily with Dune.
No functional change.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
Andrew Cooper [Tue, 2 Aug 2022 13:30:30 +0000 (14:30 +0100)]
x86/spec-ctrl: Use IST RSB protection for !SVM systems
There is a corner case where a VT-x guest which manages to reliably trigger
non-fatal #MC's could evade the rogue RSB speculation protections that were
supposed to be in place.
This is a lack of defence in depth; Xen does not architecturally execute more
RET than CALL instructions, so an attacker would have to locate a different
gadget (e.g. SpectreRSB) first to execute a transient path of excess RET
instructions.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/hypfs: check the return value of snprintf to avoid leaking stack accidently
The function snprintf() returns the number of characters that would have been
written in the buffer if the buffer size had been sufficiently large,
not counting the terminating null character.
Hence, the value returned is not guaranteed to be smaller than the buffer size.
Check the return value of snprintf() to prevent leaking stack contents to the
guest by accident.
Also, for debug builds, add an assertion to ensure that the assumption made on
the size of the destination buffer still holds.
xen/compiler: fix MISRA C 2012 Rule 20.7 violation
In __must_be_array(), the macro parameter 'a' is used as expression and
therefore it is good to be enclosed in parentheses to prevent against
unintended expansions.
Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Fri, 5 Aug 2022 06:36:54 +0000 (08:36 +0200)]
tools/xenstore: add documentation for new set/get-feature commands
Add documentation for two new Xenstore wire commands SET_FEATURE and
GET_FEATURE used to set or query the Xenstore features visible in the
ring page of a given domain.
When calling python tools to convert misra documentation or merge
cppcheck xml files, use $(PYTHON).
While there fix misra document conversion script to be executable.
Fixes: 57caa5375321 ("xen: Add MISRA support to cppcheck make rule") Fixes: 43aa3f6e72d3 ("xen/build: Add cppcheck and cppcheck-html make rules") Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Add git commands examples that can be used to generate fixes and how to
use the pretty configuration for git.
This should make it easier for contributors to have the right format.
Jan Beulich [Wed, 3 Aug 2022 10:10:26 +0000 (12:10 +0200)]
evtchn: convert domain event lock to an r/w one
Especially for the use in evtchn_move_pirqs() (called when moving a vCPU
across pCPU-s) and the ones in EOI handling in PCI pass-through code,
serializing perhaps an entire domain isn't helpful when no state (which
isn't e.g. further protected by the per-channel lock) changes.
Unfortunately this implies dropping of lock profiling for this lock,
until r/w locks may get enabled for such functionality.
While ->notify_vcpu_id is now meant to be consistently updated with the
per-channel lock held, an extension applies to ECS_PIRQ: The field is
also guaranteed to not change with the per-domain event lock held for
writing. Therefore the link_pirq_port() call from evtchn_bind_pirq()
could in principle be moved out of the per-channel locked regions, but
this further code churn didn't seem worth it.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
Hongda Deng [Fri, 29 Jul 2022 08:36:02 +0000 (16:36 +0800)]
arm/vgic-v3: fix virq offset in the rank when storing irouter
When vGIC performs irouter registers emulation, to get the target vCPU
via virq conveniently, Xen doesn't store the irouter value directly,
instead it will use the value (affinities) in irouter to calculate the
target vCPU, and then save the target vCPU in irq rank->vcpu[offset].
When vGIC tries to get the target vCPU, it first calculates the target
vCPU index via
int target = read_atomic(&rank->vcpu[virq & INTERRUPT_RANK_MASK]);
and then it gets the target vCPU via
v->domain->vcpu[target];
When vGIC tries to store irouter for one virq, the target vCPU index
in the rank is computed as
offset &= virq & INTERRUPT_RANK_MASK;
finally it gets the target vCPU via
d->vcpu[read_atomic(&rank->vcpu[offset])];
There is a difference between them while getting the target vCPU index
in the rank. Actually (virq & INTERRUPT_RANK_MASK) would already get
the target vCPU index in the rank, it's wrong to add '&' before '=' when
calculate the offset.
For example, the target vCPU index in the rank should be 6 for virq 38,
but vGIC will get offset=0 when vGIC stores the irouter for this virq,
and finally vGIC will access the wrong target vCPU index in the rank
when updating the irouter.
Fixes: 5d495f4349b5 ("xen/arm: vgic: Optimize the way to store the target vCPU in the rank") Signed-off-by: Hongda Deng <Hongda.Deng@arm.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
xen/efi: efibind: fix MISRA C 2012 Directive 4.10 violation
Prevent header file from being included more than once by adding ifndef guard.
In order to be close to gnu-efi code
- for x86_64, use the same guard
- for arm64, that there is no guard in gnu-efi, for consistency,
use a similar format and position to the x86_64 guard
automation: qemu-smoke-arm64.sh: Rename the device tree to avoid confusion
Rename the device tree from virt-gicv3 to virt-gicv2 to avoid confusion
since the version of the generic interrupt controller used for this test
is the v2 and not the v3.
xen/arm: domain: Fix MISRA C 2012 Rule 8.7 violation
The function idle_loop() is referenced only in domain.c.
Change its linkage from external to internal by adding the storage-class
specifier static to its definitions.
Add the function as a 'fake' input operand to the inline assembly statement,
to make the compiler aware that the function is used.
Fake means that the function is not actually used as an operand by the asm code.
That is because there is not a suitable gcc arm32 asm constraint for labels.
Declare return_to_new_vcpu32() and return_to_new_vcpu64() that are also
referenced by this inline asm statement.
Also, this patch resolves indirectly a MISRA C 2012 Rule 8.4 violation warning.
xen/arm: mm: Reduce the area that xen_second covers
At the moment, xen_second is used to cover the first 2GB of the
virtual address space. With the recent rework of the page-tables,
only the first 1GB region (where Xen resides) is effectively used.
In addition to that, I would like to reshuffle the memory layout.
So Xen mappings may not be anymore in the first 2GB of the virtual
address space.
Therefore, rework xen_second so it only covers the 1GB region where
Xen will reside.
With this change, xen_second doesn't cover anymore the xenheap area
on arm32. So, we first need to add memory to the boot allocator before
setting up the xenheap mappings.
Take the opportunity to update the comments on top of xen_fixmap and
xen_xenmap.
xen/arm: mm: Move domain_{,un}map_* helpers in a separate file
The file xen/arch/mm.c has been growing quite a lot. It now contains
various independent part of the MM subsytem.
One of them is the helpers to map/unmap a page which is only used
by arm32 and protected by CONFIG_ARCH_MAP_DOMAIN_PAGE. Move them in a
new file xen/arch/arm/domain_page.c.
xen: Rename CONFIG_DOMAIN_PAGE to CONFIG_ARCH_MAP_DOMAIN_PAGE and...
move it to Kconfig.
The define CONFIG_DOMAIN_PAGE indicates whether the architecture provide
helpers to map/unmap a domain page. Rename it to CONFIG_ARCH_MAP_DOMAIN_PAGE
so it is clearer that support for domain page is not something that
can be disabled in Xen.
Take the opportunity to move CONFIG_MAP_DOMAIN_PAGE to Kconfig as this
will soon be necessary to use it in the Makefile.
Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com> #arm part
xen/arm32: mm: Consolidate the domheap mappings initialization
At the moment, the domheap mappings initialization is done separately for
the boot CPU and secondary CPUs. The main difference is for the former
the pages are part of Xen binary whilst for the latter they are
dynamically allocated.
It would be good to have a single helper so it is easier to rework
on the domheap is initialized.
For CPU0, we still need to use pre-allocated pages because the
allocators may use domain_map_page(), so we need to have the domheap
area ready first. But we can still delay the initialization to setup_mm().
Introduce a new helper init_domheap_mappings() that will be called
from setup_mm() for the boot CPU and from init_secondary_pagetables()
for secondary CPUs.
At the moment, *_VIRT_END may either point to the address after the end
or the last address of the region.
The lack of consistency make quite difficult to reason with them.
Furthermore, there is a risk of overflow in the case where the address
points past to the end. I am not aware of any cases, so this is only a
latent bug.
Start to solve the problem by removing all the *_VIRT_END exclusively used
by the Arm code and add *_VIRT_SIZE when it is not present.
Take the opportunity to rename BOOT_FDT_SLOT_SIZE to BOOT_FDT_VIRT_SIZE
for better consistency and use _AT(vaddr_t, ).
Also take the opportunity to fix the coding style of the comment touched
in mm.c.
xsm/dummy: fix MISRA C 2012 Directive 4.10 violation
Protect header file from being included more than once by adding ifndef guard.
Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com> Reviewed-by: Luca Fancellu <luca.fancellu@arm.com> Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
Jan Beulich [Fri, 29 Jul 2022 06:50:25 +0000 (08:50 +0200)]
x86/shadow: drop CONFIG_HVM conditionals from sh_update_cr3()
Now that we're not building multi.c anymore for 2 and 3 guest levels
when !HVM, there's no point in having these conditionals anymore. (As
somewhat a special case, the last of the removed conditionals really
builds on shadow_mode_external() always returning false when !HVM.) This
way the code becomes a tiny bit more readable.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 29 Jul 2022 06:48:26 +0000 (08:48 +0200)]
x86/shadow: properly handle get_page() failing
We should not blindly (in a release build) insert the new entry in the
hash if a reference to the guest page cannot be obtained, or else an
excess reference would be put when removing the hash entry again. Crash
the domain in that case instead. The sole caller doesn't further care
about the state of the guest page: All it does is return the
corresponding shadow page (which was obtained successfully before) to
its caller.
To compensate we further need to adjust hash removal: Since the shadow
page already has had its backlink set, domain cleanup code would try to
destroy the shadow, and hence still cause a put_page() without
corresponding get_page(). Leverage that the failed get_page() leads to
no hash insertion, making shadow_hash_delete() no longer assume it will
find the requested entry. Instead return back whether the entry was
found. This way delete_shadow_status() can avoid calling put_page() in
the problem scenario.
For the other caller of shadow_hash_delete() simply reinstate the
otherwise dropped assertion at the call site.
While touching the conditionals in {set,delete}_shadow_status() anyway,
also switch around their two pre-existing parts, to have the cheap one
first (frequently allowing to avoid evaluation of the expensive - due to
evaluate_nospec() - one altogether).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
automation: arm64: Create a test job for testing static allocation on qemu
Enable CONFIG_STATIC_MEMORY in the existing arm64 build.
Create a new test job, called qemu-smoke-arm64-gcc-staticmem.
Adjust qemu-smoke-arm64.sh script to accomodate the static memory test as a
new test variant. The test variant is determined based on the first argument
passed to the script. For testing static memory, the argument is 'static-mem'.
The test configures DOM1 with a static memory region and adds a check in the
init script.
The check consists in comparing the contents of the /proc/device-tree
memory entry with the static memory range with which DOM1 was configured.
If the memory layout is correct, a message gets printed by DOM1.
At the end of the qemu run, the script searches for the specific message
in the logs and fails if not found.
The EXPERT config option cannot anymore be selected via the environmental
variable XEN_CONFIG_EXPERT. Remove stale references to XEN_CONFIG_EXPERT
from the automation code.
libxl/arm: Create specific IOMMU node to be referred by virtio-mmio device
Reuse generic IOMMU device tree bindings to communicate Xen specific
information for the virtio devices for which the restricted memory
access using Xen grant mappings need to be enabled.
Insert "iommus" property pointed to the IOMMU node with "xen,grant-dma"
compatible to all virtio devices which backends are going to run in
non-hardware domains (which are non-trusted by default).
Based on device-tree binding from Linux:
Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml
This patch introduces helpers to allocate Virtio MMIO params
(IRQ and memory region) and create specific device node in
the Guest device-tree with allocated params. In order to deal
with multiple Virtio devices, reserve corresponding ranges.
For now, we reserve 1MB for memory regions and 10 SPIs.
As these helpers should be used for every Virtio device attached
to the Guest, call them for Virtio disk(s).
Please note, with statically allocated Virtio IRQs there is
a risk of a clash with a physical IRQs of passthrough devices.
For the first version, it's fine, but we should consider allocating
the Virtio IRQs automatically. Thankfully, we know in advance which
IRQs will be used for passthrough to be able to choose non-clashed
ones.
This patch adds basic support for configuring and assisting virtio-mmio
based virtio-disk backend (emulator) which is intended to run out of
Qemu and could be run in any domain.
Although the Virtio block device is quite different from traditional
Xen PV block device (vbd) from the toolstack's point of view:
- as the frontend is virtio-blk which is not a Xenbus driver, nothing
written to Xenstore are fetched by the frontend currently ("vdev"
is not passed to the frontend). But this might need to be revised
in future, so frontend data might be written to Xenstore in order to
support hotplugging virtio devices or passing the backend domain id
on arch where the device-tree is not available.
- the ring-ref/event-channel are not used for the backend<->frontend
communication, the proposed IPC for Virtio is IOREQ/DM
it is still a "block device" and ought to be integrated in existing
"disk" handling. So, re-use (and adapt) "disk" parsing/configuration
logic to deal with Virtio devices as well.
For the immediate purpose and an ability to extend that support for
other use-cases in future (Qemu, virtio-pci, etc) perform the following
actions:
- Add new disk backend type (LIBXL_DISK_BACKEND_STANDALONE) and reflect
that in the configuration
- Introduce new disk "specification" and "transport" fields to struct
libxl_device_disk. Both are written to the Xenstore. The transport
field is only used for the specification "virtio" and it assumes
only "mmio" value for now.
- Introduce new "specification" option with "xen" communication
protocol being default value.
- Add new device kind (LIBXL__DEVICE_KIND_VIRTIO_DISK) as current
one (LIBXL__DEVICE_KIND_VBD) doesn't fit into Virtio disk model
An example of domain configuration for Virtio disk:
disk = [ 'phy:/dev/mmcblk0p3, xvda1, backendtype=standalone, specification=virtio']
Nothing has changed for default Xen disk configuration.
Please note, this patch is not enough for virtio-disk to work
on Xen (Arm), as for every Virtio device (including disk) we need
to allocate Virtio MMIO params (IRQ and memory region) and pass
them to the backend, also update Guest device-tree. The subsequent
patch will add these missing bits. For the current patch,
the default "irq" and "base" are just written to the Xenstore.
This is not an ideal splitting, but this way we avoid breaking
the bisectability.
Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Tested-by: Jiamei Xie <jiamei.xie@arm.com>
Jan Beulich [Wed, 27 Jul 2022 11:00:08 +0000 (13:00 +0200)]
x86/PV: correct post-preemption progress recording in iommu_memory_setup()
Coverity validly points out that the mfn_add() as used was dead code.
Coverity ID: 1507475 Fixes: c1e1564c8995 ("IOMMU/x86: perform PV Dom0 mappings in batches") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 27 Jul 2022 10:58:50 +0000 (12:58 +0200)]
mm: enforce return value checking on get_page()
It's hard to imagine a case where an error may legitimately be ignored
here. It's bad enough that in at least one case (set_shadow_status())
the return value was checked only by way of ASSERT()ing.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <jgrall@amazon.com>
Jan Beulich [Wed, 27 Jul 2022 10:58:16 +0000 (12:58 +0200)]
x86/shadow: drop shadow_prepare_page_type_change()'s 3rd parameter
As of 8cc5036bc385 ("x86/pv: Fix ABAC cmpxchg() race in
_get_page_type()") this no longer needs passing separately - the type
can now be read from struct page_info, as the call now happens after its
writing.
While there also constify the 2nd parameter.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Wed, 27 Jul 2022 10:57:10 +0000 (12:57 +0200)]
x86/msr: fix X2APIC_LAST
The latest Intel manual now says the X2APIC reserved range is only
0x800 to 0x8ff (NOT 0xbff).
This changed between SDM 68 (Nov 2018) and SDM 69 (Jan 2019).
The AMD manual documents 0x800-0x8ff too.
There are non-X2APIC MSRs in the 0x900-0xbff range now:
e.g. 0x981 is IA32_TME_CAPABILITY, an architectural MSR.
The new MSR in this range appears to have been introduced in Icelake,
so this commit should be backported to Xen versions supporting Icelake.
Backport: 4.13+
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 26 Jul 2022 13:11:33 +0000 (14:11 +0100)]
x86/vpmu: Fix build following vmfork addition
GCC with IBT extensions complains:
arch/x86/cpu/vpmu.c:351:15: error: conflicting types for 'vpmu_save_force'; have 'void(void *)' with implied 'nocf_check' attribute
351 | void cf_check vpmu_save_force(void *arg)
| ^~~~~~~~~~~~~~~
In file included from ./arch/x86/include/asm/domain.h:10,
from ./include/xen/domain.h:8,
from ./include/xen/sched.h:11,
from ./include/xen/event.h:12,
from arch/x86/cpu/vpmu.c:23:
./arch/x86/include/asm/vpmu.h:117:6: note: previous declaration of 'vpmu_save_force' with type 'void(void *)'
117 | void vpmu_save_force(void *arg);
| ^~~~~~~~~~~~~~~
Adjust the declaraion.
Fixes: 755087eb9b10 ("xen/mem_sharing: support forks with active vPMU state") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 19 Jul 2022 20:37:43 +0000 (21:37 +0100)]
x86/pv: Inject #GP for implicit grant unmaps
This is a debug behaviour to identify buggy kernels. Crashing the domain is
the most unhelpful thing to do, because it discards the relevant context.
Instead, inject #GP[0] like other permission errors in x86. In particular,
this lets the kernel provide a backtrace which is more likely to be helpful to
a developer.
As a bugfix, this always injects #GP[0] to current, not l1e_owner. It is not
l1e_owner's fault if dom0 using superpowers triggers an implicit unmap.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 26 Jul 2022 12:54:34 +0000 (14:54 +0200)]
x86/mm: correct TLB flush condition in _get_page_type()
When this logic was moved, it was moved across the point where nx is
updated to hold the new type for the page. IOW originally it was
equivalent to using x (and perhaps x would better have been used), but
now it isn't anymore. Switch to using x, which then brings things in
line again with the slightly earlier comment there (now) talking about
transitions _from_ writable.
I have to confess though that I cannot make a direct connection between
the reported observed behavior of guests leaving several pages around
with pending general references and the change here. Repeated testing,
nevertheless, confirms the reported issue is no longer there.
This is CVE-2022-33745 / XSA-408.
Reported-by: Charles Arnold <carnold@suse.com> Fixes: 8cc5036bc385 ("x86/pv: Fix ABAC cmpxchg() race in _get_page_type()") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
In common/memory.c the ifdef code surrounding ptdom_max_order is
using HAS_PASSTHROUGH instead of CONFIG_HAS_PASSTHROUGH, fix the
problem using the correct macro.
Fixes: e0d44c1f9461 ("build: convert HAS_PASSTHROUGH use to Kconfig") Signed-off-by: Luca Fancellu <luca.fancellu@arm.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 26 Jul 2022 06:33:10 +0000 (08:33 +0200)]
page-alloc: fix initialization of cross-node regions
Quite obviously to determine the split condition successive pages'
attributes need to be evaluated, not always those of the initial page.
Fixes: 72b02bc75b47 ("xen/heap: pass order to free_heap_pages() in heap init") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
Jan Beulich [Mon, 25 Jul 2022 13:46:21 +0000 (15:46 +0200)]
include: correct re-building conditions around hypercall-defs.h
For a .cmd file to be picked up, the respective target needs to be
listed in $(targets). This wasn't the case for hypercall-defs.i, leading
to permanent re-building even on an entirely unchanged tree (because of
the command apparently having changed).
In exchange the target doesn't need naming in $(clean-files) anymore.
Fixes: eca1f00d0227 ("xen: generate hypercall interface related code") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>