]> xenbits.xensource.com Git - xen.git/log
xen.git
2 years agoamd/msr: implement VIRT_SPEC_CTRL for HVM guests using legacy SSBD
Roger Pau Monné [Mon, 15 Aug 2022 07:58:55 +0000 (09:58 +0200)]
amd/msr: implement VIRT_SPEC_CTRL for HVM guests using legacy SSBD

Expose VIRT_SSBD to guests if the hardware supports setting SSBD in
the LS_CFG MSR (a.k.a. non-architectural way). Different AMD CPU
families use different bits in LS_CFG, so exposing VIRT_SPEC_CTRL.SSBD
allows for an unified way of exposing SSBD support to guests on AMD
hardware that's compatible migration wise, regardless of what
underlying mechanism is used to set SSBD.

Note that on AMD Family 17h and Hygon Family 18h processors the value
of SSBD in LS_CFG is shared between threads on the same core, so
there's extra logic in order to synchronize the value and have SSBD
set as long as one of the threads in the core requires it to be set.
Such logic also requires extra storage for each thread state, which is
allocated at initialization time.

Do the context switching of the SSBD selection in LS_CFG between
hypervisor and guest in the same handler that's already used to switch
the value of VIRT_SPEC_CTRL.

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Henry Wang <Henry.Wang@arm.com>
2 years agoamd/msr: allow passthrough of VIRT_SPEC_CTRL for HVM guests
Roger Pau Monné [Mon, 15 Aug 2022 07:58:08 +0000 (09:58 +0200)]
amd/msr: allow passthrough of VIRT_SPEC_CTRL for HVM guests

Allow HVM guests access to MSR_VIRT_SPEC_CTRL if the platform Xen is
running on has support for it.  This requires adding logic in the
vm{entry,exit} paths for SVM in order to context switch between the
hypervisor value and the guest one.  The added handlers for context
switch will also be used for the legacy SSBD support.

Introduce a new synthetic feature leaf (X86_FEATURE_VIRT_SC_MSR_HVM)
to signal whether VIRT_SPEC_CTRL needs to be handled on guest
vm{entry,exit}.

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 years agoamd/msr: implement VIRT_SPEC_CTRL for HVM guests on top of SPEC_CTRL
Roger Pau Monné [Mon, 15 Aug 2022 07:57:23 +0000 (09:57 +0200)]
amd/msr: implement VIRT_SPEC_CTRL for HVM guests on top of SPEC_CTRL

Use the logic to set shadow SPEC_CTRL values in order to implement
support for VIRT_SPEC_CTRL (signaled by VIRT_SSBD CPUID flag) for HVM
guests. This includes using the spec_ctrl vCPU MSR variable to store
the guest set value of VIRT_SPEC_CTRL.SSBD, which will be OR'ed with
any SPEC_CTRL values being set by the guest.

On hardware having SPEC_CTRL VIRT_SPEC_CTRL will not be offered by
default to guests. VIRT_SPEC_CTRL will only be part of the max CPUID
policy so it can be enabled for compatibility purposes.

Use '!' to annotate the feature in order to express that the presence
of the bit is not directly tied to its value in the host policy.

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 years agolibs/libs.mk: Rework target headers.chk dependencies
Anthony PERARD [Mon, 15 Aug 2022 06:55:36 +0000 (08:55 +0200)]
libs/libs.mk: Rework target headers.chk dependencies

There is no need to call the "headers.chk" target when it isn't
wanted, so it never need to be .PHONY.

Also, there is no more reason to separate the prerequisites from the
recipe.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
2 years agolibs/libs.mk: Remove the need for $(PKG_CONFIG_INST)
Anthony PERARD [Mon, 15 Aug 2022 06:55:34 +0000 (08:55 +0200)]
libs/libs.mk: Remove the need for $(PKG_CONFIG_INST)

We can simply use $(PKG_CONFIG) to set the parameters, and add it to
$(TARGETS) as necessary.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
2 years agolibs/libs.mk: Rename $(LIB) to $(TARGETS)
Anthony PERARD [Mon, 15 Aug 2022 06:55:32 +0000 (08:55 +0200)]
libs/libs.mk: Rename $(LIB) to $(TARGETS)

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
2 years agotools/libs/util: cleanup Makefile
Anthony PERARD [Mon, 15 Aug 2022 06:55:30 +0000 (08:55 +0200)]
tools/libs/util: cleanup Makefile

Remove -I. from CFLAGS, it isn't necessary.

Removed $(AUTOSRCS), it isn't used.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
2 years ago.gitignore: Cleanup ignores of tools/libs/*/{headers.chk,*.pc}
Anthony PERARD [Mon, 15 Aug 2022 06:55:27 +0000 (08:55 +0200)]
.gitignore: Cleanup ignores of tools/libs/*/{headers.chk,*.pc}

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
2 years agotools/xentop: rework makefile
Anthony PERARD [Mon, 15 Aug 2022 06:55:25 +0000 (08:55 +0200)]
tools/xentop: rework makefile

Add "xentop" to "TARGETS" because this variable will be useful later.

Always define all the targets, even when configured with
--disable-monitor, instead don't visit the subdirectory.
This mean xentop/ isn't visited anymore during "make clean" that's how
most other subdirs in the tools/ works.

Also add missing "xentop" rules. It only works without it because we
still have make's built-ins rules and variables, but fix this to not
have to rely on them.

Use $(TARGETS) with $(INSTALL_PROG), and thus install into the
directory rather than spelling the program name.

In the "clean" rule, use $(RM) and remove all "*.o" instead of just
one object.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
2 years agotools/xenpaging: Rework makefile
Anthony PERARD [Mon, 15 Aug 2022 06:55:23 +0000 (08:55 +0200)]
tools/xenpaging: Rework makefile

- Rename $(SRCS) to $(OBJS-y), we don't need to collect sources.
- Rename $(IBINS) to $(TARGETS)
- Stop cleaning "xen" and non-set variable $(LIB).

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
2 years agotools/libfsimage: Cleanup makefiles
Anthony PERARD [Mon, 15 Aug 2022 06:55:21 +0000 (08:55 +0200)]
tools/libfsimage: Cleanup makefiles

Remove the need for "fs-*" targets by creating a "common.mk" which
have flags that are common to libfsimage/common/ and the other
libfsimages/*/ directories.

In common.mk, make $(PIC_OBJS) a recursively expanded variable so it
doesn't matter where $(LIB_SRCS-y) is defined, and remove the extra
$(PIC_OBJS) from libfsimage/common/Makefile.

Use a $(TARGETS) variable to list things to be built. And $(TARGETS)
can be use in the clean target in common.mk.

iso9660/:
    Remove the explicit dependency between fsys_iso9660.c and
    iso9660.h, this is handled automaticaly by the .*.d dependency files,
    and iso9660.h already exist.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
2 years agotools/hotplug: cleanup Makefiles
Anthony PERARD [Mon, 15 Aug 2022 06:55:19 +0000 (08:55 +0200)]
tools/hotplug: cleanup Makefiles

Remove "build" targets.

Use simply expanded variables when recursively expanded variable
aren't needed. (Use ":=" instead of "=".)

Don't check if a directory already exist when installing, just create
it.

Fix $(HOTPLUGPATH), it shouldn't have any double-quote.

Some reindentation.

FreeBSD, "hotplugpath.sh" is already installed by common/.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
2 years agotools/fuzz/x86_instruction_emulator: rework makefile
Anthony PERARD [Mon, 15 Aug 2022 06:55:16 +0000 (08:55 +0200)]
tools/fuzz/x86_instruction_emulator: rework makefile

Rework dependencies of all objects. We don't need to add dependencies
for headers that $(CC) is capable of generating, we only need to
include $(DEPS_INCLUDE). Some dependencies are still needed so make
knows to generate symlinks for them.

We remove the use of "vpath" for cpuid.c. While it works fine for now,
when we will convert this makefile to subdirmk, vpath will not be
usable. Also, "-iquote" is now needed to build "cpuid.o".

Replace "-I." by "-iquote .", so it applies to double-quote includes
only.

Rather than checking if a symlink exist, always regenerate the
symlink. So if the source tree changed location, the symlink is
updated.

Since we are creating a new .gitignore for the symlink, also move the
entry to it.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
2 years agotools/firmware/hvmloader: rework Makefile
Anthony PERARD [Mon, 15 Aug 2022 06:55:14 +0000 (08:55 +0200)]
tools/firmware/hvmloader: rework Makefile

Setup proper dependencies with libacpi so we don't need to run "make
hvmloader" in the "all" target. ("build.o" new prerequisite isn't
exactly proper but a side effect of building the $(DSDT_FILES) is to
generate the "ssdt_*.h" needed by "build.o".)

Make use if "-iquote" instead of a plain "-I".

For "roms.inc" target, use "$(SHELL)" instead of plain "sh". And use
full path to "mkhex" instead of a relative one. Lastly, add "-f" flag
to "mv" to avoid a prompt in case the target already exist and we
don't have write permission.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 years agox86/mm: re-arrange type check around _get_page_type()'s TLB flush
Jan Beulich [Mon, 15 Aug 2022 06:53:11 +0000 (08:53 +0200)]
x86/mm: re-arrange type check around _get_page_type()'s TLB flush

Checks dependent on only d and x can be pulled out, thus allowing to
skip the flush mask calculation.

(Also-)Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
2 years agox86/build: Clean up boot/Makefile
Andrew Cooper [Thu, 14 Apr 2022 10:47:47 +0000 (11:47 +0100)]
x86/build: Clean up boot/Makefile

There are no .S intermediate files, so rework in terms of head-bin-objs.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
2 years agox86/build: Don't convert boot/{cmdline,head}.bin back to .S
Andrew Cooper [Thu, 14 Apr 2022 09:33:05 +0000 (10:33 +0100)]
x86/build: Don't convert boot/{cmdline,head}.bin back to .S

There's no point wasting time converting binaries back to asm source.  Just
use .incbin directly.  Explain in head.S what these binaries are.

Also, explicitly align the blobs.  They contain 4-byte objects, and happen to
be 4-byte aligned currently because of the position of `lret` and the size of
cmdline.S but this is incredibly fragile.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 years agox86/msi: Switch msi_info to using pci_sbdf_t
Andrew Cooper [Thu, 11 Aug 2022 16:12:22 +0000 (17:12 +0100)]
x86/msi: Switch msi_info to using pci_sbdf_t

This reorders the fields in msi_info, but removes all the under-the-hood
parameter shuffling required to call pci_get_pdev().

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 years agoconfig/x86: tidy {Free,Open}BSD LDFLAGS_DIRECT handling
Jan Beulich [Fri, 12 Aug 2022 06:54:33 +0000 (08:54 +0200)]
config/x86: tidy {Free,Open}BSD LDFLAGS_DIRECT handling

It's not clear why for x86-64 a different approach was used than the
(shorter) one x86-32 has been using. Move the setting to the respective
OS files and reuse x86-32's approach for x86-64, while at the same time
using an OS-independent variable name (thus avoiding the indirection
through $(XEN_OS)).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
2 years agoPCI: bring pci_get_real_pdev() in line with pci_get_pdev()
Jan Beulich [Fri, 12 Aug 2022 06:37:50 +0000 (08:37 +0200)]
PCI: bring pci_get_real_pdev() in line with pci_get_pdev()

Fold the three parameters into a single pci_sbdf_t one.

No functional change intended, despite the "(8 - stride)" ->
"stride" replacement (not really sure why it was written the more
complicated way originally).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Rahul Singh <rahul.singh@arm.com>
Tested-by: Rahul Singh <rahul.singh@arm.com>
2 years agoPCI: fold pci_get_pdev{,_by_domain}()
Jan Beulich [Fri, 12 Aug 2022 06:37:09 +0000 (08:37 +0200)]
PCI: fold pci_get_pdev{,_by_domain}()

Rename the latter, subsuming the functionality of the former when passed
NULL as first argument.

Since this requires touching all call sites anyway, take the opportunity
and fold the remaining three parameters into a single pci_sbdf_t one.

No functional change intended. In particular the locking related
assertion needs to continue to be kept silent when a non-NULL domain
pointer is passed - both vpci_read() and vpci_write() call the function
without holding the lock (adding respective locking to vPCI [or finding
an alternative to doing so] is the topic of a separate series).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Rahul Singh <rahul.singh@arm.com>
Tested-by: Rahul Singh <rahul.singh@arm.com>
2 years agoPCI: simplify (and thus correct) pci_get_pdev{,_by_domain}()
Jan Beulich [Fri, 12 Aug 2022 06:34:33 +0000 (08:34 +0200)]
PCI: simplify (and thus correct) pci_get_pdev{,_by_domain}()

The last "wildcard" use of either function went away with f591755823a7
("IOMMU/PCI: don't let domain cleanup continue when device de-assignment
failed"). Don't allow them to be called this way anymore. Besides
simplifying the code this also fixes two bugs:

1) When seg != -1, the outer loops should have been terminated after the
   first iteration, or else a device with the same BDF but on another
   segment could be found / returned.

Reported-by: Rahul Singh <rahul.singh@arm.com>
2) When seg == -1 calling get_pseg() is bogus. The function (taking a
   u16) would look for segment 0xffff, which might exist. If it exists,
   we might then find / return a wrong device.

In pci_get_pdev_by_domain() also switch from using the per-segment list
to using the per-domain one, with the exception of the hardware domain
(see the code comment there).

While there also constify "pseg" and drop "pdev"'s already previously
unnecessary initializer.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Rahul Singh <rahul.singh@arm.com>
Tested-by: Rahul Singh <rahul.singh@arm.com>
2 years agobuild/x86: suppress GNU ld 2.39 warning about RWX load segments
Jan Beulich [Thu, 11 Aug 2022 15:45:12 +0000 (17:45 +0200)]
build/x86: suppress GNU ld 2.39 warning about RWX load segments

Commit 68f5aac012b9 ("build: suppress future GNU ld warning about RWX
load segments") didn't quite cover all the cases: Apparently I missed
ones in the building of 32-bit helper objects because of only looking at
incremental builds (where those wouldn't normally be re-built). Clone
the workaround there to the specific Makefile in question.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
2 years agox86/amd: only call setup_force_cpu_cap for boot CPU
Ross Lagerwall [Thu, 11 Aug 2022 15:44:26 +0000 (17:44 +0200)]
x86/amd: only call setup_force_cpu_cap for boot CPU

This should only be called for the boot CPU to avoid calling _init code
after it has been unloaded.

Fixes: 062868a5a8b4 ("x86/amd: Work around CLFLUSH ordering on older parts")
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 years agox86/spec-ctrl: Enumeration for PBRSB_NO
Andrew Cooper [Fri, 29 Jul 2022 13:22:53 +0000 (14:22 +0100)]
x86/spec-ctrl: Enumeration for PBRSB_NO

The PBRSB_NO bit indicates that the CPU is not vulnerable to the Post-Barrier
RSB speculative vulnerability.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 years agoarm/gic: fix MISRA C 2012 Rule 20.7 violation
Xenia Ragiadakou [Thu, 11 Aug 2022 09:48:12 +0000 (11:48 +0200)]
arm/gic: fix MISRA C 2012 Rule 20.7 violation

In GIC_PRI_TO_GUEST(), add parentheses around the macro parameter 'pri' to
prevent against unintended expansions and realign comment.

Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2 years agoarm/vgic: fix coding style in macro REG_RANK_INDEX()
Xenia Ragiadakou [Thu, 11 Aug 2022 09:47:34 +0000 (11:47 +0200)]
arm/vgic: fix coding style in macro REG_RANK_INDEX()

Add parentheses around the macro parameter 's' to prevent against unintended
expansions. This, also, resolves a MISRA C 2012 Rule 20.7 violation warning.

Add white spaces around the subtraction operator.

Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
2 years agotools/libxl: Replace deprecated -sdl option on QEMU command line
Anthony PERARD [Thu, 11 Aug 2022 09:47:11 +0000 (11:47 +0200)]
tools/libxl: Replace deprecated -sdl option on QEMU command line

"-sdl" is deprecated upstream since 6695e4c0fd9e ("softmmu/vl:
Deprecate the -sdl and -curses option"), QEMU v6.2, and the option is
removed by 707d93d4abc6 ("ui: Remove deprecated options "-sdl" and
"-curses""), in upcoming QEMU v7.1.

Instead, use "-display sdl", available since 1472a95bab1e ("Introduce
-display argument"), before QEMU v1.0.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>
2 years agoxen/sched: setup dom0 vCPUs affinity only once
Dario Faggioli [Thu, 11 Aug 2022 09:46:22 +0000 (11:46 +0200)]
xen/sched: setup dom0 vCPUs affinity only once

Right now, affinity for dom0 vCPUs is setup in two steps. This is a
problem as, at least in Credit2, unit_insert() sees and uses the
"intermediate" affinity, and place the vCPUs on CPUs where they cannot
be run. And this in turn results in boot hangs, if the "dom0_nodes"
parameter is used.

Fix this by setting up the affinity properly once and for all, in
sched_init_vcpu() called by create_vcpu().

Note that, unless a soft-affinity is explicitly specified for dom0 (by
using the relaxed mode of "dom0_nodes") we set it to the default, which
is all CPUs, instead of computing it basing on hard affinity (if any).
This is because hard and soft affinity should be considered as
independent user controlled properties. In fact, if we dor derive dom0's
soft-affinity from its boot-time hard-affinity, such computed value will
continue to be used even if later the user changes the hard-affinity.
And this could result in the vCPUs behaving differently than what the
user wanted and expects.

Fixes: dafd936dddbd ("Make credit2 the default scheduler")
Reported-by: Olaf Hering <ohering@suse.de>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 years agox86/CPUID: AVX512-FP16 definitions
Jan Beulich [Thu, 11 Aug 2022 09:45:23 +0000 (11:45 +0200)]
x86/CPUID: AVX512-FP16 definitions

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
2 years agoxen/arm: vreg: Fix MISRA C 2012 Rule 20.7 violation
Xenia Ragiadakou [Tue, 9 Aug 2022 09:30:48 +0000 (12:30 +0300)]
xen/arm: vreg: Fix MISRA C 2012 Rule 20.7 violation

In VREG_REG_HELPERS(), the macro parameter 'offmask' is used as expression and
therefore it is good to be enclosed in parentheses to prevent against
unintended expansions.

Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2 years agoxen/arm: regs: Fix MISRA C 2012 Rule 20.7 violation
Xenia Ragiadakou [Mon, 8 Aug 2022 09:48:37 +0000 (12:48 +0300)]
xen/arm: regs: Fix MISRA C 2012 Rule 20.7 violation

In macro psr_mode(), the macro parameter 'm' is used as expression and
therefore it is good to be enclosed in parentheses to prevent against
unintended expansions.

Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Reviewed-by: Rahul Singh <rahul.singh@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2 years agox86: Expose more MSR_ARCH_CAPS to hwdom
Jason Andryuk [Tue, 19 Jul 2022 20:08:15 +0000 (16:08 -0400)]
x86: Expose more MSR_ARCH_CAPS to hwdom

commit e46474278a0e ("x86/intel: Expose MSR_ARCH_CAPS to dom0") started
exposing MSR_ARCH_CAPS to dom0.  More bits in MSR_ARCH_CAPS have since
been defined, but they haven't been exposed.  Update the list to allow
them through.

As one example, this allows a Linux Dom0 to know that it has the
appropriate microcode via FB_CLEAR.  Notably, and with the updated
microcode, this changes dom0's
/sys/devices/system/cpu/vulnerabilities/mmio_stale_data changes from:

  "Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown"

to:

  "Mitigation: Clear CPU buffers; SMT Host state unknown"

This exposes the MMIO Stale Data and Intel Branch History Injection
(BHI) controls as well as the page size change MCE issue bit.

Fixes: commit 2ebe8fe9b7e0 ("x86/spec-ctrl: Enumeration for MMIO Stale Data controls")
Fixes: commit cea9ae062295 ("x86/spec-ctrl: Enumeration for new Intel BHI controls")
Fixes: commit 59e89cdabc71 ("x86/vtx: Disable executable EPT superpages to work around CVE-2018-12207")
Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
2 years agodocs: correct x86 MCE command line option info
Jan Beulich [Tue, 9 Aug 2022 09:52:49 +0000 (11:52 +0200)]
docs: correct x86 MCE command line option info

Not even the types were correct, let alone defaults being spelled out or
the purpose of the options actually mentioned in any way.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
2 years agoxen/cpu: undefine MASK_DECLARE_ macros after their usage
Xenia Ragiadakou [Tue, 9 Aug 2022 09:52:06 +0000 (11:52 +0200)]
xen/cpu: undefine MASK_DECLARE_ macros after their usage

MASK_DECLARE_ macros have only a limited scope. Remove their definitions
immediately after their usage.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
2 years agoxen/cpu: fix MISRA C 2012 Rule 20.7 violation
Xenia Ragiadakou [Tue, 9 Aug 2022 09:51:14 +0000 (11:51 +0200)]
xen/cpu: fix MISRA C 2012 Rule 20.7 violation

In MASK_DECLARE_ macros, the macro parameter 'x' is used as expression and
therefore it is good to be enclosed in parentheses to prevent against
unintended expansions.

Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
While there add the blanks missing around the + operators involved.

Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
2 years agox86/kexec: Add the '.L_' prefix to is_* and call_* labels
Jane Malalane [Tue, 9 Aug 2022 09:49:43 +0000 (11:49 +0200)]
x86/kexec: Add the '.L_' prefix to is_* and call_* labels

These are local symbols and shouldn't be externally visible.

Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jane Malalane <jane.malalane@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
2 years agoautomation: qemu-smoke-arm64: Run ping test over a pv network interface
Xenia Ragiadakou [Mon, 8 Aug 2022 18:39:52 +0000 (21:39 +0300)]
automation: qemu-smoke-arm64: Run ping test over a pv network interface

This patch modified the test in the following way
- Dom0 is booted with an alpine linux rootfs with the xen tools.
- Once Dom0 is booted, it starts xenstored, calls init-dom0less to setup
the xenstore interface for the dom0less Dom1, setups the bridged network
and attaches a pv network interface to Dom1.
- In the meantime, Dom1 in its init script tries to assign an ip to eth0
and ping Dom0,
- If Dom1 manages to ping Dom0, it prints 'passed'.

Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2 years agoautomation: qemu-smoke-arm64: Use kernel 5.19
Xenia Ragiadakou [Mon, 8 Aug 2022 18:39:51 +0000 (21:39 +0300)]
automation: qemu-smoke-arm64: Use kernel 5.19

Use kernel 5.19 to unblock testing dom0less enhanced.
This kernel version has the necessary patches for deferring xenbus probe
until xenstore is fully initialized.
Also, build kernel with bridging and xen netback support enabled because
it will be used for testing network connectivity between Dom0 and Dom1
over a pv network interface.

Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2 years agoautomation: disable xen,enhanced in qemu-smoke-arm64
Stefano Stabellini [Fri, 29 Jul 2022 00:05:57 +0000 (17:05 -0700)]
automation: disable xen,enhanced in qemu-smoke-arm64

Disable xen,enhanced because we don't use PV drivers in this test and
also because the kernel used for testing is old and unpatched and would
break if xen,enhanced is passed.

This patch unbreaks gitlab-ci.

Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com>
Reviewed-by: Ayan Kumar Halder <ayankuma@amd.com>
Tested-by: Ayan Kumar Halder <ayankuma@amd.com>
2 years agotools/ocaml/libs/xb: hide type of Xb.t
Edwin Török [Fri, 29 Jul 2022 17:53:29 +0000 (18:53 +0100)]
tools/ocaml/libs/xb: hide type of Xb.t

Hiding the type will make it easier to change the implementation
in the future without breaking code that relies on it.

No functional change.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
2 years agotools/ocaml: fix compiler warnings
Edwin Török [Fri, 29 Jul 2022 17:53:28 +0000 (18:53 +0100)]
tools/ocaml: fix compiler warnings

Fix compiler warning about:
* unused value
* ambiguous documentation comment
* non-principal type inference (compiler version dependent)

No functional change.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
2 years agotools/ocaml/*/Makefile: generate paths.ml from configure
Edwin Török [Fri, 29 Jul 2022 17:53:25 +0000 (18:53 +0100)]
tools/ocaml/*/Makefile: generate paths.ml from configure

paths.ml contains various paths known to configure, and currently is generated
via a Makefile rule.  Simplify this and generate it through configure, similar
to how oxenstored.conf is generated from oxenstored.conf.in.

This will allow to reuse the generated file more easily with Dune.

No functional change.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
2 years agoxen/wait: Describe RSB safety
Andrew Cooper [Tue, 14 Jun 2022 15:18:36 +0000 (16:18 +0100)]
xen/wait: Describe RSB safety

It turns out that we do in fact have RSB safety here, but not for obvious
reasons.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 years agox86/spec-ctrl: Use IST RSB protection for !SVM systems
Andrew Cooper [Tue, 2 Aug 2022 13:30:30 +0000 (14:30 +0100)]
x86/spec-ctrl: Use IST RSB protection for !SVM systems

There is a corner case where a VT-x guest which manages to reliably trigger
non-fatal #MC's could evade the rogue RSB speculation protections that were
supposed to be in place.

This is a lack of defence in depth; Xen does not architecturally execute more
RET than CALL instructions, so an attacker would have to locate a different
gadget (e.g. SpectreRSB) first to execute a transient path of excess RET
instructions.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 years agoChangeLog: mention IOMMU superpage support
Jan Beulich [Fri, 5 Aug 2022 06:39:02 +0000 (08:39 +0200)]
ChangeLog: mention IOMMU superpage support

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Henry Wang <Henry.Wang@arm.com>
2 years agoxen/hypfs: check the return value of snprintf to avoid leaking stack accidently
Xenia Ragiadakou [Fri, 5 Aug 2022 06:38:23 +0000 (08:38 +0200)]
xen/hypfs: check the return value of snprintf to avoid leaking stack accidently

The function snprintf() returns the number of characters that would have been
written in the buffer if the buffer size had been sufficiently large,
not counting the terminating null character.
Hence, the value returned is not guaranteed to be smaller than the buffer size.
Check the return value of snprintf() to prevent leaking stack contents to the
guest by accident.

Also, for debug builds, add an assertion to ensure that the assumption made on
the size of the destination buffer still holds.

Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
2 years agoxen/compiler: fix MISRA C 2012 Rule 20.7 violation
Xenia Ragiadakou [Fri, 5 Aug 2022 06:37:13 +0000 (08:37 +0200)]
xen/compiler: fix MISRA C 2012 Rule 20.7 violation

In __must_be_array(), the macro parameter 'a' is used as expression and
therefore it is good to be enclosed in parentheses to prevent against
unintended expansions.

Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
2 years agotools/xenstore: add documentation for new set/get-feature commands
Juergen Gross [Fri, 5 Aug 2022 06:36:54 +0000 (08:36 +0200)]
tools/xenstore: add documentation for new set/get-feature commands

Add documentation for two new Xenstore wire commands SET_FEATURE and
GET_FEATURE used to set or query the Xenstore features visible in the
ring page of a given domain.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
2 years agoxen/char: mvebu-uart: Fix MISRA C 2012 Rule 20.7 violation
Xenia Ragiadakou [Wed, 3 Aug 2022 07:09:58 +0000 (10:09 +0300)]
xen/char: mvebu-uart: Fix MISRA C 2012 Rule 20.7 violation

The macro parameters 'off' and 'uart' are used as expressions and it is
good to be enclosed in parentheses to prevent against unintended expansion.

For the 'uart' case, in mvebu3700_write(), correct the second parentheses
that seems to have been accidently misplaced.

Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2 years agoxen/char: imx-lpuart: Fix MISRA C 2012 Rule 20.7 violation
Xenia Ragiadakou [Tue, 2 Aug 2022 07:54:33 +0000 (10:54 +0300)]
xen/char: imx-lpuart: Fix MISRA C 2012 Rule 20.7 violation

The macro parameter 'off' is used as an expression and it is good to be
enclosed in parentheses to prevent against unintended expansion.

Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2 years agotools: use $(PYTHON) to call cppcheck tools
Bertrand Marquis [Wed, 3 Aug 2022 11:57:02 +0000 (12:57 +0100)]
tools: use $(PYTHON) to call cppcheck tools

When calling python tools to convert misra documentation or merge
cppcheck xml files, use $(PYTHON).
While there fix misra document conversion script to be executable.

Fixes: 57caa5375321 ("xen: Add MISRA support to cppcheck make rule")
Fixes: 43aa3f6e72d3 ("xen/build: Add cppcheck and cppcheck-html make rules")
Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
2 years agodoc: Add git commands to generate Fixes
Bertrand Marquis [Wed, 3 Aug 2022 14:43:04 +0000 (15:43 +0100)]
doc: Add git commands to generate Fixes

Add git commands examples that can be used to generate fixes and how to
use the pretty configuration for git.
This should make it easier for contributors to have the right format.

Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Julien Grall <jgrall@amazon.com>
2 years agosched: dom0_vcpus_pin should only affect dom0
Dario Faggioli [Wed, 3 Aug 2022 10:14:01 +0000 (12:14 +0200)]
sched: dom0_vcpus_pin should only affect dom0

If dom0_vcpus_pin is used, make sure the pinning is only done for
dom0 vcpus, instead of for the hardware domain (which might not be
dom0 at all!).

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 years agotools/ocaml: do not run ocamldep during make clean
Edwin Török [Wed, 3 Aug 2022 10:13:39 +0000 (12:13 +0200)]
tools/ocaml: do not run ocamldep during make clean

Trying to include .ocamldep.make will cause it to be generated if it
doesn't exist.
We do not want this during make clean: we would remove it anyway.

Speeds up make clean.

Before (measured on f732240fd3bac25116151db5ddeb7203b62e85ce, July 2022):
```
Parsing /home/edwin/xen2/tools/ocaml/libs/xl/../../../../tools/libs/light/libxl_types.idl
Parsing /home/edwin/xen2/tools/ocaml/libs/xl/../../../../tools/libs/light/libxl_types.idl
Parsing /home/edwin/xen2/tools/ocaml/libs/xl/../../../../tools/libs/light/libxl_types.idl
Parsing /home/edwin/xen2/tools/ocaml/libs/xl/../../../../tools/libs/light/libxl_types.idl
Parsing /home/edwin/xen2/tools/ocaml/libs/xl/../../../../tools/libs/light/libxl_types.idl

 Performance counter stats for 'make clean -j8 -s' (5 runs):

            4.2233 +- 0.0208 seconds time elapsed  ( +-  0.49% )
```

After:
```
perf stat -r 5 --null make clean -j8 -s

 Performance counter stats for 'make clean -j8 -s' (5 runs):

            2.7325 +- 0.0138 seconds time elapsed  ( +-  0.51% )
```

No functional change.

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
2 years agoevtchn: convert domain event lock to an r/w one
Jan Beulich [Wed, 3 Aug 2022 10:10:26 +0000 (12:10 +0200)]
evtchn: convert domain event lock to an r/w one

Especially for the use in evtchn_move_pirqs() (called when moving a vCPU
across pCPU-s) and the ones in EOI handling in PCI pass-through code,
serializing perhaps an entire domain isn't helpful when no state (which
isn't e.g. further protected by the per-channel lock) changes.

Unfortunately this implies dropping of lock profiling for this lock,
until r/w locks may get enabled for such functionality.

While ->notify_vcpu_id is now meant to be consistently updated with the
per-channel lock held, an extension applies to ECS_PIRQ: The field is
also guaranteed to not change with the per-domain event lock held for
writing. Therefore the link_pirq_port() call from evtchn_bind_pirq()
could in principle be moved out of the per-channel locked regions, but
this further code churn didn't seem worth it.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
2 years agoarm/vgic-v3: fix virq offset in the rank when storing irouter
Hongda Deng [Fri, 29 Jul 2022 08:36:02 +0000 (16:36 +0800)]
arm/vgic-v3: fix virq offset in the rank when storing irouter

When vGIC performs irouter registers emulation, to get the target vCPU
via virq conveniently, Xen doesn't store the irouter value directly,
instead it will use the value (affinities) in irouter to calculate the
target vCPU, and then save the target vCPU in irq rank->vcpu[offset].

When vGIC tries to get the target vCPU, it first calculates the target
vCPU index via
  int target = read_atomic(&rank->vcpu[virq & INTERRUPT_RANK_MASK]);
and then it gets the target vCPU via
  v->domain->vcpu[target];

When vGIC tries to store irouter for one virq, the target vCPU index
in the rank is computed as
  offset &= virq & INTERRUPT_RANK_MASK;
finally it gets the target vCPU via
  d->vcpu[read_atomic(&rank->vcpu[offset])];

There is a difference between them while getting the target vCPU index
in the rank. Actually (virq & INTERRUPT_RANK_MASK) would already get
the target vCPU index in the rank, it's wrong to add '&' before '=' when
calculate the offset.

For example, the target vCPU index in the rank should be 6 for virq 38,
but vGIC will get offset=0 when vGIC stores the irouter for this virq,
and finally vGIC will access the wrong target vCPU index in the rank
when updating the irouter.

Fixes: 5d495f4349b5 ("xen/arm: vgic: Optimize the way to store the target vCPU in the rank")
Signed-off-by: Hongda Deng <Hongda.Deng@arm.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
2 years agoxen/efi: efibind: fix MISRA C 2012 Directive 4.10 violation
Xenia Ragiadakou [Mon, 1 Aug 2022 12:21:18 +0000 (15:21 +0300)]
xen/efi: efibind: fix MISRA C 2012 Directive 4.10 violation

Prevent header file from being included more than once by adding ifndef guard.

In order to be close to gnu-efi code
- for x86_64, use the same guard
- for arm64, that there is no guard in gnu-efi, for consistency,
use a similar format and position to the x86_64 guard

Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
2 years agoautomation: qemu-smoke-arm64.sh: Fix the number of cpus in the device tree
Xenia Ragiadakou [Fri, 29 Jul 2022 14:52:29 +0000 (17:52 +0300)]
automation: qemu-smoke-arm64.sh: Fix the number of cpus in the device tree

Qemu VM is configured with 2 cpus but the device tree passed has only 1.
Generate a device tree with 2 cpus.

Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2 years agoautomation: qemu-smoke-arm64.sh: Rename the device tree to avoid confusion
Xenia Ragiadakou [Fri, 29 Jul 2022 14:52:28 +0000 (17:52 +0300)]
automation: qemu-smoke-arm64.sh: Rename the device tree to avoid confusion

Rename the device tree from virt-gicv3 to virt-gicv2 to avoid confusion
since the version of the generic interrupt controller used for this test
is the v2 and not the v3.

Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2 years agoautomation: qemu-smoke-arm64.sh: Remove some stale comments
Xenia Ragiadakou [Fri, 29 Jul 2022 14:52:27 +0000 (17:52 +0300)]
automation: qemu-smoke-arm64.sh: Remove some stale comments

Remove comment "# Install QEMU" because qemu is not installed, it is taken
from a test-artifacts container.

Change comment "# Busybox Dom0" to "# Busybox" because busybox is not used
only for the Dom0 but also for the DomU.

Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2 years agoxen/arm: domain: Fix MISRA C 2012 Rule 8.7 violation
Xenia Ragiadakou [Thu, 28 Jul 2022 16:21:51 +0000 (19:21 +0300)]
xen/arm: domain: Fix MISRA C 2012 Rule 8.7 violation

The function idle_loop() is referenced only in domain.c.
Change its linkage from external to internal by adding the storage-class
specifier static to its definitions.

Add the function as a 'fake' input operand to the inline assembly statement,
to make the compiler aware that the function is used.
Fake means that the function is not actually used as an operand by the asm code.
That is because there is not a suitable gcc arm32 asm constraint for labels.

Declare return_to_new_vcpu32() and return_to_new_vcpu64() that are also
referenced by this inline asm statement.

Also, this patch resolves indirectly a MISRA C 2012 Rule 8.4 violation warning.

Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
[add noreturn]
Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
2 years agoxen/arm: mm: Reduce the area that xen_second covers
Julien Grall [Fri, 29 Jul 2022 21:59:49 +0000 (22:59 +0100)]
xen/arm: mm: Reduce the area that xen_second covers

At the moment, xen_second is used to cover the first 2GB of the
virtual address space. With the recent rework of the page-tables,
only the first 1GB region (where Xen resides) is effectively used.

In addition to that, I would like to reshuffle the memory layout.
So Xen mappings may not be anymore in the first 2GB of the virtual
address space.

Therefore, rework xen_second so it only covers the 1GB region where
Xen will reside.

With this change, xen_second doesn't cover anymore the xenheap area
on arm32. So, we first need to add memory to the boot allocator before
setting up the xenheap mappings.

Take the opportunity to update the comments on top of xen_fixmap and
xen_xenmap.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
2 years agoxen/arm: mm: Move domain_{,un}map_* helpers in a separate file
Julien Grall [Fri, 29 Jul 2022 21:59:01 +0000 (22:59 +0100)]
xen/arm: mm: Move domain_{,un}map_* helpers in a separate file

The file xen/arch/mm.c has been growing quite a lot. It now contains
various independent part of the MM subsytem.

One of them is the helpers to map/unmap a page which is only used
by arm32 and protected by CONFIG_ARCH_MAP_DOMAIN_PAGE. Move them in a
new file xen/arch/arm/domain_page.c.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
2 years agoxen: Rename CONFIG_DOMAIN_PAGE to CONFIG_ARCH_MAP_DOMAIN_PAGE and...
Julien Grall [Fri, 29 Jul 2022 21:53:10 +0000 (22:53 +0100)]
xen: Rename CONFIG_DOMAIN_PAGE to CONFIG_ARCH_MAP_DOMAIN_PAGE and...

move it to Kconfig.

The define CONFIG_DOMAIN_PAGE indicates whether the architecture provide
helpers to map/unmap a domain page. Rename it to CONFIG_ARCH_MAP_DOMAIN_PAGE
so it is clearer that support for domain page is not something that
can be disabled in Xen.

Take the opportunity to move CONFIG_MAP_DOMAIN_PAGE to Kconfig as this
will soon be necessary to use it in the Makefile.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com> #arm part
2 years agoxen/arm32: mm: Consolidate the domheap mappings initialization
Julien Grall [Fri, 29 Jul 2022 21:48:16 +0000 (22:48 +0100)]
xen/arm32: mm: Consolidate the domheap mappings initialization

At the moment, the domheap mappings initialization is done separately for
the boot CPU and secondary CPUs. The main difference is for the former
the pages are part of Xen binary whilst for the latter they are
dynamically allocated.

It would be good to have a single helper so it is easier to rework
on the domheap is initialized.

For CPU0, we still need to use pre-allocated pages because the
allocators may use domain_map_page(), so we need to have the domheap
area ready first. But we can still delay the initialization to setup_mm().

Introduce a new helper init_domheap_mappings() that will be called
from setup_mm() for the boot CPU and from init_secondary_pagetables()
for secondary CPUs.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Tested-by: Luca Fancellu <luca.fancellu@arm.com>
2 years agoxen/arm: Remove most of the *_VIRT_END defines
Julien Grall [Fri, 29 Jul 2022 21:41:43 +0000 (22:41 +0100)]
xen/arm: Remove most of the *_VIRT_END defines

At the moment, *_VIRT_END may either point to the address after the end
or the last address of the region.

The lack of consistency make quite difficult to reason with them.

Furthermore, there is a risk of overflow in the case where the address
points past to the end. I am not aware of any cases, so this is only a
latent bug.

Start to solve the problem by removing all the *_VIRT_END exclusively used
by the Arm code and add *_VIRT_SIZE when it is not present.

Take the opportunity to rename BOOT_FDT_SLOT_SIZE to BOOT_FDT_VIRT_SIZE
for better consistency and use _AT(vaddr_t, ).

Also take the opportunity to fix the coding style of the comment touched
in mm.c.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Tested-By: Luca Fancellu <luca.fancellu@arm.com>
2 years agoarm/atomic: fix MISRA C 2012 Rule 20.7 violation
Xenia Ragiadakou [Fri, 29 Jul 2022 06:51:31 +0000 (08:51 +0200)]
arm/atomic: fix MISRA C 2012 Rule 20.7 violation

The macro parameter 'p' is used as an expression and needs to be enclosed in
parentheses.

Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2 years agoxsm/dummy: fix MISRA C 2012 Directive 4.10 violation
Xenia Ragiadakou [Fri, 29 Jul 2022 06:50:58 +0000 (08:50 +0200)]
xsm/dummy: fix MISRA C 2012 Directive 4.10 violation

Protect header file from being included more than once by adding ifndef guard.

Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
2 years agox86/shadow: drop CONFIG_HVM conditionals from sh_update_cr3()
Jan Beulich [Fri, 29 Jul 2022 06:50:25 +0000 (08:50 +0200)]
x86/shadow: drop CONFIG_HVM conditionals from sh_update_cr3()

Now that we're not building multi.c anymore for 2 and 3 guest levels
when !HVM, there's no point in having these conditionals anymore. (As
somewhat a special case, the last of the removed conditionals really
builds on shadow_mode_external() always returning false when !HVM.) This
way the code becomes a tiny bit more readable.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
2 years agox86/shadow: don't open-code shadow_remove_all_shadows()
Jan Beulich [Fri, 29 Jul 2022 06:49:48 +0000 (08:49 +0200)]
x86/shadow: don't open-code shadow_remove_all_shadows()

Let's use the existing inline wrapper instead of repeating respective
commentary at every site.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
2 years agox86/shadow: exclude HVM-only code from sh_remove_shadows() when !HVM
Jan Beulich [Fri, 29 Jul 2022 06:49:06 +0000 (08:49 +0200)]
x86/shadow: exclude HVM-only code from sh_remove_shadows() when !HVM

In my (debug) build this amounts to well over 500 bytes of dead code.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
2 years agox86/shadow: properly handle get_page() failing
Jan Beulich [Fri, 29 Jul 2022 06:48:26 +0000 (08:48 +0200)]
x86/shadow: properly handle get_page() failing

We should not blindly (in a release build) insert the new entry in the
hash if a reference to the guest page cannot be obtained, or else an
excess reference would be put when removing the hash entry again. Crash
the domain in that case instead. The sole caller doesn't further care
about the state of the guest page: All it does is return the
corresponding shadow page (which was obtained successfully before) to
its caller.

To compensate we further need to adjust hash removal: Since the shadow
page already has had its backlink set, domain cleanup code would try to
destroy the shadow, and hence still cause a put_page() without
corresponding get_page(). Leverage that the failed get_page() leads to
no hash insertion, making shadow_hash_delete() no longer assume it will
find the requested entry. Instead return back whether the entry was
found. This way delete_shadow_status() can avoid calling put_page() in
the problem scenario.

For the other caller of shadow_hash_delete() simply reinstate the
otherwise dropped assertion at the call site.

While touching the conditionals in {set,delete}_shadow_status() anyway,
also switch around their two pre-existing parts, to have the cheap one
first (frequently allowing to avoid evaluation of the expensive - due to
evaluate_nospec() - one altogether).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
2 years agoautomation: arm64: Create a test job for testing static allocation on qemu
Xenia Ragiadakou [Thu, 28 Jul 2022 07:58:56 +0000 (10:58 +0300)]
automation: arm64: Create a test job for testing static allocation on qemu

Enable CONFIG_STATIC_MEMORY in the existing arm64 build.

Create a new test job, called qemu-smoke-arm64-gcc-staticmem.

Adjust qemu-smoke-arm64.sh script to accomodate the static memory test as a
new test variant. The test variant is determined based on the first argument
passed to the script. For testing static memory, the argument is 'static-mem'.

The test configures DOM1 with a static memory region and adds a check in the
init script.
The check consists in comparing the contents of the /proc/device-tree
memory entry with the static memory range with which DOM1 was configured.
If the memory layout is correct, a message gets printed by DOM1.

At the end of the qemu run, the script searches for the specific message
in the logs and fails if not found.

Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com>
Reviewed-by: Penny Zheng <penny.zheng@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2 years agoautomation: Remove XEN_CONFIG_EXPERT leftovers
Xenia Ragiadakou [Thu, 28 Jul 2022 07:58:55 +0000 (10:58 +0300)]
automation: Remove XEN_CONFIG_EXPERT leftovers

The EXPERT config option cannot anymore be selected via the environmental
variable XEN_CONFIG_EXPERT. Remove stale references to XEN_CONFIG_EXPERT
from the automation code.

Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2 years agolibxl/arm: Create specific IOMMU node to be referred by virtio-mmio device
Oleksandr Tyshchenko [Fri, 15 Jul 2022 19:20:26 +0000 (22:20 +0300)]
libxl/arm: Create specific IOMMU node to be referred by virtio-mmio device

Reuse generic IOMMU device tree bindings to communicate Xen specific
information for the virtio devices for which the restricted memory
access using Xen grant mappings need to be enabled.

Insert "iommus" property pointed to the IOMMU node with "xen,grant-dma"
compatible to all virtio devices which backends are going to run in
non-hardware domains (which are non-trusted by default).

Based on device-tree binding from Linux:
Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml

The example of generated nodes:

xen_iommu {
    compatible = "xen,grant-dma";
    #iommu-cells = <0x01>;
    phandle = <0xfde9>;
};

virtio@2000000 {
    compatible = "virtio,mmio";
    reg = <0x00 0x2000000 0x00 0x200>;
    interrupts = <0x00 0x01 0xf01>;
    interrupt-parent = <0xfde8>;
    dma-coherent;
    iommus = <0xfde9 0x01>;
};

virtio@2000200 {
    compatible = "virtio,mmio";
    reg = <0x00 0x2000200 0x00 0x200>;
    interrupts = <0x00 0x02 0xf01>;
    interrupt-parent = <0xfde8>;
    dma-coherent;
    iommus = <0xfde9 0x01>;
};

Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
2 years agolibxl: Introduce basic virtio-mmio support on Arm
Julien Grall [Fri, 15 Jul 2022 19:20:25 +0000 (22:20 +0300)]
libxl: Introduce basic virtio-mmio support on Arm

This patch introduces helpers to allocate Virtio MMIO params
(IRQ and memory region) and create specific device node in
the Guest device-tree with allocated params. In order to deal
with multiple Virtio devices, reserve corresponding ranges.
For now, we reserve 1MB for memory regions and 10 SPIs.

As these helpers should be used for every Virtio device attached
to the Guest, call them for Virtio disk(s).

Please note, with statically allocated Virtio IRQs there is
a risk of a clash with a physical IRQs of passthrough devices.
For the first version, it's fine, but we should consider allocating
the Virtio IRQs automatically. Thankfully, we know in advance which
IRQs will be used for passthrough to be able to choose non-clashed
ones.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
2 years agolibxl: Add support for Virtio disk configuration
Oleksandr Tyshchenko [Fri, 15 Jul 2022 19:20:24 +0000 (22:20 +0300)]
libxl: Add support for Virtio disk configuration

This patch adds basic support for configuring and assisting virtio-mmio
based virtio-disk backend (emulator) which is intended to run out of
Qemu and could be run in any domain.
Although the Virtio block device is quite different from traditional
Xen PV block device (vbd) from the toolstack's point of view:
 - as the frontend is virtio-blk which is not a Xenbus driver, nothing
   written to Xenstore are fetched by the frontend currently ("vdev"
   is not passed to the frontend). But this might need to be revised
   in future, so frontend data might be written to Xenstore in order to
   support hotplugging virtio devices or passing the backend domain id
   on arch where the device-tree is not available.
 - the ring-ref/event-channel are not used for the backend<->frontend
   communication, the proposed IPC for Virtio is IOREQ/DM
it is still a "block device" and ought to be integrated in existing
"disk" handling. So, re-use (and adapt) "disk" parsing/configuration
logic to deal with Virtio devices as well.

For the immediate purpose and an ability to extend that support for
other use-cases in future (Qemu, virtio-pci, etc) perform the following
actions:
- Add new disk backend type (LIBXL_DISK_BACKEND_STANDALONE) and reflect
  that in the configuration
- Introduce new disk "specification" and "transport" fields to struct
  libxl_device_disk. Both are written to the Xenstore. The transport
  field is only used for the specification "virtio" and it assumes
  only "mmio" value for now.
- Introduce new "specification" option with "xen" communication
  protocol being default value.
- Add new device kind (LIBXL__DEVICE_KIND_VIRTIO_DISK) as current
  one (LIBXL__DEVICE_KIND_VBD) doesn't fit into Virtio disk model

An example of domain configuration for Virtio disk:
disk = [ 'phy:/dev/mmcblk0p3, xvda1, backendtype=standalone, specification=virtio']

Nothing has changed for default Xen disk configuration.

Please note, this patch is not enough for virtio-disk to work
on Xen (Arm), as for every Virtio device (including disk) we need
to allocate Virtio MMIO params (IRQ and memory region) and pass
them to the backend, also update Guest device-tree. The subsequent
patch will add these missing bits. For the current patch,
the default "irq" and "base" are just written to the Xenstore.
This is not an ideal splitting, but this way we avoid breaking
the bisectability.

Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Tested-by: Jiamei Xie <jiamei.xie@arm.com>
2 years agox86/PV: correct post-preemption progress recording in iommu_memory_setup()
Jan Beulich [Wed, 27 Jul 2022 11:00:08 +0000 (13:00 +0200)]
x86/PV: correct post-preemption progress recording in iommu_memory_setup()

Coverity validly points out that the mfn_add() as used was dead code.

Coverity ID: 1507475
Fixes: c1e1564c8995 ("IOMMU/x86: perform PV Dom0 mappings in batches")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
2 years agomm: enforce return value checking on get_page()
Jan Beulich [Wed, 27 Jul 2022 10:58:50 +0000 (12:58 +0200)]
mm: enforce return value checking on get_page()

It's hard to imagine a case where an error may legitimately be ignored
here. It's bad enough that in at least one case (set_shadow_status())
the return value was checked only by way of ASSERT()ing.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <jgrall@amazon.com>
2 years agox86/shadow: drop shadow_prepare_page_type_change()'s 3rd parameter
Jan Beulich [Wed, 27 Jul 2022 10:58:16 +0000 (12:58 +0200)]
x86/shadow: drop shadow_prepare_page_type_change()'s 3rd parameter

As of 8cc5036bc385 ("x86/pv: Fix ABAC cmpxchg() race in
_get_page_type()") this no longer needs passing separately - the type
can now be read from struct page_info, as the call now happens after its
writing.

While there also constify the 2nd parameter.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
2 years agox86/msr: fix X2APIC_LAST
Edwin Török [Wed, 27 Jul 2022 10:57:10 +0000 (12:57 +0200)]
x86/msr: fix X2APIC_LAST

The latest Intel manual now says the X2APIC reserved range is only
0x800 to 0x8ff (NOT 0xbff).
This changed between SDM 68 (Nov 2018) and SDM 69 (Jan 2019).
The AMD manual documents 0x800-0x8ff too.

There are non-X2APIC MSRs in the 0x900-0xbff range now:
e.g. 0x981 is IA32_TME_CAPABILITY, an architectural MSR.

The new MSR in this range appears to have been introduced in Icelake,
so this commit should be backported to Xen versions supporting Icelake.

Backport: 4.13+

Signed-off-by: Edwin Török <edvin.torok@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 years agox86/vpmu: Fix build following vmfork addition
Andrew Cooper [Tue, 26 Jul 2022 13:11:33 +0000 (14:11 +0100)]
x86/vpmu: Fix build following vmfork addition

GCC with IBT extensions complains:

  arch/x86/cpu/vpmu.c:351:15: error: conflicting types for 'vpmu_save_force'; have 'void(void *)' with implied 'nocf_check' attribute
    351 | void cf_check vpmu_save_force(void *arg)
        |               ^~~~~~~~~~~~~~~
  In file included from ./arch/x86/include/asm/domain.h:10,
                   from ./include/xen/domain.h:8,
                   from ./include/xen/sched.h:11,
                   from ./include/xen/event.h:12,
                   from arch/x86/cpu/vpmu.c:23:
  ./arch/x86/include/asm/vpmu.h:117:6: note: previous declaration of 'vpmu_save_force' with type 'void(void *)'
    117 | void vpmu_save_force(void *arg);
        |      ^~~~~~~~~~~~~~~

Adjust the declaraion.

Fixes: 755087eb9b10 ("xen/mem_sharing: support forks with active vPMU state")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 years agox86/pv: Inject #GP for implicit grant unmaps
Andrew Cooper [Tue, 19 Jul 2022 20:37:43 +0000 (21:37 +0100)]
x86/pv: Inject #GP for implicit grant unmaps

This is a debug behaviour to identify buggy kernels.  Crashing the domain is
the most unhelpful thing to do, because it discards the relevant context.

Instead, inject #GP[0] like other permission errors in x86.  In particular,
this lets the kernel provide a backtrace which is more likely to be helpful to
a developer.

As a bugfix, this always injects #GP[0] to current, not l1e_owner.  It is not
l1e_owner's fault if dom0 using superpowers triggers an implicit unmap.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
2 years agox86/mm: correct TLB flush condition in _get_page_type()
Jan Beulich [Tue, 26 Jul 2022 12:54:34 +0000 (14:54 +0200)]
x86/mm: correct TLB flush condition in _get_page_type()

When this logic was moved, it was moved across the point where nx is
updated to hold the new type for the page. IOW originally it was
equivalent to using x (and perhaps x would better have been used), but
now it isn't anymore. Switch to using x, which then brings things in
line again with the slightly earlier comment there (now) talking about
transitions _from_ writable.

I have to confess though that I cannot make a direct connection between
the reported observed behavior of guests leaving several pages around
with pending general references and the change here. Repeated testing,
nevertheless, confirms the reported issue is no longer there.

This is CVE-2022-33745 / XSA-408.

Reported-by: Charles Arnold <carnold@suse.com>
Fixes: 8cc5036bc385 ("x86/pv: Fix ABAC cmpxchg() race in _get_page_type()")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
2 years agocommon/memory: Fix ifdefs for ptdom_max_order
Luca Fancellu [Tue, 26 Jul 2022 06:33:46 +0000 (08:33 +0200)]
common/memory: Fix ifdefs for ptdom_max_order

In common/memory.c the ifdef code surrounding ptdom_max_order is
using HAS_PASSTHROUGH instead of CONFIG_HAS_PASSTHROUGH, fix the
problem using the correct macro.

Fixes: e0d44c1f9461 ("build: convert HAS_PASSTHROUGH use to Kconfig")
Signed-off-by: Luca Fancellu <luca.fancellu@arm.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
2 years agopage-alloc: fix initialization of cross-node regions
Jan Beulich [Tue, 26 Jul 2022 06:33:10 +0000 (08:33 +0200)]
page-alloc: fix initialization of cross-node regions

Quite obviously to determine the split condition successive pages'
attributes need to be evaluated, not always those of the initial page.

Fixes: 72b02bc75b47 ("xen/heap: pass order to free_heap_pages() in heap init")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
2 years agoinclude: correct re-building conditions around hypercall-defs.h
Jan Beulich [Mon, 25 Jul 2022 13:46:21 +0000 (15:46 +0200)]
include: correct re-building conditions around hypercall-defs.h

For a .cmd file to be picked up, the respective target needs to be
listed in $(targets). This wasn't the case for hypercall-defs.i, leading
to permanent re-building even on an entirely unchanged tree (because of
the command apparently having changed).

In exchange the target doesn't need naming in $(clean-files) anymore.

Fixes: eca1f00d0227 ("xen: generate hypercall interface related code")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
2 years agoArm32: restore proper name of .dtb section start symbol
Jan Beulich [Mon, 25 Jul 2022 13:45:31 +0000 (15:45 +0200)]
Arm32: restore proper name of .dtb section start symbol

This addresses a build failure when CONFIG_DTB_FILE evaluates to a non-
empty string.

Fixes: d07358f2dccd ("xen/arm32: head.S: Introduce a macro to load the physical address of a symbol")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
2 years agoxen/mem_sharing: support forks with active vPMU state
Tamas K Lengyel [Mon, 25 Jul 2022 13:44:33 +0000 (15:44 +0200)]
xen/mem_sharing: support forks with active vPMU state

Currently the vPMU state from a parent isn't copied to VM forks. To enable the
vPMU state to be copied to a fork VM we export certain vPMU functions. First,
the vPMU context needs to be allocated for the fork if the parent has one. For
this we introduce vpmu->allocate_context, which has previously only been called
when the guest enables the PMU on itself. Furthermore, we export
vpmu_save_force so that the PMU context can be saved on-demand even if no
context switch took place on the parent's CPU yet. Additionally, we make sure
all relevant configuration MSRs are saved in the vPMU context so the copy is
complete and the fork starts with the same PMU config as the parent.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
2 years agogolang/xenlight: Update generated code
Oleksandr Tyshchenko [Mon, 25 Jul 2022 13:44:17 +0000 (15:44 +0200)]
golang/xenlight: Update generated code

Re-generate goland bindings to reflect changes to libxl_types.idl
from the following commit:
54d8f27d0477 tools/libxl: report trusted backend status to frontends

Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
2 years agoVT-d: fold dma_pte_clear_one() into its only caller
Jan Beulich [Mon, 25 Jul 2022 13:43:35 +0000 (15:43 +0200)]
VT-d: fold dma_pte_clear_one() into its only caller

This way intel_iommu_unmap_page() ends up quite a bit more similar to
intel_iommu_map_page().

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
2 years agoIOMMU/x86: add perf counters for page table splitting / coalescing
Jan Beulich [Mon, 25 Jul 2022 13:42:33 +0000 (15:42 +0200)]
IOMMU/x86: add perf counters for page table splitting / coalescing

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
2 years agoVT-d: replace all-contiguous page tables by superpage mappings
Jan Beulich [Mon, 25 Jul 2022 13:41:48 +0000 (15:41 +0200)]
VT-d: replace all-contiguous page tables by superpage mappings

When a page table ends up with all contiguous entries (including all
identical attributes), it can be replaced by a superpage entry at the
next higher level. The page table itself can then be scheduled for
freeing.

The adjustment to LEVEL_MASK is merely to avoid leaving a latent trap
for whenever we (and obviously hardware) start supporting 512G mappings.

Note that cache sync-ing is likely more strict than necessary. This is
both to be on the safe side as well as to maintain the pattern of all
updates of (potentially) live tables being accompanied by a flush (if so
needed).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
2 years agoAMD/IOMMU: replace all-contiguous page tables by superpage mappings
Jan Beulich [Mon, 25 Jul 2022 13:41:12 +0000 (15:41 +0200)]
AMD/IOMMU: replace all-contiguous page tables by superpage mappings

When a page table ends up with all contiguous entries (including all
identical attributes), it can be replaced by a superpage entry at the
next higher level. The page table itself can then be scheduled for
freeing.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
2 years agoVT-d: free all-empty page tables
Jan Beulich [Mon, 25 Jul 2022 13:40:41 +0000 (15:40 +0200)]
VT-d: free all-empty page tables

When a page table ends up with no present entries left, it can be
replaced by a non-present entry at the next higher level. The page table
itself can then be scheduled for freeing.

Note that while its output isn't used there yet,
pt_update_contig_markers() right away needs to be called in all places
where entries get updated, not just the one where entries get cleared.

Note further that while pt_update_contig_markers() updates perhaps
several PTEs within the table, since these are changes to "avail" bits
only I do not think that cache flushing would be needed afterwards. Such
cache flushing (of entire pages, unless adding yet more logic to me more
selective) would be quite noticable performance-wise (very prominent
during Dom0 boot).

Also note that cache sync-ing is likely more strict than necessary. This
is both to be on the safe side as well as to maintain the pattern of all
updates of (potentially) live tables being accompanied by a flush (if so
needed).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
2 years agoAMD/IOMMU: free all-empty page tables
Jan Beulich [Mon, 25 Jul 2022 13:40:00 +0000 (15:40 +0200)]
AMD/IOMMU: free all-empty page tables

When a page table ends up with no present entries left, it can be
replaced by a non-present entry at the next higher level. The page table
itself can then be scheduled for freeing.

Note that while its output isn't used there yet,
pt_update_contig_markers() right away needs to be called in all places
where entries get updated, not just the one where entries get cleared.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
2 years agoIOMMU/x86: prefill newly allocate page tables
Jan Beulich [Mon, 25 Jul 2022 13:38:22 +0000 (15:38 +0200)]
IOMMU/x86: prefill newly allocate page tables

Page tables are used for two purposes after allocation: They either
start out all empty, or they are filled to replace a superpage.
Subsequently, to replace all empty or fully contiguous page tables,
contiguous sub-regions will be recorded within individual page tables.
Install the initial set of markers immediately after allocation. Make
sure to retain these markers when further populating a page table in
preparation for it to replace a superpage.

The markers are simply 4-bit fields holding the order value of
contiguous entries. To demonstrate this, if a page table had just 16
entries, this would be the initial (fully contiguous) set of markers:

index  0 1 2 3 4 5 6 7 8 9 A B C D E F
marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0

"Contiguous" here means not only present entries with successively
increasing MFNs, each one suitably aligned for its slot, and identical
attributes, but also a respective number of all non-present (zero except
for the markers) entries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
2 years agox86: introduce helper for recording degree of contiguity in page tables
Jan Beulich [Mon, 25 Jul 2022 13:37:34 +0000 (15:37 +0200)]
x86: introduce helper for recording degree of contiguity in page tables

This is a re-usable helper (kind of a template) which gets introduced
without users so that the individual subsequent patches introducing such
users can get committed independently of one another.

See the comment at the top of the new file. To demonstrate the effect,
if a page table had just 16 entries, this would be the set of markers
for a page table with fully contiguous mappings:

index  0 1 2 3 4 5 6 7 8 9 A B C D E F
marker 4 0 1 0 2 0 1 0 3 0 1 0 2 0 1 0

"Contiguous" here means not only present entries with successively
increasing MFNs, each one suitably aligned for its slot, but also a
respective number of all non-present entries.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
2 years agoVT-d: allow use of superpage mappings
Jan Beulich [Mon, 25 Jul 2022 13:36:33 +0000 (15:36 +0200)]
VT-d: allow use of superpage mappings

... depending on feature availability (and absence of quirks).

Also make the page table dumping function aware of superpages.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>