Roger Pau Monne [Wed, 22 Jun 2022 14:47:26 +0000 (16:47 +0200)]
x86/irq: do not set nr_irqs based on nr_irqs_gsi in APIC mode
When using an APIC do not set nr_irqs based on a factor of nr_irqs_gsi
(currently x8), and instead do so exclusively based on the amount of
available vectors on the system.
There's no point in setting nr_irqs to a value higher than the
available set of vectors, as vector allocation will fail anyway.
Roger Pau Monne [Wed, 22 Jun 2022 14:52:56 +0000 (16:52 +0200)]
x86/irq: print nr_irqs as limit on the number of MSI(-X) interrupts
Using nr_irqs minus nr_irqs_gsi is misleading, as GSI interrupts are
not allocated unless requested by the hardware domain, so a hardware
domain could not use any GSI (or just one for the ACPI SCI), and hence
(almost) all nr_irqs will be available for MSI(-X) usage.
No functional difference, just affects the printed message.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Wed, 22 Jun 2022 12:51:15 +0000 (14:51 +0200)]
x86/irq: fix setting irq limits
Current code to calculate nr_irqs assumes the APIC destination mode to
be physical, so all vectors on each possible CPU is available for use
by a different interrupt source. This is not true when using Logical
(Cluster) destination mode, where CPUs in the same cluster share the
vector space.
Fix by calculating the maximum Cluster ID and use it to derive the
number of clusters in the system. Note the code assumes Cluster IDs to
be contiguous, or else we will set nr_irqs to a number higher than the
real amount of vectors (still not fatal).
The number of clusters is then used instead of the number of present
CPUs when calculating the value of nr_irqs.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Wed, 22 Jun 2022 08:07:23 +0000 (10:07 +0200)]
x86/setup: init nr_irqs after having detected x2APIC support
Logic in ioapic_init() that sets the number of available vectors for
external interrupts requires knowing the x2APIC Destination Mode. As
such move the call after x2APIC BSP setup.
Do it as part of init_irq_data(), which is called just after x2APIC
BSP init and also makes use of nr_irqs itself.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Wed, 22 Jun 2022 12:27:55 +0000 (14:27 +0200)]
x86/x2apic: use physical destination mode by default
Using cluster mode by default greatly limits the amount of vectors
available, as then vector space is shared amongst all the CPUs in the
logical cluster.
This can lead to vector shortage issues on boxes with not a huge
amount of CPUs but with a non-trivial amount of devices, there are
reports of boxes with 32 CPUs (2 logical clusters, and thus only 414
dynamic vectors) that run out of vectors and fail to setup interrupts
for dom0.
This could be considered as a regression when switching from xAPIC
mode, as when using xAPIC only physical mode is supported.
Switch default Kconfig selection to use x2APIC physical mode.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monné [Mon, 20 Jun 2022 12:35:00 +0000 (14:35 +0200)]
tools/include: drop leading underscore from xen_list header
A leading underscore is used to indicate auto generated headers, and
the clean use of 'rm -f _*.h' will remove those. _xen_list.h also
uses a leading underscore, but is checked in the repo and as such
cannot be removed as part of the clean rule.
Fix this by dropping the leading underscore, so that the header is not
removed.
Wei Chen [Fri, 10 Jun 2022 05:53:16 +0000 (13:53 +0800)]
xen/x86: use INFO level for node's without memory log message
In previous code, Xen was using KERN_WARNING for log message
when Xen found a node without memory. Xen will print this
warning message, and said that this may be an BIOS Bug or
mis-configured hardware. But actually, this warning is bogus,
because in an NUMA setting, nodes can only have processors,
and with 0 bytes memory. So it is unreasonable to warn about
BIOS or hardware corruption based on the detection of node
with 0 bytes memory.
So in this patch, we remove the warning messages, but just
keep an info message to info users that there is one or more
nodes with 0 bytes memory in the system.
Signed-off-by: Wei Chen <wei.chen@arm.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Wei Chen [Fri, 10 Jun 2022 05:53:15 +0000 (13:53 +0800)]
xen/x86: add detection of memory interleaves for different nodes
One NUMA node may contain several memory blocks. In current Xen
code, Xen will maintain a node memory range for each node to cover
all its memory blocks. But here comes the problem, in the gap of
one node's two memory blocks, if there are some memory blocks don't
belong to this node (remote memory blocks). This node's memory range
will be expanded to cover these remote memory blocks.
One node's memory range contains other nodes' memory, this is
obviously not very reasonable. This means current NUMA code only
can support node has no interleaved memory blocks. However, on a
physical machine, the addresses of multiple nodes can be interleaved.
So in this patch, we add code to detect memory interleaves of
different nodes. NUMA initialization will be failed and error
messages will be printed when Xen detect such hardware configuration.
As we have checked the node's range before, for a non-empty node,
the "nd->end == end && nd->start == start" check is unnecesary.
So we remove it from conflicting_memblks as well.
Wei Chen [Fri, 10 Jun 2022 05:53:14 +0000 (13:53 +0800)]
xen/x86: use paddr_t for addresses in NUMA node structure
NUMA node structure "struct node" is using u64 as node memory
range. In order to make other architectures can reuse this
NUMA node relative code, we replace the u64 to paddr_t. And
use pfn_to_paddr and paddr_to_pfn to replace explicit shift
operations. The relate PRIx64 in print messages have been
replaced by PRIpaddr at the same time. And some being-phased-out
types like u64 in the lines we have touched also have been
converted to uint64_t or unsigned long.
Wei Chen [Fri, 10 Jun 2022 05:53:13 +0000 (13:53 +0800)]
xen/arm: use !CONFIG_NUMA to keep fake NUMA API
We have introduced CONFIG_NUMA in a previous patch. And this
option is enabled only on x86 at the current stage. In a follow
up patch, we will enable this option for Arm. But we still
want users to be able to disable the CONFIG_NUMA via Kconfig. In
this case, keep the fake NUMA API, will make Arm code still
able to work with NUMA aware memory allocation and scheduler.
Wei Chen [Fri, 10 Jun 2022 05:53:12 +0000 (13:53 +0800)]
xen: decouple NUMA from ACPI in Kconfig
In current Xen code only implements x86 ACPI-based NUMA support.
So in Xen Kconfig system, NUMA equals to ACPI_NUMA. x86 selects
NUMA by default, and CONFIG_ACPI_NUMA is hardcode in config.h.
In a follow-up patch, we will introduce support for NUMA using
the device tree. That means we will have two NUMA implementations,
so in this patch we decouple NUMA from ACPI based NUMA in Kconfig.
Make NUMA as a common feature, that device tree based NUMA also
can select it.
Wei Chen [Fri, 10 Jun 2022 05:53:11 +0000 (13:53 +0800)]
xen: introduce an arch helper for default dma zone status
In current code, when Xen is running in a multiple nodes
NUMA system, it will set dma_bitsize in end_boot_allocator
to reserve some low address memory as DMA zone.
There are some x86 implications in the implementation.
Because on x86, memory starts from 0. On a multiple-nodes
NUMA system, if a single node contains the majority or all
of the DMA memory, x86 prefers to give out memory from
non-local allocations rather than exhausting the DMA memory
ranges. Hence x86 uses dma_bitsize to set aside some largely
arbitrary amount of memory for DMA zone. The allocations
from DMA zone would happen only after exhausting all other
nodes' memory.
But the implications are not shared across all architectures.
For example, Arm cannot guarantee the availability of memory
below a certain boundary for DMA limited-capability devices
either. But currently, Arm doesn't need a reserved DMA zone
in Xen. Because there is no DMA device in Xen. And for guests,
Xen Arm only allows Dom0 to have DMA operations without IOMMU.
Xen will try to allocate memory under 4GB or memory range that
is limited by dma_bitsize for Dom0 in boot time. For DomU, even
Xen can passthrough devices to DomU without IOMMU, but Xen Arm
doesn't guarantee their DMA operations. So, Xen Arm doesn't
need a reserved DMA zone to provide DMA memory for guests.
In this patch, we introduce an arch_want_default_dmazone helper
for different architectures to determine whether they need to
set dma_bitsize for DMA zone reservation or not.
At the same time, when x86 Xen is built with CONFIG_PV=n could
probably leverage this new helper to actually not trigger DMA
zone reservation.
Wei Chen [Fri, 10 Jun 2022 05:53:10 +0000 (13:53 +0800)]
xen/arm: Keep memory nodes in device tree when Xen boots from EFI
In current code, when Xen is booting from EFI, it will delete
all memory nodes in device tree. This would work well in current
stage, because Xen can get memory map from EFI system table.
However, EFI system table cannot completely replace memory nodes
of device tree. EFI system table doesn't contain memory NUMA
information. Xen depends on ACPI SRAT or device tree memory nodes
to parse memory blocks' NUMA mapping. So in EFI + DTB boot, Xen
doesn't have any method to get numa-node-id for memory blocks any
more. This makes device tree based NUMA support become impossible
for Xen in EFI + DTB boot.
So in this patch, we will keep memory nodes in device tree for
NUMA code to parse memory numa-node-id later.
As a side effect, if we still parse boot memory information in
early_scan_node, bootmem.info will calculate memory ranges in
memory nodes twice. So we have to prevent early_scan_node to
parse memory nodes in EFI boot.
Wei Chen [Fri, 10 Jun 2022 05:53:09 +0000 (13:53 +0800)]
xen: reuse x86 EFI stub functions for Arm
x86 is using compiler feature testing to decide EFI build
enable or not. When EFI build is disabled, x86 will use an
efi/stub.c file to replace efi/runtime.c for build objects.
Following this idea, we introduce a stub file for Arm, but
use CONFIG_ARM_EFI to decide EFI build enable or not.
And the most functions in x86 EFI stub.c can be reused for
other architectures, like Arm. So we move them to common
and keep the x86 specific function in x86/efi/stub.c.
To avoid the symbol link conflict error when linking common
stub files to x86/efi. We add a regular file check in efi
stub files' link script. Depends on this check we can bypass
the link behaviors for existed stub files in x86/efi.
As there is no Arm specific EFI stub function for Arm in
current stage, Arm still can use the existed symbol link
method for EFI stub files.
Anthony PERARD [Fri, 25 Feb 2022 15:13:21 +0000 (15:13 +0000)]
tools/ocaml: fix build dependency target
They are two competiting spelling for the variable holding the path to
"tools/ocaml", $(TOPLEVEL) and $(OCAML_TOPLEVEL). The "Makefile.rules"
which is included in all ocaml Makefiles have one rule which make use
of that variable which is then sometime unset. When building
"ocaml/xenstored", make isn't capable of generating ".ocamldep.make"
because $(TOPLEVEL) isn't defined in this case.
This can fail with an error like this when paths.ml have been
regenerated:
Error: Files define.cmx and paths.cmx
make inconsistent assumptions over interface Paths
This patch fix ".ocamldep.make" rule by always spelling the variable
$(OCAML_TOPLEVEL).
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
When we will convert tools/ build system, their will be a need to
replace some use of "vpath". This will done making symbolic links.
Those symlinks are not wanted by stubdom build system when making a
linkfarm for the Xen libraries. To avoid them, we will use `find`
instead of plain shell globbing.
For example, there will be a link to "xen/lib/x86/cpuid.o" in
"tools/libs/guest/".
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Anthony PERARD [Fri, 25 Feb 2022 15:13:16 +0000 (15:13 +0000)]
stubdom: introduce xenlibs.mk
This new makefile will be used to build libraries that provides
"Makefile.common".
At some point, we will be converting Makefile in tools/ to "subdirmk"
and stubdom build will not be able to use those new makefiles, so we
will put the necessary information for stubdom to build the xen
libraries into a new Makefile.common and xenlibs.mk will use it.
We only need to build static libraries and don't need anything else.
The check for the presence of "Makefile.common" will go aways once
there is one for all libraries used by stubdom build.
Also remove DESTDIR= from "clean" targets, we don't do installation in
this recipe so the value of DESTDIR doesn't matter.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Anthony PERARD [Fri, 25 Feb 2022 15:13:14 +0000 (15:13 +0000)]
libs/stat: Fix and rework perl-binding build
For PERL_FLAGS, use make's shell rather than a backquote.
Rather than relying on the VCS to create an empty directory for us,
we can create one before generating the *.c file for the bindings.
Make use of generic variable names to build a shared library from a
source file: CFLAGS, LDFLAGS, and LDLIBS.
To build a shared library, we need to build the source file with
"-fPIC", which was drop by 6d0ec05390 (tools: split libxenstat into
new tools/libs/stat directory).
The source file generated by swig seems to be missing many prototype for
many functions, so we need "-Wno-missing-prototypes" in order to
build it. Also, one of the prototype is deemed malformed, so we also
need "-Wno-strict-prototypes".
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com>
Anthony PERARD [Fri, 25 Feb 2022 15:13:13 +0000 (15:13 +0000)]
libs/stat: Fix and rework python-bindings build
Fix the dependency on the library, $(SHLIB) variable doesn't exist
anymore.
Rework dependency on the include file, we can let `swig` generate the
dependency for us with the use of "-M*" flags.
The xenstat.h file has moved so we need to fix the include location.
Rather than relaying on the VCS to create an empty directory for us,
we can create one before generating the *.c file for the bindings.
Make use of generic variable names to build a shared library from a
source file: CFLAGS, LDFLAGS, and LDLIBS.
Fix python's specific *flags by using python-config, and add them to
generic flags variables: CFLAGS, LDLIBS.
To build a shared library, we need to build the source file with
"-fPIC", which was drop by 6d0ec05390 (tools: split libxenstat into
new tools/libs/stat directory).
The source file generated by swig seems to be missing a prototype for
the "init" function, so we need "-Wno-missing-prototypes" in order to
build it.
Add some targets to .PHONY.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com>
Anthony PERARD [Fri, 25 Feb 2022 15:13:09 +0000 (15:13 +0000)]
libs: Rename $(SRCS-y) to $(OBJS-y)
The only thing done thing done with $(SRCS-y) is to replace ".c" by
".o". It is more useful to collect which object we want to build as
make will figure out how to build it and from which source file.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com>
Anthony PERARD [Fri, 25 Feb 2022 15:13:03 +0000 (15:13 +0000)]
tools/xenstore: Cleanup makefile
Regroup *FLAGS together, use $(LDLIBS).
Remove $(LDLIBS_xenstored) which was the wrong name name as it doesn't
decribe how to link to a potential libxenstored.so, instead add the
value to $(LDLIBS) of xenstored.
Add SYSTEMD_LIBS into $(LDLIBS) instead of $(LDFLAGS).
Remove the "-I." from $(CFLAGS), it shouldn't be needed.
Removed $(CFLAGS-y) and $(LDFLAGS-y). $(CFLAGS-y) is already included
in $(CFLAGS) and both aren't used anyway.
Rename ALL_TARGETS to TARGETS.
Only add programmes we want to build in $(TARGETS), not phony-targets
(replace "clients").
Store all `xenstored` objs into $(XENSTORED_OBJS-y).
Replace one $< by $^ even if there's only one dependency,
(xenstore-control).
clean: "init-xenstore-domain" isn't built here any more, so stop
trying to remove it, remove $(TARGETS). Also regroup all files to be
removed in one command, using $(RM).
Drop $(MAJOR) and $(MINOR), they aren't used anymore.
Drop ".SECONDARY:", it doesn't appear there's intermediate files that
would be deleted anymore.
Drop "tarball:" target.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com>
Fixes: 322ecbe4ac85 ("console: add EHCI debug port based serial console") Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 13 Jun 2022 18:18:32 +0000 (19:18 +0100)]
x86/spec-ctrl: Add spec-ctrl=unpriv-mmio
Per Xen's support statement, PCI passthrough should be to trusted domains
because the overall system security depends on factors outside of Xen's
control.
As such, Xen, in a supported configuration, is not vulnerable to DRPW/SBDR.
However, users who have risk assessed their configuration may be happy with
the risk of DoS, but unhappy with the risk of cross-domain data leakage. Such
users should enable this option.
On CPUs vulnerable to MDS, the existing mitigations are the best we can do to
mitigate MMIO cross-domain data leakage.
On CPUs fixed to MDS but vulnerable MMIO stale data leakage, this option:
* On CPUs susceptible to FBSDP, mitigates cross-domain fill buffer leakage
using FB_CLEAR.
* On CPUs susceptible to SBDR, mitigates RNG data recovery by engaging the
srb-lock, previously used to mitigate SRBDS.
Both mitigations require microcode from IPU 2022.1, May 2022.
This is part of XSA-404.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Mon, 20 Sep 2021 17:47:49 +0000 (18:47 +0100)]
x86/spec-ctrl: Enumeration for MMIO Stale Data controls
The three *_NO bits indicate non-susceptibility to the SSDP, FBSDP and PSDP
data movement primitives.
FB_CLEAR indicates that the VERW instruction has re-gained it's Fill Buffer
flushing side effect. This is only enumerated on parts where VERW had
previously lost it's flushing side effect due to the MDS/TAA vulnerabilities
being fixed in hardware.
FB_CLEAR_CTRL is available on a subset of FB_CLEAR parts where the Fill Buffer
clearing side effect of VERW can be turned off for performance reasons.
This is part of XSA-404.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Mon, 13 Jun 2022 15:19:01 +0000 (16:19 +0100)]
x86/spec-ctrl: Make VERW flushing runtime conditional
Currently, VERW flushing to mitigate MDS is boot time conditional per domain
type. However, to provide mitigations for DRPW (CVE-2022-21166), we need to
conditionally use VERW based on the trustworthiness of the guest, and the
devices passed through.
Remove the PV/HVM alternatives and instead issue a VERW on the return-to-guest
path depending on the SCF_verw bit in cpuinfo spec_ctrl_flags.
Introduce spec_ctrl_init_domain() and d->arch.verw to calculate the VERW
disposition at domain creation time, and context switch the SCF_verw bit.
For now, VERW flushing is used and controlled exactly as before, but later
patches will add per-domain cases too.
No change in behaviour.
This is part of XSA-404.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Bertrand Marquis [Mon, 13 Jun 2022 12:53:14 +0000 (13:53 +0100)]
arm: Define kconfig symbols used by arm64 cpufeatures
Define kconfig symbols which are used by arm64 cpufeatures to prevent
using undefined symbols and rely on IS_ENABLED returning false.
All the features related to those symbols are unsupported by Xen:
- pointer authentication
- sve
- memory tagging
- branch target identification
Bertrand Marquis [Mon, 13 Jun 2022 12:53:13 +0000 (13:53 +0100)]
arm: add ISAR2, MMFR0 and MMFR1 fields in cpufeature
Complete AA64ISAR2 and AA64MMFR[0-1] with more fields.
While there add a comment for MMFR bitfields as for other registers in
the cpuinfo structure definition.
Bertrand Marquis [Mon, 13 Jun 2022 12:53:12 +0000 (13:53 +0100)]
xen/arm: Add sb instruction support
This patch is adding sb instruction support when it is supported by a
CPU on arm64.
A new cpu feature capability system is introduced to enable alternative
code using sb instruction when it is supported by the processor. This is
decided based on the isa64 system register value and use a new hardware
capability ARM_HAS_SB.
The sb instruction is encoded using its hexadecimal value to avoid
recursive macro and support old compilers not having support for sb
instruction.
Arm32 instruction support is added but it is not enabled at the moment
due to the lack of hardware supporting it.
Bertrand Marquis [Mon, 13 Jun 2022 12:53:11 +0000 (13:53 +0100)]
xen/arm: Sync sysregs and cpuinfo with Linux 5.18-rc3
Sync existing ID registers sanitization with the status of Linux kernel
version 5.18-rc3 and add sanitization of ISAR2 registers.
Sync sysregs.h bit shift defintions with the status of Linux kernel
version 5.18-rc3.
Changes in this patch are splitted in a number of patches in Linux
kernel and, as the previous synchronisation point was not clear, the
changes are done in one patch with a status possible to compare easily
by diffing Xen files to Linux kernel files.
Anthony PERARD [Wed, 15 Jun 2022 08:24:06 +0000 (10:24 +0200)]
build: remove auto.conf prerequisite from compat/xlat.h target
Now that the command line generating "xlat.h" is check on rebuild, the
header will be regenerated whenever the list of xlat headers changes
due to change in ".config". We don't need to force a regeneration for
every changes in ".config".
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Wed, 15 Jun 2022 08:23:16 +0000 (10:23 +0200)]
build: fix exporting for make 3.82
GNU make 3.82 apparently has a quirk where exporting an undefined
variable prevents its value from subsequently being updated. This
situation can arise due to our adding of -rR to MAKEFLAGS, which takes
effect also on make simply re-invoking itself. Once these flags are in
effect, CC (in particular) is empty (undefined), and would be defined
only via Config.mk including StdGNU.mk or alike. With the quirk, CC
remains empty, yet with an empty CC the compiler minimum version check
fails, breaking the build.
Move the exporting of the various tool stack component variables past
where they gain their (final) values.
See also be63d9d47f57 ("build: tweak variable exporting for make 3.82").
Fixes: 15a0578ca4b0 ("build: shuffle main Makefile") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Jane Malalane [Wed, 15 Jun 2022 08:21:08 +0000 (10:21 +0200)]
x86/hvm: widen condition for is_hvm_pv_evtchn_domain() and report fix in CPUID
Have is_hvm_pv_evtchn_domain() return true for vector callbacks for
evtchn delivery set up on a per-vCPU basis via
HVMOP_set_evtchn_upcall_vector.
Assume that if vCPU0 uses HVMOP_set_evtchn_upcall_vector, all
remaining vCPUs will too and thus remove is_hvm_pv_evtchn_vcpu() and
replace sole caller with is_hvm_pv_evtchn_domain().
is_hvm_pv_evtchn_domain() returning true is a condition for setting up
physical IRQ to event channel mappings. Therefore, also add a CPUID
bit so that guests know whether the check in is_hvm_pv_evtchn_domain()
will fail when using HVMOP_set_evtchn_upcall_vector. This matters for
guests that route PIRQs over event channels since
is_hvm_pv_evtchn_domain() is a condition in physdev_map_pirq().
The naming of the CPUID bit is quite generic about upcall support
being available. That's done so that the define name doesn't become
overly long.
A guest that doesn't care about physical interrupts routed over event
channels can just test for the availability of the hypercall directly
(HVMOP_set_evtchn_upcall_vector) without checking the CPUID bit.
Signed-off-by: Jane Malalane <jane.malalane@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Wed, 15 Jun 2022 08:19:32 +0000 (10:19 +0200)]
IOMMU/x86: work around bogus gcc12 warning in hvm_gsi_eoi()
As per [1] the expansion of the pirq_dpci() macro causes a -Waddress
controlled warning (enabled implicitly in our builds, if not by default)
tying the middle part of the involved conditional expression to the
surrounding boolean context. Work around this by introducing a local
inline function in the affected source file.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102967
Julien Grall [Sat, 11 Jun 2022 11:26:18 +0000 (12:26 +0100)]
xen/arm: mm: Rework setup_xenheap_mappings()
The current implementation of setup_xenheap_mappings() is using 1GB
mappings. This can lead to unexpected result because the mapping
may alias a non-cachable region (such as device or reserved regions).
For more details see B2.8 in ARM DDI 0487H.a.
map_pages_to_xen() was recently reworked to allow superpage mappings,
support contiguous mapping and deal with the use of page-tables before
they are mapped.
Most of the code in setup_xenheap_mappings() is now replaced with a
single call to map_pages_to_xen().
Julien Grall [Sat, 11 Jun 2022 11:25:32 +0000 (12:25 +0100)]
xen/arm64: mm: Add memory to the boot allocator first
Currently, memory is added to the boot allocator after the xenheap
mappings are done. This will break if the first mapping is more than
512GB of RAM.
In addition to that, a follow-up patch will rework setup_xenheap_mappings()
to use smaller mappings (e.g. 2MB, 4KB). So it will be necessary to have
memory in the boot allocator earlier.
Only free memory (e.g. not reserved or modules) can be added to the boot
allocator. It might be possible that some regions (including the first
one) will have no free memory.
So we need to add all the free memory to the boot allocator first
and then add do the mappings.
Populating the boot allocator is nearly the same between arm32 and
arm64. The only difference is on the former we need to exclude the
xenheap for the boot allocator. Gate the difference with CONFIG_ARM_32
so the code be re-used on arm64.
Julien Grall [Sat, 11 Jun 2022 11:24:49 +0000 (12:24 +0100)]
xen/arm32: setup: Move out the code to populate the boot allocator
In a follow-up patch, we will want to populate the boot allocator
separately for arm64. The code will end up to be very similar to the one
on arm32. So move out the code in a new helper populate_boot_allocator().
For now the code is still protected by CONFIG_ARM_32 to avoid any build
failure on arm64.
Take the opportunity to replace mfn_add(xen_mfn_start, xenheap_pages) with
xenheap_mfn_end as they are equivalent.
Julien Grall [Sat, 11 Jun 2022 11:21:17 +0000 (12:21 +0100)]
xen/arm: mm: Use the PMAP helpers in xen_{,un}map_table()
During early boot, it is not possible to use xen_{,un}map_table()
if the page tables are not residing the Xen binary.
This is a blocker to switch some of the helpers to use xen_pt_update()
as we may need to allocate extra page tables and access them before
the domheap has been initialized (see setup_xenheap_mappings()).
xen_{,un}map_table() are now updated to use the PMAP helpers for early
boot map/unmap. Note that the special case for page-tables residing
in Xen binary has been dropped because it is "complex" and was
only added as a workaround in 8d4f1b8878e0 ("xen/arm: mm: Allow
generic xen page-tables helpers to be called early").
Julien Grall [Sat, 11 Jun 2022 11:21:09 +0000 (12:21 +0100)]
xen/arm: mm: Clean-up the includes and order them
The numbers of includes in mm.c has been growing quite a lot. However
some of them (e.g. xen/device_tree.h, xen/softirq.h) doesn't look
to be directly used by the file or other will be included by
larger headers (e.g asm/flushtlb.h will be included by xen/mm.h).
So trim down the number of includes. Take the opportunity to order
them with the xen headers first, then asm headers and last public
headers.
Wei Liu [Sat, 11 Jun 2022 11:20:33 +0000 (12:20 +0100)]
xen/arm: add Persistent Map (PMAP) infrastructure
The basic idea is like Persistent Kernel Map (PKMAP) in Linux. We
pre-populate all the relevant page tables before the system is fully
set up.
We will need it on Arm in order to rework the arm64 version of
xenheap_setup_mappings() as we may need to use pages allocated from
the boot allocator before they are effectively mapped.
This infrastructure is not lock-protected therefore can only be used
before smpboot. After smpboot, map_domain_page() has to be used.
This is based on the x86 version [1] that was originally implemented
by Wei Liu.
The PMAP infrastructure is implemented in common code with some
arch helpers to set/clear the page-table entries and convertion
between a fixmap slot to a virtual address...
As mfn_to_xen_entry() now needs to be exported, take the opportunity
to swich the parameter attr from unsigned to unsigned int.
Jan Beulich [Fri, 10 Jun 2022 08:24:21 +0000 (10:24 +0200)]
x86emul/test: encourage compiler to use more embedded broadcast
For one it was an oversight to leave dup_{hi,lo}() undefined for 512-bit
vector size. And then in FMA testing we can also arrange for the
compiler to (hopefully) recognize broadcasting potential. Plus we can
replace the broadcast(1) use in the addsub() surrogate with inline
assembly explicitly using embedded broadcast (even gcc12 still doesn't
support broadcast for any of the addsub/subadd builtins).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 10 Jun 2022 08:21:06 +0000 (10:21 +0200)]
x86/mm: account for PGT_pae_xen_l2 in recently added assertion
While PGT_pae_xen_l2 will be zapped once the type refcount of an L2 page
reaches zero, it'll be retained as long as the type refcount is non-
zero. Hence any checking against the requested type needs to either zap
the bit from the type or include it in the used mask.
Fixes: 9186e96b199e ("x86/pv: Clean up _get_page_type()") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Thu, 9 Jun 2022 12:23:37 +0000 (14:23 +0200)]
x86/pv: Track and flush non-coherent mappings of RAM
There are legitimate uses of WC mappings of RAM, e.g. for DMA buffers with
devices that make non-coherent writes. The Linux sound subsystem makes
extensive use of this technique.
For such usecases, the guest's DMA buffer is mapped and consistently used as
WC, and Xen doesn't interact with the buffer.
However, a mischevious guest can use WC mappings to deliberately create
non-coherency between the cache and RAM, and use this to trick Xen into
validating a pagetable which isn't actually safe.
Allocate a new PGT_non_coherent to track the non-coherency of mappings. Set
it whenever a non-coherent writeable mapping is created. If the page is used
as anything other than PGT_writable_page, force a cache flush before
validation. Also force a cache flush before the page is returned to the heap.
This is CVE-2022-26364, part of XSA-402.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 9 Jun 2022 12:23:07 +0000 (14:23 +0200)]
x86/amd: Work around CLFLUSH ordering on older parts
On pre-CLFLUSHOPT AMD CPUs, CLFLUSH is weakely ordered with everything,
including reads and writes to the address, and LFENCE/SFENCE instructions.
This creates a multitude of problematic corner cases, laid out in the manual.
Arrange to use MFENCE on both sides of the CLFLUSH to force proper ordering.
This is part of XSA-402.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 9 Jun 2022 12:22:38 +0000 (14:22 +0200)]
x86: Split cache_flush() out of cache_writeback()
Subsequent changes will want a fully flushing version.
Use the new helper rather than opencoding it in flush_area_local(). This
resolves an outstanding issue where the conditional sfence is on the wrong
side of the clflushopt loop. clflushopt is ordered with respect to older
stores, not to younger stores.
Rename gnttab_cache_flush()'s helper to avoid colliding in name.
grant_table.c can see the prototype from cache.h so the build fails
otherwise.
This is part of XSA-402.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 9 Jun 2022 12:22:08 +0000 (14:22 +0200)]
x86: Don't change the cacheability of the directmap
Changeset 55f97f49b7ce ("x86: Change cache attributes of Xen 1:1 page mappings
in response to guest mapping requests") attempted to keep the cacheability
consistent between different mappings of the same page.
The reason wasn't described in the changelog, but it is understood to be in
regards to a concern over machine check exceptions, owing to errata when using
mixed cacheabilities. It did this primarily by updating Xen's mapping of the
page in the direct map when the guest mapped a page with reduced cacheability.
Unfortunately, the logic didn't actually prevent mixed cacheability from
occurring:
* A guest could map a page normally, and then map the same page with
different cacheability; nothing prevented this.
* The cacheability of the directmap was always latest-takes-precedence in
terms of guest requests.
* Grant-mapped frames with lesser cacheability didn't adjust the page's
cacheattr settings.
* The map_domain_page() function still unconditionally created WB mappings,
irrespective of the page's cacheattr settings.
Additionally, update_xen_mappings() had a bug where the alias calculation was
wrong for mfn's which were .init content, which should have been treated as
fully guest pages, not Xen pages.
Worse yet, the logic introduced a vulnerability whereby necessary
pagetable/segdesc adjustments made by Xen in the validation logic could become
non-coherent between the cache and main memory. The CPU could subsequently
operate on the stale value in the cache, rather than the safe value in main
memory.
The directmap contains primarily mappings of RAM. PAT/MTRR conflict
resolution is asymmetric, and generally for MTRR=WB ranges, PAT of lesser
cacheability resolves to being coherent. The special case is WC mappings,
which are non-coherent against MTRR=WB regions (except for fully-coherent
CPUs).
Xen must not have any WC cacheability in the directmap, to prevent Xen's
actions from creating non-coherency. (Guest actions creating non-coherency is
dealt with in subsequent patches.) As all memory types for MTRR=WB ranges
inter-operate coherently, so leave Xen's directmap mappings as WB.
Only PV guests with access to devices can use reduced-cacheability mappings to
begin with, and they're trusted not to mount DoSs against the system anyway.
Drop PGC_cacheattr_{base,mask} entirely, and the logic to manipulate them.
Shift the later PGC_* constants up, to gain 3 extra bits in the main reference
count. Retain the check in get_page_from_l1e() for special_pages() because a
guest has no business using reduced cacheability on these.
Andrew Cooper [Thu, 9 Jun 2022 12:21:04 +0000 (14:21 +0200)]
x86/pv: Fix ABAC cmpxchg() race in _get_page_type()
_get_page_type() suffers from a race condition where it incorrectly assumes
that because 'x' was read and a subsequent a cmpxchg() succeeds, the type
cannot have changed in-between. Consider:
CPU A:
1. Creates an L2e referencing pg
`-> _get_page_type(pg, PGT_l1_page_table), sees count 0, type PGT_writable_page
2. Issues flush_tlb_mask()
CPU B:
3. Creates a writeable mapping of pg
`-> _get_page_type(pg, PGT_writable_page), count increases to 1
4. Writes into new mapping, creating a TLB entry for pg
5. Removes the writeable mapping of pg
`-> _put_page_type(pg), count goes back down to 0
CPU A:
7. Issues cmpxchg(), setting count 1, type PGT_l1_page_table
CPU B now has a writeable mapping to pg, which Xen believes is a pagetable and
suitably protected (i.e. read-only). The TLB flush in step 2 must be deferred
until after the guest is prohibited from creating new writeable mappings,
which is after step 7.
Defer all safety actions until after the cmpxchg() has successfully taken the
intended typeref, because that is what prevents concurrent users from using
the old type.
Also remove the early validation for writeable and shared pages. This removes
race conditions where one half of a parallel mapping attempt can return
successfully before:
* The IOMMU pagetables are in sync with the new page type
* Writeable mappings to shared pages have been torn down
This is part of XSA-401 / CVE-2022-26362.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Andrew Cooper [Thu, 9 Jun 2022 12:20:36 +0000 (14:20 +0200)]
x86/pv: Clean up _get_page_type()
Various fixes for clarity, ahead of making complicated changes.
* Split the overflow check out of the if/else chain for type handling, as
it's somewhat unrelated.
* Comment the main if/else chain to explain what is going on. Adjust one
ASSERT() and state the bit layout for validate-locked and partial states.
* Correct the comment about TLB flushing, as it's backwards. The problem
case is when writeable mappings are retained to a page becoming read-only,
as it allows the guest to bypass Xen's safety checks for updates.
* Reduce the scope of 'y'. It is an artefact of the cmpxchg loop and not
valid for use by subsequent logic. Switch to using ACCESS_ONCE() to treat
all reads as explicitly volatile. The only thing preventing the validated
wait-loop being infinite is the compiler barrier hidden in cpu_relax().
* Replace one page_get_owner(page) with the already-calculated 'd' already in
scope.
No functional change.
This is part of XSA-401 / CVE-2022-26362.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Thu, 9 Jun 2022 08:56:08 +0000 (10:56 +0200)]
VT-d: fold iommu_flush_iotlb{,_pages}()
With iommu_flush_iotlb_all() gone, iommu_flush_iotlb_pages() is merely a
wrapper around the not otherwise called iommu_flush_iotlb(). Fold both
functions.
No functional change intended.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Thu, 9 Jun 2022 08:54:30 +0000 (10:54 +0200)]
IOMMU: fold flush-all hook into "flush one"
Having a separate flush-all hook has always been puzzling me some. We
will want to be able to force a full flush via accumulated flush flags
from the map/unmap functions. Introduce a respective new flag and fold
all flush handling to use the single remaining hook.
Note that because of the respective comments in SMMU and IPMMU-VMSA
code, I've folded the two prior hook functions into one. For SMMU-v3,
which lacks a comment towards incapable hardware, I've left both
functions in place on the assumption that selective and full flushes
will eventually want separating.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> # IPMMU-VMSA, SMMU-V2 Reviewed-by: Rahul Singh <rahul.singh@arm.com> # SMMUv3 Acked-by: Julien Grall <jgrall@amazon.com> # Arm Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Thu, 9 Jun 2022 08:54:01 +0000 (10:54 +0200)]
SUPPORT.md: extend security support for x86 hosts to 12 TiB of memory
c49ee0329ff3 ("SUPPORT.md: limit security support for hosts with very
much memory"), as a result of XSA-385, restricted security support to
8 TiB of host memory. While subsequently further restricted for Arm,
extend this to 12 TiB on x86, putting in place a guest restriction to
8 TiB (or yet less for Arm) in exchange.
A 12 TiB x86 host was certified successfully for use with Xen 4.14 as
per https://www.suse.com/nbswebapp/yesBulletin.jsp?bulletinNumber=150753.
This in particular included running as many guests (2 TiB each) as
possible in parallel, to actually prove that all the memory can be used
like this. It may be relevant to note that the Optane memory there was
used in memory-only mode, with DRAM acting as cache.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Julien Grall <jgrall@amazon.com>
Jan Beulich [Wed, 8 Jun 2022 15:03:32 +0000 (17:03 +0200)]
IOMMU/x86: perform PV Dom0 mappings in batches
For large page mappings to be easily usable (i.e. in particular without
un-shattering of smaller page mappings) and for mapping operations to
then also be more efficient, pass batches of Dom0 memory to iommu_map().
In dom0_construct_pv() and its helpers (covering strict mode) this
additionally requires establishing the type of those pages (albeit with
zero type references).
The earlier establishing of PGT_writable_page | PGT_validated requires
the existing places where this gets done (through get_page_and_type())
to be updated: For pages which actually have a mapping, the type
refcount needs to be 1.
There is actually a related bug that gets fixed here as a side effect:
Typically the last L1 table would get marked as such only after
get_page_and_type(..., PGT_writable_page). While this is fine as far as
refcounting goes, the page did remain mapped in the IOMMU in this case
(when "iommu=dom0-strict").
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Wed, 8 Jun 2022 15:02:19 +0000 (17:02 +0200)]
IOMMU/x86: restrict IO-APIC mappings for PV Dom0
While already the case for PVH, there's no reason to treat PV
differently here, though of course the addresses get taken from another
source in this case. Except that, to match CPU side mappings, by default
we permit r/o ones. This then also means we now deal consistently with
IO-APICs whose MMIO is or is not covered by E820 reserved regions.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Anthony PERARD [Wed, 8 Jun 2022 15:00:29 +0000 (17:00 +0200)]
build: xen/include: use if_changed
Use "define" for the headers*_chk commands. That allow us to keep
writing "#include" in the Makefile without having to replace that by
"$(pound)include" which would be a bit less obvious about the command
line purpose.
Adding several .PRECIOUS as without them `make` deletes the
intermediate targets. This is an issue because the macro $(if_changed,)
check if the target exist in order to decide whether to recreate the
target.
Removing the call to `mkdir` from the commands. Those aren't needed
anymore because a rune in Rules.mk creates the directory for each
$(targets).
Remove "export PYTHON" as it is already exported.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Commit 92ea9c54fc81 ("arm/dom0less: assign dom0less guests to cpupools")
introduced a way to start a domain directly on a certain cpupool,
adding a "cpupool_id" member to struct xen_domctl_createdomain.
This was done to be able to start dom0less guests in different pools than
cpupool0, but the toolstack can benefit from it because it can now use
the struct member directly instead of creating the guest in cpupool0
and then moving it to the target cpupool.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Luca Fancellu <luca.fancellu@arm.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Michal Orzel [Mon, 23 May 2022 09:13:24 +0000 (11:13 +0200)]
xen/arm: Allow setting the number of CPUs to activate at runtime
Introduce a command line parameter "maxcpus" on Arm to allow adjusting
the number of CPUs to activate. Currently the limit is defined by the
config option CONFIG_NR_CPUS. Such parameter already exists on x86.
Define a parameter "maxcpus" and a corresponding static variable
max_cpus in Arm smpboot.c. Modify function smp_get_max_cpus to take
max_cpus as a limit and to return proper unsigned int instead of int.
Take the opportunity to remove redundant variable cpus from start_xen
function and to directly assign the return value from smp_get_max_cpus
to nr_cpu_ids (global variable in Xen used to store the number of CPUs
actually activated).
Signed-off-by: Michal Orzel <michal.orzel@arm.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Julien Grall [Fri, 20 May 2022 12:09:30 +0000 (13:09 +0100)]
xen/arm: Move fixmap definitions in a separate header
To use properly the fixmap definitions, their user would need
also new to include <xen/acpi.h>. This is not very great when
the user itself is not meant to directly use ACPI definitions.
Including <xen/acpi.h> in <asm/config.h> is not option because
the latter header is included by everyone. So move out the fixmap
entries definition in a new header.
Take the opportunity to also move {set, clear}_fixmap() prototypes
in the new header.
Note that most of the definitions in <xen/acpi.h> now need to be
surrounded with #ifndef __ASSEMBLY__ because <asm/fixmap.h> will
be used in assembly (see EARLY_UART_VIRTUAL_ADDRESS).
The split will become more helpful in a follow-up patch where new
fixmap entries will be defined.
Julien Grall [Fri, 20 May 2022 12:09:29 +0000 (13:09 +0100)]
xen/arm: mm: Allow page-table allocation from the boot allocator
At the moment, page-table can only be allocated from domheap. This means
it is not possible to create mapping in the page-tables via
map_pages_to_xen() if page-table needs to be allocated.
In order to avoid open-coding page-tables update in early boot, we need
to be able to allocate page-tables much earlier. Thankfully, we have the
boot allocator for those cases.
create_xen_table() is updated to cater early boot allocation by using
alloc_boot_pages().
Note, this is not sufficient to bootstrap the page-tables (i.e mapping
before any memory is actually mapped). This will be addressed
separately.
Julien Grall [Fri, 20 May 2022 12:09:28 +0000 (13:09 +0100)]
xen/arm: mm: Allocate xen page tables in domheap rather than xenheap
xen_{un,}map_table() already uses the helper to map/unmap pages
on-demand (note this is currently a NOP on arm64). So switching to
domheap don't have any disavantage.
But this as the benefit:
- to keep the page tables unmapped if an arch decided to do so
- reduce xenheap use on arm32 which can be pretty small
Julien Grall [Fri, 20 May 2022 12:09:25 +0000 (13:09 +0100)]
xen/arm: mm: Don't open-code Xen PT update in remove_early_mappings()
Now that xen_pt_update_entry() is able to deal with different mapping
size, we can replace the open-coding of the page-tables update by a call
to modify_xen_mappings().
As the function is not meant to fail, a BUG_ON() is added to check the
return.
Note that we don't use destroy_xen_mappings() because the helper doesn't
allow us to pass a flags. In theory we could add an extra parameter to
the function, however there are no other expected users. Hence why
modify_xen_mappings() is used.
Julien Grall [Fri, 20 May 2022 12:09:24 +0000 (13:09 +0100)]
xen/arm: mm: Avoid flushing the TLBs when mapping are inserted
Currently, the function xen_pt_update() will flush the TLBs even when
the mappings are inserted. This is a bit wasteful because we don't
allow mapping replacement. Even if we were, the flush would need to
happen earlier because mapping replacement should use Break-Before-Make
when updating the entry.
A single call to xen_pt_update() can perform a single action. IOW, it
is not possible to, for instance, mix inserting and removing mappings.
Therefore, we can use `flags` to determine what action is performed.
This change will be particularly help to limit the impact of switching
boot time mapping to use xen_pt_update().
Julien Grall [Fri, 20 May 2022 12:09:23 +0000 (13:09 +0100)]
xen/arm: mm: Add support for the contiguous bit
In follow-up patches, we will use xen_pt_update() (or its callers)
to handle large mappings (e.g. frametable, xenheap). They are also
not going to be modified once created.
The page-table entries have an hint to indicate that whether an
entry is contiguous to another 16 entries (assuming 4KB). When the
processor support the hint, one TLB entry will be created per
contiguous region.
For now this is tied to _PAGE_BLOCK. We can untie it in the future
if there are use-cases where we may want to use _PAGE_BLOCK without
setting the contiguous (couldn't think of any yet).
Note that to avoid extra complexity, mappings with the contiguous
bit set cannot be removed. Given the expected use, this restriction
ought to be fine.
Julien Grall [Fri, 20 May 2022 12:09:22 +0000 (13:09 +0100)]
xen/arm: mm: Allow other mapping size in xen_pt_update_entry()
At the moment, xen_pt_update_entry() only supports mapping at level 3
(i.e 4KB mapping). While this is fine for most of the runtime helper,
the boot code will require to use superpage mapping.
We don't want to allow superpage mapping by default as some of the
callers may expect small mappings (i.e populate_pt_range()) or even
expect to unmap only a part of a superpage.
To keep the code simple, a new flag _PAGE_BLOCK is introduced to
allow the caller to enable superpage mapping.
As the code doesn't support all the combinations, xen_pt_check_entry()
is extended to take into account the cases we don't support when
using block mapping:
- Replacing a table with a mapping. This may happen if region was
first mapped with 4KB mapping and then later on replaced with a 2MB
(or 1GB mapping).
- Removing/modifying a table. This may happen if a caller try to
remove a region with _PAGE_BLOCK set when it was created without it.
Note that the current restriction means that the caller must ensure that
_PAGE_BLOCK is consistently set/cleared across all the updates on a
given virtual region. This ought to be fine with the expected use-cases.
More rework will be necessary if we wanted to remove the restrictions.
Note that nr_mfns is now marked const as it is used for flushing the
TLBs and we don't want it to be modified.
Henry Wang [Sat, 7 May 2022 02:54:34 +0000 (10:54 +0800)]
xen/common: Use enhanced ASSERT_ALLOC_CONTEXT in xmalloc()
xmalloc() will use a pool for allocation smaller than a page.
The pool is extended only when there are no more space. At which
point, alloc_xenheap_pages() is called to add more memory.
xmalloc() must be protected by ASSERT_ALLOC_CONTEXT. It should not
rely on pool expanding to trigger the ASSERT_ALLOC_CONTEXT in
alloc_xenheap_pages(). Hence, this commit moves the definition of
ASSERT_ALLOC_CONTEXT to header and uses the ASSERT_ALLOC_CONTEXT
to replace the original assertion in xmalloc().
For consistency, the same assertion should be used in xfree(),
and the position of the assertion should be at the beginning of
the xfree().
Also take the opportunity to enhance the non-static functions
xmem_pool_{alloc,free}() with the same assertion so that future
callers of these two functions can be benefited.
Reported-by: Wei Chen <Wei.Chen@arm.com> Suggested-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Henry Wang <Henry.Wang@arm.com> Tested-by: Julien Grall <jgrall@amazon.com> Acked-by: Julien Grall <jgrall@amazon.com>
David Vrabel [Tue, 26 Apr 2022 08:33:01 +0000 (10:33 +0200)]
page_alloc: assert IRQs are enabled in heap alloc/free
Heap pages can only be safely allocated and freed with interrupts
enabled as they may require a TLB flush which may send IPIs (on x86).
Normally spinlock debugging would catch calls from the incorrect
context, but not from stop_machine_run() action functions as these are
called with spin lock debugging disabled.
Enhance the assertions in alloc_xenheap_pages() and
alloc_domheap_pages() to check interrupts are enabled. For consistency
the same asserts are used when freeing heap pages.
As an exception, when only 1 PCPU is online, allocations are permitted
with interrupts disabled as any TLB flushes would be local only. This
is necessary during early boot.
Signed-off-by: David Vrabel <dvrabel@amazon.co.uk> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Introduce a list of MISRA C rules that apply to the Xen hypervisor. The
list is in RST format.
Specify that rules deviations need to be documented. Introduce a
documentation tag for in-code comments to mark them as deviations. Also
mention that other documentation mechanisms are work-in-progress.
Jan Beulich [Wed, 1 Jun 2022 07:19:25 +0000 (09:19 +0200)]
x86: harden use of calc_ler_msr()
Avoid calling the function more than once, thus making sure we won't,
under any unusual circumstances, attempt to enable XEN_LER late (which
can't work, for setup_force_cpu_cap() being __init. In turn this then
allows making the function itself __init, too.
While fiddling with section attributes in this area, also move the two
involved variables to .data.ro_after_init.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jason Andryuk [Wed, 1 Jun 2022 07:18:49 +0000 (09:18 +0200)]
libxl: don't leak self pipes
libxl is leaking self pipes to child processes. These can be seen when
running with env var _LIBXL_DEBUG_EXEC_FDS=1:
libxl: debug: libxl_aoutils.c:593:libxl__async_exec_start: forking to execute: /etc/xen/scripts/vif-bridge online
[Detaching after fork from child process 5099]
libxl: execing /etc/xen/scripts/vif-bridge: fd 4 is open to pipe:[46805] with flags 0
libxl: execing /etc/xen/scripts/vif-bridge: fd 13 is open to pipe:[46807] with flags 0
libxl: execing /etc/xen/scripts/vif-bridge: fd 14 is open to pipe:[46807] with flags 0
libxl: execing /etc/xen/scripts/vif-bridge: fd 19 is open to pipe:[48570] with flags 0
libxl: execing /etc/xen/scripts/vif-bridge: fd 20 is open to pipe:[48570] with flags 0
(fd 3 is also open, but the check only starts at 4 for some reason.)
For xl, this is the poller created by libxl_ctx_alloc, the poller
created by do_domain_create -> libxl__ao_create, and the self pipe for
libxl__sigchld_needed. Set CLOEXEC on the FDs so they are not leaked
into children.
Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Tamas K Lengyel [Wed, 1 Jun 2022 07:18:30 +0000 (09:18 +0200)]
tools/libs/ctrl: rename and export do_memory_op as xc_memory_op
Make the do_memory_op function accessible to tools linking with libxc.
Similar functions are already available for both domctl and sysctl. As part
of this patch we also change the input 'cmd' to be unsigned int to accurately
reflect what the hypervisor expects.
Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
snprintf returns the number of characters that would have been written
to the final string if enough space had been available. A return value
of size or more means that the output was truncated.
Juergen Gross [Wed, 25 May 2022 10:55:49 +0000 (12:55 +0200)]
tools/xenstore: fix event sending in introduce_domain()
Commit fc2b57c9af46 ("xenstored: send an evtchn notification on
introduce_domain") introduced a potential NULL dereference in case of
Xenstore live update.
Fix that by adding an appropriate check.
Coverity-Id: 1504572 Fixes: fc2b57c9af46 ("xenstored: send an evtchn notification on introduce_domain") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monné [Wed, 25 May 2022 09:09:46 +0000 (11:09 +0200)]
x86/flushtlb: remove flush_area check on system state
Booting with Shadow Stacks leads to the following assert on a debug
hypervisor:
Assertion 'local_irq_is_enabled()' failed at arch/x86/smp.c:265
----[ Xen-4.17.0-10.24-d x86_64 debug=y Not tainted ]----
CPU: 0
RIP: e008:[<ffff82d040345300>] flush_area_mask+0x40/0x13e
[...]
Xen call trace:
[<ffff82d040345300>] R flush_area_mask+0x40/0x13e
[<ffff82d040338a40>] F modify_xen_mappings+0xc5/0x958
[<ffff82d0404474f9>] F arch/x86/alternative.c#_alternative_instructions+0xb7/0xb9
[<ffff82d0404476cc>] F alternative_branches+0xf/0x12
[<ffff82d04044e37d>] F __start_xen+0x1ef4/0x2776
[<ffff82d040203344>] F __high_start+0x94/0xa0
This is due to SYS_STATE_smp_boot being set before calling
alternative_branches(), and the flush in modify_xen_mappings() then
using flush_area_all() with interrupts disabled. Note that
alternative_branches() is called before APs are started, so the flush
must be a local one (and indeed the cpumask passed to
flush_area_mask() just contains one CPU).
Take the opportunity to simplify a bit the logic and make flush_area()
an alias of flush_area_all() in mm.c, taking into account that
cpu_online_map just contains the BSP before APs are started. This
requires widening the assert in flush_area_mask() to allow being
called with interrupts disabled as long as it's strictly a local only
flush.
The overall result is that a conditional can be removed from
flush_area().
While there also introduce an ASSERT to check that a vCPU state flush
is not issued for the local CPU only.
Fixes: 78e072bc37 ('x86/mm: avoid inadvertently degrading a TLB flush to local only') Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Wed, 25 May 2022 09:08:28 +0000 (11:08 +0200)]
x86/mm: rename FLUSH_FORCE_IPI to FLUSH_NO_ASSIST
Rename the flag to better note that it's not actually forcing any IPIs
to be issued if none is required, but merely avoiding the usage of TLB
flush assistance (which itself can avoid the sending of IPIs to remote
processors).
No functional change expected.
Requested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Bertrand Marquis [Wed, 25 May 2022 09:07:46 +0000 (11:07 +0200)]
build: fix make warning if there is no cppcheck
If cppcheck is not present, the following warning appears during build:
which: no cppcheck in ([...])
/bin/sh: cppcheck: command not found
Fix the problem by using shell code inside the cppcheck-version rule to
also prevent unneeded call of which when something else than cppcheck is
built.
Julien Grall [Tue, 24 May 2022 23:38:15 +0000 (16:38 -0700)]
xen/arm: setup: nr_banks should be unsigned int
It is not possible to have a negative number of banks. So switch to
unsigned int.
The type change is also propagated to any users of nr_banks that were
using "int" (there are not that many).
Note that fdt_num_mem_rsv() can actually returns a negative value in
case of an error. So the return should be checked before assigning the
result to an unsigned variable.
Luca Miccio [Fri, 13 May 2022 21:07:29 +0000 (14:07 -0700)]
tools: add example application to initialize dom0less PV drivers
Add an example application that can be run in dom0 to complete the
dom0less domains initialization so that they can get access to xenstore
and use PV drivers.
The application sets "connection" to XENSTORE_RECONNECT on the xenstore
page before calling xs_introduce_domain to signal that the connection is
not ready yet to be used. XENSTORE_RECONNECT is reset soon after by
xenstored.