Define a sanitize_cpu function to be called on secondary cores to
sanitize the system cpuinfo structure.
The safest value is taken when possible and the system is marked tainted
if we encounter values which are incompatible with each other.
Call the update_system_features function on all secondary cores that are
kept running and taint the system if different midr are found between
cores but hmp-unsafe=true was passed on Xen command line.
This is only supported on arm64 so update_system_features is an empty
static inline on arm32.
The patch is adding a new TAINT_CPU_OUT_OF_SPEC to warn the user if
Xen is running on a system with features differences between cores which
are not supported.
The patch is disabling CTR_EL0, DCZID_EL0 and ZCRusing #if 0 with a TODO
as this patch is not handling sanitization of those registers.
CTR_EL0/DCZID will be handled in a future patch to properly handle
different cache attributes when possible.
ZCR should be sanitize once we add support for SVE in Xen.
As we will sanitize the content of boot_cpu_data it will not really
contain the boot cpu information but the system sanitize information.
Rename the structure to system_cpuinfo so the user is informed that this
is the system wide available feature and not anymore the features of the
boot cpu.
The original boot cpu data is still available in cpu_data.
Import structures declared in Linux file arch/arm64/kernel/cpufeature.c
and the required types from arch/arm64/include/asm/cpufeature.h.
Current code has been imported from Linux 5.13-rc5 (Commit ID cd1245d75ce93b8fd206f4b34eb58bcfe156d5e9) and copied into cpufeature.c
in arm64 code and cpufeature.h in arm64 specific headers.
Those structure will be used to sanitize the cpu features available to
the ones availble on all cores of a system even if we are on an
heterogeneous platform (from example a big/LITTLE).
For each feature field of all ID registers, those structures define what
is the safest value and if we can allow to have different values in
different cores.
This patch is introducing Linux code without any changes to it.
Jan Beulich [Thu, 16 Sep 2021 09:02:48 +0000 (11:02 +0200)]
VT-d: skip IOMMU bitmap cleanup for phantom devices
Doing the cleanup also for phantom devices is at best redundant with
doing it for the corresponding real device. I couldn't force myself into
checking all the code paths whether it really is: It seems better to
explicitly skip this step in such cases.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Thu, 16 Sep 2021 09:02:08 +0000 (11:02 +0200)]
VT-d: defer "no DRHD" check when (un)mapping devices
If devices are to be skipped anyway (which is the case in particular for
host bridges), there's no point complaining about a missing DRHD (and
hence a missing association with an IOMMU).
While there convert assignments to initializers and constify "drhd"
local variables.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Daniel P. Smith [Thu, 16 Sep 2021 08:59:40 +0000 (10:59 +0200)]
xsm: refactor xsm_ops handling
This renames the `struct xsm_operations` to the shorter `struct xsm_ops` and
converts the global xsm_ops from being a pointer to an explicit instance. As
part of this conversion, it reworks the XSM modules init function to return
their xsm_ops struct which is copied in to the global xsm_ops instance.
Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Daniel P. Smith [Thu, 16 Sep 2021 08:58:59 +0000 (10:58 +0200)]
xsm: apply coding style
Instead of intermixing coding style changes with code changes as they
are come upon in this patch set, moving all coding style changes
into a single commit. The focus of coding style changes here are,
- move trailing comments to line above
- ensuring line length does not exceed 80 chars
- ensuring proper indentation for 80 char wrapping
- covert u32 type statements to uint32_t
- remove space between closing and opening parens
- drop extern on function declarations
Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 16 Sep 2021 08:56:25 +0000 (10:56 +0200)]
IOMMU: page table dumping adjustments
For one none of the three IOMMU implementations on Arm specify a dumping
hook. Generalize VT-d's "don't dump shared page tables" to cover for
this.
Further in the past I was told that on Arm in principle there could be
multiple different IOMMUs, and hence different domains' platform_ops
pointers could differ. Use each domain's ops for calling the dump hook.
(In the long run all uses of iommu_get_ops() would likely need to
disappear for this reason.)
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
If the new gfn matches the previous one (ie: gpfn == old_gpfn)
xenmem_add_to_physmap_one will issue a duplicated call to
guest_physmap_remove_page with the same guest frame number, because
the get_gpfn_from_mfn call has been moved by commit f8582da041 to be
performed before the original page is removed. This leads to the
second guest_physmap_remove_page failing, which was not the case
before commit f8582da041.
Fix this by adding a check that prevents a second call to
guest_physmap_remove_page if the previous one has already removed the
backing page from that gfn.
Fixes: f8582da041 ('x86/mm: pull a sanity check earlier in xenmem_add_to_physmap_one()') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86: quote section names when defining them in linker script
LLVM ld seems to require section names to be quoted at both definition
and when referencing them for a match to happen, or else we get the
following errors:
The original fix for GNU ld 2.37 only quoted the section name when
referencing it in the ADDR function. Fix by also quoting the section
names when declaring them.
Fixes: 58ad654ebce7 ("x86: work around build issue with GNU ld 2.37") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Wed, 15 Sep 2021 09:01:29 +0000 (11:01 +0200)]
x86/boot: properly "ignore" early evaluated "no-real-mode"
The option parser takes off "no-" prefixes before matching, so they also
shouldn't be specified to match against.
Fixes: e44d98608476 ("x86/setup: Ignore early boot parameters like no-real-mode") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 15 Sep 2021 09:00:40 +0000 (11:00 +0200)]
x86/ACPI: ignore processors which cannot be brought online
ACPI 6.3 introduced a flag allowing to tell MADT entries describing
hotpluggable processors from ones which are simply placeholders (often
used by firmware writers to simplify handling there).
Inspired by a Linux patch by Mario Limonciello <mario.limonciello@amd.com>.
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
This commit introduces a new function allocate_static_memory to allocate
static memory as guest RAM for domains on Static Allocation.
It uses acquire_domstatic_pages to acquire pre-configured static memory
for the domain, and uses guest_physmap_add_pages to set up the P2M table.
These pre-defined static memory banks shall be mapped to the usual guest
memory addresses (GUEST_RAM0_BASE, GUEST_RAM1_BASE) defined by
xen/include/public/arch-arm.h.
In order to deal with the trouble of count-to-order conversion when page number
is not in a power-of-two, this commit exports p2m_insert_mapping and introduce
a new function guest_physmap_add_pages to cope with adding guest RAM p2m
mapping with nr_pages.
xen/arm: introduce acquire_staticmem_pages and acquire_domstatic_pages
New function acquire_staticmem_pages aims to acquire nr_mfns contiguous pages
of static memory, starting at #smfn. And it is the equivalent of
alloc_heap_pages for static memory.
For each page, it shall check if the page is reserved(PGC_reserved)
and free. It shall also do a set of necessary initialization, which are
mostly the same ones in alloc_heap_pages, like, following the same
cache-coherency policy and turning page status into PGC_state_inuse, etc.
New function acquire_domstatic_pages is the equivalent of alloc_domheap_pages
for static memory, and it is to acquire nr_mfns contiguous pages of
static memory and assign them to one specific domain.
It uses acquire_staticmem_pages to acquire nr_mfns pages of static memory.
Then on success, it will use assign_pages to assign those pages to one
specific domain.
In order to differentiate pages of static memory from those allocated from
heap, this patch introduces a new page flag PGC_reserved, then mark pages of
static memory PGC_reserved when initializing them.
xen: re-define assign_pages and introduce a new function assign_page
In order to deal with the trouble of count-to-order conversion when page number
is not in a power-of-two, this commit re-define assign_pages for nr pages and
assign_page for original page with a single order.
Backporting confusion could be helped by altering the order of assign_pages
parameters, such that the compiler would point out that adjustments at call
sites are needed.
[stefano: switch to unsigned int for nr] Signed-off-by: Penny Zheng <penny.zheng@arm.com> Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
This commit defines a new helper mark_page_free to extract common code,
like following the same cache/TLB coherency policy, between free_heap_pages
and the new function free_staticmem_pages, which will be introduced later.
The PDX compression makes that conversion between the MFN and the page can
be potentially non-trivial. As the function is internal, pass the MFN and
the page. They are both expected to match.
Static Allocation refers to system or sub-system(domains) for which memory
areas are pre-defined by configuration using physical address ranges.
Those pre-defined memory, -- Static Memory, as parts of RAM reserved in the
beginning, shall never go to heap allocator or boot allocator for any use.
Memory can be statically allocated to a domain using the property "xen,static-
mem" defined in the domain configuration. The number of cells for the address
and the size must be defined using respectively the properties
"#xen,static-mem-address-cells" and "#xen,static-mem-size-cells".
The property 'memory' is still needed and should match the amount of memory
given to the guest. Currently, it either comes from static memory or lets Xen
allocate from heap. *Mixing* is not supported.
The static memory will be mapped in the guest at the usual guest memory
addresses (GUEST_RAM0_BASE, GUEST_RAM1_BASE) defined by
xen/include/public/arch-arm.h.
This patch introduces this new `xen,static-mem` feature, and also documents
and parses this new attribute at boot time.
This patch also introduces a new field "bool xen_domain" in "struct membank"
to tell whether the memory bank is reserved as the whole hardware resource,
or bind to a xen domain node, through "xen,static-mem"
Make the go build use APPEND_{C/LD}FLAGS when necessary, just like
other parts of the build.
Reported-by: Ting-Wei Lan <lantw44@gmail.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Nick Rosbrook <rosbrookn@ainfosec.com> Acked-by: Ian Jackson <iwj@xenproject.org>
Daniel P. Smith [Fri, 10 Sep 2021 20:12:57 +0000 (16:12 -0400)]
xsm: remove the ability to disable flask
On Linux when SELinux is put into permissive mode the discretionary access
controls are still in place. Whereas for Xen when the enforcing state of flask
is set to permissive, all operations for all domains would succeed, i.e. it
does not fall back to the default access controls. To provide a means to mimic
a similar but not equivalent behaviour, a flask op is present to allow a
one-time switch back to the default access controls, aka the "dummy policy".
While this may be desirable for an OS, Xen is a hypervisor and should not
allow the switching of which security policy framework is being enforced after
boot. This patch removes the flask op to enforce the desired XSM usage model
requiring a reboot of Xen to change the XSM policy module in use.
Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Fri, 10 Sep 2021 20:12:56 +0000 (16:12 -0400)]
xen: Implement xen/alternative-call.h for use in common code
The alternative call infrastructure is x86-only for now, but the common iommu
code has a variant and more common code wants to use the infrastructure.
Introduce CONFIG_ALTERNATIVE_CALL and a conditional implementation so common
code can use the optimisation when available, without requiring all
architectures to implement no-op stubs.
Write some documentation, which was thus far entirely absent, covering the
requirements for an architecture to implement this optimisation, and how to
use the infrastructure in general code.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 19 Aug 2019 15:40:06 +0000 (16:40 +0100)]
x86/svm: Intercept and terminate RDPRU with #UD
The RDPRU instruction isn't supported at all (and it is unclear how this can
ever be offered safely to guests). However, a guest which ignores CPUID and
blindly executes RDPRU will find that it functions.
Use the intercept and terminate with #UD. While at it, fold SKINIT into the
same "unconditionally disabled" path.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 25 May 2018 15:13:02 +0000 (16:13 +0100)]
x86/msr: Cleanup of misc constants
Move two blocks of MSRs into the cleaned up section, updating the style as
they move.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Fri, 25 May 2018 15:12:05 +0000 (16:12 +0100)]
x86/msr: Clean up the MSR_EFER constants
There are no remaining users of the bit position constants. Move the used
constants into the cleaned-up area of msr-index.h and apply appropriate style.
Rename EFER_NX to EFER_NXE to match both the Intel and AMD specs.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Fri, 30 Nov 2018 17:17:38 +0000 (17:17 +0000)]
x86/amd: Use newer SSBD mechanisms if they exist
The opencoded legacy Memory Disambiguation logic in init_amd() neglected
Fam19h for the Zen3 microarchitecture. Further more, all Zen2 based system
have the architectural MSR_SPEC_CTRL and the SSBD bit within it, so shouldn't
be using MSR_AMD64_LS_CFG.
Implement the algorithm given in AMD's SSBD whitepaper, and leave a
printk_once() behind in the case that no controls can be found.
This now means that a user explicitly choosing `spec-ctrl=ssbd` will properly
turn off Memory Disambiguation on Fam19h/Zen3 systems.
This still remains a single system-wide setting (for now), and is not context
switched between vCPUs. As such, it doesn't interact with Intel's use of
MSR_SPEC_CTRL and default_xen_spec_ctrl (yet).
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 12 Jul 2021 14:13:32 +0000 (15:13 +0100)]
x86/amd: Enumeration for speculative features/hints
There is a step change in speculation protections between the Zen1 and Zen2
microarchitectures.
Zen1 and older have no special support. Control bits in non-architectural
MSRs are used to make lfence be dispatch-serialising (Spectre v1), and to
disable Memory Disambiguation (Speculative Store Bypass). IBPB was
retrofitted in a microcode update, and software methods are required for
Spectre v2 protections.
Because the bit controlling Memory Disambiguation is model specific,
hypervisors are expected to expose a MSR_VIRT_SPEC_CTRL interface which
abstracts the model specific details.
Zen2 and later implement the MSR_SPEC_CTRL interface in hardware, and
virtualise the interface for HVM guests to use. A number of hint bits are
specified too to help guide OS software to the most efficient mitigation
strategy.
Zen3 introduced a new feature, Predictive Store Forwarding, along with a
control to disable it in sensitive code.
Add CPUID and VMCB details for all the new functionality.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 29 Jul 2021 10:59:22 +0000 (11:59 +0100)]
x86/spec-ctrl: Split the "Hardware features" diagnostic line
Separate the read-only hints from the features requiring active actions on
Xen's behalf.
Also take the opportunity split the IBRS/IBPB and IBPB mess. More features
with overlapping enumeration are on the way, and and it is not useful to split
them like this.
No practical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jane Malalane [Wed, 8 Sep 2021 12:39:18 +0000 (14:39 +0200)]
x86/cpuid: detect null segment behaviour on Zen2 CPUs
All Zen2 CPUs actually have this behaviour, but the CPUID bit couldn't
be introduced into Zen2 due to a lack of leaves. So, it was added in a
new leaf in Zen3. Nonetheless, hypervisors can synthesize the CPUID
bit in software.
So, Xen probes for NSCB (NullSelectorClearsBit) and
synthesizes the bit, if the behaviour is present.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jane Malalane <jane.malalane@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Wed, 8 Sep 2021 12:38:33 +0000 (14:38 +0200)]
domain: try to address Coverity pointing out a missing "break" in domain_teardown()
Commit 806448806264 ("xen/domain: Fix label position in
domain_teardown()" has caused Coverity to report a _new_ supposedly
un-annotated fall-through in a switch(). I find this (once again)
puzzling; I'm having an increasingly hard time figuring what patterns
the tool is actually after. I would have expected that the tool would
either have spotted an issue also before this change, or not at all. Yet
if it had spotted one before, the statistics report should have included
an eliminated instance alongside the new one (because then the issue
would simply have moved by a few lines).
Hence the only thing I could guess is that the treatment of comments in
macro expansions might be subtly different. Therefore try whether
switching the comments to the still relatively new "fallthrough" pseudo
keyword actually helps.
Jan Beulich [Wed, 8 Sep 2021 12:37:45 +0000 (14:37 +0200)]
gnttab: deal with status frame mapping race
Once gnttab_map_frame() drops the grant table lock, the MFN it reports
back to its caller is free to other manipulation. In particular
gnttab_unpopulate_status_frames() might free it, by a racing request on
another CPU, thus resulting in a reference to a deallocated page getting
added to a domain's P2M.
Obtain a page reference in gnttab_map_frame() to prevent freeing of the
page until xenmem_add_to_physmap_one() has actually completed its acting
on the page. Do so uniformly, even if only strictly required for v2
status pages, to avoid extra conditionals (which then would all need to
be kept in sync going forward).
Jan Beulich [Tue, 7 Sep 2021 12:24:49 +0000 (14:24 +0200)]
x86/p2m-pt: fix p2m_flags_to_access()
The initial if() was inverted, invalidating all output from this
function. Which in turn means the mirroring of P2M mappings into the
IOMMU didn't always work as intended: Mappings may have got updated when
there was no need to. There would not have been too few (un)mappings;
what saves us is that alongside the flags comparison MFNs also get
compared, with non-present entries always having an MFN of 0 or
INVALID_MFN while present entries always have MFNs different from these
two (0 in the table also meant to cover INVALID_MFN):
OLD NEW
P W access MFN P W access MFN
0 0 r 0 0 0 n 0
0 1 rw 0 0 1 n 0
1 0 n non-0 1 0 r non-0
1 1 n non-0 1 1 rw non-0
present <-> non-present transitions are fine because the MFNs differ.
present -> present transitions as well as non-present -> non-present
ones are potentially causing too many map/unmap operations, but never
too few, because in that case old (bogus) and new access differ.
Fixes: d1bb6c97c31e ("IOMMU: also pass p2m_access_t to p2m_get_iommu_flags()) Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jane Malalane [Tue, 7 Sep 2021 07:40:25 +0000 (09:40 +0200)]
x86/cpuid: expose NullSelectorClearsBase CPUID bit to guests
AMD Zen3 adds the NullSelectorClearsBase bit to indicate that loading
a NULL segment selector zeroes the base and limit fields, as well as
just attributes.
Expose bit to all guests.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jane Malalane <jane.malalane@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 7 Sep 2021 07:39:38 +0000 (09:39 +0200)]
x86/P2M: relax guarding of MMIO entries
One of the changes comprising the fixes for XSA-378 disallows replacing
MMIO mappings by code paths not intended for this purpose. At least in
the case of PVH Dom0 hitting an RMRR covered by an E820 ACPI region,
this is too strict. Generally short-circuit requests establishing the
same kind of mapping (mfn, type), but allow permissions to differ.
While there, also add a log message to the other domain_crash()
invocation that did prevent PVH Dom0 from coming up after the XSA-378
changes.
Fixes: 753cb68e6530 ("x86/p2m: guard (in particular) identity mapping entries") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 7 Sep 2021 07:38:42 +0000 (09:38 +0200)]
gnttab: maptrack handle shortage is not IOMMU related
Both comment and message string associated with GNTST_no_device_space
suggest a connection to the IOMMU. A lack of maptrack handles has
nothing to do with that; it's unclear to me why commit 6213b696ba65
("Grant-table interface redone") introduced it this way. Introduce a
new error indicator.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <jgrall@amazon.com>
Jan Beulich [Tue, 7 Sep 2021 07:37:50 +0000 (09:37 +0200)]
gnttab: adjust unmap checking of dev_bus_addr
There's no point checking ->dev_bus_addr when GNTMAP_device_map isn't
set (and hence the field isn't going to be consumed). And if there is a
mismatch, use the so far unused GNTST_bad_dev_addr error indicator - if
not here, where else would this (so far unused) value be used?
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <jgrall@amazon.com>
Jan Beulich [Tue, 7 Sep 2021 07:36:59 +0000 (09:36 +0200)]
ns16550: MMIO r/o ranges are maintained at page granularity
Passing byte granular values will not have the intended effect. Address
the immediate issue, but I don't think what we do is actually
sufficient: At least some devices allow access to their registers via
either I/O ports or MMIO. In such aliasing cases we'd need to protect
the MMIO range even when we use I/O port accesses to drive the port.
Note that this way we may write-protect MMIO ranges of unrelated devices
as well. To deal with this, faults resulting from this would need
handling, to emulate the accesses outside of the protected range. (An
alternative would be to relocate the BAR, but I'm afraid this might end
up even more challenging.)
Fixes: c9f8e0aee507 ("ns16550: Add support for UART present in Broadcom TruManage capable NetXtreme chips") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
Anthony PERARD [Tue, 7 Sep 2021 07:30:25 +0000 (09:30 +0200)]
build: fix clean targets when subdir-y is used
The make variable $(subdir-y) isn't used yet but will be in a
following patch. Anything in $(subdir-y) doesn't to have a '/' as
suffix as we already now it's a directory.
Rework the rules so that it doesn't matter whether there is a '/' or
not. It also mimic more closely to the way Linux's Kbuild descend in
subdirectories.
FORCE phony target isn't needed anymore running clean, so it can be
removed.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Anthony PERARD [Tue, 7 Sep 2021 07:28:43 +0000 (09:28 +0200)]
build,include: rework compat-build-source.py
Improvement are:
- give the path to xlat.lst as argument
- include `grep -v` in compat-build-source.py script, we don't need to
write this in several scripted language.
No changes in final compat/%.h headers.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Anthony PERARD [Tue, 7 Sep 2021 07:14:32 +0000 (09:14 +0200)]
build: use if_changed on built_in.o
In the case where $(obj-y) is empty, we also replace $(c_flags) by
$(XEN_CFLAGS) to avoid generating an .%.d dependency file. This avoid
make trying to include %.h file in the ld command if $(obj-y) isn't
empty anymore on a second run.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Michal Orzel [Thu, 29 Jul 2021 10:42:58 +0000 (12:42 +0200)]
xen/arm64: Remove vreg_emulate_sysreg32
According to ARMv8A architecture, AArch64 registers are 64bit wide
even though in many cases the upper 32bit is reserved. Therefore there
is no need for function vreg_emulate_sysreg32 on arm64. This means
that we can have just one function vreg_emulate_sysreg using new
function pointer:
Modify vreg_emulate_cp32 to use the new function pointer as well.
This change allows to properly use 64bit registers in AArch64 state.
In case of AArch32 the documentation (D1.20.2, DDI 0487A.j) states
that "the upper 32 bits either become zero, or hold the value that
the same architectural register held before any AArch32 execution." As
the choice between them is IMPLEMENTATION DEFINED we cannot assume they
are zeroed. Xen should ensure that but currently it does not. This is
not a new bug and must be fixed as agreed during a discussion over this
patch.
Take the opportunity to switch CNTx_CTL_* to use UL to avoid any
surprise with the negation of any bits (as used in vtimer_cntp_ctl)
Signed-off-by: Michal Orzel <michal.orzel@arm.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
Jan Beulich [Fri, 3 Sep 2021 13:10:43 +0000 (15:10 +0200)]
tests/xenstore: link in librt if necessary
Old enough glibc has clock_gettime() in librt.so, hence the library
needs to be specified to the linker. Newer glibc has the symbol
available in both libraries, so make sure that libc.so is preferred (to
avoid an unnecessary dependency on librt.so).
Fixes: 93c9edbef51b ("tests/xenstore: Rework Makefile") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Ian Jackson <iwj@xenproject.org>
Jan Beulich [Fri, 3 Sep 2021 13:10:24 +0000 (15:10 +0200)]
tools/libs: ROUNDUP() related adjustments
For one xc_private.h needlessly repeats xen-tools/libs.h's definition.
And then there are two suspicious uses (resulting from the inconsistency
with the respective 2nd parameter of DIV_ROUNDUP()): While the one in
tools/console/daemon/io.c - as per the code comment - intentionally uses
8 as the second argument (meaning to align to a multiple of 256), the
one in alloc_magic_pages_hvm() pretty certainly does not: There the goal
is to align to a uint64_t boundary, for the following module struct to
end up aligned.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Ian Jackson <iwj@xenproject.org>
Jan Beulich [Fri, 3 Sep 2021 13:09:48 +0000 (15:09 +0200)]
libxc: split xc_logdirty_control() from xc_shadow_control()
For log-dirty operations a 64-bit field is being truncated to become an
"int" return value. Seeing the large number of arguments the present
function takes, reduce its set of parameters to that needed for all
operations not involving the log-dirty bitmap, while introducing a new
wrapper for the log-dirty bitmap operations. This new function in turn
doesn't need an "mb" parameter, but has a 64-bit return type. (Using the
return value in favor of a pointer-type parameter is left as is, to
disturb callers as little as possible.)
While altering xc_shadow_control() anyway, also adjust the types of the
last two of the remaining parameters.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Ian Jackson <iwj@xenproject.org>
Jan Beulich [Tue, 31 Aug 2021 15:43:36 +0000 (17:43 +0200)]
x86/PVH: de-duplicate mappings for first Mb of Dom0 memory
One of the changes comprising the fixes for XSA-378 disallows replacing
MMIO mappings by code paths not intended for this purpose. This means we
need to be more careful about the mappings put in place in this range -
mappings should be created exactly once:
- iommu_hwdom_init() comes first; it should avoid the first Mb,
- pvh_populate_p2m() should insert identity mappings only into ranges
not populated as RAM,
- pvh_setup_acpi() should again avoid the first Mb, which was already
dealt with at that point.
Fixes: 753cb68e6530 ("x86/p2m: guard (in particular) identity mapping entries") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Fri, 27 Aug 2021 13:46:52 +0000 (14:46 +0100)]
xen/domain: Fix label position in domain_teardown()
As explained in the comments, a progress label wants to be before the function
it refers to for the higher level logic to make sense. As it happens, the
effects are benign because gnttab_mappings is immediately adjacent to teardown
in terms of co-routine exit points.
There is and will always be a corner case with 0. Help alleviate this
visually (at least slightly) with a BUILD_BUG_ON() to ensure the property
which makes this function do anything useful.
There is also a visual corner case when changing from PROGRESS() to
PROGRESS_VCPU(). The important detail is to check that there is a "return
rc;" logically between each PROGRESS*() marker.
Fixes: b1ee10be5625 ("gnttab: add preemption check to gnttab_release_mappings()") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 19 Aug 2021 12:53:15 +0000 (13:53 +0100)]
x86/spec-ctrl: Skip RSB overwriting when safe to do so
In some configurations, it is safe to not overwrite the RSB on entry to Xen.
Both Intel and AMD have guidelines in this area, because of the performance
difference it makes for native kernels.
A simple microperf test, measuring the amount of time a XENVER_version
hypercall takes, shows the following improvements:
This is best case improvement, because no real workloads are making
XENVER_version hypercalls in a tight loop. However, this is the hypercall
used by PV kernels to force evtchn delivery if one is pending, so it is a
common hypercall to see, especially in dom0.
The avoidance of RSB-overwriting speeds up all interrupts, exceptions and
system calls from PV or Xen context. RSB-overwriting is still required on
VMExit from HVM guests for now.
In terms of more realistic testing, LMBench in dom0 on an AMD Rome system
shows improvements across the board, with the best improvement at 8% for
simple syscall and simple write.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Fri, 27 Aug 2021 08:54:46 +0000 (10:54 +0200)]
gnttab: avoid triggering assertion in radix_tree_ulong_to_ptr()
Relevant quotes from the C11 standard:
"Except where explicitly stated otherwise, for the purposes of this
subclause unnamed members of objects of structure and union type do not
participate in initialization. Unnamed members of structure objects
have indeterminate value even after initialization."
"If there are fewer initializers in a brace-enclosed list than there are
elements or members of an aggregate, [...], the remainder of the
aggregate shall be initialized implicitly the same as objects that have
static storage duration."
"If an object that has static or thread storage duration is not
initialized explicitly, then:
[...]
— if it is an aggregate, every member is initialized (recursively)
according to these rules, and any padding is initialized to zero
bits;
[...]"
"A bit-field declaration with no declarator, but only a colon and a
width, indicates an unnamed bit-field." Footnote: "An unnamed bit-field
structure member is useful for padding to conform to externally imposed
layouts."
"There may be unnamed padding within a structure object, but not at its
beginning."
Which makes me conclude:
- Whether an unnamed bit-field member is an unnamed member or padding is
unclear, and hence also whether the last quote above would render the
big endian case of the structure declaration invalid.
- Whether the number of members of an aggregate includes unnamed ones is
also not really clear.
- The initializer in map_grant_ref() initializes all fields of the "cnt"
sub-structure of the union, so assuming the second quote above applies
here (indirectly), the compiler isn't required to implicitly
initialize the rest (i.e. in particular any padding) like would happen
for static storage duration objects.
Gcc 7.4.1 can be observed (apparently in debug builds only) to translate
aforementioned initializer to a read-modify-write operation of a stack
variable, leaving unchanged the top two bits of whatever was previously
in that stack slot. Clearly if either of the two bits were set,
radix_tree_ulong_to_ptr()'s assertion would trigger.
Therefore, to be on the safe side, add an explicit padding field for the
non-big-endian-bitfields case and give a dummy name to both padding
fields.
Fixes: 9781b51efde2 ("gnttab: replace mapkind()") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 27 Aug 2021 08:53:48 +0000 (10:53 +0200)]
gnttab: drop GNTMAP_can_fail
There's neither documentation of what this flag is supposed to mean, nor
any implementation. Commit 4d45702cf0398 ("paging: Updates to public
grant table header file") suggests there might have been plans to use it
for interaction with mem-paging, but no such functionality has ever
materialized. With this, don't even bother enclosing the #define-s in a
__XEN_INTERFACE_VERSION__ conditional, but drop them altogether.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 27 Aug 2021 08:53:11 +0000 (10:53 +0200)]
AMD/IOMMU: avoid recording each level's MFN when walking page table
Both callers only care about the target (level 1) MFN. I also cannot
see what we might need higher level MFNs for down the road. And even
modern gcc doesn't recognize the optimization potential.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 27 Aug 2021 08:52:15 +0000 (10:52 +0200)]
VT-d: fix caching mode IOTLB flushing
While for context cache entry flushing use of did 0 is indeed correct
(after all upon reading the context entry the IOMMU wouldn't know any
domain ID if the entry is not present, and hence a surrogate one needs
to be used), for IOTLB entries the normal domain ID (from the [present]
context entry) gets used. See sub-section "IOTLB" of section "Address
Translation Caches" in the VT-d spec.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Julien Grall [Wed, 25 Aug 2021 12:19:31 +0000 (14:19 +0200)]
xen/arm: Restrict the amount of memory that dom0less domU and dom0 can allocate
Currently, both dom0less domUs and dom0 can allocate an "unlimited"
amount of memory because d->max_pages is set to ~0U.
In particular, the former are meant to be unprivileged. Therefore the
memory they could allocate should be bounded. As the domain are not yet
officially aware of Xen (we don't expose advertise it in the DT, yet
the hypercalls are accessible), they should not need to allocate more
than the initial amount. So cap set d->max_pages directly the amount of
memory we are meant to allocate.
Take the opportunity to also restrict the memory for dom0 as the
domain is direct mapped (e.g. MFN == GFN) and therefore cannot
allocate outside of the pre-allocated region.
Jan Beulich [Wed, 25 Aug 2021 12:19:09 +0000 (14:19 +0200)]
gnttab: fix array capacity check in gnttab_get_status_frames()
The number of grant frames is of no interest here; converting the passed
in op.nr_frames this way means we allow for 8 times as many GFNs to be
written as actually fit in the array. We would corrupt xlat areas of
higher vCPU-s (after having faulted many times while trying to write to
the guard pages between any two areas) for 32-bit PV guests. For HVM
guests we'd simply crash as soon as we hit the first guard page, as
accesses to the xlat area are simply memcpy() there.
This is CVE-2021-28699 / XSA-382.
Fixes: 18b1be5e324b ("gnttab: make resource limits per domain") Signed-off-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Wed, 25 Aug 2021 12:18:39 +0000 (14:18 +0200)]
gnttab: replace mapkind()
mapkind() doesn't scale very well with larger maptrack entry counts,
using a brute force linear search through all entries, with the only
option of an early loop exit if a matching writable entry was found.
Introduce a radix tree alongside the main maptrack table, thus
allowing much faster MFN-based lookup. To avoid the need to actually
allocate space for the individual nodes, encode the two counters in the
node pointers themselves, thus limiting the number of permitted
simultaneous r/o and r/w mappings of the same MFN to 2³¹-1 (64-bit) /
2¹⁵-1 (32-bit) each.
To avoid enforcing an unnecessarily low bound on the number of
simultaneous mappings of a single MFN, introduce
radix_tree_{ulong_to_ptr,ptr_to_ulong} paralleling
radix_tree_{int_to_ptr,ptr_to_int}.
As a consequence locking changes are also applicable: With there no
longer being any inspection of the remote domain's active entries,
there's also no need anymore to hold the remote domain's grant table
lock. And since we're no longer iterating over the local domain's map
track table, the lock in map_grant_ref() can also be dropped before the
new maptrack entry actually gets populated.
As a nice side effect this also reduces the number of IOMMU operations
in unmap_common(): Previously we would have "established" a readable
mapping whenever we didn't find a writable entry anymore (yet, of
course, at least one readable one). But we only need to do this if we
actually dropped the last writable entry, not if there were none already
before.
This is part of CVE-2021-28698 / XSA-380.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
Jan Beulich [Wed, 25 Aug 2021 12:18:18 +0000 (14:18 +0200)]
gnttab: add preemption check to gnttab_release_mappings()
A guest may die with many grant mappings still in place, or simply with
a large maptrack table. Iterating through this may take more time than
is reasonable without intermediate preemption (to run softirqs and
perhaps the scheduler).
Move the invocation of the function to the section where other
restartable functions get invoked, and have the function itself check
for preemption every once in a while. Have it iterate the table
backwards, such that decreasing the maptrack limit is all it takes to
convey restart information.
In domain_teardown() introduce PROG_none such that inserting at the
front will be easier going forward.
This is part of CVE-2021-28698 / XSA-380.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
Jan Beulich [Wed, 25 Aug 2021 12:17:56 +0000 (14:17 +0200)]
x86/mm: widen locked region in xenmem_add_to_physmap_one()
For pages which can be made part of the P2M by the guest, but which can
also later be de-allocated (grant table v2 status pages being the
present example), it is imperative that they be mapped at no more than a
single GFN. We therefore need to make sure that of two parallel
XENMAPSPACE_grant_table requests for the same status page one completes
before the second checks at which other GFN the underlying MFN is
presently mapped.
Pull ahead the respective get_gfn() and push down the respective
put_gfn(). This leverages that gfn_lock() really aliases p2m_lock(), but
the function makes this assumption already anyway: In the
XENMAPSPACE_gmfn case lock nesting constraints for both involved GFNs
would otherwise need to be enforced to avoid ABBA deadlocks.
This is CVE-2021-28697 / XSA-379.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
Jan Beulich [Wed, 25 Aug 2021 12:17:32 +0000 (14:17 +0200)]
x86/p2m: guard (in particular) identity mapping entries
Such entries, created by set_identity_p2m_entry(), should only be
destroyed by clear_identity_p2m_entry(). However, similarly, entries
created by set_mmio_p2m_entry() should only be torn down by
clear_mmio_p2m_entry(), so the logic gets based upon p2m_mmio_direct as
the entry type (separation between "ordinary" and 1:1 mappings would
require a further indicator to tell apart the two).
As to the guest_remove_page() change, commit 48dfb297a20a ("x86/PVH:
allow guest_remove_page to remove p2m_mmio_direct pages"), which
introduced the call to clear_mmio_p2m_entry(), claimed this was done for
hwdom only without this actually having been the case. However, this
code shouldn't be there in the first place, as MMIO entries shouldn't be
dropped this way. Avoid triggering the warning again that 48dfb297a20a
silenced by an adjustment to xenmem_add_to_physmap_one() instead.
Note that guest_physmap_mark_populate_on_demand() gets tightened beyond
the immediate purpose of this change.
Note also that I didn't inspect code which isn't security supported,
e.g. sharing, paging, or altp2m.
This is CVE-2021-28694 / part of XSA-378.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Jan Beulich [Wed, 25 Aug 2021 12:17:07 +0000 (14:17 +0200)]
x86/p2m: introduce p2m_is_special()
Seeing the similarity of grant, foreign, and (subsequently) direct-MMIO
handling, introduce a new P2M type group named "special" (as in "needing
special accessors to create/destroy").
Also use -EPERM instead of other error codes on the two domain_crash()
paths touched.
This is part of XSA-378.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Jan Beulich [Wed, 25 Aug 2021 12:16:46 +0000 (14:16 +0200)]
AMD/IOMMU: re-arrange exclusion range and unity map recording
The spec makes no provisions for OS behavior here to depend on the
amount of RAM found on the system. While the spec may not sufficiently
clearly distinguish both kinds of regions, they are surely meant to be
separate things: Only regions with ACPI_IVMD_EXCLUSION_RANGE set should
be candidates for putting in the exclusion range registers. (As there's
only a single such pair of registers per IOMMU, secondary non-adjacent
regions with the flag set already get converted to unity mapped
regions.)
First of all, drop the dependency on max_page. With commit b4f042236ae0
("AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables") the
use of it here was stale anyway; it was bogus already before, as it
didn't account for max_page getting increased later on. Simply try an
exclusion range registration first, and if it fails (for being
unsuitable or non-mergeable), register a unity mapping range.
With this various local variables become unnecessary and hence get
dropped at the same time.
With the max_page boundary dropped for using unity maps, the minimum
page table tree height now needs both recording and enforcing in
amd_iommu_domain_init(). Since we can't predict which devices may get
assigned to a domain, our only option is to uniformly force at least
that height for all domains, now that the height isn't dynamic anymore.
Further don't make use of the exclusion range unless ACPI data says so.
Note that exclusion range registration in
register_range_for_all_devices() is on a best effort basis. Hence unity
map entries also registered are redundant when the former succeeded, but
they also do no harm. Improvements in this area can be done later imo.
Also adjust types where suitable without touching extra lines.
This is part of XSA-378.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Prior to the assignment step having completed successfully, devices
should not get associated with their new owner. Hand the device to DomIO
(perhaps temporarily), until after the de-assignment step has completed.
De-assignment of a device (from other than Dom0) as well as failure of
reassign_device() during assignment should result in unity mappings
getting torn down. This in turn requires switching to a refcounted
mapping approach, as was already used by VT-d for its RMRRs, to prevent
unmapping a region used by multiple devices.
This is CVE-2021-28696 / part of XSA-378.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Jan Beulich [Wed, 25 Aug 2021 12:15:57 +0000 (14:15 +0200)]
IOMMU: generalize VT-d's tracking of mapped RMRR regions
In order to re-use it elsewhere, move the logic to vendor independent
code and strip it of RMRR specifics.
Note that the prior "map" parameter gets folded into the new "p2ma" one
(which AMD IOMMU code will want to make use of), assigning alternative
meaning ("unmap") to p2m_access_x. Prepare set_identity_p2m_entry() and
p2m_get_iommu_flags() for getting passed access types other than
p2m_access_rw (in the latter case just for p2m_mmio_direct requests).
Note also that, to be on the safe side, an overlap check gets added to
the main loop of iommu_identity_mapping().
This is part of XSA-378.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Jan Beulich [Wed, 25 Aug 2021 12:15:11 +0000 (14:15 +0200)]
AMD/IOMMU: correct device unity map handling
Blindly assuming all addresses between any two such ranges, specified by
firmware in the ACPI tables, should also be unity-mapped can't be right.
Nor can it be correct to merge ranges with differing permissions. Track
ranges individually; don't merge at all, but check for overlaps instead.
This requires bubbling up error indicators, such that IOMMU init can be
failed when allocation of a new tracking struct wasn't possible, or an
overlap was detected.
At this occasion also stop ignoring
amd_iommu_reserve_domain_unity_map()'s return value.
This is part of XSA-378 / CVE-2021-28695.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Paul Durrant <paul@xen.org>
Jan Beulich [Wed, 25 Aug 2021 12:12:13 +0000 (14:12 +0200)]
AMD/IOMMU: correct global exclusion range extending
Besides unity mapping regions, the AMD IOMMU spec also provides for
exclusion ranges (areas of memory not to be subject to DMA translation)
to be specified by firmware in the ACPI tables. The spec does not put
any constraints on the number of such regions.
Blindly assuming all addresses between any two such ranges should also
be excluded can't be right. Since hardware has room for just a single
such range (comprised of the Exclusion Base Register and the Exclusion
Range Limit Register), combine only adjacent or overlapping regions (for
now; this may require further adjustment in case table entries aren't
sorted by address) with matching exclusion_allow_all settings. This
requires bubbling up error indicators, such that IOMMU init can be
failed when concatenation wasn't possible.
Furthermore, since the exclusion range specified in IOMMU registers
implies R/W access, reject requests asking for less permissions (this
will be brought closer to the spec by a subsequent change).
This is part of XSA-378 / CVE-2021-28695.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Michal Orzel [Fri, 20 Aug 2021 09:39:24 +0000 (11:39 +0200)]
xen/public: arch-arm: Add mention of argo_op hypercall
Commit 1ddc0d43c20cb1c1125d4d6cefc78624b2a9ccb7 introducing
argo_op hypercall forgot to add a mention of it in the
comment listing supported hypercalls. Fix that.
Signed-off-by: Michal Orzel <michal.orzel@arm.com> Reviewed-by: Christopher Clark <christopher.w.clark@gmail.com> Acked-by: Julien Grall <jgrall@amazon.com>
When a device is assigned/de-assigned it is required to properly set
IOMMU domain used to protect the device. This assignment was missing,
thus it was not possible to de-assign the device:
(XEN) Deassigning device 0000:03:00.0 from dom2
(XEN) smmu: 0000:03:00.0: not attached to domain 2
(XEN) d2: deassign (0000:03:00.0) failed (-3)
Fix this by assigning IOMMU domain on arm_smmu_assign_dev and reset it
to NULL on arm_smmu_deassign_dev.
ns16550: properly gate Exar PCIe UART cards support
Arm is about to get PCI passthrough support which means CONFIG_HAS_PCI
will be enabled, so this code will fail as Arm doesn't have ns16550
PCI support:
ns16550.c:313:5: error: implicit declaration of function 'enable_exar_enhanced_bits' [-Werror=implicit-function-declaration]
313 | enable_exar_enhanced_bits(uart);
| ^~~~~~~~~~~~~~~~~~~~~~~~~
Fix this by gating Exar PCIe UART cards support with the above in mind.
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Fri, 20 Aug 2021 10:30:35 +0000 (12:30 +0200)]
AMD/IOMMU: don't leave page table mapped when unmapping ...
... an already not mapped page. With all other exit paths doing the
unmap, I have no idea how I managed to miss that aspect at the time.
Fixes: ad591454f069 ("AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Besides standard UART setup, this device needs enabling
(vendor-specific) "Enhanced Control Bits" - otherwise disabling hardware
control flow (MCR[2]) is ignored. Add appropriate quirk to the
ns16550_setup_preirq(), similar to the handle_dw_usr_busy_quirk(). The
new function act on Exar 2-, 4-, and 8- port cards only. I have tested
the functionality on 2-port card but based on the Linux driver, the same
applies to other models too.
Additionally, Exar card supports fractional divisor (DLD[3:0] register,
at 0x02). This part is not supported here yet, and seems to not
be required for working 115200bps at the very least.
The specification for the 2-port card is available at:
https://www.maxlinear.com/product/interface/uarts/pcie-uarts/xr17v352
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Fri, 20 Aug 2021 10:28:07 +0000 (12:28 +0200)]
x86/PV: account for 32-bit Dom0 in mark_pv_pt_pages_rdonly()'s ASSERT()s
Clearly I neglected the special needs here, and also failed to test the
change with a debug build of Xen.
Fixes: 6b1ca51b1a91 ("x86/PV: assert page state in mark_pv_pt_pages_rdonly()") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jane Malalane [Tue, 17 Aug 2021 15:19:24 +0000 (16:19 +0100)]
libs/guest: Move the guest ABI check earlier into xc_dom_parse_image()
Xen may not support 32-bit PV guest for a number of reasons (lack of
CONFIG_PV32, explicit pv=no-32 command line argument, or implicitly
due to CET being enabled) and advertises this to the toolstack via the
absence of xen-3.0-x86_32p ABI.
Currently, when trying to boot a 32-bit PV guest, the ABI check is too
late and the build explodes in the following manner yielding an
unhelpful error message:
xc: error: panic: xg_dom_boot.c:121: xc_dom_boot_mem_init: can't allocate low memory for domain: Out of memory
libxl: error: libxl_dom.c:586:libxl__build_dom: xc_dom_boot_mem_init failed: Operation not supported
libxl: error: libxl_create.c:1573:domcreate_rebuild_done: Domain 1:cannot (re-)build domain: -3
libxl: error: libxl_domain.c:1182:libxl__destroy_domid: Domain 1:Non-existant domain
libxl: error: libxl_domain.c:1136:domain_destroy_callback: Domain 1:Unable to destroy guest
libxl: error: libxl_domain.c:1063:domain_destroy_cb: Domain 1:Destruction of domain failed
Move the ABI check earlier into xc_dom_parse_image() along with other
ELF-note feature checks. With this adjustment, it now looks like
this:
xc: error: panic: xg_dom_boot.c:88: xc_dom_compat_check: guest type xen-3.0-x86_32p not supported by xen kernel, sorry: Invalid kernel
libxl: error: libxl_dom.c:571:libxl__build_dom: xc_dom_parse_image failed
domainbuilder: detail: xc_dom_release: called
libxl: error: libxl_create.c:1573:domcreate_rebuild_done: Domain 11:cannot (re-)build domain: -3
libxl: error: libxl_domain.c:1182:libxl__destroy_domid: Domain 11:Non-existant domain
libxl: error: libxl_domain.c:1136:domain_destroy_callback: Domain 11:Unable to destroy guest
libxl: error: libxl_domain.c:1063:domain_destroy_cb: Domain 11:Destruction of domain failed
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jane Malalane <jane.malalane@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <iwj@xenproject.org>
Juergen Gross [Thu, 19 Aug 2021 11:38:31 +0000 (13:38 +0200)]
xen/sched: fix get_cpu_idle_time() for smt=0 suspend/resume
With smt=0 during a suspend/resume cycle of the machine the threads
which have been parked before will briefly come up again. This can
result in problems e.g. with cpufreq driver being active as this will
call into get_cpu_idle_time() for a cpu without initialized scheduler
data.
Fix that by letting get_cpu_idle_time() deal with this case. Drop a
redundant check in exchange.
Fixes: 132cbe8f35632fb2 ("sched: fix get_cpu_idle_time() with core scheduling") Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Dario Faggioli <dfaggioli@suse.com>
Jan Beulich [Thu, 19 Aug 2021 11:37:42 +0000 (13:37 +0200)]
Arm: relax iomem_access_permitted() check
Ranges checked by iomem_access_permitted() are inclusive; to permit a
mapping there's no need for access to also have been granted for the
subsequent page.
Fixes: 80f9c3167084 ("xen/arm: acpi: Map MMIO on fault in stage-2 page table for the hardware domain") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
Jan Beulich [Thu, 19 Aug 2021 11:36:54 +0000 (13:36 +0200)]
x86: mark compat hypercall regs clobbering for intended fall-through
Oddly enough in the original report Coverity only complained about the
native hypercall related switch() statements. Now that it has seen those
fixed, it complains about (only HVM) compat ones. Hence the CIDs below
are all for the HVM side of things, yet while at it take care of the PV
side as well.
Coverity-ID: 1487105, 1487106, 1487107, 1487108, 1487109. Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 18 Aug 2021 07:44:14 +0000 (09:44 +0200)]
VT-d: Tylersburg errata apply to further steppings
While for 5500 and 5520 chipsets only B3 and C2 are mentioned in the
spec update, X58's also mentions B2, and searching the internet suggests
systems with this stepping are actually in use. Even worse, for X58
erratum #69 is marked applicable even to C2. Split the check to cover
all applicable steppings and to also report applicable errata numbers in
the log message. The splitting requires using the DMI port instead of
the System Management Registers device, but that's then in line (also
revision checking wise) with the spec updates.
Fixes: 6890cebc6a98 ("VT-d: deal with 5500/5520/X58 errata") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Wed, 18 Aug 2021 07:40:08 +0000 (09:40 +0200)]
x86/PV: assert page state in mark_pv_pt_pages_rdonly()
About every time I look at dom0_construct_pv()'s "calculation" of
nr_pt_pages I question (myself) whether the result is precise or merely
an upper bound. I think it is meant to be precise, but I think we would
be better off having some checking in place. Hence add ASSERT()s to
verify that
- all pages have a valid L1...Ln (currently L4) page table type and
- no other bits are set, in particular the type refcount is still zero.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citirx.com>