Jan Beulich [Mon, 25 Jul 2022 13:34:55 +0000 (15:34 +0200)]
IOMMU/x86: new command line option to suppress use of superpage mappings
Before actually enabling their use, provide a means to suppress it in
case of problems. Note that using the option can also affect the sharing
of page tables in the VT-d / EPT combination: If EPT would use large
page mappings but the option is in effect, page table sharing would be
suppressed (to properly fulfill the admin request).
Requested-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Paul Durrant <paul@xen.org>
Jan Beulich [Mon, 25 Jul 2022 13:33:34 +0000 (15:33 +0200)]
IOMMU/x86: support freeing of pagetables
For vendor specific code to support superpages we need to be able to
deal with a superpage mapping replacing an intermediate page table (or
hierarchy thereof). Consequently an iommu_alloc_pgtable() counterpart is
needed to free individual page tables while a domain is still alive.
Since the freeing needs to be deferred until after a suitable IOTLB
flush was performed, released page tables get queued for processing by a
tasklet.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul@xen.org>
Jan Beulich [Mon, 25 Jul 2022 13:32:59 +0000 (15:32 +0200)]
IOMMU/x86: perform PV Dom0 mappings in batches
For large page mappings to be easily usable (i.e. in particular without
un-shattering of smaller page mappings) and for mapping operations to
then also be more efficient, pass batches of Dom0 memory to iommu_map().
In dom0_construct_pv() and its helpers (covering strict mode) this
additionally requires establishing the type of those pages (albeit with
zero type references).
The earlier establishing of PGT_writable_page | PGT_validated requires
the existing places where this gets done (through get_page_and_type())
to be updated: For pages which actually have a mapping, the type
refcount needs to be 1.
There is actually a related bug that gets fixed here as a side effect:
Typically the last L1 table would get marked as such only after
get_page_and_type(..., PGT_writable_page). While this is fine as far as
refcounting goes, the page did remain mapped in the IOMMU in this case
(when "iommu=dom0-strict").
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
The loop in iommu_{,un}map() can be arbitrary large, and as such it
needs to handle preemption. Introduce a new flag that signals whether
the function should do preemption checks, returning the number of pages
that have been processed in case a need for preemption was actually
found.
Note that the cleanup done in iommu_map() can now be incomplete if
preemption has happened, and hence callers would need to take care of
unmapping the whole range (ie: ranges already mapped by previously
preempted calls). So far none of the callers care about having those
ranges unmapped, so error handling in arch_iommu_hwdom_init() can be
kept as-is.
Note that iommu_legacy_{un,}map() are left without preemption handling:
callers of those interfaces aren't going to modified to pass bigger
chunks, and hence the functions won't be modified as they are legacy and
uses should be replaced with iommu_{un,}map() instead if preemption is
required.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Anthony PERARD [Thu, 21 Jul 2022 12:46:02 +0000 (13:46 +0100)]
automation: use "needs" instead of "dependencies" for test jobs
Like with "dependencies", the jobs will get artifacts from the jobs
listed in "needs". But the test jobs can run as soon as the build jobs
listed have finished.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Hongyan Xia [Wed, 24 Feb 2021 18:43:13 +0000 (18:43 +0000)]
xen/heap: pass order to free_heap_pages() in heap init
The idea is to split the range into multiple aligned power-of-2 regions
which only needs to call free_heap_pages() once each. We check the least
significant set bit of the start address and use its bit index as the
order of this increment. This makes sure that each increment is both
power-of-2 and properly aligned, which can be safely passed to
free_heap_pages(). Of course, the order also needs to be sanity checked
against the upper bound and MAX_ORDER.
Tested on a nested environment on c5.metal with various amount
of RAM and CONFIG_DEBUG=n. Time for end_boot_allocator() to complete:
Before After
- 90GB: 1445 ms 96 ms
- 8GB: 126 ms 8 ms
- 4GB: 62 ms 4 ms
At the moment, init_heap_pages() will call free_heap_pages() page
by page. To reduce the time to initialize the heap, we will want
to provide multiple pages at the same time.
init_heap_pages() is now split in two parts:
- init_heap_pages(): will break down the range in multiple set
of contiguous pages. For now, the criteria is the pages should
belong to the same NUMA node.
- _init_heap_pages(): will initialize a set of pages belonging to
the same NUMA node. In a follow-up patch, new requirements will
be added (e.g. pages should belong to the same zone). For now the
pages are still passed one by one to free_heap_pages().
Note that the comment before init_heap_pages() is heavily outdated and
does not reflect the current code. So update it.
This patch is a merge/rework of patches from David Woodhouse and
Hongyan Xia.
Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/gnttab: Store frame GFN in struct page_info on Arm
Rework Arm implementation to store grant table frame GFN
in struct page_info directly instead of keeping it in
standalone status/shared arrays. This patch is based on
the assumption that a grant table page is a xenheap page.
To cover 64-bit/40-bit IPA on Arm64/Arm32 we need the space
to hold 52-bit/28-bit + extra bit value respectively. In order
to not grow the size of struct page_info borrow the required
amount of bits from type_info's count portion which current
context won't suffer (currently only 1 bit is used on Arm).
Please note, to minimize code changes and avoid introducing
an extra #ifdef-s to the header, we keep the same amount of
bits on both subarches, although the count portion on Arm64
could be wider, so we waste some bits here.
Introduce corresponding PGT_* constructs and access macro
page_get(set)_xenheap_gfn. Please note, all accesses to
the GFN portion of type_info field should always be protected
by the P2M lock. In case when it is not feasible to satisfy
that requirement (risk of deadlock, lock inversion, etc)
it is important to make sure that all non-protected updates
to this field are atomic.
As several non-protected read accesses still exist within
current code (most calls to page_get_xenheap_gfn() are not
protected by the P2M lock) the subsequent patch will introduce
hardening code for p2m_remove_mapping() to be called with P2M
lock held in order to check any difference between what is
already mapped and what is requested to be ummapped.
Update existing gnttab macros to deal with GFN value according
to new location. Also update the use of count portion of type_info
field on Arm in share_xen_page_with_guest().
While at it, extend this simplified M2P-like approach for any
xenheap pages which are proccessed in xenmem_add_to_physmap_one()
except foreign ones. Update the code to set GFN portion after
establishing new mapping for the xenheap page in said function
and to clean GFN portion when putting a reference on that page
in p2m_put_l3_page().
And for everything to work correctly introduce arch-specific
initialization pattern PGT_TYPE_INFO_INITIALIZER to be applied
to type_info field during initialization at alloc_heap_pages()
and acquire_staticmem_pages(). The pattern's purpose on Arm
is to clear the GFN portion before use, on x86 it is just
a stub.
This patch is intended to fix the potential issue on Arm
which might happen when remapping grant-table frame.
A guest (or the toolstack) will unmap the grant-table frame
using XENMEM_remove_physmap. This is a generic hypercall,
so on x86, we are relying on the fact the M2P entry will
be cleared on removal. For architecture without the M2P,
the GFN would still be present in the grant frame/status
array. So on the next call to map the page, we will end up to
request the P2M to remove whatever mapping was the given GFN.
This could well be another mapping.
Please note, this patch also changes the behavior how the shared_info
page (which is xenheap RAM page) is mapped in xenmem_add_to_physmap_one().
Now, we only allow to map the shared_info at once. The subsequent
attempts to map it will result in -EBUSY. Doing that we mandate
the caller to first unmap the page before mapping it again. This is
to prevent Xen creating an unwanted hole in the P2M. For instance,
this could happen if the firmware stole a RAM address for mapping
the shared_info page into but forgot to unmap it afterwards.
Besides that, this patch simplifies arch code on Arm by
removing arrays and corresponding management code and
as the result gnttab_init_arch/gnttab_destroy_arch helpers
and struct grant_table_arch become useless and can be
dropped globally.
xen/arm: Harden the P2M code in p2m_remove_mapping()
Borrow the x86's check from p2m_remove_page() which was added
by the following commit: c65ea16dbcafbe4fe21693b18f8c2a3c5d14600e
"x86/p2m: don't assert that the passed in MFN matches for a remove"
and adjust it to the Arm code base.
Basically, this check will be strictly needed for the xenheap pages
after applying a subsequent commit which will introduce xenheap based
M2P approach on Arm. But, it will be a good opportunity to harden
the P2M code for *every* RAM pages since it is possible to remove
any GFN - MFN mapping currently on Arm (even with the wrong helpers).
Jan Beulich [Wed, 20 Jul 2022 13:48:49 +0000 (15:48 +0200)]
x86: also suppress use of MMX insns
Passing -mno-sse alone is not enough: The compiler may still find
(questionable) reasons to use MMX insns. In particular with gcc12 use
of MOVD+PUNPCKLDQ+MOVQ was observed in an apparent attempt to auto-
vectorize the storing of two adjacent zeroes, 32 bits each.
Reported-by: ChrisD <chris@dalessio.org> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 20 Jul 2022 13:46:48 +0000 (15:46 +0200)]
x86emul: add memory operand low bits checks for ENQCMD{,S}
Already ISE rev 044 added text to this effect; rev 045 further dropped
leftover earlier text indicating the contrary:
- ENQCMD requires the low 32 bits of the memory operand to be clear,
- ENDCMDS requires bits 20...30 of the memory operand to be clear.
Fixes: d27385968741 ("x86emul: support ENQCMD insns") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Mon, 18 Jul 2022 13:15:08 +0000 (14:15 +0100)]
x86/spec-ctrl: Make svm_vmexit_spec_ctrl conditional
The logic was written this way out of an abundance of caution, but the reality
is that AMD parts don't currently have the RAS-flushing side effect, nor do
they intend to gain it.
This removes one WRMSR from the VMExit path by default on Zen2 systems.
Fixes: 614cec7d79d7 ("x86/svm: VMEntry/Exit logic for MSR_SPEC_CTRL") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 30 Jun 2022 21:15:25 +0000 (22:15 +0100)]
x86/spec-ctrl: Consistently halt speculation using int3
The RSB stuffing loop and retpoline thunks date from the very beginning, when
halting speculation was a brand new field.
These days, we've largely settled on int3 for halting speculation in
non-architectural paths. It's a single byte, and is fully serialising - a
requirement for delivering #BP if it were to execute.
Update the thunks. Mostly for consistency across the codebase, but it does
shrink every entrypath in Xen by 6 bytes which is a marginal win.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
tools/xl: use sparse init for dom_info, remove duplicate vars
Rather than having shadow variables for every element of dom_info, it is
better to properly initialize dom_info at the start. This also removes
the misleading memset() in the middle of main_create().
Remove the dryrun element of domain_create as that has been displaced
by the global "dryrun_only" variable.
Signed-off-by: Elliott Mitchell <ehem+xen@m5p.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Jan Beulich [Tue, 19 Jul 2022 06:37:29 +0000 (08:37 +0200)]
x86: deal with gcc12 release build issues
While a number of issues we previously had with pre-release gcc12 were
fixed in the final release, we continue to have one issue (with multiple
instances) when doing release builds (i.e. at higher optimization
levels): The compiler takes issue with subtracting (always 1 in our
case) from artifical labels (expressed as array) marking the end of
certain regions. This isn't an unreasonable position to take. Simply
hide the "array-ness" by casting to an integer type. To keep things
looking consistently, apply the same cast also on the respective
expressions dealing with the starting addresses. (Note how
efi_arch_memory_setup()'s l2_table_offset() invocations avoid a similar
issue by already having the necessary casts.) In is_xen_fixed_mfn()
further switch from __pa() to virt_to_maddr() to better match the left
sides of the <= operators.
Reported-by: Charles Arnold <carnold@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
The per-cpu variable last_tickle_cpu is referenced only in credit.c.
Change its linkage from external to internal by adding the storage-class
specifier static to its definitions.
Also, this patch aims to resolve indirectly a MISRA C 2012 Rule 8.4 violation
warning.
The function vm_event_wake() is referenced only in vm_event.c.
Change the linkage of the function from external to internal by adding
the storage-class specifier static to the function definition.
Also, this patch aims to resolve indirectly a MISRA C 2012 Rule 8.4 violation
warning.
Jan Beulich [Mon, 18 Jul 2022 15:48:40 +0000 (17:48 +0200)]
EFI: strip xen.efi when putting it on the EFI partition
With debug info retained, xen.efi can be quite large. Unlike for xen.gz
there's no intermediate step (mkelf32 there) involved which would strip
debug info kind of as a side effect. While the installing of xen.efi on
the EFI partition is an optional step (intended to be a courtesy to the
developer), adjust it also for the purpose of documenting what distros
would be expected to do during boot loader configuration (which is what
would normally put xen.efi into the EFI partition).
Model the control over stripping after Linux'es module installation,
except that the stripped executable is constructed in the build area
instead of in the destination location. This is to conserve on space
used there - EFI partitions tend to be only a few hundred Mb in size.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Henry Wang <Henry.Wang@arm.com> Tested-by: Wei Chen <Wei.Chen@arm.com> # arm Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Jan Beulich [Mon, 18 Jul 2022 15:48:18 +0000 (17:48 +0200)]
xl: move freemem()'s "credit expired" loop exit
Move the "credit expired" loop exit to the middle of the loop,
immediately after "return true". This way having reached the goal on the
last iteration would be reported as success to the caller, rather than
as "timed out".
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
tools/xenstore: add documentation for extended watch command
Add documentation for an extension of the WATCH command used to limit
the scope of watched paths. Additionally it enables to receive more
information in the events related to special watches (@introduceDomain
or @releaseDomain).
tools/xenstore: add documentation for new set/get-feature commands
Add documentation for two new Xenstore wire commands SET_FEATURE and
GET_FEATURE used to set or query the Xenstore features visible in the
ring page of a given domain.
Andrew Cooper [Fri, 15 Jul 2022 11:27:08 +0000 (12:27 +0100)]
xen/wait: Minor asm improvements
There is no point preserving all registers. Instead, preserve an arbitrary 6
registers, and list the rest as clobbered. This does not alter the register
scheduling at all, but does reduce the amount of state needing saving.
Use a named parameter for page size, instead of needing to parse which is
parameter 3.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 15 Jul 2022 12:39:29 +0000 (13:39 +0100)]
xen/wait: Drop vestigial remnants of TRAP_regs_partial
The preservation of entry_vector was introduced with ecf9846a6a20 ("x86:
save/restore only partial register state where possible") where
TRAP_regs_partial was introduced, but missed from f9eb74789af7 ("x86/entry:
Remove support for partial cpu_user_regs frames") where TRAP_regs_partial was
removed.
Fixes: f9eb74789af7 ("x86/entry: Remove support for partial cpu_user_regs frames") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 15 Jul 2022 11:53:09 +0000 (12:53 +0100)]
xen: Fix latent check-endbr.sh bug with 32bit build environments
While Xen's current VMA means it works, the mawk fix (i.e. using $((0xN)) in
the shell) isn't portable in 32bit shells. See the code comment for the fix.
The fix found a second latent bug. Recombining $vma_hi/lo should have used
printf "%s%08x" and only worked previously because $vma_lo had bits set in
it's top nibble. Combining with the main fix, %08x becomes %07x.
Fixes: b2ebe879a444 ("xen: Fix check-endbr.sh with mawk") Reported-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Anthony PERARD [Thu, 14 Jul 2022 14:39:06 +0000 (15:39 +0100)]
xen: Fix check-endbr.sh with mawk
check-endbr.sh works with gawk, but fails with mawk. The produced $ALL
file is smaller as it is missing 0x$vma_lo on every line. With mawk,
int(0x2A) just produces 0, instead of the expected value.
The use of hexadecimal-constant in awk is an optional part of the posix
spec, and mawk doesn't seems to implemented.
There is a way to convert an hexadecimal to a number be putting it in a
string, and awk as I understand is supposed to use strtod() to convert
the string to a number when needed. The expression 'int("0x15") + 21'
would produce the expected value in `mawk` but now `gawk` won't convert
the string to a number unless we use the option "--non-decimal-data".
So let's convert the hexadecimal number before using it in the awk
script. The shell as no issue with dealing with hexadecimal-constant so
we'll simply use the expression "$(( 0x15 ))" to convert the value
before using it in awk.
Note: This does introduce a latent portability bug, which fixed in a separate
change to avoid mixing complexity/explanations.
Fixes: 4d037425dc ("x86: Build check for embedded endbr64 instructions")
Resolves: xen-project/xen#26 Reported-by: Luca Fancellu <Luca.Fancellu@arm.com> Reported-by: Mathieu Tarral <mathieu.tarral@protonmail.com> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
xen/arm: mm: Add more ASSERT() in {destroy, modify}_xen_mappings()
Both destroy_xen_mappings() and modify_xen_mappings() will take in
parameter a range [start, end[. Both end should be page aligned.
Add extra ASSERT() to ensure start and end are page aligned. Take the
opportunity to rename 'v' to 's' to be consistent with the other helper.
Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
----
Changes in v2:
- Also modify prototype. Note that on x86, the first parameter
was not matching in the declaration and prototype.
- Add Bertrand's reviewed-by
xen/arm: head: Add missing isb after writing to SCTLR_EL2/HSCTLR
Write to SCTLR_EL2/HSCTLR may not be visible until the next context
synchronization. When initializing the CPU, we want the update to take
effect right now. So add an isb afterwards.
Spec references:
- AArch64: D13.1.2 ARM DDI 0406C.d
- AArch32 v8: G8.1.2 ARM DDI 0406C.d
- AArch32 v7: B5.6.3 ARM DDI 0406C.d
xen/arm32: head.S: Introduce a macro to load the physical address of a symbol
A lot of places in the ARM32 assembly requires to load the physical address
of a symbol. Rather than open-coding the translation, introduce a new macro
that will load the phyiscal address of a symbol.
Lastly, use the new macro to replace all the current open-coded version.
Note that most of the comments associated to the code changed have been
removed because the code is now self-explanatory.
xen/arm: traps: Fix MISRA C 2012 Rule 8.4 violation
Add the function prototype of show_stack() in <asm/processor.h> header file
so that it is visible before its definition in traps.c.
Although show_stack() is referenced only in traps.c, it is declared with
external linkage because, during development, it is often called also by
other files for debugging purposes. Declaring it static would increase
development effort. Add appropriate comment
The three values are 64-bit and one (cval) is controlled by domain. In
theory, it would be possible that the domain has started a long time
after the system boot. So virt_time_base.offset - boot_count may be a
large numbers.
This means a domain may inadvertently set a cval so the result would
overflow. Consequently, the deadline would be set very far in the
future. This could result to loss of timer interrupts or the vCPU
getting block "forever".
One way to solve the problem, would be to separately
1) compute when the domain was created in ns
2) convert cval to ns
3) Add 1 and 2 together
The first part of the equation never change (the value is set/known at
domain creation). So take the opportunity to store it in domain structure.
Andrew Cooper [Mon, 27 Jun 2022 18:29:40 +0000 (19:29 +0100)]
x86/spec-ctrl: Mitigate Branch Type Confusion when possible
Branch Type Confusion affects AMD/Hygon CPUs on Zen2 and earlier. To
mitigate, we require SMT safety (STIBP on Zen2, no-SMT on Zen1), and to issue
an IBPB on each entry to Xen, to flush the BTB.
Due to performance concerns, dom0 (which is trusted in most configurations) is
excluded from protections by default.
Therefore:
* Use STIBP by default on Zen2 too, which now means we want it on by default
on all hardware supporting STIBP.
* Break the current IBPB logic out into a new function, extending it with
IBPB-at-entry logic.
* Change the existing IBPB-at-ctxt-switch boolean to be tristate, and disable
it by default when IBPB-at-entry is providing sufficient safety.
If all PV guests on the system are trusted, then it is recommended to boot
with `spec-ctrl=ibpb-entry=no-pv`, as this will provide an additional marginal
perf improvement.
This is part of XSA-407 / CVE-2022-23825.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 24 Feb 2022 13:44:33 +0000 (13:44 +0000)]
x86/spec-ctrl: Support IBPB-on-entry
We are going to need this to mitigate Branch Type Confusion on AMD/Hygon CPUs,
but as we've talked about using it in other cases too, arrange to support it
generally. However, this is also very expensive in some cases, so we're going
to want per-domain controls.
Introduce SCF_ist_ibpb and SCF_entry_ibpb controls, adding them to the IST and
DOM masks as appropriate. Also introduce X86_FEATURE_IBPB_ENTRY_{PV,HVM} to
to patch the code blocks.
For SVM, the STGI is serialising enough to protect against Spectre-v1 attacks,
so no "else lfence" is necessary. VT-x will use use the MSR host load list,
so doesn't need any code in the VMExit path.
For the IST path, we can't safely check CPL==0 to skip a flush, as we might
have hit an entry path before it's IBPB. As IST hitting Xen is rare, flush
irrespective of CPL. A later path, SCF_ist_sc_msr, provides Spectre-v1
safety.
For the PV paths, we know we're interrupting CPL>0, while for the INTR paths,
we can safely check CPL==0. Only flush when interrupting guest context.
An "else lfence" is needed for safety, but we want to be able to skip it on
unaffected CPUs, so the block wants to be an alternative, which means the
lfence has to be inline rather than UNLIKELY() (the replacement block doesn't
have displacements fixed up for anything other than the first instruction).
As with SPEC_CTRL_ENTRY_FROM_INTR_IST, %rdx is 0 on entry so rely on this to
shrink the logic marginally. Update the comments to specify this new
dependency.
This is part of XSA-407.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
We are shortly going to add a conditional IBPB in this path.
Therefore, we cannot hold spec_ctrl_flags in %eax, and rely on only clobbering
it after we're done with its contents. %rbx is available for use, and the
more normal register to hold preserved information in.
With %rax freed up, use it instead of %rdx for the RSB tmp register, and for
the adjustment to spec_ctrl_flags.
This leaves no use of %rdx, except as 0 for the upper half of WRMSR. In
practice, %rdx is 0 from SAVE_ALL on all paths and isn't likely to change in
the foreseeable future, so update the macro entry requirements to state this
dependency. This marginal optimisation can be revisited if circumstances
change.
No practical change.
This is part of XSA-407.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
We are shortly going to need to context switch new bits in both the vcpu and
S3 paths. Introduce SCF_IST_MASK and SCF_DOM_MASK, and rework d->arch.verw
into d->arch.spec_ctrl_flags to accommodate.
No functional change.
This is part of XSA-407.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 12 Jul 2022 13:25:00 +0000 (15:25 +0200)]
xl: relax freemem()'s retry calculation
While in principle possible also under other conditions as long as other
parallel operations potentially consuming memory aren't "locked out", in
particular with IOMMU large page mappings used in Dom0 (for PV when in
strict mode; for PVH when not sharing page tables with HAP) ballooning
out of individual pages can actually lead to less free memory available
afterwards. This is because to split a large page, one or more page
table pages are necessary (one per level that is split).
When rebooting a guest I've observed freemem() to fail: A single page
was required to be ballooned out (presumably because of heap
fragmentation in the hypervisor). This ballooning out of a single page
of course went fast, but freemem() then found that it would require to
balloon out another page. This repeating just another time leads to the
function to signal failure to the caller - without having come anywhere
near the designated 30s that the whole process is allowed to not make
any progress at all.
Convert from a simple retry count to actually calculating elapsed time,
subtracting from an initial credit of 30s. Don't go as far as limiting
the "wait_secs" value passed to libxl_wait_for_memory_target(), though.
While this leads to the overall process now possibly taking longer (if
the previous iteration ended very close to the intended 30s), this
compensates to some degree for the value passed really meaning "allowed
to run for this long without making progress".
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
George Dunlap [Tue, 12 Jul 2022 13:24:30 +0000 (15:24 +0200)]
MAINTAINERS: Make Daniel P. Smith sole XSM maintainer
While mail hasn't been bouncing, Daniel De Graaf has not been
responding to patch submissions or otherwise interacting with the
community for several years. Daniel Smith has at least been working
with the code, and is a regular member of our community; and he has
agreed to step up into the role.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
The EFI System Resource Table (ESRT) is necessary for fwupd to identify
firmware updates to install. According to the UEFI specification §23.4,
the ESRT shall be stored in memory of type EfiBootServicesData. However,
memory of type EfiBootServicesData is considered general-purpose memory
by Xen, so the ESRT needs to be moved somewhere where Xen will not
overwrite it. Copy the ESRT to memory of type EfiRuntimeServicesData,
which Xen will not reuse. dom0 can use the ESRT if (and only if) it is
in memory of type EfiRuntimeServicesData.
Earlier versions of this patch reserved the memory in which the ESRT was
located. This created awkward alignment problems, and required either
splitting the E820 table or wasting memory. It also would have required
a new platform op for dom0 to use to indicate if the ESRT is reserved.
By copying the ESRT into EfiRuntimeServicesData memory, the E820 table
does not need to be modified, and dom0 can just check the type of the
memory region containing the ESRT. The copy is only done if the ESRT is
not already in EfiRuntimeServicesData memory, avoiding memory leaks on
repeated kexec.
See https://lore.kernel.org/xen-devel/20200818184018.GN1679@mail-itl/T/
for details.
Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Tested-by: Luca Fancellu <luca.fancellu@arm.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Anthony PERARD [Tue, 12 Jul 2022 06:38:51 +0000 (08:38 +0200)]
libxl: check return value of libxl__xs_directory in name2bdf
libxl__xs_directory() can potentially return NULL without setting `n`.
As `n` isn't initialised, we need to check libxl__xs_directory()
return value before checking `n`. Otherwise, `n` might be non-zero
with `bdfs` NULL which would lead to a segv.
Fixes: 57bff091f4 ("libxl: add 'name' field to 'libxl_device_pci' in the IDL...") Reported-by: "G.R." <firemeteor@users.sourceforge.net> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Tested-by: "G.R." <firemeteor@users.sourceforge.net>
Andrew Cooper [Fri, 8 Jul 2022 15:11:40 +0000 (16:11 +0100)]
x86/spec-ctrl: Honour spec-ctrl=0 for unpriv-mmio sub-option
This was an oversight from when unpriv-mmio was introduced.
Fixes: 8c24b70fedcb ("x86/spec-ctrl: Add spec-ctrl=unpriv-mmio") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jane Malalane [Mon, 11 Jul 2022 10:15:05 +0000 (12:15 +0200)]
x86/HVM: allow per-domain usage of hardware virtualized APIC
Introduce a new per-domain creation x86 specific flag to
select whether hardware assisted virtualization should be used for
x{2}APIC.
A per-domain option is added to xl in order to select the usage of
x{2}APIC hardware assisted virtualization, as well as a global
configuration option.
Having all APIC interaction exit to Xen for emulation is slow and can
induce much overhead. Hardware can speed up x{2}APIC by decoding the
APIC access and providing a VM exit with a more specific exit reason
than a regular EPT fault or by altogether avoiding a VM exit.
On the other hand, being able to disable x{2}APIC hardware assisted
virtualization can be useful for testing and debugging purposes.
Note:
- vmx_install_vlapic_mapping doesn't require modifications regardless
of whether the guest has "Virtualize APIC accesses" enabled or not,
i.e., setting the APIC_ACCESS_ADDR VMCS field is fine so long as
virtualize_apic_accesses is supported by the CPU.
- Both per-domain and global assisted_x{2}apic options are not part of
the migration stream, unless explicitly set in the respective
configuration files. Default settings of assisted_x{2}apic done
internally by the toolstack, based on host capabilities at create
time, are not migrated.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jane Malalane <jane.malalane@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: "Roger Pau Monné" <roger.pau@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Add XEN_SYSCTL_PHYSCAP_X86_ASSISTED_XAPIC and
XEN_SYSCTL_PHYSCAP_X86_ASSISTED_X2APIC to report accelerated xAPIC and
x2APIC, on x86 hardware. This is so that xAPIC and x2APIC virtualization
can subsequently be enabled on a per-domain basis.
No such features are currently implemented on AMD hardware.
HW assisted xAPIC virtualization will be reported if HW, at the
minimum, supports virtualize_apic_accesses as this feature alone means
that an access to the APIC page will cause an APIC-access VM exit. An
APIC-access VM exit provides a VMM with information about the access
causing the VM exit, unlike a regular EPT fault, thus simplifying some
internal handling.
HW assisted x2APIC virtualization will be reported if HW supports
virtualize_x2apic_mode and, at least, either apic_reg_virt or
virtual_intr_delivery. This also means that
sysctl follows the conditionals in vmx_vlapic_msr_changed().
For that purpose, also add an arch-specific "capabilities" parameter
to struct xen_sysctl_physinfo.
Note that this interface is intended to be compatible with AMD so that
AVIC support can be introduced in a future patch. Unlike Intel that
has multiple controls for APIC Virtualization, AMD has one global
'AVIC Enable' control bit, so fine-graining of APIC virtualization
control cannot be done on a common interface.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jane Malalane <jane.malalane@citrix.com> Reviewed-by: "Roger Pau Monné" <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
As it is coming up basically every release cycle of Xen, add a
reference to the discussion why the current release scheme has been
selected in the release management documentation.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Henry Wang <Henry.Wang@arm.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Anthony PERARD [Mon, 11 Jul 2022 10:13:24 +0000 (12:13 +0200)]
tools/examples: cleanup Makefile
Don't check if a target exist before installing it. For directory,
install doesn't complain, and for file it would prevent from updating
them. Also remove the existing loop and instead install all files with
a single call to $(INSTALL_DATA).
Remove XEN_CONFIGS-y which isn't used.
Remove "build" target.
Add an empty line after the first comment. The comment isn't about
$(XEN_READMES), it is about the makefile as a whole.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
Anthony PERARD [Mon, 11 Jul 2022 10:13:07 +0000 (12:13 +0200)]
tools/console: have one Makefile per program/directory
Sources of both xenconsoled and xenconsole are already separated into
different directory and don't share anything in common. Having two
different Makefile means it's easier to deal with *FLAGS.
Some common changes:
Rename $(BIN) to $(TARGETS), this will be useful later.
Stop removing *.so *.rpm *.a as they aren't created here.
Use $(OBJS-y) to list objects.
Update $(CFLAGS) for the directory rather than a single object.
daemon:
Remove the need for $(LDLIBS_xenconsoled), use $(LDLIBS) instead.
Remove the need for $(CONSOLE_CFLAGS-y) and use $(CFLAGS-y)
instead.
client:
Remove the unused $(LDLIBS_xenconsole)
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Luca Fancellu <luca.fancellu@arm.com>
xen/x86: remove cf_check attribute from hypercall handlers
Now that the hypercall handlers are all being called directly instead
through a function vector, the "cf_check" attribute can be removed.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Daniel P. Smith <dpsmith@apertussolutions.com> # xsm parts Acked-by: Jan Beulich <jbeulich@suse.com> Tested-by: Téo Couprie Diaz <teo.coupriediaz@arm.com> Acked-by: Dario Faggioli <dfaggioli@suse.com>
xen/x86: call hypercall handlers via generated macro
Instead of using a function table use the generated macros for calling
the appropriate hypercall handlers.
This is beneficial to performance and avoids speculation issues.
With calling the handlers using the correct number of parameters now
it is possible to do the parameter register clobbering in the NDEBUG
case after returning from the handler. With the additional generated
data the hard coded hypercall_args_table[] can be replaced by tables
using the generated number of parameters.
Note that this change modifies behavior of clobbering registers in a
minor way: in case a hypercall is returning -ENOSYS (or the unsigned
equivalent thereof) for any reason the parameter registers will no
longer be clobbered. This should be of no real concern, as those cases
ought to be extremely rare and reuse of the registers in those cases
seems rather far fetched.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Instead of repeating similar data multiple times use a single source
file and a generator script for producing prototypes and call sequences
of the hypercalls.
As the script already knows the number of parameters used add generating
a macro for populating an array with the number of parameters per
hypercall.
The priorities for the specific hypercalls are based on two benchamrks
performed in guests (PV and PVH):
- make -j 4 of the Xen hypervisor (resulting in cpu load with lots of
processes created)
- scp of a large file to the guest (network load)
With a small additional debug patch applied the number of the
different hypercalls in the guest and in dom0 (for looking at backend
activity related hypercalls) were counted while the benchmark in domU
was running:
HVM-hypercall PVH-guest build PVH-guest scp
vcpu_op 277684 2324
event_channel_op 350233 57383
(the related dom0 counter values are in the same range as with the test
running in the PV guest)
Today most hypercall handlers have a return type of long, while the
compat ones return an int. There are a few exceptions from that rule,
however.
Get rid of the exceptions by letting compat handlers always return int
and others always return long, with the exception of the Arm specific
physdev_op handler.
For the compat hvm case use eax instead of rax for the stored result as
it should have been from the beginning.
Additionally move some prototypes to include/asm-x86/hypercall.h
as they are x86 specific. Move the compat_platform_op() prototype to
the common header.
Rename paging_domctl_continuation() to do_paging_domctl_cont() and add
a matching define for the associated hypercall.
Make do_callback_op() and compat_callback_op() more similar by adding
the const attribute to compat_callback_op()'s 2nd parameter.
Change the type of the cmd parameter for [do|compat]_kexec_op() to
unsigned int, as this is more appropriate for the compat case.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Christopher Clark <christopher.w.clark@gmail.com> # argo
SUPPORT.md doesn't seem to explicitly say whether static memory is
supported, so this commit updates SUPPORT.md to add feature static
allocation tech preview for now.
The variable __mon_lengths is referenced only in time.c.
Change its linkage from external to internal by adding the storage-class
specifier static to its definitions.
Also, this patch resolves indirectly a MISRA C 2012 Rule 8.4 violation warning.
Jan Beulich [Wed, 6 Jul 2022 11:05:23 +0000 (13:05 +0200)]
Revert "EFI: preserve the System Resource Table for dom0"
This reverts commit 8d410ac2c178e1dd1001cadddbe9ca75a9738c95,
for breaking booting (on at least Arm64), apparently due to
incomplete refactoring from an earlier version.
tools/libxl: report trusted backend status to frontends
Allow administrators to notify a frontend driver that it's backend
counterpart is not to be trusted, so the frontend can deploy whatever
mitigations required in order to secure itself.
Allow such option for disk and network frontends only, as those are
the only hardened ones currently supported.
This is part of XSA-403
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Xen uses "-fshort-wchar" in CFLAGS for EFI common code. Arm32
is using stub.c of EFI common code for EFI stub functions. But
"-fshort-wchar" CFLAG will cause a warning when build stub.c
for Arm32:
"arm-linux-gnueabihf-ld: warning: arch/arm/efi/built_in.o uses
2-byte wchar_t yet the output is to use 4-byte wchar_t; use of
wchar_t values across objects may fail"
This is because the "-fshort-wchar" flag causes GCC to generate
code that is not binary compatible with code generated without
that flag. Why this warning hasn't been triggered in Arm64 is
because Arm64 does not use wchar type directly in any code for
parameters, variables and return values. And in EFI code, wchar
has been replaced by CHAR16 (the UEFI "abstraction" of wchar_t).
CHAR16 has been specified as unsigned short type in typedef, the
"-fshort-wchar" flag will not affect CHAR16. So Arm64 object
files are exactly the same with "-fshort-wchar" and without
"-fshort-wchar".
We are also not using wchar in Arm32 codes, but Arm32 will embed
ABI information in ".ARM.attributes" section. This section stores
some object file attributes, like ABI version, CPU arch and etc.
And wchar size is described in this section by "Tag_ABI_PCS_wchar_t"
too. Tag_ABI_PCS_wchar_t is 2 for object files with "-fshort-wchar",
but for object files without "-fshort-wchar" is 4. Arm32 GCC
ld will check this tag, and throw above warning when it finds
the object files have different Tag_ABI_PCS_wchar_t values.
Xen need to keep "-fshort-wchar" in EFI code to force wchar to use
short integers (2 bytes) instead of integers (4 bytes), but this is
unnecessary for code out of EFI. So in this patch, we add
"-fno-short-wchar" to override "-fshort-wchar" for Arm architectures
without EFI enabled to remove above warning."
Reported-and-Suggested-by: Jan Beulich <jbeulich@suse.com> Tested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Wei Chen <wei.chen@arm.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <jgrall@amazon.com>
Jan Beulich [Tue, 5 Jul 2022 11:11:51 +0000 (13:11 +0200)]
public: constify xsd_errors[]
While in principle this could break existing users, I think such users
deserve to be put in trouble. After all the table should have been const
from the very beginning.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com>
tools/helpers: fix snprintf argument in init-dom0less.c
Fix snprintf argument in init-dom0less.c because two instances of
the function are using libxl_dominfo struct members that are uint64_t
types, so change "%lu" to "%"PRIu64 to handle it properly when
building on arm32 and arm64.
The EFI System Resource Table (ESRT) is necessary for fwupd to identify
firmware updates to install. According to the UEFI specification §23.4,
the ESRT shall be stored in memory of type EfiBootServicesData. However,
memory of type EfiBootServicesData is considered general-purpose memory
by Xen, so the ESRT needs to be moved somewhere where Xen will not
overwrite it. Copy the ESRT to memory of type EfiRuntimeServicesData,
which Xen will not reuse. dom0 can use the ESRT if (and only if) it is
in memory of type EfiRuntimeServicesData.
Earlier versions of this patch reserved the memory in which the ESRT was
located. This created awkward alignment problems, and required either
splitting the E820 table or wasting memory. It also would have required
a new platform op for dom0 to use to indicate if the ESRT is reserved.
By copying the ESRT into EfiRuntimeServicesData memory, the E820 table
does not need to be modified, and dom0 can just check the type of the
memory region containing the ESRT. The copy is only done if the ESRT is
not already in EfiRuntimeServicesData memory, avoiding memory leaks on
repeated kexec.
See https://lore.kernel.org/xen-devel/20200818184018.GN1679@mail-itl/T/
for details.
Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Daniel P. Smith [Mon, 4 Jul 2022 12:47:00 +0000 (14:47 +0200)]
flask: implement xsm_set_system_active
This commit implements full support for starting the idle domain privileged by
introducing a new flask label xenboot_t which the idle domain is labeled with
at creation. It then provides the implementation for the XSM hook
xsm_set_system_active to relabel the idle domain to the existing xen_t flask
label.
In the reference flask policy a new macro, xen_build_domain(target), is
introduced for creating policies for dom0less/hyperlaunch allowing the
hypervisor to create and assign the necessary resources for domain
construction.
Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Reviewed-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Luca Fancellu <luca.fancellu@arm.com> Tested-by: Luca Fancellu <luca.fancellu@arm.com> Reviewed-by: Rahul Singh <rahul.singh@arm.com> Tested-by: Rahul Singh <rahul.singh@arm.com>
Daniel P. Smith [Mon, 4 Jul 2022 12:46:02 +0000 (14:46 +0200)]
xsm: create idle domain privileged and demote after setup
There are new capabilities, dom0less and hyperlaunch, that introduce internal
hypervisor logic, which needs to make resource allocation calls that are
protected by XSM access checks. The need for these resource allocations are
necessary for dom0less and hyperlaunch when they are constructing the initial
domain(s). This creates an issue as a subset of the hypervisor code is
executed under a system domain, the idle domain, that is represented by a
per-CPU non-privileged struct domain. To enable these new capabilities to
function correctly but in a controlled manner, this commit changes the idle
system domain to be created as a privileged domain under the default policy and
demoted before transitioning to running. A new XSM hook,
xsm_set_system_active(), is introduced to allow each XSM policy type to demote
the idle domain appropriately for that policy type. In the case of SILO, it
inherits the default policy's hook for xsm_set_system_active().
For flask, a stub is added to ensure that flask policy system will function
correctly with this patch until flask is extended with support for starting the
idle domain privileged and properly demoting it on the call to
xsm_set_system_active().
Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Reviewed-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Luca Fancellu <luca.fancellu@arm.com> Acked-by: Julien Grall <jgrall@amazon.com> # arm Reviewed-by: Rahul Singh <rahul.singh@arm.com> Tested-by: Rahul Singh <rahul.singh@arm.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Add instructions on how to build cppcheck, the version currently used
and an example to use the cppcheck integration to run the analysis on
the Xen codebase
cppcheck MISRA addon can be used to check for non compliance to some of
the MISRA standard rules.
Add a CPPCHECK_MISRA variable that can be set to "y" using make command
line to generate a cppcheck report including cppcheck misra checks.
When MISRA checking is enabled, a file with a text description suitable
for cppcheck misra addon is generated out of Xen documentation file
which lists the rules followed by Xen (docs/misra/rules.rst).
By default MISRA checking is turned off.
While adding cppcheck-misra files to gitignore, also fix the missing /
for htmlreport gitignore
Signed-off-by: Bertrand Marquis <bertrand.marquis@arm.com> Reviewed-by: Michal Orzel <michal.orzel@arm.com> Tested-by: Michal Orzel <michal.orzel@arm.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Dmytro Semenets [Thu, 23 Jun 2022 07:44:28 +0000 (10:44 +0300)]
xen: arm: Don't use stop_cpu() in halt_this_cpu()
When shutting down (or rebooting) the platform, Xen will call stop_cpu()
on all the CPUs but one. The last CPU will then request the system to
shutdown/restart.
On platform using PSCI, stop_cpu() will call PSCI CPU off. Per the spec
(section 5.5.2 DEN0022D.b), the call could return DENIED if the Trusted
OS is resident on the CPU that is about to be turned off.
As Xen doesn't migrate off the trusted OS (which BTW may not be
migratable), it would be possible to hit the panic().
In the ideal situation, Xen should migrate the trusted OS or make sure
the CPU off is not called. However, when shutting down (or rebooting)
the platform, it is pointless to try to turn off all the CPUs (per
section 5.10.2, it is only required to put the core in a known state).
So solve the problem by open-coding stop_cpu() in halt_this_cpu() and
not call PSCI CPU off.
Julien Grall [Thu, 30 Jun 2022 18:37:34 +0000 (19:37 +0100)]
public/io: xs_wire: Allow Xenstore to report EPERM
C Xenstored is using EPERM when the client is not allowed to change
the owner (see GET_PERMS). However, the xenstore protocol doesn't
describe EPERM so EINVAL will be sent to the client.
When writing test, it would be useful to differentiate between EINVAL
(e.g. parsing error) and EPERM (i.e. no permission). So extend
xsd_errors[] to support return EPERM.
Looking at previous time xsd_errors was extended (8b2c441a1b), it was
considered to be safe to add a new error because at least Linux driver
and libxenstore treat an unknown error code as EINVAL.
This statement doesn't cover other possible OSes, however I am not
aware of any breakage.
Roger Pau Monne [Thu, 30 Jun 2022 16:34:49 +0000 (18:34 +0200)]
x86/ept: fix shattering of special pages
The current logic in epte_get_entry_emt() will split any page marked
as special with order greater than zero, without checking whether the
super page is all special.
Fix this by only splitting the page only if it's not all marked as
special, in order to prevent unneeded super page shuttering.
The unconditional special super page shattering has caused a
performance regression on some XenServer GPU pass through workloads.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Wed, 16 Mar 2022 13:07:40 +0000 (13:07 +0000)]
x86/spec-ctrl: Knobs for STIBP and PSFD, and follow hardware STIBP hint
STIBP and PSFD are slightly weird bits, because they're both implied by other
bits in MSR_SPEC_CTRL. Add fine grain controls for them, and take the
implications into account when setting IBRS/SSBD.
Rearrange the IBPB text/variables/logic to keep all the MSR_SPEC_CTRL bits
together, for consistency.
However, AMD have a hardware hint CPUID bit recommending that STIBP be set
unilaterally. This is advertised on Zen3, so follow the recommendation.
Furthermore, in such cases, set STIBP behind the guest's back for now. This
has negligible overhead for the guest, but saves a WRMSR on vmentry. This is
the only default change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Mon, 27 Jun 2022 10:54:27 +0000 (11:54 +0100)]
x86/spec-ctrl: Only adjust MSR_SPEC_CTRL for idle with legacy IBRS
Back at the time of the original Spectre-v2 fixes, it was recommended to clear
MSR_SPEC_CTRL when going idle. This is because of the side effects on the
sibling thread caused by the microcode IBRS and STIBP implementations which
were retrofitted to existing CPUs.
However, there are no relevant cross-thread impacts for the hardware
IBRS/STIBP implementations, so this logic should not be used on Intel CPUs
supporting eIBRS, or any AMD CPUs; doing so only adds unnecessary latency to
the idle path.
Furthermore, there's no point playing with MSR_SPEC_CTRL in the idle paths if
SMT is disabled for other reasons.
Fixes: 8d03080d2a33 ("x86/spec-ctrl: Cease using thunk=lfence on AMD") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Xenia Ragiadakou [Tue, 28 Jun 2022 15:08:51 +0000 (18:08 +0300)]
xen/arm: smmu-v3: Fix MISRA C 2012 Rule 1.3 violations
The expression 1 << 31 produces undefined behaviour because the type of integer
constant 1 is (signed) int and the result of shifting 1 by 31 bits is not
representable in the (signed) int type.
Change the type of 1 to unsigned int by adding the U suffix.