Leigh Brown [Tue, 23 Apr 2024 12:10:03 +0000 (14:10 +0200)]
tools/misc: xenwatchdogd enhancements
Add usage() function, the ability to run in the foreground, and
the ability to disarm the watchdog timer when exiting.
Add enhanced parameter parsing and validation, making use of
getopt_long(). Check the number of parameters are correct, the
timeout is at least two seconds (to allow a minimum sleep time of
one second), and that the sleep time is at least one and less
than the watchdog timeout.
With these changes, the daemon will no longer instantly reboot
the domain if you enter a zero timeout (or non-numeric parameter),
and prevent the daemon consuming 100% of a CPU due to zero sleep
time.
Signed-off-by: Leigh Brown <leigh@solinno.co.uk> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Leigh Brown [Tue, 23 Apr 2024 12:09:50 +0000 (14:09 +0200)]
tools/misc: xenwatchdogd: add parse_secs()
Create a new parse_secs() function to parse the timeout and sleep
parameters. This ensures that non-numeric parameters are not
accidentally treated as numbers.
Signed-off-by: Leigh Brown <leigh@solinno.co.uk> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Ross Lagerwall [Tue, 23 Apr 2024 12:09:18 +0000 (14:09 +0200)]
x86/rtc: Avoid UIP flag being set for longer than expected
In a test, OVMF reported an error initializing the RTC without
indicating the precise nature of the error. The only plausible
explanation I can find is as follows:
As part of the initialization, OVMF reads register C and then reads
register A repatedly until the UIP flag is not set. If this takes longer
than 100 ms, OVMF fails and reports an error. This may happen with the
following sequence of events:
At guest time=0s, rtc_init() calls check_update_timer() which schedules
update_timer for t=(1 - 244us).
At t=1s, the update_timer function happens to have been called >= 244us
late. In the timer callback, it sets the UIP flag and schedules
update_timer2 for t=1s.
Before update_timer2 runs, the guest reads register C which calls
check_update_timer(). check_update_timer() stops the scheduled
update_timer2 and since the guest time is now outside of the update
cycle, it schedules update_timer for t=(2 - 244us).
The UIP flag will therefore be set for a whole second from t=1 to t=2
while the guest repeatedly reads register A waiting for the UIP flag to
clear. Fix it by clearing the UIP flag when scheduling update_timer.
I was able to reproduce this issue with a synthetic test and this
resolves the issue.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
PVH guests skip real mode VGA detection, and never have a VGA available, hence
the default VGA selection is not applicable, and at worse can cause confusion
when parsing Xen boot log.
Zero the boot_vid_info structure when Xen is booted from the PVH entry point.
This fixes Xen incorrectly reporting:
(XEN) Video information:
(XEN) VGA is text mode 80x25, font 8x16
When booted as a PVH guest.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/video: add boot_video_info offset generation to asm-offsets
Currently the offsets into the boot_video_info struct are manually encoded in
video.S, which is fragile. Generate them in asm-offsets.c and switch the
current code to use those instead.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
automation/eclair_analysis: substitute deprecated service STD.emptrecd
The ECLAIR service STD.emptrecd (which checks for empty structures) is being
deprecated; hence, as a preventive measure, STD.anonstct (which checks for
structures with no named members, an UB in C99) is used here; the latter being
a more general case than the previous one, this change does not affect the
analysis. This new service is already supported by the current version of
ECLAIR.
MISRA C Rule 16.2 states:
"A switch label shall only be used when the most closely-enclosing
compound statement is the body of a switch statement".
The PROGRESS_VCPU local helper specifies a case that is directly
inside the compound statement of a for loop, hence violating the rule.
To avoid this, the construct is deviated with a text-based deviation.
No functional change.
Signed-off-by: Nicola Vetrini <nicola.vetrini@bugseng.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jason Andryuk [Mon, 22 Apr 2024 13:11:02 +0000 (15:11 +0200)]
x86/PVH: Use unsigned int for dom0 e820 index
Switch to unsigned int for the dom0 e820 index. This eliminates the
potential for array underflows, and the compiler might be able to
generate better code.
Requested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Acked-by: Jan Beulich <jbeulich@suse.com>
eclair_analysis: deviate x86 emulator for Rule 16.2
MISRA C Rule 16.2 states:
"A switch label shall only be used when the most closely-enclosing
compound statement is the body of a switch statement".
Since complying with this rule of the x86 emulator would lead to
a lot of code duplication, it is deemed better to exempt those
files for this guideline.
The header taken form Linux 6.4.0-rc1 and is based on
arch/riscv/include/asm/mmio.h with the following changes:
- drop forcing of endianess for read*(), write*() functions as
no matter what CPU endianness, what endianness a particular device
(and hence its MMIO region(s)) is using is entirely independent.
Hence conversion, where necessary, needs to occur at a layer up.
Another one reason to drop endianess conversion here is:
https://patchwork.kernel.org/project/linux-riscv/patch/20190411115623.5749-3-hch@lst.de/
One of the answers of the author of the commit:
And we don't know if Linux will be around if that ever changes.
The point is:
a) the current RISC-V spec is LE only
b) the current linux port is LE only except for this little bit
There is no point in leaving just this bitrotting code around. It
just confuses developers, (very very slightly) slows down compiles
and will bitrot. It also won't be any significant help to a future
developer down the road doing a hypothetical BE RISC-V Linux port.
- drop unused argument of __io_ar() macros.
- drop "#define _raw_{read,write}{b,w,l,d,q} _raw_{read,write}{b,w,l,d,q}"
as they are unnecessary.
- Adopt the Xen code style for this header, considering that significant changes
are not anticipated in the future.
In the event of any issues, adapting them to Xen style should be easily
manageable.
- drop unnecessary __r variables in macros read*_cpu()
- update inline assembler constraints for addr argument for
__raw_read{b,w,l,q} and __raw_write{b,w,l,q} to tell a compiler that
*addr will be accessed.
- add stubs for __raw_readq() and __raw_writeq() for RISCV_32
Addionally, to the header was added definions of ioremap_*().
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
x86/MCE: move intel mcheck init code to separate file
Separate Intel nonfatal MCE initialization code from generic MCE code, the same
way it is done for AMD code. This is to be able to later make intel/amd MCE
code optional in the build.
Convert to Xen coding style. Clean up unused includes. Remove seemingly
outdated comment about MCE check period.
Daniel P. Smith [Wed, 17 Apr 2024 14:37:13 +0000 (10:37 -0400)]
xen/gzip: Remove custom memory allocator
All the other decompression routines use xmalloc_bytes(), thus there is no
reason for gzip to be handling its own allocation of memory. In fact, there is
a bug somewhere in the allocator as decompression started to break when adding
additional allocations. Instead of troubleshooting the allocator, replace it
with xmalloc_bytes().
Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Tue, 16 Apr 2024 15:21:34 +0000 (16:21 +0100)]
xen/efi: Rewrite DOS/PE magic checking without memcmp()
Misra Rule 21.16 doesn't like the use of memcmp() against character arrays (a
string literal in this case). This is a rare piece of logic where we're
looking for a magic marker that just happens to make sense when expressed as
ASCII. Rewrite using plain compares.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
altcall: fix __alt_call_maybe_initdata so it's safe for livepatch
Setting alternative call variables as __init is not safe for use with
livepatch, as livepatches can rightfully introduce new alternative calls to
structures marked as __alt_call_maybe_initdata (possibly just indirectly due to
replacing existing functions that use those). Attempting to resolve those
alternative calls then results in page faults as the variable that holds the
function pointer address has been freed.
When livepatch is supported use the __ro_after_init attribute instead of
__initdata for __alt_call_maybe_initdata.
Fixes: f26bb285949b ('xen: Implement xen/alternative-call.h for use in common code') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jason Andryuk [Sun, 7 Apr 2024 20:58:09 +0000 (16:58 -0400)]
libxl: devd: Spawn QEMU for 9pfs
Add support for xl devd to support 9pfs in a domU. devd need to spawn a
pvqemu for the domain to service 9pfs as well as qdisk backends. Rename
num_qdisks to pvqemu_refcnt to be more generic.
Keep the qdisk-backend-pid xenstore key as well as the disk-%u log file.
They are externally visible, so they might be used by other tooling.
Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Jason Andryuk [Sun, 7 Apr 2024 14:32:08 +0000 (10:32 -0400)]
libxl: Use vkb=[] for HVMs
xl/libxl only applies vkb=[] to PV & PVH guests. HVM gets only a single
vkb by default, but that can be disabled by the vkb_device boolean.
Notably the HVM vkb cannot be configured, so feature-abs-pointer or the
backend-type cannot be specified.
Re-arrange the logic so that vkb=[] is handled regardless of domain
type. If vkb is empty or unspecified, follow the vkb_device boolean for
HVMs. Nothing changes for PVH & PV. HVMs can now get a configured vkb
instead of just the default one.
The chance for regression is an HVM config with
vkb=["$something"]
vkb_device=false
Which would now get a vkb.
This is useful for vGlass which provides a VKB to HVMs. vGlass wants to
specify feature-abs-pointer, but that is racily written by vGlass
instead of coming through the xl.cfg. Unhelpfully, Linux xen-kbdfront
reads the backend nodes without checking that the backend is in
InitWait.
Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
xen/include: move definition of ASM_INT() to xen/linkage.h
ASM_INT() is defined in arch/[arm|x86]/include/asm/asm_defns.h in
exactly the same way. Instead of replicating this definition for riscv
and ppc, move it to include/xen/linkage.h, where other arch agnostic
definitions for assembler code are living already.
Adapt the generation of assembler sources via tools/binfile to include
the new home of ASM_INT().
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Michal Orzel <michal.orzel@amd.com>
Ross Lagerwall [Tue, 9 Apr 2024 10:32:07 +0000 (11:32 +0100)]
MAINTAINERS: Update livepatch maintainers
Remove Konrad from the livepatch maintainers list as he hasn't been
active for a few years.
At the same time, add Roger as a new maintainer since he has been
actively working on it for a while.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
xen/acpi: Allow xen/acpi.h to be included on non-ACPI archs
Conditionalize xen/acpi.h's inclusion of acpi/acpi.h and asm/acpi.h on
CONFIG_ACPI and import ARM's !CONFIG_ACPI stub for acpi_disabled() so
that the header can be included on architectures without ACPI support,
like ppc.
This change revealed some missing #includes across the ARM tree, so fix
those as well.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Shawn Anastasio <sanastasio@raptorengineering.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com>
[Fold Randconfig fix] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Refactor the switch so that a violation of
MISRA C Rule 16.2 is resolved (A switch label shall only be used
when the most closely-enclosing compound statement is the body of
a switch statement).
Note that the switch clause ending with the pseudo
keyword "fallthrough" is an allowed exception to Rule 16.3.
Signed-off-by: Nicola Vetrini <nicola.vetrini@bugseng.com> Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Refactor the switch so that a violation of
MISRA C Rule 16.2 is resolved (a switch label should be immediately
enclosed in the compound statement of the switch).
The switch clause ending with the pseudo
keyword "fallthrough" is an allowed exception to Rule 16.3.
Signed-off-by: Nicola Vetrini <nicola.vetrini@bugseng.com> Acked-by: Jan Beulich <jbeulich@suse.com>
xen/domctl: address violations of MISRA C Rule 16.2
Refactor the first clauses so that a violation of
MISRA C Rule 16.2 is resolved (a switch label should be immediately
enclosed in the compound statement of the switch).
Note that the switch clause ending with the pseudo
keyword "fallthrough" is an allowed exception to Rule 16.3.
Convert fallthrough comments in other clauses to the pseudo-keyword
while at it.
No functional change.
Signed-off-by: Nicola Vetrini <nicola.vetrini@bugseng.com> Acked-by: Jan Beulich <jbeulich@suse.com>
x86/efi: tidy switch statement and address MISRA violation
Refactor the first clauses so that a violation of
MISRA C Rule 16.2 is resolved (a switch label, "default" in this
case, should be immediately enclosed in the compound statement
of the switch). Note that the switch clause ending with the pseudo
keyword "fallthrough" is an allowed exception to Rule 16.3.
Convert fallthrough comments in other clauses to the pseudo-keyword
while at it.
No functional change.
Signed-off-by: Nicola Vetrini <nicola.vetrini@bugseng.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86/irq: tidy switch statement and address MISRA violation
Refactor the clauses so that a MISRA C Rule 16.2 violation is resolved
(A switch label shall only be used when the most closely-enclosing
compound statement is the body of a switch statement).
Note that the switch clause ending with the pseudo keyword "fallthrough"
is an allowed exception to Rule 16.3.
No functional change.
Signed-off-by: Nicola Vetrini <nicola.vetrini@bugseng.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Refactor the switch so that a violation of MISRA C Rule 16.2 is resolved
(A switch label shall only be used when the most closely-enclosing
compound statement is the body of a switch statement).
Note that the switch clause ending with the pseudo
keyword "fallthrough" is an allowed exception to Rule 16.3.
No functional change.
Signed-off-by: Nicola Vetrini <nicola.vetrini@bugseng.com> Acked-by: Jan Beulich <jbeulich@suse.com>
x86/vlapic: tidy switch statement and address MISRA violation
Refactor the last clauses so that a violation of
MISRA C Rule 16.2 is resolved (A switch label shall only be used
when the most closely-enclosing compound statement is the body of
a switch statement). The switch clause ending with the
pseudo keyword "fallthrough" is an allowed exception to Rule 16.3.
Andrew Cooper [Wed, 10 Apr 2024 19:32:24 +0000 (20:32 +0100)]
xen/spinlock: Adjust LOCK_DEBUG_INITVAL to placate MISRA
Resolves 160 MISRA R7.2 violations.
Fixes: c286bb93d20c ("xen/spinlock: support higher number of cpus") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Luca Fancellu <luca.fancellu@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Nicola Vetrini <nicola.vetrini@bugseng.com>
Fixes: 7ef0084418e1 ("x86/spinlock: introduce support for blocking speculation into critical regions") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Luca Fancellu <luca.fancellu@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Andrew Cooper [Wed, 10 Apr 2024 10:26:24 +0000 (11:26 +0100)]
x86/hvm: Fix Misra Rule 19.1 regression
Despite noticing an impending Rule 19.1 violation, the adjustment made (the
uint32_t cast) wasn't sufficient to avoid it. Try again.
Subsequently noticed by Coverity too.
Fixes: 6a98383b0877 ("x86/HVM: clear upper halves of GPRs upon entry from 32-bit code")
Coverity-IDs: 1596289 thru 1596298 Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Andrew Cooper [Fri, 15 Mar 2024 17:18:42 +0000 (17:18 +0000)]
xen/virtual-region: Link the list build time
Given 3 statically initialised objects, its easy to link the list at build
time. There's no need to do it during runtime at boot (and with IRQs-off,
even).
As a consequence, register_virtual_region() can now move inside ifdef
CONFIG_LIVEPATCH like unregister_virtual_region().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com>
Andrew Cooper [Fri, 15 Mar 2024 18:43:53 +0000 (18:43 +0000)]
xen/virtual-region: Rework how bugframe linkage works
The start/stop1/etc linkage scheme predates struct virtual_region, and as
setup_virtual_regions() shows, it's awkward to express in the new scheme.
Change the linker to provide explicit start/stop symbols for each bugframe
type, and change virtual_region to have a stop pointer rather than a count.
This marginally simplifies both do_bug_frame()s and prepare_payload(), but it
massively simplifies setup_virtual_regions() by allowing the compiler to
initialise the .frame[] array at build time.
virtual_region.c is the only user of the linker symbols, and this is unlikely
to change given the purpose of struct virtual_region, so move their externs
out of bug.h
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Michal Orzel <michal.orzel@amd.com>
Andrew Cooper [Wed, 25 Oct 2023 13:18:15 +0000 (14:18 +0100)]
x86/Kconfig: Introduce CONFIG_{AMD,INTEL} and conditionalise ucode
We eventually want to be able to build a stripped down Xen for a single
platform. Make a start with CONFIG_{AMD,INTEL} (hidden behind EXPERT, but
available to randconfig), and adjust the microcode logic.
No practical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Andrew Cooper [Tue, 24 Oct 2023 18:32:31 +0000 (19:32 +0100)]
x86/ucode: Move vendor specifics back out of early_microcode_init()
I know it was me who dropped microcode_init_{intel,amd}() in c/s dd5f07997f29 ("x86/ucode: Rationalise startup and family/model checks"), but
times have moved on. We've gained new conditional support, and a wish to
compile-time specialise Xen to single platform.
(Re)introduce ucode_probe_{amd,intel}() and move the recent vendor specific
additions back out. Encode the conditional support state in the NULL-ness of
hooks as it's already done on other paths.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
John Ernberg [Mon, 8 Apr 2024 16:11:35 +0000 (16:11 +0000)]
xen/drivers: imx-lpuart: Replace iMX8QM compatible with iMX8QXP
Allow the uart to probe also with iMX8QXP. The ip-block is the same as in
the QM.
Since the fsl,imx8qm-lpuart compatible in Linux exists in name only and is
not used in the driver any iMX8QM device tree that can boot Linux must set
fsl,imx8qxp-lpuart compatible as well as the QM one.
Thus we replace the compatible rather than adding just another one.
Signed-off-by: John Ernberg <john.ernberg@actia.se> Acked-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Peng Fan <peng.fan@nxp.com>
John Ernberg [Mon, 8 Apr 2024 16:11:35 +0000 (16:11 +0000)]
xen/arm: Add imx8q{m,x} platform glue
When using Linux for dom0 there are a bunch of drivers that need to do SMC
SIP calls into the firmware to enable certain hardware bits like the
watchdog.
Provide a basic platform glue that implements the needed SMC forwarding.
The format of these calls are as follows:
- reg 0: function ID
- reg 1: subfunction ID (when there's a subfunction)
remaining regs: args
For now we only allow Dom0 to make these calls as they are all managing
hardware. There is no specification for these SIP calls, the IDs and names
have been extracted from the upstream linux kernel and the vendor kernel.
We can reject CPUFREQ because Dom0 cannot make an informed decision
regarding CPU frequency scaling, WAKEUP_SRC is to wake up from suspend,
which Xen doesn't support at this time.
This leaves the TIME SIP, OTP SIPs which for now are allowed to Dom0.
NOTE: This code is based on code found in NXP Xen tree located here:
https://github.com/nxp-imx/imx-xen/blob/lf-5.10.y_4.13/xen/arch/arm/platforms/imx8qm.c
Signed-off-by: Peng Fan <peng.fan@nxp.com>
[jernberg: Add SIP call filtering] Signed-off-by: John Ernberg <john.ernberg@actia.se> Reviewed-by: Peng Fan <peng.fan@nxp.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com>
[stefano: commit message improvement] Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com>
Andrew Cooper [Tue, 9 Apr 2024 20:39:51 +0000 (21:39 +0100)]
x86/entry: Fix build with older toolchains
Binutils older than 2.29 doesn't know INCSSPD.
Fixes: 8e186f98ce0e ("x86: Use indirect calls in reset-stack infrastructure") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Andrew Cooper [Thu, 8 Jun 2023 18:41:44 +0000 (19:41 +0100)]
x86/spec-ctrl: Software BHB-clearing sequences
Implement clear_bhb_{tsx,loops}() as per the BHI guidance. The loops variant
is set up as the "short" sequence.
Introduce SCF_entry_bhb and extend SPEC_CTRL_ENTRY_* with a conditional call
to selected clearing routine.
Note that due to a limitation in the ALTERNATIVE capability, the TEST/JZ can't
be included alongside a CALL in a single alternative block. This is going to
require further work to untangle.
The BHB sequences (if used) must be after the restoration of Xen's
MSR_SPEC_CTRL value, which must be accounted for when judging whether it is
safe to skip the safety LFENCEs.
This is part of XSA-456 / CVE-2024-2201.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Tue, 26 Mar 2024 19:01:37 +0000 (19:01 +0000)]
x86/spec-ctrl: Support BHI_DIS_S in order to mitigate BHI
Introduce a "bhi-dis-s" boolean to match the other options we have for
MSR_SPEC_CTRL values. Also introduce bhi_calculations().
Use BHI_DIS_S whenever possible.
Guests which are levelled to be migration compatible with older CPUs can't see
BHI_DIS_S, and Xen must fill in the difference to make the guest safe. Use
the virt MSR_SPEC_CTRL infrastructure to force BHI_DIS_S behind the guest's
back.
This is part of XSA-456 / CVE-2024-2201.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Sat, 6 Apr 2024 19:36:54 +0000 (20:36 +0100)]
x86/tsx: Expose RTM_ALWAYS_ABORT to guests
A TSX Abort is one option mitigate Native-BHI, but a guest kernel doesn't get
to see this if Xen has turned RTM off using MSR_TSX_{CTRL,FORCE_ABORT}.
Therefore, the meaning of RTM_ALWAYS_ABORT has been adjusted to "XBEGIN won't
fault", and it should be exposed to guests so they can make a better decision.
Expose it in the max policy for any RTM-capable system. Offer it by default
only if RTM has been disabled.
Update test-tsx to account for this new meaning. While adjusting the logic in
test_guest_policies(), take the opportunity to use feature names (now they're
available) to make the logic easier to follow.
This is part of XSA-456 / CVE-2024-2201.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 22 Dec 2023 18:01:37 +0000 (18:01 +0000)]
x86: Drop INDIRECT_JMP
Indirect JMPs which are not tailcalls can lead to an unwelcome form of
speculative type confusion, and we've removed the uses of INDIRECT_JMP to
compensate. Remove the temptation to reintroduce new instances.
This is part of XSA-456 / CVE-2024-2201.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 22 Dec 2023 17:44:48 +0000 (17:44 +0000)]
x86: Use indirect calls in reset-stack infrastructure
Mixing up JMP and CALL indirect targets leads a very fun form of speculative
type confusion. A target which is expecting to be called CALLed needs a
return address on the stack, and an indirect JMP doesn't place one there.
An indirect JMP which predicts to a target intending to be CALLed can end up
with a RET speculatively executing with a value from the JMPers stack frame.
There are several ways get indirect JMPs in Xen.
* From tailcall optimisations. These are safe because the compiler has
arranged the stack to point at the callee's return address.
* From jump tables. These are unsafe, but Xen is built with -fno-jump-tables
to work around several compiler issues.
* From reset_stack_and_jump_ind(), which is particularly unsafe. Because of
the additional stack adjustment made, the value picked up off the stack is
regs->r15 of the next vCPU to run.
In order to mitigate this type confusion, we want to make all indirect targets
be CALL targets, and remove the use of indirect JMP except via tailcall
optimisation.
Luckily due to XSA-348, all C target functions of reset_stack_and_jump_ind()
are noreturn. {svm,vmx}_do_resume() exits via reset_stack_and_jump(); a
direct JMP with entirely different prediction properties. idle_loop() is an
infinite loop which eventually exits via reset_stack_and_jump_ind() from a new
schedule. i.e. These paths are all fine having one extra return address on
the stack.
This leaves continue_pv_domain(), which is expecting to be a JMP target.
Alter it to strip the return address off the stack, which is safe because
there isn't actually a RET expecting to return to its caller.
This allows us change reset_stack_and_jump_ind() to reset_stack_and_call_ind()
in order to mitigate the speculative type confusion.
This is part of XSA-456 / CVE-2024-2201.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monne [Thu, 15 Feb 2024 16:46:53 +0000 (17:46 +0100)]
x86/vmx: Add support for virtualize SPEC_CTRL
The feature is defined in the tertiary exec control, and is available starting
from Sapphire Rapids and Alder Lake CPUs.
When enabled, two extra VMCS fields are used: SPEC_CTRL mask and shadow. Bits
set in mask are not allowed to be toggled by the guest (either set or clear)
and the value in the shadow field is the value the guest expects to be in the
SPEC_CTRL register.
By using it the hypervisor can force the value of SPEC_CTRL bits behind the
guest back without having to trap all accesses to SPEC_CTRL, note that no bits
are forced into the guest as part of this patch. It also allows getting rid of
SPEC_CTRL in the guest MSR load list, since the value in the shadow field will
be loaded by the hardware on vmentry.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Fri, 22 Mar 2024 14:33:17 +0000 (14:33 +0000)]
x86/spec-ctrl: Simplify DO_COND_IBPB
With the prior refactoring, SPEC_CTRL_ENTRY_{PV,INTR} both load SCF into %ebx,
and handle the conditional safety including skipping if interrupting Xen.
Therefore, we can drop the maybexen parameter and the conditional safety.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Fri, 22 Mar 2024 11:41:41 +0000 (11:41 +0000)]
x86/spec-ctrl: Rework conditional safety for SPEC_CTRL_ENTRY_*
Right now, we have a mix of safety strategies in different blocks, making the
logic fragile and hard to follow.
Start addressing this by having a safety LFENCE at the end of the blocks,
which can be patched out if other safety criteria are met. This will allow us
to simplify the sub-blocks. For SPEC_CTRL_ENTRY_FROM_IST, simply leave an
LFENCE unconditionally at the end; the IST path is not a fast-path by any
stretch of the imagination.
For SPEC_CTRL_ENTRY_FROM_INTR, the existing description was incorrect. The
IRET #GP path is non-fatal but can occur with the guest's choice of
MSR_SPEC_CTRL. It is safe to skip the flush/barrier-like protections when
interrupting Xen, but we must run DO_SPEC_CTRL_ENTRY irrespective.
This will skip RSB stuffing which was previously unconditional even when
interrupting Xen.
AFAICT, this is a missing cleanup from commit 3fffaf9c13e9 ("x86/entry: Avoid
using alternatives in NMI/#MC paths") where we split the IST entry path out of
the main INTR entry path.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Tue, 26 Mar 2024 22:47:25 +0000 (22:47 +0000)]
x86/spec-ctrl: Fix BTC/SRSO mitigations
We were looking for SCF_entry_ibpb in the wrong variable in the top-of-stack
block, and xen_spec_ctrl won't have had bit 5 set because Xen doesn't
understand SPEC_CTRL_RRSBA_DIS_U yet.
This is XSA-455 / CVE-2024-31142.
Fixes: 53a570b28569 ("x86/spec-ctrl: Support IBPB-on-entry") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 9 Apr 2024 14:03:05 +0000 (15:03 +0100)]
x86/cpuid: Don't expose {IPRED,RRSBA,BHI}_CTRL to PV guests
All of these are prediction-mode (i.e. CPL) based. They don't operate as
advertised in PV context.
Fixes: 4dd676070684 ("x86/spec-ctrl: Expose IPRED_CTRL to guests") Fixes: 478e4787fa64 ("x86/spec-ctrl: Expose RRSBA_CTRL to guests") Fixes: 583f1d095052 ("x86/spec-ctrl: Expose BHI_CTRL to guests") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
x86/alternatives: fix .init section reference in _apply_alternatives()
The code in _apply_alternatives() will unconditionally attempt to read
__initdata_cf_clobber_{start,end} when called as part of applying alternatives
to a livepatch payload when Xen is using IBT.
That leads to a page-fault as __initdata_cf_clobber_{start,end} living in
.init section will have been unmapped by the time a livepatch gets loaded.
Fix by adding a check that limits the clobbering of endbr64 instructions to
boot time only.
Fixes: 37ed5da851b8 ('x86/altcall: Optimise away endbr64 instruction where possible') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Bjoern Doebel [Wed, 27 Mar 2024 17:31:38 +0000 (17:31 +0000)]
hypercall_xlat_continuation: Replace BUG_ON with domain_crash
Instead of crashing the host in case of unexpected hypercall parameters,
resort to only crashing the calling domain.
This is part of XSA-454 / CVE-2023-46842.
Fixes: b8a7efe8528a ("Enable compatibility mode operation for HYPERVISOR_memory_op") Reported-by: Manuel Andreas <manuel.andreas@tum.de> Signed-off-by: Bjoern Doebel <doebel@amazon.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Wed, 27 Mar 2024 17:31:38 +0000 (17:31 +0000)]
x86/HVM: clear upper halves of GPRs upon entry from 32-bit code
Hypercalls in particular can be the subject of continuations, and logic
there checks updated state against incoming register values. If the
guest manufactured a suitable argument register with a non-zero upper
half before entering compatibility mode and issuing a hypercall from
there, checks in hypercall_xlat_continuation() might trip.
Since for HVM we want to also be sure to not hit a corner case in the
emulator, initiate the clipping right from the top of
{svm,vmx}_vmexit_handler(). Also rename the invoked function, as it no
longer does only invalidation of fields.
Note that architecturally the upper halves of registers are undefined
after a switch between compatibility and 64-bit mode (either direction).
Hence once having entered compatibility mode, the guest can't assume
the upper half of any register to retain its value.
This is part of XSA-454 / CVE-2023-46842.
Fixes: b8a7efe8528a ("Enable compatibility mode operation for HYPERVISOR_memory_op") Reported-by: Manuel Andreas <manuel.andreas@tum.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
drivers: char: Enable OMAP UART driver for TI K3 devices
TI K3 devices (J721E, J721S2, AM62X .etc) have the same variant
of UART as OMAP4. Add the compatible used in Linux device tree,
"ti,am654-uart" to the OMAP UART dt_match so that the driver can
be used with these devices. Also, enable the driver for ARM64
platforms.
Signed-off-by: Vaishnav Achath <vaishnav.a@ti.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com>
xen/compiler: address violation of MISRA C Rule 20.9
The rule states:
"All identifiers used in the controlling expression of #if or #elif
preprocessing directives shall be #define'd before evaluation".
In this case, using defined(identifier) is a MISRA-compliant
way to achieve the same effect.
Signed-off-by: Nicola Vetrini <nicola.vetrini@bugseng.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jason Andryuk [Mon, 8 Apr 2024 07:22:56 +0000 (09:22 +0200)]
x86/PVH: Support relocatable dom0 kernels
Xen tries to load a PVH dom0 kernel at the fixed guest physical address
from the elf headers. For Linux, this defaults to 0x1000000 (16MB), but
it can be configured.
Unfortunately there exist firmwares that have reserved regions at this
address, so Xen fails to load the dom0 kernel since it's not RAM.
The PVH entry code is not relocatable - it loads from absolute
addresses, which fail when the kernel is loaded at a different address.
With a suitably modified kernel, a reloctable entry point is possible.
Add XEN_ELFNOTE_PHYS32_RELOC which specifies optional alignment,
minimum, and maximum addresses needed for the kernel. The presence of
the NOTE indicates the kernel supports a relocatable entry path.
Change the loading to check for an acceptable load address. If the
kernel is relocatable, support finding an alternate load address.
The primary motivation for an explicit align field is that Linux has a
configurable CONFIG_PHYSICAL_ALIGN field. This value is present in the
bzImage setup header, but not the ELF program headers p_align, which
report 2MB even when CONFIG_PHYSICAL_ALIGN is greater. Since a kernel
is only considered relocatable if the PHYS32_RELOC elf note is present,
the alignment contraints can just be specified within the note instead
of searching for an alignment value via a heuristic.
Load alignment uses the PHYS32_RELOC note value if specified.
Otherwise, the maxmum ELF PHDR p_align value is selected if greater than
or equal to PAGE_SIZE. Finally, the fallback default is 2MB.
libelf-private.h includes common-macros.h to satisfy the fuzzer build.
The rwlock handling is limiting the number of cpus to 4095 today. The
main reason is the use of the atomic_t data type for the main lock
handling, which needs 2 bits for the locking state (writer waiting or
write locked), 12 bits for the id of a possible writer, and a 12 bit
counter for readers. The limit isn't 4096 due to an off by one sanity
check.
The atomic_t data type is 32 bits wide, so in theory 15 bits for the
writer's cpu id and 15 bits for the reader count seem to be fine, but
via read_trylock() more readers than cpus are possible.
This means that it is possible to raise the number of cpus to 16384
without changing the rwlock_t data structure. In order to avoid the
reader count wrapping to zero, don't let read_trylock() succeed in case
the highest bit of the reader's count is set already. This leaves enough
headroom for non-recursive readers to enter without risking a wrap.
While at it calculate _QW_CPUMASK and _QR_SHIFT from _QW_SHIFT and
add a sanity check for not overflowing the atomic_t data type.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Allow 16 bits per cpu number, which is the limit imposed by
spinlock_tickets_t.
This will allow up to 65535 cpus, while increasing only the size of
recursive spinlocks in debug builds from 8 to 12 bytes.
The current Xen limit of 4095 cpus is imposed by SPINLOCK_CPU_BITS
being 12. There are machines available with more cpus than the current
Xen limit, so it makes sense to have the possibility to use more cpus.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/spinlock: split recursive spinlocks from normal ones
Recursive and normal spinlocks are sharing the same data structure for
representation of the lock. This has two major disadvantages:
- it is not clear from the definition of a lock, whether it is intended
to be used recursive or not, while a mixture of both usage variants
needs to be
- in production builds (builds without CONFIG_DEBUG_LOCKS) the needed
data size of an ordinary spinlock is 8 bytes instead of 4, due to the
additional recursion data needed (associated with that the rwlock
data is using 12 instead of only 8 bytes)
Fix that by introducing a struct spinlock_recursive for recursive
spinlocks only, and switch recursive spinlock functions to require
pointers to this new struct.
This allows to check the correct usage at build time.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
In order to prepare a type-safe recursive spinlock structure, add
explicitly non-recursive locking functions to be used for non-recursive
locking of spinlocks, which are used recursively, too.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Michal Orzel <michal.orzel@amd.com>
Simone Ballarin [Thu, 28 Mar 2024 10:29:35 +0000 (11:29 +0100)]
MISRA C Rule 17.1 states: "The features of `<stdarg.h>' shall not be used"
The Xen community wants to avoid using variadic functions except for
specific circumstances where it feels appropriate by strict code review.
Functions hypercall_create_continuation and hypercall_xlat_continuation
are internal helper functions made to break long running hypercalls into
multiple calls. They take a variable number of arguments depending on the
original hypercall they are trying to continue.
Add SAF deviations for the aforementioned functions.
Nicola Vetrini [Fri, 29 Mar 2024 09:11:33 +0000 (10:11 +0100)]
automation/eclair: add deviations for Rule 20.7
These deviations deal with the following cases:
- macro arguments used directly as initializer list arguments;
- uses of the __config_enabled macro, that can't be brought
into compliance without breaking its functionality;
- exclude files that are out of scope (efi headers and cpu_idle);
- uses of alternative_{call,vcall}[0-9] macros.
The existing configuration for R20.7 is reordered so that it matches the
cases listed in its documentation comment.
Nicola Vetrini [Fri, 29 Mar 2024 09:11:30 +0000 (10:11 +0100)]
arm/public: address violations of MISRA C Rule 20.7
MISRA C Rule 20.7 states: "Expressions resulting from the expansion
of macro parameters shall be enclosed in parentheses". Therefore, some
macro definitions should gain additional parentheses to ensure that all
current and future users will be safe with respect to expansions that
can possibly alter the semantics of the passed-in macro parameter.
Andrew Cooper [Wed, 3 Apr 2024 16:43:42 +0000 (17:43 +0100)]
x86/tsx: Cope with RTM_ALWAYS_ABORT vs RTM mismatch
It turns out there is something wonky on some but not all CPUs with
MSR_TSX_FORCE_ABORT. The presence of RTM_ALWAYS_ABORT causes Xen to think
it's safe to offer HLE/RTM to guests, but in this case, XBEGIN instructions
genuinely #UD.
Spot this case and try to back out as cleanly as we can.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Jan Beulich <jbeulich@suse.com>