Roger Pau Monne [Tue, 12 May 2020 11:42:56 +0000 (13:42 +0200)]
x86/tlb: fix assisted flush usage
Commit e9aca9470ed86 introduced a regression when avoiding sending
IPIs for certain flush operations. Xen page fault handler
(spurious_page_fault) relies on blocking interrupts in order to
prevent handling TLB flush IPIs and thus preventing other CPUs from
removing page tables pages. Switching to assisted flushing avoided such
IPIs, and thus can result in pages belonging to the page tables being
removed (and possibly re-used) while __page_fault_type is being
executed.
Force some of the TLB flushes to use IPIs, thus avoiding the assisted
TLB flush. Those selected flushes are the page type change (when
switching from a page table type to a different one, ie: a page that
has been removed as a page table) and page allocation. This sadly has
a negative performance impact on the pvshim, as less assisted flushes
can be used.
Introduce a new flag (FLUSH_FORCE_IPI) and helper to force a TLB flush
using an IPI (flush_tlb_mask_sync). Note that the flag is only
meaningfully defined when the hypervisor supports PV or shadow paging
mode, as otherwise hardware assisted paging domains are in charge of
their page tables and won't share page tables with Xen, thus not
influencing the result of page walks performed by the spurious fault
handler.
Just passing this new flag when calling flush_area_mask prevents the
usage of the assisted flush without any other side effects.
Note the flag is not defined on Arm, and the introduced helper is just
a dummy alias to the existing flush_tlb_mask.
Fixes: e9aca9470ed86 ('x86/tlb: use Xen L0 assisted TLB flush when available') Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v2:
- Always do a physical IPI triggered flush in
filtered_flush_tlb_mask, since it's always required by the current
callers of the function.
Changes since v1:
- Add a comment describing the usage of FLUSH_FORCE_IPI (and why no
modifications to flush_area_mask are required).
- Use PGT_root_page_table instead of PGT_l4_page_table.
- Also perform IPI flushes if configured with shadow paging support.
- Use ifdef instead of if.
Roger Pau Monne [Thu, 11 Jun 2020 15:04:48 +0000 (17:04 +0200)]
x86/hvm: enable emulated PIT for PVH dom0
Some video BIOS require a PIT in order to work properly, hence classic
PV dom0 gets partial access to the physical PIT as long as it's not in
use by Xen.
Since PVH dom0 is built on top of HVM support, there's already an
emulated PIT implementation available for use. Tweak the emulated PIT
code so it injects interrupts directly into the vIO-APIC if the legacy
PIC (i8259) is disabled. Make sure the GSI used matches the ISA IRQ 0
in the likely case there's an interrupt overwrite in the MADT ACPI
table.
Finally prevent the passthrough of the GSI that belongs to the PIT,
since interrupts will be generated by the emulated PIT instead of the
physical one.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Thu, 11 Jun 2020 13:21:54 +0000 (15:21 +0200)]
x86/hvm: add hardware domain support to hvm_isa_irq_to_gsi
The current function has the ISA IRQ 0 hardcoded to GSI 2 for HVM
domUs. Allow such function to also be used by the hardware domain by
taking into account the ACPI interrupt overwrites in order to get the
correct ISA to GSI mappings.
This requires passing a domain parameter to the helper, since it's not
guaranteed to always be called with current being the destination
vCPU.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Fri, 12 Jun 2020 13:57:41 +0000 (15:57 +0200)]
x86/vpt: fix injection to remote vCPU
vpt timers are usually added to the per-vCPU list of the vCPU where
they get setup, but depending on the timer source type that vCPU might
be different than the one where the interrupt vector gets injected.
For example the PIT timer use a PIC or IO-APIC pin in order to select
the destination vCPU and vector, which might not match the vCPU they
are configured from.
If such a situation happens pt_intr_post won't be called, and thus the
vpt will be left in a limbo where the next interrupt won't be
scheduled. Fix this by generalizing the special handling done to
IO-APIC level interrupts to be applied always when the destination
vCPU of the injected vector is different from the vCPU where the vpt
belongs to (ie: usually the one it's been configured from).
A further improvement as noted in a comment added to the code might be
to move the vpt so it's handled by the same vCPU where the vector gets
injected.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Thu, 11 Jun 2020 17:42:13 +0000 (19:42 +0200)]
x86/vpt: only try to resume timers belonging to enabled devices
Check whether the emulated device is actually enabled before trying to
resume the associated timers.
Thankfully all those structures are zeroed at initialization, and
since the devices are not enabled they are never populated, which
triggers the pt->vcpu check at the beginning of pt_resume forcing an
exit from the function.
While there limit the scope of i and make it unsigned.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Thu, 11 Jun 2020 12:38:15 +0000 (14:38 +0200)]
x86/hvm: fix ISA IRQ 0 handling when set as lowest priority mode in IO APIC
Lowest priority destination mode does allow the vIO APIC code to
select a vCPU to inject the interrupt to, but the selected vCPU must
be part of the possible destinations configured for such IO APIC pin.
Fix the code in order to only force vCPU 0 if it's part of the
listed destinations.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Thu, 11 Jun 2020 11:54:48 +0000 (13:54 +0200)]
x86/hvm: don't force vCPU 0 for IRQ 0 when using fixed destination mode
When the IO APIC pin mapped to the ISA IRQ 0 has been configured to
use fixed delivery mode do not forcefully route interrupts to vCPU 0,
as the OS might have setup those interrupts to be injected to a
different vCPU, and injecting to vCPU 0 can cause the OS to miss such
interrupts or errors to happen due to unexpected vectors being
injected on vCPU 0.
In order to fix remove such handling altogether for fixed destination
mode pins and just inject them according to the data setup in the
IO-APIC entry.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Wed, 10 Jun 2020 10:43:32 +0000 (12:43 +0200)]
x86/passthrough: introduce a flag for GSIs not requiring an EOI or unmask
There's no need to setup a timer for GSIs that are edge triggered,
since those don't require any EIO or unmask, and hence couldn't block
other interrupts.
Note this is only used by PVH dom0, that can setup the passthrough of
edge triggered interrupts from the vIO-APIC. One example of such kind
of interrupt that can be used by a PVH dom0 would be the RTC timer.
While there introduce an out label to do the unlock and reduce code
duplication.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
- Introduce an out label that does the unlock.
Roger Pau Monne [Wed, 10 Jun 2020 08:31:55 +0000 (10:31 +0200)]
x86/passthrough: do not assert edge triggered GSIs for PVH dom0
Edge triggered interrupts do not assert the line, so the handling done
in Xen should also avoid asserting it. Asserting the line prevents
further edge triggered interrupts on the same vIO-APIC pin from being
delivered, since the line is not de-asserted.
One case of such kind of interrupt is the RTC timer, which is edge
triggered and available to a PVH dom0. Note this should not affect
domUs, as it only modifies the behavior of IDENTITY_GSI kind of passed
through interrupts.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v1:
- Compare the triggering against VIOAPIC_{EDGE/LEVEL}_TRIG.
Roger Pau Monne [Thu, 4 Jun 2020 17:09:55 +0000 (19:09 +0200)]
x86/rtc: provide mediated access to RTC for PVH dom0
Mediated access to the RTC was provided for PVHv1 dom0 using the PV
code paths (guest_io_{write/read}), but those accesses where never
implemented for PVHv2 dom0. This patch provides such mediated accesses
to the RTC for PVH dom0, just like it's provided for a classic PV
dom0.
Pull out some of the RTC logic from guest_io_{read/write} into
specific helpers that can be used by both PV and HVM guests. The
setup of the handlers for PVH is done in rtc_init, which is already
used to initialize the fully emulated RTC.
Without this a Linux PVH dom0 will read garbage when trying to access
the RTC, and one vCPU will be constantly looping in
rtc_timer_do_work.
Note that such issue doesn't happen on domUs because the ACPI
NO_CMOS_RTC flag is set in FADT, which prevents the OS from accessing
the RTC. Also the X86_EMU_RTC flag is not set for PVH dom0, as the
accesses are not emulated but rather forwarded to the physical
hardware.
No functional change expected for classic PV dom0.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
for-4.14 reasoning: the fix is mostly isolated to PVH dom0, and as
such the risk is very low of causing issues to other guests types, but
without this fix one vCPU when using a Linux dom0 will be constantly
looping over rtc_timer_do_work with 100% CPU usage, at least when
using Linux 4.19 or newer.
---
Changes since v3:
- Reword comment.
- Add missing newline after break.
- Remove extra parentheses in the RTC ports check in
guest_io_{read/write}.
Changes since v2:
- Move the access check into the read/write handler.
- Allow access to the latched first RTC port by all PV guests.
- Register the handlers for HVM native accesses if vRTC is disabled.
Changes since v1:
- Share the code with PV.
- Add access checks to the IO ports.
Roger Pau Monne [Tue, 2 Jun 2020 08:58:00 +0000 (10:58 +0200)]
compilers/clang: always use _Static_assert with clang
All versions of clang used by Xen support _Static_assert, so use it
unconditionally when building Xen with clang.
No functional change expected.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Not sure whether this fully qualifies as a bugfix, as the current
behavior should also work fine under clang. Note that all versions of
clang from 3.5 to trunk (11) seem to return __GNUC__ == 4 and
__GNUC_MINOR__ == 2.
Roger Pau Monne [Tue, 2 Jun 2020 08:57:47 +0000 (10:57 +0200)]
x86/cpu: fix build with clang 3.5
Clang 3.5 complains with:
common.c:794:24: error: statement expression not allowed at file scope
i < ARRAY_SIZE(this_cpu(tss_page).ist_ssp); ++i )
^
/build/xen/include/asm/percpu.h:14:7: note: expanded from macro 'this_cpu'
(*RELOC_HIDE(&per_cpu__##var, get_cpu_info()->per_cpu_offset))
^
/build/xen/include/xen/compiler.h:104:3: note: expanded from macro 'RELOC_HIDE'
({ unsigned long __ptr; \
^
/build/xen/include/xen/lib.h:68:69: note: expanded from macro 'ARRAY_SIZE'
#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]) + __must_be_array(x))
^
/build/xen/include/xen/compiler.h:85:57: note: expanded from macro '__must_be_array'
BUILD_BUG_ON_ZERO(__builtin_types_compatible_p(typeof(a), typeof(&a[0])))
^
/build/xen/include/xen/lib.h:39:57: note: expanded from macro 'BUILD_BUG_ON_ZERO'
#define BUILD_BUG_ON_ZERO(cond) sizeof(struct { int:-!!(cond); })
^
Workaround this by defining the tss_page as a local variable. Adjust
other users of the per-cpu tss_page to also use the newly introduced
local variable.
No functional change expected.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Thu, 28 May 2020 14:12:00 +0000 (16:12 +0200)]
build32: don't discard .shstrtab in linker script
LLVM linker doesn't support discarding .shstrtab, and complains with:
ld -melf_i386_fbsd -N -T build32.lds -o reloc.lnk reloc.o
ld: error: discarding .shstrtab section is not allowed
Add an explicit .shstrtab, .strtab and .symtab sections to the linker
script after the text section in order to make LLVM LD happy and match
the behavior of GNU LD.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
---
Changes since v2:
- Also add .strtab and .symtab sections to match GNU behavior.
Roger Pau Monne [Mon, 4 May 2020 09:08:05 +0000 (11:08 +0200)]
x86/mm: do not attempt to convert _PAGE_GNTTAB to a boolean
Clang 10 complains with:
mm.c:1239:10: error: converting the result of '<<' to a boolean always evaluates to true
[-Werror,-Wtautological-constant-compare]
if ( _PAGE_GNTTAB && (l1e_get_flags(l1e) & _PAGE_GNTTAB) &&
^
xen/include/asm/x86_64/page.h:161:25: note: expanded from macro '_PAGE_GNTTAB'
#define _PAGE_GNTTAB (1U<<22)
^
Remove the conversion of _PAGE_GNTTAB to a boolean and instead use a
preprocessor conditional to check if _PAGE_GNTTAB is defined.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
---
Changes since v2:
- Add comment.
Changes since v1:
- Use a preprocessor conditional.
x86/hvm: allow for more fine grained assisted flush
Improve the assisted flush by expanding the interface and allowing for
more fine grained TLB flushes to be issued using the HVMOP_flush_tlbs
hypercall. Support for such advanced flushes is signaled in CPUID
using the XEN_HVM_CPUID_ADVANCED_FLUSH flag.
The new features make use of the NULL parameter so far passed in the
hypercall in order to convey extra data to perform more selective
flushes: a virtual address, an order field, a flags field and finally a
vCPU bitmap. Note that not all features are implemented as part of
this patch, but are already added to the interface in order to avoid
having to introduce yet a new CPUID flag when the new features are
added.
The feature currently implemented is the usage of a guest provided
vCPU bitmap in order to signal which vCPUs require a TLB flush,
instead of assuming all vCPUs must be flushed. Note that not
implementing the rest of the features just make the flush less
efficient, but it's still correct and safe.
Finally add support for Xen running in guest mode (Xen on Xen or PV
shim mode) to use the newly introduced flush options when available.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Thu, 23 Jan 2020 17:37:47 +0000 (18:37 +0100)]
x86/apic: simplify disconnect_bsp_APIC setup of LVT{0/1}
There's no need to read the current values of LVT{0/1} for the
purposes of the function, which seem to be to save the currently
selected vector: in the destination modes used (ExtINT and NMI) the
vector field is ignored and hence can be set to 0.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monne [Tue, 14 Jan 2020 18:06:26 +0000 (19:06 +0100)]
x86/hvmloader: round up memory BAR size to 4K
When placing memory BARs with sizes smaller than 4K multiple memory
BARs can end up mapped to the same guest physical address, and thus
won't work correctly.
Round up all memory BAR sizes to be at least 4K, so that they are
naturally aligned to a page size and thus don't end up sharing a page.
Also add a couple of asserts to the current code to make sure the MMIO
hole is properly sized and aligned.
Note that the guest can still move the BARs around and create this
collisions, and that BARs not filling up a physical page might leak
access to other MMIO regions placed in the same host physical page.
This is however no worse than what's currently done, and hence should
be considered an improvement over the current state.
Reported-by: Jason Andryuk <jandryuk@gmail.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jason Andryuk <jandryuk@gmail.com>
---
Changes since v1:
- Do the round up when sizing the BARs, so that the MMIO hole is
correctly sized.
- Add some asserts that the hole is properly sized and size-aligned.
- Dropped Jason Tested-by since the code has changed.
---
Jason, can you give this a spin? Thanks.
Andrew Cooper [Sat, 30 May 2020 00:52:13 +0000 (01:52 +0100)]
xen/credit2: Fix build following c/s 8e2aa76dc (take 2)
OSSTest reports:
credit2.c: In function 'cpu_runqueue_siblings_match':
credit2.c:883:29: error: implicit declaration of function 'cpu_nr_siblings' [-Werror=implicit-function-declaration]
unsigned int nr_sibls = cpu_nr_siblings(cpu);
^~~~~~~~~~~~~~~
credit2.c:883:5: error: nested extern declaration of 'cpu_nr_siblings' [-Werror=nested-externs]
unsigned int nr_sibls = cpu_nr_siblings(cpu);
^~~~~~~~
cc1: all warnings being treated as errors
For whatever reason, cpufeature.h's inclusion is conditional, and missing for
arm32. Inlcude it explicitly.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Wed, 22 Apr 2020 12:44:37 +0000 (13:44 +0100)]
x86/shstk: Activate Supervisor Shadow Stacks
With all other plumbing in place, activate shadow stacks when possible.
Note that CET shares the similar problems to SMEP/SMAP with Ring1 being
supervisor to the processor, and that the layout of the shadow stack differs
between an IRET to Ring 1 and Ring 3. Therefore, we disable PV32 when CET is
enabled. Compatibility can be maintained if necessary via PV-Shim.
The BSP needs to wait until alternatives have run (to avoid interaction with
CR0.WP), and after the first reset_stack_and_jump() to avoid having a pristine
shadow stack interact in problematic ways with an in-use regular stack.
Activate shadow stack in reinit_bsp_stack().
APs have all infrastructure set up by the booting CPU, so enable shadow stacks
before entering C. Adjust the logic to call start_secondary rather than jump
to it, so stack traces make more sense.
The crash path needs to turn CET off to avoid interfering with the crash
kernel's environment.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 24 Apr 2020 13:34:44 +0000 (14:34 +0100)]
x86/entry: Adjust guest paths to be shadow stack compatible
The SYSCALL/SYSENTER/SYSRET paths need to use {SET,CLR}SSBSY. The IRET to
guest paths must not. In the SYSRET path, re-position the mov which loads rip
into %rcx so we can use %rcx for CLRSSBSY, rather than spilling another
register to the stack.
While we can in principle detect shadow stack corruption and a failure to
clear the supervisor token busy bit in the SYSRET path (by inspecting the
carry flag following CLRSSBSY), we cannot detect similar problems for the IRET
path (IRET is specified not to fault in this case).
We will double fault at some point later, when next trying to enter Xen, due
to an already-set supervisor shadow stack busy bit. As SYSRET is a uncommon
path anyway, avoid the added complexity for no appreciable gain.
The IST switch onto the primary stack is not great as we have an instruction
boundary with no shadow stack. This is the least bad option available.
These paths are not used before shadow stacks are properly established, so can
use alternatives to avoid extra runtime CET detection logic.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 1 May 2020 17:10:00 +0000 (18:10 +0100)]
x86/alt: Adjust _alternative_instructions() to not create shadow stacks
The current alternatives algorithm clears CR0.WP and writes into .text. This
has a side effect of the mappings becoming shadow stacks once CET is active.
Adjust _alternative_instructions() to clean up after itself. This involves
extending the set of bits modify_xen_mappings() to include Dirty (and Accessed
for good measure).
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 30 Apr 2020 14:05:24 +0000 (15:05 +0100)]
x86/extable: Adjust extable handling to be shadow stack compatible
When adjusting an IRET frame to recover from a fault, and equivalent
adjustment needs making in the shadow IRET frame.
The adjustment in exception_with_ints_disabled() could in principle be an
alternative block rather than an ifdef, as the only two current users of
_PRE_EXTABLE() are IRET-to-guest instructions. However, this is not a
fastpath, and this form is more robust to future changes.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Fri, 24 Apr 2020 13:19:52 +0000 (14:19 +0100)]
x86/spec-ctrl: Adjust DO_OVERWRITE_RSB to be shadow stack compatible
The 32 calls need dropping from the shadow stack as well as the regular stack.
To shorten the code, we can use the 32bit forms of RDSSP/INCSSP, but need to
double up the input to INCSSP to counter the operand size based multiplier.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 21 Feb 2020 17:56:57 +0000 (17:56 +0000)]
x86/cpu: Adjust enable_nmis() to be shadow stack compatible
When executing an IRET-to-self, the shadow stack must agree with the regular
stack. We can't manipulate SSP directly, so have to fake a shadow IRET frame
by executing 3 CALLs, then editing the result to look correct.
This is not a fastpath, is called on the BSP long before CET can be set up,
and may be called on the crash path after CET is disabled. Use the fact that
INCSSP is allocated from the hint nop space to construct a test for CET being
active which is safe on all processors.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 23 Apr 2020 19:20:59 +0000 (20:20 +0100)]
x86/shstk: Create shadow stacks
Introduce HYPERVISOR_SHSTK pagetable constants, which are Read-Only + Dirty.
Use these in place of _PAGE_RW for memguard_guard_stack(), to create real
shadow stacks on capable hardware.
Supervisor shadow stacks need a token written at the top, which is most easily
done before making the frame read only.
Allocate the shadow IST stack block in struct tss_page. It doesn't strictly
need to live here, but it is a convenient location (and XPTI-safe, for testing
purposes), and placing it ahead of the TSS doesn't risk colliding with a bad
IO Bitmap offset and turning into some IO port permissions.
Have load_system_tables() set up the shadow IST stack table when setting up
the regular IST in the TSS.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Thu, 23 Apr 2020 19:20:59 +0000 (20:20 +0100)]
x86/shstk: Rework the stack layout to support shadow stacks
We have two free pages in the current stack. A useful property of shadow
stacks and regular stacks is that they act as each others guard pages as far
as OoB writes go. As wild OoB stack reads aren't likely, we don't lose any
meaningful protection from using read-only guard pages in general (rather than
non-present guard pages), but result is far simpler for Xen as a whole by not
having a feature/mode dependent stack configuration.
Move the regular IST stacks up by one page, to allow their shadow stack page
to be in slot 0. The primary shadow stack uses slot 5.
As the shadow IST stacks are only 1k large, shuffle the order of IST vectors
to have #DF numerically highest, so there is no chance of a shadow stack
overflow clobbering the supervisor token.
The XPTI code already breaks the MEMORY_GUARD abstraction for stacks by
forcing it to be in effect (i.e. guard page not present). To avoid having too
many configurations, do away with the concept entirely, and unconditionally
map the pages in their read-only form.
A later change will turn these properly into shadow stacks. Some of the
comments written here are the intended result, and will become true in the
subsequent change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Fri, 21 Feb 2020 17:56:57 +0000 (17:56 +0000)]
x86/shstk: Introduce Supervisor Shadow Stack support
Introduce CONFIG_HAS_AS_CET_SS to determine whether CET Shadow Stack
instructions are supported in the assembler, and CONFIG_XEN_SHSTK as the main
build option.
Introduce cet={no-,}shstk to for a user to select whether or not to use shadow
stacks at runtime, and X86_FEATURE_XEN_SHSTK to determine Xen's overall
enablement of shadow stacks.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 30 Apr 2020 14:05:24 +0000 (15:05 +0100)]
x86/traps: Factor out extable_fixup() and make printing consistent
UD faults never had any diagnostics printed, and the others were inconsistent.
Don't use dprintk() because identifying traps.c is actively unhelpful in the
message, as it is the location of the fixup, not the fault. Use the new
vec_name() infrastructure, rather than leaving raw numbers for the log.
Andrew Cooper [Thu, 30 Apr 2020 14:05:24 +0000 (15:05 +0100)]
x86/traps: Clean up printing in {do_reserved,fatal}_trap()
For one, they render the vector in a different base.
Introduce X86_EXC_* constants and vec_name() to refer to exceptions by their
mnemonic, which starts bringing the code/diagnostics in line with the Intel
and AMD manuals.
Provide constants for every archtiecturally defined exception, even those not
implemented by Xen yet, as do_reserved_trap() is a catch-all handler.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Fri, 29 May 2020 18:28:00 +0000 (20:28 +0200)]
xen/build: fix xen/tools/binfile
xen/tools/binfile contains a bash specific command (let). This leads
to build failures on systems not using bash as /bin/sh.
Replace "let SHIFT=$OPTIND-1" by "SHIFT=$((OPTIND-1))".
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
The ARM side of the cpu_nr_siblings() was missing a return type. OSSTest
reports:
/home/osstest/build.150502.build-arm64-xsm/xen/xen/include/asm/cpufeature.h:67:15:
error: return type defaults to 'int' [-Werror=implicit-int]
static inline cpu_nr_siblings(unsigned int)
^~~~~~~~~~~~~~~
My local build test then reported:
/local/xen.git/xen/include/asm/cpufeature.h: In function ‘cpu_nr_siblings’:
/local/xen.git/xen/include/asm/cpufeature.h:67:1: error: parameter name omitted
static inline int cpu_nr_siblings(unsigned int)
^
Fix it up to match its x86 counterpart.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
toggle_guest_pt() is called in pairs, to read guest kernel data
structures when emulating a guest userspace action. Hence this doesn't
modify cr3 from the guest's point of view, and therefore doesn't need
any resync on the exit-to-guest path. Therefore move the updating of
->pv_cr3 and ->root_pgt_changed into toggle_guest_mode(), since undoing
the changes during the second of these invocations wouldn't be a safe
thing to do.
While at it, add a comment ahead of toggle_guest_pt() to clarify its
intended usage.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Anthony PERARD [Fri, 29 May 2020 15:43:43 +0000 (16:43 +0100)]
xen/build: introduce CLANG_FLAGS for testing other CFLAGS
Commit 534519f0514f ("xen: Have Kconfig check $(CC)'s version")
introduced the use of CLANG_FLAGS in Kconfig which is used when
testing for other CFLAGS via $(cc-option ...) but CLANG_FLAGS doesn't
exist in the Xen build system. (It's a Linux/Kbuild variable that
haven't been added yet.)
The missing CLANG_FLAGS isn't an issue for $(cc-option ..) but it
would be when $(as-instr ..) gets imported from Kbuild to tests
assembly instruction. We need to know if we are going to use clang's
assembler or not.
CLANG_FLAGS needs to be calculated before we call Kconfig.
So, this patch adds CLANG_FLAGS which may contain two flags which are
needed for further testing of $(CC)'s capabilities:
-no-integrated-as
This flags isn't new, but simply tested earlier so that it can be
used in Kconfig. The flags is only added for x86 builds like
before.
-Werror=unknown-warning-option
The one is new and is to make sure that the warning is enabled,
even though it is by default but could be disabled in a particular
build of clang, see Linux's commit e8de12fb7cde ("kbuild: Check
for unknown options with cc-option usage in Kconfig and clang")
It is present in clang 3.0.0, according Linux's commit 589834b3a009 ("kbuild: Add -Werror=unknown-warning-option to
CLANG_FLAGS").
(The "note" that say that the flags was only added once wasn't true
when tested on CentOS 6, so the patch uses $(or) and the flag will only
be added once.)
Fixes: 534519f0514f ("xen: Have Kconfig check $(CC)'s version") Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Dario Faggioli [Thu, 28 May 2020 21:29:44 +0000 (23:29 +0200)]
xen: credit2: limit the max number of CPUs in a runqueue
In Credit2 CPUs (can) share runqueues, depending on the topology. For
instance, with per-socket runqueues (the default) all the CPUs that are
part of the same socket share a runqueue.
On platform with a huge number of CPUs per socket, that could be a
problem. An example is AMD EPYC2 servers, where we can have up to 128
CPUs in a socket.
It is of course possible to define other, still topology-based, runqueue
arrangements (e.g., per-LLC, per-DIE, etc). But that may still result in
runqueues with too many CPUs on other/future platforms. For instance, a
system with 96 CPUs and 2 NUMA nodes will end up having 48 CPUs per
runqueue. Not as bad, but still a lot!
Therefore, let's set a limit to the max number of CPUs that can share a
Credit2 runqueue. The actual value is configurable (at boot time), the
default being 16. If, for instance, there are more than 16 CPUs in a
socket, they'll be split among two (or more) runqueues.
Note: with core scheduling enabled, this parameter sets the max number
of *scheduling resources* that can share a runqueue. Therefore, with
granularity set to core (and assumint 2 threads per core), we will have
at most 16 cores per runqueue, which corresponds to 32 threads. But that
is fine, considering how core scheduling works.
Dario Faggioli [Thu, 28 May 2020 21:29:37 +0000 (23:29 +0200)]
xen: cpupool: add a back-pointer from a scheduler to its pool
If we need to know within which pool a particular scheduler
is working, we can do that by querying the cpupool pointer
of any of the sched_resource-s (i.e., ~ any of the CPUs)
assigned to the scheduler itself.
Basically, we pick any sched_resource that we know uses that
scheduler, and we check its *cpupool pointer. If we really
know that the resource uses the scheduler, this is fine, as
it also means the resource is inside the pool we are
looking for.
But, of course, we can't do that for a pool/scheduler that has
not any been given any sched_resource yet (or if we do not
know whether or not it has any sched_resource).
To overcome such limitation, add a back pointer from the
scheduler, to its own pool.
Andrew Cooper [Tue, 1 Sep 2020 15:08:00 +0000 (16:08 +0100)]
docs/xl.cfg: Rewrite cpuid= section
This is partly to adjust the description of 'k' and 's' seeing as they have
changed, but mostly restructuring the information for clarity.
In particular, use indentation to clearly separate the areas discussing libxl
format from xend format. In addition, extend the xend format section to
discuss subleaf notation.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Fri, 20 Dec 2019 19:38:26 +0000 (19:38 +0000)]
tools/libxc: Restore CPUID/MSR data found in the migration stream
With all other pieces in place, it is now safe to restore the CPUID and MSR
data in the migration stream, rather than discarding them and using the higher
level toolstacks compatibility logic.
While this is a small patch, it has large implications for migrated/resumed
domains. Most obviously, the CPU family/model/stepping data,
cache/tlb/etc. will no longer change behind the guests back.
Another change is the interpretation of the Xend cpuid strings. The 'k'
option is not a sensible thing to have ever supported, and 's' is how how the
stream will end up behaving.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Mon, 16 Dec 2019 19:03:14 +0000 (19:03 +0000)]
tools/libx[cl]: Plumb 'missing' through static_data_done() up into libxl
Pre Xen-4.14 streams will not contain any CPUID/MSR information. There is
nothing libxc can do about this, and will have to rely on the higher level
toolstack to provide backwards compatibility.
To facilitate this, extend the static_data_done() callback, highlighting the
missing information, and modify libxl to use it. At the libxc level, this
requires an arch-specific hook which, for now, always reports CPUID and MSR as
missing. This will be adjusted in a later change.
No overall functional change - this is just plumbing.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Tue, 17 Dec 2019 12:41:02 +0000 (12:41 +0000)]
libxc/save: Write X86_{CPUID,MSR}_DATA records
With the destination side now able to understand X86_{CPUID,MSR}_DATA
records (and compatibly handle their absense), update the sending logic to
obtain and forward this data from Xen.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Thu, 2 Jan 2020 19:44:36 +0000 (19:44 +0000)]
tools/libxl: Re-position CPUID handling during domain construction
CPUID handling needs to be earlier in construction. Move it from its current
position in libxl__build_post() to libxl__build_pre() for fresh builds, and
libxl__srm_callout_callback_static_data_done() for the migration/resume case.
Later changes will make the migration/resume case conditional on whether CPUID
data was present in the migration stream, and the libxc layer took care of
restoring it.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Tue, 17 Dec 2019 12:29:42 +0000 (12:29 +0000)]
libxc/save: Write a v3 stream
Introduce a new static_data() hook which is responsible for writing out
any static data records. The HVM side continues to be a no-op, while
the PV side moves write_x86_pv_info() into this earlier hook. The the
common code writes out a STATIC_DATA_END record, and the stream version
is bumped to 3.
Update convert-legacy-stream to write a v3 stream, because this will
bypass the compatibly logic in libxc.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Mon, 16 Dec 2019 19:03:14 +0000 (19:03 +0000)]
libxc/restore: STATIC_DATA_END inference for v2 compatibility
A v3 stream can compatibly read a v2 stream by inferring the position of the
STATIC_DATA_END record.
v2 compatibility is only needed for x86. No other architectures exist yet,
but they will have a minimum of v3 when introduced.
The x86 HVM compatibility point being in handle_page_data() (which is common
code) is a bit awkward. However, as the two compatibility points are subtly
different, and it is (intentionally) not possible to call into arch specific
code from common code (except via the ops hooks), use some #ifdef-ary and
opencode the check, rather than make handle_page_data() a per-arch helper.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Thu, 5 Dec 2019 15:57:13 +0000 (15:57 +0000)]
docs/migration Specify migration v3 and STATIC_DATA_END
Migration data can be split into two parts - that which is invariant of
guest execution, and that which is not. Separate these two with the
STATIC_DATA_END record.
The short term, we want to move the x86 CPU Policy data into the stream.
In the longer term, we want to provisionally send the static data only
to the destination as a more robust compatibility check. In both cases,
we will want a callback into the higher level toolstack.
Mandate the presence of the STATIC_DATA_END record, and declare this v3,
along with instructions for how to compatibly interpret a v2 stream.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Tamas K Lengyel [Fri, 29 May 2020 16:22:34 +0000 (17:22 +0100)]
tools/libxl: fix setting altp2m param broken by 1e9bc407cf0
The patch 1e9bc407cf0 mistakenly converted the altp2m config option to a
boolean. This is incorrect and breaks external-only usecases of altp2m that
is set with a value of 2.
Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Juergen Gross [Fri, 29 May 2020 11:37:09 +0000 (12:37 +0100)]
docs: update xenstore-migration.md
Update connection record details:
- make flags common for sockets and domains (makes it easier to have a
C union for conn-spec)
- add pending incoming data (needed for handling partially read
requests when doing live update)
- add partial response length (needed for proper split to individual
responses after live update)
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
Roger Pau Monné [Fri, 29 May 2020 15:52:06 +0000 (17:52 +0200)]
clang: don't define nocall
Clang doesn't support attribute error, and the possible equivalents
like diagnose_if don't seem to work well in this case as they trigger
when when the function is not called (just by being used by the
APPEND_CALL macro).
Define nocall to a noop on clang until a proper solution can be found.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Julien Grall <jgrall@amazon.com>
[jb: error -> __error__] Acked-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Fri, 29 May 2020 10:29:53 +0000 (11:29 +0100)]
tools: fix Rules.mk library make variables
Both SHDEPS_libxendevicemodel and SHDEPS_libxenhypfs have a bug by
adding $(SHLIB_xencall) instead of $(SHLIB_libxencall).
The former seems not to have any negative impact, probably because
it is not used anywhere in Xen without the correct $(SHLIB_libxencall)
being used, too.
Jan Beulich [Fri, 29 May 2020 15:35:09 +0000 (17:35 +0200)]
x86emul: support FXSAVE/FXRSTOR
Note that FPU selector handling as well as MXCSR mask saving for now
does not honor differences between host and guest visible featuresets.
While for Intel operation of the insns with CR4.OSFXSR=0 is
implementation dependent, use the easiest solution there: Simply don't
look at the bit in the first place. For AMD and alike the behavior is
well defined, so it gets handled together with FFXSR.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 29 May 2020 15:34:31 +0000 (17:34 +0200)]
x86emul: support FLDENV and FRSTOR
While the Intel SDM claims that FRSTOR itself may raise #MF upon
completion, this was confirmed by Intel to be a doc error which will be
corrected in due course; behavior is like FLDENV, and like old hard copy
manuals describe it.
Re-arrange a switch() statement's case label order to allow for
fall-through from FLDENV handling to FNSTENV's.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 29 May 2020 15:33:54 +0000 (17:33 +0200)]
x86emul: support FNSTENV and FNSAVE
To avoid introducing another boolean into emulator state, the
rex_prefix field gets (ab)used to convey the real/VM86 vs protected mode
info (affecting structure layout, albeit not size) to x86_emul_blk().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 29 May 2020 15:32:55 +0000 (17:32 +0200)]
x86emul: support ENQCMD insns
Note that the ISA extensions document revision 038 doesn't specify
exception behavior for ModRM.mod == 0b11; assuming #UD here.
No tests are being added to the harness - this would be quite hard,
we can't just issue the insns against RAM. Their similarity with
MOVDIR64B should have the test case there be god enough to cover any
fundamental flaws.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 29 May 2020 15:32:14 +0000 (17:32 +0200)]
x86emul: support MOVDIR{I,64B} insns
Introduce a new blk() hook, paralleling the rmw() one in a certain way,
but being intended for larger data sizes, and hence its HVM intermediate
handling function doesn't fall back to splitting the operation if the
requested virtual address can't be mapped.
Note that SDM revision 071 doesn't specify exception behavior for
ModRM.mod == 0b11; assuming #UD here.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Acked-by: Andrew Cooper <andrew.cooper@citrix.com>
Jan Beulich [Fri, 29 May 2020 15:31:13 +0000 (17:31 +0200)]
x86emul: disable FPU/MMX/SIMD insn emulation when !HVM
In a pure PV environment (the PV shim in particular) we don't really
need emulation of all these. To limit #ifdef-ary utilize some of the
CASE_*() macros we have, by providing variants expanding to
(effectively) nothing (really a label, which in turn requires passing
-Wno-unused-label to the compiler when build such configurations).
Due to the mixture of macro and #ifdef use, the placement of some of
the #ifdef-s is a little arbitrary.
The resulting object file's .text is less than half the size of the
original, and looks to also be compiling a little more quickly.
This is meant as a first step; more parts can likely be disabled down
the road.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Bregrudingly-acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 29 May 2020 15:29:59 +0000 (17:29 +0200)]
x86emul: also test decoding and mem access / write logic
x86emul_is_mem_{access,write}() (and their interaction with
x86_decode()) have become sufficiently complex that we should have a way
to test this logic. Start by covering legacy encoded GPR insns, with the
exception of a few the main emulator doesn't support yet (left as
comments in the respective tables, or about to be added by subsequent
patches). This has already helped spot a few flaws in said logic,
addressed by (revised) earlier patches.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 29 May 2020 15:28:45 +0000 (17:28 +0200)]
x86emul: rework CMP and TEST emulation
Unlike similarly encoded insns these don't write their memory operands,
and hence x86_is_mem_write() should return false for them. However,
rather than adding special logic there, rework how their emulation gets
done, by making decoding attributes properly describe the r/o nature of
their memory operands:
- change the table entries for opcodes 0x38 and 0x39, with no other
adjustments to the attributes later on,
- for the other opcodes, leave the table entries as they are, and
override the attributes for the specific sub-cases (identified by
ModRM.reg).
For opcodes 0x38 and 0x39 the change of the table entries implies
changing the order of operands as passed to emulate_2op_SrcV(), hence
the splitting of the cases in the main switch().
Note how this also allows dropping custom LOCK prefix checks.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
First of all explain in comments what the functions' purposes are. Then
make them actually match their comments.
Note that fc6fa977be54 ("x86emul: extend x86_insn_is_mem_write()
coverage") didn't actually fix the function's behavior for {,V}STMXCSR:
Both are covered by generic code higher up in the function, due to
x86_decode_twobyte() already doing suitable adjustments. And VSTMXCSR
wouldn't have been covered anyway without a further X86EMUL_OPC_VEX()
case label. Keep the inner case label in a comment for reference.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Juergen Gross [Fri, 29 May 2020 10:20:31 +0000 (12:20 +0200)]
xen: remove XEN_SYSCTL_set_parameter support
The functionality of XEN_SYSCTL_set_parameter is available via hypfs
now, so it can be removed.
This allows to remove the kernel_param structure for runtime parameters
by putting the now only used structure element into the hypfs node
structure of the runtime parameters.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Fri, 29 May 2020 10:18:36 +0000 (12:18 +0200)]
xen: add runtime parameter access support to hypfs
Add support to read and modify values of hypervisor runtime parameters
via the hypervisor file system.
As runtime parameters can be modified via a sysctl, too, this path has
to take the hypfs rw_lock as writer.
For custom runtime parameters the connection between the parameter
value and the file system is done via an init function which will set
the initial value (if needed) and the leaf properties.
Juergen Gross [Fri, 29 May 2020 10:14:51 +0000 (12:14 +0200)]
xen: provide version information in hypfs
Provide version and compile information in /buildinfo/ node of the
Xen hypervisor file system. As this information is accessible by dom0
only no additional security problem arises.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Fri, 29 May 2020 08:15:50 +0000 (10:15 +0200)]
xen: add basic hypervisor filesystem support
Add the infrastructure for the hypervisor filesystem.
This includes the hypercall interface and the base functions for
entry creation, deletion and modification.
In order not to have to repeat the same pattern multiple times in case
adding a new node should BUG_ON() failure, the helpers for adding a
node (hypfs_add_dir() and hypfs_add_leaf()) get a nofault parameter
causing the BUG() in case of a failure.
Juergen Gross [Fri, 29 May 2020 08:15:35 +0000 (10:15 +0200)]
docs: add feature document for Xen hypervisor sysfs-like support
On the 2019 Xen developer summit there was agreement that the Xen
hypervisor should gain support for a hierarchical name-value store
similar to the Linux kernel's sysfs.
In the beginning there should only be basic support: entries can be
added from the hypervisor itself only, there is a simple hypercall
interface to read the data.
Add a feature document for setting the base of a discussion regarding
the desired functionality and the entries to add.
George Dunlap [Thu, 28 May 2020 11:20:54 +0000 (12:20 +0100)]
golang/xenlight: Get rid of GOPATH-based build artefacts
The original build setup used a "fake GOPATH" in tools/golang to test
the mechanism of building from go package files installed on a
filesystem. With the move to modules, this isn't necessary, and leads
to potentially confusing directories being created. (I.e., it might
not be obvious that files under tools/golang/src shouldn't be edited.)
Get rid of the code that creates this (now unused) intermediate
directory. Add direct dependencies from 'build' onto the source
files.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Nick Rosbrook <rosbrookn@ainfosec.com>
George Dunlap [Thu, 28 May 2020 11:20:53 +0000 (12:20 +0100)]
libxl: Generate golang bindings in libxl Makefile
The generated golang bindings (types.gen.go and helpers.gen.go) are
left checked in so that they can be fetched from xenbits using the
golang tooling. This means that they must be updated whenever
libxl_types.idl (or other dependencies) are updated. However, the
golang bindings are only built optionally; we can't assume that anyone
updating libxl_types.idl will also descend into the tools/golang tree
to re-generate the bindings.
Fix this by re-generating the golang bindings from the libxl Makefile
when the IDL dependencies are updated, so that anyone who updates
libxl_types.idl will also end up updating the golang generated files
as well.
- Make a variable for the generated files, and a target in
xenlight/Makefile which will only re-generate the files.
- Add a target in libxl/Makefile to call external idl generation
targets (currently only golang).
For ease of testing, also add a specific target in libxl/Makefile just
to check and update files generated from the IDL.
This does mean that there are two potential paths for generating the
files during a parallel build; but that shouldn't be an issue, since
tools/golang/xenlight should never be built until after tools/libxl
has completed building anyway.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Nick Rosbrook <rosbrookn@ainfosec.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Jan Beulich [Thu, 28 May 2020 10:03:25 +0000 (12:03 +0200)]
VT-x: extend LBR Broadwell errata coverage
For lbr_tsx_fixup_check() simply name a few more specific erratum
numbers.
For bdf93_fixup_check(), however, more models are affected. Oddly enough
despite being the same model and stepping, the erratum is listed for
Xeon E3 but not its Core counterpart. Apply the workaround uniformly,
and also for Xeon D, which only has the LBR-from one listed in its spec
update.
Seeing this broader applicability, rename anything BDF93-related to more
generic names.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Thu, 28 May 2020 10:00:24 +0000 (12:00 +0200)]
x86: relax LDT check in arch_set_info_guest()
It is wrong for us to check the base address when there's no LDT in the
first place. Once we don't do this check anymore we can also set the
base address to a non-canonical value when the LDT is empty.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
xen/arm: call iomem_permit_access for passthrough devices
iomem_permit_access should be called for MMIO regions of devices
assigned to a domain. Currently it is not called for MMIO regions of
passthrough devices of Dom0less guests. This patch fixes it.
Andrew Cooper [Wed, 27 May 2020 12:48:45 +0000 (13:48 +0100)]
x86/boot: Fix load_system_tables() to be NMI/#MC-safe
During boot, load_system_tables() is used in reinit_bsp_stack() to switch the
virtual addresses used from their .data/.bss alias, to their directmap alias.
The structure assignment is implemented as a memset() to zero first, then a
copy-in of the new data. This causes the NMI/#MC stack pointers to
transiently become 0, at a point where we may have an NMI watchdog running.
Rewrite the logic using a volatile tss pointer (equivalent to, but more
readable than, using ACCESS_ONCE() for all writes).
This does drop the zeroing side effect for holes in the structure, but the
backing memory for the TSS is fully zeroed anyway, and architecturally, they
are all reserved.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tamas K Lengyel [Wed, 27 May 2020 07:50:55 +0000 (09:50 +0200)]
x86/mem_sharing: gate enabling on cpu_has_vmx
It is unclear whether mem_sharing was ever made to work on other architectures
but at this time the only verified platform for it is vmx. No plans to support
or maintain it on other architectures. Make this explicit by checking during
initialization.
Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com> Reviewed-by: Wei Liu <wl@xen.org>
There have been reports of RDRAND issues after resuming from suspend on
some AMD family 15h and family 16h systems. This issue stems from a BIOS
not performing the proper steps during resume to ensure RDRAND continues
to function properly.
Update the CPU initialization to clear the RDRAND CPUID bit for any family
15h and 16h processor that supports RDRAND. If it is known that the family
15h or family 16h system does not have an RDRAND resume issue or that the
system will not be placed in suspend, the "cpuid=rdrand" kernel parameter
can be used to stop the clearing of the RDRAND CPUID bit.
Note, that clearing the RDRAND CPUID bit does not prevent a processor
that normally supports the RDRAND instruction from executing it. So any
code that determined the support based on family and model won't #UD.
Warn if no explicit choice was given on affected hardware.
Check RDRAND functions at boot as well as after S3 resume (the retry
limit chosen is entirely arbitrary).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>