xen/arm: Implement workaround for Cortex A-57 and Cortex A72 AT speculate
Both Cortex-A57 (erratum 1319537) and Cortex-A72 (erratum 1319367) can
end with corrupted TLBs if they speculate an AT instruction while S1/S2
system registers in inconsistent state.
The workaround is the same as for Cortex A-76 implemented by commit a18be06aca "xen/arm: Implement workaround for Cortex-A76 erratum 1165522",
so it is only necessary to plumb in the cpuerrata framework.
Julien Grall [Wed, 27 Mar 2019 18:45:23 +0000 (18:45 +0000)]
xen/arm: memaccess: Initialize correctly *access in __p2m_get_mem_access
The commit 8d84e701fd "xen/arm: initialize access" initializes
*access using the wrong enumeration type. This result to a warning
using clang:
mem_access.c:50:20: error: implicit conversion from enumeration type
'p2m_access_t' to different enumeration type 'xenmem_access_t'
[-Werror,-Wenum-conversion]
*access = p2m->default_access;
~ ~~~~~^~~~~~~~~~~~~~
The correct solution is to use the array memaccess that will do the
conversion between the 2 enums.
Julien Grall [Wed, 15 May 2019 20:17:30 +0000 (21:17 +0100)]
xen/arm: traps: Avoid using BUG_ON() to check guest state in advance_pc()
The condition of the BUG_ON() in advance_pc() is pretty wrong because
the bits [26:25] and [15:10] have a different meaning between AArch32
and AArch64 state.
On AArch32, they are used to store PSTATE.IT. On AArch64, they are RES0
or used for new feature (e.g ARMv8.0-SSBS, ARMv8.5-BTI).
This means a 64-bit guest will hit the BUG_ON() if it is trying to use
any of these features.
More generally, RES0 means that the bits is reserved for future use. So
crashing the host is definitely not the right solution.
In this particular case, we only need to know the guest was using 32-bit
Mode and the Thumb instructions. So replace the BUG_ON() by a proper
check.
On Arm64, system registers are always 64-bit including SCTLR_EL1.
However, Xen is assuming this is 32-bit because earlier revision of
Armv8 had the top 32-bit RES0 (see ARM DDI0595.b).
>From Armv8.5, some bits in [63:32] will be defined and allowed to be
modified by the guest. So we would effectively reset those bits to 0
after each context switch. This means the guest may not function
correctly afterwards.
Rather than resetting to 0 the bits [63:32], preserve them across
context switch.
Note that the corresponding register on Arm32 (i.e SCTLR) is always
32-bit. So we need to use register_t anywhere we deal the SCTLR{,_EL1}.
Outside interface is switched to use 64-bit to allow ABI compatibility
between 32-bit and 64-bit.
Julien Grall [Wed, 15 May 2019 16:16:13 +0000 (17:16 +0100)]
xen/arm: traps: Avoid using BUG_ON() in _show_registers()
At the moment, _show_registers() is using a BUG_ON() to assert only
userspace will run 32-bit code in a 64-bit domain.
Such extra precaution is not necessary and could be avoided by only
checking the CPU mode to decide whether show_registers_64() or
show_reigsters_32() should be called.
This has also the nice advantage to avoid nested if in the code.
Igor Druzhinin [Fri, 25 Oct 2019 09:43:49 +0000 (11:43 +0200)]
x86/efi: properly handle 0 in pixel reserved bitmask
In some graphics modes firmware is allowed to return 0 in pixel reserved
bitmask which doesn't go against UEFI Spec 2.8 (12.9 Graphics Output Protocol).
Without this change non-TrueColor modes won't work which will cause
GOP init to fail - observed while trying to boot EFI Xen with Cirrus VGA.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 521a1445510a30873aec471194045e7f4b5e8d75
master date: 2019-10-10 16:50:50 +0200
Roger Pau Monné [Fri, 25 Oct 2019 09:43:06 +0000 (11:43 +0200)]
pci: clear {host/guest}_maskall field on assign
The current implementation of host_maskall makes it sticky across
assign and deassign calls, which means that once a guest forces Xen to
set host_maskall the maskall bit is not going to be cleared until a
call to PHYSDEVOP_prepare_msix is performed. Such call however
shouldn't be part of the normal flow when doing PCI passthrough, and
hence the flag needs to be cleared when assigning in order to prevent
host_maskall being carried over from previous assignations.
Note that the entry maskbit is reset when the msix capability is
initialized, and the guest_maskall field is also cleared so that the
hardware value matches Xen's internal state (hardware maskall =
host_maskall | guest_maskall).
Also note that doing the reset of host_maskall there would allow the
guest to reset such field by enabling and disabling MSIX, which is not
intended.
Igor Druzhinin [Fri, 25 Oct 2019 09:42:29 +0000 (11:42 +0200)]
efi/boot: make sure graphics mode is set while booting through MB2
If a bootloader is using native driver instead of EFI GOP it might
reset graphics mode to be different from what has been originally set
by firmware. While booting through MB2 Xen either need to parse video
setting passed by MB2 and use them instead of what GOP reports or
reset the mode to synchronise it with firmware - prefer the latter.
Observed while booting Xen using MB2 with EFI GRUB2 compiled with
all possible video drivers where native drivers take priority over firmware.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: af9f357fb8dbceb9c5dd1c5cb8b4e198f6149456
master date: 2019-10-10 10:58:45 +0200
Igor Druzhinin [Fri, 25 Oct 2019 09:40:17 +0000 (11:40 +0200)]
x86/crash: force unlock console before printing on kexec crash
There is a small window where shootdown NMI might come to a CPU
(e.g. in serial interrupt handler) where console lock is taken. In order
not to leave following console prints waiting infinitely for shot down
CPUs to free the lock - force unlock the console.
The race has been frequently observed while crashing nested Xen in
an HVM domain.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 7d5247cee21aa38a16c4b21bc9243eda70c8aebd
master date: 2019-10-02 11:25:05 +0100
Juergen Gross [Fri, 25 Oct 2019 09:39:12 +0000 (11:39 +0200)]
sched: don't let XEN_RUNSTATE_UPDATE leak into vcpu_runstate_get()
vcpu_runstate_get() should never return a state entry time with
XEN_RUNSTATE_UPDATE set. To avoid this let update_runstate_area()
operate on a local runstate copy.
As it is required to first set the XEN_RUNSTATE_UPDATE indicator in
guest memory, then update all the runstate data, and then at last
clear the XEN_RUNSTATE_UPDATE again it is much less effort to have
a local copy of the runstate data instead of keeping only a copy of
state_entry_time.
This problem was introduced with commit 2529c850ea48f036 ("add update
indicator to vcpu_runstate_info").
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien.grall@arm.com>
master commit: f28c4c4c10bdacb1e49cc6e9de57eb1f973cbdf6
master date: 2019-09-26 18:04:09 +0200
Juergen Gross [Fri, 25 Oct 2019 09:38:39 +0000 (11:38 +0200)]
sched: fix freeing per-vcpu data in sched_move_domain()
In case of an allocation error of per-vcpu data in sched_move_domain()
the already allocated data is freed just using xfree(). This is wrong
as some schedulers need to do additional operations (e.g. the arinc653
scheduler needs to remove the vcpu-data from a list).
So instead xfree() make use of the sched_free_vdata() hook.
Jan Beulich [Fri, 25 Oct 2019 09:37:20 +0000 (11:37 +0200)]
ACPI/cpuidle: bump maximum number of power states we support
Commit 4c6cd64519 ("mwait_idle: Skylake Client Support") added a table
with 8 entries, which - together with C0 - rendered the current limit
too low. It should have been accompanied by an increase of the constant;
do this now. Don't bump by too much though, as there are a number of on-
stack arrays which are dimensioned by this constant.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org>
master commit: ff22a91b4c45f9310d0ec0d7ee070d84a373dd87
master date: 2019-09-25 15:53:35 +0200
Jan Beulich [Fri, 25 Oct 2019 09:36:50 +0000 (11:36 +0200)]
libxc/x86: avoid certain overflows in CPUID APIC ID adjustments
Recent AMD processors may report up to 128 logical processors in CPUID
leaf 1. Doubling this value produces 0 (which OSes sincerely dislike),
as the respective field is only 8 bits wide. Suppress doubling the value
(and its leaf 0x80000008 counterpart) in such a case.
Note that while there's a similar overflow in intel_xc_cpuid_policy(),
that one is being left alone for now.
Note further that while it was considered to suppress the multiplication
by 2 altogether if the host topology already provides at least one bit
of thread ID within APIC IDs, it was decided to avoid more change here
than really needed at this point.
Also zap leaf 4 (and at the same time leaf 2) EDX output for AMD, as it
should have been from the beginning.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
libxc/x86: correct overflow avoidance check in AMD CPUID handling
Commit df29d03f1d ("libxc/x86: avoid certain overflows in CPUID APIC ID
adjustments" introduced a one bit too narrow mask when checking whether
multiplying by 1 (in particular in leaf 1) would result in overflow.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: df29d03f1d97bdde1bc0cea8ef8538d4f524b3ec
master date: 2019-09-24 10:50:33 +0200
master commit: c9c7ac508b3f65f7d5f9685893096a1b22d8b176
master date: 2019-09-25 15:50:58 +0200
The loop in FOR_EACH_IOREQ_SERVER is backwards hence the cleanup on
failure needs to be done forwards.
Fixes: 97a5a3e30161 ('x86/hvm/ioreq: maintain an array of ioreq servers rather than a list') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
master commit: 215f2576b0ac1bc18f3ff74e34f0d8379bda9040
master date: 2019-09-10 16:32:47 +0200
Andrew Cooper [Mon, 23 Sep 2019 12:25:35 +0000 (14:25 +0200)]
x86/cpuid: Fix handling of the CPUID.7[0].eax levelling MSR
7a0 is an integer field, not a mask - taking the logical and of the hardware
and policy values results in nonsense. Instead, take the policy value
directly.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@cirtrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b50d78d0eaffb43d5f5ceeda55fa22c11f47d01b
master date: 2019-09-10 13:33:21 +0100
Jan Beulich [Mon, 23 Sep 2019 12:25:04 +0000 (14:25 +0200)]
x86/shadow: don't enable shadow mode with too small a shadow allocation (part 2)
Commit 2634b997af ("x86/shadow: don't enable shadow mode with too small
a shadow allocation") was incomplete: The adjustment done there to
shadow_enable() is also needed in shadow_one_bit_enable(). The (new)
problem report was (apparently) a failed PV guest migration followed by
another migration attempt for that same guest. Disabling log-dirty mode
after the first one had left a couple of shadow pages allocated (perhaps
something that also wants fixing), and hence the second enabling of
log-dirty mode wouldn't have allocated anything further.
Reported-by: James Wang <jnwang@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: 8b25551baa3307af0aa1ef8f7f43403f01c2c5d7
master date: 2019-09-05 09:56:42 +0200
Jan Beulich [Mon, 23 Sep 2019 12:24:33 +0000 (14:24 +0200)]
x86: properly gate clearing of PKU feature
setup_clear_cpu_cap() is __init and hence may not be called post-boot.
Note that opt_pku nevertheless is not getting __initdata added - see
e.g. commit 43fa95ae6a ("mm: make opt_bootscrub non-init").
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 41c7700a00011ad08be3c9d71126b67e08e58ac3
master date: 2019-08-29 15:10:07 +0200
p2m/ept: pass correct level to atomic_write_ept_entry in ept_invalidate_emt
The level passed to ept_invalidate_emt corresponds to the EPT entry
passed as the mfn parameter, which is a pointer to an EPT page table,
hence the entries in that page table will have one level less than the
parent.
Fix the call to atomic_write_ept_entry to pass the correct level, ie:
one level less than the parent.
Fixes: 50fe6e73705 ('pvh dom0: add and remove foreign pages') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>.
master commit: b806c91275fb1ab7696ebf033b56631693056c90
master date: 2019-08-28 16:57:36 +0200
Igor Druzhinin [Mon, 23 Sep 2019 12:23:27 +0000 (14:23 +0200)]
x86/mm: correctly initialise M2P entries on boot
Since guest resource management work it's now possible to have a page
assigned to a domain without a valid M2P entry. Some paths in the code
rely on the fact a GFN returned from mfn_to_gfn() for such a page
is not valid as well, i.e. see arch_iommu_populate_page_table().
For systems without 512GB contiguous RAM M2P entries were already
correctly initialised on boot with INVALID_M2P_ENTRY (~0UL) but
on systems where M2P could be covered by a single 1GB page directory
0x77 poison was used instead. That eventually resulted in a crash
during IOMMU construction on systems without shared PTs enabled.
While here fix up compat M2P entries as well.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6c093931a765803cfc7b0df466ee032760cc8020
master date: 2019-08-27 13:40:42 +0100
x86/p2m: fix non-translated handling of iommu mappings
The current usage of need_iommu_pt_sync in p2m for non-translated
guests is wrong because it doesn't correctly handle a relaxed PV
hardware domain, that has need_sync set to false, but still need
entries to be added from calls to {set/clear}_identity_p2m_entry.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Tested-by: Roman Shaposhnik <roman@zededa.com>
master commit: 77a994f3f8eb0d3cb0f2bf314b0ebf6a1d37f623
master date: 2019-08-20 14:24:05 +0100
Michał Kowalczyk [Mon, 23 Sep 2019 12:22:09 +0000 (14:22 +0200)]
x86: Restore IA32_MISC_ENABLE on wakeup
Code in intel.c:early_init_intel() modifies IA32_MISC_ENABLE MSR. Those
modifications must be restored after resuming from S3 (see e.g. Linux wakeup
code), otherwise bad things may happen (e.g. wakeup code may cause #GP when
trying to set IA32_EFER.NXE [1]).
This bug was noticed on a ThinkPad x230 with NX disabled in the BIOS:
Xen could correctly boot, but crashed when resuming from suspend.
Applying this patch fixed the problem.
[1] Intel SDM vol 3: "If the execute-disable capability is not
available, a write to set IA32_EFER.NXE produces a #GP exception."
Signed-off-by: Michał Kowalczyk <mkow@invisiblethingslab.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c3cfa5b3084d71bccd8360d044bea813688b587c
master date: 2019-08-19 15:07:34 +0100
Andrew Cooper [Mon, 23 Sep 2019 12:21:20 +0000 (14:21 +0200)]
x86/xpti: Don't leak TSS-adjacent percpu data via Meltdown
The XPTI work restricted the visibility of most of memory, but missed a few
aspects when it came to the TSS.
Given that the TSS is just an object in percpu data, the 4k mapping for it
created in setup_cpu_root_pgt() maps adjacent percpu data, making it all
leakable via Meltdown, even when XPTI is in use.
Furthermore, no care is taken to check that the TSS doesn't cross a page
boundary. As it turns out, struct tss_struct is aligned on its size which
does prevent it straddling a page boundary.
Rework the TSS types while making this change. Rename tss_struct to tss64, to
mirror the existing tss32 structure we have in HVM's Tast Switch logic. Drop
tss64's alignment and __cacheline_filler[] field.
Introduce tss_page which contains a single tss64 and keeps the rest of the
page clear, so no adjacent data can be leaked. Move the definition from
setup.c to traps.c, which is a more appropriate place for it to live.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 7888440625617693487495a7842e6a991ead2647
master date: 2019-08-12 14:10:09 +0100
xen/page_alloc: Keep away MFN 0 from the buddy allocator
Combining of buddies happens only such that the resulting larger buddy
is still order-aligned. To cross a zone boundary while merging, the
implication is that both the buddy [0, 2^n-1] and the buddy
[2^n, 2^(n+1)-1] are free.
Ideally we want to fix the allocator, but for now we can just prevent
adding the MFN 0 in the allocator to avoid merging across zone
boundaries.
On x86, the MFN 0 is already kept away from the buddy allocator. So the
bug can only happen on Arm platform where the first memory bank is
starting at 0.
As this is a specific to the allocator, the MFN 0 is removed in the common code
to cater all the architectures (current and future).
Andrew Cooper [Mon, 23 Sep 2019 12:19:45 +0000 (14:19 +0200)]
xen/link: Introduce .bss.percpu.page_aligned
Future changes are going to need to page align some percpu data.
Shuffle the exact link order of items within the BSS to give
.bss.percpu.page_aligned appropriate alignment, even on CPU0, which uses
.bss.percpu itself.
Insert explicit alignment such that there won't be a gap between
__per_cpu_start and the first actual per-CPU object. The POINTER_ALIGN
for __bss_end is to cover the lack of SMP_CACHE_BYTES alignment, as the
loops which zero the BSS use pointer-sized stores on all architectures.
Rework __DEFINE_PER_CPU() so the caller passes in all attributes, and
adjust DEFINE_PER_CPU{,_READ_MOSTLY}() to match. This has the added bonus
that it is now possible to grep for .bss.percpu and find all the users.
Finally, introduce DEFINE_PER_CPU_PAGE_ALIGNED() which specifies the
section attribute and verifies the type's alignment.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Make DEFINE_PER_CPU_PAGE_ALIGNED() verify the alignment rather than
specifying it. It is the underlying type which should be suitably aligned.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien.grall@arm.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6c9639a72f0ca3a9430ef75f375877182281fdef
master date: 2019-08-09 16:36:58 +0200
Andrew Cooper [Mon, 23 Sep 2019 12:18:31 +0000 (14:18 +0200)]
x86/boot: Set Accessed bits in boot_cpu_{,compat_}gdt_table[]
There is no point causing the CPU to performed a locked update of the
descriptors on first use.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: af292b41e9edc0a87f0205ece833e64808ec3883
master date: 2019-08-07 13:34:56 +0100
x86/apic: enable x2APIC mode before doing any setup
Current code calls apic_x2apic_probe which does some initialization
and setup before having enabled x2APIC mode (if it's not already
enabled by the firmware).
This can lead to issues if the APIC ID doesn't match the x2APIC ID, as
apic_x2apic_probe calls init_apic_ldr_x2apic_cluster which depending
on the APIC mode might set cpu_2_logical_apicid using the APIC ID
instead of the x2APIC ID (because x2APIC might not be enabled yet).
Fix this by enabling x2APIC before calling apic_x2apic_probe.
As a remark, this was discovered while I was trying to figure out why
one of my test boxes didn't report any iommu faults. The root cause
was that the iommu MSI address field was set using the stale value in
cpu_2_logical_apicid, and thus the iommu fault interrupt would get
lost. Even if the MSI address field gets sets to a correct value
afterwards as soon as a single iommu fault is pending no further
interrupts would get injected, so losing a single iommu fault
interrupt is fatal.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 260940578de348c38f18cadc6fa53f499e57919c
master date: 2019-08-07 12:09:51 +0200
x86/microcode: always collect_cpu_info() during boot
Currently cpu_sig struct is not updated during boot if no microcode blob
is specified by "ucode=[<interger>| scan]".
It will result in cpu_sig.rev being 0 which affects APIC's
check_deadline_errata() and retpoline_safe() functions.
Fix this by getting ucode revision early during boot and SMP bring up.
While at it, protect early_microcode_update_cpu() for cases when
microcode_ops is NULL.
Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2bb2c55cf870e78bc7f514784b2cd8c947d8729c
master date: 2019-08-01 18:45:32 +0100
James Wang [Mon, 23 Sep 2019 12:16:35 +0000 (14:16 +0200)]
xen/spec-ctrl: Speculative mitigation facilities report wrong status
Booting with spec-ctrl=0 results in Xen printing "None MD_CLEAR".
(XEN) Support for HVM VMs: None MD_CLEAR
(XEN) Support for PV VMs: None MD_CLEAR
Add a check about X86_FEATURE_MD_CLEAR to avoid to print "None".
Signed-off-by: James Wang <jnwang@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2adc580bd59f5c3034fd6ecacd5748678373f17a
master date: 2019-07-31 14:53:13 +0100
Andrew Cooper [Mon, 23 Sep 2019 12:15:49 +0000 (14:15 +0200)]
x86/boot: Fix build dependenices for reloc.c
c/s 201f852eaf added start_info.h and kconfig.h to reloc.c, but only updated
start_info.h in RELOC_DEPS.
This causes reloc.c to not be regenerated when Kconfig changes. It is most
noticeable when enabling CONFIG_PVH and finding the resulting binary crash
early with:
(d9) (XEN)
(d9) (XEN) ****************************************
(d9) (XEN) Panic on CPU 0:
(d9) (XEN) Magic value is wrong: c2c2c2c2
(d9) (XEN) ****************************************
(d9) (XEN)
(d9) (XEN) Reboot in five seconds...
(XEN) d9v0 Triple fault - invoking HVM shutdown action 1
Reported-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 78c0000c87ce498bf621914c0554b83fac3ee00d
master date: 2019-07-31 11:19:45 +0100
EPT differs from NPT and shadow when translating page orders to levels
in the physmap page tables. EPT page tables level for order 0 pages is
0, while NPT and shadow instead use 1, ie: EPT page tables levels
starts at 0 while NPT and shadow starts at 1.
Fix the p2m_entry_modify call in atomic_write_ept_entry to always add
one to the level, in order to match NPT and shadow usage.
While there also add a check to ensure p2m_entry_modify is never
called with level == 0. That should allow to catch future errors
related to the level parameter.
Fixes: c7a4c088ad1c ('x86/mm: split p2m ioreq server pages special handling into helper') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: ec2ab491b52815c1daedfdf3d95d13cfe25fb38e
master date: 2019-07-16 09:05:28 +0200
On some machines (for example Thinkpad P52), UEFI GOP reports
framebuffer located above 4GB (0x4000000000 on that machine). This
address does not fit in {xen,dom0}_vga_console_info.u.vesa_lfb.lfb_base
field, which is 32bit. The overflow here cause all kind of memory
corruption when anything tries to write something on the screen,
starting with zeroing the whole framebuffer in vesa_init().
Fix this similar to how it's done in Linux: add ext_lfb_base field at
the end of the structure, to hold upper 32bits of the address. Since the
field is added at the end of the structure, it will work with older
Linux versions too (other than using possibly truncated address - no
worse than without this change). Thanks to ABI containing size of the
structure (start_info.console.dom0.info_size), Linux can detect when
this field is present and use it appropriately then.
Since this change public interface and use __XEN_INTERFACE_VERSION__,
bump __XEN_LATEST_INTERFACE_VERSION__.
Note: if/when backporting this change to Xen <= 4.12, #if in xen.h needs
to be extended with " || defined(__XEN__)".
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 9cf11fdcd91ff8e9cd038f8336cf21f0701e8b7b
master date: 2019-05-17 14:48:23 +0200
Juergen Gross [Wed, 26 Jun 2019 13:37:26 +0000 (14:37 +0100)]
libxl: fix pci device re-assigning after domain reboot
After a reboot of a guest only the first pci device configuration will
be retrieved from Xenstore resulting in loss of any further assigned
passed through pci devices.
The main reason is that all passed through pci devices reside under a
common root device "0" in Xenstore. So when the device list is rebuilt
from Xenstore after a reboot the sub-devices below that root device
need to be selected instead of using the root device number as a
selector.
Fix that by adding a new member to struct libxl_device_type which when
set is used to get the number of devices. Add such a member for pci to
get the correct number of pci devices instead of implying it from the
number of pci root devices (which will always be 1).
While at it fix the type of libxl__device_pci_from_xs_be() to match
the one of the .from_xenstore member of struct libxl_device_type. This
fixes a latent bug checking the return value of a function returning
void.
Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit c19434d9284e93e6f9aaec9a70f5f361adbfaba6) Backport-requested-by: Pasi Kärkkäinen <pasik@iki.fi> Backport-requested-by: Roman Shaposhnik <roman@zededa.com> Backport-requested-by: Rich Persaud <persaur@gmail.com>
Andrew Cooper [Fri, 26 Jul 2019 09:02:22 +0000 (11:02 +0200)]
passthrough/vtd: Don't DMA to the stack in queue_invalidate_wait()
DMA-ing to the stack is considered bad practice. In this case, if a
timeout occurs because of a sluggish device which is processing the
request, the completion notification will corrupt the stack of a
subsequent deeper call tree.
Place the poll_slot in a percpu area and DMA to that instead.
Fix the declaration of saddr in struct qinval_entry, to avoid a shift by
two. The requirement here is that the DMA address is dword aligned,
which is covered by poll_slot's type.
This change does not address other issues. Correlating completions
after a timeout with their request is a more complicated change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <JBeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 8970834eb95586d87b064e8c7fc49ee8d2875db4
master date: 2019-07-24 14:40:10 +0100
Igor Druzhinin [Fri, 26 Jul 2019 09:01:52 +0000 (11:01 +0200)]
x86/crash: fix kexec transition breakage
Following 6ff560f7f ("x86/SMP: don't try to stop already stopped CPUs")
an incorrect condition was placed into kexec transition path
leaving crashing CPU always online breaking kdump kernel entering.
Correct it by unifying the condition with smp_send_stop().
Jan Beulich [Fri, 26 Jul 2019 09:01:24 +0000 (11:01 +0200)]
AMD/IOMMU: process softirqs while dumping IRTs
When there are sufficiently many devices listed in the ACPI tables (no
matter if they actually exist), output may take way longer than the
watchdog would like.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Brian Woods <brian.woods@amd.com>
master commit: df2030c34cc9161dd9e35c0e8c55057e101ac81a
master date: 2019-07-22 12:03:46 +0200
Jan Beulich [Fri, 26 Jul 2019 09:00:50 +0000 (11:00 +0200)]
AMD/IOMMU: free more memory when cleaning up after error
The interrupt remapping in-use bitmaps were leaked in all cases. The
ring buffers and the mapping of the MMIO space were leaked for any IOMMU
that hadn't been enabled yet.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Brian Woods <brian.woods@amd.com>
master commit: 6ae22e7aac8fb0d39318eb27eab439dd97521174
master date: 2019-07-22 11:59:01 +0200
Some logging messages made more sense as argo debug
logs rather than standard Xen logs. Use argo_dprintk
to only print this info if argo DEBUG is enabled.
Signed-off-by: Nicholas Tsirakis <tsirakisn@ainfosec.com> Reviewed-by: Christopher Clark <christopher.w.clark@gmail.com>
master commit: 7abd7c21b9c456c9f83d0d09ccea5627ae62c3d4
master date: 2019-06-19 21:14:05 +0100
When a message is requeue'd in Xen's internal queue, the queue
entry contains the length of the message so that Xen knows to
send a VIRQ to the respective domain when enough space frees up
in the ring. Due to a small bug, however, Xen doesn't populate
the length of the msg if a given write fails, so this length is
always reported as zero. This causes Xen to spuriously wake up
a domain even when the ring doesn't have enough space.
This patch makes sure that the msg len is properly reported by
populating it in the event of a write failure.
Signed-off-by: Nicholas Tsirakis <tsirakisn@ainfosec.com> Reviewed-by: Christopher Clark <christopher.w.clark@gmail.com>
master commit: 8966a3e9ab485f3a9d2adf66b71265163f8fb8eb
master date: 2019-06-12 21:06:18 +0100
In its current state, if the destination ring is full, sendv()
will requeue the message and return the rc of pending_requeue(),
which will return 0 on success. This prevents the caller from
distinguishing the difference between a successful write and a
message that needs to be resent at a later time.
Instead, capture the -EAGAIN value returned from ringbuf_insert()
and *only* overwrite it if the rc of pending_requeue() is non-zero.
This allows the caller to make intelligent decisions on -EAGAIN and
still be alerted if the pending message fails to requeue.
Signed-off-by: Nicholas Tsirakis <tsirakisn@ainfosec.com> Reviewed-by: Christopher Clark <christopher.w.clark@gmail.com>
master commit: 480800c76969b38f13b6909eb679b23571417538
master date: 2019-06-11 20:27:28 +0100
Andrew Cooper [Fri, 19 Jul 2019 14:14:08 +0000 (16:14 +0200)]
x86/ctxt-switch: Document and improve GDT handling
Calling virt_to_mfn() in the context switch path is a lot
of wasted cycles for a result which is constant after boot.
Begin by documenting how Xen handles the GDTs across context switch.
The loop in write_full_gdt_ptes() is unnecessary, because
NR_RESERVED_GDT_PAGES is 1. Dropping it makes the code substantially
more clear, and with it dropped, write_full_gdt_ptes() becomes more
obviously a poor name, so rename it to update_xen_slot_in_full_gdt().
Furthermore, load_full_gdt() is completely independent of the current
CPU, and load_default_gdt() only needs the current CPU's regular
GDT. (This is a change in behaviour, as previously it may have used the
compat GDT, but either will do.)
Add two extra per-cpu variables which cache the L1e for the regular and compat
GDT, calculated in cpu_smpboot_alloc()/trap_init() as appropriate, so
update_xen_slot_in_full_gdt() doesn't need to waste time performing the same
calculation on every context switch.
One performance scenario of Jüergen's (time to build the hypervisor on
an 8 CPU system, with two single-vCPU MiniOS VMs constantly interrupting
dom0 with events) shows the following, average over 5 measurements:
elapsed user system
Unpatched 66.51 232.93 109.21
Patched 57.00 225.47 105.47
which is a substantial improvement.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Tested-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 564d261687c071c5a77fa81d693b7ca3d8e83b48
master date: 2019-07-10 09:40:25 -0500
x86: make loading of GDT at context switch more modular
In preparation for core scheduling, carve out the GDT related
functionality (writing GDT related PTEs, loading default of full GDT)
into sub-functions.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 12dce7ea5a84e0f107710f8df1cfb2dfe306c793
master date: 2019-07-04 16:02:52 +0200
Andrew Cooper [Fri, 19 Jul 2019 14:12:42 +0000 (16:12 +0200)]
x86/svm: Fix svm_vmcb_dump() when used in current context
VMExit doesn't switch all state. The FS/GS/TS/LDTR/GSBASE segment
information, and SYSCALL/SYSENTER MSRs may still be cached in hardware, rather
than up-to-date in the VMCB.
Export svm_sync_vmcb() via svmdebug.h so svm_vmcb_dump() can use it, and bring
the VMCB into sync in current context.
As a minor optimisation, switch svm_sync_vmcb() to use svm_vm{load,save}_pa(),
as svm->vmcb_pa is always correct, and this avoids a redundant __pa()
translation behind the scenes.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Brian Woods <brian.woods@amd.com>
master commit: 7d161f6537557520b52c2c7fb8321460f37ff933
master date: 2019-06-19 19:54:22 +0100
Andrew Cooper [Fri, 19 Jul 2019 14:12:06 +0000 (16:12 +0200)]
x86/clear_page: Update clear_page_sse2() after dropping 32bit Xen
This code was never updated when the 32bit build of Xen was dropped.
* Expand the now-redundant ptr_reg macro.
* The number of iterations in the loop can be halfed by using 64bit writes,
without consuming any extra execution resource in the pipeline. Adjust all
numbers/offsets appropriately.
* Replace dec with sub to avoid a eflags stall, and position it to be
macro-fused with the related jmp.
* With no need to preserve eflags across the body of the loop, replace lea
with add which has 1/3'rd the latency on basically all 64bit hardware.
A quick userspace perf test on my Haswell dev box indicates that the old
version takes ~1385 cycles on average (ignoring outliers), and the new version
takes ~1060 cyles, or about 77% of the time.
Reported-by: Edwin Török <edvin.torok@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 260acc521db4c29df4aa9b7a67f42cf967871fd3
master date: 2019-06-18 15:47:52 +0100
Tamas K Lengyel [Fri, 19 Jul 2019 14:11:20 +0000 (16:11 +0200)]
x86/altp2m: cleanup p2m_altp2m_lazy_copy
The p2m_altp2m_lazy_copy is responsible for lazily populating an
altp2m view when the guest traps out due to no EPT entry being present
in the active view. Currently, in addition to taking a number of
unused argements, the whole calling convention has a number of
redundant p2m lookups: the function reads the hostp2m, even though the
caller has just read the same hostp2m entry; and then the caller
re-reads the altp2m entry that the function has just read (and possibly set).
Rework this function to make it a bit more rational. Specifically:
- Pass the current hostp2m entry values we have just read for it to
use to populate the altp2m entry if it finds the entry empty.
- If the altp2m entry is not empty, pass out the values we've read so
the caller doesn't need to re-walk the tables
- Either way, return with the gfn 'locked', to make clean-up handling
more consistent.
Rename the function to better reflect this functionality.
While we're here, change bool_t to bool, and return true/false rather
than 1/0.
It's a bit grating to do both the p2m_lock() and the get_gfn(),
knowing that they boil down to the same thing at the moment; but we
have to maintain the fiction until such time as we decide to get rid
of it entirely.
Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Tested-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: Jan Beulich <jbeulich@suse.com>
x86/altp2m: Fix style errors introduced with c/s 9abcac7ff
Drop introduced trailing whitespace, excessively long lines, mal-indention,
superfluous use of PRI macros for int-or-smaller types, and incorrect PRI
macros for gfns and mfns.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: 9abcac7ff14506b934e55d1cfd86575f182b77b7
master date: 2019-05-28 14:10:36 +0100
master commit: df4e4cafd28d63be64db06493e310ac0217d2c5b
master date: 2019-05-29 14:39:28 +0100
Andrew Cooper [Fri, 19 Jul 2019 14:10:43 +0000 (16:10 +0200)]
x86/boot: Don't leak the module_map allocation in __start_xen()
Ever since its introducion in c/s 436fb462 "x86/microcode: enable boot
time (pre-Dom0) loading", the allocation has gone un-freed, and has its final
use as part of constructing dom0.
Xen already consideres it an error to have more than a single unaccounted-for
module (again, logic from the same change), and will only pass the first one
to dom0 as the initrd.
Instead of having an 8 byte pointer to a bitmap which won't exceed 4 bits wide
in any production scenario (dom0 kernel, initrd, XSM blob and microcode blob),
allocate module_map[] on the stack and add a sanity bound for mbi->mods_count.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 9b757bdc1794d012f5d784de54d5884e425622e0
master date: 2019-05-13 10:35:37 +0100
Andrew Cooper [Fri, 19 Jul 2019 14:10:02 +0000 (16:10 +0200)]
x86/hvm: Fix altp2m_op hypercall continuations
c/s 9383de210 "x86/altp2m: support for setting restrictions for an array of
pages" introduced this logic, but do_hvm_op() was already capable of handling
-ERESTART correctly.
More problematic however is a continuation from compat_altp2m_op(). The arg
written back into register state points into the hypercall XLAT area, not at
the original parameter passed by the guest. It may be truncated by the
vmentry, but definitely won't be correct on the next invocation.
Delete the hypercall_create_continuation() call, and return -ERESTART, which
will cause the compat case to start working correctly.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 8228577ad1ba9f4b49370b76c90b75fb9243ee2f
master date: 2019-04-09 19:34:41 +0100
Andrew Cooper [Fri, 19 Jul 2019 14:09:30 +0000 (16:09 +0200)]
x86/msr: Fix handling of MSR_AMD_PATCHLEVEL/MSR_IA32_UCODE_REV
There are a number of bugs. There are no read/write hooks on the HVM side, so
guest accesses fall into the "read/write-discard" defaults, which bypass the
correct faulting behaviour and the Intel special case.
For the PV side, writes are discarded (again, bypassing proper faulting),
except for a pinned dom0, which is permitted to actually write the values
other than 0. This is pointless with read hook implementing the Intel special
case.
However, implementing the Intel special case is itself pointless. First of
all, OS software can't guarentee to read back 0 in the first place, because a)
this behaviour isn't guarenteed in the SDM, and b) there are SMM handlers
which use the CPUID instruction. Secondly, when a guest executes CPUID, this
doesn't typically result in Xen executing a CPUID instruction in practice.
With the dom0 special case removed, there are now no writes to this MSR other
than Xen's microcode loading facilities, which means that the value held in
the MSR will be properly up-to-date. Forward it directly, without jumping
through any hoops.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 013896cb8b2f070dc452bd1b91fc5b842a538367
master date: 2019-04-05 11:09:08 +0100
Jan Beulich [Fri, 19 Jul 2019 14:08:28 +0000 (16:08 +0200)]
tools: re-sync CPUID leaf 7 tables
Bring libxl's in line with the public header, and update xen-cpuid's to
the latest information available in Intel's documentation (SDM ver 068
and ISA extensions ver 035), with (as before) the exception on MAWAU.
Some pre-existing strings get changed to match SDM naming. This should
be benign in xen-cpuid, and I hope it's also acceptable in libxl, where
people actually using the slightly wrong names would have to update
their guest config files.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
[Backport restricted to just xen-cpuid.c]
master commit: 787619a0640ed79650305fc21f70d48e0726e7c7
master date: 2019-03-14 16:38:39 +0100
Andrew Cooper [Fri, 19 Jul 2019 14:07:06 +0000 (16:07 +0200)]
x86/xstate: Don't special case feature collection
The logic in xstate_init() is a rementent of the pre-featuremask days.
Collect the xstate features in generic_identify(), like all other feature
leaves, after which identify_cpu() will apply the known_feature[] mask derived
from the automatically generated CPUID information.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 65c165d6595f2762da577cb428e3bc61e32e0899
master date: 2019-03-12 13:57:13 +0000
Thus the loop termination condition is dereferencing a struct pointer that
is being incremented by the loop.
A block of MSI entries stores the number of vectors in entry[0].msi.nvec,
with all subsequent entries using a value of 0. Therefore, for a block of
two or more MSIs will terminate the loop early, as entry[1].msi.nvec is 0.
However, for a single MSI, ++entry moves the pointer out of bounds, and a
bogus read is used for the termination condition. In the case that the
loop body gets entered, there are subsequent OoB writes which clobber
adjacent memory in the heap.
This patch simply initializes a stack variable to the value of
entry->msi.nvec before starting the loop and then uses that in the
termination condition instead.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 56ad626532eb7addeef2bb2f5f67a15756b5cee2
master date: 2019-07-02 12:00:42 +0100
Igor Druzhinin [Fri, 5 Jul 2019 08:24:30 +0000 (10:24 +0200)]
x86/cpuid: leak OSXSAVE only when XSAVE is not clear in policy
This fixes booting of old non-PV-OPS kernels which historically
looked for OSXSAVE instead of XSAVE bit in CPUID to check whether
XSAVE feature is enabled. If such a guest appears to be started on
an XSAVE enabled CPU and the feature is explicitly cleared in
policy, leaked OSXSAVE bit from Xen will lead to guest crash early in
boot.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 902888922e6feda2c485cc4bdeffd0d6e6c26e14
master date: 2019-06-28 13:17:53 +0100
Jan Beulich [Fri, 5 Jul 2019 08:23:34 +0000 (10:23 +0200)]
x86/SMP: don't try to stop already stopped CPUs
In particular with an enabled IOMMU (but not really limited to this
case), trying to invoke fixup_irqs() after having already done
disable_IO_APIC() -> clear_IO_APIC() is a rather bad idea:
Don't call fixup_irqs() and don't send any IPI if there's only one
online CPU anyway, and don't call __stop_this_cpu() at all when the CPU
we're on was already marked offline (by a prior invocation of
__stop_this_cpu()).
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Extend this to the kexec/crash path as well.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6ff560f7f1f214fb89baaf97812c4c943e44a642
master date: 2019-06-18 16:35:35 +0200
Jan Beulich [Fri, 5 Jul 2019 08:23:01 +0000 (10:23 +0200)]
x86/AMD: limit C1E disable family range
Just like for other family values of 0x17 (see "x86/AMD: correct certain
Fam17 checks"), commit 3157bb4e13 ("Add MSR support for various feature
AMD processor families") made the original check for Fam11 here include
families all the way up to Fam17. The involved MSR (0xC0010055),
however, is fully reserved starting from Fam16, and the two bits of
interest are reserved for Fam12 and onwards (albeit I admit I wasn't
able to find any Fam13 doc). Restore the upper bound to be Fam11.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5c2926f576c9127a8d47217e0cafe00cc741c452
master date: 2019-06-18 16:34:51 +0200
Jan Beulich [Fri, 5 Jul 2019 08:22:27 +0000 (10:22 +0200)]
x86/AMD: correct certain Fam17 checks
Commit 3157bb4e13 ("Add MSR support for various feature AMD processor
families") converted certain checks for Fam11 to include families all
the way up to Fam17. The commit having no description, it is hard to
tell whether this was a mechanical dec->hex conversion mistake, or
indeed intended. In any event the NB_CFG handling needs to be restricted
to Fam16 and below: Fam17 doesn't really have such an MSR anymore. As
per observation it's read-zero / write-discard now, so make PV uniformly
(with the exception of pinned Dom0 vCPU-s) behave so, just like HVM
already does.
Mirror the NB_CFG behavior to MSR_FAM10H_MMIO_CONF_BASE as well, except
that here the vendor/model check is kept in place (for now at least).
A non-MMCFG extended config space access mechanism still appears to
exist, but code to deal with it will need to be written down the road,
when it can actually be tested.
Reported-by: Pu Wen <puwen@hygon.cn> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e0fbf3bf9871b00fa526c4ed893604e7ad6c3090
master date: 2019-06-18 16:33:53 +0200
Andrew Cooper [Fri, 5 Jul 2019 08:20:39 +0000 (10:20 +0200)]
x86/spec-ctrl: Knights Landing/Mill are retpoline-safe
They are both Airmont-based and should have been included in c/s 17f74242ccf
"x86/spec-ctrl: Extend repoline safey calcuations for eIBRS and Atom parts".
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: e2105180f99d22aad47ee57113015e11d7397e54
master date: 2019-05-31 19:11:29 +0100
Paul Durrant [Fri, 5 Jul 2019 08:19:12 +0000 (10:19 +0200)]
x86/vhpet: avoid 'small' time diff test on resume
It appears that even 64-bit versions of Windows 10, when not using syth-
etic timers, will use 32-bit HPET non-periodic timers. There is a test
in hpet_set_timer(), specific to 32-bit timers, that tries to disambiguate
between a comparator value that is in the past and one that is sufficiently
far in the future that it wraps. This is done by assuming that the delta
between the main counter and comparator will be 'small' [1], if the
comparator value is in the past. Unfortunately, more often than not, this
is not the case if the timer is being re-started after a migrate and so
the timer is set to fire far in the future (in excess of a minute in
several observed cases) rather then set to fire immediately. This has a
rather odd symptom where the guest console is alive enough to be able to
deal with mouse pointer re-rendering, but any keyboard activity or mouse
clicks yield no response.
This patch simply adds an extra check of 'creation_finished' into
hpet_set_timer() so that the 'small' time test is omitted when the function
is called to restart timers after migration, and thus any negative delta
causes a timer to fire immediately.
[1] The number of ticks that equate to 0.9765625 milliseconds
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b144cf45d50b603c2909fc32c6abf7359f86f1aa
master date: 2019-05-31 11:40:52 +0200
xen/arm: grant-table: Protect gnttab_clear_flag against guest misbehavior
The function gnttab_clear_flag is used to clear the access flags. On
Arm, it is implemented using a loop and guest_cmpxchg.
It is possible that guest_cmpxchg will always return a different value
than old. This can happen if the guest updated the memory before Xen has
time to do the exchange. Because of that, there are no way for to
promise the loop will end.
It is possible to make the current code safe by re-using the same
principle as applied on the guest atomic helper. However this patch
takes a different approach that should lead to more efficient code in
the default case.
A new helper is introduced to clear a set of bits on a 16-bits word.
This should avoid a an extra loop to check cmpxchg succeeded.
Note that a mask is used instead of a bit, so the helper can be re-used
later on for clearing multiple flags at the same time.
This is part of XSA-295.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Stefano Stabellini <stefanos@xilinx.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
xen: Use guest atomics helpers when modifying atomically guest memory
On Arm, exclusive load-store atomics should only be used between trusted
thread. As not all the guests are trusted, it may be possible to DoS Xen
when updating shared memory with guest atomically.
This patch replaces all the atomics operations on shared memory with
a guest by the new guest atomics helpers. The x86 code was not audited
to know where guest atomics helpers could be used. I will leave that
to the x86 folks.
Note that some rework was required in order to plumb use the new guest
atomics in event channel and grant-table.
Because guest_test_bit is ignoring the parameter "d" for now, it
means there a lot of places do not need to drop the const. We may want
to revisit this in the future if the parameter "d" becomes necessary.
xen/cmpxchg: Provide helper to safely modify guest memory atomically
On Arm, exclusive load-store atomics should only be used between trusted
thread. As not all the guests are trusted, it may be possible to DoS Xen
when updating shared memory with guest atomically.
This patch adds a new helper that will update the guest memory safely.
For x86, it is already possible to use the current helper safely. So
just wrap it.
For Arm, we will first attempt to update the guest memory with the
loop bounded by a maximum number of iterations. If it fails, we will
pause the domain and try again.
Note that this heuristics assumes that a page can only
be shared between Xen and one domain. Not Xen and multiple domain.
The maximum number of iterations is based on how many times atomic_inc()
can be executed in 1uS. The maximum value is per-CPU to cater big.LITTLE
and calculated when the CPU is booting.
The maximum number of iterations is based on how many times a simple
load-store atomic operation can be executed in 1uS. The maximum
value is per-CPU to cater big.LITTLE and calculated when the CPU is
booting. The heuristic was randomly chosen and can be modified if
impact too much good-behaving guest.
xen/bitops: Provide helpers to safely modify guest memory atomically
On Arm, exclusive load-store atomics should only be used between trusted
thread. As not all the guests are trusted, it may be possible to DoS Xen
when updating shared memory with guest atomically.
This patch adds a new set of helper that will update the guest memory
safely. For x86, it is already possible to use the current helpers
safely. So just wrap them.
For Arm, we will first attempt to update the guest memory with the loop
bounded by a maximum number of iterations. If it fails, we will pause the
domain and try again.
Note that this heuristics assumes that a page can only be shared between
Xen and one domain. Not Xen and multiple domain.
The maximum number of iterations is based on how many times a simple
load-store atomic operation can be executed in 1uS. The maximum value is
per-CPU to cater big.LITTLE and calculated when the CPU is booting. The
heuristic was randomly chosen and can be modified if impact too much
good-behaving guest.
Note, while test_bit does not requires to use atomic operation, a
wrapper for test_bit was added for completeness. In this case, the
domain stays constified to avoid major rework in the caller for the
time-being.
On Arm, exclusive load-store atomics should only be used between trusted
thread. As not all the guests are trusted, it may be possible to DoS Xen
when updating shared memory with guest atomically.
Recent patches introduced new helpers to update shared memory with guest
atomically. Those helpers relies on a memory region to be be shared with
Xen and a single guest.
At the moment, nothing prevent a guest sharing a page with Xen and as
well with another guest (e.g via grant table).
For the scope of the XSA, the quickest way is to deny communications
between unprivileged guest. So this patch is enabling and using SILO
mode by default on Arm.
Users wanted finer graine policy could wrote their own Flask policy.
This is part of XSA-295.
Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Wed, 22 May 2019 20:39:17 +0000 (13:39 -0700)]
xen/arm: cmpxchg: Provide a new helper that can timeout
Exclusive load-store atomics should only be used between trusted
threads. As not all the guests are trusted, it may be possible to DoS
Xen when updating shared memory with guest atomically.
To prevent the infinite loop, we introduce a new helper that can timeout.
The timeout is based on the maximum number of iterations.
It will be used in follow-up patch to make atomic operations on shared
memory safe.
xen/arm: bitops: Implement a new set of helpers that can timeout
Exclusive load-store atomics should only be used between trusted
threads. As not all the guests are trusted, it may be possible to DoS
Xen when updating shared memory with guest atomically.
To prevent the infinite loop, we introduce a new set of helpers that can
timeout. The timeout is based on the maximum number of iterations.
They will be used in follow-up patch to make atomic operations
on shared memory safe.
xen/arm32: cmpxchg: Simplify the cmpxchg implementation
The only difference between each case of the cmpxchg is the size of
used. Rather than duplicating the code, provide a macro to generate each
cases.
This makes the code easier to read and modify.
While doing the rework, the case for 64-bit cmpxchg is removed. This is
unused today (already commented) and it would not be possible to use
it directly.
xen/grant_table: Rework the prototype of _set_status* for lisibility
It is not clear from the parameters name whether domid and gt_version
correspond to the local or remote domain. A follow-up patch will make
them more confusing.
So rename domid (resp. gt_version) to ldomid (resp. rgt_version). At
the same time re-order the parameters to hopefully make it more
readable.
This is part of XSA-295.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
xen/arm: Add an isb() before reading CNTPCT_EL0 to prevent re-ordering
Per D8.2.1 in ARM DDI 0487C.a, "a read to CNTPCT_EL0 can occur
speculatively and out of order relative to other instructions executed
on the same PE."
Add an instruction barrier to get accurate number of cycles when
requested in get_cycles(). For the other users of CNPCT_EL0, replace by
a call to get_cycles().
Jan Beulich [Tue, 12 Mar 2019 13:40:24 +0000 (14:40 +0100)]
events: drop arch_evtchn_inject()
Have the only user call vcpu_mark_events_pending() instead, at the same
time arranging for correct ordering of the writes (evtchn_pending_sel
should be written before evtchn_upcall_pending).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <julien.grall@arm.com>
Igor Druzhinin [Tue, 4 Jun 2019 13:40:37 +0000 (15:40 +0200)]
libacpi: report PCI slots as enabled only for hotpluggable devices
DSDT for qemu-xen lacks _STA method of PCI slot object. If _STA method
doesn't exist then the slot is assumed to be always present and active
which in conjunction with _EJ0 method makes every device ejectable for
an OS even if it's not the case.
qemu-kvm is able to dynamically add _EJ0 method only to those slots
that either have hotpluggable devices or free for PCI passthrough.
As Xen lacks this capability we cannot use their way.
qemu-xen-traditional DSDT has _STA method which only reports that
the slot is present if there is a PCI devices hotplugged there.
This is done through querying of its PCI hotplug controller.
qemu-xen has similar capability that reports if device is "hotpluggable
or absent" which we can use to achieve the same result.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6761965243b113230bed900d6105be05b28f5cea
master date: 2019-05-24 10:30:21 +0200
Jan Beulich [Tue, 4 Jun 2019 13:40:07 +0000 (15:40 +0200)]
x86/IO-APIC: fix build with gcc9
There are a number of pointless __packed attributes which cause gcc 9 to
legitimately warn:
utils.c: In function 'vtd_dump_iommu_info':
utils.c:287:33: error: converting a packed 'struct IO_APIC_route_entry' pointer (alignment 1) to a 'struct IO_APIC_route_remap_entry' pointer (alignment 8) may result in an unaligned pointer value [-Werror=address-of-packed-member]
287 | remap = (struct IO_APIC_route_remap_entry *) &rte;
| ^~~~~~~~~~~~~~~~~~~~~~~~~
intremap.c: In function 'ioapic_rte_to_remap_entry':
intremap.c:343:25: error: converting a packed 'struct IO_APIC_route_entry' pointer (alignment 1) to a 'struct IO_APIC_route_remap_entry' pointer (alignment 8) may result in an unaligned pointer value [-Werror=address-of-packed-member]
343 | remap_rte = (struct IO_APIC_route_remap_entry *) old_rte;
| ^~~~~~~~~~~~~~~~~~~~~~~~~
Simply drop these attributes. Take the liberty and also re-format the
structure definitions at the same time.
Reported-by: Charles Arnold <carnold@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ca9310b24e6205de5387e5982ccd42c35caf89d4
master date: 2019-05-24 10:19:59 +0200
Juergen Gross [Tue, 4 Jun 2019 13:39:37 +0000 (15:39 +0200)]
xen/sched: fix csched2_deinit_pdata()
Commit 753ba43d6d16e688 ("xen/sched: fix credit2 smt idle handling")
introduced a regression when switching cpus between cpupools.
When assigning a cpu to a cpupool with credit2 being the default
scheduler csched2_deinit_pdata() is called for the credit2 private data
after the new scheduler's private data has been hooked to the per-cpu
scheduler data. Unfortunately csched2_deinit_pdata() will cycle through
all per-cpu scheduler areas it knows of for removing the cpu from the
respective sibling masks including the area of the just moved cpu. This
will (depending on the new scheduler) either clobber the data of the
new scheduler or in case of sched_rt lead to a crash.
Avoid that by removing the cpu from the list of active cpus in credit2
data first.
The opposite problem is occurring when removing a cpu from a cpupool:
init_pdata() of credit2 will access the per-cpu data of the old
scheduler.
Jan Beulich [Tue, 4 Jun 2019 13:39:04 +0000 (15:39 +0200)]
x86emul: add support for missing {,V}PMADDWD insns
Their pre-AVX512 incarnations have clearly been overlooked during much
earlier work. Their memory access pattern is entirely standard, so no
specific tests get added to the harness.
Reported-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Alexandru Isaila <aisaila@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1a48bdd599b268a2d9b7d0c45f1fd40c4892186e
master date: 2019-05-16 13:43:17 +0200
Jan Beulich [Tue, 4 Jun 2019 13:38:17 +0000 (15:38 +0200)]
x86/IRQ: avoid UB (or worse) in trace_irq_mask()
Dynamically allocated CPU mask objects may be smaller than cpumask_t, so
copying has to be restricted to the actual allocation size. This is
particulary important since the function doesn't bail early when tracing
is not active, so even production builds would be affected by potential
misbehavior here.
Take the opportunity and also
- use initializers instead of assignment + memset(),
- constify the cpumask_t input pointer,
- u32 -> uint32_t.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: 6fafb8befa99620a2d7323b9eca5c387bad1f59f
master date: 2019-05-13 16:41:03 +0200
Andrew Cooper [Tue, 4 Jun 2019 13:37:47 +0000 (15:37 +0200)]
x86/boot: Fix latent memory corruption with early_boot_opts_t
c/s ebb26b509f "xen/x86: make VGA support selectable" added an #ifdef
CONFIG_VIDEO into the middle the backing space for early_boot_opts_t,
but didn't adjust the structure definition in cmdline.c
This only functions correctly because the affected fields are at the end
of the structure, and cmdline.c doesn't write to them in this case.
To retain the slimming effect of compiling out CONFIG_VIDEO, adjust
cmdline.c with enough #ifdef-ary to make C's idea of the structure match
the declaration in asm. This requires adding __maybe_unused annotations
to two helper functions.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 30596213617fcf4dd7b71d244e16c8fc0acf456b
master date: 2019-05-13 10:35:38 +0100
Andrew Cooper [Tue, 4 Jun 2019 13:37:18 +0000 (15:37 +0200)]
x86/svm: Fix handling of ICEBP intercepts
c/s 9338a37d "x86/svm: implement debug events" added support for introspecting
ICEBP debug exceptions, but didn't account for the fact that
svm_get_insn_len() (previously __get_instruction_length) can fail and may
already have raised #GP with the guest.
If svm_get_insn_len() fails, return back to guest context rather than
continuing and mistaking a trap-style VMExit for a fault-style one.
Spotted by Coverity.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Acked-by: Brian Woods <brian.woods@amd.com>
master commit: 1495b4ff9b4af2b9c0f12cdb6491082cecf34f86
master date: 2019-05-13 10:35:37 +0100
The limit 1900x1200 do not match real world devices (1900 looks like a
typo, should be 1920). But in practice the limits are arbitrary and do
not serve any real purpose. As discussed in "Increase framebuffer size
to todays standards" thread, drop them completely.
This fixes graphic console on device with 3840x2160 native resolution.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
drivers/video: drop unused limits
MAX_BPP, MAX_FONT_W, MAX_FONT_H are not used in the code at all.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 19600eb75aa9b1df3e4b0a4e55a5d08b957e1fd9
master date: 2019-05-13 10:13:24 +0200
master commit: 343459e34a6d32ba44a21f8b8fe4c1f69b1714c2
master date: 2019-05-13 10:12:56 +0200
When bitmap_fill(..., 0) is called, do not try to write anything. Before
this patch, it tried to write almost LONG_MAX, surely overwriting
something.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 93df28be2d4f620caf18109222d046355ac56327
master date: 2019-05-13 10:12:00 +0200
Tamas K Lengyel [Tue, 4 Jun 2019 13:34:58 +0000 (15:34 +0200)]
x86/vmx: correctly gather gs_shadow value for current vCPU
Currently the gs_shadow value is only cached when the vCPU is being scheduled
out by Xen. Reporting this (usually) stale value through vm_event is incorrect,
since it doesn't represent the actual state of the vCPU at the time the event
was recorded. This prevents vm_event subscribers from correctly finding kernel
structures in the guest when it is trapped while in ring3.
Refresh shadow_gs value when the context being saved is for the current vCPU.
Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: f69fc1c2f36e8a74ba54c9c8fa5c904ea1ad319e
master date: 2019-05-13 09:55:59 +0200
Igor Druzhinin [Tue, 4 Jun 2019 13:34:21 +0000 (15:34 +0200)]
x86/mtrr: recalculate P2M type for domains with iocaps
This change reflects the logic in epte_get_entry_emt() and allows
changes in guest MTTRs to be reflected in EPT for domains having
direct access to certain hardware memory regions but without IOMMU
context assigned (e.g. XenGT).
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f3d880bf2be92534c5bacf11de2f561cbad550fb
master date: 2019-05-13 09:54:45 +0200
Jan Beulich [Tue, 4 Jun 2019 13:33:53 +0000 (15:33 +0200)]
AMD/IOMMU: disable previously enabled IOMMUs upon init failure
If any IOMMUs were successfully initialized before encountering failure,
the successfully enabled ones should be disabled again before cleaning
up their resources.
Move disable_iommu() next to enable_iommu() to avoid a forward
declaration, and take the opportunity to remove stray blank lines ahead
of both functions' final closing braces.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Brian Woods <brian.woods@amd.com>
master commit: 87a3347d476443c66c79953d77d6aef1d2bb3bbd
master date: 2019-05-13 09:52:43 +0200
Jan Beulich [Tue, 4 Jun 2019 13:32:55 +0000 (15:32 +0200)]
trace: fix build with gcc9
While I've not observed this myself, gcc 9 (imo validly) reportedly may
complain
trace.c: In function '__trace_hypercall':
trace.c:826:19: error: taking address of packed member of 'struct <anonymous>' may result in an unaligned pointer value [-Werror=address-of-packed-member]
826 | uint32_t *a = d.args;
and the fix is rather simple - remove the __packed attribute. Introduce
a BUILD_BUG_ON() as replacement, for the unlikely case that Xen might
get ported to an architecture where array alignment higher that that of
its elements.
Reported-by: Martin Liška <martin.liska@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: 3fd3b266d4198c06e8e421ca515d9ba09ccd5155
master date: 2019-05-13 09:51:23 +0200
Igor Druzhinin [Tue, 9 Apr 2019 12:01:58 +0000 (13:01 +0100)]
tools/xl: use libxl_domain_info to get domain type for vcpu-pin
Parsing the config seems to be an overkill for this particular task
and the config might simply be absent. Type returned from libxl_domain_info
should be either LIBXL_DOMAIN_TYPE_HVM or LIBXL_DOMAIN_TYPE_PV but in
that context distinction between PVH and HVM should be irrelevant.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 48dab9767d2eb173495707cb1fd8ceaf73604ac1)
Jan Beulich [Wed, 15 May 2019 07:42:17 +0000 (09:42 +0200)]
x86: fix build race when generating temporary object files
The rules to generate xen-syms and xen.efi may run in parallel, but both
recursively invoke $(MAKE) to build symbol/relocation table temporary
object files. These recursive builds would both re-generate the .*.d2
files (where needed). Both would in turn invoke the same rule, thus
allowing for a race on the .*.d2.tmp intermediate files.
The dependency files of the temporary .xen*.o files live in xen/ rather
than xen/arch/x86/ anyway, so won't be included no matter what. Take the
opportunity and delete them, as the just re-generated .xen*.S files will
trigger a proper re-build of the .xen*.o ones anyway.
Empty the DEPS variable in case the set of goals consists of just those
temporary object files, thus eliminating the race.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 761bb575ce97255029d2d2249b2719e54bc76825
master date: 2019-04-11 10:25:05 +0200