Ross Lagerwall [Mon, 14 May 2018 11:03:00 +0000 (13:03 +0200)]
x86/shutdown: use ACPI reboot method for Dell PowerEdge R540
When EFI booting the Dell PowerEdge R540 it consistently wanders into
the weeds and gets an invalid opcode in the EFI ResetSystem call. This
is the same bug which affects the PowerEdge R740 so fix it in the same
way: quirk this hardware to use the ACPI reboot method instead.
BIOS Information
Vendor: Dell Inc.
Version: 1.3.7
Release Date: 02/09/2018
System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge R540
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Daniel Kiper [Fri, 20 Apr 2018 09:54:00 +0000 (11:54 +0200)]
x86/setup: properly update PTEs if src/dst overlaps when relocating Xen image
Commit 0d31d16 (x86/setup: do not relocate Xen over current Xen image
placement) disallowed src/dst images overlaps when relocating Xen image.
Though it deliberately allowed destination region between __image_base__
and (__image_base__ + XEN_IMG_OFFSET) overlaps with the end of source
image. And here is the problem. If anything between __page_tables_start
and __page_tables_end in source image lands in the overlap then some or
even all page table entries may not be updated. This usually means boom
in early boot which will be difficult to the investigate. So, I think
that we have three choices to fix the issue:
- drop XEN_IMG_OFFSET from
if ( (end > s) && (end - reloc_size + XEN_IMG_OFFSET >= __pa(_end)) )
- add XEN_IMG_OFFSET to xen_phys_start in PFN_DOWN(xen_phys_start)
used in loops as one of conditions and replace ">" with ">=",
- change PFN_DOWN(xen_phys_start) to PFN_DOWN(xen_remap_end_pfn)
proposed in earlier version of this patch.
This patch implements the second option. This way we still allow source
and destination partial overlap as described above but PTEs are properly
updated now.
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Olaf Hering [Tue, 12 Jun 2018 14:11:00 +0000 (16:11 +0200)]
unmodified_drivers: unplug the emulated devices at resume time
Since qemu-2.10 it is required to unplug emulated devices again after
a live migration. If this is not done, qemu's block-backend driver
will be unable to open the backing disk image because it is still busy
by qemu's IDE driver. As a result the domUs block-frontend driver will
be unable to access the disks, and the domU has to be destroyed.
libxl is unable to detect the situation.
Apply the same workaround for this qemu bug that was done already
years ago in linux.git with commit 512b109ec962 ("xen: unplug the
emulated devices at resume time") to make sure xenlinux based domUs
can be migrated to unfixed hosts.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monne [Wed, 27 Jun 2018 14:33:00 +0000 (16:33 +0200)]
x86/cpuid: fix generation of auto cpuid header
The makefile rule to generate the cpuid-autogen.h header passes the
whole list of dependencies to gen-cpuid.py but only the first
dependency is actually needed.
So far this seems to be harmless.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Thu, 28 Jun 2018 07:08:04 +0000 (09:08 +0200)]
x86: guard against #NM
Just in case we still don't get CR0.TS handling right, prevent a host
crash by honoring exception fixups in do_device_not_available(). This
would in particular cover emulator stubs raising #NM.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 28 Jun 2018 07:07:06 +0000 (09:07 +0200)]
x86/HVM: don't cause #NM to be raised in Xen
The changes for XSA-267 did not touch management of CR0.TS for HVM
guests. In fully eager mode this bit should never be set when
respective vCPU-s are active, or else hvmemul_get_fpu() might leave it
wrongly set, leading to #NM in hypervisor context.
{svm,vmx}_enter() and {svm,vmx}_fpu_dirty_intercept() become unreachable
this way. Explicit {svm,vmx}_fpu_leave() invocations need to be guarded
now.
With no CR0.TS management necessary in fully eager mode, there's also no
need anymore to intercept #NM.
Reported-by: Charles Arnold <carnold@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Ian Jackson [Wed, 13 Jun 2018 14:54:53 +0000 (15:54 +0100)]
libxl: restore passing "readonly=" to qemu for SCSI disks
A read-only check was introduced for XSA-142, commit ef6cb76026 ("libxl:
relax readonly check introduced by XSA-142 fix") added the passing of
the extra setting, but commit dab0539568 ("Introduce COLO mode and
refactor relevant function") dropped the passing of the setting again,
quite likely due to improper re-basing.
Restore the readonly= parameter to SCSI disks. For IDE disks this is
supposed to be rejected; add an assert. And there is a bare ad-hoc
disk drive string in libxl__build_device_model_args_new, which we also
update.
This is XSA-266.
Reported-by: Andrew Reimers <andrew.reimers@orionvm.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Wed, 13 Jun 2018 14:51:36 +0000 (15:51 +0100)]
libxl: qemu_disk_scsi_drive_string: Break out common parts of disk config
The generated configurations are identical apart from, in some cases,
reordering of the id=%s element. So, overall, no functional change.
This is part of XSA-266.
Reported-by: Andrew Reimers <andrew.reimers@orionvm.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Andrew Cooper [Thu, 28 Jun 2018 07:04:20 +0000 (09:04 +0200)]
x86: Refine checks in #DB handler for faulting conditions
One of the fix for XSA-260 (c/s 75d6828bc2 "x86/traps: Fix handling of #DB
exceptions in hypervisor context") added some safety checks to help avoid
livelocks of #DB faults.
While a General Detect #DB exception does have fault semantics, hardware
clears %dr7.gd on entry to the handler, meaning that it is actually safe to
return to. Furthermore, %dr6.gd is guest controlled and sticky (never cleared
by hardware). A malicious PV guest can therefore trigger the fatal_trap() and
crash Xen.
Instruction breakpoints are more tricky. The breakpoint match bits in %dr6
are not sticky, but the Intel manual warns that they may be set for
non-enabled breakpoints, so add a breakpoint enabled check.
Beyond that, because of the restriction on the linear addresses PV guests can
set, and the fault (rather than trap) nature of instruction breakpoints
(i.e. can't be deferred by a MovSS shadow), there should be no way to
encounter an instruction breakpoint in Xen context. However, for extra
robustness, deal with this situation by clearing the breakpoint configuration,
rather than crashing.
This is XSA-265
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 26 Jun 2018 13:23:08 +0000 (15:23 +0200)]
x86/EFI: further correct FPU state handling around runtime calls
We must not leave a vCPU with CR0.TS clear when it is not in fully eager
mode and has not touched non-lazy state. Instead of adding a 3rd
invocation of stts() to vcpu_restore_fpu_eager(), consolidate all of
them into a single one done at the end of the function.
Rename the function at the same time to better reflect its purpose, as
the patches touches all of its occurences anyway.
The new function parameter is not really well named, but
"need_stts_if_not_fully_eager" seemed excessive to me.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Roger Pau Monné [Tue, 26 Jun 2018 06:48:14 +0000 (08:48 +0200)]
x86/dom0: add extra RAM regions as UNUSABLE for PVH memory map
When running as PVH Dom0 the native memory map is used in order to
craft a tailored memory map for Dom0 taking into account it's memory
limit.
Dom0 memory is always going to be smaller than the total amount
of memory present on the host, so in order to prevent Dom0 from
relocating PCI BARs over RAM regions mark all the RAM regions not
available to Dom0 as UNUSABLE in the memory map.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 26 Jun 2018 06:47:17 +0000 (08:47 +0200)]
x86/HVM: alter completion-needed checking
The function only looks at the ioreq_t, so pass it a pointer to just
that. Also use it in hvmemul_do_io().
Suggested-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Jan Beulich [Tue, 26 Jun 2018 06:41:08 +0000 (08:41 +0200)]
x86/HVM: attempts to emulate FPU insns need to set fpu_initialised
My original way of thinking here was that this would be set anyway at
the point state gets reloaded after the adjustments hvmemul_put_fpu()
does, but the flag should already be set before that - after all the
guest may never again touch the FPU before e.g. getting migrated/saved.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Paul Durrant <paul.durrant@citrix.com>
Jan Beulich [Thu, 21 Jun 2018 09:35:46 +0000 (11:35 +0200)]
x86/EFI: fix FPU state handling around runtime calls
There are two issues. First, the nonlazy xstates were never restored
after returning from the runtime call.
Secondly, with the fully_eager_fpu mitigation for XSA-267 / LazyFPU, the
unilateral stts() is no longer correct, and hits an assertion later when
a lazy state restore tries to occur for a fully eager vcpu.
Fix both of these issues by calling vcpu_restore_fpu_eager(). As EFI
runtime services can be used in the idle context, the idle assertion
needs to move until after the fully_eager_fpu check.
Introduce a "curr" local variable and replace other uses of "current"
at the same time.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Juergen Gross <jgross@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Juergen Gross [Mon, 18 Jun 2018 07:18:56 +0000 (09:18 +0200)]
tools/libxc: retry hypercall in case of EFAULT
A hypercall issued via the privcmd driver can very rarely return
-EFAULT even if the hypercall buffers are locked in memory. This
happens for hypercall buffers in user memory when the Linux kernel
is doing memory scans e.g. for page migration or compaction.
Retry the getpageframeinfo3 hypercall up to 2 times in case
-EFAULT is returned and the hypervisor might see invalid PTEs for
user hypercall buffers (which should be the case only if the kernel
doesn't offer a /dev/xen/hypercall node).
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Juergen Gross [Mon, 18 Jun 2018 07:18:54 +0000 (09:18 +0200)]
tools/libxencall: use hypercall buffer device if available
Instead of using anonymous memory for hypercall buffers which is then
locked into memory, use the hypercall buffer device of the Linux
privcmd driver if available.
This has the advantage of needing just a single mmap() for allocating
the buffer and page migration or compaction can't make the buffer
unaccessible for the hypervisor.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Jan Beulich [Fri, 15 Jun 2018 09:49:06 +0000 (11:49 +0200)]
x86/HVM: account for fully eager FPU mode in emulation
In fully eager mode we must not clear fpu_dirtied, set CR0.TS, or invoke
the fpu_leave() hook. Instead do what the mode's name says: Restore
state right away.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Thu, 7 Jun 2018 16:00:37 +0000 (17:00 +0100)]
x86/spec-ctrl: Mitigations for LazyFPU
Intel Core processors since at least Nehalem speculate past #NM, which is the
mechanism by which lazy FPU context switching is implemented.
On affected processors, Xen must use fully eager FPU context switching to
prevent guests from being able to read FPU state (SSE/AVX/etc) from previously
scheduled vcpus.
This is part of XSA-267 / CVE-2018-3665
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Display and input protocols define "unique-id" XenBus field as string
which is much more flexible in defining unique identifiers comparing
to integer used by sound protocol. For example, this allows to provide
UUIDs as unique ID's. Align sound protocol with display and input
and redefine "unique-id" field as string.
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Release-acked-by: Juergen Gross <jgross@suse.com>
If frontend is configured to expose multiple connectors then backend may
require a way to uniquely identify concrete virtual connector within the
frontend. This is useful for use-cases where connector needs to be
matched to physical display connector.
Add XenBus "unique-id" node parameter, so this sort of use-cases can
be implemented.
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Release-acked-by: Juergen Gross <jgross@suse.com>
If frontend is configured to expose multiple input device instances
then backend may require a way to uniquely identify concrete input
device within the frontend. This is useful for use-cases where
virtual input device needs to be matched to physical input device.
Add XenBus "unique-id" node parameter, so this sort of use-cases can
be implemented.
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Release-acked-by: Juergen Gross <jgross@suse.com>
xen/kbdif: Move multi-touch device parameters to backend nodes
In current kbdif protocol definition multi-touch device parameters
are described as a part of frontend's XenBus configuration nodes while
they belong to backend's configuration. Fix this by moving
the parameters to the proper section.
Andrew Cooper [Thu, 31 May 2018 15:57:47 +0000 (16:57 +0100)]
x86/VT-x: Fix printing of EFER in vmcs_dump_vcpu()
This is essentially a "take 2" of c/s 82540b66ce "x86/VT-x: Fix determination
of EFER.LMA in vmcs_dump_vcpu()" because in hindight, that change was more
problematic than useful.
The original reason was to fix the logic for determining when not to print the
PDPTE pointers. However, mutating the efer variable (particularly LME and
LMA) before printing it interferes with diagnosing vmentry failures.
Instead of modifying efer, change the PDPTE conditional to use
VM_ENTRY_IA32E_MODE.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Marcello Seri [Thu, 31 May 2018 13:05:37 +0000 (14:05 +0100)]
ocaml/xenstored: reduce use of unsafe conversions
The rationalisation of the Xs_ring interface in the xb library
allows to further reduce the unsafe calls withouth introducing
copies. This patch also contains some further code cleanups.
Signed-off-by: Marcello Seri <marcello.seri@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Marcello Seri [Thu, 31 May 2018 13:05:36 +0000 (14:05 +0100)]
ocaml/libs/xb: Use bytes in place of strings for mutable buffers
Since Ocaml 4.06.0, that made safe-string on by default, the compiler is
allowed to perform optimisations on immutable strings. They should no
longer be used as mutable buffers, and bytes should be used instead.
The C stubs for Xs_ring have been updated to use bytes, and the interface
rationalised mimicking the new Unix module in the standard library (the
implementation of Unix.write_substring uses unsafe_of_string in the exact same
way, and both the write implementations are using the bytes as an immutable
payload for the write).
Signed-off-by: Marcello Seri <marcello.seri@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Fri, 1 Jun 2018 13:08:59 +0000 (14:08 +0100)]
x86/traps: Fix error handling of the pv %dr7 shadow state
c/s "x86/pv: Introduce and use x86emul_write_dr()" fixed a bug with IO shadow
handling, in that it remained stale and visible until %dr7.L/G got set again.
However, it neglected the -EPERM return inbetween these two hunks, introducing
a different bug in which a write to %dr7 which tries to set IO breakpoints
without %cr4.DE being set clobbers the IO state, rather than leaves it alone.
Instead, move the zeroing slightly later, which guarentees that the shadow
gets written exactly once, on a successful update to %dr7.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Tue, 29 May 2018 10:39:24 +0000 (12:39 +0200)]
x86/CPUID: don't override tool stack decision to hide STIBP
Other than in the feature sets, where we indeed want to offer the
feature even if not enumerated on hardware, we shouldn't dictate the
feature being available if tool stack or host admin have decided to not
expose it (for whatever [questionable?] reason). That feature set side
override is sufficient to achieve the intended guest side safety
property (in offering - by default - STIBP independent of actual
availability in hardware).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Tue, 29 May 2018 10:38:52 +0000 (12:38 +0200)]
x86: correct default_xen_spec_ctrl calculation
Even with opt_msr_sc_{pv,hvm} both false we should set up the variable
as usual, to ensure proper one-time setup during boot and CPU bringup.
This then also brings the code in line with the comment immediately
ahead of the printk() being modified saying "irrespective of guests".
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Tue, 29 May 2018 10:38:09 +0000 (12:38 +0200)]
x86: suppress sync when XPTI is disabled for a domain
Now that we have a per-domain flag we can and should control sync-ing in
a more fine grained manner: Only domains having XPTI enabled need the
sync to occur.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
tools/kdd: alternative way of muting spurious gcc warning
Older gcc does not support #pragma GCC diagnostics, so use alternative
approach - change variable type to uint32_t (this code handle 32-bit
requests only anyway), which apparently also avoid gcc complaining about
this (otherwise correct) code.
Ian Jackson [Wed, 13 Dec 2017 11:58:00 +0000 (11:58 +0000)]
docs/process/xen-release-management: Lesson to learn
The 4.10 release preparation was significantly more hairy than ideal.
(We seem to have a good overall outcome despite, rather than because
of, our approach.)
This is the second time (at least) that we have come close to failure
by committing to a release date before the exact code to be released
is known and has been made and tested.
Evidently our docs makes it insufficiently clear not to do that.
CC: Julien Grall <julien.grall@arm.com> Acked-by: Juergen Gross <jgross@suse.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Lars Kurth <lars.kurth@citrix.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Andrew Cooper [Thu, 24 May 2018 14:06:16 +0000 (15:06 +0100)]
x86/traps: Dump the instruction stream even for double faults
This helps debug #DF's which occur in alternative patches
Reported-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Mon, 28 May 2018 09:20:26 +0000 (11:20 +0200)]
x86/XPTI: fix S3 resume (and CPU offlining in general)
We should index an L1 table with an L1 index.
Reported-by: Simon Gaiser <simon@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Tue, 22 May 2018 14:01:26 +0000 (16:01 +0200)]
x86/HVM: correct mtrr_pat_not_equal()
The two vCPU-s differing in MTRR-enabled state means MTRR settings are
not equal. Both vCPU-s having MTRRs disabled means only PAT needs to be
compared. Along those lines for fixed range MTRRs. Differing variable
range counts likewise mean settings are different overall (even if
that's not a very reasonable setup to have).
Constify types and convert bool_t to bool.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Tue, 22 May 2018 14:00:32 +0000 (16:00 +0200)]
x86: correct vCPU dirty CPU handling
Commit df8234fd2c ("replace vCPU's dirty CPU mask by numeric ID") was
too lax in two respects: First of all it didn't consider the case of a
vCPU not having a valid dirty CPU in the descriptor table TLB flush
case. This is the issue Manual has run into with NetBSD.
Additionally reads of ->dirty_cpu for other than the current vCPU are at
risk of racing with scheduler actions, i.e. single atomic reads need to
be used there. Obviously the non-init write sites then better also use
atomic writes.
Having to touch the descriptor table TLB flush code here anyway, take
the opportunity and switch it to be at most one flush_tlb_mask()
invocation.
Reported-by: Manuel Bouyer <bouyer@antioche.eu.org> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Wed, 28 Mar 2018 14:21:39 +0000 (15:21 +0100)]
x86/Intel: Mitigations for GPZ SP4 - Speculative Store Bypass
To combat GPZ SP4 "Speculative Store Bypass", Intel have extended their
speculative sidechannel mitigations specification as follows:
* A feature bit to indicate that Speculative Store Bypass Disable is
supported.
* A new bit in MSR_SPEC_CTRL which, when set, disables memory disambiguation
in the pipeline.
* A new bit in MSR_ARCH_CAPABILITIES, which will be set in future hardware,
indicating that the hardware is not susceptible to Speculative Store Bypass
sidechannels.
For contemporary processors, this interface will be implemented via a
microcode update.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 26 Apr 2018 09:56:28 +0000 (10:56 +0100)]
x86/AMD: Mitigations for GPZ SP4 - Speculative Store Bypass
AMD processors will execute loads and stores with the same base register in
program order, which is typically how a compiler emits code.
Therefore, by default no mitigating actions are taken, despite there being
corner cases which are vulnerable to the issue.
For performance testing, or for users with particularly sensitive workloads,
the `spec-ctrl=ssbd` command line option is available to force Xen to disable
Memory Disambiguation on applicable hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Tue, 8 May 2018 06:47:30 +0000 (08:47 +0200)]
doc: correct livepatch.markdown syntax
"make -C docs all" fails due to incorrect markdown syntax in
livepatch.markdown. Correct it.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Misc fixes:
* Insert real URLs
* Drop trailing whitespace
* Consistent alignment and indentation for code blocks and lists
* Consistent capitalisation
* Consistent use of `` blocks for command line arguments and function names
* Rearrange things not to leave < and > in the text
No change in content. The document now reads rather more consistently in HTML
and PDF form.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Olaf Hering [Tue, 3 Apr 2018 11:14:11 +0000 (13:14 +0200)]
xl: show full value of cpu_khz in xl info output
The exact value of cpu_khz can only be obtained via 'xl dmesg', and
therefore can be lost after some time. 'xl info' truncates the value to
full MHz. Adjust the output to show the full khz value.
This helps the host admin to track how a host has calibrated itself. The
value of cpu_khz is used during live migration for the decision if
access to TSC should be emualted.
Commit eb5277a30e ("bitkeeper revision 1.959.1.4
(40d04a87acOb29u-5Y5OxMhHvP2x9g)" gives no hint why cpu_mhz instead of
cpu_khz was chosen.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Anthony PERARD [Fri, 18 May 2018 16:17:54 +0000 (17:17 +0100)]
Config.mk: Update QEMU to include build fixes
This tag includes two build fixes:
- dump: Fix build with newer gcc
Fix build with GCC-8
- Fix libusb-1.0.22 deprecated libusb_set_debug with libusb_set_option
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
xen/kbdif: Add features to disable keyboard and pointer
It is now not fully possible to control if and which virtual devices
are created by the frontend, e.g. keyboard and pointer devices
are always created and multi-touch device is created if the
backend advertises multi-touch support. In some cases this
behavior is not desirable and better control over the frontend's
configuration is required.
Add new XenStore feature fields, so it is possible to individually
control set of exposed virtual devices for each guest OS:
- set feature-disable-keyboard to 1 if no keyboard device needs
to be created
- set feature-disable-pointer to 1 if no pointer device needs
to be created
Keep old behavior by default.
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Release-acked-by: Juergen Gross <jgross@suse.com>
I manually tested all options and the most common combinations
on Mac.
Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: George Dunlap <George.Dunlap@eu.citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Julien Grall <julien.grall@arm.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Tim Deegan <tim@xen.org> Cc: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Lars Kurth <lars.kurth@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Lars Kurth <lars.kurth@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Roger Pau Monné [Wed, 16 May 2018 14:28:46 +0000 (16:28 +0200)]
vpci/msi: fix unbind loop
The current unbind loop on failure in vpci_msi_enable is wrong and
will only work correctly if the initial pirq is 0. Fix this by adding
a proper bound.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Thu, 26 Apr 2018 09:52:55 +0000 (10:52 +0100)]
x86/spec_ctrl: Introduce a new `spec-ctrl=` command line argument to replace `bti=`
In hindsight, the options for `bti=` aren't as flexible or useful as expected
(including several options which don't appear to behave as intended).
Changing the behaviour of an existing option is problematic for compatibility,
so introduce a new `spec-ctrl=` in the hopes that we can do better.
One common way of deploying Xen is with a single PV dom0 and all domUs being
HVM domains. In such a setup, an administrator who has weighed up the risks
may wish to forgo protection against malicious PV domains, to reduce the
overall performance hit. To cater for this usecase, `spec-ctrl=no-pv` will
disable all speculative protection for PV domains, while leaving all
speculative protection for HVM domains intact.
For coding clarity as much as anything else, the suboptions are grouped by
logical area; those which affect the alternatives blocks, and those which
affect Xen's in-hypervisor settings. See the xen-command-line.markdown for
full details of the new options.
While changing the command line options, take the time to change how the data
is reported to the user. The three DEBUG printks are upgraded to unilateral,
as they are all relevant pieces of information, and the old "mitigations:"
line is split in the two logical areas described above.
Sample output from booting with `spec-ctrl=no-pv` looks like:
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Tue, 1 May 2018 10:59:03 +0000 (11:59 +0100)]
x86/cpuid: Improvements to guest policies for speculative sidechannel features
If Xen isn't virtualising MSR_SPEC_CTRL for guests, IBRSB shouldn't be
advertised. It is not currently possible to express this via the existing
command line options, but such an ability will be introduced.
Another useful option in some usecases is to offer IBPB without IBRS. When a
guest kernel is known to be compatible (uses retpoline and knows about the AMD
IBPB feature bit), an administrator with pre-Skylake hardware may wish to hide
IBRS. This allows the VM to have full protection, without Xen or the VM
needing to touch MSR_SPEC_CTRL, which can reduce the overhead of Spectre
mitigations.
Break the logic common to both PV and HVM CPUID calculations into a common
helper, to avoid duplication.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Wed, 9 May 2018 12:59:56 +0000 (13:59 +0100)]
x86/spec_ctrl: Explicitly set Xen's default MSR_SPEC_CTRL value
With the impending ability to disable MSR_SPEC_CTRL handling on a
per-guest-type basis, the first exit-from-guest may not have the side effect
of loading Xen's choice of value. Explicitly set Xen's default during the BSP
and AP boot paths.
For the BSP however, delay setting a non-zero MSR_SPEC_CTRL default until
after dom0 has been constructed when safe to do so. Oracle report that this
speeds up boots of some hardware by 50s.
"when safe to do so" is based on whether we are virtualised. A native boot
won't have any other code running in a position to mount an attack.
Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Tue, 17 Apr 2018 13:15:04 +0000 (14:15 +0100)]
x86/spec_ctrl: Split X86_FEATURE_SC_MSR into PV and HVM variants
In order to separately control whether MSR_SPEC_CTRL is virtualised for PV and
HVM guests, split the feature used to control runtime alternatives into two.
Xen will use MSR_SPEC_CTRL itself if either of these features are active.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Mon, 7 May 2018 13:06:16 +0000 (14:06 +0100)]
x86/spec_ctrl: Elide MSR_SPEC_CTRL handling in idle context when possible
If Xen is virtualising MSR_SPEC_CTRL handling for guests, but using 0 as its
own MSR_SPEC_CTRL value, spec_ctrl_{enter,exit}_idle() need not write to the
MSR.
Requested-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Mon, 30 Apr 2018 13:20:23 +0000 (14:20 +0100)]
x86/spec_ctrl: Rename bits of infrastructure to avoid NATIVE and VMEXIT
In hindsight, using NATIVE and VMEXIT as naming terminology was not clever.
A future change wants to split SPEC_CTRL_EXIT_TO_GUEST into PV and HVM
specific implementations, and using VMEXIT as a term is completely wrong.
Take the opportunity to fix some stale documentation in spec_ctrl_asm.h. The
IST helpers were missing from the large comment block, and since
SPEC_CTRL_ENTRY_FROM_INTR_IST was introduced, we've gained a new piece of
functionality which currently depends on the fine grain control, which exists
in lieu of livepatching. Note this in the comment.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Tue, 17 Apr 2018 13:15:04 +0000 (14:15 +0100)]
x86/spec_ctrl: Fold the XEN_IBRS_{SET,CLEAR} ALTERNATIVES together
Currently, the SPEC_CTRL_{ENTRY,EXIT}_* macros encode Xen's choice of
MSR_SPEC_CTRL as an immediate constant, and chooses between IBRS or not by
doubling up the entire alternative block.
There is now a variable holding Xen's choice of value, so use that and
simplify the alternatives.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Tue, 17 Apr 2018 13:15:04 +0000 (14:15 +0100)]
x86/spec_ctrl: Merge bti_ist_info and use_shadow_spec_ctrl into spec_ctrl_flags
All 3 bits of information here are control flags for the entry/exit code
behaviour. Treat them as such, rather than having two different variables.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Tue, 17 Apr 2018 13:15:04 +0000 (14:15 +0100)]
x86/spec_ctrl: Express Xen's choice of MSR_SPEC_CTRL value as a variable
At the moment, we have two different encodings of Xen's MSR_SPEC_CTRL value,
which is a side effect of how the Spectre series developed. One encoding is
via an alias with the bottom bit of bti_ist_info, and can encode IBRS or not,
but not other configurations such as STIBP.
Break Xen's value out into a separate variable (in the top of stack block for
XPTI reasons) and use this instead of bti_ist_info in the IST path.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Thu, 26 Apr 2018 11:21:00 +0000 (12:21 +0100)]
x86/spec_ctrl: Read MSR_ARCH_CAPABILITIES only once
Make it available from the beginning of init_speculation_mitigations(), and
pass it into appropriate functions. Fix an RSBA typo while moving the
affected comment.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
CC xenctrl_stubs.o
xenctrl_stubs.c: In function 'failwith_xc':
xenctrl_stubs.c:65:17: error: 'snprintf' output may be truncated before the last format character [-Werror=format-truncation=]
"%d: %s: %s", error->code,
^
xenctrl_stubs.c:64:4: note: 'snprintf' output 6 or more bytes (assuming 1029) into a destination of size 1028
snprintf(error_str, sizeof(error_str),
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
"%d: %s: %s", error->code,
~~~~~~~~~~~~~~~~~~~~~~~~~~
xc_error_code_to_desc(error->code),
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
error->message);
~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[8]: *** [/build/xen-git/src/xen/tools/ocaml/libs/xc/../../Makefile.rules:37: xenctrl_stubs.o] Error 1
m
Signed-off-by: John Thomson <git@johnthomson.fastmail.com.au> Acked-by: Christian Lindig <christian.lindig@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Ian Jackson [Tue, 15 May 2018 14:41:14 +0000 (15:41 +0100)]
docs/parse-support-md: Correctly process caveats in multi-status sections
When SUPPORT.md uses the syntax
Status, <some thing>: <support status>
the caveats were lost (not footnoted) because they were attached
only to <some thing>.
Caveats occur in running text, so they are necessarily part of a real
section, not an individual status line like that. So attach them to
the RealSectNode, and look there for them.
Reported-by: Lars Kurth <lars.kurth@citrix.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Lars Kurth <Lars.kurth@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Paul Durrant [Fri, 11 May 2018 14:48:32 +0000 (15:48 +0100)]
viridian: fix cpuid leaf 0x40000003
The response to viridian leaf 3 needs to split a 64-bit mask across EAX and
EBX, with the low order 32 bits in EAX and the high order 32 bits in EBX.
To facilitate this a union of two uint32_t values and the mask (type
HV_PARTITION_PRIVILEGE_MASK) is allocated on stack as follows:
union {
HV_PARTITION_PRIVILEGE_MASK mask;
uint32_t lo, hi;
} u;
This, of course, is incorrect as both lo and hi will alias the low order
32 bits of the mask.
This patch wraps lo and hi in an anonmymous struct to achieve the desired
effect.
NOTE: Fixing this also stops Windows making the HvGetPartitionId hypercall
which was previously considered erroneous behaviour. Thus the
hypercall handler is also modified to stop squashing the
'unimplemented' warning for this hypercall.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
New versions of iasl have introduced improved C file generation, as
reported in the changelog:
iASL: Enhanced the -tc option (which creates an AML hex file in C,
suitable for import into a firmware project):
1) Create a unique name for the table, to simplify use of multiple
SSDTs.
2) Add a protection #ifdef in the file, similar to a .h header file.
Fix the build with newer versions of iasl by stripping the '_aml_code'
suffix from the variable name on generated files.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Tue, 8 May 2018 17:12:56 +0000 (18:12 +0100)]
x86/HVM: guard against emulator driving ioreq state in weird ways
In the case where hvm_wait_for_io() calls wait_on_xen_event_channel(),
p->state ends up being read twice in succession: once to determine that
state != p->state, and then again at the top of the loop. This gives a
compromised emulator a chance to change the state back between the two
reads, potentially keeping Xen in a loop indefinitely.
Instead:
* Read p->state once in each of the wait_on_xen_event_channel() tests,
* re-use that value the next time around,
* and insist that the states continue to transition "forward" (with the
exception of the transition to STATE_IOREQ_NONE).
This is XSA-262.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
x86/vpt: add support for IO-APIC routed interrupts
And modify the HPET code to make use of it. Currently HPET interrupts
are always treated as ISA and thus injected through the vPIC. This is
wrong because HPET interrupts when not in legacy mode should be
injected from the IO-APIC.
To make things worse, the supported interrupt routing values are set
to [20..23], which clearly falls outside of the ISA range, thus
leading to an ASSERT in debug builds or memory corruption in non-debug
builds because the interrupt injection code will write out of the
bounds of the arch.hvm_domain.vpic array.
Since the HPET interrupt source can change between ISA and IO-APIC
always destroy the timer before changing the mode, or else Xen risks
changing it while the timer is active.
Note that vpt interrupt injection is racy in the sense that the
vIO-APIC RTE entry can be written by the guest in between the call to
pt_irq_masked and hvm_ioapic_assert, or the call to pt_update_irq and
pt_intr_post. Those are not deemed to be security issues, but rather
quirks of the current implementation. In the worse case the guest
might lose interrupts or get multiple interrupt vectors injected for
the same timer source.
This is part of XSA-261.
Address actual and potential compiler warnings. Fix formatting.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Fri, 23 Mar 2018 17:03:42 +0000 (17:03 +0000)]
x86/traps: Fix handling of #DB exceptions in hypervisor context
The WARN_ON() can be triggered by guest activities, and emits a full stack
trace without rate limiting. Swap it out for a ratelimited printk with just
enough information to work out what is going on.
Not all #DB exceptions are traps, so blindly continuing is not a safe action
to take. We don't let PV guests select these settings in the real %dr7 to
begin with, but for added safety against unexpected situations, detect the
fault cases and crash in an obvious manner.
This is part of XSA-260 / CVE-2018-8897
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 22 Mar 2018 11:27:03 +0000 (11:27 +0000)]
x86/pv: Move exception injection into {,compat_}test_all_events()
This allows paths to jump straight to {,compat_}test_all_events() and have
injection of pending exceptions happen automatically, rather than requiring
all calling paths to handle exceptions themselves.
The normal exception path is simplified as a result, and
compat_post_handle_exception() is removed entirely.
This is part of XSA-260 / CVE-2018-8897
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 26 Mar 2018 08:02:34 +0000 (09:02 +0100)]
x86/traps: Fix %dr6 handing in #DB handler
Most bits in %dr6 accumulate, rather than being set directly based on the
current source of #DB. Have the handler follow the manuals guidance, which
avoids leaking hypervisor debugging activities into guest context.
This is part of XSA-260 / CVE-2018-8897
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 8 May 2018 12:45:45 +0000 (13:45 +0100)]
x86/domain: Drop the only-written smap_check_policy infrastructure
c/s 4c5d78a10d "x86/pagewalk: Re-implement the pagetable walker" dropped the
consumer of smap_policy. Looking at c/s 31ae587e6f which introduced the
smap_check logic, it exists only to work around a bug in guest_walk_tables()
was resolved by the aformentioned commit.
Remove the unused variables and associated infrastructure.
Reported-by: Jason Andryuk <jandryuk@gmail.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Tue, 20 Mar 2018 19:36:40 +0000 (19:36 +0000)]
x86/pv: Hide more EFER bits from PV guests
We don't advertise SVM in CPUID so a PV guest shouldn't be under the
impression that it can use SVM functionality, but despite this, it really
shouldn't see SVME set when reading EFER.
On Intel processors, 32bit PV guests don't see, and can't use SYSCALL.
Introduce EFER_KNOWN_MASK to whitelist the features Xen knows about, and use
this to clamp the guests view.
Take the opportunity to reuse the mask to simplify svm_vmcb_isvalid(), and
change "undefined" to "unknown" in the print message, as there is at least
EFER.TCE (Translation Cache Extension) defined but unknown to Xen.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Mon, 7 May 2018 07:12:16 +0000 (09:12 +0200)]
SVM: introduce a VM entry helper
Neither the register values copying nor the trace entry generation need
doing in assembly. The VMLOAD invocation can also be further deferred
(and centralized). Therefore replace the svm_asid_handle_vmrun()
invocation with one of the new helper.
Similarly move the VM exit side register value copying into
svm_vmexit_handler().
Now that we always make it out to guest context after VMLOAD,
svm_sync_vmcb() no longer overrides vmcb_needs_vmsave, making
svm_vmexit_handler() setting the field early unnecessary.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Mon, 7 May 2018 07:11:15 +0000 (09:11 +0200)]
SVM: re-work VMCB sync-ing
While the main problem to be addressed here is the issue of what so far
was named "vmcb_in_sync" starting out with the wrong value (should have
been true instead of false, to prevent performing a VMSAVE without ever
having VMLOADed the vCPU's state), go a step further and make the
sync-ed state a tristate: CPU and memory may be in sync or an update
may be required in either direction. Rename the field and introduce an
enum. Callers of svm_sync_vmcb() now indicate the intended new state
(with a slight "anomaly" when requesting VMLOAD: we could store
vmcb_needs_vmsave in those cases as the callers request, but the VMCB
really is in sync at that point, and hence there's no need to VMSAVE in
case we don't make it out to guest context), and all syncing goes
through that function.
With that, there's no need to VMLOAD the state perhaps multiple times;
all that's needed is loading it once before VM entry.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Avoid flushing the complete TLB when switching %cr3 for mitigation of
Meltdown by using the PCID feature if available.
We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
2 values for the non-XPTI case:
- guest active and in kernel mode
- guest active and in user mode
- hypervisor active and guest in user mode (XPTI only)
- hypervisor active and guest in kernel mode (XPTI only)
We use PCID only if PCID _and_ INVPCID are supported. With PCID in use
we disable global pages in cr4. A command line parameter controls in
which cases PCID is being used.
As the non-XPTI case has shown not to perform better with PCID at least
on some machines the default is to use PCID only for domains subject to
XPTI.
With PCID enabled we always disable global pages. This avoids having to
either flush the complete TLB or do a cycle through all PCID values
when invalidating a single global page.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/x86: use flag byte for decision whether xen_cr3 is valid
Today cpu_info->xen_cr3 is either 0 to indicate %cr3 doesn't need to
be switched on entry to Xen, or negative for keeping the value while
indicating not to restore %cr3, or positive in case %cr3 is to be
restored.
Switch to use a flag byte instead of a negative xen_cr3 value in order
to allow %cr3 values with the high bit set in case we want to keep TLB
entries when using the PCID feature.
This reduces the number of branches in interrupt handling and results
in better performance (e.g. parallel make of the Xen hypervisor on my
system was using about 3% less system time).
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/x86: disable global pages for domains with XPTI active
Instead of flushing the TLB from global pages when switching address
spaces with XPTI being active just disable global pages via %cr4
completely when a domain subject to XPTI is active. This avoids the
need for extra TLB flushes as loading %cr3 will remove all TLB
entries.
In order to avoid states with cr3/cr4 having inconsistent values
(e.g. global pages being activated while cr3 already specifies a XPTI
address space) move loading of the new cr4 value to write_ptbase()
(actually to switch_cr3_cr4() called by write_ptbase()).
This requires to use switch_cr3_cr4() instead of write_ptbase() when
building dom0 in order to avoid setting cr4 with cr4.smap set.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Instead of switching XPTI globally on or off add a per-domain flag for
that purpose. This allows to modify the xpti boot parameter to support
running dom0 without Meltdown mitigations. Using "xpti=no-dom0" as boot
parameter will achieve that.
Move the xpti boot parameter handling to xen/arch/x86/pv/domain.c as
it is pv-domain specific.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/xpti: avoid copying L4 page table contents when possible
For mitigation of Meltdown the current L4 page table is copied to the
cpu local root page table each time a 64 bit pv guest is entered.
Copying can be avoided in cases where the guest L4 page table hasn't
been modified while running the hypervisor, e.g. when handling
interrupts or any hypercall not modifying the L4 page table or %cr3.
So add a per-cpu flag indicating whether the copying should be
performed and set that flag only when loading a new %cr3 or modifying
the L4 page table. This includes synchronization of the cpu local
root page table with other cpus, so add a special synchronization flag
for that case.
A simple performance check (compiling the hypervisor via "make -j 4")
in dom0 with 4 vcpus shows a significant improvement:
- real time drops from 112 seconds to 103 seconds
- system time drops from 142 seconds to 131 seconds
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 3 May 2018 15:35:51 +0000 (17:35 +0200)]
x86: fix return value checks of set_guest_{machinecheck,nmi}_trapbounce
Commit 0142064421 ("x86/traps: move set_guest_{machine,nmi}_trapbounce")
converted the functions' return types from int to bool without also
correcting the checks in assembly code: The ABI does not guarantee sub-
32-bit return values to be promoted to 32 bits.
Take the liberty and also adjust the number of spaces used in the compat
code, such that both code sequences end up similar (they already are in
the non-compat case).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
George Dunlap [Tue, 1 May 2018 17:13:27 +0000 (18:13 +0100)]
xen/schedule: Fix races in vcpu migration
The current sequence to initiate vcpu migration is inefficent and error-prone:
- The initiator sets VPF_migraging with the lock held, then drops the
lock and calls vcpu_sleep_nosync(), which immediately grabs the lock
again
- A number of places unnecessarily check for v->pause_flags in between
those two
- Every call to vcpu_migrate() must be prefaced with
vcpu_sleep_nosync() or introduce a race condition; this code
duplication is error-prone
- In the event that v->is_running is true at the beginning of
vcpu_migrate(), it's almost certain that vcpu_migrate() will end up
being called in context_switch() as well; we might as well simply
let it run there and save the duplicated effort (which will be
non-negligible).
The result is that Credit1 has several races which result in runqueue
<-> v->processor invariants being violated (triggering ASSERTs in
debug builds and strange bugs in production builds).
Instead, introduce vcpu_migrate_start() to initiate the process.
vcpu_migrate_start() is called with the scheduling lock held. It not
only sets VPF_migrating, but also calls vcpu_sleep_nosync_locked()
(which will automatically do nothing if there's nothing to do).
Rename vcpu_migrate() to vcpu_migrate_finish(). Check for v->is_running and
pause_flags & VPF_migrating at the top and return if appropriate.
Then the way to initiate migration is consistently:
George Dunlap [Wed, 2 May 2018 10:09:18 +0000 (11:09 +0100)]
xen: Introduce vcpu_sleep_nosync_locked()
There are a lot of places which release a lock before calling
vcpu_sleep_nosync(), which then just grabs the lock again. This is
not only a waste of time, but leads to more code duplication (since
you have to copy-and-paste recipes rather than calling a unified
function), which in turn leads to an increased chance of bugs.
Introduce vcpu_sleep_nosync_locked(), which can be called if you
already hold the schedule lock.