Jan Beulich [Mon, 30 Jul 2018 12:16:15 +0000 (14:16 +0200)]
cmdline: fix parse_boolean() for NULL incoming end pointer
Use the calculated lengths instead of pointers, as 'e' being NULL will
otherwise cause undue parsing failures.
Reported-by: Karl Johnson <karljohnson.it@gmail.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 28 Jun 2018 10:28:21 +0000 (12:28 +0200)]
x86/HVM: don't cause #NM to be raised in Xen
The changes for XSA-267 did not touch management of CR0.TS for HVM
guests. In fully eager mode this bit should never be set when
respective vCPU-s are active, or else hvmemul_get_fpu() might leave it
wrongly set, leading to #NM in hypervisor context.
{svm,vmx}_enter() and {svm,vmx}_fpu_dirty_intercept() become unreachable
this way. Explicit {svm,vmx}_fpu_leave() invocations need to be guarded
now.
With no CR0.TS management necessary in fully eager mode, there's also no
need anymore to intercept #NM.
Reported-by: Charles Arnold <carnold@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 28 Jun 2018 10:27:56 +0000 (12:27 +0200)]
x86/EFI: further correct FPU state handling around runtime calls
We must not leave a vCPU with CR0.TS clear when it is not in fully eager
mode and has not touched non-lazy state. Instead of adding a 3rd
invocation of stts() to vcpu_restore_fpu_eager(), consolidate all of
them into a single one done at the end of the function.
Rename the function at the same time to better reflect its purpose, as
the patches touches all of its occurences anyway.
The new function parameter is not really well named, but
"need_stts_if_not_fully_eager" seemed excessive to me.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Jan Beulich [Thu, 28 Jun 2018 10:27:34 +0000 (12:27 +0200)]
x86/EFI: fix FPU state handling around runtime calls
There are two issues. First, the nonlazy xstates were never restored
after returning from the runtime call.
Secondly, with the fully_eager_fpu mitigation for XSA-267 / LazyFPU, the
unilateral stts() is no longer correct, and hits an assertion later when
a lazy state restore tries to occur for a fully eager vcpu.
Fix both of these issues by calling vcpu_restore_fpu_eager(). As EFI
runtime services can be used in the idle context, the idle assertion
needs to move until after the fully_eager_fpu check.
Introduce a "curr" local variable and replace other uses of "current"
at the same time.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Thu, 28 Jun 2018 10:26:54 +0000 (12:26 +0200)]
x86: Refine checks in #DB handler for faulting conditions
One of the fix for XSA-260 (c/s 75d6828bc2 "x86/traps: Fix handling of #DB
exceptions in hypervisor context") added some safety checks to help avoid
livelocks of #DB faults.
While a General Detect #DB exception does have fault semantics, hardware
clears %dr7.gd on entry to the handler, meaning that it is actually safe to
return to. Furthermore, %dr6.gd is guest controlled and sticky (never cleared
by hardware). A malicious PV guest can therefore trigger the fatal_trap() and
crash Xen.
Instruction breakpoints are more tricky. The breakpoint match bits in %dr6
are not sticky, but the Intel manual warns that they may be set for
non-enabled breakpoints, so add a breakpoint enabled check.
Beyond that, because of the restriction on the linear addresses PV guests can
set, and the fault (rather than trap) nature of instruction breakpoints
(i.e. can't be deferred by a MovSS shadow), there should be no way to
encounter an instruction breakpoint in Xen context. However, for extra
robustness, deal with this situation by clearing the breakpoint configuration,
rather than crashing.
This is XSA-265
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 28 Jun 2018 10:25:43 +0000 (12:25 +0200)]
x86: correct default_xen_spec_ctrl calculation
Even with opt_msr_sc_{pv,hvm} both false we should set up the variable
as usual, to ensure proper one-time setup during boot and CPU bringup.
This then also brings the code in line with the comment immediately
ahead of the printk() being modified saying "irrespective of guests".
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d6239f64713df819278bf048446d3187c6ac4734
master date: 2018-05-29 12:38:52 +0200
Andrew Cooper [Thu, 7 Jun 2018 16:00:37 +0000 (17:00 +0100)]
x86/spec-ctrl: Mitigations for LazyFPU
Intel Core processors since at least Nehalem speculate past #NM, which is the
mechanism by which lazy FPU context switching is implemented.
On affected processors, Xen must use fully eager FPU context switching to
prevent guests from being able to read FPU state (SSE/AVX/etc) from previously
scheduled vcpus.
This is part of XSA-267 / CVE-2018-3665
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 243435bf67e8159495194f623b9e4d8c90140384)
Andrew Cooper [Thu, 7 Jun 2018 16:00:37 +0000 (17:00 +0100)]
x86: Support fully eager FPU context switching
This is controlled on a per-vcpu bases for flexibility.
This is part of XSA-267 / CVE-2018-3665
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 146dfe9277c2b4a8c399b229e00d819065e3167b)
Jan Beulich [Wed, 30 May 2018 06:38:06 +0000 (08:38 +0200)]
x86: re-enable XPTI/PCID as needed in switch_native()
Additionally avoid accessing d->arch.pv_domain for PVH domains (running
in a HVM container).
Reported-by: Sergey Dyasli <sergey.dyasli@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Avoid flushing the complete TLB when switching %cr3 for mitigation of
Meltdown by using the PCID feature if available.
We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
2 values for the non-XPTI case:
- guest active and in kernel mode
- guest active and in user mode
- hypervisor active and guest in user mode (XPTI only)
- hypervisor active and guest in kernel mode (XPTI only)
We use PCID only if PCID _and_ INVPCID are supported. With PCID in use
we disable global pages in cr4. A command line parameter controls in
which cases PCID is being used.
As the non-XPTI case has shown not to perform better with PCID at least
on some machines the default is to use PCID only for domains subject to
XPTI.
With PCID enabled we always disable global pages. This avoids having to
either flush the complete TLB or do a cycle through all PCID values
when invalidating a single global page.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/x86: use flag byte for decision whether xen_cr3 is valid
Today cpu_info->xen_cr3 is either 0 to indicate %cr3 doesn't need to
be switched on entry to Xen, or negative for keeping the value while
indicating not to restore %cr3, or positive in case %cr3 is to be
restored.
Switch to use a flag byte instead of a negative xen_cr3 value in order
to allow %cr3 values with the high bit set in case we want to keep TLB
entries when using the PCID feature.
This reduces the number of branches in interrupt handling and results
in better performance (e.g. parallel make of the Xen hypervisor on my
system was using about 3% less system time).
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/x86: disable global pages for domains with XPTI active
Instead of flushing the TLB from global pages when switching address
spaces with XPTI being active just disable global pages via %cr4
completely when a domain subject to XPTI is active. This avoids the
need for extra TLB flushes as loading %cr3 will remove all TLB
entries.
In order to avoid states with cr3/cr4 having inconsistent values
(e.g. global pages being activated while cr3 already specifies a XPTI
address space) move loading of the new cr4 value to write_ptbase()
(actually to switch_cr3_cr4() called by write_ptbase()).
This requires to use switch_cr3_cr4() instead of write_ptbase() when
building dom0 in order to avoid setting cr4 with cr4.smap set.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Instead of switching XPTI globally on or off add a per-domain flag for
that purpose. This allows to modify the xpti boot parameter to support
running dom0 without Meltdown mitigations. Using "xpti=no-dom0" as boot
parameter will achieve that.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Tue, 29 May 2018 09:27:32 +0000 (11:27 +0200)]
x86/xpti: avoid copying L4 page table contents when possible
For mitigation of Meltdown the current L4 page table is copied to the
cpu local root page table each time a 64 bit pv guest is entered.
Copying can be avoided in cases where the guest L4 page table hasn't
been modified while running the hypervisor, e.g. when handling
interrupts or any hypercall not modifying the L4 page table or %cr3.
So add a per-cpu flag indicating whether the copying should be
performed and set that flag only when loading a new %cr3 or modifying
the L4 page table. This includes synchronization of the cpu local
root page table with other cpus, so add a special synchronization flag
for that case.
A simple performance check (compiling the hypervisor via "make -j 4")
in dom0 with 4 vcpus shows a significant improvement:
- real time drops from 112 seconds to 103 seconds
- system time drops from 142 seconds to 131 seconds
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 23 Jan 2018 09:43:39 +0000 (10:43 +0100)]
x86: move invocations of hvm_flush_guest_tlbs()
Their need is not tied to the actual flushing of TLBs, but the ticking
of the TLB clock. Make this more obvious by folding the two invocations
into a single one in pre_flush().
Also defer the latching of CR4 in write_cr3() until after pre_flush()
(and hence implicitly until after IRQs are off), making operation
sequence the same in both cases (eliminating the theoretical risk of
pre_flush() altering CR4). This then also improves register allocation,
as the compiler doesn't need to use a callee-saved register for "cr4"
anymore.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Tue, 29 May 2018 09:08:34 +0000 (11:08 +0200)]
x86/Intel: Mitigations for GPZ SP4 - Speculative Store Bypass
To combat GPZ SP4 "Speculative Store Bypass", Intel have extended their
speculative sidechannel mitigations specification as follows:
* A feature bit to indicate that Speculative Store Bypass Disable is
supported.
* A new bit in MSR_SPEC_CTRL which, when set, disables memory disambiguation
in the pipeline.
* A new bit in MSR_ARCH_CAPABILITIES, which will be set in future hardware,
indicating that the hardware is not susceptible to Speculative Store Bypass
sidechannels.
For contemporary processors, this interface will be implemented via a
microcode update.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 9df52a25e0e95a0b9971aa2fc26c5c6a5cbdf4ef
master date: 2018-05-21 14:20:06 +0100
Andrew Cooper [Tue, 29 May 2018 09:08:10 +0000 (11:08 +0200)]
x86/AMD: Mitigations for GPZ SP4 - Speculative Store Bypass
AMD processors will execute loads and stores with the same base register in
program order, which is typically how a compiler emits code.
Therefore, by default no mitigating actions are taken, despite there being
corner cases which are vulnerable to the issue.
For performance testing, or for users with particularly sensitive workloads,
the `spec-ctrl=ssbd` command line option is available to force Xen to disable
Memory Disambiguation on applicable hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 8c0e338086f060eba31d37b83fbdb883928aa085
master date: 2018-05-21 14:20:06 +0100
Andrew Cooper [Tue, 29 May 2018 09:07:29 +0000 (11:07 +0200)]
x86/spec_ctrl: Introduce a new `spec-ctrl=` command line argument to replace `bti=`
In hindsight, the options for `bti=` aren't as flexible or useful as expected
(including several options which don't appear to behave as intended).
Changing the behaviour of an existing option is problematic for compatibility,
so introduce a new `spec-ctrl=` in the hopes that we can do better.
One common way of deploying Xen is with a single PV dom0 and all domUs being
HVM domains. In such a setup, an administrator who has weighed up the risks
may wish to forgo protection against malicious PV domains, to reduce the
overall performance hit. To cater for this usecase, `spec-ctrl=no-pv` will
disable all speculative protection for PV domains, while leaving all
speculative protection for HVM domains intact.
For coding clarity as much as anything else, the suboptions are grouped by
logical area; those which affect the alternatives blocks, and those which
affect Xen's in-hypervisor settings. See the xen-command-line.markdown for
full details of the new options.
While changing the command line options, take the time to change how the data
is reported to the user. The three DEBUG printks are upgraded to unilateral,
as they are all relevant pieces of information, and the old "mitigations:"
line is split in the two logical areas described above.
Sample output from booting with `spec-ctrl=no-pv` looks like:
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3352afc26c497d26ecb70527db3cb29daf7b1422
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 09:06:56 +0000 (11:06 +0200)]
x86/cpuid: Improvements to guest policies for speculative sidechannel features
If Xen isn't virtualising MSR_SPEC_CTRL for guests, IBRSB shouldn't be
advertised. It is not currently possible to express this via the existing
command line options, but such an ability will be introduced.
Another useful option in some usecases is to offer IBPB without IBRS. When a
guest kernel is known to be compatible (uses retpoline and knows about the AMD
IBPB feature bit), an administrator with pre-Skylake hardware may wish to hide
IBRS. This allows the VM to have full protection, without Xen or the VM
needing to touch MSR_SPEC_CTRL, which can reduce the overhead of Spectre
mitigations.
Break the logic common to both PV and HVM CPUID calculations into a common
helper, to avoid duplication.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cb06b308ec71b23f37a44f5e2351fe2cae0306e9
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 09:06:30 +0000 (11:06 +0200)]
x86/spec_ctrl: Explicitly set Xen's default MSR_SPEC_CTRL value
With the impending ability to disable MSR_SPEC_CTRL handling on a
per-guest-type basis, the first exit-from-guest may not have the side effect
of loading Xen's choice of value. Explicitly set Xen's default during the BSP
and AP boot paths.
For the BSP however, delay setting a non-zero MSR_SPEC_CTRL default until
after dom0 has been constructed when safe to do so. Oracle report that this
speeds up boots of some hardware by 50s.
"when safe to do so" is based on whether we are virtualised. A native boot
won't have any other code running in a position to mount an attack.
Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cb8c12020307b39a89273d7699e89000451987ab
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 09:00:29 +0000 (11:00 +0200)]
x86/spec_ctrl: Split X86_FEATURE_SC_MSR into PV and HVM variants
In order to separately control whether MSR_SPEC_CTRL is virtualised for PV and
HVM guests, split the feature used to control runtime alternatives into two.
Xen will use MSR_SPEC_CTRL itself if either of these features are active.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: fa9eb09d446a1279f5e861e6b84fa8675dabf148
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 09:00:04 +0000 (11:00 +0200)]
x86/spec_ctrl: Elide MSR_SPEC_CTRL handling in idle context when possible
If Xen is virtualising MSR_SPEC_CTRL handling for guests, but using 0 as its
own MSR_SPEC_CTRL value, spec_ctrl_{enter,exit}_idle() need not write to the
MSR.
Requested-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 94df6e8588e35cc2028ccb3fd2921c6e6360605e
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 08:59:39 +0000 (10:59 +0200)]
x86/spec_ctrl: Rename bits of infrastructure to avoid NATIVE and VMEXIT
In hindsight, using NATIVE and VMEXIT as naming terminology was not clever.
A future change wants to split SPEC_CTRL_EXIT_TO_GUEST into PV and HVM
specific implementations, and using VMEXIT as a term is completely wrong.
Take the opportunity to fix some stale documentation in spec_ctrl_asm.h. The
IST helpers were missing from the large comment block, and since
SPEC_CTRL_ENTRY_FROM_INTR_IST was introduced, we've gained a new piece of
functionality which currently depends on the fine grain control, which exists
in lieu of livepatching. Note this in the comment.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d9822b8a38114e96e4516dc998f4055249364d5d
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 08:59:11 +0000 (10:59 +0200)]
x86/spec_ctrl: Fold the XEN_IBRS_{SET,CLEAR} ALTERNATIVES together
Currently, the SPEC_CTRL_{ENTRY,EXIT}_* macros encode Xen's choice of
MSR_SPEC_CTRL as an immediate constant, and chooses between IBRS or not by
doubling up the entire alternative block.
There is now a variable holding Xen's choice of value, so use that and
simplify the alternatives.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: af949407eaba7af71067f23d5866cd0bf1f1144d
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 08:58:44 +0000 (10:58 +0200)]
x86/spec_ctrl: Merge bti_ist_info and use_shadow_spec_ctrl into spec_ctrl_flags
All 3 bits of information here are control flags for the entry/exit code
behaviour. Treat them as such, rather than having two different variables.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5262ba2e7799001402dfe139ff944e035dfff928
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 08:58:17 +0000 (10:58 +0200)]
x86/spec_ctrl: Express Xen's choice of MSR_SPEC_CTRL value as a variable
At the moment, we have two different encodings of Xen's MSR_SPEC_CTRL value,
which is a side effect of how the Spectre series developed. One encoding is
via an alias with the bottom bit of bti_ist_info, and can encode IBRS or not,
but not other configurations such as STIBP.
Break Xen's value out into a separate variable (in the top of stack block for
XPTI reasons) and use this instead of bti_ist_info in the IST path.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 66dfae0f32bfbc899c2f3446d5ee57068cb7f957
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 08:57:41 +0000 (10:57 +0200)]
x86/spec_ctrl: Read MSR_ARCH_CAPABILITIES only once
Make it available from the beginning of init_speculation_mitigations(), and
pass it into appropriate functions. Fix an RSBA typo while moving the
affected comment.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d6c65187252a6c1810fd24c4d46f812840de8d3c
master date: 2018-05-16 12:19:10 +0100
Juergen Gross [Fri, 18 May 2018 11:32:05 +0000 (13:32 +0200)]
xpti: fix bug in double fault handling
When entering the hypervisor via the double fault handler resetting
xen_cr3 was missing. This led to switching to pv_cr3 when returning
from the next following exception, so repair this in order to allow
exception handling to work even after a double fault.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d80af845de7a4db01a4a3b4d779e0e0dcb5e738b
master date: 2018-04-23 16:13:01 +0200
The 'RSB Alternative' bit in MSR_ARCH_CAPABILITIES may be set by a hypervisor
to indicate that the virtual machine may migrate to a processor which isn't
retpoline-safe. Introduce a shortened name (to reduce code volume), treat it
as authorative in retpoline_safe(), and print its value along with the other
ARCH_CAPS bits.
The exact processor models which do have RSB semantics which fall back to BTB
predictions are enumerated, and include Kabylake and Coffeelake. Leave a
printk() in the default case to help identify cases which aren't covered.
The exact microcode versions from Broadwell RSB-safety are taken from the
referenced microcode update file (adjusting for the known-bad microcode
versions). Despite the exact wording of the text, it is only Broadwell
processors which need a microcode check.
In practice, this means that all Broadwell hardware with up-to-date microcode
will use retpoline in preference to IBRS, which will be a performance
improvement for desktop and server systems which would previously always opt
for IBRS over retpoline.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/spec_ctrl: Fix typo in ARCH_CAPS decode
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 1232378bd2fef45f613db049b33852fdf84d7ddf
master date: 2018-04-19 17:28:23 +0100
master commit: 27170adb54a558e11defcd51989326a9beb95afe
master date: 2018-04-24 13:34:12 +0100
Andrew Cooper [Fri, 18 May 2018 11:31:01 +0000 (13:31 +0200)]
x86/msr: Correct the emulation behaviour of MSR_PRED_CMD
Experimentally, the behaviour of reserved bits in MSR_PRED_CMD changed between
beta and production microcode, and now raises a #GP fault for set reserved
bits. The AMD spec for future hardware also specifies this behaviour, and it
is the more sensible behaviour to implement.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/msr: further correct the emulation behaviour of MSR_PRED_CMD
Following commit a6aa678fa3 ("x86/msr: Correct the emulation behaviour
of MSR_PRED_CMD") we may end up writing the low bit with the wrong
value. While it's unlikely for a guest to want to write zero there, we
should still permit (this without incurring the overhead of an actual
barrier). Correcting this right away will also help whenever further
bits in the MSR might become defined.
Jan Beulich [Fri, 18 May 2018 11:30:30 +0000 (13:30 +0200)]
x86: suppress BTI mitigations around S3 suspend/resume
NMI and #MC can occur at any time after S3 resume, yet the MSR_SPEC_CTRL
may become available only once we're reloaded microcode. Make
SPEC_CTRL_ENTRY_FROM_INTR_IST and DO_SPEC_CTRL_EXIT_TO_XEN no-ops for
the critical period of time.
Also set the MSR back to its intended value.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86: Use spec_ctrl_{enter,exit}_idle() in the S3/S5 path
The main purpose of this patch is to avoid opencoding the recovery logic at
the end, but also has the positive side effect of relaxing the SPEC_CTRL
mitigations when working to shut the final CPU down.
Jan Beulich [Fri, 18 May 2018 11:30:05 +0000 (13:30 +0200)]
x86: correct ordering of operations during S3 resume
Microcode loading needs to happen before re-enabling interrupts, in case
only updated microcode allows the use of e.g. the SPEC_{CTRL,CMD} MSRs.
Otoh it doesn't need to happen at all when we didn't suspend in the
first place. It needs to happen before spin_debug_enable() though, as it
acquires a lock and hence would otherwise make
common/spinlock.c:check_lock() unhappy. As micrcode loading can be
pretty verbose, also make sure it only runs after console_end_sync().
cpufreq_add_cpu() doesn't need calling on the only "goto enable_cpu"
path, which sits ahead of cpufreq_del_cpu().
Reported-by: Simon Gaiser <simon@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: cb2a4a449dfd50af309a333aa805835015fbc8c8
master date: 2018-04-16 14:08:30 +0200
Jan Beulich [Tue, 8 May 2018 17:28:20 +0000 (18:28 +0100)]
x86/HVM: guard against emulator driving ioreq state in weird ways
In the case where hvm_wait_for_io() calls wait_on_xen_event_channel(),
p->state ends up being read twice in succession: once to determine that
state != p->state, and then again at the top of the loop. This gives a
compromised emulator a chance to change the state back between the two
reads, potentially keeping Xen in a loop indefinitely.
Instead:
* Read p->state once in each of the wait_on_xen_event_channel() tests,
* re-use that value the next time around,
* and insist that the states continue to transition "forward" (with the
exception of the transition to STATE_IOREQ_NONE).
This is XSA-262.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
x86/vpt: add support for IO-APIC routed interrupts
And modify the HPET code to make use of it. Currently HPET interrupts
are always treated as ISA and thus injected through the vPIC. This is
wrong because HPET interrupts when not in legacy mode should be
injected from the IO-APIC.
To make things worse, the supported interrupt routing values are set
to [20..23], which clearly falls outside of the ISA range, thus
leading to an ASSERT in debug builds or memory corruption in non-debug
builds because the interrupt injection code will write out of the
bounds of the arch.hvm_domain.vpic array.
Since the HPET interrupt source can change between ISA and IO-APIC
always destroy the timer before changing the mode, or else Xen risks
changing it while the timer is active.
Note that vpt interrupt injection is racy in the sense that the
vIO-APIC RTE entry can be written by the guest in between the call to
pt_irq_masked and hvm_ioapic_assert, or the call to pt_update_irq and
pt_intr_post. Those are not deemed to be security issues, but rather
quirks of the current implementation. In the worse case the guest
might lose interrupts or get multiple interrupt vectors injected for
the same timer source.
This is part of XSA-261.
Address actual and potential compiler warnings. Fix formatting.
Andrew Cooper [Tue, 8 May 2018 17:28:03 +0000 (18:28 +0100)]
x86/traps: Fix handling of #DB exceptions in hypervisor context
The WARN_ON() can be triggered by guest activities, and emits a full stack
trace without rate limiting. Swap it out for a ratelimited printk with just
enough information to work out what is going on.
Not all #DB exceptions are traps, so blindly continuing is not a safe action
to take. We don't let PV guests select these settings in the real %dr7 to
begin with, but for added safety against unexpected situations, detect the
fault cases and crash in an obvious manner.
This is part of XSA-260 / CVE-2018-8897.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 8 May 2018 17:28:03 +0000 (18:28 +0100)]
x86/pv: Move exception injection into {,compat_}test_all_events()
This allows paths to jump straight to {,compat_}test_all_events() and have
injection of pending exceptions happen automatically, rather than requiring
all calling paths to handle exceptions themselves.
The normal exception path is simplified as a result, and
compat_post_handle_exception() is removed entirely.
This is part of XSA-260 / CVE-2018-8897.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 8 May 2018 17:28:03 +0000 (18:28 +0100)]
x86/traps: Fix %dr6 handing in #DB handler
Most bits in %dr6 accumulate, rather than being set directly based on the
current source of #DB. Have the handler follow the manuals guidance, which
avoids leaking hypervisor debugging activities into guest context.
This is part of XSA-260 / CVE-2018-8897.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Wed, 25 Apr 2018 12:52:52 +0000 (14:52 +0200)]
x86: fix slow int80 path after XPTI additions
For the int80 slow path to jump to handle_exception_saved, %r14 needs to
be set up suitably for XPTI purposes. This is because of the difference
in nature between the int80 path (which is synchronous WRT guest
actions) and the exception path which is potentially asynchronous.
This is XSA-259.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5a5c368faf45ced8a8c6235f4fbf5cdb38ec939f
master date: 2018-04-25 14:39:41 +0200
Anthony PERARD [Wed, 25 Apr 2018 12:52:35 +0000 (14:52 +0200)]
libxl: Specify format of inserted cdrom
Without this extra parameter on the QMP command, QEMU will guess the
format of the new file.
This is XSA-258.
Reported-by: Anthony PERARD <anthony.perard@citrix.com> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
master commit: d8f65e68a7c1047fad97206a6282c281247fadc2
master date: 2018-04-25 14:38:47 +0200
Jan Beulich [Thu, 22 Mar 2018 09:26:28 +0000 (10:26 +0100)]
x86/PV: also cover Dom0 in SPEC_CTRL / PRED_CMD emulation
Introduce a helper wrapping the pv_cpuid()-style domain_cpuid() /
cpuid_count() (or alike) invocations, and use it instead of plain
domain_cpuid() in MSR access emulation.
Reported-by: Jason Andryuk <jandryuk@gmail.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Jason Andryuk <jandryuk@gmail.com>
Ross Lagerwall [Thu, 22 Mar 2018 09:25:44 +0000 (10:25 +0100)]
x86: Move microcode loading earlier
Move microcode loading earlier for the boot CPU and secondary CPUs so
that it takes place before identify_cpu() is called for each CPU.
Without this, the detected features may be wrong if the new microcode
loading adjusts the feature bits. That could mean that some fixes (e.g. d6e9f8d4f35d ("x86/vmx: fix vmentry failure with TSX bits in LBR"))
don't work as expected.
Previously during boot, the microcode loader was invoked for each
secondary CPU started and then again for each CPU as part of an
initcall. Simplify the code so that it is invoked exactly once for each
CPU during boot.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f97838bbd980a0104e16c4a12fbf514f9fa805f1
master date: 2017-04-19 17:08:01 +0100
Jason Andryuk [Tue, 20 Mar 2018 13:51:23 +0000 (14:51 +0100)]
x86/entry: Fix passing 6th argument for compat hypercalls
Commit ec05090403ef4d760fbe701e31afd0f0edc414d5 ("x86/entry: Erase guest
GPR state on entry to Xen") zero-ed %rbp, compat arg 6, but it is not
restored before passing to hypercalls. We need to pass the saved compat
arg 6 to the hypercall in r9, the 6th function argument.
Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 6 Mar 2018 15:27:02 +0000 (16:27 +0100)]
x86/spec_ctrl: Fix several bugs in SPEC_CTRL_ENTRY_FROM_INTR_IST
DO_OVERWRITE_RSB clobbers %rax, meaning in practice that the bti_ist_info
field gets zeroed. Older versions of this code had the DO_OVERWRITE_RSB
register selectable, so reintroduce this ability and use it to cause the
INTR_IST path to use %rdx instead.
The use of %dl for the %cs.rpl check means that when an IST interrupt hits
Xen, we try to load 1 into the high 32 bits of MSR_SPEC_CTRL, suffering a #GP
fault instead.
Also, drop an unused label which was a copy/paste mistake.
Andrew Cooper [Thu, 16 Nov 2017 21:10:00 +0000 (21:10 +0000)]
tools/libxc: Fix restoration of PV MSRs after migrate
There are two bugs in process_vcpu_msrs() which clearly demonstrate that I
didn't test this bit of Migration v2 very well when writing it...
vcpu->msrsz is always expected to be a multiple of xen_domctl_vcpu_msr_t
records in a spec-compliant stream, so the modulo yields 0 for the msr_count,
rather than the actual number sent in the stream.
Passing 0 for the msr_count causes the hypercall to exit early, and hides the
fact that the guest handle is inserted into the wrong field in the domctl
union.
The reason that these bugs have gone unnoticed for so long is that the only
MSRs passed like this for PV guests are the AMD DBGEXT MSRs, which only exist
in fairly modern hardware, and whose use doesn't appear to be implemented in
any contemporary PV guests.
Andrew Cooper [Thu, 30 Mar 2017 16:32:32 +0000 (17:32 +0100)]
tools/libxc: Avoid generating inappropriate zero-content records
The code as written attempted to elide zero-content records, as such records
serve no purpose but come with a performance hit. Unfortunately, in the case
where the hypervisor reported max size is non-zero, but the actual size is
zero, the record is not elided.
This previously tripped up the sanity checks in the restore side of migration,
but as the underlying reasons for eliding the records in the first place are
still valid, fix the elision logic.
Jan Beulich [Tue, 27 Feb 2018 13:37:39 +0000 (14:37 +0100)]
gnttab: don't blindly free status pages upon version change
There may still be active mappings, which would trigger the respective
BUG_ON(). Split the loop into one dealing with the page attributes and
the second (when the first fully passed) freeing the pages. Return an
error if any pages still have pending references.
This is part of XSA-255.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 38bfcc165dda5f4284d7c218b91df9e144ddd88d
master date: 2018-02-27 14:07:12 +0100
Jan Beulich [Tue, 27 Feb 2018 13:37:19 +0000 (14:37 +0100)]
gnttab/ARM: don't corrupt shared GFN array
... by writing status GFNs to it. Introduce a second array instead.
Also implement gnttab_status_gmfn() properly now that the information is
suitably being tracked.
While touching it anyway, remove a misguided (but luckily benign) upper
bound check from gnttab_shared_gmfn(): We should never access beyond the
bounds of that array.
This is part of XSA-255.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9d2f8f9c65d4da35437f50ed9e812a2c5ab313e2
master date: 2018-02-27 14:04:44 +0100
Jan Beulich [Tue, 27 Feb 2018 13:36:46 +0000 (14:36 +0100)]
memory: don't implicitly unpin for decrease-reservation
It very likely was a mistake (copy-and-paste from domain cleanup code)
to implicitly unpin here: The caller should really unpin itself before
(or after, if they so wish) requesting the page to be removed.
This is XSA-252.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d798a0952903db9d8ee0a580e03f214d2b49b7d7
master date: 2018-02-27 14:03:27 +0100
Andrew Cooper [Fri, 23 Feb 2018 09:22:48 +0000 (10:22 +0100)]
x86/hvm: Don't corrupt the HVM context stream when writing the MSR record
Ever since it was introduced in c/s bd1f0b45ff, hvm_save_cpu_msrs() has had a
bug whereby it corrupts the HVM context stream if some, but fewer than the
maximum number of MSRs are written.
_hvm_init_entry() creates an hvm_save_descriptor with length for
msr_count_max, but in the case that we write fewer than max, h->cur only moves
forward by the amount of space used, causing the subsequent
hvm_save_descriptor to be written within the bounds of the previous one.
To resolve this, reduce the length reported by the descriptor to match the
actual number of bytes used.
A typical failure on the destination side looks like:
(XEN) HVM4 restore: CPU_MSR 0
(XEN) HVM4.0 restore: not enough data left to read 56 MSR bytes
(XEN) HVM4 restore: failed to load entry 20/0
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d2f86bf604698806d311cc251c1b66fbb752673c
master date: 2017-11-21 11:19:02 +0000
Julien Grall [Wed, 14 Feb 2018 12:22:23 +0000 (12:22 +0000)]
xen/arm: cpuerrata: Actually check errata on non-boot CPUs
The cpu errata framework was introduced in commit 8b01f6364f "xen/arm:
Detect silicon revision and set cap bits accordingly" and was meant to
detect errata present on any CPUs (via check_local_cpu_errata). However,
the function to check the MIDR (is_affected_midr_range) mistakenly
always use the boot CPU MIDR.
Fix is_affected_midr_range to use the current CPU MIDR.
Julien Grall [Fri, 2 Feb 2018 14:19:25 +0000 (14:19 +0000)]
xen/arm32: entry: Document the purpose of r11 in the traps handler
It took me a bit of time to understand why __DEFINE_TRAP_ENTRY is
storing the original stack pointer in r11. It is working in pair with
return_traps_entry where sp will be restored from r11.
This is fine because per the AAPCS r11 must be preserved by the
subroutine. So in return_from_trap, r11 will still contain the original
stack pointer.
Add some documentation in the code to point the 2 sides to each other.
Julien Grall [Fri, 2 Feb 2018 14:19:24 +0000 (14:19 +0000)]
xen/arm32: Invalidate icache on guest exist for Cortex-A15
In order to avoid aliasing attacks against the branch predictor on
Cortex A-15, let's invalidate the BTB on guest exit, which can only be
done by invalidating the icache (with ACTLR[0] being set).
We use the same hack as for A12/A17 to perform the vector decoding.
This is based on Linux patch from the kpti branch in [1].
Julien Grall [Fri, 2 Feb 2018 14:19:23 +0000 (14:19 +0000)]
xen/arm32: Invalidate BTB on guest exit for Cortex A17 and 12
In order to avoid aliasing attackes agains the branch predictor, let's
invalidate the BTB on guest exist. This is made complicated by the fact
that we cannot take a branch invalidating the BTB.
This is based on the fourth version posted by Marc Zyngier on Linux-arm
mailing list (see [1]).
Julien Grall [Fri, 2 Feb 2018 14:19:22 +0000 (14:19 +0000)]
xen/arm32: Add skeleton to harden branch predictor aliasing attacks
Aliasing attacked against CPU branch predictors can allow an attacker to
redirect speculative control flow on some CPUs and potentially divulge
information from one context to another.
This patch adds initiatial skeleton code behind a new Kconfig option
to enable implementation-specific mitigations against these attacks
for CPUs that are affected.
Most of mitigations will have to be applied when entering to the
hypervisor from the guest context.
Because the attack is against branch predictor, it is not possible to
safely use branch instruction before the mitigation is applied.
Therefore this has to be done in the vector entry before jump to the
helper handling a given exception.
However, on arm32, each vector contain a single instruction. This means
that the hardened vector tables may rely on the state of registers that
does not hold when in the hypervisor (e.g SP is 8 bytes aligned).
Therefore hypervisor code running with guest vectors table should be
minimized and always have IRQs and SErrors masked to reduce the risk to
use them.
This patch provides an infrastructure to switch vector tables before
entering to the guest and when leaving it.
Note that alternative could have been used, but older Xen (4.8 or
earlier) doesn't have support. So avoid using alternative to ease
backporting.
Julien Grall [Fri, 2 Feb 2018 14:19:21 +0000 (14:19 +0000)]
xen/arm32: entry: Add missing trap_reset entry
At the moment, the reset vector is defined as .word 0 (e.g andeq r0, r0,
r0).
This is rather unintuitive and will result to execute the trap
undefined. Instead introduce trap helpers for reset and will generate an
error message in the unlikely case that reset will be called.
Julien Grall [Tue, 16 Jan 2018 14:23:37 +0000 (14:23 +0000)]
xen/arm64: Implement branch predictor hardening for affected Cortex-A CPUs
Cortex-A57, A72, A73 and A75 are susceptible to branch predictor
aliasing and can theoritically be attacked by malicious code.
This patch implements a PSCI-based mitigation for these CPUs when
available. The call into firmware will invalidate the branch predictor
state, preventing any malicious entries from affection other victim
contexts.
Ported from Linux git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git
branch kpti.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
This is part of XSA-254.
Julien Grall [Tue, 16 Jan 2018 14:23:36 +0000 (14:23 +0000)]
xen/arm64: Add skeleton to harden the branch predictor aliasing attacks
Aliasing attacked against CPU branch predictors can allow an attacker to
redirect speculative control flow on some CPUs and potentially divulge
information from one context to another.
This patch adds initial skeleton code behind a new Kconfig option to
enable implementation-specific mitigations against these attacks for
CPUs that are affected.
Most of the mitigations will have to be applied when entering to the
hypervisor from the guest context. For safety, it is applied at every
exception entry. So there are potential for optimizing when receiving
an exception at the same level.
Because the attack is against branch predictor, it is not possible to
safely use branch instruction before the mitigation is applied.
Therefore, this has to be done in the vector entry before jump to the
helper handling a given exception.
On Arm64, each vector can hold 32 instructions. This leave us 31
instructions for the mitigation. The last one is the branch instruction
to the helper.
Because a platform may have CPUs with different micro-architectures,
per-CPU vector table needs to be provided. Realistically, only a few
different mitigations will be necessary. So provide a small set of
vector tables. They will be re-used and patch with the mitigations
on-demand.
This is based on the work done in Linux (see [1]).
Julien Grall [Tue, 16 Jan 2018 14:23:33 +0000 (14:23 +0000)]
xen/arm: Introduce enable callback to enable a capabilities on each online CPU
Once Xen knows what features/workarounds present on the platform, it
might be necessary to configure each online CPU.
Introduce a new callback "enable" that will be called on each online CPU to
configure the "capability".
The code is based on Linux v4.14 (where cpufeature.c comes from), the
explanation of why using stop_machine_run is kept as we have similar
problem in the future.
Lastly introduce enable_errata_workaround that will be called once CPUs
have booted and before the hardware domain is created.
xen/arm: Detect silicon revision and set cap bits accordingly
After each CPU has been started, we iterate through a list of CPU
errata to detect CPUs which need from hypervisor code patches.
For each bug there is a function which checks if that a particular CPU is
affected. This needs to be done on every CPU to cover heterogenous
systems properly.
If a certain erratum has been detected, the capability bit will be set.
In the case the erratum requires code patching, this will be triggered
by the call to apply_alternatives.
The code is based on the file arch/arm64/kernel/cpu_errata.c in Linux
v4.6-rc3.
xen/arm: cpufeature: Provide an helper to check if a capability is supported
The CPU capabilities will be set depending on the value found in the CPU
registers. This patch provides a generic to go through a set of capabilities
and find which one should be enabled.
The parameter "info" is used to display the kind of capability updated (e.g
workaround, feature...).
Julien Grall [Wed, 22 Jun 2016 11:15:18 +0000 (12:15 +0100)]
xen/arm: Add macros to handle the MIDR
Add new macros to easily get different parts of the register and to
check if a given MIDR match a CPU model range. The latter will be really
useful to handle errata later.
The macros have been imported from the header
arch/arm64/include/asm/cputype.h in Linux v4.6-rc3.
Andrew Cooper [Wed, 14 Feb 2018 12:44:33 +0000 (13:44 +0100)]
x86/idle: Clear SPEC_CTRL while idle
On contemporary hardware, setting IBRS/STIBP has a performance impact on
adjacent hyperthreads. It is therefore recommended to clear the setting
before becoming idle, to avoid an idle core preventing adjacent userspace
execution from running at full performance.
Care must be taken to ensure there are no ret or indirect branch instructions
between spec_ctrl_{enter,exit}_idle() invocations, which are forced always
inline. Care must also be taken to avoid using spec_ctrl_enter_idle() between
flushing caches and becoming idle, in cases where that matters.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4c7e478d597b0346eef3a256cfd6794ac778b608
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Wed, 14 Feb 2018 12:44:01 +0000 (13:44 +0100)]
x86/ctxt: Issue a speculation barrier between vcpu contexts
Issuing an IBPB command flushes the Branch Target Buffer, so that any poison
left by one vcpu won't remain when beginning to execute the next.
The cost of IBPB is substantial, and skipped on transition to idle, as Xen's
idle code is robust already. All transitions into vcpu context are fully
serialising in practice (and under consideration for being retroactively
declared architecturally serialising), so a cunning attacker cannot use SP1 to
try and skip the flush.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a2ed643ed783020f885035432e9c0919756921d1
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Wed, 14 Feb 2018 12:43:36 +0000 (13:43 +0100)]
x86/boot: Calculate the most appropriate BTI mitigation to use
See the logic and comments in init_speculation_mitigations() for further
details.
There are two controls for RSB overwriting, because in principle there are
cases where it might be safe to forego rsb_native (Off the top of my head,
SMEP active, no 32bit PV guests at all, no use of vmevent/paging subsystems
for HVM guests, but I make no guarantees that this list of restrictions is
exhaustive).
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/spec_ctrl: Fix determination of when to use IBRS
The original version of this logic was:
/*
* On Intel hardware, we'd like to use retpoline in preference to
* IBRS, but only if it is safe on this hardware.
*/
else if ( boot_cpu_has(X86_FEATURE_IBRSB) )
{
if ( retpoline_safe() )
thunk = THUNK_RETPOLINE;
else
ibrs = true;
}
but it was changed by a request during review. Sadly, the result is buggy as
it breaks the later fallback logic by allowing IBRS to appear as available
when in fact it isn't.
This in practice means that on repoline-unsafe hardware without IBRS, we
select THUNK_JUMP despite intending to select THUNK_RETPOLINE.
Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2713715305ca516f698d58cec5e0b322c3b2c4eb
master date: 2018-01-26 14:10:21 +0000
master commit: 30cbd0c83ef3d0edac2d5bcc41a9a2b7a843ae58
master date: 2018-02-06 18:32:58 +0000
Andrew Cooper [Wed, 14 Feb 2018 12:43:00 +0000 (13:43 +0100)]
x86/entry: Avoid using alternatives in NMI/#MC paths
This patch is deliberately arranged to be easy to revert if/when alternatives
patching becomes NMI/#MC safe.
For safety, there must be a dispatch serialising instruction in (what is
logically) DO_SPEC_CTRL_ENTRY so that, in the case that Xen needs IBRS set in
context, an attacker can't speculate around the WRMSR and reach an indirect
branch within the speculation window.
Using conditionals opens this attack vector up, so the else clause gets an
LFENCE to force the pipeline to catch up before continuing. This also covers
the safety of RSB conditional, as execution it is guaranteed to either hit the
WRMSR or LFENCE.
One downside of not using alternatives is that there unconditionally an LFENCE
in the IST path in cases where we are not using the features from IBRS-capable
microcode.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3fffaf9c13e9502f09ad4ab1aac3f8b7b9398f6f
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Wed, 14 Feb 2018 12:42:20 +0000 (13:42 +0100)]
x86/entry: Organise the clobbering of the RSB/RAS on entry to Xen
ret instructions are speculated directly to values recorded in the Return
Stack Buffer/Return Address Stack, as there is no uncertainty in well-formed
code. Guests can take advantage of this in two ways:
1) If they can find a path in Xen which executes more ret instructions than
call instructions. (At least one in the waitqueue infrastructure,
probably others.)
2) Use the fact that the RSB/RAS in hardware is actually a circular stack
without a concept of empty. (When it logically empties, stale values
will start being used.)
To mitigate, overwrite the RSB on entry to Xen with gadgets which will capture
and contain rogue speculation.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e6c0128e9ab25bf66df11377a33ee5584d7f99e3
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Wed, 14 Feb 2018 12:41:31 +0000 (13:41 +0100)]
x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point
We need to be able to either set or clear IBRS in Xen context, as well as
restore appropriate guest values in guest context. See the documentation in
asm-x86/spec_ctrl_asm.h for details.
With the contemporary microcode, writes to %cr3 are slower when SPEC_CTRL.IBRS
is set. Therefore, the positioning of SPEC_CTRL_{ENTRY/EXIT}* is important.
Ideally, the IBRS_SET/IBRS_CLEAR hunks might be positioned either side of the
%cr3 change, but that is rather more complicated to arrange, and could still
result in a guest controlled value in SPEC_CTRL during the %cr3 change,
negating the saving if the guest chose to have IBRS set.
Therefore, we optimise for the pre-Skylake case (being far more common in the
field than Skylake and later, at the moment), where we have a Xen-preferred
value of IBRS clear when switching %cr3.
There is a semi-unrelated bugfix, where various asm_defn.h macros have a
hidden dependency on PAGE_SIZE, which results in an assembler error if used in
a .macro definition.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5e7962901131186d3514528ed57c7a9901a15a3e
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Wed, 14 Feb 2018 12:40:38 +0000 (13:40 +0100)]
x86/hvm: Permit guests direct access to MSR_{SPEC_CTRL,PRED_CMD}
For performance reasons, HVM guests should have direct access to these MSRs
when possible.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 5a2fe171144ebcc908ea1fca45058d6010f6a286
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Wed, 14 Feb 2018 12:40:10 +0000 (13:40 +0100)]
x86/migrate: Move MSR_SPEC_CTRL on migrate
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0cf2a4eb769302b7d7d7835540e7b2f15006df30
master date: 2018-01-26 14:10:21 +0000
MSR_ARCH_CAPABILITIES will only come into existence on new hardware, but is
implemented as a straight #GP for now to avoid being leaky when new hardware
arrives.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ea58a679a6190e714a592f1369b660769a48a80c
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Wed, 14 Feb 2018 12:39:01 +0000 (13:39 +0100)]
x86/cpuid: Handling of IBRS/IBPB, STIBP and IBRS for guests
Intel specifies IBRS/IBPB (combined, in a single bit) and STIBP as a separate
bit. AMD specifies IBPB alone in a 3rd bit.
AMD's IBPB is a subset of Intel's combined IBRS/IBPB. For performance
reasons, administrators might wish to express "IBPB only" even on Intel
hardware, so we allow the AMD bit to be used for this purpose.
The behaviour of STIBP is more complicated.
It is our current understanding that STIBP will be advertised on HT-capable
hardware irrespective of whether HT is enabled, but not advertised on
HT-incapable hardware. However, for ease of virtualisation, STIBP's
functionality is ignored rather than reserved by microcode/hardware on
HT-incapable hardware.
For guest safety, we treat STIBP as special, always override the toolstack
choice, and always advertise STIBP if IBRS is available. This removes the
corner case where STIBP is not advertised, but the guest is running on
HT-capable hardware where it does matter.
Finally as a bugfix, update the libxc CPUID logic to understand the e8b
feature leaf, which has the side effect of also offering CLZERO to guests on
applicable hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d297b56682e730d598e2529cc6998151d3b6f6f8
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Wed, 14 Feb 2018 11:44:17 +0000 (12:44 +0100)]
x86/cmdline: Introduce a command line option to disable IBRS/IBPB, STIBP and IBPB
Instead of gaining yet another top level boolean, introduce a more generic
cpuid= option. Also introduce a helper function to parse a generic boolean
value.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/cmdline: Fix parse_boolean() for unadorned values
A command line such as "cpuid=no-ibrsb,no-stibp" tickles a bug in
parse_boolean() because the separating comma fails the NUL case.
Instead, check for slen == nlen which accounts for the boundary (if any)
passed via the 'e' parameter.
Andrew Cooper [Wed, 14 Feb 2018 11:43:37 +0000 (12:43 +0100)]
x86/feature: Definitions for Indirect Branch Controls
Contemporary processors are gaining Indirect Branch Controls via microcode
updates. Intel are introducing one bit to indicate IBRS and IBPB support, and
a second bit for STIBP. AMD are introducing IBPB only, so enumerate it with a
separate bit.
Furthermore, depending on compiler and microcode availability, we may want to
run Xen with IBRS set, or clear.
To use these facilities, we synthesise separate IBRS and IBPB bits for
internal use. A lot of infrastructure is required before these features are
safe to offer to guests.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
master commit: 0d703a701cc4bc47773986b2796eebd28b1439b5
master date: 2018-01-16 17:45:50 +0000
Andrew Cooper [Wed, 14 Feb 2018 11:42:31 +0000 (12:42 +0100)]
x86/amd: Try to set lfence as being Dispatch Serialising
This property is required for the AMD's recommended mitigation for Branch
Target Injection, but Xen needs to cope with being unable to detect or modify
the MSR.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: fe3ee5530a8d0d0b6a478167125d00c40f294a86
master date: 2018-01-16 17:45:50 +0000
Andrew Cooper [Wed, 14 Feb 2018 11:41:00 +0000 (12:41 +0100)]
x86: Support indirect thunks from assembly code
Introduce INDIRECT_CALL and INDIRECT_JMP which either degrade to a normal
indirect branch, or dispatch to the __x86_indirect_thunk_* symbols.
Update all the manual indirect branches in to use the new thunks. The
indirect branches in the early boot and kexec path are left intact as we can't
use the compiled-in thunks at those points.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7c508612f7a5096b4819d4ef2ce566e01bd66c0c
master date: 2018-01-16 17:45:50 +0000
Andrew Cooper [Wed, 14 Feb 2018 11:40:13 +0000 (12:40 +0100)]
x86: Support compiling with indirect branch thunks
Use -mindirect-branch=thunk-extern/-mindirect-branch-register when available.
To begin with, use the retpoline thunk. Later work will add alternative
thunks which can be selected at boot time.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 3659f0f4bcc6ca08103d1a7ae4e97535ecc978be
master date: 2018-01-16 17:45:50 +0000