]> xenbits.xensource.com Git - xen.git/log
xen.git
6 years agox86/msr: Virtualise MSR_SPEC_CTRL.SSBD for guests to use
Andrew Cooper [Tue, 29 May 2018 09:08:58 +0000 (11:08 +0200)]
x86/msr: Virtualise MSR_SPEC_CTRL.SSBD for guests to use

Almost all infrastructure is already in place.  Update the reserved bits
calculation in guest_wrmsr(), and offer SSBD to guests by default.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cd53023df952cf0084be9ee3d15a90f8837049c2
master date: 2018-05-21 14:20:06 +0100

6 years agox86/Intel: Mitigations for GPZ SP4 - Speculative Store Bypass
Andrew Cooper [Tue, 29 May 2018 09:08:34 +0000 (11:08 +0200)]
x86/Intel: Mitigations for GPZ SP4 - Speculative Store Bypass

To combat GPZ SP4 "Speculative Store Bypass", Intel have extended their
speculative sidechannel mitigations specification as follows:

 * A feature bit to indicate that Speculative Store Bypass Disable is
   supported.
 * A new bit in MSR_SPEC_CTRL which, when set, disables memory disambiguation
   in the pipeline.
 * A new bit in MSR_ARCH_CAPABILITIES, which will be set in future hardware,
   indicating that the hardware is not susceptible to Speculative Store Bypass
   sidechannels.

For contemporary processors, this interface will be implemented via a
microcode update.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 9df52a25e0e95a0b9971aa2fc26c5c6a5cbdf4ef
master date: 2018-05-21 14:20:06 +0100

6 years agox86/AMD: Mitigations for GPZ SP4 - Speculative Store Bypass
Andrew Cooper [Tue, 29 May 2018 09:08:10 +0000 (11:08 +0200)]
x86/AMD: Mitigations for GPZ SP4 - Speculative Store Bypass

AMD processors will execute loads and stores with the same base register in
program order, which is typically how a compiler emits code.

Therefore, by default no mitigating actions are taken, despite there being
corner cases which are vulnerable to the issue.

For performance testing, or for users with particularly sensitive workloads,
the `spec-ctrl=ssbd` command line option is available to force Xen to disable
Memory Disambiguation on applicable hardware.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 8c0e338086f060eba31d37b83fbdb883928aa085
master date: 2018-05-21 14:20:06 +0100

6 years agox86/spec_ctrl: Introduce a new `spec-ctrl=` command line argument to replace `bti=`
Andrew Cooper [Tue, 29 May 2018 09:07:29 +0000 (11:07 +0200)]
x86/spec_ctrl: Introduce a new `spec-ctrl=` command line argument to replace `bti=`

In hindsight, the options for `bti=` aren't as flexible or useful as expected
(including several options which don't appear to behave as intended).
Changing the behaviour of an existing option is problematic for compatibility,
so introduce a new `spec-ctrl=` in the hopes that we can do better.

One common way of deploying Xen is with a single PV dom0 and all domUs being
HVM domains.  In such a setup, an administrator who has weighed up the risks
may wish to forgo protection against malicious PV domains, to reduce the
overall performance hit.  To cater for this usecase, `spec-ctrl=no-pv` will
disable all speculative protection for PV domains, while leaving all
speculative protection for HVM domains intact.

For coding clarity as much as anything else, the suboptions are grouped by
logical area; those which affect the alternatives blocks, and those which
affect Xen's in-hypervisor settings.  See the xen-command-line.markdown for
full details of the new options.

While changing the command line options, take the time to change how the data
is reported to the user.  The three DEBUG printks are upgraded to unilateral,
as they are all relevant pieces of information, and the old "mitigations:"
line is split in the two logical areas described above.

Sample output from booting with `spec-ctrl=no-pv` looks like:

  (XEN) Speculative mitigation facilities:
  (XEN)   Hardware features: IBRS/IBPB STIBP IBPB
  (XEN)   Compiled-in support: INDIRECT_THUNK
  (XEN)   Xen settings: BTI-Thunk RETPOLINE, SPEC_CTRL: IBRS-, Other: IBPB
  (XEN)   Support for VMs: PV: None, HVM: MSR_SPEC_CTRL RSB
  (XEN)   XPTI (64-bit PV only): Dom0 enabled, DomU enabled

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3352afc26c497d26ecb70527db3cb29daf7b1422
master date: 2018-05-16 12:19:10 +0100

6 years agox86/cpuid: Improvements to guest policies for speculative sidechannel features
Andrew Cooper [Tue, 29 May 2018 09:06:56 +0000 (11:06 +0200)]
x86/cpuid: Improvements to guest policies for speculative sidechannel features

If Xen isn't virtualising MSR_SPEC_CTRL for guests, IBRSB shouldn't be
advertised.  It is not currently possible to express this via the existing
command line options, but such an ability will be introduced.

Another useful option in some usecases is to offer IBPB without IBRS.  When a
guest kernel is known to be compatible (uses retpoline and knows about the AMD
IBPB feature bit), an administrator with pre-Skylake hardware may wish to hide
IBRS.  This allows the VM to have full protection, without Xen or the VM
needing to touch MSR_SPEC_CTRL, which can reduce the overhead of Spectre
mitigations.

Break the logic common to both PV and HVM CPUID calculations into a common
helper, to avoid duplication.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cb06b308ec71b23f37a44f5e2351fe2cae0306e9
master date: 2018-05-16 12:19:10 +0100

6 years agox86/spec_ctrl: Explicitly set Xen's default MSR_SPEC_CTRL value
Andrew Cooper [Tue, 29 May 2018 09:06:30 +0000 (11:06 +0200)]
x86/spec_ctrl: Explicitly set Xen's default MSR_SPEC_CTRL value

With the impending ability to disable MSR_SPEC_CTRL handling on a
per-guest-type basis, the first exit-from-guest may not have the side effect
of loading Xen's choice of value.  Explicitly set Xen's default during the BSP
and AP boot paths.

For the BSP however, delay setting a non-zero MSR_SPEC_CTRL default until
after dom0 has been constructed when safe to do so.  Oracle report that this
speeds up boots of some hardware by 50s.

"when safe to do so" is based on whether we are virtualised.  A native boot
won't have any other code running in a position to mount an attack.

Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cb8c12020307b39a89273d7699e89000451987ab
master date: 2018-05-16 12:19:10 +0100

6 years agox86/spec_ctrl: Split X86_FEATURE_SC_MSR into PV and HVM variants
Andrew Cooper [Tue, 29 May 2018 09:00:29 +0000 (11:00 +0200)]
x86/spec_ctrl: Split X86_FEATURE_SC_MSR into PV and HVM variants

In order to separately control whether MSR_SPEC_CTRL is virtualised for PV and
HVM guests, split the feature used to control runtime alternatives into two.
Xen will use MSR_SPEC_CTRL itself if either of these features are active.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: fa9eb09d446a1279f5e861e6b84fa8675dabf148
master date: 2018-05-16 12:19:10 +0100

6 years agox86/spec_ctrl: Elide MSR_SPEC_CTRL handling in idle context when possible
Andrew Cooper [Tue, 29 May 2018 09:00:04 +0000 (11:00 +0200)]
x86/spec_ctrl: Elide MSR_SPEC_CTRL handling in idle context when possible

If Xen is virtualising MSR_SPEC_CTRL handling for guests, but using 0 as its
own MSR_SPEC_CTRL value, spec_ctrl_{enter,exit}_idle() need not write to the
MSR.

Requested-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 94df6e8588e35cc2028ccb3fd2921c6e6360605e
master date: 2018-05-16 12:19:10 +0100

6 years agox86/spec_ctrl: Rename bits of infrastructure to avoid NATIVE and VMEXIT
Andrew Cooper [Tue, 29 May 2018 08:59:39 +0000 (10:59 +0200)]
x86/spec_ctrl: Rename bits of infrastructure to avoid NATIVE and VMEXIT

In hindsight, using NATIVE and VMEXIT as naming terminology was not clever.
A future change wants to split SPEC_CTRL_EXIT_TO_GUEST into PV and HVM
specific implementations, and using VMEXIT as a term is completely wrong.

Take the opportunity to fix some stale documentation in spec_ctrl_asm.h.  The
IST helpers were missing from the large comment block, and since
SPEC_CTRL_ENTRY_FROM_INTR_IST was introduced, we've gained a new piece of
functionality which currently depends on the fine grain control, which exists
in lieu of livepatching.  Note this in the comment.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d9822b8a38114e96e4516dc998f4055249364d5d
master date: 2018-05-16 12:19:10 +0100

6 years agox86/spec_ctrl: Fold the XEN_IBRS_{SET,CLEAR} ALTERNATIVES together
Andrew Cooper [Tue, 29 May 2018 08:59:11 +0000 (10:59 +0200)]
x86/spec_ctrl: Fold the XEN_IBRS_{SET,CLEAR} ALTERNATIVES together

Currently, the SPEC_CTRL_{ENTRY,EXIT}_* macros encode Xen's choice of
MSR_SPEC_CTRL as an immediate constant, and chooses between IBRS or not by
doubling up the entire alternative block.

There is now a variable holding Xen's choice of value, so use that and
simplify the alternatives.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: af949407eaba7af71067f23d5866cd0bf1f1144d
master date: 2018-05-16 12:19:10 +0100

6 years agox86/spec_ctrl: Merge bti_ist_info and use_shadow_spec_ctrl into spec_ctrl_flags
Andrew Cooper [Tue, 29 May 2018 08:58:44 +0000 (10:58 +0200)]
x86/spec_ctrl: Merge bti_ist_info and use_shadow_spec_ctrl into spec_ctrl_flags

All 3 bits of information here are control flags for the entry/exit code
behaviour.  Treat them as such, rather than having two different variables.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5262ba2e7799001402dfe139ff944e035dfff928
master date: 2018-05-16 12:19:10 +0100

6 years agox86/spec_ctrl: Express Xen's choice of MSR_SPEC_CTRL value as a variable
Andrew Cooper [Tue, 29 May 2018 08:58:17 +0000 (10:58 +0200)]
x86/spec_ctrl: Express Xen's choice of MSR_SPEC_CTRL value as a variable

At the moment, we have two different encodings of Xen's MSR_SPEC_CTRL value,
which is a side effect of how the Spectre series developed.  One encoding is
via an alias with the bottom bit of bti_ist_info, and can encode IBRS or not,
but not other configurations such as STIBP.

Break Xen's value out into a separate variable (in the top of stack block for
XPTI reasons) and use this instead of bti_ist_info in the IST path.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 66dfae0f32bfbc899c2f3446d5ee57068cb7f957
master date: 2018-05-16 12:19:10 +0100

6 years agox86/spec_ctrl: Read MSR_ARCH_CAPABILITIES only once
Andrew Cooper [Tue, 29 May 2018 08:57:41 +0000 (10:57 +0200)]
x86/spec_ctrl: Read MSR_ARCH_CAPABILITIES only once

Make it available from the beginning of init_speculation_mitigations(), and
pass it into appropriate functions.  Fix an RSBA typo while moving the
affected comment.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d6c65187252a6c1810fd24c4d46f812840de8d3c
master date: 2018-05-16 12:19:10 +0100

6 years agoxpti: fix bug in double fault handling
Juergen Gross [Fri, 18 May 2018 11:32:05 +0000 (13:32 +0200)]
xpti: fix bug in double fault handling

When entering the hypervisor via the double fault handler resetting
xen_cr3 was missing. This led to switching to pv_cr3 when returning
from the next following exception, so repair this in order to allow
exception handling to work even after a double fault.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d80af845de7a4db01a4a3b4d779e0e0dcb5e738b
master date: 2018-04-23 16:13:01 +0200

6 years agox86/spec_ctrl: Updates to retpoline-safety decision making
Andrew Cooper [Fri, 18 May 2018 11:31:33 +0000 (13:31 +0200)]
x86/spec_ctrl: Updates to retpoline-safety decision making

All of this is as recommended by the Intel whitepaper:

https://software.intel.com/sites/default/files/managed/1d/46/Retpoline-A-Branch-Target-Injection-Mitigation.pdf

The 'RSB Alternative' bit in MSR_ARCH_CAPABILITIES may be set by a hypervisor
to indicate that the virtual machine may migrate to a processor which isn't
retpoline-safe.  Introduce a shortened name (to reduce code volume), treat it
as authorative in retpoline_safe(), and print its value along with the other
ARCH_CAPS bits.

The exact processor models which do have RSB semantics which fall back to BTB
predictions are enumerated, and include Kabylake and Coffeelake.  Leave a
printk() in the default case to help identify cases which aren't covered.

The exact microcode versions from Broadwell RSB-safety are taken from the
referenced microcode update file (adjusting for the known-bad microcode
versions).  Despite the exact wording of the text, it is only Broadwell
processors which need a microcode check.

In practice, this means that all Broadwell hardware with up-to-date microcode
will use retpoline in preference to IBRS, which will be a performance
improvement for desktop and server systems which would previously always opt
for IBRS over retpoline.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/spec_ctrl: Fix typo in ARCH_CAPS decode

Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 1232378bd2fef45f613db049b33852fdf84d7ddf
master date: 2018-04-19 17:28:23 +0100
master commit: 27170adb54a558e11defcd51989326a9beb95afe
master date: 2018-04-24 13:34:12 +0100

6 years agox86/msr: Correct the emulation behaviour of MSR_PRED_CMD
Andrew Cooper [Fri, 18 May 2018 11:31:01 +0000 (13:31 +0200)]
x86/msr: Correct the emulation behaviour of MSR_PRED_CMD

Experimentally, the behaviour of reserved bits in MSR_PRED_CMD changed between
beta and production microcode, and now raises a #GP fault for set reserved
bits.  The AMD spec for future hardware also specifies this behaviour, and it
is the more sensible behaviour to implement.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/msr: further correct the emulation behaviour of MSR_PRED_CMD

Following commit a6aa678fa3 ("x86/msr: Correct the emulation behaviour
of MSR_PRED_CMD") we may end up writing the low bit with the wrong
value. While it's unlikely for a guest to want to write zero there, we
should still permit (this without incurring the overhead of an actual
barrier). Correcting this right away will also help whenever further
bits in the MSR might become defined.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a6aa678fa380e9369cc44701a181142322b3a4b0
master date: 2018-04-16 13:18:19 +0100
master commit: a996273d1fc10d14598985703227bfa35a91f681
master date: 2018-04-18 11:16:37 +0200

6 years agox86: suppress BTI mitigations around S3 suspend/resume
Jan Beulich [Fri, 18 May 2018 11:30:30 +0000 (13:30 +0200)]
x86: suppress BTI mitigations around S3 suspend/resume

NMI and #MC can occur at any time after S3 resume, yet the MSR_SPEC_CTRL
may become available only once we're reloaded microcode. Make
SPEC_CTRL_ENTRY_FROM_INTR_IST and DO_SPEC_CTRL_EXIT_TO_XEN no-ops for
the critical period of time.

Also set the MSR back to its intended value.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86: Use spec_ctrl_{enter,exit}_idle() in the S3/S5 path

The main purpose of this patch is to avoid opencoding the recovery logic at
the end, but also has the positive side effect of relaxing the SPEC_CTRL
mitigations when working to shut the final CPU down.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 710a8ebf2bc111a34bba04d1c85b6d07ed3d9389
master date: 2018-04-16 14:09:55 +0200
master commit: ef3ab46493f650b7e5cca2b2578a99ca0cbff195
master date: 2018-04-19 10:55:59 +0100

6 years agox86: correct ordering of operations during S3 resume
Jan Beulich [Fri, 18 May 2018 11:30:05 +0000 (13:30 +0200)]
x86: correct ordering of operations during S3 resume

Microcode loading needs to happen before re-enabling interrupts, in case
only updated microcode allows the use of e.g. the SPEC_{CTRL,CMD} MSRs.
Otoh it doesn't need to happen at all when we didn't suspend in the
first place. It needs to happen before spin_debug_enable() though, as it
acquires a lock and hence would otherwise make
common/spinlock.c:check_lock() unhappy. As micrcode loading can be
pretty verbose, also make sure it only runs after console_end_sync().

cpufreq_add_cpu() doesn't need calling on the only "goto enable_cpu"
path, which sits ahead of cpufreq_del_cpu().

Reported-by: Simon Gaiser <simon@invisiblethingslab.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: cb2a4a449dfd50af309a333aa805835015fbc8c8
master date: 2018-04-16 14:08:30 +0200

6 years agox86/pv: Protect multicalls against Spectre v2 - Branch Target Injection
Andrew Cooper [Fri, 18 May 2018 11:26:15 +0000 (13:26 +0200)]
x86/pv: Protect multicalls against Spectre v2 - Branch Target Injection

This is a missing adjustment in c/s 88602190f69 "x86: Support indirect thunks
from assembly code".

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
6 years agox86/emul: Fix emulator test harness build following a backport of 7c508612
Andrew Cooper [Wed, 9 May 2018 17:06:46 +0000 (18:06 +0100)]
x86/emul: Fix emulator test harness build following a backport of 7c508612

The x86 emulator doesn't need to employ any Spectre v2 mitigations.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
6 years agox86/emul: Fix emulator test harness build following the backport of ff555d59e8a
Andrew Cooper [Wed, 9 May 2018 15:54:32 +0000 (16:54 +0100)]
x86/emul: Fix emulator test harness build following the backport of ff555d59e8a

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
7 years agox86/HVM: guard against emulator driving ioreq state in weird ways
Jan Beulich [Tue, 8 May 2018 17:28:20 +0000 (18:28 +0100)]
x86/HVM: guard against emulator driving ioreq state in weird ways

In the case where hvm_wait_for_io() calls wait_on_xen_event_channel(),
p->state ends up being read twice in succession: once to determine that
state != p->state, and then again at the top of the loop.  This gives a
compromised emulator a chance to change the state back between the two
reads, potentially keeping Xen in a loop indefinitely.

Instead:
* Read p->state once in each of the wait_on_xen_event_channel() tests,
* re-use that value the next time around,
* and insist that the states continue to transition "forward" (with the
  exception of the transition to STATE_IOREQ_NONE).

This is XSA-262.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
7 years agox86/vpt: add support for IO-APIC routed interrupts
Xen Project Security Team [Mon, 23 Apr 2018 15:56:47 +0000 (16:56 +0100)]
x86/vpt: add support for IO-APIC routed interrupts

And modify the HPET code to make use of it. Currently HPET interrupts
are always treated as ISA and thus injected through the vPIC. This is
wrong because HPET interrupts when not in legacy mode should be
injected from the IO-APIC.

To make things worse, the supported interrupt routing values are set
to [20..23], which clearly falls outside of the ISA range, thus
leading to an ASSERT in debug builds or memory corruption in non-debug
builds because the interrupt injection code will write out of the
bounds of the arch.hvm_domain.vpic array.

Since the HPET interrupt source can change between ISA and IO-APIC
always destroy the timer before changing the mode, or else Xen risks
changing it while the timer is active.

Note that vpt interrupt injection is racy in the sense that the
vIO-APIC RTE entry can be written by the guest in between the call to
pt_irq_masked and hvm_ioapic_assert, or the call to pt_update_irq and
pt_intr_post. Those are not deemed to be security issues, but rather
quirks of the current implementation. In the worse case the guest
might lose interrupts or get multiple interrupt vectors injected for
the same timer source.

This is part of XSA-261.

Address actual and potential compiler warnings. Fix formatting.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
7 years agox86/traps: Fix handling of #DB exceptions in hypervisor context
Andrew Cooper [Tue, 8 May 2018 17:28:03 +0000 (18:28 +0100)]
x86/traps: Fix handling of #DB exceptions in hypervisor context

The WARN_ON() can be triggered by guest activities, and emits a full stack
trace without rate limiting.  Swap it out for a ratelimited printk with just
enough information to work out what is going on.

Not all #DB exceptions are traps, so blindly continuing is not a safe action
to take.  We don't let PV guests select these settings in the real %dr7 to
begin with, but for added safety against unexpected situations, detect the
fault cases and crash in an obvious manner.

This is part of XSA-260 / CVE-2018-8897.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
7 years agox86/traps: Use an Interrupt Stack Table for #DB
Andrew Cooper [Tue, 8 May 2018 17:28:03 +0000 (18:28 +0100)]
x86/traps: Use an Interrupt Stack Table for #DB

PV guests can use architectural corner cases to cause #DB to be raised after
transitioning into supervisor mode.

Use an interrupt stack table for #DB to prevent the exception being taken with
a guest controlled stack pointer.

This is part of XSA-260 / CVE-2018-8897.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
7 years agox86/pv: Move exception injection into {,compat_}test_all_events()
Andrew Cooper [Tue, 8 May 2018 17:28:03 +0000 (18:28 +0100)]
x86/pv: Move exception injection into {,compat_}test_all_events()

This allows paths to jump straight to {,compat_}test_all_events() and have
injection of pending exceptions happen automatically, rather than requiring
all calling paths to handle exceptions themselves.

The normal exception path is simplified as a result, and
compat_post_handle_exception() is removed entirely.

This is part of XSA-260 / CVE-2018-8897.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
7 years agox86/traps: Fix %dr6 handing in #DB handler
Andrew Cooper [Tue, 8 May 2018 17:28:03 +0000 (18:28 +0100)]
x86/traps: Fix %dr6 handing in #DB handler

Most bits in %dr6 accumulate, rather than being set directly based on the
current source of #DB.  Have the handler follow the manuals guidance, which
avoids leaking hypervisor debugging activities into guest context.

This is part of XSA-260 / CVE-2018-8897.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
7 years agox86: fix slow int80 path after XPTI additions
Jan Beulich [Wed, 25 Apr 2018 12:52:52 +0000 (14:52 +0200)]
x86: fix slow int80 path after XPTI additions

For the int80 slow path to jump to handle_exception_saved, %r14 needs to
be set up suitably for XPTI purposes. This is because of the difference
in nature between the int80 path (which is synchronous WRT guest
actions) and the exception path which is potentially asynchronous.

This is XSA-259.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5a5c368faf45ced8a8c6235f4fbf5cdb38ec939f
master date: 2018-04-25 14:39:41 +0200

7 years agolibxl: Specify format of inserted cdrom
Anthony PERARD [Wed, 25 Apr 2018 12:52:35 +0000 (14:52 +0200)]
libxl: Specify format of inserted cdrom

Without this extra parameter on the QMP command, QEMU will guess the
format of the new file.

This is XSA-258.

Reported-by: Anthony PERARD <anthony.perard@citrix.com>
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
master commit: d8f65e68a7c1047fad97206a6282c281247fadc2
master date: 2018-04-25 14:38:47 +0200

7 years agox86/PV: also cover Dom0 in SPEC_CTRL / PRED_CMD emulation
Jan Beulich [Thu, 22 Mar 2018 09:26:28 +0000 (10:26 +0100)]
x86/PV: also cover Dom0 in SPEC_CTRL / PRED_CMD emulation

Introduce a helper wrapping the pv_cpuid()-style domain_cpuid() /
cpuid_count() (or alike) invocations, and use it instead of plain
domain_cpuid() in MSR access emulation.

Reported-by: Jason Andryuk <jandryuk@gmail.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Jason Andryuk <jandryuk@gmail.com>
7 years agox86: Move microcode loading earlier
Ross Lagerwall [Thu, 22 Mar 2018 09:25:44 +0000 (10:25 +0100)]
x86: Move microcode loading earlier

Move microcode loading earlier for the boot CPU and secondary CPUs so
that it takes place before identify_cpu() is called for each CPU.
Without this, the detected features may be wrong if the new microcode
loading adjusts the feature bits. That could mean that some fixes (e.g.
d6e9f8d4f35d ("x86/vmx: fix vmentry failure with TSX bits in LBR"))
don't work as expected.

Previously during boot, the microcode loader was invoked for each
secondary CPU started and then again for each CPU as part of an
initcall. Simplify the code so that it is invoked exactly once for each
CPU during boot.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f97838bbd980a0104e16c4a12fbf514f9fa805f1
master date: 2017-04-19 17:08:01 +0100

7 years agox86/entry: Fix passing 6th argument for compat hypercalls
Jason Andryuk [Tue, 20 Mar 2018 13:51:23 +0000 (14:51 +0100)]
x86/entry: Fix passing 6th argument for compat hypercalls

Commit ec05090403ef4d760fbe701e31afd0f0edc414d5 ("x86/entry: Erase guest
GPR state on entry to Xen") zero-ed %rbp, compat arg 6, but it is not
restored before passing to hypercalls.  We need to pass the saved compat
arg 6 to the hypercall in r9, the 6th function argument.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
7 years agox86/spec_ctrl: Fix several bugs in SPEC_CTRL_ENTRY_FROM_INTR_IST
Andrew Cooper [Tue, 6 Mar 2018 15:27:02 +0000 (16:27 +0100)]
x86/spec_ctrl: Fix several bugs in SPEC_CTRL_ENTRY_FROM_INTR_IST

DO_OVERWRITE_RSB clobbers %rax, meaning in practice that the bti_ist_info
field gets zeroed.  Older versions of this code had the DO_OVERWRITE_RSB
register selectable, so reintroduce this ability and use it to cause the
INTR_IST path to use %rdx instead.

The use of %dl for the %cs.rpl check means that when an IST interrupt hits
Xen, we try to load 1 into the high 32 bits of MSR_SPEC_CTRL, suffering a #GP
fault instead.

Also, drop an unused label which was a copy/paste mistake.

Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a2b08fbed388f18235fda5ba1655c1483ef3e215
master date: 2018-02-14 13:22:15 +0000

7 years agoxen/arm: Flush TLBs before turning on the MMU to avoid stale entries
Julien Grall [Tue, 27 Feb 2018 11:15:57 +0000 (11:15 +0000)]
xen/arm: Flush TLBs before turning on the MMU to avoid stale entries

We don't know what is the state of the TLBs when booting Xen. To avoid
stale entries, it is necessary to flush the TLBs before turning on the
MMU.

Reported-by: Iain Hunter <iain@hunterembedded.co.uk>
Signed-off-by: Julien Grall <julien.gralL@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 1c473c42199a8f4d70533c202e1c57ecd1dad35b)

7 years agotools/libxc: Fix restoration of PV MSRs after migrate
Andrew Cooper [Thu, 16 Nov 2017 21:10:00 +0000 (21:10 +0000)]
tools/libxc: Fix restoration of PV MSRs after migrate

There are two bugs in process_vcpu_msrs() which clearly demonstrate that I
didn't test this bit of Migration v2 very well when writing it...

vcpu->msrsz is always expected to be a multiple of xen_domctl_vcpu_msr_t
records in a spec-compliant stream, so the modulo yields 0 for the msr_count,
rather than the actual number sent in the stream.

Passing 0 for the msr_count causes the hypercall to exit early, and hides the
fact that the guest handle is inserted into the wrong field in the domctl
union.

The reason that these bugs have gone unnoticed for so long is that the only
MSRs passed like this for PV guests are the AMD DBGEXT MSRs, which only exist
in fairly modern hardware, and whose use doesn't appear to be implemented in
any contemporary PV guests.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Julien Grall <julien.grall@linaro.org>
(cherry picked from commit f1a0a8c3fe2fb37c77ec1fe43618feef412427b5)
(cherry picked from commit 56d203b0f0e28a5d5935889587ca47a34606c556)
(cherry picked from commit 03f947472fde01f438ec057439d8d30456210a1c)
(cherry picked from commit bbd12188fa94640717deb6b4e6e4abc0b90843e3)

7 years agotools/libxc: Avoid generating inappropriate zero-content records
Andrew Cooper [Thu, 30 Mar 2017 16:32:32 +0000 (17:32 +0100)]
tools/libxc: Avoid generating inappropriate zero-content records

The code as written attempted to elide zero-content records, as such records
serve no purpose but come with a performance hit.  Unfortunately, in the case
where the hypervisor reported max size is non-zero, but the actual size is
zero, the record is not elided.

This previously tripped up the sanity checks in the restore side of migration,
but as the underlying reasons for eliding the records in the first place are
still valid, fix the elision logic.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 72efb1df629421037e2795f5529210aaa95ec72e)
(cherry picked from commit c31070f3505fb12f78d5b67498c6b1e460209c9a)
(cherry picked from commit 60e129725abe6163e63e838121d8a1c608710a20)

7 years agox86: two fixes to Spectre v2 backports
Jan Beulich [Tue, 27 Feb 2018 13:38:18 +0000 (14:38 +0100)]
x86: two fixes to Spectre v2 backports

- convert another (importantish) indirect call
- check the full guest value for PV SPEC_CTRL writes

Signed-off-by: Jan Beulich <jbeulich@suse.com>
7 years agognttab: don't blindly free status pages upon version change
Jan Beulich [Tue, 27 Feb 2018 13:37:39 +0000 (14:37 +0100)]
gnttab: don't blindly free status pages upon version change

There may still be active mappings, which would trigger the respective
BUG_ON(). Split the loop into one dealing with the page attributes and
the second (when the first fully passed) freeing the pages. Return an
error if any pages still have pending references.

This is part of XSA-255.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 38bfcc165dda5f4284d7c218b91df9e144ddd88d
master date: 2018-02-27 14:07:12 +0100

7 years agognttab/ARM: don't corrupt shared GFN array
Jan Beulich [Tue, 27 Feb 2018 13:37:19 +0000 (14:37 +0100)]
gnttab/ARM: don't corrupt shared GFN array

... by writing status GFNs to it. Introduce a second array instead.
Also implement gnttab_status_gmfn() properly now that the information is
suitably being tracked.

While touching it anyway, remove a misguided (but luckily benign) upper
bound check from gnttab_shared_gmfn(): We should never access beyond the
bounds of that array.

This is part of XSA-255.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9d2f8f9c65d4da35437f50ed9e812a2c5ab313e2
master date: 2018-02-27 14:04:44 +0100

7 years agomemory: don't implicitly unpin for decrease-reservation
Jan Beulich [Tue, 27 Feb 2018 13:36:46 +0000 (14:36 +0100)]
memory: don't implicitly unpin for decrease-reservation

It very likely was a mistake (copy-and-paste from domain cleanup code)
to implicitly unpin here: The caller should really unpin itself before
(or after, if they so wish) requesting the page to be removed.

This is XSA-252.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d798a0952903db9d8ee0a580e03f214d2b49b7d7
master date: 2018-02-27 14:03:27 +0100

7 years agox86/hvm: Don't corrupt the HVM context stream when writing the MSR record
Andrew Cooper [Fri, 23 Feb 2018 09:22:48 +0000 (10:22 +0100)]
x86/hvm: Don't corrupt the HVM context stream when writing the MSR record

Ever since it was introduced in c/s bd1f0b45ff, hvm_save_cpu_msrs() has had a
bug whereby it corrupts the HVM context stream if some, but fewer than the
maximum number of MSRs are written.

_hvm_init_entry() creates an hvm_save_descriptor with length for
msr_count_max, but in the case that we write fewer than max, h->cur only moves
forward by the amount of space used, causing the subsequent
hvm_save_descriptor to be written within the bounds of the previous one.

To resolve this, reduce the length reported by the descriptor to match the
actual number of bytes used.

A typical failure on the destination side looks like:

    (XEN) HVM4 restore: CPU_MSR 0
    (XEN) HVM4.0 restore: not enough data left to read 56 MSR bytes
    (XEN) HVM4 restore: failed to load entry 20/0

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d2f86bf604698806d311cc251c1b66fbb752673c
master date: 2017-11-21 11:19:02 +0000

7 years agox86/PV: correctly count MSRs to migrate
Jan Beulich [Fri, 23 Feb 2018 09:22:08 +0000 (10:22 +0100)]
x86/PV: correctly count MSRs to migrate

Signed-off-by: Jan Beulich <jbeulich@suse.com>
7 years agoxen/arm: cpuerrata: Actually check errata on non-boot CPUs
Julien Grall [Wed, 14 Feb 2018 12:22:23 +0000 (12:22 +0000)]
xen/arm: cpuerrata: Actually check errata on non-boot CPUs

The cpu errata framework was introduced in commit 8b01f6364f "xen/arm:
Detect silicon revision and set cap bits accordingly" and was meant to
detect errata present on any CPUs (via check_local_cpu_errata). However,
the function to check the MIDR (is_affected_midr_range) mistakenly
always use the boot CPU MIDR.

Fix is_affected_midr_range to use the current CPU MIDR.

Reported-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 27196d4cc917d91b5b5daee50173565139ca9c9d)

7 years agoxen/arm32: entry: Document the purpose of r11 in the traps handler
Julien Grall [Fri, 2 Feb 2018 14:19:25 +0000 (14:19 +0000)]
xen/arm32: entry: Document the purpose of r11 in the traps handler

It took me a bit of time to understand why __DEFINE_TRAP_ENTRY is
storing the original stack pointer in r11. It is working in pair with
return_traps_entry where sp will be restored from r11.

This is fine because per the AAPCS r11 must be preserved by the
subroutine. So in return_from_trap, r11 will still contain the original
stack pointer.

Add some documentation in the code to point the 2 sides to each other.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit dd855aa430f2da9b677c145f0c625a82aaa97110)

7 years agoxen/arm32: Invalidate icache on guest exist for Cortex-A15
Julien Grall [Fri, 2 Feb 2018 14:19:24 +0000 (14:19 +0000)]
xen/arm32: Invalidate icache on guest exist for Cortex-A15

In order to avoid aliasing attacks against the branch predictor on
Cortex A-15, let's invalidate the BTB on guest exit, which can only be
done by invalidating the icache (with ACTLR[0] being set).

We use the same hack as for A12/A17 to perform the vector decoding.

This is based on Linux patch from the kpti branch in [1].

[1] https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 665c4b6aa79eb21b1aada9f7f98fb5cb7f03743a)

7 years agoxen/arm32: Invalidate BTB on guest exit for Cortex A17 and 12
Julien Grall [Fri, 2 Feb 2018 14:19:23 +0000 (14:19 +0000)]
xen/arm32: Invalidate BTB on guest exit for Cortex A17 and 12

In order to avoid aliasing attackes agains the branch predictor, let's
invalidate the BTB on guest exist. This is made complicated by the fact
that we cannot take a branch invalidating the BTB.

This is based on the fourth version posted by Marc Zyngier on Linux-arm
mailing list (see [1]).

This is part of XSA-254.

[1] https://www.spinics.net/lists/arm-kernel/msg632062.html

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 05e0690d03dc6177e614e060ae78001d4f2abde2)

Renamed trap_hypervisor_call to trap_supervisor_call

7 years agoxen/arm32: Add skeleton to harden branch predictor aliasing attacks
Julien Grall [Fri, 2 Feb 2018 14:19:22 +0000 (14:19 +0000)]
xen/arm32: Add skeleton to harden branch predictor aliasing attacks

Aliasing attacked against CPU branch predictors can allow an attacker to
redirect speculative control flow on some CPUs and potentially divulge
information from one context to another.

This patch adds initiatial skeleton code behind a new Kconfig option
to enable implementation-specific mitigations against these attacks
for CPUs that are affected.

Most of mitigations will have to be applied when entering to the
hypervisor from the guest context.

Because the attack is against branch predictor, it is not possible to
safely use branch instruction before the mitigation is applied.
Therefore this has to be done in the vector entry before jump to the
helper handling a given exception.

However, on arm32, each vector contain a single instruction. This means
that the hardened vector tables may rely on the state of registers that
does not hold when in the hypervisor (e.g SP is 8 bytes aligned).
Therefore hypervisor code running with guest vectors table should be
minimized and always have IRQs and SErrors masked to reduce the risk to
use them.

This patch provides an infrastructure to switch vector tables before
entering to the guest and when leaving it.

Note that alternative could have been used, but older Xen (4.8 or
earlier) doesn't have support. So avoid using alternative to ease
backporting.

This is part of XSA-254.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 9bd4463b5c7cc026a07b9bbd41a6a7122a95647e)

7 years agoxen/arm32: entry: Add missing trap_reset entry
Julien Grall [Fri, 2 Feb 2018 14:19:21 +0000 (14:19 +0000)]
xen/arm32: entry: Add missing trap_reset entry

At the moment, the reset vector is defined as .word 0 (e.g andeq r0, r0,
r0).

This is rather unintuitive and will result to execute the trap
undefined. Instead introduce trap helpers for reset and will generate an
error message in the unlikely case that reset will be called.

This is part of XSA-254.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 00268cc91270c7b0aa3a1906bf7e7702db9c61c1)

Conflicts:
xen/arch/arm/arm32/traps.c

7 years agoxen/arm32: Add missing MIDR values for Cortex-A17 and A12
Julien Grall [Fri, 2 Feb 2018 14:19:20 +0000 (14:19 +0000)]
xen/arm32: Add missing MIDR values for Cortex-A17 and A12

Cortex-A17 and A12 MIDR will be used in a follow-up patch for hardening
the branch predictor.

This is part of XSA-254.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 340367bca5360f3e3d263341b58234d0efe5ced2)

7 years agoxen/arm32: entry: Consolidate DEFINE_TRAP_ENTRY_* macros
Julien Grall [Wed, 7 Feb 2018 16:52:44 +0000 (08:52 -0800)]
xen/arm32: entry: Consolidate DEFINE_TRAP_ENTRY_* macros

The only difference between all the DEFINE_TRAP_ENTRY_* macros  are the
interrupts (Asynchronous Abort, IRQ, FIQ) unmasked.

Rather than duplicating the code, introduce __DEFINE_TRAP_ENTRY macro
that will take the list of interrupts to unmask.

This is part of XSA-254.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 3bd8fd751e50dd981b7055fb33cdc8aa29537673)

7 years agoxen/arm64: Implement branch predictor hardening for affected Cortex-A CPUs
Julien Grall [Tue, 16 Jan 2018 14:23:37 +0000 (14:23 +0000)]
xen/arm64: Implement branch predictor hardening for affected Cortex-A CPUs

Cortex-A57, A72, A73 and A75 are susceptible to branch predictor
aliasing and can theoritically be attacked by malicious code.

This patch implements a PSCI-based mitigation for these CPUs when
available. The call into firmware will invalidate the branch predictor
state, preventing any malicious entries from affection other victim
contexts.

Ported from Linux git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git
branch kpti.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
This is part of XSA-254.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit e730f8e41e8537f1db9770b9464f9523c28857b9)

Conflicts:
xen/arch/arm/cpuerrata.c

7 years agoxen/arm64: Add skeleton to harden the branch predictor aliasing attacks
Julien Grall [Tue, 16 Jan 2018 14:23:36 +0000 (14:23 +0000)]
xen/arm64: Add skeleton to harden the branch predictor aliasing attacks

Aliasing attacked against CPU branch predictors can allow an attacker to
redirect speculative control flow on some CPUs and potentially divulge
information from one context to another.

This patch adds initial skeleton code behind a new Kconfig option to
enable implementation-specific mitigations against these attacks for
CPUs that are affected.

Most of the mitigations will have to be applied when entering to the
hypervisor from the guest context. For safety, it is applied at every
exception entry. So there are potential for optimizing when receiving
an exception at the same level.

Because the attack is against branch predictor, it is not possible to
safely use branch instruction before the mitigation is applied.
Therefore, this has to be done in the vector entry before jump to the
helper handling a given exception.

On Arm64, each vector can hold 32 instructions. This leave us 31
instructions for the mitigation. The last one is the branch instruction
to the helper.

Because a platform may have CPUs with different micro-architectures,
per-CPU vector table needs to be provided. Realistically, only a few
different mitigations will be necessary. So provide a small set of
vector tables. They will be re-used and patch with the mitigations
on-demand.

This is based on the work done in Linux (see [1]).

This is part of XSA-254.

[1] git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git
branch ktpi

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 4c4fddc166cf528aca49540bcc9ee4f196b01dac)

Conflicts:
xen/arch/arm/cpuerrata.c
xen/include/asm-arm/cpuerrata.h
xen/include/asm-arm/cpufeature.h
xen/arch/arm/Kconfig
xen/arch/arm/arm64/Makefile

7 years agoxen/arm: cpuerrata: Add MIDR_ALL_VERSIONS
Julien Grall [Tue, 16 Jan 2018 14:23:35 +0000 (14:23 +0000)]
xen/arm: cpuerrata: Add MIDR_ALL_VERSIONS

Introduce a new macro MIDR_ALL_VERSIONS to match all variant/revision of a
given CPU model.

This is part of XSA-254.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit ba73070af43a38d200413f446d6a718e108867b6)

Conflicts:
xen/arch/arm/cpuerrata.c

7 years agoxen/arm64: Add missing MIDR values for Cortex-A72, A73 and A75
Julien Grall [Tue, 16 Jan 2018 14:23:34 +0000 (14:23 +0000)]
xen/arm64: Add missing MIDR values for Cortex-A72, A73 and A75

Cortex-A72, A73 and A75 MIDR will be used to a follow-up for hardening
the branch predictor.

This is part of XSA-254.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 7975bff524c4e2c30efbf144de753f151d974e53)

Conflicts:
xen/include/asm-arm/processor.h

Add missing definitions of A15, A53, and A57.

7 years agoxen/arm: Introduce enable callback to enable a capabilities on each online CPU
Julien Grall [Tue, 16 Jan 2018 14:23:33 +0000 (14:23 +0000)]
xen/arm: Introduce enable callback to enable a capabilities on each online CPU

Once Xen knows what features/workarounds present on the platform, it
might be necessary to configure each online CPU.

Introduce a new callback "enable" that will be called on each online CPU to
configure the "capability".

The code is based on Linux v4.14 (where cpufeature.c comes from), the
explanation of why using stop_machine_run is kept as we have similar
problem in the future.

Lastly introduce enable_errata_workaround that will be called once CPUs
have booted and before the hardware domain is created.

This is part of XSA-254.

Signed-of-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 7500495155aacce437878cb576f45224ae984f40)

Conflicts:
xen/include/asm-arm/cpufeature.h
xen/arch/arm/setup.c

7 years agoxen/arm: Detect silicon revision and set cap bits accordingly
Julien Grall [Wed, 27 Jul 2016 16:37:07 +0000 (17:37 +0100)]
xen/arm: Detect silicon revision and set cap bits accordingly

After each CPU has been started, we iterate through a list of CPU
errata to detect CPUs which need from hypervisor code patches.

For each bug there is a function which checks if that a particular CPU is
affected. This needs to be done on every CPU to cover heterogenous
systems properly.

If a certain erratum has been detected, the capability bit will be set.
In the case the erratum requires code patching, this will be triggered
by the call to apply_alternatives.

The code is based on the file arch/arm64/kernel/cpu_errata.c in Linux
v4.6-rc3.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 8b01f6364f50f3b416105cc5f1ee2ca4f13d43eb)

7 years agoxen/arm: cpufeature: Provide an helper to check if a capability is supported
Julien Grall [Wed, 27 Jul 2016 16:37:06 +0000 (17:37 +0100)]
xen/arm: cpufeature: Provide an helper to check if a capability is supported

The CPU capabilities will be set depending on the value found in the CPU
registers. This patch provides a generic to go through a set of capabilities
and find which one should be enabled.

The parameter "info" is used to display the kind of capability updated (e.g
workaround, feature...).

Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 64eb606000f1258267f27e6819c59848e2226773)

7 years agoxen/arm: Add cpu_hwcap bitmap
Julien Grall [Wed, 22 Jun 2016 11:15:19 +0000 (12:15 +0100)]
xen/arm: Add cpu_hwcap bitmap

This will be used to know if a feature, which Xen cares, is available accross
all the CPUs.

This code is a light version of arch/arm64/kernel/cpufeature.c from
Linux v4.6-rc3.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit bfb489629c0c5f36c60dec879383cbed4080a509)

Conflicts:
xen/arch/arm/Makefile

7 years agoxen/arm: Add macros to handle the MIDR
Julien Grall [Wed, 22 Jun 2016 11:15:18 +0000 (12:15 +0100)]
xen/arm: Add macros to handle the MIDR

Add new macros to easily get different parts of the register and to
check if a given MIDR match a CPU model range. The latter will be really
useful to handle errata later.

The macros have been imported from the header
arch/arm64/include/asm/cputype.h in Linux v4.6-rc3.

Also remove MIDR_MASK which is unused.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 96c53eaa8cd32f86700b749065eaa37bf4cdc24c)

7 years agox86/idle: Clear SPEC_CTRL while idle
Andrew Cooper [Wed, 14 Feb 2018 12:44:33 +0000 (13:44 +0100)]
x86/idle: Clear SPEC_CTRL while idle

On contemporary hardware, setting IBRS/STIBP has a performance impact on
adjacent hyperthreads.  It is therefore recommended to clear the setting
before becoming idle, to avoid an idle core preventing adjacent userspace
execution from running at full performance.

Care must be taken to ensure there are no ret or indirect branch instructions
between spec_ctrl_{enter,exit}_idle() invocations, which are forced always
inline.  Care must also be taken to avoid using spec_ctrl_enter_idle() between
flushing caches and becoming idle, in cases where that matters.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4c7e478d597b0346eef3a256cfd6794ac778b608
master date: 2018-01-26 14:10:21 +0000

7 years agox86/ctxt: Issue a speculation barrier between vcpu contexts
Andrew Cooper [Wed, 14 Feb 2018 12:44:01 +0000 (13:44 +0100)]
x86/ctxt: Issue a speculation barrier between vcpu contexts

Issuing an IBPB command flushes the Branch Target Buffer, so that any poison
left by one vcpu won't remain when beginning to execute the next.

The cost of IBPB is substantial, and skipped on transition to idle, as Xen's
idle code is robust already.  All transitions into vcpu context are fully
serialising in practice (and under consideration for being retroactively
declared architecturally serialising), so a cunning attacker cannot use SP1 to
try and skip the flush.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a2ed643ed783020f885035432e9c0919756921d1
master date: 2018-01-26 14:10:21 +0000

7 years agox86/boot: Calculate the most appropriate BTI mitigation to use
Andrew Cooper [Wed, 14 Feb 2018 12:43:36 +0000 (13:43 +0100)]
x86/boot: Calculate the most appropriate BTI mitigation to use

See the logic and comments in init_speculation_mitigations() for further
details.

There are two controls for RSB overwriting, because in principle there are
cases where it might be safe to forego rsb_native (Off the top of my head,
SMEP active, no 32bit PV guests at all, no use of vmevent/paging subsystems
for HVM guests, but I make no guarantees that this list of restrictions is
exhaustive).

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/spec_ctrl: Fix determination of when to use IBRS

The original version of this logic was:

    /*
     * On Intel hardware, we'd like to use retpoline in preference to
     * IBRS, but only if it is safe on this hardware.
     */
    else if ( boot_cpu_has(X86_FEATURE_IBRSB) )
    {
        if ( retpoline_safe() )
            thunk = THUNK_RETPOLINE;
        else
            ibrs = true;
    }

but it was changed by a request during review.  Sadly, the result is buggy as
it breaks the later fallback logic by allowing IBRS to appear as available
when in fact it isn't.

This in practice means that on repoline-unsafe hardware without IBRS, we
select THUNK_JUMP despite intending to select THUNK_RETPOLINE.

Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2713715305ca516f698d58cec5e0b322c3b2c4eb
master date: 2018-01-26 14:10:21 +0000
master commit: 30cbd0c83ef3d0edac2d5bcc41a9a2b7a843ae58
master date: 2018-02-06 18:32:58 +0000

7 years agox86/entry: Avoid using alternatives in NMI/#MC paths
Andrew Cooper [Wed, 14 Feb 2018 12:43:00 +0000 (13:43 +0100)]
x86/entry: Avoid using alternatives in NMI/#MC paths

This patch is deliberately arranged to be easy to revert if/when alternatives
patching becomes NMI/#MC safe.

For safety, there must be a dispatch serialising instruction in (what is
logically) DO_SPEC_CTRL_ENTRY so that, in the case that Xen needs IBRS set in
context, an attacker can't speculate around the WRMSR and reach an indirect
branch within the speculation window.

Using conditionals opens this attack vector up, so the else clause gets an
LFENCE to force the pipeline to catch up before continuing.  This also covers
the safety of RSB conditional, as execution it is guaranteed to either hit the
WRMSR or LFENCE.

One downside of not using alternatives is that there unconditionally an LFENCE
in the IST path in cases where we are not using the features from IBRS-capable
microcode.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3fffaf9c13e9502f09ad4ab1aac3f8b7b9398f6f
master date: 2018-01-26 14:10:21 +0000

7 years agox86/entry: Organise the clobbering of the RSB/RAS on entry to Xen
Andrew Cooper [Wed, 14 Feb 2018 12:42:20 +0000 (13:42 +0100)]
x86/entry: Organise the clobbering of the RSB/RAS on entry to Xen

ret instructions are speculated directly to values recorded in the Return
Stack Buffer/Return Address Stack, as there is no uncertainty in well-formed
code.  Guests can take advantage of this in two ways:

  1) If they can find a path in Xen which executes more ret instructions than
     call instructions.  (At least one in the waitqueue infrastructure,
     probably others.)

  2) Use the fact that the RSB/RAS in hardware is actually a circular stack
     without a concept of empty.  (When it logically empties, stale values
     will start being used.)

To mitigate, overwrite the RSB on entry to Xen with gadgets which will capture
and contain rogue speculation.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e6c0128e9ab25bf66df11377a33ee5584d7f99e3
master date: 2018-01-26 14:10:21 +0000

7 years agox86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point
Andrew Cooper [Wed, 14 Feb 2018 12:41:31 +0000 (13:41 +0100)]
x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point

We need to be able to either set or clear IBRS in Xen context, as well as
restore appropriate guest values in guest context.  See the documentation in
asm-x86/spec_ctrl_asm.h for details.

With the contemporary microcode, writes to %cr3 are slower when SPEC_CTRL.IBRS
is set.  Therefore, the positioning of SPEC_CTRL_{ENTRY/EXIT}* is important.

Ideally, the IBRS_SET/IBRS_CLEAR hunks might be positioned either side of the
%cr3 change, but that is rather more complicated to arrange, and could still
result in a guest controlled value in SPEC_CTRL during the %cr3 change,
negating the saving if the guest chose to have IBRS set.

Therefore, we optimise for the pre-Skylake case (being far more common in the
field than Skylake and later, at the moment), where we have a Xen-preferred
value of IBRS clear when switching %cr3.

There is a semi-unrelated bugfix, where various asm_defn.h macros have a
hidden dependency on PAGE_SIZE, which results in an assembler error if used in
a .macro definition.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5e7962901131186d3514528ed57c7a9901a15a3e
master date: 2018-01-26 14:10:21 +0000

7 years agox86/hvm: Permit guests direct access to MSR_{SPEC_CTRL,PRED_CMD}
Andrew Cooper [Wed, 14 Feb 2018 12:40:38 +0000 (13:40 +0100)]
x86/hvm: Permit guests direct access to MSR_{SPEC_CTRL,PRED_CMD}

For performance reasons, HVM guests should have direct access to these MSRs
when possible.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 5a2fe171144ebcc908ea1fca45058d6010f6a286
master date: 2018-01-26 14:10:21 +0000

7 years agox86/migrate: Move MSR_SPEC_CTRL on migrate
Andrew Cooper [Wed, 14 Feb 2018 12:40:10 +0000 (13:40 +0100)]
x86/migrate: Move MSR_SPEC_CTRL on migrate

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0cf2a4eb769302b7d7d7835540e7b2f15006df30
master date: 2018-01-26 14:10:21 +0000

7 years agox86/msr: Emulation of MSR_{SPEC_CTRL,PRED_CMD} for guests
Andrew Cooper [Wed, 14 Feb 2018 12:39:40 +0000 (13:39 +0100)]
x86/msr: Emulation of MSR_{SPEC_CTRL,PRED_CMD} for guests

As per the spec currently available here:

https://software.intel.com/sites/default/files/managed/c5/63/336996-Speculative-Execution-Side-Channel-Mitigations.pdf

MSR_ARCH_CAPABILITIES will only come into existence on new hardware, but is
implemented as a straight #GP for now to avoid being leaky when new hardware
arrives.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ea58a679a6190e714a592f1369b660769a48a80c
master date: 2018-01-26 14:10:21 +0000

7 years agox86/cpuid: Handling of IBRS/IBPB, STIBP and IBRS for guests
Andrew Cooper [Wed, 14 Feb 2018 12:39:01 +0000 (13:39 +0100)]
x86/cpuid: Handling of IBRS/IBPB, STIBP and IBRS for guests

Intel specifies IBRS/IBPB (combined, in a single bit) and STIBP as a separate
bit.  AMD specifies IBPB alone in a 3rd bit.

AMD's IBPB is a subset of Intel's combined IBRS/IBPB.  For performance
reasons, administrators might wish to express "IBPB only" even on Intel
hardware, so we allow the AMD bit to be used for this purpose.

The behaviour of STIBP is more complicated.

It is our current understanding that STIBP will be advertised on HT-capable
hardware irrespective of whether HT is enabled, but not advertised on
HT-incapable hardware.  However, for ease of virtualisation, STIBP's
functionality is ignored rather than reserved by microcode/hardware on
HT-incapable hardware.

For guest safety, we treat STIBP as special, always override the toolstack
choice, and always advertise STIBP if IBRS is available.  This removes the
corner case where STIBP is not advertised, but the guest is running on
HT-capable hardware where it does matter.

Finally as a bugfix, update the libxc CPUID logic to understand the e8b
feature leaf, which has the side effect of also offering CLZERO to guests on
applicable hardware.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d297b56682e730d598e2529cc6998151d3b6f6f8
master date: 2018-01-26 14:10:21 +0000

7 years agox86/cmdline: Introduce a command line option to disable IBRS/IBPB, STIBP and IBPB
Andrew Cooper [Wed, 14 Feb 2018 11:44:17 +0000 (12:44 +0100)]
x86/cmdline: Introduce a command line option to disable IBRS/IBPB, STIBP and IBPB

Instead of gaining yet another top level boolean, introduce a more generic
cpuid= option.  Also introduce a helper function to parse a generic boolean
value.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/cmdline: Fix parse_boolean() for unadorned values

A command line such as "cpuid=no-ibrsb,no-stibp" tickles a bug in
parse_boolean() because the separating comma fails the NUL case.

Instead, check for slen == nlen which accounts for the boundary (if any)
passed via the 'e' parameter.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7850b1c00749df834ea2ad0c1f5d9364c4838795
master date: 2018-01-16 17:45:50 +0000
master commit: ac37ec1ddef234eeba6f438c29ff687c64962ebd
master date: 2018-01-31 10:47:12 +0000

7 years agox86/feature: Definitions for Indirect Branch Controls
Andrew Cooper [Wed, 14 Feb 2018 11:43:37 +0000 (12:43 +0100)]
x86/feature: Definitions for Indirect Branch Controls

Contemporary processors are gaining Indirect Branch Controls via microcode
updates.  Intel are introducing one bit to indicate IBRS and IBPB support, and
a second bit for STIBP.  AMD are introducing IBPB only, so enumerate it with a
separate bit.

Furthermore, depending on compiler and microcode availability, we may want to
run Xen with IBRS set, or clear.

To use these facilities, we synthesise separate IBRS and IBPB bits for
internal use.  A lot of infrastructure is required before these features are
safe to offer to guests.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
master commit: 0d703a701cc4bc47773986b2796eebd28b1439b5
master date: 2018-01-16 17:45:50 +0000

7 years agox86: Introduce alternative indirect thunks
Andrew Cooper [Wed, 14 Feb 2018 11:43:05 +0000 (12:43 +0100)]
x86: Introduce alternative indirect thunks

Depending on hardware and microcode availability, we will want to replace
IND_THUNK_REPOLINE with other implementations.

For AMD hardware, choose IND_THUNK_LFENCE in preference to retpoline if lfence
is known to be (or was successfully made) dispatch serialising.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 858cba0d4c6b6b45180afcb41561fd6585ad51a3
master date: 2018-01-16 17:45:50 +0000

7 years agox86/amd: Try to set lfence as being Dispatch Serialising
Andrew Cooper [Wed, 14 Feb 2018 11:42:31 +0000 (12:42 +0100)]
x86/amd: Try to set lfence as being Dispatch Serialising

This property is required for the AMD's recommended mitigation for Branch
Target Injection, but Xen needs to cope with being unable to detect or modify
the MSR.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: fe3ee5530a8d0d0b6a478167125d00c40f294a86
master date: 2018-01-16 17:45:50 +0000

7 years agox86/boot: Report details of speculative mitigations
Andrew Cooper [Wed, 14 Feb 2018 11:41:53 +0000 (12:41 +0100)]
x86/boot: Report details of speculative mitigations

Nothing very interesting at the moment, but the logic will grow as new
mitigations are added.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 31d6c53adf6417bf449ca50e8416e41b64d46803
master date: 2018-01-16 17:45:50 +0000

7 years agox86: Support indirect thunks from assembly code
Andrew Cooper [Wed, 14 Feb 2018 11:41:00 +0000 (12:41 +0100)]
x86: Support indirect thunks from assembly code

Introduce INDIRECT_CALL and INDIRECT_JMP which either degrade to a normal
indirect branch, or dispatch to the __x86_indirect_thunk_* symbols.

Update all the manual indirect branches in to use the new thunks.  The
indirect branches in the early boot and kexec path are left intact as we can't
use the compiled-in thunks at those points.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7c508612f7a5096b4819d4ef2ce566e01bd66c0c
master date: 2018-01-16 17:45:50 +0000

7 years agox86: Support compiling with indirect branch thunks
Andrew Cooper [Wed, 14 Feb 2018 11:40:13 +0000 (12:40 +0100)]
x86: Support compiling with indirect branch thunks

Use -mindirect-branch=thunk-extern/-mindirect-branch-register when available.
To begin with, use the retpoline thunk.  Later work will add alternative
thunks which can be selected at boot time.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 3659f0f4bcc6ca08103d1a7ae4e97535ecc978be
master date: 2018-01-16 17:45:50 +0000

7 years agocommon/wait: Clarifications to wait infrastructure
Andrew Cooper [Wed, 14 Feb 2018 11:39:16 +0000 (12:39 +0100)]
common/wait: Clarifications to wait infrastructure

This logic is not as clear as it could be.  Add some comments to help.

Rearrange the asm block in __prepare_to_wait() to separate the GPR
saving/restoring from the internal logic.

While tweaking, add an unreachable() following the jmp in
check_wakeup_from_wait().

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2d1c82261d966735e82e5971eddb63ba3c565a37
master date: 2018-01-05 19:57:08 +0000

7 years agox86/entry: Erase guest GPR state on entry to Xen
Andrew Cooper [Wed, 14 Feb 2018 11:38:48 +0000 (12:38 +0100)]
x86/entry: Erase guest GPR state on entry to Xen

This reduces the number of code gadgets which can be attacked with arbitrary
guest-controlled GPR values.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: 03bd8c3a70d101fc2f8f36f1e171b7594462a4cd
master date: 2018-01-05 19:57:08 +0000

7 years agox86/hvm: Use SAVE_ALL to construct the cpu_user_regs frame after VMExit
Andrew Cooper [Wed, 14 Feb 2018 11:38:07 +0000 (12:38 +0100)]
x86/hvm: Use SAVE_ALL to construct the cpu_user_regs frame after VMExit

No practical change.

One side effect in debug builds is that %rbp is inverted in the manner
expected by the stack unwinder to indicate a interrupt frame.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: 13682ca8c94bd5612a44f7f1edc1fd8ff675dacb
master date: 2018-01-05 19:57:08 +0000

7 years agox86/entry: Rearrange RESTORE_ALL to restore register in stack order
Andrew Cooper [Wed, 14 Feb 2018 11:37:33 +0000 (12:37 +0100)]
x86/entry: Rearrange RESTORE_ALL to restore register in stack order

Results in a more predictable (i.e. linear) memory access pattern.

No functional change.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: f85d105e27735f0e20aa30d77f03774f3ed55ae5
master date: 2018-01-05 19:57:08 +0000

7 years agox86/alt: Introduce ALTERNATIVE{,_2} macros
Andrew Cooper [Wed, 14 Feb 2018 11:36:39 +0000 (12:36 +0100)]
x86/alt: Introduce ALTERNATIVE{,_2} macros

To help creating alternative frames in assembly.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4711428f5e2a9bfff9f8d75b6a696072118c19a4
master date: 2018-01-05 19:57:07 +0000

7 years agox86/alt: Break out alternative-asm into a separate header file
Andrew Cooper [Wed, 14 Feb 2018 11:35:39 +0000 (12:35 +0100)]
x86/alt: Break out alternative-asm into a separate header file

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 9d7b4351d3bb5c744db311cffa57ba3ebb583327
master date: 2018-01-05 19:57:07 +0000

7 years agox86/microcode: Add support for fam17h microcode loading
Tom Lendacky [Wed, 14 Feb 2018 11:33:54 +0000 (12:33 +0100)]
x86/microcode: Add support for fam17h microcode loading

The size for the Microcode Patch Block (MPB) for an AMD family 17h
processor is 3200 bytes.  Add a #define for fam17h so that it does
not default to 2048 bytes and fail a microcode load/update.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[Linux commit f4e9b7af0cd58dd039a0fb2cd67d57cea4889abf]

Ported to Xen.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 61d458ba8c171809e8dd9abd19339c87f3f934ca
master date: 2017-12-13 14:30:10 +0000

7 years agox86: allow Meltdown band-aid to be disabled
Jan Beulich [Wed, 17 Jan 2018 16:32:03 +0000 (17:32 +0100)]
x86: allow Meltdown band-aid to be disabled

First of all we don't need it on AMD systems. Additionally allow its use
to be controlled by command line option. For best backportability, this
intentionally doesn't use alternative instruction patching to achieve
the intended effect - while we likely want it, this will be later
follow-up.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e871e80c38547d9faefc6604532ba3e985e65873
master date: 2018-01-16 17:50:59 +0100

7 years agox86: Meltdown band-aid against malicious 64-bit PV guests
Jan Beulich [Wed, 17 Jan 2018 16:31:16 +0000 (17:31 +0100)]
x86: Meltdown band-aid against malicious 64-bit PV guests

This is a very simplistic change limiting the amount of memory a running
64-bit PV guest has mapped (and hence available for attacking): Only the
mappings of stack, IDT, and TSS are being cloned from the direct map
into per-CPU page tables. Guest controlled parts of the page tables are
being copied into those per-CPU page tables upon entry into the guest.
Cross-vCPU synchronization of top level page table entry changes is
being effected by forcing other active vCPU-s of the guest into the
hypervisor.

The change to context_switch() isn't strictly necessary, but there's no
reason to keep switching page tables once a PV guest is being scheduled
out.

This isn't providing full isolation yet, but it should be covering all
pieces of information exposure of which would otherwise require an XSA.

There is certainly much room for improvement, especially of performance,
here - first and foremost suppressing all the negative effects on AMD
systems. But in the interest of backportability (including to really old
hypervisors, which may not even have alternative patching) any such is
being left out here.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5784de3e2067ed73efc2fe42e62831e8ae7f46c4
master date: 2018-01-16 17:49:03 +0100

7 years agox86/mm: Always set _PAGE_ACCESSED on L4e updates
Andrew Cooper [Wed, 17 Jan 2018 16:30:05 +0000 (17:30 +0100)]
x86/mm: Always set _PAGE_ACCESSED on L4e updates

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: bd61fe94bee0556bc2f64999a4a8315b93f90f21
master date: 2018-01-15 13:53:16 +0000

7 years agox86/entry: Remove support for partial cpu_user_regs frames
Andrew Cooper [Wed, 17 Jan 2018 16:29:31 +0000 (17:29 +0100)]
x86/entry: Remove support for partial cpu_user_regs frames

Save all GPRs on entry to Xen.

The entry_int82() path is via a DPL1 gate, only usable by 32bit PV guests, so
can get away with only saving the 32bit registers.  All other entrypoints can
be reached from 32 or 64bit contexts.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: f9eb74789af77e985ae653193f3622263499f674
master date: 2018-01-05 19:57:07 +0000

7 years agox86/paging: don't unconditionally BUG() on finding SHARED_M2P_ENTRY
Jan Beulich [Tue, 12 Dec 2017 14:09:11 +0000 (15:09 +0100)]
x86/paging: don't unconditionally BUG() on finding SHARED_M2P_ENTRY

PV guests can fully control the values written into the P2M.

This is XSA-251.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b4d0218cff66b7eaa9c9b8dc9bd71e7b089b016d
master date: 2017-12-12 14:30:17 +0100

7 years agox86/shadow: fix ref-counting error handling
Jan Beulich [Tue, 12 Dec 2017 14:08:46 +0000 (15:08 +0100)]
x86/shadow: fix ref-counting error handling

The old-Linux handling in shadow_set_l4e() mistakenly ORed together the
results of sh_get_ref() and sh_pin(). As the latter failing is not a
correctness problem, simply ignore its return value.

In sh_set_toplevel_shadow() a failing sh_get_ref() must not be
accompanied by installing the entry, despite the domain being crashed.

This is XSA-250.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 10be8001de7d87be1f0ccdda75cc70e922e56d03
master date: 2017-12-12 14:29:45 +0100

7 years agox86/shadow: fix refcount overflow check
Jan Beulich [Tue, 12 Dec 2017 14:08:13 +0000 (15:08 +0100)]
x86/shadow: fix refcount overflow check

Commit c385d27079 ("x86 shadow: for multi-page shadows, explicitly track
the first page") reduced the refcount width to 25, without adjusting the
overflow check. Eliminate the disconnect by using a manifest constant.

Interestingly, up to commit 047782fa01 ("Out-of-sync L1 shadows: OOS
snapshot") the refcount was 27 bits wide, yet the check was already
using 26.

This is XSA-249.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 54e2292e8df7a1a7b041192be9d6d797b6d00869
master date: 2017-12-12 14:29:13 +0100

7 years agox86/mm: don't wrongly set page ownership
Jan Beulich [Tue, 12 Dec 2017 14:07:47 +0000 (15:07 +0100)]
x86/mm: don't wrongly set page ownership

PV domains can obtain mappings of any pages owned by the correct domain,
including ones that aren't actually assigned as "normal" RAM, but used
by Xen internally.  At the moment such "internal" pages marked as owned
by a guest include pages used to track logdirty bits, as well as p2m
pages and the "unpaged pagetable" for HVM guests. Since the PV memory
management and shadow code conflict in their use of struct page_info
fields, and since shadow code is being used for log-dirty handling for
PV domains, pages coming from the shadow pool must, for PV domains, not
have the domain set as their owner.

While the change could be done conditionally for just the PV case in
shadow code, do it unconditionally (and for consistency also for HAP),
just to be on the safe side.

There's one special case though for shadow code: The page table used for
running a HVM guest in unpaged mode is subject to get_page() (in
set_shadow_status()) and hence must have its owner set.

This is XSA-248.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: ff2a793e15bb0b6254bc849ef8e83e1c284c3583
master date: 2017-12-12 14:28:36 +0100

7 years agox86: don't wrongly trigger linear page table assertion (2)
Jan Beulich [Tue, 12 Dec 2017 14:07:17 +0000 (15:07 +0100)]
x86: don't wrongly trigger linear page table assertion (2)

_put_final_page_type(), when free_page_type() has exited early to allow
for preemption, should not update the time stamp, as the page continues
to retain the typ which is in the process of being unvalidated. I can't
see why the time stamp update was put on that path in the first place
(albeit it may well have been me who had put it there years ago).

This is part of XSA-240.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: George Dunlap <george.dunlap.com>
master commit: e40b0219a8c77741ae48989efb520f4a762a5be3
master date: 2017-12-12 14:27:34 +0100

7 years agop2m: Check return value of p2m_set_entry() when decreasing reservation
George Dunlap [Tue, 28 Nov 2017 12:49:22 +0000 (13:49 +0100)]
p2m: Check return value of p2m_set_entry() when decreasing reservation

If the entire range specified to p2m_pod_decrease_reservation() is marked
populate-on-demand, then it will make a single p2m_set_entry() call,
reducing its PoD entry count.

Unfortunately, in the right circumstances, this p2m_set_entry() call
may fail.  It that case, repeated calls to decrease_reservation() may
cause p2m->pod.entry_count to fall below zero, potentially tripping
over BUG_ON()s to the contrary.

Instead, check to see if the entry succeeded, and return false if not.
The caller will then call guest_remove_page() on the gfns, which will
return -EINVAL upon finding no valid memory there to return.

Unfortunately if the order > 0, the entry may have partially changed.
A domain_crash() is probably the safest thing in that case.

Other p2m_set_entry() calls in the same function should be fine,
because they are writing the entry at its current order.  Nonetheless,
check the return value and crash if our assumption turns otu to be
wrong.

This is part of XSA-247.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a3d64de8e86f5812917d2d0af28298f80debdf9a
master date: 2017-11-28 13:13:26 +0100

7 years agop2m: Always check to see if removing a p2m entry actually worked
George Dunlap [Tue, 28 Nov 2017 12:48:55 +0000 (13:48 +0100)]
p2m: Always check to see if removing a p2m entry actually worked

The PoD zero-check functions speculatively remove memory from the p2m,
then check to see if it's completely zeroed, before putting it in the
cache.

Unfortunately, the p2m_set_entry() calls may fail if the underlying
pagetable structure needs to change and the domain has exhausted its
p2m memory pool: for instance, if we're removing a 2MiB region out of
a 1GiB entry (in the p2m_pod_zero_check_superpage() case), or a 4k
region out of a 2MiB or larger entry (in the p2m_pod_zero_check()
case); and the return value is not checked.

The underlying mfn will then be added into the PoD cache, and at some
point mapped into another location in the p2m.  If the guest
afterwards ballons out this memory, it will be freed to the hypervisor
and potentially reused by another domain, in spite of the fact that
the original domain still has writable mappings to it.

There are several places where p2m_set_entry() shouldn't be able to
fail, as it is guaranteed to write an entry of the same order that
succeeded before.  Add a backstop of crashing the domain just in case,
and an ASSERT_UNREACHABLE() to flag up the broken assumption on debug
builds.

While we're here, use PAGE_ORDER_2M rather than a magic constant.

This is part of XSA-247.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 92790672dedf2eab042e04ecc277c19d40fd348a
master date: 2017-11-28 13:13:03 +0100

7 years agox86/pod: prevent infinite loop when shattering large pages
Julien Grall [Tue, 28 Nov 2017 12:48:13 +0000 (13:48 +0100)]
x86/pod: prevent infinite loop when shattering large pages

When populating pages, the PoD may need to split large ones using
p2m_set_entry and request the caller to retry (see ept_get_entry for
instance).

p2m_set_entry may fail to shatter if it is not possible to allocate
memory for the new page table. However, the error is not propagated
resulting to the callers to retry infinitely the PoD.

Prevent the infinite loop by return false when it is not possible to
shatter the large mapping.

This is XSA-246.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: a1c6c6768971ea387d7eba0803908ef0928b43ac
master date: 2017-11-28 13:11:55 +0100

7 years agox86/shadow: correct SH_LINEAR mapping detection in sh_guess_wrmap()
Andrew Cooper [Thu, 16 Nov 2017 11:14:57 +0000 (12:14 +0100)]
x86/shadow: correct SH_LINEAR mapping detection in sh_guess_wrmap()

The fix for XSA-243 / CVE-2017-15592 (c/s bf2b4eadcf379) introduced a change
in behaviour for sh_guest_wrmap(), where it had to cope with no shadow linear
mapping being present.

As the name suggests, guest_vtable is a mapping of the guests pagetable, not
Xen's pagetable, meaning that it isn't the pagetable we need to check for the
shadow linear slot in.

The practical upshot is that a shadow HVM vcpu which switches into 4-level
paging mode, with an L4 pagetable that contains a mapping which aliases Xen's
SH_LINEAR_PT_VIRT_START will fool the safety check for whether a SHADOW_LINEAR
mapping is present.  As the check passes (when it should have failed), Xen
subsequently falls over the missing mapping with a pagefault such as:

    (XEN) Pagetable walk from ffff8140a0503880:
    (XEN)  L4[0x102] = 000000046c218063 ffffffffffffffff
    (XEN)  L3[0x102] = 000000046c218063 ffffffffffffffff
    (XEN)  L2[0x102] = 000000046c218063 ffffffffffffffff
    (XEN)  L1[0x103] = 0000000000000000 ffffffffffffffff

This is part of XSA-243.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
7 years agox86: don't wrongly trigger linear page table assertion
Jan Beulich [Thu, 16 Nov 2017 11:14:13 +0000 (12:14 +0100)]
x86: don't wrongly trigger linear page table assertion

_put_page_type() may do multiple iterations until its cmpxchg()
succeeds. It invokes set_tlbflush_timestamp() on the first
iteration, however. Code inside the function takes care of this, but
- the assertion in _put_final_page_type() would trigger on the second
  iteration if time stamps in a debug build are permitted to be
  sufficiently much wider than the default 6 bits (see WRAP_MASK in
  flushtlb.c),
- it returning -EINTR (for a continuation to be scheduled) would leave
  the page inconsistent state (until the re-invocation completes).
Make the set_tlbflush_timestamp() invocation conditional, bypassing it
(for now) only in the case we really can't tolerate the stamp to be
stored.

This is part of XSA-240.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
7 years agox86/mm: Make PV linear pagetables optional
George Dunlap [Thu, 16 Nov 2017 11:13:11 +0000 (12:13 +0100)]
x86/mm: Make PV linear pagetables optional

Allowing pagetables to point to other pagetables of the same level
(often called 'linear pagetables') has been included in Xen since its
inception; but recently it has been the source of a number of subtle
reference-counting bugs.

It is not used by Linux or MiniOS; but it is used by NetBSD and Novell
Netware.  There are significant numbers of people who are never going
to use the feature, along with significant numbers who need the
feature.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
7 years agognttab: fix pin count / page reference race
Jan Beulich [Tue, 24 Oct 2017 14:51:51 +0000 (16:51 +0200)]
gnttab: fix pin count / page reference race

Dropping page references before decrementing pin counts is a bad idea
if assumptions are being made that a non-zero pin count implies a valid
page. Fix the order of operations in gnttab_copy_release_buf(), but at
the same time also remove the assertion that was found to trigger:
map_grant_ref() also has the potential of causing a race here, and
changing the order of operations there would likely be quite a bit more
involved.

This is CVE-2017-15597 / XSA-236.

Reported-by: Pawel Wieczorkiewicz <wipawel@amazon.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e008f7619dcd6d549727c9635b3f9f3c7ee483ed
master date: 2017-10-24 16:01:33 +0200

7 years agox86/cpu: Fix IST handling during PCPU bringup
Andrew Cooper [Thu, 12 Oct 2017 13:41:57 +0000 (15:41 +0200)]
x86/cpu: Fix IST handling during PCPU bringup

Clear IST references in newly allocated IDTs.  Nothing good will come of
having them set before the TSS is suitably constructed (although the chances
of the CPU surviving such an IST interrupt/exception is extremely slim).

Uniformly set the IST references after the TSS is in place.  This fixes an
issue on AMD hardware, where onlining a PCPU while PCPU0 is in HVM context
will cause IST_NONE to be copied into the new IDT, making that PCPU vulnerable
to privilege escalation from PV guests until it subsequently schedules an HVM
guest.

This is XSA-244.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cc08c73c8c1f5ba5ed0f8274548db6725e1c3157
master date: 2017-10-12 14:50:31 +0200