Jan Beulich [Tue, 23 Jan 2018 09:43:39 +0000 (10:43 +0100)]
x86: move invocations of hvm_flush_guest_tlbs()
Their need is not tied to the actual flushing of TLBs, but the ticking
of the TLB clock. Make this more obvious by folding the two invocations
into a single one in pre_flush().
Also defer the latching of CR4 in write_cr3() until after pre_flush()
(and hence implicitly until after IRQs are off), making operation
sequence the same in both cases (eliminating the theoretical risk of
pre_flush() altering CR4). This then also improves register allocation,
as the compiler doesn't need to use a callee-saved register for "cr4"
anymore.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 29 May 2018 07:49:01 +0000 (09:49 +0200)]
x86/XPTI: fix S3 resume (and CPU offlining in general)
We should index an L1 table with an L1 index.
Reported-by: Simon Gaiser <simon@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6b9562dac1746014ab376bd2cf8ba400acf34c6d
master date: 2018-05-28 11:20:26 +0200
Andrew Cooper [Tue, 29 May 2018 07:48:21 +0000 (09:48 +0200)]
x86/spec_ctrl: Introduce a new `spec-ctrl=` command line argument to replace `bti=`
In hindsight, the options for `bti=` aren't as flexible or useful as expected
(including several options which don't appear to behave as intended).
Changing the behaviour of an existing option is problematic for compatibility,
so introduce a new `spec-ctrl=` in the hopes that we can do better.
One common way of deploying Xen is with a single PV dom0 and all domUs being
HVM domains. In such a setup, an administrator who has weighed up the risks
may wish to forgo protection against malicious PV domains, to reduce the
overall performance hit. To cater for this usecase, `spec-ctrl=no-pv` will
disable all speculative protection for PV domains, while leaving all
speculative protection for HVM domains intact.
For coding clarity as much as anything else, the suboptions are grouped by
logical area; those which affect the alternatives blocks, and those which
affect Xen's in-hypervisor settings. See the xen-command-line.markdown for
full details of the new options.
While changing the command line options, take the time to change how the data
is reported to the user. The three DEBUG printks are upgraded to unilateral,
as they are all relevant pieces of information, and the old "mitigations:"
line is split in the two logical areas described above.
Sample output from booting with `spec-ctrl=no-pv` looks like:
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3352afc26c497d26ecb70527db3cb29daf7b1422
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 07:47:57 +0000 (09:47 +0200)]
x86/cpuid: Improvements to guest policies for speculative sidechannel features
If Xen isn't virtualising MSR_SPEC_CTRL for guests, IBRSB shouldn't be
advertised. It is not currently possible to express this via the existing
command line options, but such an ability will be introduced.
Another useful option in some usecases is to offer IBPB without IBRS. When a
guest kernel is known to be compatible (uses retpoline and knows about the AMD
IBPB feature bit), an administrator with pre-Skylake hardware may wish to hide
IBRS. This allows the VM to have full protection, without Xen or the VM
needing to touch MSR_SPEC_CTRL, which can reduce the overhead of Spectre
mitigations.
Break the logic common to both PV and HVM CPUID calculations into a common
helper, to avoid duplication.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cb06b308ec71b23f37a44f5e2351fe2cae0306e9
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 07:47:32 +0000 (09:47 +0200)]
x86/spec_ctrl: Explicitly set Xen's default MSR_SPEC_CTRL value
With the impending ability to disable MSR_SPEC_CTRL handling on a
per-guest-type basis, the first exit-from-guest may not have the side effect
of loading Xen's choice of value. Explicitly set Xen's default during the BSP
and AP boot paths.
For the BSP however, delay setting a non-zero MSR_SPEC_CTRL default until
after dom0 has been constructed when safe to do so. Oracle report that this
speeds up boots of some hardware by 50s.
"when safe to do so" is based on whether we are virtualised. A native boot
won't have any other code running in a position to mount an attack.
Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cb8c12020307b39a89273d7699e89000451987ab
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 07:47:01 +0000 (09:47 +0200)]
x86/spec_ctrl: Split X86_FEATURE_SC_MSR into PV and HVM variants
In order to separately control whether MSR_SPEC_CTRL is virtualised for PV and
HVM guests, split the feature used to control runtime alternatives into two.
Xen will use MSR_SPEC_CTRL itself if either of these features are active.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: fa9eb09d446a1279f5e861e6b84fa8675dabf148
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 07:46:31 +0000 (09:46 +0200)]
x86/spec_ctrl: Elide MSR_SPEC_CTRL handling in idle context when possible
If Xen is virtualising MSR_SPEC_CTRL handling for guests, but using 0 as its
own MSR_SPEC_CTRL value, spec_ctrl_{enter,exit}_idle() need not write to the
MSR.
Requested-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 94df6e8588e35cc2028ccb3fd2921c6e6360605e
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 07:45:57 +0000 (09:45 +0200)]
x86/spec_ctrl: Rename bits of infrastructure to avoid NATIVE and VMEXIT
In hindsight, using NATIVE and VMEXIT as naming terminology was not clever.
A future change wants to split SPEC_CTRL_EXIT_TO_GUEST into PV and HVM
specific implementations, and using VMEXIT as a term is completely wrong.
Take the opportunity to fix some stale documentation in spec_ctrl_asm.h. The
IST helpers were missing from the large comment block, and since
SPEC_CTRL_ENTRY_FROM_INTR_IST was introduced, we've gained a new piece of
functionality which currently depends on the fine grain control, which exists
in lieu of livepatching. Note this in the comment.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d9822b8a38114e96e4516dc998f4055249364d5d
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 07:45:26 +0000 (09:45 +0200)]
x86/spec_ctrl: Fold the XEN_IBRS_{SET,CLEAR} ALTERNATIVES together
Currently, the SPEC_CTRL_{ENTRY,EXIT}_* macros encode Xen's choice of
MSR_SPEC_CTRL as an immediate constant, and chooses between IBRS or not by
doubling up the entire alternative block.
There is now a variable holding Xen's choice of value, so use that and
simplify the alternatives.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: af949407eaba7af71067f23d5866cd0bf1f1144d
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 07:44:55 +0000 (09:44 +0200)]
x86/spec_ctrl: Merge bti_ist_info and use_shadow_spec_ctrl into spec_ctrl_flags
All 3 bits of information here are control flags for the entry/exit code
behaviour. Treat them as such, rather than having two different variables.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5262ba2e7799001402dfe139ff944e035dfff928
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 07:44:12 +0000 (09:44 +0200)]
x86/spec_ctrl: Express Xen's choice of MSR_SPEC_CTRL value as a variable
At the moment, we have two different encodings of Xen's MSR_SPEC_CTRL value,
which is a side effect of how the Spectre series developed. One encoding is
via an alias with the bottom bit of bti_ist_info, and can encode IBRS or not,
but not other configurations such as STIBP.
Break Xen's value out into a separate variable (in the top of stack block for
XPTI reasons) and use this instead of bti_ist_info in the IST path.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 66dfae0f32bfbc899c2f3446d5ee57068cb7f957
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 07:43:28 +0000 (09:43 +0200)]
x86/spec_ctrl: Read MSR_ARCH_CAPABILITIES only once
Make it available from the beginning of init_speculation_mitigations(), and
pass it into appropriate functions. Fix an RSBA typo while moving the
affected comment.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d6c65187252a6c1810fd24c4d46f812840de8d3c
master date: 2018-05-16 12:19:10 +0100
Paul Durrant [Fri, 18 May 2018 10:01:31 +0000 (12:01 +0200)]
viridian: fix cpuid leaf 0x40000003
The response to viridian leaf 3 needs to split a 64-bit mask across EAX and
EBX, with the low order 32 bits in EAX and the high order 32 bits in EBX.
To facilitate this a union of two uint32_t values and the mask (type
HV_PARTITION_PRIVILEGE_MASK) is allocated on stack as follows:
union {
HV_PARTITION_PRIVILEGE_MASK mask;
uint32_t lo, hi;
} u;
This, of course, is incorrect as both lo and hi will alias the low order
32 bits of the mask.
This patch wraps lo and hi in an anonmymous struct to achieve the desired
effect.
NOTE: Fixing this also stops Windows making the HvGetPartitionId hypercall
which was previously considered erroneous behaviour. Thus the
hypercall handler is also modified to stop squashing the
'unimplemented' warning for this hypercall.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 29fc0493d8eabdd63f5bbff9e3069253053addca
master date: 2018-05-14 12:57:13 +0100
New versions of iasl have introduced improved C file generation, as
reported in the changelog:
iASL: Enhanced the -tc option (which creates an AML hex file in C,
suitable for import into a firmware project):
1) Create a unique name for the table, to simplify use of multiple
SSDTs.
2) Add a protection #ifdef in the file, similar to a .h header file.
Andrew Cooper [Fri, 18 May 2018 10:00:15 +0000 (12:00 +0200)]
x86/pv: Hide more EFER bits from PV guests
We don't advertise SVM in CPUID so a PV guest shouldn't be under the
impression that it can use SVM functionality, but despite this, it really
shouldn't see SVME set when reading EFER.
On Intel processors, 32bit PV guests don't see, and can't use SYSCALL.
Introduce EFER_KNOWN_MASK to whitelist the features Xen knows about, and use
this to clamp the guests view.
Take the opportunity to reuse the mask to simplify svm_vmcb_isvalid(), and
change "undefined" to "unknown" in the print message, as there is at least
EFER.TCE (Translation Cache Extension) defined but unknown to Xen.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 589263031c04e2ba527783b4e04e8df27d364769
master date: 2018-05-07 11:52:57 +0100
George Dunlap [Fri, 18 May 2018 09:59:40 +0000 (11:59 +0200)]
xen/schedule: Fix races in vcpu migration
The current sequence to initiate vcpu migration is inefficent and error-prone:
- The initiator sets VPF_migraging with the lock held, then drops the
lock and calls vcpu_sleep_nosync(), which immediately grabs the lock
again
- A number of places unnecessarily check for v->pause_flags in between
those two
- Every call to vcpu_migrate() must be prefaced with
vcpu_sleep_nosync() or introduce a race condition; this code
duplication is error-prone
- In the event that v->is_running is true at the beginning of
vcpu_migrate(), it's almost certain that vcpu_migrate() will end up
being called in context_switch() as well; we might as well simply
let it run there and save the duplicated effort (which will be
non-negligible).
The result is that Credit1 has several races which result in runqueue
<-> v->processor invariants being violated (triggering ASSERTs in
debug builds and strange bugs in production builds).
Instead, introduce vcpu_migrate_start() to initiate the process.
vcpu_migrate_start() is called with the scheduling lock held. It not
only sets VPF_migrating, but also calls vcpu_sleep_nosync_locked()
(which will automatically do nothing if there's nothing to do).
Rename vcpu_migrate() to vcpu_migrate_finish(). Check for v->is_running and
pause_flags & VPF_migrating at the top and return if appropriate.
Then the way to initiate migration is consistently:
George Dunlap [Fri, 18 May 2018 09:59:09 +0000 (11:59 +0200)]
xen: Introduce vcpu_sleep_nosync_locked()
There are a lot of places which release a lock before calling
vcpu_sleep_nosync(), which then just grabs the lock again. This is
not only a waste of time, but leads to more code duplication (since
you have to copy-and-paste recipes rather than calling a unified
function), which in turn leads to an increased chance of bugs.
Introduce vcpu_sleep_nosync_locked(), which can be called if you
already hold the schedule lock.
That is, Dom0's ACPI processor driver is in the process of uploading Px
and Cx data. Looking at the ticket lock state in CPU1's registers, it is
waiting for ticket 0x0000 to have its turn, while the supposed current
owner's ticket is 0x0001, which is an invalid state (and neither of the
other two CPUs holds the lock anyway). Hence I can only conclude that
cpuidle_init_cpu(1) ran on CPU 0 while some other CPU held the lock (the
unlock then put the lock in the state that CPU1 is observing).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2f64a251fa10dd4d62f84967e3dafa709f5e96ab
master date: 2018-04-27 14:35:35 +0200
Andrew Cooper [Fri, 18 May 2018 09:58:01 +0000 (11:58 +0200)]
x86/SVM: Fix intercepted {RD,WR}MSR for the SYS{CALL,ENTER} MSRs
By default, the SYSCALL MSRs are not intercepted, and accesses are completed
by hardware. The SYSENTER MSRs are intercepted for cross-vendor
purposes (albeit needlessly in the common case), and are fully emulated.
However, {RD,WR}MSR instructions which happen to be emulated (FEP,
introspection, or older versions of Xen which intercepted #UD), or when the
MSRs are explicitly intercepted (introspection), will be completed
incorrectly.
svm_msr_read_intercept() appears to return the correct values, but only
because of the default read-everything case (which is going to disappear), and
that in vcpu context, hardware should have the guest values in context.
Update the read path to explicitly sync the VMCB and complete the accesses,
rather than falling all the way through to the default case.
svm_msr_write_intercept() silently discard all updates. Synchronise the VMCB
for all applicable MSRs, and implement suitable checks. The actual behaviour
of AMD hardware is to truncate the SYSENTER and SFMASK MSRs at 32 bits, but
this isn't implemented yet to remain compatible with the cross-vendor case.
Drop one bit of trailing whitespace while modifing this area of the code.
Reported-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
master commit: c04c1866e5131e450ddcd114e32401477c60b816
master date: 2018-04-25 13:08:13 +0100
Juergen Gross [Fri, 18 May 2018 09:57:06 +0000 (11:57 +0200)]
xpti: fix bug in double fault handling
When entering the hypervisor via the double fault handler resetting
xen_cr3 was missing. This led to switching to pv_cr3 when returning
from the next following exception, so repair this in order to allow
exception handling to work even after a double fault.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d80af845de7a4db01a4a3b4d779e0e0dcb5e738b
master date: 2018-04-23 16:13:01 +0200
Jan Beulich [Fri, 18 May 2018 09:56:29 +0000 (11:56 +0200)]
x86/HVM: never retain emulated insn cache when exiting back to guest
Commit 5fcb26e69e ("x86/HVM: don't retain emulated insn cache when
exiting back to guest") didn't go quite far enough: The insn emulator
may itself decide to return X86EMUL_RETRY (currently for certain
CMPXCHG failures and AVX2 gather insns), in which case we'd also exit
back to guest context. Tie the caching to whether we have an I/O
completion pending, instead of x86_emulate()'s return value.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
master commit: 25b0dad541e31bd892d57cbeafe8e0c0bf4e8385
master date: 2018-04-23 11:01:09 +0200
CPUs may share an in-use channel. Hence clearing of a bit from the
cpumask (in hpet_broadcast_exit()) as well as setting one (in
hpet_broadcast_enter()) must not race evaluation of that same cpumask.
Therefore avoid evaluating the cpumask twice in hpet_detach_channel().
Otherwise cpumask_empty() may e.g.return false while the subsequent
cpumask_first() could return nr_cpu_ids, which then triggers the
assertion in cpumask_of() reached through set_channel_irq_affinity().
Signed-off-by: David Wang <davidwang@zhaoxin.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 8c02a19230502a9522b097ee15742599091064aa
master date: 2018-04-23 11:00:07 +0200
The 'RSB Alternative' bit in MSR_ARCH_CAPABILITIES may be set by a hypervisor
to indicate that the virtual machine may migrate to a processor which isn't
retpoline-safe. Introduce a shortened name (to reduce code volume), treat it
as authorative in retpoline_safe(), and print its value along with the other
ARCH_CAPS bits.
The exact processor models which do have RSB semantics which fall back to BTB
predictions are enumerated, and include Kabylake and Coffeelake. Leave a
printk() in the default case to help identify cases which aren't covered.
The exact microcode versions from Broadwell RSB-safety are taken from the
referenced microcode update file (adjusting for the known-bad microcode
versions). Despite the exact wording of the text, it is only Broadwell
processors which need a microcode check.
In practice, this means that all Broadwell hardware with up-to-date microcode
will use retpoline in preference to IBRS, which will be a performance
improvement for desktop and server systems which would previously always opt
for IBRS over retpoline.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/spec_ctrl: Fix typo in ARCH_CAPS decode
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 1232378bd2fef45f613db049b33852fdf84d7ddf
master date: 2018-04-19 17:28:23 +0100
master commit: 27170adb54a558e11defcd51989326a9beb95afe
master date: 2018-04-24 13:34:12 +0100
Andrew Cooper [Fri, 18 May 2018 09:54:43 +0000 (11:54 +0200)]
x86/pv: Introduce and use x86emul_write_dr()
set_debugreg() has several bugs:
* %dr4/5 should function correctly as aliases of %dr6/7 when CR4.DE is clear.
* Attempting to set the upper 32 bits of %dr6/7 should fail with #GP[0]
rather than be silently corrected and complete.
* For emulation, the #UD and #GP[0] cases need properly distinguishing. Use
-ENODEV for #UD cases, leaving -EINVAL (bad bits) and -EPERM (not allowed to
use that valid bit) as before for hypercall callers.
* A write which clears %dr7.L/G leaves the IO shadow intact, meaning that
subsequent reads of %dr7 will see stale IO watchpoint configuration.
Implement x86emul_write_dr() as a thin wrapper around set_debugreg().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f539ae27061c6811fd5e80e0755bf0514e22b977
master date: 2018-04-17 15:12:36 +0100
Andrew Cooper [Fri, 18 May 2018 09:54:05 +0000 (11:54 +0200)]
x86/pv: Introduce and use x86emul_read_dr()
do_get_debugreg() has several bugs:
* The %cr4.de condition is inverted. %dr4/5 should be accessible only when
%cr4.de is disabled.
* When %cr4.de is disabled, emulation should yield #UD rather than complete
with zero.
* Using -EINVAL for errors is a broken ABI, as it overlaps with valid values
near the top of the address space.
Introduce a common x86emul_read_dr() handler (as we will eventually want to
add HVM support) which separates its success/failure indication from the data
value, and have do_get_debugreg() call into the handler.
The ABI of do_get_debugreg() remains broken, but switches from -EINVAL to
-ENODEV for compatibility with the changes in the following patch.
Take the opportunity to add a missing local variable block to x86_emulate.c
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 881f8dc4314809293efc6f66f9af49734994bf0e
master date: 2018-04-17 15:12:36 +0100
Jan Beulich [Fri, 18 May 2018 09:53:13 +0000 (11:53 +0200)]
x86: suppress BTI mitigations around S3 suspend/resume
NMI and #MC can occur at any time after S3 resume, yet the MSR_SPEC_CTRL
may become available only once we're reloaded microcode. Make
SPEC_CTRL_ENTRY_FROM_INTR_IST and DO_SPEC_CTRL_EXIT_TO_XEN no-ops for
the critical period of time.
Also set the MSR back to its intended value.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86: Use spec_ctrl_{enter,exit}_idle() in the S3/S5 path
The main purpose of this patch is to avoid opencoding the recovery logic at
the end, but also has the positive side effect of relaxing the SPEC_CTRL
mitigations when working to shut the final CPU down.
Jan Beulich [Fri, 18 May 2018 09:52:29 +0000 (11:52 +0200)]
x86: correct ordering of operations during S3 resume
Microcode loading needs to happen before re-enabling interrupts, in case
only updated microcode allows the use of e.g. the SPEC_{CTRL,CMD} MSRs.
Otoh it doesn't need to happen at all when we didn't suspend in the
first place. It needs to happen before spin_debug_enable() though, as it
acquires a lock and hence would otherwise make
common/spinlock.c:check_lock() unhappy. As micrcode loading can be
pretty verbose, also make sure it only runs after console_end_sync().
cpufreq_add_cpu() doesn't need calling on the only "goto enable_cpu"
path, which sits ahead of cpufreq_del_cpu().
Reported-by: Simon Gaiser <simon@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: cb2a4a449dfd50af309a333aa805835015fbc8c8
master date: 2018-04-16 14:08:30 +0200
Jan Beulich [Tue, 8 May 2018 17:18:58 +0000 (18:18 +0100)]
x86/HVM: guard against emulator driving ioreq state in weird ways
In the case where hvm_wait_for_io() calls wait_on_xen_event_channel(),
p->state ends up being read twice in succession: once to determine that
state != p->state, and then again at the top of the loop. This gives a
compromised emulator a chance to change the state back between the two
reads, potentially keeping Xen in a loop indefinitely.
Instead:
* Read p->state once in each of the wait_on_xen_event_channel() tests,
* re-use that value the next time around,
* and insist that the states continue to transition "forward" (with the
exception of the transition to STATE_IOREQ_NONE).
This is XSA-262.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
x86/vpt: add support for IO-APIC routed interrupts
And modify the HPET code to make use of it. Currently HPET interrupts
are always treated as ISA and thus injected through the vPIC. This is
wrong because HPET interrupts when not in legacy mode should be
injected from the IO-APIC.
To make things worse, the supported interrupt routing values are set
to [20..23], which clearly falls outside of the ISA range, thus
leading to an ASSERT in debug builds or memory corruption in non-debug
builds because the interrupt injection code will write out of the
bounds of the arch.hvm_domain.vpic array.
Since the HPET interrupt source can change between ISA and IO-APIC
always destroy the timer before changing the mode, or else Xen risks
changing it while the timer is active.
Note that vpt interrupt injection is racy in the sense that the
vIO-APIC RTE entry can be written by the guest in between the call to
pt_irq_masked and hvm_ioapic_assert, or the call to pt_update_irq and
pt_intr_post. Those are not deemed to be security issues, but rather
quirks of the current implementation. In the worse case the guest
might lose interrupts or get multiple interrupt vectors injected for
the same timer source.
This is part of XSA-261.
Address actual and potential compiler warnings. Fix formatting.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 8 May 2018 17:16:37 +0000 (18:16 +0100)]
x86/traps: Fix handling of #DB exceptions in hypervisor context
The WARN_ON() can be triggered by guest activities, and emits a full stack
trace without rate limiting. Swap it out for a ratelimited printk with just
enough information to work out what is going on.
Not all #DB exceptions are traps, so blindly continuing is not a safe action
to take. We don't let PV guests select these settings in the real %dr7 to
begin with, but for added safety against unexpected situations, detect the
fault cases and crash in an obvious manner.
This is part of XSA-260 / CVE-2018-8897.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 8 May 2018 17:16:37 +0000 (18:16 +0100)]
x86/pv: Move exception injection into {,compat_}test_all_events()
This allows paths to jump straight to {,compat_}test_all_events() and have
injection of pending exceptions happen automatically, rather than requiring
all calling paths to handle exceptions themselves.
The normal exception path is simplified as a result, and
compat_post_handle_exception() is removed entirely.
This is part of XSA-260 / CVE-2018-8897.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 8 May 2018 17:16:37 +0000 (18:16 +0100)]
x86/traps: Fix %dr6 handing in #DB handler
Most bits in %dr6 accumulate, rather than being set directly based on the
current source of #DB. Have the handler follow the manuals guidance, which
avoids leaking hypervisor debugging activities into guest context.
This is part of XSA-260 / CVE-2018-8897.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Wed, 25 Apr 2018 12:47:02 +0000 (14:47 +0200)]
x86: fix slow int80 path after XPTI additions
For the int80 slow path to jump to handle_exception_saved, %r14 needs to
be set up suitably for XPTI purposes. This is because of the difference
in nature between the int80 path (which is synchronous WRT guest
actions) and the exception path which is potentially asynchronous.
This is XSA-259.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5a5c368faf45ced8a8c6235f4fbf5cdb38ec939f
master date: 2018-04-25 14:39:41 +0200
Anthony PERARD [Wed, 25 Apr 2018 12:46:44 +0000 (14:46 +0200)]
libxl: Specify format of inserted cdrom
Without this extra parameter on the QMP command, QEMU will guess the
format of the new file.
This is XSA-258.
Reported-by: Anthony PERARD <anthony.perard@citrix.com> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
master commit: d8f65e68a7c1047fad97206a6282c281247fadc2
master date: 2018-04-25 14:38:47 +0200
Andrew Cooper [Wed, 18 Apr 2018 14:43:23 +0000 (16:43 +0200)]
x86/msr: Correct the emulation behaviour of MSR_PRED_CMD
Experimentally, the behaviour of reserved bits in MSR_PRED_CMD changed between
beta and production microcode, and now raises a #GP fault for set reserved
bits. The AMD spec for future hardware also specifies this behaviour, and it
is the more sensible behaviour to implement.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/msr: further correct the emulation behaviour of MSR_PRED_CMD
Following commit a6aa678fa3 ("x86/msr: Correct the emulation behaviour
of MSR_PRED_CMD") we may end up writing the low bit with the wrong
value. While it's unlikely for a guest to want to write zero there, we
should still permit (this without incurring the overhead of an actual
barrier). Correcting this right away will also help whenever further
bits in the MSR might become defined.
Jan Beulich [Wed, 18 Apr 2018 14:42:17 +0000 (16:42 +0200)]
x86/HVM: suppress I/O completion for port output
We don't break up port requests in case they cross emulation entity
boundaries, and a write to an I/O port is necessarily the last
operation of an instruction instance, so there's no need to re-invoke
the full emulation path upon receiving the result from an external
emulator.
In case we want to properly split port accesses in the future, this
change will need to be reverted, as it would prevent things working
correctly when e.g. the first part needs to go to an external emulator,
while the second part is to be handled internally.
While this addresses the reported problem of Windows paging out the
buffer underneath an in-process REP OUTS, it does not address the wider
problem of the re-issued insn (to the insn emulator) being prone to
raise an exception (#PF) during a replayed, previously successful memory
access (we only record prior MMIO accesses).
Leaving aside the problem tried to be worked around here, I think the
performance aspect alone is a good reason to change the behavior.
Also take the opportunity and change bool_t -> bool as
hvm_vcpu_io_need_completion()'s return type.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
master commit: 91afb8139f954a06e564d4915bc7d6a8575e2812
master date: 2018-04-11 10:42:24 +0200
Andrew Cooper [Wed, 18 Apr 2018 14:41:47 +0000 (16:41 +0200)]
x86/pv: Fix up erroneous segments for 32bit syscall entry
The existing FLAT_KERNEL_SS expands to the correct value, 0xe02b, but is the
wrong constant to use. Switch to FLAT_USER_SS32.
For compat domains however, the reported values are entirely bogus.
FLAT_USER_SS32 (value 0xe02b) is FLAT_RING3_CS in the 32bit ABI, while
FLAT_USER_CS32 (value 0xe023) is FLAT_RING1_DS with an RPL of 3.
The guests SYSCALL callback is invoked with a broken iret frame, and if left
unmodified by the guest, will fail on the way back out when Xen's iret tries
to load a code segment into %ss.
In practice, this is only a problem for 32bit PV guests on AMD hardware, as
Intel hardware doesn't permit the SYSCALL instruction outside of 64bit mode.
This appears to have been broken ever since 64bit support was added to Xen,
and has gone unnoticed because Linux doesn't use SYSCALL in 32bit builds.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: dba899de14989b3dff78009404ed891da7fefdc1
master date: 2018-04-09 13:12:18 +0100
Jan Beulich [Wed, 18 Apr 2018 14:41:16 +0000 (16:41 +0200)]
x86/XPTI: reduce .text.entry
This exposes less code pieces and at the same time reduces the range
covered from slightly above 3 pages to a little below 2 of them.
The code being moved is unchanged, except for the removal of trailing
blanks, insertion of blanks between operands, and a pointless q suffix
from "retq".
A few more small pieces could be moved, but it seems better to me to
leave them where they are to not make it overly hard to follow code
paths.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 454efb2a31b64b98e3dd55c083ce41b87375faa6
master date: 2018-04-05 15:48:23 +0100
Jan Beulich [Wed, 18 Apr 2018 14:40:50 +0000 (16:40 +0200)]
x86: log XPTI enabled status
At the same time also report the state of the two defined
ARCH_CAPABILITIES MSR bits. To avoid further complicating the
conditional around that printk(), drop it (it's a debug level one only
anyway).
Issue the main message without any XENLOG_*, and also drop XENLOG_INFO
from the respective BTI message, to make sure they're visible at default
log level also in release builds.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Juergen Gross <jgross@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: 442b303cdaf7d774c0be8096fe5dbab68701abd3
master date: 2018-04-05 15:48:23 +0100
Andrew Cooper [Wed, 18 Apr 2018 14:39:38 +0000 (16:39 +0200)]
x86/pv: Fix the handing of writes to %dr7
c/s 65e35549 "x86/PV: support data breakpoint extension registers"
accidentally broke the handing of writes. The call to activate_debugregs()
doesn't write %dr7 as v->arch.debugreg[7] hasn't been updated yet, and the
break skips the intended write to %dr7.
Remove the break, causing execution to hit the write_debugreg(7, value); in
context at the bottom of the hunk, which in turn causes hardware to be updated
appropriately.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: adf8feba1afa040f3a84a82953e18af02060884a
master date: 2018-03-29 15:12:21 +0100
Andrew Cooper [Wed, 18 Apr 2018 14:38:50 +0000 (16:38 +0200)]
x86/spec_ctrl: Fix several bugs in SPEC_CTRL_ENTRY_FROM_INTR_IST
DO_OVERWRITE_RSB clobbers %rax, meaning in practice that the bti_ist_info
field gets zeroed. Older versions of this code had the DO_OVERWRITE_RSB
register selectable, so reintroduce this ability and use it to cause the
INTR_IST path to use %rdx instead.
The use of %dl for the %cs.rpl check means that when an IST interrupt hits
Xen, we try to load 1 into the high 32 bits of MSR_SPEC_CTRL, suffering a #GP
fault instead.
Also, drop an unused label which was a copy/paste mistake.
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a2b08fbed388f18235fda5ba1655c1483ef3e215
master date: 2018-02-14 13:22:15 +0000
Liran Alon [Tue, 20 Mar 2018 13:35:02 +0000 (14:35 +0100)]
x86/vlapic: clear TMR bit upon acceptance of edge-triggered interrupt to IRR
According to Intel SDM section "Interrupt Acceptance for Fixed Interrupts":
"The trigger mode register (TMR) indicates the trigger mode of the
interrupt (see Figure 10-20). Upon acceptance of an interrupt
into the IRR, the corresponding TMR bit is cleared for
edge-triggered interrupts and set for level-triggered interrupts.
If a TMR bit is set when an EOI cycle for its corresponding
interrupt vector is generated, an EOI message is sent to
all I/O APICs."
Before this patch TMR-bit was cleared on LAPIC EOI which is not what
real hardware does. This was also confirmed in KVM upstream commit a0c9a822bf37 ("KVM: dont clear TMR on EOI").
Behavior after this patch is aligned with both Intel SDM and KVM
implementation.
Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 12a50030a81a14a3c7be672ddfde707b961479ec
master date: 2018-03-15 16:59:52 +0100
Jan Beulich [Tue, 20 Mar 2018 13:34:24 +0000 (14:34 +0100)]
cpufreq/ondemand: fix race while offlining CPU
Offlining a CPU involves stopping the cpufreq governor. The on-demand
governor will kill the timer before letting generic code proceed, but
since that generally isn't happening on the subject CPU,
cpufreq_dbs_timer_resume() may run in parallel. If that managed to
invoke the timer handler, that handler needs to run to completion before
dbs_timer_exit() may safely exit.
Make the "stoppable" field a tristate, changing it from +1 to -1 around
the timer function invocation, and make dbs_timer_exit() wait for it to
become non-negative (still writing zero if it's +1).
Also adjust coding style in cpufreq_dbs_timer_resume().
Reported-by: Martin Cerveny <martin@c-home.cz> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Martin Cerveny <martin@c-home.cz> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: 185413355fe331cbc926d48568838227234c9a20
master date: 2018-03-09 17:30:49 +0100
Jan Beulich [Tue, 20 Mar 2018 13:33:47 +0000 (14:33 +0100)]
x86: remove CR reads from exit-to-guest path
CR3 is - during normal operation - only ever loaded from v->arch.cr3,
so there's no need to read the actual control register. For CR4 we can
generally use the cached value on all synchronous entry end exit paths.
Drop the write_cr3 macro, as the two use sites are probably easier to
follow without its use.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Juergen Gross <jgross@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 31bf55cb5fe3796cf6a4efbcfc0a9418bb1c783f
master date: 2018-03-06 16:49:36 +0100
Jan Beulich [Tue, 20 Mar 2018 13:33:15 +0000 (14:33 +0100)]
x86: slightly reduce Meltdown band-aid overhead
I'm not sure why I didn't do this right away: By avoiding the use of
global PTEs in the cloned directmap, there's no need to fiddle with
CR4.PGE on any of the entry paths. Only the exit paths need to flush
global mappings.
The reduced flushing, however, requires that we now have interrupts off
on all entry paths until after the page table switch, so that flush IPIs
can't be serviced while on the restricted pagetables, leaving a window
where a potentially stale guest global mapping can be brought into the
TLB. Along those lines the "sync" IPI after L4 entry updates now needs
to become a real (and global) flush IPI, so that inside Xen we'll also
pick up such changes.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Juergen Gross <jgross@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86: correct EFLAGS.IF in SYSENTER frame
Commit 9d1d31ad94 ("x86: slightly reduce Meltdown band-aid overhead")
moved the STI past the PUSHF. While this isn't an active problem (as we
force EFLAGS.IF to 1 before exiting to guest context), let's not risk
internal confusion by finding a PV guest frame with interrupts
apparently off.
Jan Beulich [Tue, 20 Mar 2018 13:32:36 +0000 (14:32 +0100)]
x86/xpti: don't map stack guard pages
Other than for the main mappings, don't even do this in release builds,
as there are no huge page shattering concerns here.
Note that since we don't run on the restructed page tables while HVM
guests execute, the non-present mappings won't trigger the triple fault
issue AMD SVM is susceptible to with our current placement of STGI vs
TR loading.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d303784b68237ff3050daa184f560179dda21b8c
master date: 2018-03-06 16:46:57 +0100
Andrew Cooper [Tue, 20 Mar 2018 13:31:33 +0000 (14:31 +0100)]
x86/xpti: Hide almost all of .text and all .data/.rodata/.bss mappings
The current XPTI implementation isolates the directmap (and therefore a lot of
guest data), but a large quantity of CPU0's state (including its stack)
remains visible.
Furthermore, an attacker able to read .text is in a vastly superior position
to normal when it comes to fingerprinting Xen for known vulnerabilities, or
scanning for ROP/Spectre gadgets.
Collect together the entrypoints in .text.entry (currently 3x4k frames, but
can almost certainly be slimmed down), and create a common mapping which is
inserted into each per-cpu shadow. The stubs are also inserted into this
mapping by pointing at the in-use L2. This allows stubs allocated later (SMP
boot, or CPU hotplug) to work without further changes to the common mappings.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/xpti: really hide almost all of Xen image
Commit 422588e885 ("x86/xpti: Hide almost all of .text and all
.data/.rodata/.bss mappings") carefully limited the Xen image cloning to
just entry code, but then overwrote the just allocated and populated L3
entry with the normal one again covering both Xen image and stubs.
Drop the respective code in favor of an explicit clone_mapping()
invocation. This in turn now requires setup_cpu_root_pgt() to run after
stub setup in all cases. Additionally, with (almost) no unintended
mappings left, the BSP's IDT now also needs to be page aligned.
The moving ahead of cleanup_cpu_root_pgt() is not strictly necessary
for functionality, but things are more logical this way, and we retain
cleanup being done in the inverse order of setup.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86/traps: Put idt_table[] back into .bss
c/s d1d6fc97d "x86/xpti: really hide almost all of Xen image" accidentially
moved idt_table[] from .bss to .data by virtue of using the page_aligned
section. We also have .bss.page_aligned, so use that.
Jan Beulich [Fri, 16 Mar 2018 16:17:23 +0000 (17:17 +0100)]
x86: ignore guest microcode loading attempts
The respective MSRs are write-only, and hence attempts by guests to
write to these are - as of 1f1d183d49 ("x86/HVM: don't give the wrong
impression of WRMSR succeeding") no longer ignored. Restore original
behavior for the two affected MSRs.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 59c0983e10d70ea2368085271b75fb007811fe52
master date: 2018-03-15 12:44:24 +0100
Daniel Sabogal [Fri, 25 Aug 2017 21:35:47 +0000 (17:35 -0400)]
libxl/arm: Fix build on arm64 + acpi
With musl, the build fails with the following errors:
actypes.h:202:2: error: #error unknown ACPI_MACHINE_WIDTH
#error unknown ACPI_MACHINE_WIDTH
^~~~~
actypes.h:207:9: error: unknown type name ‘acpi_native_uint’
typedef acpi_native_uint acpi_size;
^~~~~~~~~~~~~~~~
actypes.h:617:3: error: unknown type name ‘acpi_io_address’
acpi_io_address pblk_address;
^~~~~~~~~~~~~~~
This likely went undetected with glibc builds since glibc
indirectly pulls __BITS_PER_LONG from the linux headers
through a standard header. For musl, this is not the case.
Instead, use BITS_PER_LONG to fix the build.
Signed-off-by: Daniel Sabogal <dsabogalcc@gmail.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 6657e938bf769768b56ba0c86cd4775b010538a8)
Jan Beulich [Tue, 6 Mar 2018 15:06:40 +0000 (16:06 +0100)]
x86/PV: fix off-by-one in I/O bitmap limit check
With everyone having their tags below agreeing that putting things the
other way around in the comparison makes things easier to understand, do
that rearrangement while changing the line anyway.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.apu@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c6527bc66b6dd7a8dadaebb1047c8e52c6c5793c
master date: 2018-02-27 14:10:00 +0100
Jan Beulich [Tue, 6 Mar 2018 15:06:11 +0000 (16:06 +0100)]
x86/HVM: don't give the wrong impression of WRMSR succeeding
... for non-existent MSRs: wrmsr_hypervisor_regs()'s comment clearly
says that the function returns 0 for unrecognized MSRs, so
{svm,vmx}_msr_write_intercept() should not convert this into success. We
don't want to unconditionally fail the access though, as we can't be
certain the list of handled MSRs is complete enough for the guest types
we care about, so instead mirror what we do on the read paths and probe
the MSR to decide whether to raise #GP.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
master commit: 1f1d183d49008794b087cf043fc77f724a45af98
master date: 2018-02-27 15:12:23 +0100
Andrew Cooper [Tue, 6 Mar 2018 15:04:47 +0000 (16:04 +0100)]
x86/pv: Avoid leaking other guests' MSR_TSC_AUX values into PV context
If the CPU pipeline supports RDTSCP or RDPID, a guest can observe the value in
MSR_TSC_AUX, irrespective of whether the relevant CPUID features are
advertised/hidden.
At the moment, paravirt_ctxt_switch_to() only writes to MSR_TSC_AUX if
TSC_MODE_PVRDTSCP mode is enabled, but this is not the default mode.
Therefore, default PV guests can read the value from a previously scheduled
HVM vcpu, or TSC_MODE_PVRDTSCP-enabled PV guest.
Alter the PV path to always write to MSR_TSC_AUX, using 0 in the common case.
To amortise overhead cost, introduce wrmsr_tsc_aux() which performs a lazy
update of the MSR, and use this function consistently across the codebase.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
master commit: cc0e45db277922b5723a7b1d9657d6f744230cf1
master date: 2018-02-27 10:47:23 +0000
Igor Druzhinin [Tue, 6 Mar 2018 15:04:13 +0000 (16:04 +0100)]
x86/nmi: start NMI watchdog on CPU0 after SMP bootstrap
We're noticing a reproducible system boot hang on certain
Skylake platforms where the BIOS is configured in legacy
boot mode with x2APIC disabled. The system stalls immediately
after writing the first SMP initialization sequence into APIC ICR.
The cause of the problem is watchdog NMI handler execution -
somewhere near the end of NMI handling (after it's already
rescheduled the next NMI) it tries to access IO port 0x61
to get the actual NMI reason on CPU0. Unfortunately, this
port is emulated by BIOS using SMIs and this emulation for
some reason takes more time than we expect during INIT-SIPI-SIPI
sequence. As the result, the system is constantly moving between
NMI and SMI handler and not making any progress.
To avoid this, initialize the watchdog after SMP bootstrap on
CPU0 and, additionally, protect the NMI handler by moving
IO port access before NMI re-scheduling. The latter should also
help in case of post boot CPU onlining. Although we're running
watchdog at much lower frequency at this point, it's neveretheless
possible we may trigger the issue anyway.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a44f1697968e04fcc6145e3bd51c748b57047240
master date: 2018-02-20 10:16:56 +0100
Jan Beulich [Tue, 6 Mar 2018 15:03:40 +0000 (16:03 +0100)]
x86/srat: fix end calculation in nodes_cover_memory()
Along the lines of commit 7226486767 ("x86/srat: fix the end pfn check
in valid_numa_range()") nodes_cover_memory() also doesn't consistently
use "end": It's set to an inclusive value initially, but then compared
to the exclusive "end" field of struct node and also possibly set to
nodes[j].start, making it exclusive too. Change the initialization to
make the variable consistently exclusive.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fdbed42649eb064e7c6d1bae2bdd4f46e7b2a160
master date: 2018-02-15 18:17:32 +0100
Ross Lagerwall [Tue, 6 Mar 2018 15:03:07 +0000 (16:03 +0100)]
x86/hvm/dmop: only copy what is needed to/from the guest
dm_op() fails with -EFAULT if the struct xen_dm_op given by the guest is
smaller than Xen's struct xen_dm_op. This is a problem because DMOP is
meant to be a stable ABI but it breaks whenever the size of struct
xen_dm_op changes.
To fix this, change how the copying to and from the guest is done. When
copying from the guest, first copy the header and inspect the op. Then,
only copy the correct amount needed for that op. When copying to the
guest, don't copy the header. Rather, copy only the correct amount
needed for that particular op.
So now the dm_op() will fail if the guest does not supply enough bytes
for the specific op. It will not fail if the guest supplies too many
bytes for the specific op, but Xen will not copy the extra bytes.
Remove some now unused macros and helper functions.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 85cb15dfe4d13b9b8b0f39a9cb257525c0b74c60
master date: 2018-02-15 18:16:17 +0100
Haozhong Zhang [Tue, 6 Mar 2018 15:01:11 +0000 (16:01 +0100)]
x86/srat: fix the end pfn check in valid_numa_range()
... and fix the coding style on fly.
valid_numa_range(..., epfn << PAGE_SHIFT, ...) and its only caller
memory_add(..., epfn, pxm) interpret epfn inconsistently. The former
interprets epfn as the last pfn, while the latter interprets it as the
last pfn plus one. Fix this inconsistency in valid_numa_range(), since
most of other places use the latter interpretation.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 722648676751fda39086f54d961640f88174360b
master date: 2018-02-12 11:08:33 +0000
Jan Beulich [Tue, 6 Mar 2018 15:00:32 +0000 (16:00 +0100)]
x86: reduce Meltdown band-aid IPI overhead
In case we can detect single-threaded guest processes (by checking
whether we can account for all root page table uses locally on the vCPU
that's running), there's no point in issuing a sync IPI upon an L4 entry
update, as no other vCPU of the guest will have that page table loaded.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a22320e32dca0918ed23799583f470afe4c24330
master date: 2018-02-07 16:31:41 +0100
Andrew Cooper [Tue, 6 Mar 2018 14:59:13 +0000 (15:59 +0100)]
x86/emul: Fix the emulation of invlpga
The instruction requires EFER.SVME set to be usable in the first place.
Furthermore, the emulation doesn't handle ASIDs, so avoid giving the
impression that they work. Permit ASID 0 which is reserved for non-root
mode (in which case the instruction is identical to invlpg), but raise #UD for
any other ASID.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a91b2ec337a45d5d98e5a4387aa6563bc5cdc4c9
master date: 2018-02-05 18:17:22 +0000
Julien Grall [Fri, 16 Feb 2018 14:59:56 +0000 (14:59 +0000)]
xen/arm: vgic: Make sure the number of SPIs is a multiple of 32
The vGIC relies on having a pending_irq available for every IRQs
described in the ranks. As each rank describes 32 interrupts, we need to
make sure the number of SPIs is a multiple of 32.
Andrew Cooper [Thu, 16 Nov 2017 21:10:00 +0000 (21:10 +0000)]
tools/libxc: Fix restoration of PV MSRs after migrate
There are two bugs in process_vcpu_msrs() which clearly demonstrate that I
didn't test this bit of Migration v2 very well when writing it...
vcpu->msrsz is always expected to be a multiple of xen_domctl_vcpu_msr_t
records in a spec-compliant stream, so the modulo yields 0 for the msr_count,
rather than the actual number sent in the stream.
Passing 0 for the msr_count causes the hypercall to exit early, and hides the
fact that the guest handle is inserted into the wrong field in the domctl
union.
The reason that these bugs have gone unnoticed for so long is that the only
MSRs passed like this for PV guests are the AMD DBGEXT MSRs, which only exist
in fairly modern hardware, and whose use doesn't appear to be implemented in
any contemporary PV guests.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
(cherry picked from commit f1a0a8c3fe2fb37c77ec1fe43618feef412427b5)
Andrew Cooper [Tue, 27 Feb 2018 13:24:51 +0000 (14:24 +0100)]
x86/hvm: Disallow the creation of HVM domains without Local APIC emulation
There are multiple problems, not necesserily limited to:
* Guests which configure event channels via hvmop_set_evtchn_upcall_vector(),
or which hit %cr8 emulation will cause Xen to fall over a NULL vlapic->regs
pointer.
* On Intel hardware, disabling the TPR_SHADOW execution control without
reenabling CR8_{LOAD,STORE} interception means that the guests %cr8
accesses interact with the real TPR. Amongst other things, setting the
real TPR to 0xf blocks even IPIs from interrupting this CPU.
* On hardware which sets up the use of Interrupt Posting, including
IOMMU-Posting, guests run without the appropriate non-root configuration,
which at a minimum will result in dropped interrupts.
Whether no-LAPIC mode is of any use at all remains to be seen.
This is XSA-256.
Reported-by: Ian Jackson <ian.jackson@eu.citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0aa6158b674c5d083b75ac8fcd1e7ae92d0c39ae
master date: 2018-02-27 14:08:36 +0100
Jan Beulich [Tue, 27 Feb 2018 13:24:00 +0000 (14:24 +0100)]
gnttab: don't blindly free status pages upon version change
There may still be active mappings, which would trigger the respective
BUG_ON(). Split the loop into one dealing with the page attributes and
the second (when the first fully passed) freeing the pages. Return an
error if any pages still have pending references.
This is part of XSA-255.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 38bfcc165dda5f4284d7c218b91df9e144ddd88d
master date: 2018-02-27 14:07:12 +0100
Jan Beulich [Tue, 27 Feb 2018 13:23:32 +0000 (14:23 +0100)]
gnttab/ARM: don't corrupt shared GFN array
... by writing status GFNs to it. Introduce a second array instead.
Also implement gnttab_status_gmfn() properly now that the information is
suitably being tracked.
While touching it anyway, remove a misguided (but luckily benign) upper
bound check from gnttab_shared_gmfn(): We should never access beyond the
bounds of that array.
This is part of XSA-255.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9d2f8f9c65d4da35437f50ed9e812a2c5ab313e2
master date: 2018-02-27 14:04:44 +0100
Jan Beulich [Tue, 27 Feb 2018 13:22:48 +0000 (14:22 +0100)]
memory: don't implicitly unpin for decrease-reservation
It very likely was a mistake (copy-and-paste from domain cleanup code)
to implicitly unpin here: The caller should really unpin itself before
(or after, if they so wish) requesting the page to be removed.
This is XSA-252.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d798a0952903db9d8ee0a580e03f214d2b49b7d7
master date: 2018-02-27 14:03:27 +0100
Julien Grall [Wed, 14 Feb 2018 12:22:23 +0000 (12:22 +0000)]
xen/arm: cpuerrata: Actually check errata on non-boot CPUs
The cpu errata framework was introduced in commit 8b01f6364f "xen/arm:
Detect silicon revision and set cap bits accordingly" and was meant to
detect errata present on any CPUs (via check_local_cpu_errata). However,
the function to check the MIDR (is_affected_midr_range) mistakenly
always use the boot CPU MIDR.
Fix is_affected_midr_range to use the current CPU MIDR.
Andrew Cooper [Thu, 8 Feb 2018 11:32:14 +0000 (12:32 +0100)]
x86/idle: Clear SPEC_CTRL while idle
On contemporary hardware, setting IBRS/STIBP has a performance impact on
adjacent hyperthreads. It is therefore recommended to clear the setting
before becoming idle, to avoid an idle core preventing adjacent userspace
execution from running at full performance.
Care must be taken to ensure there are no ret or indirect branch instructions
between spec_ctrl_{enter,exit}_idle() invocations, which are forced always
inline. Care must also be taken to avoid using spec_ctrl_enter_idle() between
flushing caches and becoming idle, in cases where that matters.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4c7e478d597b0346eef3a256cfd6794ac778b608
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Thu, 8 Feb 2018 11:31:39 +0000 (12:31 +0100)]
x86/cpuid: Offer Indirect Branch Controls to guests
With all infrastructure in place, it is now safe to let guests see and use
these features.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
master commit: 67c6838ddacfa646f9d1ae802bd0f16a935665b8
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Thu, 8 Feb 2018 11:31:14 +0000 (12:31 +0100)]
x86/ctxt: Issue a speculation barrier between vcpu contexts
Issuing an IBPB command flushes the Branch Target Buffer, so that any poison
left by one vcpu won't remain when beginning to execute the next.
The cost of IBPB is substantial, and skipped on transition to idle, as Xen's
idle code is robust already. All transitions into vcpu context are fully
serialising in practice (and under consideration for being retroactively
declared architecturally serialising), so a cunning attacker cannot use SP1 to
try and skip the flush.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a2ed643ed783020f885035432e9c0919756921d1
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Thu, 8 Feb 2018 11:30:46 +0000 (12:30 +0100)]
x86/boot: Calculate the most appropriate BTI mitigation to use
See the logic and comments in init_speculation_mitigations() for further
details.
There are two controls for RSB overwriting, because in principle there are
cases where it might be safe to forego rsb_native (Off the top of my head,
SMEP active, no 32bit PV guests at all, no use of vmevent/paging subsystems
for HVM guests, but I make no guarantees that this list of restrictions is
exhaustive).
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/spec_ctrl: Fix determination of when to use IBRS
The original version of this logic was:
/*
* On Intel hardware, we'd like to use retpoline in preference to
* IBRS, but only if it is safe on this hardware.
*/
else if ( boot_cpu_has(X86_FEATURE_IBRSB) )
{
if ( retpoline_safe() )
thunk = THUNK_RETPOLINE;
else
ibrs = true;
}
but it was changed by a request during review. Sadly, the result is buggy as
it breaks the later fallback logic by allowing IBRS to appear as available
when in fact it isn't.
This in practice means that on repoline-unsafe hardware without IBRS, we
select THUNK_JUMP despite intending to select THUNK_RETPOLINE.
Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2713715305ca516f698d58cec5e0b322c3b2c4eb
master date: 2018-01-26 14:10:21 +0000
master commit: 30cbd0c83ef3d0edac2d5bcc41a9a2b7a843ae58
master date: 2018-02-06 18:32:58 +0000
Andrew Cooper [Thu, 8 Feb 2018 11:30:06 +0000 (12:30 +0100)]
x86/entry: Avoid using alternatives in NMI/#MC paths
This patch is deliberately arranged to be easy to revert if/when alternatives
patching becomes NMI/#MC safe.
For safety, there must be a dispatch serialising instruction in (what is
logically) DO_SPEC_CTRL_ENTRY so that, in the case that Xen needs IBRS set in
context, an attacker can't speculate around the WRMSR and reach an indirect
branch within the speculation window.
Using conditionals opens this attack vector up, so the else clause gets an
LFENCE to force the pipeline to catch up before continuing. This also covers
the safety of RSB conditional, as execution it is guaranteed to either hit the
WRMSR or LFENCE.
One downside of not using alternatives is that there unconditionally an LFENCE
in the IST path in cases where we are not using the features from IBRS-capable
microcode.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3fffaf9c13e9502f09ad4ab1aac3f8b7b9398f6f
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Thu, 8 Feb 2018 11:29:28 +0000 (12:29 +0100)]
x86/entry: Organise the clobbering of the RSB/RAS on entry to Xen
ret instructions are speculated directly to values recorded in the Return
Stack Buffer/Return Address Stack, as there is no uncertainty in well-formed
code. Guests can take advantage of this in two ways:
1) If they can find a path in Xen which executes more ret instructions than
call instructions. (At least one in the waitqueue infrastructure,
probably others.)
2) Use the fact that the RSB/RAS in hardware is actually a circular stack
without a concept of empty. (When it logically empties, stale values
will start being used.)
To mitigate, overwrite the RSB on entry to Xen with gadgets which will capture
and contain rogue speculation.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e6c0128e9ab25bf66df11377a33ee5584d7f99e3
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Thu, 8 Feb 2018 11:28:35 +0000 (12:28 +0100)]
x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point
We need to be able to either set or clear IBRS in Xen context, as well as
restore appropriate guest values in guest context. See the documentation in
asm-x86/spec_ctrl_asm.h for details.
With the contemporary microcode, writes to %cr3 are slower when SPEC_CTRL.IBRS
is set. Therefore, the positioning of SPEC_CTRL_{ENTRY/EXIT}* is important.
Ideally, the IBRS_SET/IBRS_CLEAR hunks might be positioned either side of the
%cr3 change, but that is rather more complicated to arrange, and could still
result in a guest controlled value in SPEC_CTRL during the %cr3 change,
negating the saving if the guest chose to have IBRS set.
Therefore, we optimise for the pre-Skylake case (being far more common in the
field than Skylake and later, at the moment), where we have a Xen-preferred
value of IBRS clear when switching %cr3.
There is a semi-unrelated bugfix, where various asm_defn.h macros have a
hidden dependency on PAGE_SIZE, which results in an assembler error if used in
a .macro definition.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5e7962901131186d3514528ed57c7a9901a15a3e
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Thu, 8 Feb 2018 11:27:50 +0000 (12:27 +0100)]
x86/hvm: Permit guests direct access to MSR_{SPEC_CTRL,PRED_CMD}
For performance reasons, HVM guests should have direct access to these MSRs
when possible.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 5a2fe171144ebcc908ea1fca45058d6010f6a286
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Thu, 8 Feb 2018 11:27:18 +0000 (12:27 +0100)]
x86/migrate: Move MSR_SPEC_CTRL on migrate
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0cf2a4eb769302b7d7d7835540e7b2f15006df30
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Thu, 8 Feb 2018 11:26:54 +0000 (12:26 +0100)]
x86: Avoid corruption on migrate for vcpus using CPUID Faulting
Xen 4.8 and later virtualises CPUID Faulting support for guests. However, the
value of MSR_MISC_FEATURES_ENABLES is omitted from the vcpu state, meaning
that the current cpuid faulting setting is lost on migrate/suspend/resume.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b90f86be161c74df8cb69c98d9f22885d9d87114
master date: 2017-12-01 18:09:48 +0000
MSR_ARCH_CAPABILITIES will only come into existence on new hardware, but is
implemented as a straight #GP for now to avoid being leaky when new hardware
arrives.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ea58a679a6190e714a592f1369b660769a48a80c
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Thu, 8 Feb 2018 11:21:25 +0000 (12:21 +0100)]
x86/cpuid: Handling of IBRS/IBPB, STIBP and IBRS for guests
Intel specifies IBRS/IBPB (combined, in a single bit) and STIBP as a separate
bit. AMD specifies IBPB alone in a 3rd bit.
AMD's IBPB is a subset of Intel's combined IBRS/IBPB. For performance
reasons, administrators might wish to express "IBPB only" even on Intel
hardware, so we allow the AMD bit to be used for this purpose.
The behaviour of STIBP is more complicated.
It is our current understanding that STIBP will be advertised on HT-capable
hardware irrespective of whether HT is enabled, but not advertised on
HT-incapable hardware. However, for ease of virtualisation, STIBP's
functionality is ignored rather than reserved by microcode/hardware on
HT-incapable hardware.
For guest safety, we treat STIBP as special, always override the toolstack
choice, and always advertise STIBP if IBRS is available. This removes the
corner case where STIBP is not advertised, but the guest is running on
HT-capable hardware where it does matter.
Finally as a bugfix, update the libxc CPUID logic to understand the e8b
feature leaf, which has the side effect of also offering CLZERO to guests on
applicable hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d297b56682e730d598e2529cc6998151d3b6f6f8
master date: 2018-01-26 14:10:21 +0000
Wei Liu [Thu, 8 Feb 2018 11:20:45 +0000 (12:20 +0100)]
x86: fix GET_STACK_END
AIUI the purpose of having the .if directive is to make GET_STACK_END
work with any general purpose registers. The code as-is would produce
the wrong result for r8. Fix it.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 8155476765a5bdecea1534b46562cf28e0113a9a
master date: 2018-01-25 11:34:17 +0000
Roger Pau Monné [Thu, 8 Feb 2018 11:20:19 +0000 (12:20 +0100)]
x86/acpi: process softirqs while printing CPU ACPI data
Or else the watchdog triggers on boxes with a huge number of CPUs
Reported-by: Simon Crowe <simon.crowe@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: a5579ee79ef8546dd47abe34d73dc9a69a14bbda
master date: 2018-01-24 18:02:14 +0100
Andrew Cooper [Thu, 8 Feb 2018 11:19:40 +0000 (12:19 +0100)]
x86/cmdline: Introduce a command line option to disable IBRS/IBPB, STIBP and IBPB
Instead of gaining yet another top level boolean, introduce a more generic
cpuid= option. Also introduce a helper function to parse a generic boolean
value.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/cmdline: Fix parse_boolean() for unadorned values
A command line such as "cpuid=no-ibrsb,no-stibp" tickles a bug in
parse_boolean() because the separating comma fails the NUL case.
Instead, check for slen == nlen which accounts for the boundary (if any)
passed via the 'e' parameter.
Andrew Cooper [Thu, 8 Feb 2018 11:18:57 +0000 (12:18 +0100)]
x86/feature: Definitions for Indirect Branch Controls
Contemporary processors are gaining Indirect Branch Controls via microcode
updates. Intel are introducing one bit to indicate IBRS and IBPB support, and
a second bit for STIBP. AMD are introducing IBPB only, so enumerate it with a
separate bit.
Furthermore, depending on compiler and microcode availability, we may want to
run Xen with IBRS set, or clear.
To use these facilities, we synthesise separate IBRS and IBPB bits for
internal use. A lot of infrastructure is required before these features are
safe to offer to guests.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
master commit: 0d703a701cc4bc47773986b2796eebd28b1439b5
master date: 2018-01-16 17:45:50 +0000
Andrew Cooper [Thu, 8 Feb 2018 11:17:42 +0000 (12:17 +0100)]
x86/amd: Try to set lfence as being Dispatch Serialising
This property is required for the AMD's recommended mitigation for Branch
Target Injection, but Xen needs to cope with being unable to detect or modify
the MSR.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: fe3ee5530a8d0d0b6a478167125d00c40f294a86
master date: 2018-01-16 17:45:50 +0000
Andrew Cooper [Thu, 8 Feb 2018 11:15:37 +0000 (12:15 +0100)]
x86: Support indirect thunks from assembly code
Introduce INDIRECT_CALL and INDIRECT_JMP which either degrade to a normal
indirect branch, or dispatch to the __x86_indirect_thunk_* symbols.
Update all the manual indirect branches in to use the new thunks. The
indirect branches in the early boot and kexec path are left intact as we can't
use the compiled-in thunks at those points.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7c508612f7a5096b4819d4ef2ce566e01bd66c0c
master date: 2018-01-16 17:45:50 +0000
Andrew Cooper [Thu, 8 Feb 2018 11:14:35 +0000 (12:14 +0100)]
x86: Support compiling with indirect branch thunks
Use -mindirect-branch=thunk-extern/-mindirect-branch-register when available.
To begin with, use the retpoline thunk. Later work will add alternative
thunks which can be selected at boot time.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 3659f0f4bcc6ca08103d1a7ae4e97535ecc978be
master date: 2018-01-16 17:45:50 +0000
Andrew Cooper [Thu, 8 Feb 2018 11:13:15 +0000 (12:13 +0100)]
x86/entry: Erase guest GPR state on entry to Xen
This reduces the number of code gadgets which can be attacked with arbitrary
guest-controlled GPR values.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: 03bd8c3a70d101fc2f8f36f1e171b7594462a4cd
master date: 2018-01-05 19:57:08 +0000
Andrew Cooper [Thu, 8 Feb 2018 11:12:44 +0000 (12:12 +0100)]
x86/hvm: Use SAVE_ALL to construct the cpu_user_regs frame after VMExit
No practical change.
One side effect in debug builds is that %rbp is inverted in the manner
expected by the stack unwinder to indicate a interrupt frame.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: 13682ca8c94bd5612a44f7f1edc1fd8ff675dacb
master date: 2018-01-05 19:57:08 +0000