Andrew Cooper [Mon, 30 Jul 2018 12:13:09 +0000 (14:13 +0200)]
x86/spec-ctrl: Fix the parsing of xpti= on fixed Intel hardware
The calls to xpti_init_default() in parse_xpti() are buggy. The CPUID data
hasn't been fetched that early, and boot_cpu_has(X86_FEATURE_ARCH_CAPS) will
always evaluate false.
As a result, the default case won't disable XPTI on Intel hardware which
advertises ARCH_CAPABILITIES_RDCL_NO.
Simplify parse_xpti() to solely the setting of opt_xpti according to the
passed string, and have init_speculation_mitigations() call
xpti_init_default() if appropiate. Drop the force parameter, and pass caps
instead, to avoid redundant re-reading of MSR_ARCH_CAPS.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: be5e2ff6f54e0245331ed360b8786760f82fd673
master date: 2018-07-24 11:25:54 +0100
Jan Beulich [Mon, 30 Jul 2018 12:12:39 +0000 (14:12 +0200)]
x86: command line option to avoid use of secondary hyper-threads
Shared resources (L1 cache and TLB in particular) present a risk of
information leak via side channels. Provide a means to avoid use of
hyperthreads in such cases.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d8f974f1a646c0200b97ebcabb808324b288fadb
master date: 2018-07-19 13:43:33 +0100
Jan Beulich [Mon, 30 Jul 2018 12:11:42 +0000 (14:11 +0200)]
x86: possibly bring up all CPUs even if not all are supposed to be used
Reportedly Intel CPUs which can't broadcast #MC to all targeted
cores/threads because some have CR4.MCE clear will shut down. Therefore
we want to keep CR4.MCE enabled when offlining a CPU, and we need to
bring up all CPUs in order to be able to set CR4.MCE in the first place.
The use of clear_in_cr4() in cpu_mcheck_disable() was ill advised
anyway, and to avoid future similar mistakes I'm removing clear_in_cr4()
altogether right here.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: 8797d20a6ec2dd75195585a107ce345c51c0a59a
master date: 2018-07-19 13:43:33 +0100
Jan Beulich [Mon, 30 Jul 2018 12:10:58 +0000 (14:10 +0200)]
x86: distinguish CPU offlining from CPU removal
In order to be able to service #MC on offlined CPUs, the GDT, IDT,
stack, and per-CPU data (which includes the TSS) need to be kept
allocated. They should only be freed upon CPU removal (which we
currently don't support, so some code is becoming effectively dead for
the moment).
Note that for now park_offline_cpus doesn't get set to true anywhere -
this is going to be the subject of a subsequent patch.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2e6c8f182c9c50129b1c7a620242861e6ad6a9fb
master date: 2018-07-19 13:43:33 +0100
Jan Beulich [Mon, 30 Jul 2018 12:10:30 +0000 (14:10 +0200)]
x86/AMD: distinguish compute units from hyper-threads
Fam17 replaces CUs by HTs, which we should reflect accordingly, even if
the difference is not very big. The most relevant change (requiring some
code restructuring) is that the topoext feature no longer means there is
a valid CU ID.
Take the opportunity and convert wrongly plain int variables in
set_cpu_sibling_map() to unsigned int.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Brian Woods <brian.woods@amd.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9429b07a0af7f92a5f25e4068e11db881e157495
master date: 2018-07-19 09:42:42 +0200
Jan Beulich [Mon, 30 Jul 2018 12:09:57 +0000 (14:09 +0200)]
cpupools: fix state when downing a CPU failed
While I've run into the issue with further patches in place which no
longer guarantee the per-CPU area to start out as all zeros, the
CPU_DOWN_FAILED processing looks to have the same issue: By not zapping
the per-CPU cpupool pointer, cpupool_cpu_add()'s (indirect) invocation
of schedule_cpu_switch() will trigger the "c != old_pool" assertion
there.
Clearing the field during CPU_DOWN_PREPARE is too early (afaict this
should not happen before cpu_disable_scheduler()). Clearing it in
CPU_DEAD and CPU_DOWN_FAILED would be an option, but would take the same
piece of code twice. Since the field's value shouldn't matter while the
CPU is offline, simply clear it (implicitly) for CPU_ONLINE and
CPU_DOWN_FAILED, but only for other than the suspend/resume case (which
gets specially handled in cpupool_cpu_remove()).
By adjusting the conditional in cpupool_cpu_add() CPU_DOWN_FAILED
handling in the suspend case should now also be handled better.
Jan Beulich [Mon, 30 Jul 2018 12:09:19 +0000 (14:09 +0200)]
allow cpu_down() to be called earlier
The function's use of the stop-machine logic has so far prevented its
use ahead of the processing of the "ordinary" initcalls. Since at this
early time we're in a controlled environment anyway, there's no need for
such a heavy tool. Additionally this ought to have less of a performance
impact especially on large systems, compared to the alternative of
making stop-machine functionality available earlier.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5894c0a2da66243a89088d309c7e1ea212ab28d6
master date: 2018-07-16 15:15:12 +0200
Jan Beulich [Mon, 30 Jul 2018 12:08:42 +0000 (14:08 +0200)]
x86/spec-ctrl: command line handling adjustments
For one, "no-xen" should not imply "no-eager-fpu", as "eager FPU" mode
is to guard guests, not Xen itself, which is also expressed so by
print_details().
And then opt_ssbd, despite being off by default, should also be cleared
by the "no" and "no-xen" sub-options.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ac3f9a72141a48d40fabfff561d5a7dc0e1b810d
master date: 2018-07-10 12:22:31 +0200
Jan Beulich [Mon, 30 Jul 2018 12:06:44 +0000 (14:06 +0200)]
cmdline: fix parse_boolean() for NULL incoming end pointer
Use the calculated lengths instead of pointers, as 'e' being NULL will
otherwise cause undue parsing failures.
Reported-by: Karl Johnson <karljohnson.it@gmail.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 28 Jun 2018 09:31:48 +0000 (11:31 +0200)]
x86/HVM: don't cause #NM to be raised in Xen
The changes for XSA-267 did not touch management of CR0.TS for HVM
guests. In fully eager mode this bit should never be set when
respective vCPU-s are active, or else hvmemul_get_fpu() might leave it
wrongly set, leading to #NM in hypervisor context.
{svm,vmx}_enter() and {svm,vmx}_fpu_dirty_intercept() become unreachable
this way. Explicit {svm,vmx}_fpu_leave() invocations need to be guarded
now.
With no CR0.TS management necessary in fully eager mode, there's also no
need anymore to intercept #NM.
Reported-by: Charles Arnold <carnold@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 488efc29e4e996bb3805c982200f65061390cdce
master date: 2018-06-28 09:07:06 +0200
Ian Jackson [Thu, 28 Jun 2018 09:31:34 +0000 (11:31 +0200)]
libxl: restore passing "readonly=" to qemu for SCSI disks
A read-only check was introduced for XSA-142, commit ef6cb76026 ("libxl:
relax readonly check introduced by XSA-142 fix") added the passing of
the extra setting, but commit dab0539568 ("Introduce COLO mode and
refactor relevant function") dropped the passing of the setting again,
quite likely due to improper re-basing.
Restore the readonly= parameter to SCSI disks. For IDE disks this is
supposed to be rejected; add an assert. And there is a bare ad-hoc
disk drive string in libxl__build_device_model_args_new, which we also
update.
This is XSA-266.
Reported-by: Andrew Reimers <andrew.reimers@orionvm.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
master commit: dd64d3c41a2d15139c3a35d22d4cb6b78f4c5c59
master date: 2018-06-28 09:05:06 +0200
Ian Jackson [Thu, 28 Jun 2018 09:31:21 +0000 (11:31 +0200)]
libxl: qemu_disk_scsi_drive_string: Break out common parts of disk config
The generated configurations are identical apart from, in some cases,
reordering of the id=%s element. So, overall, no functional change.
This is part of XSA-266.
Reported-by: Andrew Reimers <andrew.reimers@orionvm.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
master commit: 724e5aa31b58d1e430ad36b484cf0ec021497399
master date: 2018-06-28 09:04:55 +0200
Andrew Cooper [Thu, 28 Jun 2018 09:30:56 +0000 (11:30 +0200)]
x86: Refine checks in #DB handler for faulting conditions
One of the fix for XSA-260 (c/s 75d6828bc2 "x86/traps: Fix handling of #DB
exceptions in hypervisor context") added some safety checks to help avoid
livelocks of #DB faults.
While a General Detect #DB exception does have fault semantics, hardware
clears %dr7.gd on entry to the handler, meaning that it is actually safe to
return to. Furthermore, %dr6.gd is guest controlled and sticky (never cleared
by hardware). A malicious PV guest can therefore trigger the fatal_trap() and
crash Xen.
Instruction breakpoints are more tricky. The breakpoint match bits in %dr6
are not sticky, but the Intel manual warns that they may be set for
non-enabled breakpoints, so add a breakpoint enabled check.
Beyond that, because of the restriction on the linear addresses PV guests can
set, and the fault (rather than trap) nature of instruction breakpoints
(i.e. can't be deferred by a MovSS shadow), there should be no way to
encounter an instruction breakpoint in Xen context. However, for extra
robustness, deal with this situation by clearing the breakpoint configuration,
rather than crashing.
This is XSA-265
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 17bf51297220dcd74da29de99320b6b1c72d1fa5
master date: 2018-06-28 09:04:20 +0200
Jan Beulich [Thu, 28 Jun 2018 09:29:48 +0000 (11:29 +0200)]
x86/EFI: further correct FPU state handling around runtime calls
We must not leave a vCPU with CR0.TS clear when it is not in fully eager
mode and has not touched non-lazy state. Instead of adding a 3rd
invocation of stts() to vcpu_restore_fpu_eager(), consolidate all of
them into a single one done at the end of the function.
Rename the function at the same time to better reflect its purpose, as
the patches touches all of its occurences anyway.
The new function parameter is not really well named, but
"need_stts_if_not_fully_eager" seemed excessive to me.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
master commit: 23839a0fa0bbe78c174cd2bb49083e153f0f99df
master date: 2018-06-26 15:23:08 +0200
Jan Beulich [Thu, 28 Jun 2018 09:29:21 +0000 (11:29 +0200)]
x86/EFI: fix FPU state handling around runtime calls
There are two issues. First, the nonlazy xstates were never restored
after returning from the runtime call.
Secondly, with the fully_eager_fpu mitigation for XSA-267 / LazyFPU, the
unilateral stts() is no longer correct, and hits an assertion later when
a lazy state restore tries to occur for a fully eager vcpu.
Fix both of these issues by calling vcpu_restore_fpu_eager(). As EFI
runtime services can be used in the idle context, the idle assertion
needs to move until after the fully_eager_fpu check.
Introduce a "curr" local variable and replace other uses of "current"
at the same time.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Juergen Gross <jgross@suse.com>
master commit: 437211cb696515ee5bd5dae0ab72866c9f382a33
master date: 2018-06-21 11:35:46 +0200
Jan Beulich [Thu, 28 Jun 2018 09:28:39 +0000 (11:28 +0200)]
x86: correct default_xen_spec_ctrl calculation
Even with opt_msr_sc_{pv,hvm} both false we should set up the variable
as usual, to ensure proper one-time setup during boot and CPU bringup.
This then also brings the code in line with the comment immediately
ahead of the printk() being modified saying "irrespective of guests".
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d6239f64713df819278bf048446d3187c6ac4734
master date: 2018-05-29 12:38:52 +0200
Andrew Cooper [Thu, 7 Jun 2018 16:00:37 +0000 (17:00 +0100)]
x86/spec-ctrl: Mitigations for LazyFPU
Intel Core processors since at least Nehalem speculate past #NM, which is the
mechanism by which lazy FPU context switching is implemented.
On affected processors, Xen must use fully eager FPU context switching to
prevent guests from being able to read FPU state (SSE/AVX/etc) from previously
scheduled vcpus.
This is part of XSA-267 / CVE-2018-3665
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 243435bf67e8159495194f623b9e4d8c90140384)
Andrew Cooper [Thu, 7 Jun 2018 16:00:37 +0000 (17:00 +0100)]
x86: Support fully eager FPU context switching
This is controlled on a per-vcpu bases for flexibility.
This is part of XSA-267 / CVE-2018-3665
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 146dfe9277c2b4a8c399b229e00d819065e3167b)
Jan Beulich [Wed, 30 May 2018 06:37:19 +0000 (08:37 +0200)]
x86: re-enable XPTI/PCID as needed in switch_native()
Additionally avoid accessing d->arch.pv_domain for PVH domains (running
in a HVM container).
Reported-by: Sergey Dyasli <sergey.dyasli@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Avoid flushing the complete TLB when switching %cr3 for mitigation of
Meltdown by using the PCID feature if available.
We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
2 values for the non-XPTI case:
- guest active and in kernel mode
- guest active and in user mode
- hypervisor active and guest in user mode (XPTI only)
- hypervisor active and guest in kernel mode (XPTI only)
We use PCID only if PCID _and_ INVPCID are supported. With PCID in use
we disable global pages in cr4. A command line parameter controls in
which cases PCID is being used.
As the non-XPTI case has shown not to perform better with PCID at least
on some machines the default is to use PCID only for domains subject to
XPTI.
With PCID enabled we always disable global pages. This avoids having to
either flush the complete TLB or do a cycle through all PCID values
when invalidating a single global page.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/x86: use flag byte for decision whether xen_cr3 is valid
Today cpu_info->xen_cr3 is either 0 to indicate %cr3 doesn't need to
be switched on entry to Xen, or negative for keeping the value while
indicating not to restore %cr3, or positive in case %cr3 is to be
restored.
Switch to use a flag byte instead of a negative xen_cr3 value in order
to allow %cr3 values with the high bit set in case we want to keep TLB
entries when using the PCID feature.
This reduces the number of branches in interrupt handling and results
in better performance (e.g. parallel make of the Xen hypervisor on my
system was using about 3% less system time).
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/x86: disable global pages for domains with XPTI active
Instead of flushing the TLB from global pages when switching address
spaces with XPTI being active just disable global pages via %cr4
completely when a domain subject to XPTI is active. This avoids the
need for extra TLB flushes as loading %cr3 will remove all TLB
entries.
In order to avoid states with cr3/cr4 having inconsistent values
(e.g. global pages being activated while cr3 already specifies a XPTI
address space) move loading of the new cr4 value to write_ptbase()
(actually to switch_cr3_cr4() called by write_ptbase()).
This requires to use switch_cr3_cr4() instead of write_ptbase() when
building dom0 in order to avoid setting cr4 with cr4.smap set.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Instead of switching XPTI globally on or off add a per-domain flag for
that purpose. This allows to modify the xpti boot parameter to support
running dom0 without Meltdown mitigations. Using "xpti=no-dom0" as boot
parameter will achieve that.
Move the xpti boot parameter handling to xen/arch/x86/pv/domain.c as
it is pv-domain specific.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Tue, 29 May 2018 08:42:44 +0000 (10:42 +0200)]
x86/xpti: avoid copying L4 page table contents when possible
For mitigation of Meltdown the current L4 page table is copied to the
cpu local root page table each time a 64 bit pv guest is entered.
Copying can be avoided in cases where the guest L4 page table hasn't
been modified while running the hypervisor, e.g. when handling
interrupts or any hypercall not modifying the L4 page table or %cr3.
So add a per-cpu flag indicating whether the copying should be
performed and set that flag only when loading a new %cr3 or modifying
the L4 page table. This includes synchronization of the cpu local
root page table with other cpus, so add a special synchronization flag
for that case.
A simple performance check (compiling the hypervisor via "make -j 4")
in dom0 with 4 vcpus shows a significant improvement:
- real time drops from 112 seconds to 103 seconds
- system time drops from 142 seconds to 131 seconds
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 23 Jan 2018 09:43:39 +0000 (10:43 +0100)]
x86: move invocations of hvm_flush_guest_tlbs()
Their need is not tied to the actual flushing of TLBs, but the ticking
of the TLB clock. Make this more obvious by folding the two invocations
into a single one in pre_flush().
Also defer the latching of CR4 in write_cr3() until after pre_flush()
(and hence implicitly until after IRQs are off), making operation
sequence the same in both cases (eliminating the theoretical risk of
pre_flush() altering CR4). This then also improves register allocation,
as the compiler doesn't need to use a callee-saved register for "cr4"
anymore.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 29 May 2018 08:40:37 +0000 (10:40 +0200)]
x86/XPTI: fix S3 resume (and CPU offlining in general)
We should index an L1 table with an L1 index.
Reported-by: Simon Gaiser <simon@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6b9562dac1746014ab376bd2cf8ba400acf34c6d
master date: 2018-05-28 11:20:26 +0200
Andrew Cooper [Tue, 29 May 2018 08:39:07 +0000 (10:39 +0200)]
x86/Intel: Mitigations for GPZ SP4 - Speculative Store Bypass
To combat GPZ SP4 "Speculative Store Bypass", Intel have extended their
speculative sidechannel mitigations specification as follows:
* A feature bit to indicate that Speculative Store Bypass Disable is
supported.
* A new bit in MSR_SPEC_CTRL which, when set, disables memory disambiguation
in the pipeline.
* A new bit in MSR_ARCH_CAPABILITIES, which will be set in future hardware,
indicating that the hardware is not susceptible to Speculative Store Bypass
sidechannels.
For contemporary processors, this interface will be implemented via a
microcode update.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 9df52a25e0e95a0b9971aa2fc26c5c6a5cbdf4ef
master date: 2018-05-21 14:20:06 +0100
Andrew Cooper [Tue, 29 May 2018 08:38:25 +0000 (10:38 +0200)]
x86/AMD: Mitigations for GPZ SP4 - Speculative Store Bypass
AMD processors will execute loads and stores with the same base register in
program order, which is typically how a compiler emits code.
Therefore, by default no mitigating actions are taken, despite there being
corner cases which are vulnerable to the issue.
For performance testing, or for users with particularly sensitive workloads,
the `spec-ctrl=ssbd` command line option is available to force Xen to disable
Memory Disambiguation on applicable hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 8c0e338086f060eba31d37b83fbdb883928aa085
master date: 2018-05-21 14:20:06 +0100
Andrew Cooper [Tue, 29 May 2018 08:36:49 +0000 (10:36 +0200)]
x86/spec_ctrl: Introduce a new `spec-ctrl=` command line argument to replace `bti=`
In hindsight, the options for `bti=` aren't as flexible or useful as expected
(including several options which don't appear to behave as intended).
Changing the behaviour of an existing option is problematic for compatibility,
so introduce a new `spec-ctrl=` in the hopes that we can do better.
One common way of deploying Xen is with a single PV dom0 and all domUs being
HVM domains. In such a setup, an administrator who has weighed up the risks
may wish to forgo protection against malicious PV domains, to reduce the
overall performance hit. To cater for this usecase, `spec-ctrl=no-pv` will
disable all speculative protection for PV domains, while leaving all
speculative protection for HVM domains intact.
For coding clarity as much as anything else, the suboptions are grouped by
logical area; those which affect the alternatives blocks, and those which
affect Xen's in-hypervisor settings. See the xen-command-line.markdown for
full details of the new options.
While changing the command line options, take the time to change how the data
is reported to the user. The three DEBUG printks are upgraded to unilateral,
as they are all relevant pieces of information, and the old "mitigations:"
line is split in the two logical areas described above.
Sample output from booting with `spec-ctrl=no-pv` looks like:
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3352afc26c497d26ecb70527db3cb29daf7b1422
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 08:36:12 +0000 (10:36 +0200)]
x86/cpuid: Improvements to guest policies for speculative sidechannel features
If Xen isn't virtualising MSR_SPEC_CTRL for guests, IBRSB shouldn't be
advertised. It is not currently possible to express this via the existing
command line options, but such an ability will be introduced.
Another useful option in some usecases is to offer IBPB without IBRS. When a
guest kernel is known to be compatible (uses retpoline and knows about the AMD
IBPB feature bit), an administrator with pre-Skylake hardware may wish to hide
IBRS. This allows the VM to have full protection, without Xen or the VM
needing to touch MSR_SPEC_CTRL, which can reduce the overhead of Spectre
mitigations.
Break the logic common to both PV and HVM CPUID calculations into a common
helper, to avoid duplication.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cb06b308ec71b23f37a44f5e2351fe2cae0306e9
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 08:35:47 +0000 (10:35 +0200)]
x86/spec_ctrl: Explicitly set Xen's default MSR_SPEC_CTRL value
With the impending ability to disable MSR_SPEC_CTRL handling on a
per-guest-type basis, the first exit-from-guest may not have the side effect
of loading Xen's choice of value. Explicitly set Xen's default during the BSP
and AP boot paths.
For the BSP however, delay setting a non-zero MSR_SPEC_CTRL default until
after dom0 has been constructed when safe to do so. Oracle report that this
speeds up boots of some hardware by 50s.
"when safe to do so" is based on whether we are virtualised. A native boot
won't have any other code running in a position to mount an attack.
Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cb8c12020307b39a89273d7699e89000451987ab
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 08:33:09 +0000 (10:33 +0200)]
x86/spec_ctrl: Split X86_FEATURE_SC_MSR into PV and HVM variants
In order to separately control whether MSR_SPEC_CTRL is virtualised for PV and
HVM guests, split the feature used to control runtime alternatives into two.
Xen will use MSR_SPEC_CTRL itself if either of these features are active.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: fa9eb09d446a1279f5e861e6b84fa8675dabf148
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 08:32:36 +0000 (10:32 +0200)]
x86/spec_ctrl: Elide MSR_SPEC_CTRL handling in idle context when possible
If Xen is virtualising MSR_SPEC_CTRL handling for guests, but using 0 as its
own MSR_SPEC_CTRL value, spec_ctrl_{enter,exit}_idle() need not write to the
MSR.
Requested-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 94df6e8588e35cc2028ccb3fd2921c6e6360605e
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 08:32:04 +0000 (10:32 +0200)]
x86/spec_ctrl: Rename bits of infrastructure to avoid NATIVE and VMEXIT
In hindsight, using NATIVE and VMEXIT as naming terminology was not clever.
A future change wants to split SPEC_CTRL_EXIT_TO_GUEST into PV and HVM
specific implementations, and using VMEXIT as a term is completely wrong.
Take the opportunity to fix some stale documentation in spec_ctrl_asm.h. The
IST helpers were missing from the large comment block, and since
SPEC_CTRL_ENTRY_FROM_INTR_IST was introduced, we've gained a new piece of
functionality which currently depends on the fine grain control, which exists
in lieu of livepatching. Note this in the comment.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d9822b8a38114e96e4516dc998f4055249364d5d
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 08:31:25 +0000 (10:31 +0200)]
x86/spec_ctrl: Fold the XEN_IBRS_{SET,CLEAR} ALTERNATIVES together
Currently, the SPEC_CTRL_{ENTRY,EXIT}_* macros encode Xen's choice of
MSR_SPEC_CTRL as an immediate constant, and chooses between IBRS or not by
doubling up the entire alternative block.
There is now a variable holding Xen's choice of value, so use that and
simplify the alternatives.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: af949407eaba7af71067f23d5866cd0bf1f1144d
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 08:25:46 +0000 (10:25 +0200)]
x86/spec_ctrl: Merge bti_ist_info and use_shadow_spec_ctrl into spec_ctrl_flags
All 3 bits of information here are control flags for the entry/exit code
behaviour. Treat them as such, rather than having two different variables.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5262ba2e7799001402dfe139ff944e035dfff928
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 08:25:10 +0000 (10:25 +0200)]
x86/spec_ctrl: Express Xen's choice of MSR_SPEC_CTRL value as a variable
At the moment, we have two different encodings of Xen's MSR_SPEC_CTRL value,
which is a side effect of how the Spectre series developed. One encoding is
via an alias with the bottom bit of bti_ist_info, and can encode IBRS or not,
but not other configurations such as STIBP.
Break Xen's value out into a separate variable (in the top of stack block for
XPTI reasons) and use this instead of bti_ist_info in the IST path.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 66dfae0f32bfbc899c2f3446d5ee57068cb7f957
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 08:24:43 +0000 (10:24 +0200)]
x86/spec_ctrl: Read MSR_ARCH_CAPABILITIES only once
Make it available from the beginning of init_speculation_mitigations(), and
pass it into appropriate functions. Fix an RSBA typo while moving the
affected comment.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d6c65187252a6c1810fd24c4d46f812840de8d3c
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 08:23:47 +0000 (10:23 +0200)]
x86: Fix "x86: further CPUID handling adjustments"
c/s 62b187969 "x86: further CPUID handling adjustments" make some adjustments.
However, it breaks levelling of guests, making it impossible for the toolstack
to hide STIBP or IBPB from guests on hardware with up-to-date microcode.
The dom0 issue referenced in the commit message was fixed by the hunk
adjusting the zeroing alone. STIBP and IBPB don't need (and indeed, must not
be for levelling purposes) OR'd into the leaf.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Fri, 18 May 2018 11:22:28 +0000 (13:22 +0200)]
xpti: fix bug in double fault handling
When entering the hypervisor via the double fault handler resetting
xen_cr3 was missing. This led to switching to pv_cr3 when returning
from the next following exception, so repair this in order to allow
exception handling to work even after a double fault.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d80af845de7a4db01a4a3b4d779e0e0dcb5e738b
master date: 2018-04-23 16:13:01 +0200
The 'RSB Alternative' bit in MSR_ARCH_CAPABILITIES may be set by a hypervisor
to indicate that the virtual machine may migrate to a processor which isn't
retpoline-safe. Introduce a shortened name (to reduce code volume), treat it
as authorative in retpoline_safe(), and print its value along with the other
ARCH_CAPS bits.
The exact processor models which do have RSB semantics which fall back to BTB
predictions are enumerated, and include Kabylake and Coffeelake. Leave a
printk() in the default case to help identify cases which aren't covered.
The exact microcode versions from Broadwell RSB-safety are taken from the
referenced microcode update file (adjusting for the known-bad microcode
versions). Despite the exact wording of the text, it is only Broadwell
processors which need a microcode check.
In practice, this means that all Broadwell hardware with up-to-date microcode
will use retpoline in preference to IBRS, which will be a performance
improvement for desktop and server systems which would previously always opt
for IBRS over retpoline.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/spec_ctrl: Fix typo in ARCH_CAPS decode
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 1232378bd2fef45f613db049b33852fdf84d7ddf
master date: 2018-04-19 17:28:23 +0100
master commit: 27170adb54a558e11defcd51989326a9beb95afe
master date: 2018-04-24 13:34:12 +0100
Jan Beulich [Fri, 18 May 2018 11:20:40 +0000 (13:20 +0200)]
x86: suppress BTI mitigations around S3 suspend/resume
NMI and #MC can occur at any time after S3 resume, yet the MSR_SPEC_CTRL
may become available only once we're reloaded microcode. Make
SPEC_CTRL_ENTRY_FROM_INTR_IST and DO_SPEC_CTRL_EXIT_TO_XEN no-ops for
the critical period of time.
Also set the MSR back to its intended value.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86: Use spec_ctrl_{enter,exit}_idle() in the S3/S5 path
The main purpose of this patch is to avoid opencoding the recovery logic at
the end, but also has the positive side effect of relaxing the SPEC_CTRL
mitigations when working to shut the final CPU down.
Jan Beulich [Fri, 18 May 2018 11:20:10 +0000 (13:20 +0200)]
x86: correct ordering of operations during S3 resume
Microcode loading needs to happen before re-enabling interrupts, in case
only updated microcode allows the use of e.g. the SPEC_{CTRL,CMD} MSRs.
Otoh it doesn't need to happen at all when we didn't suspend in the
first place. It needs to happen before spin_debug_enable() though, as it
acquires a lock and hence would otherwise make
common/spinlock.c:check_lock() unhappy. As micrcode loading can be
pretty verbose, also make sure it only runs after console_end_sync().
cpufreq_add_cpu() doesn't need calling on the only "goto enable_cpu"
path, which sits ahead of cpufreq_del_cpu().
Reported-by: Simon Gaiser <simon@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: cb2a4a449dfd50af309a333aa805835015fbc8c8
master date: 2018-04-16 14:08:30 +0200
Jan Beulich [Fri, 18 May 2018 11:19:34 +0000 (13:19 +0200)]
x86: log XPTI enabled status
At the same time also report the state of the two defined
ARCH_CAPABILITIES MSR bits. To avoid further complicating the
conditional around that printk(), drop it (it's a debug level one only
anyway).
Issue the main message without any XENLOG_*, and also drop XENLOG_INFO
from the respective BTI message, to make sure they're visible at default
log level also in release builds.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Juergen Gross <jgross@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: 442b303cdaf7d774c0be8096fe5dbab68701abd3
master date: 2018-04-05 15:48:23 +0100
Sergey Dyasli [Fri, 18 May 2018 11:16:03 +0000 (13:16 +0200)]
x86/cpuid: fix raw FEATURESET_7d0 reporting
Commit 62b1879693e0 ("x86: further CPUID handling adjustments") added
FEATURESET_7d0 reporting but forgot to update calculate_raw_featureset()
function. As result, the value reported by xen-cpuid contains 0.
Fix that by properly filling raw_featureset[FEATURESET_7d0].
Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 8 May 2018 17:23:21 +0000 (18:23 +0100)]
x86/HVM: guard against emulator driving ioreq state in weird ways
In the case where hvm_wait_for_io() calls wait_on_xen_event_channel(),
p->state ends up being read twice in succession: once to determine that
state != p->state, and then again at the top of the loop. This gives a
compromised emulator a chance to change the state back between the two
reads, potentially keeping Xen in a loop indefinitely.
Instead:
* Read p->state once in each of the wait_on_xen_event_channel() tests,
* re-use that value the next time around,
* and insist that the states continue to transition "forward" (with the
exception of the transition to STATE_IOREQ_NONE).
This is XSA-262.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
x86/vpt: add support for IO-APIC routed interrupts
And modify the HPET code to make use of it. Currently HPET interrupts
are always treated as ISA and thus injected through the vPIC. This is
wrong because HPET interrupts when not in legacy mode should be
injected from the IO-APIC.
To make things worse, the supported interrupt routing values are set
to [20..23], which clearly falls outside of the ISA range, thus
leading to an ASSERT in debug builds or memory corruption in non-debug
builds because the interrupt injection code will write out of the
bounds of the arch.hvm_domain.vpic array.
Since the HPET interrupt source can change between ISA and IO-APIC
always destroy the timer before changing the mode, or else Xen risks
changing it while the timer is active.
Note that vpt interrupt injection is racy in the sense that the
vIO-APIC RTE entry can be written by the guest in between the call to
pt_irq_masked and hvm_ioapic_assert, or the call to pt_update_irq and
pt_intr_post. Those are not deemed to be security issues, but rather
quirks of the current implementation. In the worse case the guest
might lose interrupts or get multiple interrupt vectors injected for
the same timer source.
This is part of XSA-261.
Address actual and potential compiler warnings. Fix formatting.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 8 May 2018 17:23:01 +0000 (18:23 +0100)]
x86/traps: Fix handling of #DB exceptions in hypervisor context
The WARN_ON() can be triggered by guest activities, and emits a full stack
trace without rate limiting. Swap it out for a ratelimited printk with just
enough information to work out what is going on.
Not all #DB exceptions are traps, so blindly continuing is not a safe action
to take. We don't let PV guests select these settings in the real %dr7 to
begin with, but for added safety against unexpected situations, detect the
fault cases and crash in an obvious manner.
This is part of XSA-260 / CVE-2018-8897.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 8 May 2018 17:23:01 +0000 (18:23 +0100)]
x86/pv: Move exception injection into {,compat_}test_all_events()
This allows paths to jump straight to {,compat_}test_all_events() and have
injection of pending exceptions happen automatically, rather than requiring
all calling paths to handle exceptions themselves.
The normal exception path is simplified as a result, and
compat_post_handle_exception() is removed entirely.
This is part of XSA-260 / CVE-2018-8897.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 8 May 2018 17:23:01 +0000 (18:23 +0100)]
x86/traps: Fix %dr6 handing in #DB handler
Most bits in %dr6 accumulate, rather than being set directly based on the
current source of #DB. Have the handler follow the manuals guidance, which
avoids leaking hypervisor debugging activities into guest context.
This is part of XSA-260 / CVE-2018-8897.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Wed, 25 Apr 2018 12:50:35 +0000 (14:50 +0200)]
x86: fix slow int80 path after XPTI additions
For the int80 slow path to jump to handle_exception_saved, %r14 needs to
be set up suitably for XPTI purposes. This is because of the difference
in nature between the int80 path (which is synchronous WRT guest
actions) and the exception path which is potentially asynchronous.
This is XSA-259.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5a5c368faf45ced8a8c6235f4fbf5cdb38ec939f
master date: 2018-04-25 14:39:41 +0200
Anthony PERARD [Wed, 25 Apr 2018 12:50:19 +0000 (14:50 +0200)]
libxl: Specify format of inserted cdrom
Without this extra parameter on the QMP command, QEMU will guess the
format of the new file.
This is XSA-258.
Reported-by: Anthony PERARD <anthony.perard@citrix.com> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
master commit: d8f65e68a7c1047fad97206a6282c281247fadc2
master date: 2018-04-25 14:38:47 +0200
Andrew Cooper [Wed, 18 Apr 2018 14:56:22 +0000 (16:56 +0200)]
x86/msr: Correct the emulation behaviour of MSR_PRED_CMD
Experimentally, the behaviour of reserved bits in MSR_PRED_CMD changed between
beta and production microcode, and now raises a #GP fault for set reserved
bits. The AMD spec for future hardware also specifies this behaviour, and it
is the more sensible behaviour to implement.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/msr: further correct the emulation behaviour of MSR_PRED_CMD
Following commit a6aa678fa3 ("x86/msr: Correct the emulation behaviour
of MSR_PRED_CMD") we may end up writing the low bit with the wrong
value. While it's unlikely for a guest to want to write zero there, we
should still permit (this without incurring the overhead of an actual
barrier). Correcting this right away will also help whenever further
bits in the MSR might become defined.
Jan Beulich [Wed, 18 Apr 2018 14:55:18 +0000 (16:55 +0200)]
x86/HVM: suppress I/O completion for port output
We don't break up port requests in case they cross emulation entity
boundaries, and a write to an I/O port is necessarily the last
operation of an instruction instance, so there's no need to re-invoke
the full emulation path upon receiving the result from an external
emulator.
In case we want to properly split port accesses in the future, this
change will need to be reverted, as it would prevent things working
correctly when e.g. the first part needs to go to an external emulator,
while the second part is to be handled internally.
While this addresses the reported problem of Windows paging out the
buffer underneath an in-process REP OUTS, it does not address the wider
problem of the re-issued insn (to the insn emulator) being prone to
raise an exception (#PF) during a replayed, previously successful memory
access (we only record prior MMIO accesses).
Leaving aside the problem tried to be worked around here, I think the
performance aspect alone is a good reason to change the behavior.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
master commit: 91afb8139f954a06e564d4915bc7d6a8575e2812
master date: 2018-04-11 10:42:24 +0200
Andrew Cooper [Wed, 18 Apr 2018 14:54:52 +0000 (16:54 +0200)]
x86/pv: Fix up erroneous segments for 32bit syscall entry
The existing FLAT_KERNEL_SS expands to the correct value, 0xe02b, but is the
wrong constant to use. Switch to FLAT_USER_SS32.
For compat domains however, the reported values are entirely bogus.
FLAT_USER_SS32 (value 0xe02b) is FLAT_RING3_CS in the 32bit ABI, while
FLAT_USER_CS32 (value 0xe023) is FLAT_RING1_DS with an RPL of 3.
The guests SYSCALL callback is invoked with a broken iret frame, and if left
unmodified by the guest, will fail on the way back out when Xen's iret tries
to load a code segment into %ss.
In practice, this is only a problem for 32bit PV guests on AMD hardware, as
Intel hardware doesn't permit the SYSCALL instruction outside of 64bit mode.
This appears to have been broken ever since 64bit support was added to Xen,
and has gone unnoticed because Linux doesn't use SYSCALL in 32bit builds.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: dba899de14989b3dff78009404ed891da7fefdc1
master date: 2018-04-09 13:12:18 +0100
Andrew Cooper [Wed, 18 Apr 2018 14:54:22 +0000 (16:54 +0200)]
x86/pv: Fix the handing of writes to %dr7
c/s 65e35549 "x86/PV: support data breakpoint extension registers"
accidentally broke the handing of writes. The call to activate_debugregs()
doesn't write %dr7 as v->arch.debugreg[7] hasn't been updated yet, and the
break skips the intended write to %dr7.
Remove the break, causing execution to hit the write_debugreg(7, value); in
context at the bottom of the hunk, which in turn causes hardware to be updated
appropriately.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: adf8feba1afa040f3a84a82953e18af02060884a
master date: 2018-03-29 15:12:21 +0100
Andrew Cooper [Mon, 9 Apr 2018 10:25:18 +0000 (12:25 +0200)]
x86/emul: Fix backport of "x86/emul: Fix the decoding of segment overrides in 64bit mode"
The logic in c/s b7dce29d is correct for master, but depends on c/s 216c29765
"x86emul: don't assume a memory operand", and in particular changed the type
of override_seg and the constant used to signify "no override".
When backported, the "no override in place" case causes x86_seg_none to be
used in place of x86_seg_ds, which has implications for the correctness of
emulated memory reads and writes.
Problems from this manifest as a reliable failure to boot shadow guests in the
following manner:
(d3) [15145.989017] Press F12 for boot menu.
(d3) [15145.989730]
(d3) [15146.000273] Boot device: CD-Rom5814MB medium detected
(d3) [15146.013234] - success.
(XEN) [15146.324244] paging.c:694:d0v3 Tried to do a paging op on itself.
(XEN) [15204.719457] sh error: sh_remove_shadows(): can't find all shadows of mfn 112cd54 (shadow_flags=00002000)
(XEN) [15204.719462] domain_crash called from common.c:2884
(XEN) [15204.719466] Domain 3 (vcpu#0) crashed on cpu#31:
Other problems with regular MMIO emulation haven't been observed, but
shouldn't be ruled out.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 22 Mar 2018 09:24:16 +0000 (10:24 +0100)]
x86/PV: also cover Dom0 in SPEC_CTRL / PRED_CMD emulation
Introduce a helper wrapping the pv_cpuid()-style domain_cpuid() /
cpuid_count() (or alike) invocations, and use it instead of plain
domain_cpuid() in MSR access emulation.
Reported-by: Jason Andryuk <jandryuk@gmail.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Ross Lagerwall [Thu, 22 Mar 2018 09:23:21 +0000 (10:23 +0100)]
x86: Move microcode loading earlier
Move microcode loading earlier for the boot CPU and secondary CPUs so
that it takes place before identify_cpu() is called for each CPU.
Without this, the detected features may be wrong if the new microcode
loading adjusts the feature bits. That could mean that some fixes (e.g. d6e9f8d4f35d ("x86/vmx: fix vmentry failure with TSX bits in LBR"))
don't work as expected.
Previously during boot, the microcode loader was invoked for each
secondary CPU started and then again for each CPU as part of an
initcall. Simplify the code so that it is invoked exactly once for each
CPU during boot.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f97838bbd980a0104e16c4a12fbf514f9fa805f1
master date: 2017-04-19 17:08:01 +0100
Jason Andryuk [Tue, 20 Mar 2018 13:49:26 +0000 (14:49 +0100)]
x86/entry: Fix passing 6th argument for compat hypercalls
Commit ec05090403ef4d760fbe701e31afd0f0edc414d5 ("x86/entry: Erase guest
GPR state on entry to Xen") zero-ed %rbp, compat arg 6, but it is not
restored before passing to hypercalls. We need to pass the saved compat
arg 6 to the hypercall in r9, the 6th function argument.
Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Liran Alon [Tue, 20 Mar 2018 13:48:57 +0000 (14:48 +0100)]
x86/vlapic: clear TMR bit upon acceptance of edge-triggered interrupt to IRR
According to Intel SDM section "Interrupt Acceptance for Fixed Interrupts":
"The trigger mode register (TMR) indicates the trigger mode of the
interrupt (see Figure 10-20). Upon acceptance of an interrupt
into the IRR, the corresponding TMR bit is cleared for
edge-triggered interrupts and set for level-triggered interrupts.
If a TMR bit is set when an EOI cycle for its corresponding
interrupt vector is generated, an EOI message is sent to
all I/O APICs."
Before this patch TMR-bit was cleared on LAPIC EOI which is not what
real hardware does. This was also confirmed in KVM upstream commit a0c9a822bf37 ("KVM: dont clear TMR on EOI").
Behavior after this patch is aligned with both Intel SDM and KVM
implementation.
Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 12a50030a81a14a3c7be672ddfde707b961479ec
master date: 2018-03-15 16:59:52 +0100
Jan Beulich [Tue, 20 Mar 2018 13:48:08 +0000 (14:48 +0100)]
cpufreq/ondemand: fix race while offlining CPU
Offlining a CPU involves stopping the cpufreq governor. The on-demand
governor will kill the timer before letting generic code proceed, but
since that generally isn't happening on the subject CPU,
cpufreq_dbs_timer_resume() may run in parallel. If that managed to
invoke the timer handler, that handler needs to run to completion before
dbs_timer_exit() may safely exit.
Make the "stoppable" field a tristate, changing it from +1 to -1 around
the timer function invocation, and make dbs_timer_exit() wait for it to
become non-negative (still writing zero if it's +1).
Also adjust coding style in cpufreq_dbs_timer_resume().
Reported-by: Martin Cerveny <martin@c-home.cz> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Martin Cerveny <martin@c-home.cz> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: 185413355fe331cbc926d48568838227234c9a20
master date: 2018-03-09 17:30:49 +0100
Jan Beulich [Tue, 20 Mar 2018 13:47:33 +0000 (14:47 +0100)]
x86: remove CR reads from exit-to-guest path
CR3 is - during normal operation - only ever loaded from v->arch.cr3,
so there's no need to read the actual control register. For CR4 we can
generally use the cached value on all synchronous entry end exit paths.
Drop the write_cr3 macro, as the two use sites are probably easier to
follow without its use.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Juergen Gross <jgross@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 31bf55cb5fe3796cf6a4efbcfc0a9418bb1c783f
master date: 2018-03-06 16:49:36 +0100
Jan Beulich [Tue, 20 Mar 2018 13:47:04 +0000 (14:47 +0100)]
x86: slightly reduce Meltdown band-aid overhead
I'm not sure why I didn't do this right away: By avoiding the use of
global PTEs in the cloned directmap, there's no need to fiddle with
CR4.PGE on any of the entry paths. Only the exit paths need to flush
global mappings.
The reduced flushing, however, requires that we now have interrupts off
on all entry paths until after the page table switch, so that flush IPIs
can't be serviced while on the restricted pagetables, leaving a window
where a potentially stale guest global mapping can be brought into the
TLB. Along those lines the "sync" IPI after L4 entry updates now needs
to become a real (and global) flush IPI, so that inside Xen we'll also
pick up such changes.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Juergen Gross <jgross@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86: correct EFLAGS.IF in SYSENTER frame
Commit 9d1d31ad94 ("x86: slightly reduce Meltdown band-aid overhead")
moved the STI past the PUSHF. While this isn't an active problem (as we
force EFLAGS.IF to 1 before exiting to guest context), let's not risk
internal confusion by finding a PV guest frame with interrupts
apparently off.
Jan Beulich [Tue, 20 Mar 2018 13:46:22 +0000 (14:46 +0100)]
x86/xpti: don't map stack guard pages
Other than for the main mappings, don't even do this in release builds,
as there are no huge page shattering concerns here.
Note that since we don't run on the restructed page tables while HVM
guests execute, the non-present mappings won't trigger the triple fault
issue AMD SVM is susceptible to with our current placement of STGI vs
TR loading.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d303784b68237ff3050daa184f560179dda21b8c
master date: 2018-03-06 16:46:57 +0100
Andrew Cooper [Tue, 20 Mar 2018 13:45:50 +0000 (14:45 +0100)]
x86/xpti: Hide almost all of .text and all .data/.rodata/.bss mappings
The current XPTI implementation isolates the directmap (and therefore a lot of
guest data), but a large quantity of CPU0's state (including its stack)
remains visible.
Furthermore, an attacker able to read .text is in a vastly superior position
to normal when it comes to fingerprinting Xen for known vulnerabilities, or
scanning for ROP/Spectre gadgets.
Collect together the entrypoints in .text.entry (currently 3x4k frames, but
can almost certainly be slimmed down), and create a common mapping which is
inserted into each per-cpu shadow. The stubs are also inserted into this
mapping by pointing at the in-use L2. This allows stubs allocated later (SMP
boot, or CPU hotplug) to work without further changes to the common mappings.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/xpti: really hide almost all of Xen image
Commit 422588e885 ("x86/xpti: Hide almost all of .text and all
.data/.rodata/.bss mappings") carefully limited the Xen image cloning to
just entry code, but then overwrote the just allocated and populated L3
entry with the normal one again covering both Xen image and stubs.
Drop the respective code in favor of an explicit clone_mapping()
invocation. This in turn now requires setup_cpu_root_pgt() to run after
stub setup in all cases. Additionally, with (almost) no unintended
mappings left, the BSP's IDT now also needs to be page aligned.
The moving ahead of cleanup_cpu_root_pgt() is not strictly necessary
for functionality, but things are more logical this way, and we retain
cleanup being done in the inverse order of setup.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86/traps: Put idt_table[] back into .bss
c/s d1d6fc97d "x86/xpti: really hide almost all of Xen image" accidentially
moved idt_table[] from .bss to .data by virtue of using the page_aligned
section. We also have .bss.page_aligned, so use that.
Quan Xu [Tue, 20 Mar 2018 13:44:45 +0000 (14:44 +0100)]
x86/apicv: fix wrong IPI suppression during posted interrupt delivery
__vmx_deliver_posted_interrupt() wrongly used a softirq bit to decide whether
to suppress an IPI. Its logic was: the first time an IPI was sent, we set
the softirq bit. Next time, we would check that softirq bit before sending
another IPI. If the 1st IPI arrived at the pCPU which was in
non-root mode, the hardware would consume the IPI and sync PIR to vIRR.
During the process, no one (both hardware and software) will clear the
softirq bit. As a result, the following IPI would be wrongly suppressed.
This patch discards the suppression check, always sending an IPI.
The softirq also need to be raised. But there is a little change.
This patch moves the place where we raise a softirq for
'cpu != smp_processor_id()' case to the IPI interrupt handler.
Namely, don't raise a softirq for this case and set the interrupt handler
to pi_notification_interrupt()(in which a softirq is raised) regardless of
VT-d PI enabled or not. The only difference is when an IPI arrives at the
pCPU which is happened in non-root mode, the code will not raise a useless
softirq since the IPI is consumed by hardware rather than raise a softirq
unconditionally.
Signed-off-by: Quan Xu <xuquan8@huawei.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b151125b4d89d7ec139ac34470e3c709fb4b1b4d
master date: 2017-03-03 12:00:35 +0100
Jan Beulich [Fri, 16 Mar 2018 16:20:59 +0000 (17:20 +0100)]
x86: ignore guest microcode loading attempts
The respective MSRs are write-only, and hence attempts by guests to
write to these are - as of 1f1d183d49 ("x86/HVM: don't give the wrong
impression of WRMSR succeeding") no longer ignored. Restore original
behavior for the two affected MSRs.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 59c0983e10d70ea2368085271b75fb007811fe52
master date: 2018-03-15 12:44:24 +0100
Jan Beulich [Tue, 6 Mar 2018 15:24:41 +0000 (16:24 +0100)]
x86/HVM: don't give the wrong impression of WRMSR succeeding
... for non-existent MSRs: wrmsr_hypervisor_regs()'s comment clearly
says that the function returns 0 for unrecognized MSRs, so
{svm,vmx}_msr_write_intercept() should not convert this into success. We
don't want to unconditionally fail the access though, as we can't be
certain the list of handled MSRs is complete enough for the guest types
we care about, so instead mirror what we do on the read paths and probe
the MSR to decide whether to raise #GP.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
master commit: 1f1d183d49008794b087cf043fc77f724a45af98
master date: 2018-02-27 15:12:23 +0100
Jan Beulich [Tue, 6 Mar 2018 15:24:01 +0000 (16:24 +0100)]
x86/PV: fix off-by-one in I/O bitmap limit check
With everyone having their tags below agreeing that putting things the
other way around in the comparison makes things easier to understand, do
that rearrangement while changing the line anyway.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.apu@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c6527bc66b6dd7a8dadaebb1047c8e52c6c5793c
master date: 2018-02-27 14:10:00 +0100
Andrew Cooper [Tue, 6 Mar 2018 15:22:54 +0000 (16:22 +0100)]
x86/pv: Avoid leaking other guests' MSR_TSC_AUX values into PV context
If the CPU pipeline supports RDTSCP or RDPID, a guest can observe the value in
MSR_TSC_AUX, irrespective of whether the relevant CPUID features are
advertised/hidden.
At the moment, paravirt_ctxt_switch_to() only writes to MSR_TSC_AUX if
TSC_MODE_PVRDTSCP mode is enabled, but this is not the default mode.
Therefore, default PV guests can read the value from a previously scheduled
HVM vcpu, or TSC_MODE_PVRDTSCP-enabled PV guest.
Alter the PV path to always write to MSR_TSC_AUX, using 0 in the common case.
To amortise overhead cost, introduce wrmsr_tsc_aux() which performs a lazy
update of the MSR, and use this function consistently across the codebase.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
master commit: cc0e45db277922b5723a7b1d9657d6f744230cf1
master date: 2018-02-27 10:47:23 +0000
Igor Druzhinin [Tue, 6 Mar 2018 15:22:09 +0000 (16:22 +0100)]
x86/nmi: start NMI watchdog on CPU0 after SMP bootstrap
We're noticing a reproducible system boot hang on certain
Skylake platforms where the BIOS is configured in legacy
boot mode with x2APIC disabled. The system stalls immediately
after writing the first SMP initialization sequence into APIC ICR.
The cause of the problem is watchdog NMI handler execution -
somewhere near the end of NMI handling (after it's already
rescheduled the next NMI) it tries to access IO port 0x61
to get the actual NMI reason on CPU0. Unfortunately, this
port is emulated by BIOS using SMIs and this emulation for
some reason takes more time than we expect during INIT-SIPI-SIPI
sequence. As the result, the system is constantly moving between
NMI and SMI handler and not making any progress.
To avoid this, initialize the watchdog after SMP bootstrap on
CPU0 and, additionally, protect the NMI handler by moving
IO port access before NMI re-scheduling. The latter should also
help in case of post boot CPU onlining. Although we're running
watchdog at much lower frequency at this point, it's neveretheless
possible we may trigger the issue anyway.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a44f1697968e04fcc6145e3bd51c748b57047240
master date: 2018-02-20 10:16:56 +0100
Jan Beulich [Tue, 6 Mar 2018 15:21:26 +0000 (16:21 +0100)]
x86/srat: fix end calculation in nodes_cover_memory()
Along the lines of commit 7226486767 ("x86/srat: fix the end pfn check
in valid_numa_range()") nodes_cover_memory() also doesn't consistently
use "end": It's set to an inclusive value initially, but then compared
to the exclusive "end" field of struct node and also possibly set to
nodes[j].start, making it exclusive too. Change the initialization to
make the variable consistently exclusive.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fdbed42649eb064e7c6d1bae2bdd4f46e7b2a160
master date: 2018-02-15 18:17:32 +0100
Andrew Cooper [Tue, 6 Mar 2018 15:19:35 +0000 (16:19 +0100)]
x86/spec_ctrl: Fix several bugs in SPEC_CTRL_ENTRY_FROM_INTR_IST
DO_OVERWRITE_RSB clobbers %rax, meaning in practice that the bti_ist_info
field gets zeroed. Older versions of this code had the DO_OVERWRITE_RSB
register selectable, so reintroduce this ability and use it to cause the
INTR_IST path to use %rdx instead.
The use of %dl for the %cs.rpl check means that when an IST interrupt hits
Xen, we try to load 1 into the high 32 bits of MSR_SPEC_CTRL, suffering a #GP
fault instead.
Also, drop an unused label which was a copy/paste mistake.
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a2b08fbed388f18235fda5ba1655c1483ef3e215
master date: 2018-02-14 13:22:15 +0000
Haozhong Zhang [Tue, 6 Mar 2018 15:19:03 +0000 (16:19 +0100)]
x86/srat: fix the end pfn check in valid_numa_range()
... and fix the coding style on fly.
valid_numa_range(..., epfn << PAGE_SHIFT, ...) and its only caller
memory_add(..., epfn, pxm) interpret epfn inconsistently. The former
interprets epfn as the last pfn, while the latter interprets it as the
last pfn plus one. Fix this inconsistency in valid_numa_range(), since
most of other places use the latter interpretation.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 722648676751fda39086f54d961640f88174360b
master date: 2018-02-12 11:08:33 +0000
Jan Beulich [Tue, 6 Mar 2018 15:18:28 +0000 (16:18 +0100)]
x86: reduce Meltdown band-aid IPI overhead
In case we can detect single-threaded guest processes (by checking
whether we can account for all root page table uses locally on the vCPU
that's running), there's no point in issuing a sync IPI upon an L4 entry
update, as no other vCPU of the guest will have that page table loaded.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a22320e32dca0918ed23799583f470afe4c24330
master date: 2018-02-07 16:31:41 +0100
Andrew Cooper [Tue, 6 Mar 2018 15:17:48 +0000 (16:17 +0100)]
x86/emul: Fix the emulation of invlpga
The instruction requires EFER.SVME set to be usable in the first place.
Furthermore, the emulation doesn't handle ASIDs, so avoid giving the
impression that they work. Permit ASID 0 which is reserved for non-root
mode (in which case the instruction is identical to invlpg), but raise #UD for
any other ASID.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a91b2ec337a45d5d98e5a4387aa6563bc5cdc4c9
master date: 2018-02-05 18:17:22 +0000