Jan Beulich [Wed, 7 Nov 2018 08:51:44 +0000 (09:51 +0100)]
x86: work around HLE host lockup erratum
XACQUIRE prefixed accesses to the 4Mb range of memory starting at 1Gb
are liable to lock up the processor. Disallow use of this memory range.
Unfortunately the available Core Gen7 and Gen8 spec updates are pretty
old, so I can only guess that they're similarly affected when Core Gen6
is and the Xeon counterparts are, too.
This is part of XSA-282.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: cc76410d20aff2cc07b268b0713dc1d2740c6e12
master date: 2018-11-07 09:33:24 +0100
In particular, initialising %dr6 with the value 0 is buggy, because on
hardware supporting Transactional Memory, it will cause the sticky RTM bit to
be asserted, even though a debug exception from a transaction hasn't actually
been observed.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
master commit: 46029da12e5efeca6d957e5793bd34f2965fa0a1
master date: 2018-10-24 14:43:05 +0100
In particular, initialising %dr6 with the value 0 is buggy, because on
hardware supporting Transactional Memory, it will cause the sticky RTM bit to
be asserted, even though a debug exception from a transaction hasn't actually
been observed.
Introduce arch_vcpu_regs_init() to set various architectural defaults, and
reuse this in the hvm_vcpu_reset_state() path.
Architecturally, %edx's init state contains the processors model information,
and 0xf looks to be a remnant of the old Intel processors. We clearly have no
software which cares, seeing as it is wrong for the last decade's worth of
Intel hardware and for all other vendors, so lets use the value 0 for
simplicity.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
x86/domain: Fix build with GCC 4.3.x
GCC 4.3.x can't initialise the user_regs structure like this.
Reported-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: dfba4d2e91f63a8f40493c4fc2db03fd8287f6cb
master date: 2018-10-24 14:43:05 +0100
master commit: 0a1fa635029d100d4b6b7eddb31d49603217cab7
master date: 2018-10-30 13:26:21 +0000
Andrew Cooper [Mon, 5 Nov 2018 15:16:45 +0000 (16:16 +0100)]
x86/boot: Initialise the debug registers correctly
In particular, initialising %dr6 with the value 0 is buggy, because on
hardware supporting Transactional Memory, it will cause the sticky RTM bit to
be asserted, even though a debug exception from a transaction hasn't actually
been observed.
Move X86_DR6_DEFAULT into x86-defns.h along with the other architectural
register constants, and introduce a new X86_DR7_DEFAULT. Use the existing
write_debugreg() helper, rather than opencoded inline assembly.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 721da6d41a70fe08b3fcd9c31a62f6709a54c6ba
master date: 2018-10-24 14:43:05 +0100
Jan Beulich [Mon, 5 Nov 2018 15:14:50 +0000 (16:14 +0100)]
x86: fix "xpti=" and "pv-l1tf=" yet again
While commit 2a3b34ec47 ("x86/spec-ctrl: Yet more fixes for xpti=
parsing") indeed fixed "xpti=dom0", it broke "xpti=no-dom0", in that
this then became equivalent to "xpti=no". In particular, the presence
of "xpti=" alone on the command line means nothing as to which default
is to be overridden; "xpti=no-dom0", for example, ought to have no
effect for DomU-s, as this is distinct from both "xpti=no-dom0,domu"
and "xpti=no-dom0,no-domu".
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 8743d2dea539617e237c77556a91dc357098a8af
master date: 2018-10-04 14:49:56 +0200
Jan Beulich [Mon, 5 Nov 2018 15:13:09 +0000 (16:13 +0100)]
x86: silence false log messages for plain "xpti" / "pv-l1tf"
While commit 2a3b34ec47 ("x86/spec-ctrl: Yet more fixes for xpti=
parsing") claimed to have got rid of the 'parameter "xpti" has invalid
value "", rc=-22!' log message for "xpti" alone on the command line,
this wasn't the case (the option took effect nevertheless).
Fix this there as well as for plain "pv-l1tf".
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2fb57e4beefeda923446b73f88b392e59b07d847
master date: 2018-09-28 17:12:14 +0200
Ian Jackson [Tue, 18 Sep 2018 10:25:20 +0000 (11:25 +0100)]
stubdom/grub.patches: Drop docs changes, for licensing reasons
The patch file 00cvs is an import of a new upstream version of
grub1 from upstream CVS.
Unfortunately, in the period covered by the update, upstream changed
the documentation licence from a simple permissive licence, to the GNU
"Free Documentation Licence" with Front and Back Cover Texts.
The Debian Project is of the view that use the Front and Back Cover
Texts feature of the GFDL makes the resulting document not Free
Software, because of the mandatory redistribution of these immutable
texts. (Personally, I agree.)
This is awkward because Debian do not want to ship non-free content.
So the Debian maintainers need to launder the upstream source code, to
remove the troublesome files. This is an extra step when
incorporating new upstream versions. It's particularly annoying for
security response, which often involves rebasing onto a new upstream
release.
grub1 is obsolete and the last change to Xen's PV grub1 stubdom code
was in 2016. Furthermore, the grub1 documentation is not built and
installed by the Xen pv-grub stubdom Makefiles.
Therefore, remove all docs changes from stubdom/grub.patches. This
means that there are now no longer any GFDL-licenced grub docs in
xen.git.
There is no user impact, and Debian is helped. This change would
complicate any attempts to update to a new version of upstream grub1,
but it seems unlikely that such a thing will ever happen.
Paul Durrant [Mon, 8 Oct 2018 12:51:33 +0000 (14:51 +0200)]
x86/hvm/emulate: make sure rep I/O emulation does not cross GFN boundaries
When emulating a rep I/O operation it is possible that the ioreq will
describe a single operation that spans multiple GFNs. This is fine as long
as all those GFNs fall within an MMIO region covered by a single device
model, but unfortunately the higher levels of the emulation code do not
guarantee that. This is something that should almost certainly be fixed,
but in the meantime this patch makes sure that MMIO is truncated at GFN
boundaries and hence the appropriate device model is re-evaluated for each
target GFN.
NOTE: This patch does not deal with the case of a single MMIO operation
spanning a GFN boundary. That is more complex to deal with and is
deferred to a subsequent patch.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Convert calculations to be 32-bit only.
Ross Lagerwall [Mon, 8 Oct 2018 12:51:03 +0000 (14:51 +0200)]
x86/shutdown: use ACPI reboot method for Dell PowerEdge R540
When EFI booting the Dell PowerEdge R540 it consistently wanders into
the weeds and gets an invalid opcode in the EFI ResetSystem call. This
is the same bug which affects the PowerEdge R740 so fix it in the same
way: quirk this hardware to use the ACPI reboot method instead.
BIOS Information
Vendor: Dell Inc.
Version: 1.3.7
Release Date: 02/09/2018
System Information
Manufacturer: Dell Inc.
Product Name: PowerEdge R540
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 328ca55b7bd47e1324b75cce2a6c461308ecf93d
master date: 2018-06-28 09:29:13 +0200
Ross Lagerwall [Mon, 8 Oct 2018 12:50:16 +0000 (14:50 +0200)]
x86/shutdown: use ACPI reboot method for Dell PowerEdge R740
When EFI booting the Dell PowerEdge R740, it consistently wanders into the
weeds and gets an invalid opcode in the EFI ResetSystem call.
Quirk this hardware to use the ACPI reboot method instead.
Jan Beulich [Fri, 14 Sep 2018 11:36:32 +0000 (13:36 +0200)]
x86: assorted array_index_nospec() insertions
Don't chance having Spectre v1 (including BCBS) gadgets. In some of the
cases the insertions are more of precautionary nature rather than there
provably being a gadget, but I think we should err on the safe (secure)
side here.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 3f2002614af51dfd507168a1696658bac91155ce
master date: 2018-09-03 17:50:10 +0200
Jan Beulich [Fri, 14 Sep 2018 11:35:27 +0000 (13:35 +0200)]
rangeset: make inquiry functions tolerate NULL inputs
Rather than special casing the ->iomem_caps check in x86's
get_page_from_l1e() for the dom_xen case, let's be more tolerant in
general, along the lines of rangeset_is_empty(): A never allocated
rangeset can't possibly contain or overlap any range.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: ad0a9f273d6d6f0545cd9b708b2d4be581a6cadd
master date: 2018-08-17 13:54:40 +0200
Andrew Cooper [Fri, 14 Sep 2018 11:34:57 +0000 (13:34 +0200)]
x86/setup: Avoid OoB E820 lookup when calculating the L1TF safe address
A number of corner cases (most obviously, no-real-mode and no Multiboot memory
map) can end up with e820_raw.nr_map being 0, at which point the L1TF
calculation will underflow.
Spotted by Coverity.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: 3e4ec07e14bce81f6ae22c31ff1302d1f297a226
master date: 2018-08-16 18:10:07 +0100
Paul Durrant [Fri, 14 Sep 2018 11:34:26 +0000 (13:34 +0200)]
x86/hvm/ioreq: MMIO range checking completely ignores direction flag
hvm_select_ioreq_server() is used to route an ioreq to the appropriate
ioreq server. For MMIO this is done by comparing the range of the ioreq
to the ranges registered by the device models of each ioreq server.
Unfortunately the calculation of the range if the ioreq completely ignores
the direction flag and thus may calculate the wrong range for comparison.
Thus the ioreq may either be routed to the wrong server or erroneously
terminated by null_ops.
NOTE: The patch also fixes whitespace in the switch statement to make it
style compliant.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 60a56dc0064a00830663ffe48215dcd080cb9504
master date: 2018-08-15 14:14:06 +0200
Andrew Cooper [Fri, 14 Sep 2018 11:33:59 +0000 (13:33 +0200)]
x86/vlapic: Bugfixes and improvements to vlapic_{read,write}()
Firstly, there is no 'offset' boundary check on the non-32-bit write path
before the call to vlapic_read_aligned(), which allows an attacker to read
beyond the end of vlapic->regs->data[], which is only 1024 bytes long.
However, as the backing memory is a domheap page, and misaligned accesses get
chunked down to single bytes across page boundaries, I can't spot any
XSA-worthy problems which occur from the overrun.
On real hardware, bad accesses don't instantly crash the machine. Their
behaviour is undefined, but the domain_crash() prohibits sensible testing.
Behave more like other x86 MMIO and terminate bad accesses with appropriate
defaults.
While making these changes, clean up and simplify the the smaller-access
handling. In particular, avoid pointer based mechansims for 1/2-byte reads so
as to avoid forcing the value to be spilled to the stack.
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-175 (-175)
function old new delta
vlapic_read 211 142 -69
vlapic_write 304 198 -106
Finally, there are a plethora of read/write functions in the vlapic namespace,
so rename these to vlapic_mmio_{read,write}() to make their purpose more
clear.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: b6f43c14cef3af8477a9eca4efab87dd150a2885
master date: 2018-08-10 13:27:24 +0100
Wei Liu [Tue, 7 Aug 2018 14:35:34 +0000 (15:35 +0100)]
xl.conf: Add global affinity masks
XSA-273 involves one hyperthread being able to use Spectre-like
techniques to "spy" on another thread. The details are somewhat
complicated, but the upshot is that after all Xen-based mitigations
have been applied:
* PV guests cannot spy on sibling threads
* HVM guests can spy on sibling threads
(NB that for purposes of this vulnerability, PVH and HVM guests are
identical. Whenever this comment refers to 'HVM', this includes PVH.)
There are many possible mitigations to this, including disabling
hyperthreading entirely. But another solution would be:
* Specify some cores as PV-only, others as PV or HVM
* Allow HVM guests to only run on thread 0 of the "HVM-or-PV" cores
* Allow PV guests to run on the above cores, as well as any thread of the PV-only cores.
For example, suppose you had 16 threads across 8 cores (0-7). You
could specify 0-3 as PV-only, and 4-7 as HVM-or-PV. Then you'd set
the affinity of the HVM guests as follows (binary representation):
In order to make this easy, this patches introduces three "global affinity
masks", placed in xl.conf:
vm.cpumask
vm.hvm.cpumask
vm.pv.cpumask
These are parsed just like the 'cpus' and 'cpus_soft' options in the
per-domain xl configuration files. The resulting mask is AND-ed with
whatever mask results at the end of the xl configuration file.
`vm.cpumask` would be applied to all guest types, `vm.hvm.cpumask`
would be applied to HVM and PVH guest types, and `vm.pv.cpumask`
would be applied to PV guest types.
The idea would be that to implement the above mask across all your
VMs, you'd simply add the following two lines to the configuration
file:
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit aa67b97ed34279c43a43d9ca46727b5746caa92e)
PVH guest type in toolstack is not available in this version of Xen.
Change code and manpage to cope. Also xl is still part of libxl in
thsi version, manually backport code to relevant places.
Jan Beulich [Mon, 13 Aug 2018 11:07:23 +0000 (05:07 -0600)]
x86: Make "spec-ctrl=no" a global disable of all mitigations
In order to have a simple and easy to remember means to suppress all the
more or less recent workarounds for hardware vulnerabilities, force
settings not controlled by "spec-ctrl=" also to their original defaults,
unless they've been forced to specific values already by earlier command
line options.
This is part of XSA-273.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit d8800a82c3840b06b17672eddee4878bbfdacc6d)
Andrew Cooper [Tue, 29 May 2018 17:44:16 +0000 (18:44 +0100)]
x86/spec-ctrl: Introduce an option to control L1D_FLUSH for HVM HAP guests
This mitigation requires up-to-date microcode, and is enabled by default on
affected hardware if available, and is used for HVM guests
The default for SMT/Hyperthreading is far more complicated to reason about,
not least because we don't know if the user is going to want to run any HVM
guests to begin with. If a explicit default isn't given, nag the user to
perform a risk assessment and choose an explicit default, and leave other
configuration to the toolstack.
This is part of XSA-273 / CVE-2018-3620.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3bd36952dab60290f33d6791070b57920e10754b)
Andrew Cooper [Fri, 13 Apr 2018 15:34:01 +0000 (15:34 +0000)]
x86/msr: Virtualise MSR_FLUSH_CMD for guests
Guests (outside of the nested virt case, which isn't supported yet) don't need
L1D_FLUSH for their L1TF mitigations, but offering/emulating MSR_FLUSH_CMD is
easy and doesn't pose an issue for Xen.
The MSR is offered to HVM guests only. PV guests attempting to use it would
trap for emulation, and the L1D cache would fill long before the return to
guest context. As such, PV guests can't make any use of the L1D_FLUSH
functionality.
This is part of XSA-273 / CVE-2018-3646.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit fd9823faf9df057a69a9a53c2e100691d3f4267c)
Andrew Cooper [Wed, 28 Mar 2018 14:21:39 +0000 (15:21 +0100)]
x86/spec-ctrl: CPUID/MSR definitions for L1D_FLUSH
This is part of XSA-273 / CVE-2018-3646.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3563fc2b2731a63fd7e8372ab0f5cef205bf8477)
x86/pv: Force a guest into shadow mode when it writes an L1TF-vulnerable PTE
See the comment in shadow.h for an explanation of L1TF and the safety
consideration of the PTEs.
In the case that CONFIG_SHADOW_PAGING isn't compiled in, crash the domain
instead. This allows well-behaved PV guests to function, while preventing
L1TF from being exploited. (Note: PV guest kernels which haven't been updated
with L1TF mitigations will likely be crashed as soon as they try paging a
piece of userspace out to disk.)
This is part of XSA-273 / CVE-2018-3620.
Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 06e8b622d3f3c0fa5075e91b041c6f45549ad70a)
Andrew Cooper [Mon, 23 Jul 2018 06:11:40 +0000 (08:11 +0200)]
x86/mm: Plumbing to allow any PTE update to fail with -ERESTART
Switching to shadow mode is performed in tasklet context. To facilitate this,
we schedule the tasklet, then create a hypercall continuation to allow the
switch to take place.
As a consequence, the x86 mm code needs to cope with an L1e operation being
continuable. do_mmu{,ext}_op() may no longer assert that a continuation
doesn't happen on the final iteration.
To handle the arguments correctly on continuation, compat_update_va_mapping*()
may no longer call into their non-compat counterparts. Move the compat
functions into mm.c rather than exporting __do_update_va_mapping() and
{get,put}_pg_owner(), and fix an unsigned long/int inconsistency with
compat_update_va_mapping_otherdomain().
This is part of XSA-273 / CVE-2018-3620.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c612481d1c9232c6abf91b03ec655e92f808805f)
x86/shadow: Infrastructure to force a PV guest into shadow mode
To mitigate L1TF, we cannot alter an architecturally-legitimate PTE a PV guest
chooses to write, but we can force the PV domain into shadow mode so Xen
controls the PTEs which are reachable by the CPU pagewalk.
Introduce new shadow mode, PG_SH_forced, and a tasklet to perform the
transition. Later patches will introduce the logic to enable this mode at the
appropriate time.
To simplify vcpu cleanup, make tasklet_kill() idempotent with respect to
tasklet_init(), which involves adding a helper to check for an uninitialised
list head.
This is part of XSA-273 / CVE-2018-3620.
Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit b76ec3946bf6caca2c3950b857c008bc8db6723f)
Andrew Cooper [Mon, 23 Jul 2018 13:46:10 +0000 (13:46 +0000)]
x86/spec-ctrl: Introduce an option to control L1TF mitigation for PV guests
Shadowing a PV guest is only available when shadow paging is compiled in.
When shadow paging isn't available, guests can be crashed instead as
mitigation from Xen's point of view.
Ideally, dom0 would also be potentially-shadowed-by-default, but dom0 has
never been shadowed before, and there are some stability issues under
investigation.
This is part of XSA-273 / CVE-2018-3620.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 66a4e986819a86ba66ca2fe9d925e62a4fd30114)
Andrew Cooper [Wed, 25 Jul 2018 12:10:19 +0000 (12:10 +0000)]
x86/spec-ctrl: Calculate safe PTE addresses for L1TF mitigations
Safe PTE addresses for L1TF mitigations are ones which are within the L1D
address width (may be wider than reported in CPUID), and above the highest
cacheable RAM/NVDIMM/BAR/etc.
All logic here is best-effort heuristics, which should in practice be fine for
most hardware. Future work will see about disentangling the SRAT handling
further, as well as having L0 pass this information down to lower levels when
virtualised.
This is part of XSA-273 / CVE-2018-3620.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit b03a57c9383b32181e60add6b6de12b473652aa4)
Christian Lindig [Mon, 13 Aug 2018 16:26:56 +0000 (17:26 +0100)]
tools/oxenstored: Make evaluation order explicit
In Store.path_write(), Path.apply_modify() updates the node_created
reference and both the value of apply_modify() and node_created are
returned by path_write().
At least with OCaml 4.06.1 this leads to the value of node_created being
returned *before* it is updated by apply_modify(). This in turn leads
to the quota for a domain not being updated in Store.write(). Hence, a
guest can create an unlimited number of entries in xenstore.
The fix is to make evaluation order explicit.
This is XSA-272.
Signed-off-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Rob Hoes <rob.hoes@citrix.com>
(cherry picked from commit 73392c7fd14c59f8c96e0b2eeeb329e4ae9086b6)
Andrew Cooper [Mon, 18 Jun 2018 08:12:39 +0000 (16:12 +0800)]
x86/vtx: Fix the checking for unknown/invalid MSR_DEBUGCTL bits
The VPMU_MODE_OFF early-exit in vpmu_do_wrmsr() introduced by c/s 11fe998e56 bypasses all reserved bit checking in the general case. As a
result, a guest can enable BTS when it shouldn't be permitted to, and
lock up the entire host.
With vPMU active (not a security supported configuration, but useful for
debugging), the reserved bit checking in broken, caused by the original
BTS changeset 1a8aa75ed.
From a correctness standpoint, it is not possible to have two different
pieces of code responsible for different parts of value checking, if
there isn't an accumulation of bits which have been checked. A
practical upshot of this is that a guest can set any value it
wishes (usually resulting in a vmentry failure for bad guest state).
Therefore, fix this by implementing all the reserved bit checking in the
main MSR_DEBUGCTL block, and removing all handling of DEBUGCTL from the
vPMU MSR logic.
This is XSA-269.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 2a8a8e99feb950504559196521bc9fd63ed3a962)
Andrew Cooper [Tue, 14 Aug 2018 10:20:53 +0000 (11:20 +0100)]
common/gnttab: Introduce command line feature controls
This patch was originally released as part of XSA-226. It retains the same
command line syntax (as various downstreams are mitigating XSA-226 using this
mechanism) but the defaults have been updated due to the revised XSA-226
patched, after which transitive grants are believed to functioning
properly.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit dc96c65ed6d7ffd4c95487373df708d97443cf77)
Jan Beulich [Thu, 19 Jul 2018 09:54:45 +0000 (11:54 +0200)]
VMX: fix vmx_{find,del}_msr() build
Older gcc at -O2 (and perhaps higher) does not recognize that apparently
uninitialized variables aren't really uninitialized. Pull out the
assignments used by two of the three case blocks and make them
initializers of the variables, as I think I had suggested during review.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit 97cb0516a322ecdf0032fa9d8aa1525c03d7772f)
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Support load-only guest MSR list entries
Currently, the VMX_MSR_GUEST type maintains completely symmetric guest load
and save lists, by pointing VM_EXIT_MSR_STORE_ADDR and VM_ENTRY_MSR_LOAD_ADDR
at the same page, and setting VM_EXIT_MSR_STORE_COUNT and
VM_ENTRY_MSR_LOAD_COUNT to the same value.
However, for MSRs which we won't let the guest have direct access to, having
hardware save the current value on VMExit is unnecessary overhead.
To avoid this overhead, we must make the load and save lists asymmetric. By
making the entry load count greater than the exit store count, we can maintain
two adjacent lists of MSRs, the first of which is saved and restored, and the
second of which is only restored on VMEntry.
For simplicity:
* Both adjacent lists are still sorted by MSR index.
* It undefined behaviour to insert the same MSR into both lists.
* The total size of both lists is still limited at 256 entries (one 4k page).
Split the current msr_count field into msr_{load,save}_count, and introduce a
new VMX_MSR_GUEST_LOADONLY type, and update vmx_{add,find}_msr() to calculate
which sublist to search, based on type. VMX_MSR_HOST has no logical sublist,
whereas VMX_MSR_GUEST has a sublist between 0 and the save count, while
VMX_MSR_GUEST_LOADONLY has a sublist between the save count and the load
count.
One subtle point is that inserting an MSR into the load-save list involves
moving the entire load-only list, and updating both counts.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit 1ac46b55632626aeb935726e1b0a71605ef6763a)
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Pass an MSR value into vmx_msr_add()
The main purpose of this change is to allow us to set a specific MSR value,
without needing to know whether there is already a load/save list slot for it.
Previously, callers wanting this property needed to call both vmx_add_*_msr()
and vmx_write_*_msr() to cover both cases, and there are no callers which want
the old behaviour of being a no-op if an entry already existed for the MSR.
As a result of this API improvement, the default value for guest MSRs need not
be 0, and the default for host MSRs need not be passed via hardware register.
In practice, this cleans up the VPMU allocation logic, and avoids an MSR read
as part of vcpu construction.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit ee7689b94ac7094b975ab4a023cfeae209da0a36)
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Improvements to LBR MSR handling
The main purpose of this patch is to only ever insert the LBR MSRs into the
guest load/save list once, as a future patch wants to change the behaviour of
vmx_add_guest_msr().
The repeated processing of lbr_info and the guests MSR load/save list is
redundant, and a guest using LBR itself will have to re-enable
MSR_DEBUGCTL.LBR in its #DB handler, meaning that Xen will repeat this
redundant processing every time the guest gets a debug exception.
Rename lbr_fixup_enabled to lbr_flags to be a little more generic, and use one
bit to indicate that the MSRs have been inserted into the load/save list.
Shorten the existing FIXUP* identifiers to reduce code volume.
Furthermore, handing the guest #MC on an error isn't a legitimate action. Two
of the three failure cases are definitely hypervisor bugs, and the third is a
boundary case which shouldn't occur in practice. The guest also won't execute
correctly, so handle errors by cleanly crashing the guest.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit be73a842e642772d7372004c9c105de35b771020)
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Support remote access to the MSR lists
At the moment, all modifications of the MSR lists are in current context.
However, future changes may need to put MSR_EFER into the lists from domctl
hypercall context.
Plumb a struct vcpu parameter down through the infrastructure, and use
vmx_vmcs_{enter,exit}() for safe access to the VMCS in vmx_add_msr(). Use
assertions to ensure that access is either in current context, or while the
vcpu is paused.
Note these expectations beside the fields in arch_vmx_struct, and reorder the
fields to avoid unnecessary padding.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 80599f0b770199116aa753bfdfac9bfe2e8ea86a)
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Factor locate_msr_entry() out of vmx_find_msr() and vmx_add_msr()
Instead of having multiple algorithms searching the MSR lists, implement a
single one. It has the semantics required by vmx_add_msr(), to identify the
position in which an MSR should live, if it isn't already present.
There will be a marginal improvement for vmx_find_msr() by avoiding the
function pointer calls to vmx_msr_entry_key_cmp(), and a major improvement for
vmx_add_msr() by using a binary search instead of a linear search.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit 4d94828cf11104256dccea1fa7762f00575dfaa0)
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Internal cleanup for MSR load/save infrastructure
* Use an arch_vmx_struct local variable to reduce later code volume.
* Use start/total instead of msr_area/msr_count. This is in preparation for
more finegrained handling with later changes.
* Use ent/end pointers (again for preparation), and to make the vmx_add_msr()
logic easier to follow.
* Make the memory allocation block of vmx_add_msr() unlikely, and calculate
virt_to_maddr() just once.
No practical change to functionality.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit 94fda356fcdcc847662a4c9f6cc63511f25c1247)
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: API improvements for MSR load/save infrastructure
Collect together related infrastructure in vmcs.h, rather than having it
spread out. Turn vmx_{read,write}_guest_msr() into static inlines, as they
are simple enough.
Replace 'int type' with 'enum vmx_msr_list_type', and use switch statements
internally. Later changes are going to introduce a new type.
Rename the type identifiers for consistency with the other VMX_MSR_*
constants.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit f54b63e8617ada823be43d60467a43c8224b7909)
Andrew Cooper [Mon, 28 May 2018 14:02:34 +0000 (15:02 +0100)]
x86/vmx: Defer vmx_vmcs_exit() as long as possible in construct_vmcs()
paging_update_paging_modes() and vmx_vlapic_msr_changed() both operate on the
VMCS being constructed. Avoid dropping and re-acquiring the reference
multiple times.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit f30e3cf34042846e391e3f8361fc6a76d181a7ee)
Andrew Cooper [Thu, 24 May 2018 17:20:09 +0000 (17:20 +0000)]
x86/vmx: Fix handing of MSR_DEBUGCTL on VMExit
Currently, whenever the guest writes a nonzero value to MSR_DEBUGCTL, Xen
updates a host MSR load list entry with the current hardware value of
MSR_DEBUGCTL.
On VMExit, hardware automatically resets MSR_DEBUGCTL to 0. Later, when the
guest writes to MSR_DEBUGCTL, the current value in hardware (0) is fed back
into guest load list. As a practical result, `ler` debugging gets lost on any
PCPU which has ever scheduled an HVM vcpu, and the common case when `ler`
debugging isn't active, guest actions result in an unnecessary load list entry
repeating the MSR_DEBUGCTL reset.
Restoration of Xen's debugging setting needs to happen from the very first
vmexit. Due to the automatic reset, Xen need take no action in the general
case, and only needs to load a value when debugging is active.
This could be fixed by using a host MSR load list entry set up during
construct_vmcs(). However, a more efficient option is to use an alternative
block in the VMExit path, keyed on whether hypervisor debugging has been
enabled.
In order to set this up, drop the per cpu ler_msr variable (as there is no
point having it per cpu when it will be the same everywhere), and use a single
read_mostly variable instead. Split calc_ler_msr() out of percpu_traps_init()
for clarity.
Finally, clean up do_debug(). Reinstate LBR early to help catch cascade
errors, which allows for the removal of the out label.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit 730dc8d2c9e1b6402e66973cf99a7c56bc78be4c)
Andrew Cooper [Thu, 9 Aug 2018 16:22:17 +0000 (17:22 +0100)]
x86/spec-ctrl: Yet more fixes for xpti= parsing
As it currently stands, 'xpti=dom0' is indistinguishable from the default
value, which means it will be overridden by ARCH_CAPABILITIES_RDCL_NO on fixed
hardware.
Switch opt_xpti to use -1 as a default like all our other related options, and
clobber it as soon as we have a string to parse.
In addition, 'xpti' alone should be interpreted in its positive boolean form,
rather than resulting in a parse error.
(XEN) parameter "xpti" has invalid value "", rc=-22!
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 2a3b34ec47817048ab59586855cf0709fc77487e)
Andrew Cooper [Mon, 30 Jul 2018 10:10:58 +0000 (12:10 +0200)]
x86/spec-ctrl: Fix the parsing of xpti= on fixed Intel hardware
The calls to xpti_init_default() in parse_xpti() are buggy. The CPUID data
hasn't been fetched that early, and boot_cpu_has(X86_FEATURE_ARCH_CAPS) will
always evaluate false.
As a result, the default case won't disable XPTI on Intel hardware which
advertises ARCH_CAPABILITIES_RDCL_NO.
Simplify parse_xpti() to solely the setting of opt_xpti according to the
passed string, and have init_speculation_mitigations() call
xpti_init_default() if appropiate. Drop the force parameter, and pass caps
instead, to avoid redundant re-reading of MSR_ARCH_CAPS.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: be5e2ff6f54e0245331ed360b8786760f82fd673
master date: 2018-07-24 11:25:54 +0100
Andrew Cooper [Mon, 30 Jul 2018 10:10:18 +0000 (12:10 +0200)]
x86/hvm: Disallow unknown MSR_EFER bits
It turns out that nothing ever prevented HVM guests from trying to set unknown
EFER bits. Generally, this results in a vmentry failure.
For Intel hardware, all implemented bits are covered by the checks.
For AMD hardware, the only EFER bit which isn't covered by the checks is TCE
(which AFAICT is specific to AMD Fam15/16 hardware). We never advertise TCE
in CPUID, but it isn't a security problem to have TCE unexpected enabled in
guest context.
Disallow the setting of bits outside of the EFER_KNOWN_MASK, which prevents
any vmentry failures for guests, yielding #GP instead.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: ef0269c6215d642a709866f04ba1a1f9f13f3614
master date: 2018-07-24 11:25:53 +0100
Andrew Cooper [Mon, 30 Jul 2018 10:09:46 +0000 (12:09 +0200)]
x86/xstate: Make errors in xstate calculations more obvious by crashing the domain
If xcr0_max exceeds xfeature_mask, then something is broken with the CPUID
policy derivation or auditing logic. If hardware rejects new_bv, then
something is broken with Xen's xstate logic.
In both cases, crash the domain with an obvious error message, to help
highlight the issues.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d6371ccb93012db4ad6615fe666205b86308cb4e
master date: 2018-07-19 19:57:26 +0100
Andrew Cooper [Mon, 30 Jul 2018 10:09:18 +0000 (12:09 +0200)]
x86/xstate: Use a guests CPUID policy, rather than allowing all features
It turns out that Xen has never enforced that a domain remain within the
xstate features advertised in CPUID.
The check of new_bv against xfeature_mask ensures that a domain stays within
the set of features that Xen has enabled in hardware (and therefore isn't a
security problem), but this does means that attempts to level a guest for
migration safety might not be effective if the guest ignores CPUID.
Check the CPUID policy in validate_xstate() (for incoming migration) and in
handle_xsetbv() (for guest XSETBV instructions). This subsumes the PKRU check
for PV guests in handle_xsetbv() (and also demonstrates that I should have
spotted this problem while reviewing c/s fbf9971241f).
For migration, this is correct despite the current (mis)ordering of data
because d->arch.cpuid is the applicable max policy.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 361b835fa00d9f45167c50a60e054ccf22c065d7
master date: 2018-07-19 19:57:26 +0100
Andrew Cooper [Mon, 30 Jul 2018 10:08:43 +0000 (12:08 +0200)]
x86/vmx: Don't clobber %dr6 while debugging state is lazy
c/s 4f36452b63 introduced a write to %dr6 in the #DB intercept case, but the
guests debug registers may be lazy at this point, at which point the guests
later attempt to read %dr6 will discard this value and use the older stale
value.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 3cdac2805692c7accde2f405d81cc0be799aee48
master date: 2018-07-19 14:06:48 +0100
Jan Beulich [Mon, 30 Jul 2018 10:08:14 +0000 (12:08 +0200)]
x86: command line option to avoid use of secondary hyper-threads
Shared resources (L1 cache and TLB in particular) present a risk of
information leak via side channels. Provide a means to avoid use of
hyperthreads in such cases.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d8f974f1a646c0200b97ebcabb808324b288fadb
master date: 2018-07-19 13:43:33 +0100
Jan Beulich [Mon, 30 Jul 2018 10:07:43 +0000 (12:07 +0200)]
x86: possibly bring up all CPUs even if not all are supposed to be used
Reportedly Intel CPUs which can't broadcast #MC to all targeted
cores/threads because some have CR4.MCE clear will shut down. Therefore
we want to keep CR4.MCE enabled when offlining a CPU, and we need to
bring up all CPUs in order to be able to set CR4.MCE in the first place.
The use of clear_in_cr4() in cpu_mcheck_disable() was ill advised
anyway, and to avoid future similar mistakes I'm removing clear_in_cr4()
altogether right here.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: 8797d20a6ec2dd75195585a107ce345c51c0a59a
master date: 2018-07-19 13:43:33 +0100
Jan Beulich [Mon, 30 Jul 2018 10:07:09 +0000 (12:07 +0200)]
x86: distinguish CPU offlining from CPU removal
In order to be able to service #MC on offlined CPUs, the GDT, IDT,
stack, and per-CPU data (which includes the TSS) need to be kept
allocated. They should only be freed upon CPU removal (which we
currently don't support, so some code is becoming effectively dead for
the moment).
Note that for now park_offline_cpus doesn't get set to true anywhere -
this is going to be the subject of a subsequent patch.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2e6c8f182c9c50129b1c7a620242861e6ad6a9fb
master date: 2018-07-19 13:43:33 +0100
Jan Beulich [Mon, 30 Jul 2018 10:06:39 +0000 (12:06 +0200)]
x86/AMD: distinguish compute units from hyper-threads
Fam17 replaces CUs by HTs, which we should reflect accordingly, even if
the difference is not very big. The most relevant change (requiring some
code restructuring) is that the topoext feature no longer means there is
a valid CU ID.
Take the opportunity and convert wrongly plain int variables in
set_cpu_sibling_map() to unsigned int.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Brian Woods <brian.woods@amd.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9429b07a0af7f92a5f25e4068e11db881e157495
master date: 2018-07-19 09:42:42 +0200
Jan Beulich [Mon, 30 Jul 2018 10:06:08 +0000 (12:06 +0200)]
cpupools: fix state when downing a CPU failed
While I've run into the issue with further patches in place which no
longer guarantee the per-CPU area to start out as all zeros, the
CPU_DOWN_FAILED processing looks to have the same issue: By not zapping
the per-CPU cpupool pointer, cpupool_cpu_add()'s (indirect) invocation
of schedule_cpu_switch() will trigger the "c != old_pool" assertion
there.
Clearing the field during CPU_DOWN_PREPARE is too early (afaict this
should not happen before cpu_disable_scheduler()). Clearing it in
CPU_DEAD and CPU_DOWN_FAILED would be an option, but would take the same
piece of code twice. Since the field's value shouldn't matter while the
CPU is offline, simply clear it (implicitly) for CPU_ONLINE and
CPU_DOWN_FAILED, but only for other than the suspend/resume case (which
gets specially handled in cpupool_cpu_remove()).
By adjusting the conditional in cpupool_cpu_add() CPU_DOWN_FAILED
handling in the suspend case should now also be handled better.
Jan Beulich [Mon, 30 Jul 2018 10:05:36 +0000 (12:05 +0200)]
allow cpu_down() to be called earlier
The function's use of the stop-machine logic has so far prevented its
use ahead of the processing of the "ordinary" initcalls. Since at this
early time we're in a controlled environment anyway, there's no need for
such a heavy tool. Additionally this ought to have less of a performance
impact especially on large systems, compared to the alternative of
making stop-machine functionality available earlier.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5894c0a2da66243a89088d309c7e1ea212ab28d6
master date: 2018-07-16 15:15:12 +0200
Ian Jackson [Mon, 30 Jul 2018 10:05:00 +0000 (12:05 +0200)]
xen: oprofile/nmi_int.c: Drop unwanted sexual reference
This is not really very nice.
This line doesn't have much value in itself. The rest of this comment
block is pretty clear what it wants to convey. So delete it.
(While we are here, adopt the CODING_STYLE-mandated formatting.)
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Lars Kurth <lars.kurth.xen@gmail.com> Acked-by: George Dunlap <dunlapg@umich.edu Acked-by: Jan Beulich <JBeulich@suse.com>
master commit: 41cb2db62627a7438d938aae487550c3f4acb1da
master date: 2018-07-12 16:38:30 +0100
Jan Beulich [Mon, 30 Jul 2018 10:04:28 +0000 (12:04 +0200)]
x86/spec-ctrl: command line handling adjustments
For one, "no-xen" should not imply "no-eager-fpu", as "eager FPU" mode
is to guard guests, not Xen itself, which is also expressed so by
print_details().
And then opt_ssbd, despite being off by default, should also be cleared
by the "no" and "no-xen" sub-options.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ac3f9a72141a48d40fabfff561d5a7dc0e1b810d
master date: 2018-07-10 12:22:31 +0200
Jan Beulich [Mon, 30 Jul 2018 10:03:28 +0000 (12:03 +0200)]
x86: correctly set nonlazy_xstate_used when loading full state
In this case, just like xcr0_accum, nonlazy_xstate_used should always be
set to the intended new value, rather than possibly leaving the flag set
from a prior state load.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f46bf0e101ca63118b9db2616e8f51e972d7f563
master date: 2018-07-09 10:51:02 +0200
Jan Beulich [Thu, 28 Jun 2018 09:23:33 +0000 (11:23 +0200)]
x86/HVM: don't cause #NM to be raised in Xen
The changes for XSA-267 did not touch management of CR0.TS for HVM
guests. In fully eager mode this bit should never be set when
respective vCPU-s are active, or else hvmemul_get_fpu() might leave it
wrongly set, leading to #NM in hypervisor context.
{svm,vmx}_enter() and {svm,vmx}_fpu_dirty_intercept() become unreachable
this way. Explicit {svm,vmx}_fpu_leave() invocations need to be guarded
now.
With no CR0.TS management necessary in fully eager mode, there's also no
need anymore to intercept #NM.
Reported-by: Charles Arnold <carnold@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 488efc29e4e996bb3805c982200f65061390cdce
master date: 2018-06-28 09:07:06 +0200
Ian Jackson [Thu, 28 Jun 2018 09:23:11 +0000 (11:23 +0200)]
libxl: restore passing "readonly=" to qemu for SCSI disks
A read-only check was introduced for XSA-142, commit ef6cb76026 ("libxl:
relax readonly check introduced by XSA-142 fix") added the passing of
the extra setting, but commit dab0539568 ("Introduce COLO mode and
refactor relevant function") dropped the passing of the setting again,
quite likely due to improper re-basing.
Restore the readonly= parameter to SCSI disks. For IDE disks this is
supposed to be rejected; add an assert. And there is a bare ad-hoc
disk drive string in libxl__build_device_model_args_new, which we also
update.
This is XSA-266.
Reported-by: Andrew Reimers <andrew.reimers@orionvm.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
master commit: dd64d3c41a2d15139c3a35d22d4cb6b78f4c5c59
master date: 2018-06-28 09:05:06 +0200
Ian Jackson [Thu, 28 Jun 2018 09:22:55 +0000 (11:22 +0200)]
libxl: qemu_disk_scsi_drive_string: Break out common parts of disk config
The generated configurations are identical apart from, in some cases,
reordering of the id=%s element. So, overall, no functional change.
This is part of XSA-266.
Reported-by: Andrew Reimers <andrew.reimers@orionvm.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
master commit: 724e5aa31b58d1e430ad36b484cf0ec021497399
master date: 2018-06-28 09:04:55 +0200
Andrew Cooper [Thu, 28 Jun 2018 09:22:30 +0000 (11:22 +0200)]
x86: Refine checks in #DB handler for faulting conditions
One of the fix for XSA-260 (c/s 75d6828bc2 "x86/traps: Fix handling of #DB
exceptions in hypervisor context") added some safety checks to help avoid
livelocks of #DB faults.
While a General Detect #DB exception does have fault semantics, hardware
clears %dr7.gd on entry to the handler, meaning that it is actually safe to
return to. Furthermore, %dr6.gd is guest controlled and sticky (never cleared
by hardware). A malicious PV guest can therefore trigger the fatal_trap() and
crash Xen.
Instruction breakpoints are more tricky. The breakpoint match bits in %dr6
are not sticky, but the Intel manual warns that they may be set for
non-enabled breakpoints, so add a breakpoint enabled check.
Beyond that, because of the restriction on the linear addresses PV guests can
set, and the fault (rather than trap) nature of instruction breakpoints
(i.e. can't be deferred by a MovSS shadow), there should be no way to
encounter an instruction breakpoint in Xen context. However, for extra
robustness, deal with this situation by clearing the breakpoint configuration,
rather than crashing.
This is XSA-265
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 17bf51297220dcd74da29de99320b6b1c72d1fa5
master date: 2018-06-28 09:04:20 +0200
Jan Beulich [Thu, 28 Jun 2018 09:20:24 +0000 (11:20 +0200)]
x86/EFI: further correct FPU state handling around runtime calls
We must not leave a vCPU with CR0.TS clear when it is not in fully eager
mode and has not touched non-lazy state. Instead of adding a 3rd
invocation of stts() to vcpu_restore_fpu_eager(), consolidate all of
them into a single one done at the end of the function.
Rename the function at the same time to better reflect its purpose, as
the patches touches all of its occurences anyway.
The new function parameter is not really well named, but
"need_stts_if_not_fully_eager" seemed excessive to me.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
master commit: 23839a0fa0bbe78c174cd2bb49083e153f0f99df
master date: 2018-06-26 15:23:08 +0200
Jan Beulich [Thu, 28 Jun 2018 09:19:43 +0000 (11:19 +0200)]
x86/EFI: fix FPU state handling around runtime calls
There are two issues. First, the nonlazy xstates were never restored
after returning from the runtime call.
Secondly, with the fully_eager_fpu mitigation for XSA-267 / LazyFPU, the
unilateral stts() is no longer correct, and hits an assertion later when
a lazy state restore tries to occur for a fully eager vcpu.
Fix both of these issues by calling vcpu_restore_fpu_eager(). As EFI
runtime services can be used in the idle context, the idle assertion
needs to move until after the fully_eager_fpu check.
Introduce a "curr" local variable and replace other uses of "current"
at the same time.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Juergen Gross <jgross@suse.com>
master commit: 437211cb696515ee5bd5dae0ab72866c9f382a33
master date: 2018-06-21 11:35:46 +0200
Jan Beulich [Thu, 28 Jun 2018 09:18:50 +0000 (11:18 +0200)]
x86: correct default_xen_spec_ctrl calculation
Even with opt_msr_sc_{pv,hvm} both false we should set up the variable
as usual, to ensure proper one-time setup during boot and CPU bringup.
This then also brings the code in line with the comment immediately
ahead of the printk() being modified saying "irrespective of guests".
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d6239f64713df819278bf048446d3187c6ac4734
master date: 2018-05-29 12:38:52 +0200
Andrew Cooper [Thu, 7 Jun 2018 16:00:37 +0000 (17:00 +0100)]
x86/spec-ctrl: Mitigations for LazyFPU
Intel Core processors since at least Nehalem speculate past #NM, which is the
mechanism by which lazy FPU context switching is implemented.
On affected processors, Xen must use fully eager FPU context switching to
prevent guests from being able to read FPU state (SSE/AVX/etc) from previously
scheduled vcpus.
This is part of XSA-267 / CVE-2018-3665
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 243435bf67e8159495194f623b9e4d8c90140384)
Andrew Cooper [Thu, 7 Jun 2018 16:00:37 +0000 (17:00 +0100)]
x86: Support fully eager FPU context switching
This is controlled on a per-vcpu bases for flexibility.
This is part of XSA-267 / CVE-2018-3665
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 146dfe9277c2b4a8c399b229e00d819065e3167b)
Jan Beulich [Wed, 30 May 2018 06:35:36 +0000 (08:35 +0200)]
x86: re-enable XPTI/PCID as needed in switch_native()
Additionally avoid accessing d->arch.pv_domain for PVH domains (running
in a HVM container).
Reported-by: Sergey Dyasli <sergey.dyasli@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Avoid flushing the complete TLB when switching %cr3 for mitigation of
Meltdown by using the PCID feature if available.
We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
2 values for the non-XPTI case:
- guest active and in kernel mode
- guest active and in user mode
- hypervisor active and guest in user mode (XPTI only)
- hypervisor active and guest in kernel mode (XPTI only)
We use PCID only if PCID _and_ INVPCID are supported. With PCID in use
we disable global pages in cr4. A command line parameter controls in
which cases PCID is being used.
As the non-XPTI case has shown not to perform better with PCID at least
on some machines the default is to use PCID only for domains subject to
XPTI.
With PCID enabled we always disable global pages. This avoids having to
either flush the complete TLB or do a cycle through all PCID values
when invalidating a single global page.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/x86: use flag byte for decision whether xen_cr3 is valid
Today cpu_info->xen_cr3 is either 0 to indicate %cr3 doesn't need to
be switched on entry to Xen, or negative for keeping the value while
indicating not to restore %cr3, or positive in case %cr3 is to be
restored.
Switch to use a flag byte instead of a negative xen_cr3 value in order
to allow %cr3 values with the high bit set in case we want to keep TLB
entries when using the PCID feature.
This reduces the number of branches in interrupt handling and results
in better performance (e.g. parallel make of the Xen hypervisor on my
system was using about 3% less system time).
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/x86: disable global pages for domains with XPTI active
Instead of flushing the TLB from global pages when switching address
spaces with XPTI being active just disable global pages via %cr4
completely when a domain subject to XPTI is active. This avoids the
need for extra TLB flushes as loading %cr3 will remove all TLB
entries.
In order to avoid states with cr3/cr4 having inconsistent values
(e.g. global pages being activated while cr3 already specifies a XPTI
address space) move loading of the new cr4 value to write_ptbase()
(actually to switch_cr3_cr4() called by write_ptbase()).
This requires to use switch_cr3_cr4() instead of write_ptbase() when
building dom0 in order to avoid setting cr4 with cr4.smap set.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Instead of switching XPTI globally on or off add a per-domain flag for
that purpose. This allows to modify the xpti boot parameter to support
running dom0 without Meltdown mitigations. Using "xpti=no-dom0" as boot
parameter will achieve that.
Move the xpti boot parameter handling to xen/arch/x86/pv/domain.c as
it is pv-domain specific.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Tue, 29 May 2018 08:15:33 +0000 (10:15 +0200)]
x86/xpti: avoid copying L4 page table contents when possible
For mitigation of Meltdown the current L4 page table is copied to the
cpu local root page table each time a 64 bit pv guest is entered.
Copying can be avoided in cases where the guest L4 page table hasn't
been modified while running the hypervisor, e.g. when handling
interrupts or any hypercall not modifying the L4 page table or %cr3.
So add a per-cpu flag indicating whether the copying should be
performed and set that flag only when loading a new %cr3 or modifying
the L4 page table. This includes synchronization of the cpu local
root page table with other cpus, so add a special synchronization flag
for that case.
A simple performance check (compiling the hypervisor via "make -j 4")
in dom0 with 4 vcpus shows a significant improvement:
- real time drops from 112 seconds to 103 seconds
- system time drops from 142 seconds to 131 seconds
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 23 Jan 2018 09:43:39 +0000 (10:43 +0100)]
x86: move invocations of hvm_flush_guest_tlbs()
Their need is not tied to the actual flushing of TLBs, but the ticking
of the TLB clock. Make this more obvious by folding the two invocations
into a single one in pre_flush().
Also defer the latching of CR4 in write_cr3() until after pre_flush()
(and hence implicitly until after IRQs are off), making operation
sequence the same in both cases (eliminating the theoretical risk of
pre_flush() altering CR4). This then also improves register allocation,
as the compiler doesn't need to use a callee-saved register for "cr4"
anymore.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 29 May 2018 08:11:53 +0000 (10:11 +0200)]
x86/XPTI: fix S3 resume (and CPU offlining in general)
We should index an L1 table with an L1 index.
Reported-by: Simon Gaiser <simon@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6b9562dac1746014ab376bd2cf8ba400acf34c6d
master date: 2018-05-28 11:20:26 +0200
Andrew Cooper [Tue, 29 May 2018 08:10:23 +0000 (10:10 +0200)]
x86/Intel: Mitigations for GPZ SP4 - Speculative Store Bypass
To combat GPZ SP4 "Speculative Store Bypass", Intel have extended their
speculative sidechannel mitigations specification as follows:
* A feature bit to indicate that Speculative Store Bypass Disable is
supported.
* A new bit in MSR_SPEC_CTRL which, when set, disables memory disambiguation
in the pipeline.
* A new bit in MSR_ARCH_CAPABILITIES, which will be set in future hardware,
indicating that the hardware is not susceptible to Speculative Store Bypass
sidechannels.
For contemporary processors, this interface will be implemented via a
microcode update.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 9df52a25e0e95a0b9971aa2fc26c5c6a5cbdf4ef
master date: 2018-05-21 14:20:06 +0100
Andrew Cooper [Tue, 29 May 2018 08:09:57 +0000 (10:09 +0200)]
x86/AMD: Mitigations for GPZ SP4 - Speculative Store Bypass
AMD processors will execute loads and stores with the same base register in
program order, which is typically how a compiler emits code.
Therefore, by default no mitigating actions are taken, despite there being
corner cases which are vulnerable to the issue.
For performance testing, or for users with particularly sensitive workloads,
the `spec-ctrl=ssbd` command line option is available to force Xen to disable
Memory Disambiguation on applicable hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 8c0e338086f060eba31d37b83fbdb883928aa085
master date: 2018-05-21 14:20:06 +0100
Andrew Cooper [Tue, 29 May 2018 08:08:43 +0000 (10:08 +0200)]
x86/spec_ctrl: Introduce a new `spec-ctrl=` command line argument to replace `bti=`
In hindsight, the options for `bti=` aren't as flexible or useful as expected
(including several options which don't appear to behave as intended).
Changing the behaviour of an existing option is problematic for compatibility,
so introduce a new `spec-ctrl=` in the hopes that we can do better.
One common way of deploying Xen is with a single PV dom0 and all domUs being
HVM domains. In such a setup, an administrator who has weighed up the risks
may wish to forgo protection against malicious PV domains, to reduce the
overall performance hit. To cater for this usecase, `spec-ctrl=no-pv` will
disable all speculative protection for PV domains, while leaving all
speculative protection for HVM domains intact.
For coding clarity as much as anything else, the suboptions are grouped by
logical area; those which affect the alternatives blocks, and those which
affect Xen's in-hypervisor settings. See the xen-command-line.markdown for
full details of the new options.
While changing the command line options, take the time to change how the data
is reported to the user. The three DEBUG printks are upgraded to unilateral,
as they are all relevant pieces of information, and the old "mitigations:"
line is split in the two logical areas described above.
Sample output from booting with `spec-ctrl=no-pv` looks like:
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3352afc26c497d26ecb70527db3cb29daf7b1422
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 08:08:21 +0000 (10:08 +0200)]
x86/cpuid: Improvements to guest policies for speculative sidechannel features
If Xen isn't virtualising MSR_SPEC_CTRL for guests, IBRSB shouldn't be
advertised. It is not currently possible to express this via the existing
command line options, but such an ability will be introduced.
Another useful option in some usecases is to offer IBPB without IBRS. When a
guest kernel is known to be compatible (uses retpoline and knows about the AMD
IBPB feature bit), an administrator with pre-Skylake hardware may wish to hide
IBRS. This allows the VM to have full protection, without Xen or the VM
needing to touch MSR_SPEC_CTRL, which can reduce the overhead of Spectre
mitigations.
Break the logic common to both PV and HVM CPUID calculations into a common
helper, to avoid duplication.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cb06b308ec71b23f37a44f5e2351fe2cae0306e9
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 08:07:58 +0000 (10:07 +0200)]
x86/spec_ctrl: Explicitly set Xen's default MSR_SPEC_CTRL value
With the impending ability to disable MSR_SPEC_CTRL handling on a
per-guest-type basis, the first exit-from-guest may not have the side effect
of loading Xen's choice of value. Explicitly set Xen's default during the BSP
and AP boot paths.
For the BSP however, delay setting a non-zero MSR_SPEC_CTRL default until
after dom0 has been constructed when safe to do so. Oracle report that this
speeds up boots of some hardware by 50s.
"when safe to do so" is based on whether we are virtualised. A native boot
won't have any other code running in a position to mount an attack.
Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cb8c12020307b39a89273d7699e89000451987ab
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 08:07:22 +0000 (10:07 +0200)]
x86/spec_ctrl: Split X86_FEATURE_SC_MSR into PV and HVM variants
In order to separately control whether MSR_SPEC_CTRL is virtualised for PV and
HVM guests, split the feature used to control runtime alternatives into two.
Xen will use MSR_SPEC_CTRL itself if either of these features are active.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: fa9eb09d446a1279f5e861e6b84fa8675dabf148
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 08:06:53 +0000 (10:06 +0200)]
x86/spec_ctrl: Elide MSR_SPEC_CTRL handling in idle context when possible
If Xen is virtualising MSR_SPEC_CTRL handling for guests, but using 0 as its
own MSR_SPEC_CTRL value, spec_ctrl_{enter,exit}_idle() need not write to the
MSR.
Requested-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 94df6e8588e35cc2028ccb3fd2921c6e6360605e
master date: 2018-05-16 12:19:10 +0100
Andrew Cooper [Tue, 29 May 2018 08:06:19 +0000 (10:06 +0200)]
x86/spec_ctrl: Rename bits of infrastructure to avoid NATIVE and VMEXIT
In hindsight, using NATIVE and VMEXIT as naming terminology was not clever.
A future change wants to split SPEC_CTRL_EXIT_TO_GUEST into PV and HVM
specific implementations, and using VMEXIT as a term is completely wrong.
Take the opportunity to fix some stale documentation in spec_ctrl_asm.h. The
IST helpers were missing from the large comment block, and since
SPEC_CTRL_ENTRY_FROM_INTR_IST was introduced, we've gained a new piece of
functionality which currently depends on the fine grain control, which exists
in lieu of livepatching. Note this in the comment.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d9822b8a38114e96e4516dc998f4055249364d5d
master date: 2018-05-16 12:19:10 +0100