]> xenbits.xensource.com Git - xen.git/log
xen.git
5 years agohvmloader: avoid tests when they would clobber used memory stable-4.6 staging-4.6
Jan Beulich [Fri, 21 Jun 2019 09:58:24 +0000 (11:58 +0200)]
hvmloader: avoid tests when they would clobber used memory

First of all limit the memory range used for testing to 4Mb: There's no
point placing page tables right above 8Mb when they can equally well
live at the bottom of the chunk at 4Mb - rep_io_test() cares about the
5Mb...7Mb range only anyway. In a subsequent patch this will then also
allow simply looking for an unused 4Mb range (instead of using a build
time determined one).

Extend the "skip tests" condition beyond the "is there enough memory"
question.

Reported-by: Charles Arnold <carnold@suse.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Gary Lin <glin@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0d6968635ce51a8ed7508ddcf17b3d13a462cb27
master date: 2017-05-19 16:04:38 +0200

5 years agolibxl: compilation warning fix for arm & aarch64
Chris Patterson [Wed, 27 Jul 2016 20:01:26 +0000 (16:01 -0400)]
libxl: compilation warning fix for arm & aarch64

GCC 6 will warn on unused static const variables in c modules:
https://gcc.gnu.org/ml/gcc-patches/2015-09/msg00847.html

When compiling with LIBXL_HAVE_NO_SUSPEND_RESUME set (arm & aarch64),
the compiler emits the following errors:
  xl_cmdimpl.c:101:19: error: 'migrate_report'
      defined but not used [-Werror=unused-const-variable=]
  xl_cmdimpl.c:99:19: error: 'migrate_permission_to_go'
      defined but not used [-Werror=unused-const-variable=]
  xl_cmdimpl.c:97:19: error: 'migrate_receiver_ready'
      defined but not used [-Werror=unused-const-variable=]
  xl_cmdimpl.c:95:19: error: 'migrate_receiver_banner'
      defined but not used [-Werror=unused-const-variable=]

These unused const variables are only used in functions which exist between
the ifndef block:
   #ifndef LIBXL_HAVE_NO_SUSPEND_RESUME
   ...
   #endif

Wrap the same ifndef around these variables.

Signed-off-by: Chris Patterson <pattersonc@ainfosec.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 5f28de0b0e474e01931b719fa27ca30b8aa446e0)
(cherry picked from commit c51a3a5826a64f0807545460bfc35022dc9c8428)

5 years agoipxe: update to newer commit
Wei Liu [Mon, 10 Oct 2016 12:50:58 +0000 (13:50 +0100)]
ipxe: update to newer commit

The current commit in tree is rather old. It has come to a point that
cherry-picking commits from upstream isn't trivial anymore.

There is long term plan to track ipxe upstream, but for 4.8 release, we
should just update ipxe to a newer commit (they are using rolling
release model now).

Forward-port the one boot prompt patch that is still relevant and retire
the rest which are already in upstream.

Reported-by: Juergen Schinker <ba1020@homie.homelinux.net>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit 38ab99b26bf4298a33105ec66f3f6a3f7e05a326)

5 years agohvmloader: use xen/errno.h rather than the host systems errno.h
Andrew Cooper [Mon, 7 Mar 2016 16:46:03 +0000 (17:46 +0100)]
hvmloader: use xen/errno.h rather than the host systems errno.h

hvmloader is unhosted, and shouldn't use the system errno.h.  It already has
to use Xen's errno.h for other hypercalls.  The use of public/io/xs_wire.h
requires the use of un-prefixed errno values.

This fixes the build on stricter toolchains where requesting -fno-builtin does
reduce the include path as much as it can.

Reported-by: Doug Goldstein <cardoe@cardoe.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 305e957ffee94fc06c4ba53ef5562f1b8c1c6b02)

5 years agopublic/errno: Reduce complexity of inclusion
Andrew Cooper [Mon, 7 Mar 2016 16:45:13 +0000 (17:45 +0100)]
public/errno: Reduce complexity of inclusion

The inclusion rules conditions for errno.h were unnecesserily complicated, and
required the includer to jump through hoops if they wished to avoid getting
multiple namespaces worth of constants.

Simply the logic, and document what is going on.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 67790205df26e7c3dfeef8b8e64194ebc279220d)

5 years agoerrno: declare aliases using XEN_ERRNO()
Andrew Cooper [Thu, 3 Mar 2016 08:50:11 +0000 (09:50 +0100)]
errno: declare aliases using XEN_ERRNO()

Otherwise a custom XEN_ERRNO definition will not end up creating appropriately
namespaced constants for the aliases.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 03720ea541382a3ca80eaaec2aa11932b03aacaf)

5 years agoerrno: introduce EISDIR/EROFS/ENOTEMPTY to the ABI
Andrew Cooper [Thu, 3 Mar 2016 07:56:48 +0000 (08:56 +0100)]
errno: introduce EISDIR/EROFS/ENOTEMPTY to the ABI

These POSIX errnos are expected by other areas of the Xen public interface,
specifically public/io/xs_wire.h

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 7825ae12df1f6d48c4d009cbbdf5a55aff27291b)

5 years agostubdom: fix stubdom-vtpm build
Juergen Gross [Mon, 31 Oct 2016 09:04:18 +0000 (10:04 +0100)]
stubdom: fix stubdom-vtpm build

stubdom-vtpm needs gmp and expects it under
stubdom/cross-root-x86_64/x86_64-xen-elf/lib while gmp seems to install
it under stubdom/cross-root-x86_64/x86_64-xen-elf/lib64 at least in an
openSUSE environment.

Modify gmp configure parameters to explicitly specify --libdir.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 7791790c7ab97c85306dce749c6c8eb56d1dc0da)

5 years agostubdom: make GMP aware that it's being cross-compiled
Wei Liu [Sat, 29 Oct 2016 17:22:38 +0000 (18:22 +0100)]
stubdom: make GMP aware that it's being cross-compiled

Append --build and --host flags to GMP's configure script so that it
knows it is being cross-compiled.

This should fix the issue that GMP doesn't compile with gcc 6, because
configure script won't try to test the host environment anymore.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 668e4edf92fcf7cb929eed221059a3eeb02722c3)

5 years agolibxl: replace deprecated readdir_r() with readdir()
Chris Patterson [Fri, 3 Jun 2016 16:50:09 +0000 (12:50 -0400)]
libxl: replace deprecated readdir_r() with readdir()

Replace the usage of readdir_r() with readdir() to address a
compilation error under glibc due to the deprecation of readdir_r
for their next release (2.24) [1, 2].

Remove code specific to usage of readdir_r which is no longer required,
such as zalloc_dirent().

--

From the GNU libc manual [3]:
"
 It is expected that future versions of POSIX will obsolete readdir_r and
 mandate the level of thread safety for readdir which is provided by the
 GNU C Library and other implementations today.
"

There is a filed bug in the Austin Group Defect Tracker [4]  in which 'dalias'
proposes (in comment 0001632) that:
"
   I would like to propose an alternate solution. For readdir, replace the text:
    "The readdir() function need not be thread-safe."
   with:
    "If multiple threads call the readdir() function with the same directory
    stream argument and without synchronization to preclude simultaneous
    access, then the behavior is undefined."

   With this change, the clunky readdir_r function is no longer needed or
   useful, and should probably be deprecated. As the only reasonable way
   to meet the implementation requirements for readdir is to have the dirent
   buffer in the DIR structure, this change should not require any change to
   existing implementations.
"

[1] https://sourceware.org/ml/libc-alpha/2016-02/msg00093.html
[2] https://sourceware.org/bugzilla/show_bug.cgi?id=19056
[3] https://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html
[4] http://austingroupbugs.net/view.php?id=696

Signed-off-by: Chris Patterson <pattersonc@ainfosec.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit b9daff9d811285f1e40669bc621c2241793f7a95)

5 years agolibfsimage: replace deprecated readdir_r() with readdir()
Chris Patterson [Fri, 3 Jun 2016 16:50:10 +0000 (12:50 -0400)]
libfsimage: replace deprecated readdir_r() with readdir()

Replace the usage of readdir_r() with readdir() to address a
compilation error under glibc due to the deprecation of readdir_r
for their next release (2.24) [1, 2].

Add new error checking on readdir(), and fail if error occurs.

--

From the GNU libc manual [3]:
"
 It is expected that future versions of POSIX will obsolete readdir_r and
 mandate the level of thread safety for readdir which is provided by the
 GNU C Library and other implementations today.
"

There is a filed bug in the Austin Group Defect Tracker [4]  in which 'dalias'
proposes (in comment 0001632) that:
"
   I would like to propose an alternate solution. For readdir, replace the text:
    "The readdir() function need not be thread-safe."
   with:
    "If multiple threads call the readdir() function with the same directory
    stream argument and without synchronization to preclude simultaneous
    access, then the behavior is undefined."

   With this change, the clunky readdir_r function is no longer needed or
   useful, and should probably be deprecated. As the only reasonable way
   to meet the implementation requirements for readdir is to have the dirent
   buffer in the DIR structure, this change should not require any change to
   existing implementations.
"

[1] https://sourceware.org/ml/libc-alpha/2016-02/msg00093.html
[2] https://sourceware.org/bugzilla/show_bug.cgi?id=19056
[3] https://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html
[4] http://austingroupbugs.net/view.php?id=696

Signed-off-by: Chris Patterson <pattersonc@ainfosec.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit c2a17869d5dcd845d646bf4db122cad73596a2be)

5 years agox86emul/test: don't use *_len symbols
Jan Beulich [Tue, 24 Jan 2017 16:22:03 +0000 (16:22 +0000)]
x86emul/test: don't use *_len symbols

... as they don't work as intended with -fPIC.

I did prefer them over *_end ones at the time because older gcc would
cause .L* symbols to be public, due to issuing .globl for all
referenced externals. And labels at the end of instructions collide
with the ones at the start of the next instruction, making disassembly
harder to grok. Luckily recent gcc no longer issues those .globl
directives, and hence .L* labels, staying local by default, no longer
get in the way.

Reported-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Tested-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 9315fa0ef736d1153c98ce42bff5853da5ec697f)

6 years agoFix misleading indentation warnings
Cédric Bosdonnat [Thu, 10 Nov 2016 09:23:31 +0000 (10:23 +0100)]
Fix misleading indentation warnings

Gcc6 build reports misleading indentation as warnings. Fix a few
warnings in stubdom.

Signed-off-by: Cédric Bosdonnat <cbosdonnat@suse.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Quan Xu <xuquan8@huawei.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 9fdffbbab3ada427bac07076f042f0265e5ae05f)
(cherry picked from commit 7c8db58d3739c805f4c0f773b65157f306b00c2a)

6 years agoxenalyze: remove cr3_compare_total
Ian Campbell [Fri, 22 Jan 2016 14:27:29 +0000 (14:27 +0000)]
xenalyze: remove cr3_compare_total

gcc-6 complains:
xenalyze.c:4132:9: error: 'cr3_compare_total' defined but not used [-Werror=unused-function]
     int cr3_compare_total(const void *_a, const void *_b) {
         ^~~~~~~~~~~~~~~~~

I believe it is correct.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
(cherry picked from commit 87761277d7f079ce278323b45da279f2bd25d31b)

6 years agoxenalyze: fix misleading indentation.
Ian Campbell [Fri, 22 Jan 2016 14:27:28 +0000 (14:27 +0000)]
xenalyze: fix misleading indentation.

gcc-6 adds -Wmisleading-indentation which found these issues.

xenalyze.c: In function 'weighted_percentile':
xenalyze.c:2136:18: error: statement is indented as if it were guarded by... [-Werror=misleading-indentation]
             L=I; L_weight = I_weight;
                  ^~~~~~~~

xenalyze.c:2135:9: note: ...this 'if' clause, but it is not
         if(J_weight<K_weight)
         ^~

xenalyze.c:2138:18: error: statement is indented as if it were guarded by... [-Werror=misleading-indentation]
             R=J; R_weight = J_weight;
                  ^~~~~~~~

xenalyze.c:2137:9: note: ...this 'if' clause, but it is not
         if(K_weight<I_weight)
         ^~

xenalyze.c: In function 'self_weighted_percentile':
xenalyze.c:2215:18: error: statement is indented as if it were guarded by... [-Werror=misleading-indentation]
             L=I; L_weight = I_weight;
                  ^~~~~~~~

xenalyze.c:2214:9: note: ...this 'if' clause, but it is not
         if(J_weight<K_weight)
         ^~

xenalyze.c:2217:18: error: statement is indented as if it were guarded by... [-Werror=misleading-indentation]
             R=J; R_weight = J_weight;
                  ^~~~~~~~

xenalyze.c:2216:9: note: ...this 'if' clause, but it is not
         if(K_weight<I_weight)
         ^~

I've modified according to what I think the intention is, i.e. added braces
rather than moving the line in question out a level.

I have only build tested the result.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
(cherry picked from commit ebdba150bff1d914805d60efa576337bbef0c305)

6 years agotools/firmware: update OVMF Makefile, when necessary
Wei Liu [Wed, 28 Nov 2018 17:43:33 +0000 (17:43 +0000)]
tools/firmware: update OVMF Makefile, when necessary

[ This is two commits from master aka staging-4.12: ]

OVMF has become dependent on OpenSSL, which is included as a
submodule.  Initialise submodules before building.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit b16281870e06f5f526029a4e69634a16dc38e8e4)

tools: only call git when necessary in OVMF Makefile

Users may choose to export a snapshot of OVMF and build it
with xen.git supplied ovmf-makefile. In that case we don't
need to call `git submodule`.

Fixes b16281870e.

Reported-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit 68292c94a60eab24514ab4a8e4772af24dead807)
(cherry picked from commit e983e8ae84efd5e43045a3d20a820f13cb4a75bf)
(cherry picked from commit 5a81de4c6b6036974f29e2330a493f23a8f0c1f0)
(cherry picked from commit 63d9330ba9fdec7c8e9346e6d85360747d61c947)
(cherry picked from commit e9d860f1f657a198d990bdae3e295001bd19223c)
(cherry picked from commit 7835644d5141d0f28ec221eda40fcbf2fc03be23)

6 years agox86/spec-ctrl: adjust backport of b76ec3946b
Andrew Cooper [Fri, 14 Sep 2018 09:33:12 +0000 (11:33 +0200)]
x86/spec-ctrl: adjust backport of b76ec3946b

Refreshing XenServer's patchqueue has shown that I missed this adjustment in
the upstream backports of the final version of the XSA-273 fixes.

The code does work in 4.7 and earlier, but only because the eventual value of
(opt_pv_l1tf & OPT_PV_L1TF_DOMx) is within range of a char.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
6 years agoamend "x86/spec-ctrl: CPUID/MSR definitions for L1D_FLUSH"
Jan Beulich [Wed, 15 Aug 2018 12:27:40 +0000 (14:27 +0200)]
amend "x86/spec-ctrl: CPUID/MSR definitions for L1D_FLUSH"

This is part of XSA-273 / CVE-2018-3646.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
6 years agoamend "x86/Intel: Mitigations for GPZ SP4 - Speculative Store Bypass"
Jan Beulich [Wed, 15 Aug 2018 12:27:22 +0000 (14:27 +0200)]
amend "x86/Intel: Mitigations for GPZ SP4 - Speculative Store Bypass"

This is part of CVE-2018-3639 / XSA-263.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
6 years agox86: write to correct variable in parse_pv_l1tf()
Jan Beulich [Wed, 15 Aug 2018 12:26:02 +0000 (14:26 +0200)]
x86: write to correct variable in parse_pv_l1tf()

Apparently a copy-and-paste mistake.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 57c554f8a6e06894f601d977d18b3017d2a60f40
master date: 2018-08-15 14:15:30 +0200

6 years agoarm: constify atomic_read parameter
Wei Liu [Wed, 15 Aug 2018 08:30:09 +0000 (09:30 +0100)]
arm: constify atomic_read parameter

7d98594e6e added const to atomic_read. Do the same for Arm
counterpart.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Julien Grall <julien.grall@arm.com>
6 years agoxl.conf: Add global affinity masks
Wei Liu [Tue, 7 Aug 2018 14:35:34 +0000 (15:35 +0100)]
xl.conf: Add global affinity masks

XSA-273 involves one hyperthread being able to use Spectre-like
techniques to "spy" on another thread.  The details are somewhat
complicated, but the upshot is that after all Xen-based mitigations
have been applied:

* PV guests cannot spy on sibling threads
* HVM guests can spy on sibling threads

(NB that for purposes of this vulnerability, PVH and HVM guests are
identical.  Whenever this comment refers to 'HVM', this includes PVH.)

There are many possible mitigations to this, including disabling
hyperthreading entirely.  But another solution would be:

* Specify some cores as PV-only, others as PV or HVM
* Allow HVM guests to only run on thread 0 of the "HVM-or-PV" cores
* Allow PV guests to run on the above cores, as well as any thread of the PV-only cores.

For example, suppose you had 16 threads across 8 cores (0-7).  You
could specify 0-3 as PV-only, and 4-7 as HVM-or-PV.  Then you'd set
the affinity of the HVM guests as follows (binary representation):

0000000010101010

And the affinity of the PV guests as follows:

1111111110101010

In order to make this easy, this patches introduces three "global affinity
masks", placed in xl.conf:

    vm.cpumask
    vm.hvm.cpumask
    vm.pv.cpumask

These are parsed just like the 'cpus' and 'cpus_soft' options in the
per-domain xl configuration files.  The resulting mask is AND-ed with
whatever mask results at the end of the xl configuration file.
`vm.cpumask` would be applied to all guest types, `vm.hvm.cpumask`
would be applied to HVM and PVH guest types, and `vm.pv.cpumask`
would be applied to PV guest types.

The idea would be that to implement the above mask across all your
VMs, you'd simply add the following two lines to the configuration
file:

    vm.hvm.cpumask=8,10,12,14
    vm.pv.cpumask=0-8,10,12,14

See xl.conf manpage for details.

This is part of XSA-273 / CVE-2018-3646.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit aa67b97ed34279c43a43d9ca46727b5746caa92e)

PVH guest type in toolstack is not available in this version of Xen.
Change code and manpage to cope. Also xl is still part of libxl in
thsi version, manually backport code to relevant places.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
6 years agox86: Make "spec-ctrl=no" a global disable of all mitigations
Jan Beulich [Mon, 13 Aug 2018 11:07:23 +0000 (05:07 -0600)]
x86: Make "spec-ctrl=no" a global disable of all mitigations

In order to have a simple and easy to remember means to suppress all the
more or less recent workarounds for hardware vulnerabilities, force
settings not controlled by "spec-ctrl=" also to their original defaults,
unless they've been forced to specific values already by earlier command
line options.

This is part of XSA-273.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit d8800a82c3840b06b17672eddee4878bbfdacc6d)

6 years agox86/spec-ctrl: Introduce an option to control L1D_FLUSH for HVM HAP guests
Andrew Cooper [Tue, 29 May 2018 17:44:16 +0000 (18:44 +0100)]
x86/spec-ctrl: Introduce an option to control L1D_FLUSH for HVM HAP guests

This mitigation requires up-to-date microcode, and is enabled by default on
affected hardware if available, and is used for HVM guests

The default for SMT/Hyperthreading is far more complicated to reason about,
not least because we don't know if the user is going to want to run any HVM
guests to begin with.  If a explicit default isn't given, nag the user to
perform a risk assessment and choose an explicit default, and leave other
configuration to the toolstack.

This is part of XSA-273 / CVE-2018-3620.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3bd36952dab60290f33d6791070b57920e10754b)

6 years agox86/msr: Virtualise MSR_FLUSH_CMD for guests
Andrew Cooper [Fri, 13 Apr 2018 15:34:01 +0000 (15:34 +0000)]
x86/msr: Virtualise MSR_FLUSH_CMD for guests

Guests (outside of the nested virt case, which isn't supported yet) don't need
L1D_FLUSH for their L1TF mitigations, but offering/emulating MSR_FLUSH_CMD is
easy and doesn't pose an issue for Xen.

The MSR is offered to HVM guests only.  PV guests attempting to use it would
trap for emulation, and the L1D cache would fill long before the return to
guest context.  As such, PV guests can't make any use of the L1D_FLUSH
functionality.

This is part of XSA-273 / CVE-2018-3646.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit fd9823faf9df057a69a9a53c2e100691d3f4267c)

6 years agox86/spec-ctrl: CPUID/MSR definitions for L1D_FLUSH
Andrew Cooper [Wed, 28 Mar 2018 14:21:39 +0000 (15:21 +0100)]
x86/spec-ctrl: CPUID/MSR definitions for L1D_FLUSH

This is part of XSA-273 / CVE-2018-3646.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3563fc2b2731a63fd7e8372ab0f5cef205bf8477)

6 years agox86/pv: Force a guest into shadow mode when it writes an L1TF-vulnerable PTE
Juergen Gross [Mon, 23 Jul 2018 06:11:40 +0000 (08:11 +0200)]
x86/pv: Force a guest into shadow mode when it writes an L1TF-vulnerable PTE

See the comment in shadow.h for an explanation of L1TF and the safety
consideration of the PTEs.

In the case that CONFIG_SHADOW_PAGING isn't compiled in, crash the domain
instead.  This allows well-behaved PV guests to function, while preventing
L1TF from being exploited.  (Note: PV guest kernels which haven't been updated
with L1TF mitigations will likely be crashed as soon as they try paging a
piece of userspace out to disk.)

This is part of XSA-273 / CVE-2018-3620.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 06e8b622d3f3c0fa5075e91b041c6f45549ad70a)

6 years agox86/mm: Plumbing to allow any PTE update to fail with -ERESTART
Andrew Cooper [Mon, 23 Jul 2018 06:11:40 +0000 (08:11 +0200)]
x86/mm: Plumbing to allow any PTE update to fail with -ERESTART

Switching to shadow mode is performed in tasklet context.  To facilitate this,
we schedule the tasklet, then create a hypercall continuation to allow the
switch to take place.

As a consequence, the x86 mm code needs to cope with an L1e operation being
continuable.  do_mmu{,ext}_op() may no longer assert that a continuation
doesn't happen on the final iteration.

To handle the arguments correctly on continuation, compat_update_va_mapping*()
may no longer call into their non-compat counterparts.  Move the compat
functions into mm.c rather than exporting __do_update_va_mapping() and
{get,put}_pg_owner(), and fix an unsigned long/int inconsistency with
compat_update_va_mapping_otherdomain().

This is part of XSA-273 / CVE-2018-3620.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c612481d1c9232c6abf91b03ec655e92f808805f)

6 years agox86/shadow: Infrastructure to force a PV guest into shadow mode
Juergen Gross [Mon, 23 Jul 2018 06:11:40 +0000 (07:11 +0100)]
x86/shadow: Infrastructure to force a PV guest into shadow mode

To mitigate L1TF, we cannot alter an architecturally-legitimate PTE a PV guest
chooses to write, but we can force the PV domain into shadow mode so Xen
controls the PTEs which are reachable by the CPU pagewalk.

Introduce new shadow mode, PG_SH_forced, and a tasklet to perform the
transition.  Later patches will introduce the logic to enable this mode at the
appropriate time.

To simplify vcpu cleanup, make tasklet_kill() idempotent with respect to
tasklet_init(), which involves adding a helper to check for an uninitialised
list head.

This is part of XSA-273 / CVE-2018-3620.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit b76ec3946bf6caca2c3950b857c008bc8db6723f)

6 years agox86/spec-ctrl: Introduce an option to control L1TF mitigation for PV guests
Andrew Cooper [Mon, 23 Jul 2018 13:46:10 +0000 (13:46 +0000)]
x86/spec-ctrl: Introduce an option to control L1TF mitigation for PV guests

Shadowing a PV guest is only available when shadow paging is compiled in.
When shadow paging isn't available, guests can be crashed instead as
mitigation from Xen's point of view.

Ideally, dom0 would also be potentially-shadowed-by-default, but dom0 has
never been shadowed before, and there are some stability issues under
investigation.

This is part of XSA-273 / CVE-2018-3620.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 66a4e986819a86ba66ca2fe9d925e62a4fd30114)

6 years agox86/spec-ctrl: Calculate safe PTE addresses for L1TF mitigations
Andrew Cooper [Wed, 25 Jul 2018 12:10:19 +0000 (12:10 +0000)]
x86/spec-ctrl: Calculate safe PTE addresses for L1TF mitigations

Safe PTE addresses for L1TF mitigations are ones which are within the L1D
address width (may be wider than reported in CPUID), and above the highest
cacheable RAM/NVDIMM/BAR/etc.

All logic here is best-effort heuristics, which should in practice be fine for
most hardware.  Future work will see about disentangling the SRAT handling
further, as well as having L0 pass this information down to lower levels when
virtualised.

This is part of XSA-273 / CVE-2018-3620.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit b03a57c9383b32181e60add6b6de12b473652aa4)

6 years agotools/oxenstored: Make evaluation order explicit
Christian Lindig [Mon, 13 Aug 2018 16:26:56 +0000 (17:26 +0100)]
tools/oxenstored: Make evaluation order explicit

In Store.path_write(), Path.apply_modify() updates the node_created
reference and both the value of apply_modify() and node_created are
returned by path_write().

At least with OCaml 4.06.1 this leads to the value of node_created being
returned *before* it is updated by apply_modify().  This in turn leads
to the quota for a domain not being updated in Store.write().  Hence, a
guest can create an unlimited number of entries in xenstore.

The fix is to make evaluation order explicit.

This is XSA-272.

Signed-off-by: Christian Lindig <christian.lindig@citrix.com>
Reviewed-by: Rob Hoes <rob.hoes@citrix.com>
(cherry picked from commit 73392c7fd14c59f8c96e0b2eeeb329e4ae9086b6)

6 years agox86/vtx: Fix the checking for unknown/invalid MSR_DEBUGCTL bits
Andrew Cooper [Mon, 18 Jun 2018 08:12:39 +0000 (16:12 +0800)]
x86/vtx: Fix the checking for unknown/invalid MSR_DEBUGCTL bits

The VPMU_MODE_OFF early-exit in vpmu_do_wrmsr() introduced by c/s
11fe998e56 bypasses all reserved bit checking in the general case.  As a
result, a guest can enable BTS when it shouldn't be permitted to, and
lock up the entire host.

With vPMU active (not a security supported configuration, but useful for
debugging), the reserved bit checking in broken, caused by the original
BTS changeset 1a8aa75ed.

From a correctness standpoint, it is not possible to have two different
pieces of code responsible for different parts of value checking, if
there isn't an accumulation of bits which have been checked.  A
practical upshot of this is that a guest can set any value it
wishes (usually resulting in a vmentry failure for bad guest state).

Therefore, fix this by implementing all the reserved bit checking in the
main MSR_DEBUGCTL block, and removing all handling of DEBUGCTL from the
vPMU MSR logic.

This is XSA-269.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 2a8a8e99feb950504559196521bc9fd63ed3a962)

6 years agoARM: disable grant table v2
Stefano Stabellini [Tue, 14 Aug 2018 10:20:53 +0000 (11:20 +0100)]
ARM: disable grant table v2

It was never expected to work, the implementation is incomplete.

As a side effect, it also prevents guests from triggering a
"BUG_ON(page_get_owner(pg) != d)" in gnttab_unpopulate_status_frames().

This is XSA-268.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9a5c16a3e75778c8a094ca87784d93b74676f46c)

6 years agocommon/gnttab: Introduce command line feature controls
Andrew Cooper [Tue, 14 Aug 2018 10:20:53 +0000 (11:20 +0100)]
common/gnttab: Introduce command line feature controls

This patch was originally released as part of XSA-226.  It retains the same
command line syntax (as various downstreams are mitigating XSA-226 using this
mechanism) but the defaults have been updated due to the revised XSA-226
patched, after which transitive grants are believed to functioning
properly.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit dc96c65ed6d7ffd4c95487373df708d97443cf77)

6 years agoVMX: fix vmx_{find,del}_msr() build
Jan Beulich [Thu, 19 Jul 2018 09:54:45 +0000 (11:54 +0200)]
VMX: fix vmx_{find,del}_msr() build

Older gcc at -O2 (and perhaps higher) does not recognize that apparently
uninitialized variables aren't really uninitialized. Pull out the
assignments used by two of the three case blocks and make them
initializers of the variables, as I think I had suggested during review.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit 97cb0516a322ecdf0032fa9d8aa1525c03d7772f)

6 years agox86/vmx: Support load-only guest MSR list entries
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Support load-only guest MSR list entries

Currently, the VMX_MSR_GUEST type maintains completely symmetric guest load
and save lists, by pointing VM_EXIT_MSR_STORE_ADDR and VM_ENTRY_MSR_LOAD_ADDR
at the same page, and setting VM_EXIT_MSR_STORE_COUNT and
VM_ENTRY_MSR_LOAD_COUNT to the same value.

However, for MSRs which we won't let the guest have direct access to, having
hardware save the current value on VMExit is unnecessary overhead.

To avoid this overhead, we must make the load and save lists asymmetric.  By
making the entry load count greater than the exit store count, we can maintain
two adjacent lists of MSRs, the first of which is saved and restored, and the
second of which is only restored on VMEntry.

For simplicity:
 * Both adjacent lists are still sorted by MSR index.
 * It undefined behaviour to insert the same MSR into both lists.
 * The total size of both lists is still limited at 256 entries (one 4k page).

Split the current msr_count field into msr_{load,save}_count, and introduce a
new VMX_MSR_GUEST_LOADONLY type, and update vmx_{add,find}_msr() to calculate
which sublist to search, based on type.  VMX_MSR_HOST has no logical sublist,
whereas VMX_MSR_GUEST has a sublist between 0 and the save count, while
VMX_MSR_GUEST_LOADONLY has a sublist between the save count and the load
count.

One subtle point is that inserting an MSR into the load-save list involves
moving the entire load-only list, and updating both counts.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit 1ac46b55632626aeb935726e1b0a71605ef6763a)

6 years agox86/vmx: Support remote access to the MSR lists
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Support remote access to the MSR lists

At the moment, all modifications of the MSR lists are in current context.
However, future changes may need to put MSR_EFER into the lists from domctl
hypercall context.

Plumb a struct vcpu parameter down through the infrastructure, and use
vmx_vmcs_{enter,exit}() for safe access to the VMCS in vmx_add_msr().  Use
assertions to ensure that access is either in current context, or while the
vcpu is paused.

Note these expectations beside the fields in arch_vmx_struct, and reorder the
fields to avoid unnecessary padding.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 80599f0b770199116aa753bfdfac9bfe2e8ea86a)

6 years agox86/vmx: Factor locate_msr_entry() out of vmx_find_msr() and vmx_add_msr()
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Factor locate_msr_entry() out of vmx_find_msr() and vmx_add_msr()

Instead of having multiple algorithms searching the MSR lists, implement a
single one.  It has the semantics required by vmx_add_msr(), to identify the
position in which an MSR should live, if it isn't already present.

There will be a marginal improvement for vmx_find_msr() by avoiding the
function pointer calls to vmx_msr_entry_key_cmp(), and a major improvement for
vmx_add_msr() by using a binary search instead of a linear search.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit 4d94828cf11104256dccea1fa7762f00575dfaa0)

6 years agox86/vmx: Internal cleanup for MSR load/save infrastructure
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Internal cleanup for MSR load/save infrastructure

 * Use an arch_vmx_struct local variable to reduce later code volume.
 * Use start/total instead of msr_area/msr_count.  This is in preparation for
   more finegrained handling with later changes.
 * Use ent/end pointers (again for preparation), and to make the vmx_add_msr()
   logic easier to follow.
 * Make the memory allocation block of vmx_add_msr() unlikely, and calculate
   virt_to_maddr() just once.

No practical change to functionality.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit 94fda356fcdcc847662a4c9f6cc63511f25c1247)

6 years agox86/vmx: API improvements for MSR load/save infrastructure
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: API improvements for MSR load/save infrastructure

Collect together related infrastructure in vmcs.h, rather than having it
spread out.  Turn vmx_{read,write}_guest_msr() into static inlines, as they
are simple enough.

Replace 'int type' with 'enum vmx_msr_list_type', and use switch statements
internally.  Later changes are going to introduce a new type.

Rename the type identifiers for consistency with the other VMX_MSR_*
constants.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit f54b63e8617ada823be43d60467a43c8224b7909)

6 years agox86/vmx: Defer vmx_vmcs_exit() as long as possible in construct_vmcs()
Andrew Cooper [Mon, 28 May 2018 14:02:34 +0000 (15:02 +0100)]
x86/vmx: Defer vmx_vmcs_exit() as long as possible in construct_vmcs()

paging_update_paging_modes() and vmx_vlapic_msr_changed() both operate on the
VMCS being constructed.  Avoid dropping and re-acquiring the reference
multiple times.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit f30e3cf34042846e391e3f8361fc6a76d181a7ee)

6 years agox86/spec-ctrl: Yet more fixes for xpti= parsing
Andrew Cooper [Thu, 9 Aug 2018 16:22:17 +0000 (17:22 +0100)]
x86/spec-ctrl: Yet more fixes for xpti= parsing

As it currently stands, 'xpti=dom0' is indistinguishable from the default
value, which means it will be overridden by ARCH_CAPABILITIES_RDCL_NO on fixed
hardware.

Switch opt_xpti to use -1 as a default like all our other related options, and
clobber it as soon as we have a string to parse.

In addition, 'xpti' alone should be interpreted in its positive boolean form,
rather than resulting in a parse error.

  (XEN) parameter "xpti" has invalid value "", rc=-22!

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 2a3b34ec47817048ab59586855cf0709fc77487e)

6 years agox86/spec-ctrl: Fix the parsing of xpti= on fixed Intel hardware
Andrew Cooper [Mon, 30 Jul 2018 12:21:41 +0000 (14:21 +0200)]
x86/spec-ctrl: Fix the parsing of xpti= on fixed Intel hardware

The calls to xpti_init_default() in parse_xpti() are buggy.  The CPUID data
hasn't been fetched that early, and boot_cpu_has(X86_FEATURE_ARCH_CAPS) will
always evaluate false.

As a result, the default case won't disable XPTI on Intel hardware which
advertises ARCH_CAPABILITIES_RDCL_NO.

Simplify parse_xpti() to solely the setting of opt_xpti according to the
passed string, and have init_speculation_mitigations() call
xpti_init_default() if appropiate.  Drop the force parameter, and pass caps
instead, to avoid redundant re-reading of MSR_ARCH_CAPS.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: be5e2ff6f54e0245331ed360b8786760f82fd673
master date: 2018-07-24 11:25:54 +0100

6 years agox86: command line option to avoid use of secondary hyper-threads
Jan Beulich [Mon, 30 Jul 2018 12:21:07 +0000 (14:21 +0200)]
x86: command line option to avoid use of secondary hyper-threads

Shared resources (L1 cache and TLB in particular) present a risk of
information leak via side channels. Provide a means to avoid use of
hyperthreads in such cases.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d8f974f1a646c0200b97ebcabb808324b288fadb
master date: 2018-07-19 13:43:33 +0100

6 years agox86: possibly bring up all CPUs even if not all are supposed to be used
Jan Beulich [Mon, 30 Jul 2018 12:20:22 +0000 (14:20 +0200)]
x86: possibly bring up all CPUs even if not all are supposed to be used

Reportedly Intel CPUs which can't broadcast #MC to all targeted
cores/threads because some have CR4.MCE clear will shut down. Therefore
we want to keep CR4.MCE enabled when offlining a CPU, and we need to
bring up all CPUs in order to be able to set CR4.MCE in the first place.

The use of clear_in_cr4() in cpu_mcheck_disable() was ill advised
anyway, and to avoid future similar mistakes I'm removing clear_in_cr4()
altogether right here.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: 8797d20a6ec2dd75195585a107ce345c51c0a59a
master date: 2018-07-19 13:43:33 +0100

6 years agox86: distinguish CPU offlining from CPU removal
Jan Beulich [Mon, 30 Jul 2018 12:19:38 +0000 (14:19 +0200)]
x86: distinguish CPU offlining from CPU removal

In order to be able to service #MC on offlined CPUs, the GDT, IDT,
stack, and per-CPU data (which includes the TSS) need to be kept
allocated. They should only be freed upon CPU removal (which we
currently don't support, so some code is becoming effectively dead for
the moment).

Note that for now park_offline_cpus doesn't get set to true anywhere -
this is going to be the subject of a subsequent patch.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2e6c8f182c9c50129b1c7a620242861e6ad6a9fb
master date: 2018-07-19 13:43:33 +0100

6 years agox86/AMD: distinguish compute units from hyper-threads
Jan Beulich [Mon, 30 Jul 2018 12:19:05 +0000 (14:19 +0200)]
x86/AMD: distinguish compute units from hyper-threads

Fam17 replaces CUs by HTs, which we should reflect accordingly, even if
the difference is not very big. The most relevant change (requiring some
code restructuring) is that the topoext feature no longer means there is
a valid CU ID.

Take the opportunity and convert wrongly plain int variables in
set_cpu_sibling_map() to unsigned int.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Brian Woods <brian.woods@amd.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9429b07a0af7f92a5f25e4068e11db881e157495
master date: 2018-07-19 09:42:42 +0200

6 years agocpupools: fix state when downing a CPU failed
Jan Beulich [Mon, 30 Jul 2018 12:18:40 +0000 (14:18 +0200)]
cpupools: fix state when downing a CPU failed

While I've run into the issue with further patches in place which no
longer guarantee the per-CPU area to start out as all zeros, the
CPU_DOWN_FAILED processing looks to have the same issue: By not zapping
the per-CPU cpupool pointer, cpupool_cpu_add()'s (indirect) invocation
of schedule_cpu_switch() will trigger the "c != old_pool" assertion
there.

Clearing the field during CPU_DOWN_PREPARE is too early (afaict this
should not happen before cpu_disable_scheduler()). Clearing it in
CPU_DEAD and CPU_DOWN_FAILED would be an option, but would take the same
piece of code twice. Since the field's value shouldn't matter while the
CPU is offline, simply clear it (implicitly) for CPU_ONLINE and
CPU_DOWN_FAILED, but only for other than the suspend/resume case (which
gets specially handled in cpupool_cpu_remove()).

By adjusting the conditional in cpupool_cpu_add() CPU_DOWN_FAILED
handling in the suspend case should now also be handled better.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
master commit: cb1ae9a27819cea0c5008773c68a7be6f37eb0e5
master date: 2018-07-19 09:41:55 +0200

6 years agoallow cpu_down() to be called earlier
Jan Beulich [Mon, 30 Jul 2018 12:18:13 +0000 (14:18 +0200)]
allow cpu_down() to be called earlier

The function's use of the stop-machine logic has so far prevented its
use ahead of the processing of the "ordinary" initcalls. Since at this
early time we're in a controlled environment anyway, there's no need for
such a heavy tool. Additionally this ought to have less of a performance
impact especially on large systems, compared to the alternative of
making stop-machine functionality available earlier.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5894c0a2da66243a89088d309c7e1ea212ab28d6
master date: 2018-07-16 15:15:12 +0200

6 years agox86/spec-ctrl: command line handling adjustments
Jan Beulich [Mon, 30 Jul 2018 12:17:45 +0000 (14:17 +0200)]
x86/spec-ctrl: command line handling adjustments

For one, "no-xen" should not imply "no-eager-fpu", as "eager FPU" mode
is to guard guests, not Xen itself, which is also expressed so by
print_details().

And then opt_ssbd, despite being off by default, should also be cleared
by the "no" and "no-xen" sub-options.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ac3f9a72141a48d40fabfff561d5a7dc0e1b810d
master date: 2018-07-10 12:22:31 +0200

6 years agoxen: Port the array_index_nospec() infrastructure from Linux
Andrew Cooper [Mon, 30 Jul 2018 12:17:11 +0000 (14:17 +0200)]
xen: Port the array_index_nospec() infrastructure from Linux

This is as the infrastructure appeared in Linux 4.17, adapted slightly for
Xen.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 2ddfae51d8b1d7b8cd33a4f6ad4d16d27cb869ae
master date: 2018-07-06 16:49:57 +0100

6 years agocmdline: fix parse_boolean() for NULL incoming end pointer
Jan Beulich [Mon, 30 Jul 2018 12:16:15 +0000 (14:16 +0200)]
cmdline: fix parse_boolean() for NULL incoming end pointer

Use the calculated lengths instead of pointers, as 'e' being NULL will
otherwise cause undue parsing failures.

Reported-by: Karl Johnson <karljohnson.it@gmail.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
6 years agox86/HVM: don't cause #NM to be raised in Xen
Jan Beulich [Thu, 28 Jun 2018 10:28:21 +0000 (12:28 +0200)]
x86/HVM: don't cause #NM to be raised in Xen

The changes for XSA-267 did not touch management of CR0.TS for HVM
guests. In fully eager mode this bit should never be set when
respective vCPU-s are active, or else hvmemul_get_fpu() might leave it
wrongly set, leading to #NM in hypervisor context.

{svm,vmx}_enter() and {svm,vmx}_fpu_dirty_intercept() become unreachable
this way. Explicit {svm,vmx}_fpu_leave() invocations need to be guarded
now.

With no CR0.TS management necessary in fully eager mode, there's also no
need anymore to intercept #NM.

Reported-by: Charles Arnold <carnold@suse.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
6 years agox86/EFI: further correct FPU state handling around runtime calls
Jan Beulich [Thu, 28 Jun 2018 10:27:56 +0000 (12:27 +0200)]
x86/EFI: further correct FPU state handling around runtime calls

We must not leave a vCPU with CR0.TS clear when it is not in fully eager
mode and has not touched non-lazy state. Instead of adding a 3rd
invocation of stts() to vcpu_restore_fpu_eager(), consolidate all of
them into a single one done at the end of the function.

Rename the function at the same time to better reflect its purpose, as
the patches touches all of its occurences anyway.

The new function parameter is not really well named, but
"need_stts_if_not_fully_eager" seemed excessive to me.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
6 years agox86/EFI: fix FPU state handling around runtime calls
Jan Beulich [Thu, 28 Jun 2018 10:27:34 +0000 (12:27 +0200)]
x86/EFI: fix FPU state handling around runtime calls

There are two issues.  First, the nonlazy xstates were never restored
after returning from the runtime call.

Secondly, with the fully_eager_fpu mitigation for XSA-267 / LazyFPU, the
unilateral stts() is no longer correct, and hits an assertion later when
a lazy state restore tries to occur for a fully eager vcpu.

Fix both of these issues by calling vcpu_restore_fpu_eager().  As EFI
runtime services can be used in the idle context, the idle assertion
needs to move until after the fully_eager_fpu check.

Introduce a "curr" local variable and replace other uses of "current"
at the same time.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Juergen Gross <jgross@suse.com>
6 years agox86: Refine checks in #DB handler for faulting conditions
Andrew Cooper [Thu, 28 Jun 2018 10:26:54 +0000 (12:26 +0200)]
x86: Refine checks in #DB handler for faulting conditions

One of the fix for XSA-260 (c/s 75d6828bc2 "x86/traps: Fix handling of #DB
exceptions in hypervisor context") added some safety checks to help avoid
livelocks of #DB faults.

While a General Detect #DB exception does have fault semantics, hardware
clears %dr7.gd on entry to the handler, meaning that it is actually safe to
return to.  Furthermore, %dr6.gd is guest controlled and sticky (never cleared
by hardware).  A malicious PV guest can therefore trigger the fatal_trap() and
crash Xen.

Instruction breakpoints are more tricky.  The breakpoint match bits in %dr6
are not sticky, but the Intel manual warns that they may be set for
non-enabled breakpoints, so add a breakpoint enabled check.

Beyond that, because of the restriction on the linear addresses PV guests can
set, and the fault (rather than trap) nature of instruction breakpoints
(i.e. can't be deferred by a MovSS shadow), there should be no way to
encounter an instruction breakpoint in Xen context.  However, for extra
robustness, deal with this situation by clearing the breakpoint configuration,
rather than crashing.

This is XSA-265

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
6 years agox86/mm: don't bypass preemption checks
Jan Beulich [Thu, 28 Jun 2018 10:26:25 +0000 (12:26 +0200)]
x86/mm: don't bypass preemption checks

While unlikely, it is not impossible for a multi-vCPU guest to leverage
bypasses of preemption checks to drive Xen into an unbounded loop.

This is XSA-264.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
6 years agox86: correct default_xen_spec_ctrl calculation
Jan Beulich [Thu, 28 Jun 2018 10:25:43 +0000 (12:25 +0200)]
x86: correct default_xen_spec_ctrl calculation

Even with opt_msr_sc_{pv,hvm} both false we should set up the variable
as usual, to ensure proper one-time setup during boot and CPU bringup.
This then also brings the code in line with the comment immediately
ahead of the printk() being modified saying "irrespective of guests".

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d6239f64713df819278bf048446d3187c6ac4734
master date: 2018-05-29 12:38:52 +0200

6 years agox86/spec-ctrl: Mitigations for LazyFPU
Andrew Cooper [Thu, 7 Jun 2018 16:00:37 +0000 (17:00 +0100)]
x86/spec-ctrl: Mitigations for LazyFPU

Intel Core processors since at least Nehalem speculate past #NM, which is the
mechanism by which lazy FPU context switching is implemented.

On affected processors, Xen must use fully eager FPU context switching to
prevent guests from being able to read FPU state (SSE/AVX/etc) from previously
scheduled vcpus.

This is part of XSA-267 / CVE-2018-3665

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 243435bf67e8159495194f623b9e4d8c90140384)

6 years agox86: Support fully eager FPU context switching
Andrew Cooper [Thu, 7 Jun 2018 16:00:37 +0000 (17:00 +0100)]
x86: Support fully eager FPU context switching

This is controlled on a per-vcpu bases for flexibility.

This is part of XSA-267 / CVE-2018-3665

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 146dfe9277c2b4a8c399b229e00d819065e3167b)

6 years agox86: don't enable XPTI on idle domain
Jan Beulich [Wed, 30 May 2018 11:38:03 +0000 (13:38 +0200)]
x86: don't enable XPTI on idle domain

While the involved code (in pv_domain_initialise()) sits behind an
!is_idle_domain() check already in 4.10, we need to add one here.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
6 years agox86: re-enable XPTI/PCID as needed in switch_native()
Jan Beulich [Wed, 30 May 2018 06:38:06 +0000 (08:38 +0200)]
x86: re-enable XPTI/PCID as needed in switch_native()

Additionally avoid accessing d->arch.pv_domain for PVH domains (running
in a HVM container).

Reported-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
6 years agoxen/x86: use PCID feature
Juergen Gross [Thu, 26 Apr 2018 11:33:18 +0000 (13:33 +0200)]
xen/x86: use PCID feature

Avoid flushing the complete TLB when switching %cr3 for mitigation of
Meltdown by using the PCID feature if available.

We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
2 values for the non-XPTI case:

- guest active and in kernel mode
- guest active and in user mode
- hypervisor active and guest in user mode (XPTI only)
- hypervisor active and guest in kernel mode (XPTI only)

We use PCID only if PCID _and_ INVPCID are supported. With PCID in use
we disable global pages in cr4. A command line parameter controls in
which cases PCID is being used.

As the non-XPTI case has shown not to perform better with PCID at least
on some machines the default is to use PCID only for domains subject to
XPTI.

With PCID enabled we always disable global pages. This avoids having to
either flush the complete TLB or do a cycle through all PCID values
when invalidating a single global page.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
6 years agoxen/x86: add some cr3 helpers
Juergen Gross [Thu, 26 Apr 2018 11:33:17 +0000 (13:33 +0200)]
xen/x86: add some cr3 helpers

Add some helper macros to access the address and pcid parts of cr3.

Use those helpers where appropriate.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
6 years agoxen/x86: convert pv_guest_cr4_to_real_cr4() to a function
Juergen Gross [Thu, 26 Apr 2018 11:33:16 +0000 (13:33 +0200)]
xen/x86: convert pv_guest_cr4_to_real_cr4() to a function

pv_guest_cr4_to_real_cr4() is becoming more and more complex. Convert
it from a macro to an ordinary function.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
6 years agoxen/x86: use flag byte for decision whether xen_cr3 is valid
Juergen Gross [Thu, 26 Apr 2018 11:33:15 +0000 (13:33 +0200)]
xen/x86: use flag byte for decision whether xen_cr3 is valid

Today cpu_info->xen_cr3 is either 0 to indicate %cr3 doesn't need to
be switched on entry to Xen, or negative for keeping the value while
indicating not to restore %cr3, or positive in case %cr3 is to be
restored.

Switch to use a flag byte instead of a negative xen_cr3 value in order
to allow %cr3 values with the high bit set in case we want to keep TLB
entries when using the PCID feature.

This reduces the number of branches in interrupt handling and results
in better performance (e.g. parallel make of the Xen hypervisor on my
system was using about 3% less system time).

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
6 years agoxen/x86: disable global pages for domains with XPTI active
Juergen Gross [Thu, 26 Apr 2018 11:33:14 +0000 (13:33 +0200)]
xen/x86: disable global pages for domains with XPTI active

Instead of flushing the TLB from global pages when switching address
spaces with XPTI being active just disable global pages via %cr4
completely when a domain subject to XPTI is active. This avoids the
need for extra TLB flushes as loading %cr3 will remove all TLB
entries.

In order to avoid states with cr3/cr4 having inconsistent values
(e.g. global pages being activated while cr3 already specifies a XPTI
address space) move loading of the new cr4 value to write_ptbase()
(actually to switch_cr3_cr4() called by write_ptbase()).

This requires to use switch_cr3_cr4() instead of write_ptbase() when
building dom0 in order to avoid setting cr4 with cr4.smap set.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
6 years agoxen/x86: use invpcid for flushing the TLB
Juergen Gross [Thu, 26 Apr 2018 11:33:13 +0000 (13:33 +0200)]
xen/x86: use invpcid for flushing the TLB

If possible use the INVPCID instruction for flushing the TLB instead of
toggling cr4.pge for that purpose.

While at it remove the dependency on cr4.pge being required for mtrr
loading, as this will be required later anyway.

Add a command line option "invpcid" for controlling the use of
INVPCID (default to true).

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
6 years agoxen/x86: support per-domain flag for xpti
Juergen Gross [Thu, 26 Apr 2018 11:33:12 +0000 (13:33 +0200)]
xen/x86: support per-domain flag for xpti

Instead of switching XPTI globally on or off add a per-domain flag for
that purpose. This allows to modify the xpti boot parameter to support
running dom0 without Meltdown mitigations. Using "xpti=no-dom0" as boot
parameter will achieve that.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
6 years agoxen/x86: add a function for modifying cr3
Juergen Gross [Thu, 26 Apr 2018 11:33:11 +0000 (13:33 +0200)]
xen/x86: add a function for modifying cr3

Instead of having multiple places with more or less identical asm
statements just have one function doing a write to cr3.

As this function should be named write_cr3() rename the current
write_cr3() function to switch_cr3().

Suggested-by: Andrew Copper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
6 years agox86/xpti: avoid copying L4 page table contents when possible
Juergen Gross [Tue, 29 May 2018 09:27:32 +0000 (11:27 +0200)]
x86/xpti: avoid copying L4 page table contents when possible

For mitigation of Meltdown the current L4 page table is copied to the
cpu local root page table each time a 64 bit pv guest is entered.

Copying can be avoided in cases where the guest L4 page table hasn't
been modified while running the hypervisor, e.g. when handling
interrupts or any hypercall not modifying the L4 page table or %cr3.

So add a per-cpu flag indicating whether the copying should be
performed and set that flag only when loading a new %cr3 or modifying
the L4 page table.  This includes synchronization of the cpu local
root page table with other cpus, so add a special synchronization flag
for that case.

A simple performance check (compiling the hypervisor via "make -j 4")
in dom0 with 4 vcpus shows a significant improvement:

- real time drops from 112 seconds to 103 seconds
- system time drops from 142 seconds to 131 seconds

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
6 years agox86: invpcid support
Wei Liu [Fri, 2 Mar 2018 16:23:38 +0000 (16:23 +0000)]
x86: invpcid support

Provide the functions needed for different modes. Add cpu_has_invpcid.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
6 years agox86: move invocations of hvm_flush_guest_tlbs()
Jan Beulich [Tue, 23 Jan 2018 09:43:39 +0000 (10:43 +0100)]
x86: move invocations of hvm_flush_guest_tlbs()

Their need is not tied to the actual flushing of TLBs, but the ticking
of the TLB clock. Make this more obvious by folding the two invocations
into a single one in pre_flush().

Also defer the latching of CR4 in write_cr3() until after pre_flush()
(and hence implicitly until after IRQs are off), making operation
sequence the same in both cases (eliminating the theoretical risk of
pre_flush() altering CR4). This then also improves register allocation,
as the compiler doesn't need to use a callee-saved register for "cr4"
anymore.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
6 years agox86/msr: Virtualise MSR_SPEC_CTRL.SSBD for guests to use
Andrew Cooper [Tue, 29 May 2018 09:08:58 +0000 (11:08 +0200)]
x86/msr: Virtualise MSR_SPEC_CTRL.SSBD for guests to use

Almost all infrastructure is already in place.  Update the reserved bits
calculation in guest_wrmsr(), and offer SSBD to guests by default.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cd53023df952cf0084be9ee3d15a90f8837049c2
master date: 2018-05-21 14:20:06 +0100

6 years agox86/Intel: Mitigations for GPZ SP4 - Speculative Store Bypass
Andrew Cooper [Tue, 29 May 2018 09:08:34 +0000 (11:08 +0200)]
x86/Intel: Mitigations for GPZ SP4 - Speculative Store Bypass

To combat GPZ SP4 "Speculative Store Bypass", Intel have extended their
speculative sidechannel mitigations specification as follows:

 * A feature bit to indicate that Speculative Store Bypass Disable is
   supported.
 * A new bit in MSR_SPEC_CTRL which, when set, disables memory disambiguation
   in the pipeline.
 * A new bit in MSR_ARCH_CAPABILITIES, which will be set in future hardware,
   indicating that the hardware is not susceptible to Speculative Store Bypass
   sidechannels.

For contemporary processors, this interface will be implemented via a
microcode update.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 9df52a25e0e95a0b9971aa2fc26c5c6a5cbdf4ef
master date: 2018-05-21 14:20:06 +0100

6 years agox86/AMD: Mitigations for GPZ SP4 - Speculative Store Bypass
Andrew Cooper [Tue, 29 May 2018 09:08:10 +0000 (11:08 +0200)]
x86/AMD: Mitigations for GPZ SP4 - Speculative Store Bypass

AMD processors will execute loads and stores with the same base register in
program order, which is typically how a compiler emits code.

Therefore, by default no mitigating actions are taken, despite there being
corner cases which are vulnerable to the issue.

For performance testing, or for users with particularly sensitive workloads,
the `spec-ctrl=ssbd` command line option is available to force Xen to disable
Memory Disambiguation on applicable hardware.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 8c0e338086f060eba31d37b83fbdb883928aa085
master date: 2018-05-21 14:20:06 +0100

6 years agox86/spec_ctrl: Introduce a new `spec-ctrl=` command line argument to replace `bti=`
Andrew Cooper [Tue, 29 May 2018 09:07:29 +0000 (11:07 +0200)]
x86/spec_ctrl: Introduce a new `spec-ctrl=` command line argument to replace `bti=`

In hindsight, the options for `bti=` aren't as flexible or useful as expected
(including several options which don't appear to behave as intended).
Changing the behaviour of an existing option is problematic for compatibility,
so introduce a new `spec-ctrl=` in the hopes that we can do better.

One common way of deploying Xen is with a single PV dom0 and all domUs being
HVM domains.  In such a setup, an administrator who has weighed up the risks
may wish to forgo protection against malicious PV domains, to reduce the
overall performance hit.  To cater for this usecase, `spec-ctrl=no-pv` will
disable all speculative protection for PV domains, while leaving all
speculative protection for HVM domains intact.

For coding clarity as much as anything else, the suboptions are grouped by
logical area; those which affect the alternatives blocks, and those which
affect Xen's in-hypervisor settings.  See the xen-command-line.markdown for
full details of the new options.

While changing the command line options, take the time to change how the data
is reported to the user.  The three DEBUG printks are upgraded to unilateral,
as they are all relevant pieces of information, and the old "mitigations:"
line is split in the two logical areas described above.

Sample output from booting with `spec-ctrl=no-pv` looks like:

  (XEN) Speculative mitigation facilities:
  (XEN)   Hardware features: IBRS/IBPB STIBP IBPB
  (XEN)   Compiled-in support: INDIRECT_THUNK
  (XEN)   Xen settings: BTI-Thunk RETPOLINE, SPEC_CTRL: IBRS-, Other: IBPB
  (XEN)   Support for VMs: PV: None, HVM: MSR_SPEC_CTRL RSB
  (XEN)   XPTI (64-bit PV only): Dom0 enabled, DomU enabled

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3352afc26c497d26ecb70527db3cb29daf7b1422
master date: 2018-05-16 12:19:10 +0100

6 years agox86/cpuid: Improvements to guest policies for speculative sidechannel features
Andrew Cooper [Tue, 29 May 2018 09:06:56 +0000 (11:06 +0200)]
x86/cpuid: Improvements to guest policies for speculative sidechannel features

If Xen isn't virtualising MSR_SPEC_CTRL for guests, IBRSB shouldn't be
advertised.  It is not currently possible to express this via the existing
command line options, but such an ability will be introduced.

Another useful option in some usecases is to offer IBPB without IBRS.  When a
guest kernel is known to be compatible (uses retpoline and knows about the AMD
IBPB feature bit), an administrator with pre-Skylake hardware may wish to hide
IBRS.  This allows the VM to have full protection, without Xen or the VM
needing to touch MSR_SPEC_CTRL, which can reduce the overhead of Spectre
mitigations.

Break the logic common to both PV and HVM CPUID calculations into a common
helper, to avoid duplication.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cb06b308ec71b23f37a44f5e2351fe2cae0306e9
master date: 2018-05-16 12:19:10 +0100

6 years agox86/spec_ctrl: Explicitly set Xen's default MSR_SPEC_CTRL value
Andrew Cooper [Tue, 29 May 2018 09:06:30 +0000 (11:06 +0200)]
x86/spec_ctrl: Explicitly set Xen's default MSR_SPEC_CTRL value

With the impending ability to disable MSR_SPEC_CTRL handling on a
per-guest-type basis, the first exit-from-guest may not have the side effect
of loading Xen's choice of value.  Explicitly set Xen's default during the BSP
and AP boot paths.

For the BSP however, delay setting a non-zero MSR_SPEC_CTRL default until
after dom0 has been constructed when safe to do so.  Oracle report that this
speeds up boots of some hardware by 50s.

"when safe to do so" is based on whether we are virtualised.  A native boot
won't have any other code running in a position to mount an attack.

Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cb8c12020307b39a89273d7699e89000451987ab
master date: 2018-05-16 12:19:10 +0100

6 years agox86/spec_ctrl: Split X86_FEATURE_SC_MSR into PV and HVM variants
Andrew Cooper [Tue, 29 May 2018 09:00:29 +0000 (11:00 +0200)]
x86/spec_ctrl: Split X86_FEATURE_SC_MSR into PV and HVM variants

In order to separately control whether MSR_SPEC_CTRL is virtualised for PV and
HVM guests, split the feature used to control runtime alternatives into two.
Xen will use MSR_SPEC_CTRL itself if either of these features are active.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: fa9eb09d446a1279f5e861e6b84fa8675dabf148
master date: 2018-05-16 12:19:10 +0100

6 years agox86/spec_ctrl: Elide MSR_SPEC_CTRL handling in idle context when possible
Andrew Cooper [Tue, 29 May 2018 09:00:04 +0000 (11:00 +0200)]
x86/spec_ctrl: Elide MSR_SPEC_CTRL handling in idle context when possible

If Xen is virtualising MSR_SPEC_CTRL handling for guests, but using 0 as its
own MSR_SPEC_CTRL value, spec_ctrl_{enter,exit}_idle() need not write to the
MSR.

Requested-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 94df6e8588e35cc2028ccb3fd2921c6e6360605e
master date: 2018-05-16 12:19:10 +0100

6 years agox86/spec_ctrl: Rename bits of infrastructure to avoid NATIVE and VMEXIT
Andrew Cooper [Tue, 29 May 2018 08:59:39 +0000 (10:59 +0200)]
x86/spec_ctrl: Rename bits of infrastructure to avoid NATIVE and VMEXIT

In hindsight, using NATIVE and VMEXIT as naming terminology was not clever.
A future change wants to split SPEC_CTRL_EXIT_TO_GUEST into PV and HVM
specific implementations, and using VMEXIT as a term is completely wrong.

Take the opportunity to fix some stale documentation in spec_ctrl_asm.h.  The
IST helpers were missing from the large comment block, and since
SPEC_CTRL_ENTRY_FROM_INTR_IST was introduced, we've gained a new piece of
functionality which currently depends on the fine grain control, which exists
in lieu of livepatching.  Note this in the comment.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d9822b8a38114e96e4516dc998f4055249364d5d
master date: 2018-05-16 12:19:10 +0100

6 years agox86/spec_ctrl: Fold the XEN_IBRS_{SET,CLEAR} ALTERNATIVES together
Andrew Cooper [Tue, 29 May 2018 08:59:11 +0000 (10:59 +0200)]
x86/spec_ctrl: Fold the XEN_IBRS_{SET,CLEAR} ALTERNATIVES together

Currently, the SPEC_CTRL_{ENTRY,EXIT}_* macros encode Xen's choice of
MSR_SPEC_CTRL as an immediate constant, and chooses between IBRS or not by
doubling up the entire alternative block.

There is now a variable holding Xen's choice of value, so use that and
simplify the alternatives.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: af949407eaba7af71067f23d5866cd0bf1f1144d
master date: 2018-05-16 12:19:10 +0100

6 years agox86/spec_ctrl: Merge bti_ist_info and use_shadow_spec_ctrl into spec_ctrl_flags
Andrew Cooper [Tue, 29 May 2018 08:58:44 +0000 (10:58 +0200)]
x86/spec_ctrl: Merge bti_ist_info and use_shadow_spec_ctrl into spec_ctrl_flags

All 3 bits of information here are control flags for the entry/exit code
behaviour.  Treat them as such, rather than having two different variables.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5262ba2e7799001402dfe139ff944e035dfff928
master date: 2018-05-16 12:19:10 +0100

6 years agox86/spec_ctrl: Express Xen's choice of MSR_SPEC_CTRL value as a variable
Andrew Cooper [Tue, 29 May 2018 08:58:17 +0000 (10:58 +0200)]
x86/spec_ctrl: Express Xen's choice of MSR_SPEC_CTRL value as a variable

At the moment, we have two different encodings of Xen's MSR_SPEC_CTRL value,
which is a side effect of how the Spectre series developed.  One encoding is
via an alias with the bottom bit of bti_ist_info, and can encode IBRS or not,
but not other configurations such as STIBP.

Break Xen's value out into a separate variable (in the top of stack block for
XPTI reasons) and use this instead of bti_ist_info in the IST path.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 66dfae0f32bfbc899c2f3446d5ee57068cb7f957
master date: 2018-05-16 12:19:10 +0100

6 years agox86/spec_ctrl: Read MSR_ARCH_CAPABILITIES only once
Andrew Cooper [Tue, 29 May 2018 08:57:41 +0000 (10:57 +0200)]
x86/spec_ctrl: Read MSR_ARCH_CAPABILITIES only once

Make it available from the beginning of init_speculation_mitigations(), and
pass it into appropriate functions.  Fix an RSBA typo while moving the
affected comment.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d6c65187252a6c1810fd24c4d46f812840de8d3c
master date: 2018-05-16 12:19:10 +0100

6 years agoxpti: fix bug in double fault handling
Juergen Gross [Fri, 18 May 2018 11:32:05 +0000 (13:32 +0200)]
xpti: fix bug in double fault handling

When entering the hypervisor via the double fault handler resetting
xen_cr3 was missing. This led to switching to pv_cr3 when returning
from the next following exception, so repair this in order to allow
exception handling to work even after a double fault.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d80af845de7a4db01a4a3b4d779e0e0dcb5e738b
master date: 2018-04-23 16:13:01 +0200

6 years agox86/spec_ctrl: Updates to retpoline-safety decision making
Andrew Cooper [Fri, 18 May 2018 11:31:33 +0000 (13:31 +0200)]
x86/spec_ctrl: Updates to retpoline-safety decision making

All of this is as recommended by the Intel whitepaper:

https://software.intel.com/sites/default/files/managed/1d/46/Retpoline-A-Branch-Target-Injection-Mitigation.pdf

The 'RSB Alternative' bit in MSR_ARCH_CAPABILITIES may be set by a hypervisor
to indicate that the virtual machine may migrate to a processor which isn't
retpoline-safe.  Introduce a shortened name (to reduce code volume), treat it
as authorative in retpoline_safe(), and print its value along with the other
ARCH_CAPS bits.

The exact processor models which do have RSB semantics which fall back to BTB
predictions are enumerated, and include Kabylake and Coffeelake.  Leave a
printk() in the default case to help identify cases which aren't covered.

The exact microcode versions from Broadwell RSB-safety are taken from the
referenced microcode update file (adjusting for the known-bad microcode
versions).  Despite the exact wording of the text, it is only Broadwell
processors which need a microcode check.

In practice, this means that all Broadwell hardware with up-to-date microcode
will use retpoline in preference to IBRS, which will be a performance
improvement for desktop and server systems which would previously always opt
for IBRS over retpoline.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/spec_ctrl: Fix typo in ARCH_CAPS decode

Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 1232378bd2fef45f613db049b33852fdf84d7ddf
master date: 2018-04-19 17:28:23 +0100
master commit: 27170adb54a558e11defcd51989326a9beb95afe
master date: 2018-04-24 13:34:12 +0100

6 years agox86/msr: Correct the emulation behaviour of MSR_PRED_CMD
Andrew Cooper [Fri, 18 May 2018 11:31:01 +0000 (13:31 +0200)]
x86/msr: Correct the emulation behaviour of MSR_PRED_CMD

Experimentally, the behaviour of reserved bits in MSR_PRED_CMD changed between
beta and production microcode, and now raises a #GP fault for set reserved
bits.  The AMD spec for future hardware also specifies this behaviour, and it
is the more sensible behaviour to implement.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/msr: further correct the emulation behaviour of MSR_PRED_CMD

Following commit a6aa678fa3 ("x86/msr: Correct the emulation behaviour
of MSR_PRED_CMD") we may end up writing the low bit with the wrong
value. While it's unlikely for a guest to want to write zero there, we
should still permit (this without incurring the overhead of an actual
barrier). Correcting this right away will also help whenever further
bits in the MSR might become defined.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a6aa678fa380e9369cc44701a181142322b3a4b0
master date: 2018-04-16 13:18:19 +0100
master commit: a996273d1fc10d14598985703227bfa35a91f681
master date: 2018-04-18 11:16:37 +0200

6 years agox86: suppress BTI mitigations around S3 suspend/resume
Jan Beulich [Fri, 18 May 2018 11:30:30 +0000 (13:30 +0200)]
x86: suppress BTI mitigations around S3 suspend/resume

NMI and #MC can occur at any time after S3 resume, yet the MSR_SPEC_CTRL
may become available only once we're reloaded microcode. Make
SPEC_CTRL_ENTRY_FROM_INTR_IST and DO_SPEC_CTRL_EXIT_TO_XEN no-ops for
the critical period of time.

Also set the MSR back to its intended value.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86: Use spec_ctrl_{enter,exit}_idle() in the S3/S5 path

The main purpose of this patch is to avoid opencoding the recovery logic at
the end, but also has the positive side effect of relaxing the SPEC_CTRL
mitigations when working to shut the final CPU down.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 710a8ebf2bc111a34bba04d1c85b6d07ed3d9389
master date: 2018-04-16 14:09:55 +0200
master commit: ef3ab46493f650b7e5cca2b2578a99ca0cbff195
master date: 2018-04-19 10:55:59 +0100

6 years agox86: correct ordering of operations during S3 resume
Jan Beulich [Fri, 18 May 2018 11:30:05 +0000 (13:30 +0200)]
x86: correct ordering of operations during S3 resume

Microcode loading needs to happen before re-enabling interrupts, in case
only updated microcode allows the use of e.g. the SPEC_{CTRL,CMD} MSRs.
Otoh it doesn't need to happen at all when we didn't suspend in the
first place. It needs to happen before spin_debug_enable() though, as it
acquires a lock and hence would otherwise make
common/spinlock.c:check_lock() unhappy. As micrcode loading can be
pretty verbose, also make sure it only runs after console_end_sync().

cpufreq_add_cpu() doesn't need calling on the only "goto enable_cpu"
path, which sits ahead of cpufreq_del_cpu().

Reported-by: Simon Gaiser <simon@invisiblethingslab.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: cb2a4a449dfd50af309a333aa805835015fbc8c8
master date: 2018-04-16 14:08:30 +0200

6 years agox86/pv: Protect multicalls against Spectre v2 - Branch Target Injection
Andrew Cooper [Fri, 18 May 2018 11:26:15 +0000 (13:26 +0200)]
x86/pv: Protect multicalls against Spectre v2 - Branch Target Injection

This is a missing adjustment in c/s 88602190f69 "x86: Support indirect thunks
from assembly code".

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
6 years agox86/emul: Fix emulator test harness build following a backport of 7c508612
Andrew Cooper [Wed, 9 May 2018 17:06:46 +0000 (18:06 +0100)]
x86/emul: Fix emulator test harness build following a backport of 7c508612

The x86 emulator doesn't need to employ any Spectre v2 mitigations.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
6 years agox86/emul: Fix emulator test harness build following the backport of ff555d59e8a
Andrew Cooper [Wed, 9 May 2018 15:54:32 +0000 (16:54 +0100)]
x86/emul: Fix emulator test harness build following the backport of ff555d59e8a

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
6 years agox86/HVM: guard against emulator driving ioreq state in weird ways
Jan Beulich [Tue, 8 May 2018 17:28:20 +0000 (18:28 +0100)]
x86/HVM: guard against emulator driving ioreq state in weird ways

In the case where hvm_wait_for_io() calls wait_on_xen_event_channel(),
p->state ends up being read twice in succession: once to determine that
state != p->state, and then again at the top of the loop.  This gives a
compromised emulator a chance to change the state back between the two
reads, potentially keeping Xen in a loop indefinitely.

Instead:
* Read p->state once in each of the wait_on_xen_event_channel() tests,
* re-use that value the next time around,
* and insist that the states continue to transition "forward" (with the
  exception of the transition to STATE_IOREQ_NONE).

This is XSA-262.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
6 years agox86/vpt: add support for IO-APIC routed interrupts
Xen Project Security Team [Mon, 23 Apr 2018 15:56:47 +0000 (16:56 +0100)]
x86/vpt: add support for IO-APIC routed interrupts

And modify the HPET code to make use of it. Currently HPET interrupts
are always treated as ISA and thus injected through the vPIC. This is
wrong because HPET interrupts when not in legacy mode should be
injected from the IO-APIC.

To make things worse, the supported interrupt routing values are set
to [20..23], which clearly falls outside of the ISA range, thus
leading to an ASSERT in debug builds or memory corruption in non-debug
builds because the interrupt injection code will write out of the
bounds of the arch.hvm_domain.vpic array.

Since the HPET interrupt source can change between ISA and IO-APIC
always destroy the timer before changing the mode, or else Xen risks
changing it while the timer is active.

Note that vpt interrupt injection is racy in the sense that the
vIO-APIC RTE entry can be written by the guest in between the call to
pt_irq_masked and hvm_ioapic_assert, or the call to pt_update_irq and
pt_intr_post. Those are not deemed to be security issues, but rather
quirks of the current implementation. In the worse case the guest
might lose interrupts or get multiple interrupt vectors injected for
the same timer source.

This is part of XSA-261.

Address actual and potential compiler warnings. Fix formatting.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
6 years agox86/traps: Fix handling of #DB exceptions in hypervisor context
Andrew Cooper [Tue, 8 May 2018 17:28:03 +0000 (18:28 +0100)]
x86/traps: Fix handling of #DB exceptions in hypervisor context

The WARN_ON() can be triggered by guest activities, and emits a full stack
trace without rate limiting.  Swap it out for a ratelimited printk with just
enough information to work out what is going on.

Not all #DB exceptions are traps, so blindly continuing is not a safe action
to take.  We don't let PV guests select these settings in the real %dr7 to
begin with, but for added safety against unexpected situations, detect the
fault cases and crash in an obvious manner.

This is part of XSA-260 / CVE-2018-8897.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
6 years agox86/traps: Use an Interrupt Stack Table for #DB
Andrew Cooper [Tue, 8 May 2018 17:28:03 +0000 (18:28 +0100)]
x86/traps: Use an Interrupt Stack Table for #DB

PV guests can use architectural corner cases to cause #DB to be raised after
transitioning into supervisor mode.

Use an interrupt stack table for #DB to prevent the exception being taken with
a guest controlled stack pointer.

This is part of XSA-260 / CVE-2018-8897.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
6 years agox86/pv: Move exception injection into {,compat_}test_all_events()
Andrew Cooper [Tue, 8 May 2018 17:28:03 +0000 (18:28 +0100)]
x86/pv: Move exception injection into {,compat_}test_all_events()

This allows paths to jump straight to {,compat_}test_all_events() and have
injection of pending exceptions happen automatically, rather than requiring
all calling paths to handle exceptions themselves.

The normal exception path is simplified as a result, and
compat_post_handle_exception() is removed entirely.

This is part of XSA-260 / CVE-2018-8897.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>