xenbits.xensource.com Git - people/iwj/xen.git/log

x86/shutdown: use ACPI reboot method for Dell PowerEdge R540

When EFI booting the Dell PowerEdge R540 it consistently wanders into
the weeds and gets an invalid opcode in the EFI ResetSystem call. This
is the same bug which affects the PowerEdge R740 so fix it in the same
way: quirk this hardware to use the ACPI reboot method instead.

BIOS Information
    Vendor: Dell Inc.
    Version: 1.3.7
    Release Date: 02/09/2018
System Information
    Manufacturer: Dell Inc.
    Product Name: PowerEdge R540

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/setup: properly update PTEs if src/dst overlaps when relocating Xen image

Commit 0d31d16 (x86/setup: do not relocate Xen over current Xen image
placement) disallowed src/dst images overlaps when relocating Xen image.
Though it deliberately allowed destination region between __image_base__
and (__image_base__ + XEN_IMG_OFFSET) overlaps with the end of source
image. And here is the problem. If anything between __page_tables_start
and __page_tables_end in source image lands in the overlap then some or
even all page table entries may not be updated. This usually means boom
in early boot which will be difficult to the investigate. So, I think
that we have three choices to fix the issue:
  - drop XEN_IMG_OFFSET from
    if ( (end > s) && (end - reloc_size + XEN_IMG_OFFSET >= __pa(_end)) )
  - add XEN_IMG_OFFSET to xen_phys_start in PFN_DOWN(xen_phys_start)
    used in loops as one of conditions and replace ">" with ">=",
  - change PFN_DOWN(xen_phys_start) to PFN_DOWN(xen_remap_end_pfn)
    proposed in earlier version of this patch.

This patch implements the second option. This way we still allow source
and destination partial overlap as described above but PTEs are properly
updated now.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

unmodified_drivers: unplug the emulated devices at resume time

Since qemu-2.10 it is required to unplug emulated devices again after
a live migration. If this is not done, qemu's block-backend driver
will be unable to open the backing disk image because it is still busy
by qemu's IDE driver. As a result the domUs block-frontend driver will
be unable to access the disks, and the domU has to be destroyed.
libxl is unable to detect the situation.

Apply the same workaround for this qemu bug that was done already
years ago in linux.git with commit 512b109ec962 ("xen: unplug the
emulated devices at resume time") to make sure xenlinux based domUs
can be migrated to unfixed hosts.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Jan Beulich <jbeulich@suse.com>

build: remove stray .*.d2 files during clean/distclean

Otherwise e.g. xen/..xen-syms.0.o.d2 and xen/..xen-syms.1.o.d2 files
stay untouched because they are not listed in DEPS_RM variable.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/cpuid: fix generation of auto cpuid header

The makefile rule to generate the cpuid-autogen.h header passes the
whole list of dependencies to gen-cpuid.py but only the first
dependency is actually needed.

So far this seems to be harmless.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

x86/idle: don't mix up ACPI and APIC IDs

Correct a log message and, to clarify code as well, rename the
respective function parameter too.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86: guard against #NM

Just in case we still don't get CR0.TS handling right, prevent a host
crash by honoring exception fixups in do_device_not_available(). This
would in particular cover emulator stubs raising #NM.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/HVM: don't cause #NM to be raised in Xen

The changes for XSA-267 did not touch management of CR0.TS for HVM
guests. In fully eager mode this bit should never be set when
respective vCPU-s are active, or else hvmemul_get_fpu() might leave it
wrongly set, leading to #NM in hypervisor context.

{svm,vmx}_enter() and {svm,vmx}_fpu_dirty_intercept() become unreachable
this way. Explicit {svm,vmx}_fpu_leave() invocations need to be guarded
now.

With no CR0.TS management necessary in fully eager mode, there's also no
need anymore to intercept #NM.

Reported-by: Charles Arnold <carnold@suse.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

libxl: restore passing "readonly=" to qemu for SCSI disks

A read-only check was introduced for XSA-142, commit ef6cb76026 ("libxl:
relax readonly check introduced by XSA-142 fix") added the passing of
the extra setting, but commit dab0539568 ("Introduce COLO mode and
refactor relevant function") dropped the passing of the setting again,
quite likely due to improper re-basing.

Restore the readonly= parameter to SCSI disks. For IDE disks this is
supposed to be rejected; add an assert. And there is a bare ad-hoc
disk drive string in libxl__build_device_model_args_new, which we also
update.

This is XSA-266.

Reported-by: Andrew Reimers <andrew.reimers@orionvm.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

libxl: qemu_disk_scsi_drive_string: Break out common parts of disk config

The generated configurations are identical apart from, in some cases,
reordering of the id=%s element. So, overall, no functional change.

This is part of XSA-266.

Reported-by: Andrew Reimers <andrew.reimers@orionvm.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

x86: Refine checks in #DB handler for faulting conditions

One of the fix for XSA-260 (c/s 75d6828bc2 "x86/traps: Fix handling of #DB
exceptions in hypervisor context") added some safety checks to help avoid
livelocks of #DB faults.

While a General Detect #DB exception does have fault semantics, hardware
clears %dr7.gd on entry to the handler, meaning that it is actually safe to
return to.  Furthermore, %dr6.gd is guest controlled and sticky (never cleared
by hardware).  A malicious PV guest can therefore trigger the fatal_trap() and
crash Xen.

Instruction breakpoints are more tricky.  The breakpoint match bits in %dr6
are not sticky, but the Intel manual warns that they may be set for
non-enabled breakpoints, so add a breakpoint enabled check.

Beyond that, because of the restriction on the linear addresses PV guests can
set, and the fault (rather than trap) nature of instruction breakpoints
(i.e. can't be deferred by a MovSS shadow), there should be no way to
encounter an instruction breakpoint in Xen context.  However, for extra
robustness, deal with this situation by clearing the breakpoint configuration,
rather than crashing.

This is XSA-265

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/mm: don't bypass preemption checks

While unlikely, it is not impossible for a multi-vCPU guest to leverage
bypasses of preemption checks to drive Xen into an unbounded loop.

This is XSA-264.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/cpuid: Fix up stale comments

* There is no legacy path any more. All static information is retrieved in
the first pass.
* d->arch.cpuids[] doesn't exist any more.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/EFI: further correct FPU state handling around runtime calls

We must not leave a vCPU with CR0.TS clear when it is not in fully eager
mode and has not touched non-lazy state. Instead of adding a 3rd
invocation of stts() to vcpu_restore_fpu_eager(), consolidate all of
them into a single one done at the end of the function.

Rename the function at the same time to better reflect its purpose, as
the patches touches all of its occurences anyway.

The new function parameter is not really well named, but
"need_stts_if_not_fully_eager" seemed excessive to me.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

libxl: fix return code in qmp_synchronous_send

Use error code from libxl namespace, a plain -1 is not valid in this context.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Wei Liu <wei.liu2@citrix.com>

stubdom/vtpm: fix memcmp in TPM_ChangeAuthAsymFinish

gcc8 spotted this error:
error: 'memcmp' reading 20 bytes from a region of size 8 [-Werror=stringop-overflow=]

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org>

x86/dom0: add extra RAM regions as UNUSABLE for PVH memory map

When running as PVH Dom0 the native memory map is used in order to
craft a tailored memory map for Dom0 taking into account it's memory
limit.

Dom0 memory is always going to be smaller than the total amount
of memory present on the host, so in order to prevent Dom0 from
relocating PCI BARs over RAM regions mark all the RAM regions not
available to Dom0 as UNUSABLE in the memory map.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

x86/HVM: alter completion-needed checking

The function only looks at the ioreq_t, so pass it a pointer to just
that. Also use it in hvmemul_do_io().

Suggested-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>

x86/HVM: attempts to emulate FPU insns need to set fpu_initialised

My original way of thinking here was that this would be set anyway at
the point state gets reloaded after the adjustments hvmemul_put_fpu()
does, but the flag should already be set before that - after all the
guest may never again touch the FPU before e.g. getting migrated/saved.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Paul Durrant <paul.durrant@citrix.com>

configure: Rerun autogen.sh (on stretch)

This is just a version number update.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>

README, Makefiles, Config.mk: Update for branching 4.11 vs 4.12-unstable

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

x86/EFI: fix FPU state handling around runtime calls

There are two issues. First, the nonlazy xstates were never restored
after returning from the runtime call.

Secondly, with the fully_eager_fpu mitigation for XSA-267 / LazyFPU, the
unilateral stts() is no longer correct, and hits an assertion later when
a lazy state restore tries to occur for a fully eager vcpu.

Fix both of these issues by calling vcpu_restore_fpu_eager(). As EFI
runtime services can be used in the idle context, the idle assertion
needs to move until after the fully_eager_fpu check.

Introduce a "curr" local variable and replace other uses of "current"
at the same time.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Juergen Gross <jgross@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

tools/libxc: retry hypercall in case of EFAULT

A hypercall issued via the privcmd driver can very rarely return
-EFAULT even if the hypercall buffers are locked in memory. This
happens for hypercall buffers in user memory when the Linux kernel
is doing memory scans e.g. for page migration or compaction.

Retry the getpageframeinfo3 hypercall up to 2 times in case
-EFAULT is returned and the hypervisor might see invalid PTEs for
user hypercall buffers (which should be the case only if the kernel
doesn't offer a /dev/xen/hypercall node).

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

tools/libxencalls: add new function to query hypercall buffer safety

Add a new function to query whether hypercall buffers are always safe
to access by the hypervisor or might result in EFAULT.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

tools/libxencall: use hypercall buffer device if available

Instead of using anonymous memory for hypercall buffers which is then
locked into memory, use the hypercall buffer device of the Linux
privcmd driver if available.

This has the advantage of needing just a single mmap() for allocating
the buffer and page migration or compaction can't make the buffer
unaccessible for the hypervisor.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

x86/HVM: account for fully eager FPU mode in emulation

In fully eager mode we must not clear fpu_dirtied, set CR0.TS, or invoke
the fpu_leave() hook. Instead do what the mode's name says: Restore
state right away.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec-ctrl: Mitigations for LazyFPU

Intel Core processors since at least Nehalem speculate past #NM, which is the
mechanism by which lazy FPU context switching is implemented.

On affected processors, Xen must use fully eager FPU context switching to
prevent guests from being able to read FPU state (SSE/AVX/etc) from previously
scheduled vcpus.

This is part of XSA-267 / CVE-2018-3665

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86: Support fully eager FPU context switching

This is controlled on a per-vcpu bases for flexibility.

This is part of XSA-267 / CVE-2018-3665

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

scripts/add_maintainers.pl: Don't call get_maintainers.pl with -f

The option -f of scripts/get_maintainers.pl will return the maintainers
of a given file, *not* the list of maintainers if the file was a patch.

The output expected of add_maintainers is the latter, so drop the option
-f.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Lars Kurth <lars.kurth@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

xen/sndif: Change stream's unique-id to string

Display and input protocols define "unique-id" XenBus field as string
which is much more flexible in defining unique identifiers comparing
to integer used by sound protocol. For example, this allows to provide
UUIDs as unique ID's. Align sound protocol with display and input
and redefine "unique-id" field as string.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/displif: Add unique display connector identifier

If frontend is configured to expose multiple connectors then backend may
require a way to uniquely identify concrete virtual connector within the
frontend. This is useful for use-cases where connector needs to be
matched to physical display connector.
Add XenBus "unique-id" node parameter, so this sort of use-cases can
be implemented.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/kbdif: Add unique input device identifier

If frontend is configured to expose multiple input device instances
then backend may require a way to uniquely identify concrete input
device within the frontend. This is useful for use-cases where
virtual input device needs to be matched to physical input device.
Add XenBus "unique-id" node parameter, so this sort of use-cases can
be implemented.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/kbdif: Move multi-touch device parameters to backend nodes

In current kbdif protocol definition multi-touch device parameters
are described as a part of frontend's XenBus configuration nodes while
they belong to backend's configuration. Fix this by moving
the parameters to the proper section.

Fixes: b7a3ce49d528 ("xen/kbdif: add multi-touch support")
Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Reported-by: Oleksandr Grytsov <oleksandr_grytsov@epam.com>
Reviewed-by: Oleksandr Grytsov <oleksandr_grytsov@epam.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

x86/VT-x: Fix printing of EFER in vmcs_dump_vcpu()

This is essentially a "take 2" of c/s 82540b66ce "x86/VT-x: Fix determination
of EFER.LMA in vmcs_dump_vcpu()" because in hindight, that change was more
problematic than useful.

The original reason was to fix the logic for determining when not to print the
PDPTE pointers. However, mutating the efer variable (particularly LME and
LMA) before printing it interferes with diagnosing vmentry failures.

Instead of modifying efer, change the PDPTE conditional to use
VM_ENTRY_IA32E_MODE.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

ocaml/xenstored: reduce use of unsafe conversions

The rationalisation of the Xs_ring interface in the xb library
allows to further reduce the unsafe calls withouth introducing
copies. This patch also contains some further code cleanups.

Signed-off-by: Marcello Seri <marcello.seri@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

ocaml/libs/xb: Use bytes in place of strings for mutable buffers

Since Ocaml 4.06.0, that made safe-string on by default, the compiler is
allowed to perform optimisations on immutable strings. They should no
longer be used as mutable buffers, and bytes should be used instead.

The C stubs for Xs_ring have been updated to use bytes, and the interface
rationalised mimicking the new Unix module in the standard library (the
implementation of Unix.write_substring uses unsafe_of_string in the exact same
way, and both the write implementations are using the bytes as an immutable
payload for the write).

Signed-off-by: Marcello Seri <marcello.seri@citrix.com>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/traps: Fix error handling of the pv %dr7 shadow state

c/s "x86/pv: Introduce and use x86emul_write_dr()" fixed a bug with IO shadow
handling, in that it remained stale and visible until %dr7.L/G got set again.

However, it neglected the -EPERM return inbetween these two hunks, introducing
a different bug in which a write to %dr7 which tries to set IO breakpoints
without %cr4.DE being set clobbers the IO state, rather than leaves it alone.

Instead, move the zeroing slightly later, which guarentees that the shadow
gets written exactly once, on a successful update to %dr7.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/CPUID: don't override tool stack decision to hide STIBP

Other than in the feature sets, where we indeed want to offer the
feature even if not enumerated on hardware, we shouldn't dictate the
feature being available if tool stack or host admin have decided to not
expose it (for whatever [questionable?] reason). That feature set side
override is sufficient to achieve the intended guest side safety
property (in offering - by default - STIBP independent of actual
availability in hardware).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86: correct default_xen_spec_ctrl calculation

Even with opt_msr_sc_{pv,hvm} both false we should set up the variable
as usual, to ensure proper one-time setup during boot and CPU bringup.
This then also brings the code in line with the comment immediately
ahead of the printk() being modified saying "irrespective of guests".

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86: suppress sync when XPTI is disabled for a domain

Now that we have a per-domain flag we can and should control sync-ing in
a more fine grained manner: Only domains having XPTI enabled need the
sync to occur.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

libxc/x86/PV: don't hand through CPUID leaf 0x80000008 as is

Just like for HVM the feature set should be used for EBX output, while
EAX should be restricted to the low 16 bits and ECX/EDX should be zero.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

tools/kdd: alternative way of muting spurious gcc warning

Older gcc does not support #pragma GCC diagnostics, so use alternative
approach - change variable type to uint32_t (this code handle 32-bit
requests only anyway), which apparently also avoid gcc complaining about
this (otherwise correct) code.

Fixes 437e00fea04becc91c1b6bc1c0baa636b067a5cc "tools/kdd: mute spurious
gcc warning"

Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Tim Deegan <tim@xen.org>

docs/process/xen-release-management: Lesson to learn

The 4.10 release preparation was significantly more hairy than ideal.
(We seem to have a good overall outcome despite, rather than because
of, our approach.)

This is the second time (at least) that we have come close to failure
by committing to a release date before the exact code to be released
is known and has been made and tested.

Evidently our docs makes it insufficiently clear not to do that.

CC: Julien Grall <julien.grall@arm.com>
Acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Lars Kurth <lars.kurth@citrix.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>

docs/process: Add RUBRIC

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Juergen Gross <jgross@suse.com>

x86/traps: Dump the instruction stream even for double faults

This helps debug #DF's which occur in alternative patches

Reported-by: George Dunlap <george.dunlap@eu.citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/XPTI: fix S3 resume (and CPU offlining in general)

We should index an L1 table with an L1 index.

Reported-by: Simon Gaiser <simon@invisiblethingslab.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/HVM: correct mtrr_pat_not_equal()

The two vCPU-s differing in MTRR-enabled state means MTRR settings are
not equal. Both vCPU-s having MTRRs disabled means only PAT needs to be
compared. Along those lines for fixed range MTRRs. Differing variable
range counts likewise mean settings are different overall (even if
that's not a very reasonable setup to have).

Constify types and convert bool_t to bool.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86: correct vCPU dirty CPU handling

Commit df8234fd2c ("replace vCPU's dirty CPU mask by numeric ID") was
too lax in two respects: First of all it didn't consider the case of a
vCPU not having a valid dirty CPU in the descriptor table TLB flush
case. This is the issue Manual has run into with NetBSD.

Additionally reads of ->dirty_cpu for other than the current vCPU are at
risk of racing with scheduler actions, i.e. single atomic reads need to
be used there. Obviously the non-init write sites then better also use
atomic writes.

Having to touch the descriptor table TLB flush code here anyway, take
the opportunity and switch it to be at most one flush_tlb_mask()
invocation.

Reported-by: Manuel Bouyer <bouyer@antioche.eu.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec-ctrl: Rename ARCH_CAPS.SSBD_NO to SSB_NO

A last-minute rename of the feature occured, and the patch committed to
staging was unfortunately stale. Correct it.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/msr: Virtualise MSR_SPEC_CTRL.SSBD for guests to use

Almost all infrastructure is already in place. Update the reserved bits
calculation in guest_wrmsr(), and offer SSBD to guests by default.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/Intel: Mitigations for GPZ SP4 - Speculative Store Bypass

To combat GPZ SP4 "Speculative Store Bypass", Intel have extended their
speculative sidechannel mitigations specification as follows:

* A feature bit to indicate that Speculative Store Bypass Disable is
   supported.
* A new bit in MSR_SPEC_CTRL which, when set, disables memory disambiguation
   in the pipeline.
* A new bit in MSR_ARCH_CAPABILITIES, which will be set in future hardware,
   indicating that the hardware is not susceptible to Speculative Store Bypass
   sidechannels.

For contemporary processors, this interface will be implemented via a
microcode update.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/AMD: Mitigations for GPZ SP4 - Speculative Store Bypass

AMD processors will execute loads and stores with the same base register in
program order, which is typically how a compiler emits code.

Therefore, by default no mitigating actions are taken, despite there being
corner cases which are vulnerable to the issue.

For performance testing, or for users with particularly sensitive workloads,
the `spec-ctrl=ssbd` command line option is available to force Xen to disable
Memory Disambiguation on applicable hardware.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

doc: correct livepatch.markdown syntax

"make -C docs all" fails due to incorrect markdown syntax in
livepatch.markdown. Correct it.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Misc fixes:
* Insert real URLs
* Drop trailing whitespace
* Consistent alignment and indentation for code blocks and lists
* Consistent capitalisation
* Consistent use of `` blocks for command line arguments and function names
* Rearrange things not to leave < and > in the text

No change in content. The document now reads rather more consistently in HTML
and PDF form.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>

xl: show full value of cpu_khz in xl info output

The exact value of cpu_khz can only be obtained via 'xl dmesg', and
therefore can be lost after some time. 'xl info' truncates the value to
full MHz. Adjust the output to show the full khz value.
This helps the host admin to track how a host has calibrated itself. The
value of cpu_khz is used during live migration for the decision if
access to TSC should be emualted.

Commit eb5277a30e ("bitkeeper revision 1.959.1.4
(40d04a87acOb29u-5Y5OxMhHvP2x9g)" gives no hint why cpu_mhz instead of
cpu_khz was chosen.

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

Config.mk: Update QEMU to include build fixes

This tag includes two build fixes:
- dump: Fix build with newer gcc
Fix build with GCC-8
- Fix libusb-1.0.22 deprecated libusb_set_debug with libusb_set_option

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>

xen/kbdif: Add features to disable keyboard and pointer

It is now not fully possible to control if and which virtual devices
are created by the frontend, e.g. keyboard and pointer devices
are always created and multi-touch device is created if the
backend advertises multi-touch support. In some cases this
behavior is not desirable and better control over the frontend's
configuration is required.

Add new XenStore feature fields, so it is possible to individually
control set of exposed virtual devices for each guest OS:
- set feature-disable-keyboard to 1 if no keyboard device needs
to be created
- set feature-disable-pointer to 1 if no pointer device needs
to be created

Keep old behavior by default.

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

scripts/add_maintainers.pl: New script

This provides a much better workflow when using git format-patch and
git send-email, with get_maintainer.pl.

The tool covers step 2 of the following workflow

  Step 1: git format-patch ... -o <patchdir> ...
  Step 2: ./scripts/add_maintainers.pl -d <patchdir>
          This overwrites  *.patch files in <patchdir>
  Step 3: git send-email -to xen-devel@lists.xenproject.org <patchdir>/*.patchxm

I manually tested all options and the most common combinations
on Mac.

Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Julien Grall <julien.grall@arm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Lars Kurth <lars.kurth@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Lars Kurth <lars.kurth@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>

vpci/msi: fix unbind loop

The current unbind loop on failure in vpci_msi_enable is wrong and
will only work correctly if the initial pirq is 0. Fix this by adding
a proper bound.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Introduce a new `spec-ctrl=` command line argument to replace `bti=`

In hindsight, the options for `bti=` aren't as flexible or useful as expected
(including several options which don't appear to behave as intended).
Changing the behaviour of an existing option is problematic for compatibility,
so introduce a new `spec-ctrl=` in the hopes that we can do better.

One common way of deploying Xen is with a single PV dom0 and all domUs being
HVM domains.  In such a setup, an administrator who has weighed up the risks
may wish to forgo protection against malicious PV domains, to reduce the
overall performance hit.  To cater for this usecase, `spec-ctrl=no-pv` will
disable all speculative protection for PV domains, while leaving all
speculative protection for HVM domains intact.

For coding clarity as much as anything else, the suboptions are grouped by
logical area; those which affect the alternatives blocks, and those which
affect Xen's in-hypervisor settings.  See the xen-command-line.markdown for
full details of the new options.

While changing the command line options, take the time to change how the data
is reported to the user.  The three DEBUG printks are upgraded to unilateral,
as they are all relevant pieces of information, and the old "mitigations:"
line is split in the two logical areas described above.

Sample output from booting with `spec-ctrl=no-pv` looks like:

  (XEN) Speculative mitigation facilities:
  (XEN)   Hardware features: IBRS/IBPB STIBP IBPB
  (XEN)   Compiled-in support: INDIRECT_THUNK
  (XEN)   Xen settings: BTI-Thunk RETPOLINE, SPEC_CTRL: IBRS-, Other: IBPB
  (XEN)   Support for VMs: PV: None, HVM: MSR_SPEC_CTRL RSB
  (XEN)   XPTI (64-bit PV only): Dom0 enabled, DomU enabled

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/cpuid: Improvements to guest policies for speculative sidechannel features

If Xen isn't virtualising MSR_SPEC_CTRL for guests, IBRSB shouldn't be
advertised.  It is not currently possible to express this via the existing
command line options, but such an ability will be introduced.

Another useful option in some usecases is to offer IBPB without IBRS.  When a
guest kernel is known to be compatible (uses retpoline and knows about the AMD
IBPB feature bit), an administrator with pre-Skylake hardware may wish to hide
IBRS.  This allows the VM to have full protection, without Xen or the VM
needing to touch MSR_SPEC_CTRL, which can reduce the overhead of Spectre
mitigations.

Break the logic common to both PV and HVM CPUID calculations into a common
helper, to avoid duplication.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Explicitly set Xen's default MSR_SPEC_CTRL value

With the impending ability to disable MSR_SPEC_CTRL handling on a
per-guest-type basis, the first exit-from-guest may not have the side effect
of loading Xen's choice of value.  Explicitly set Xen's default during the BSP
and AP boot paths.

For the BSP however, delay setting a non-zero MSR_SPEC_CTRL default until
after dom0 has been constructed when safe to do so.  Oracle report that this
speeds up boots of some hardware by 50s.

"when safe to do so" is based on whether we are virtualised.  A native boot
won't have any other code running in a position to mount an attack.

Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Split X86_FEATURE_SC_MSR into PV and HVM variants

In order to separately control whether MSR_SPEC_CTRL is virtualised for PV and
HVM guests, split the feature used to control runtime alternatives into two.
Xen will use MSR_SPEC_CTRL itself if either of these features are active.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Elide MSR_SPEC_CTRL handling in idle context when possible

If Xen is virtualising MSR_SPEC_CTRL handling for guests, but using 0 as its
own MSR_SPEC_CTRL value, spec_ctrl_{enter,exit}_idle() need not write to the
MSR.

Requested-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Rename bits of infrastructure to avoid NATIVE and VMEXIT

In hindsight, using NATIVE and VMEXIT as naming terminology was not clever.
A future change wants to split SPEC_CTRL_EXIT_TO_GUEST into PV and HVM
specific implementations, and using VMEXIT as a term is completely wrong.

Take the opportunity to fix some stale documentation in spec_ctrl_asm.h. The
IST helpers were missing from the large comment block, and since
SPEC_CTRL_ENTRY_FROM_INTR_IST was introduced, we've gained a new piece of
functionality which currently depends on the fine grain control, which exists
in lieu of livepatching. Note this in the comment.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Fold the XEN_IBRS_{SET,CLEAR} ALTERNATIVES together

Currently, the SPEC_CTRL_{ENTRY,EXIT}_* macros encode Xen's choice of
MSR_SPEC_CTRL as an immediate constant, and chooses between IBRS or not by
doubling up the entire alternative block.

There is now a variable holding Xen's choice of value, so use that and
simplify the alternatives.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Merge bti_ist_info and use_shadow_spec_ctrl into spec_ctrl_flags

All 3 bits of information here are control flags for the entry/exit code
behaviour. Treat them as such, rather than having two different variables.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Express Xen's choice of MSR_SPEC_CTRL value as a variable

At the moment, we have two different encodings of Xen's MSR_SPEC_CTRL value,
which is a side effect of how the Spectre series developed. One encoding is
via an alias with the bottom bit of bti_ist_info, and can encode IBRS or not,
but not other configurations such as STIBP.

Break Xen's value out into a separate variable (in the top of stack block for
XPTI reasons) and use this instead of bti_ist_info in the IST path.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/spec_ctrl: Read MSR_ARCH_CAPABILITIES only once

Make it available from the beginning of init_speculation_mitigations(), and
pass it into appropriate functions. Fix an RSBA typo while moving the
affected comment.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

tools/ocaml/libs/xc fix gcc-8 format-truncation warning

CC       xenctrl_stubs.o
xenctrl_stubs.c: In function 'failwith_xc':
xenctrl_stubs.c:65:17: error: 'snprintf' output may be truncated before the last format character [-Werror=format-truncation=]
      "%d: %s: %s", error->code,
                 ^
xenctrl_stubs.c:64:4: note: 'snprintf' output 6 or more bytes (assuming 1029) into a destination of size 1028
    snprintf(error_str, sizeof(error_str),
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      "%d: %s: %s", error->code,
      ~~~~~~~~~~~~~~~~~~~~~~~~~~
      xc_error_code_to_desc(error->code),
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      error->message);
      ~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
make[8]: *** [/build/xen-git/src/xen/tools/ocaml/libs/xc/../../Makefile.rules:37: xenctrl_stubs.o] Error 1
m

Signed-off-by: John Thomson <git@johnthomson.fastmail.com.au>
Acked-by: Christian Lindig <christian.lindig@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/kbdif: Add string constants for raw pointer

Add missing string constants for {feature|request}-raw-pointer
to align with the rest of the interface file.

Fixes 7868654ff7fe ("kbdif: Define "feature-raw-pointer" and "request-raw-pointer")

Signed-off-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

docs/parse-support-md: Correctly process caveats in multi-status sections

When SUPPORT.md uses the syntax
Status, <some thing>: <support status>
the caveats were lost (not footnoted) because they were attached
only to <some thing>.

Caveats occur in running text, so they are necessarily part of a real
section, not an individual status line like that. So attach them to
the RealSectNode, and look there for them.

Reported-by: Lars Kurth <lars.kurth@citrix.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Lars Kurth <Lars.kurth@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

docs/parse-support-md: Provide $sectnode->{RealSectNode}

No functional change yet.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Lars Kurth <Lars.kurth@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

docs/parse-support-md: Rename RealSect to RealInSect

This makes the distinction between insections and sectnodes clearer.

No functional change.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Lars Kurth <Lars.kurth@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

viridian: fix cpuid leaf 0x40000003

The response to viridian leaf 3 needs to split a 64-bit mask across EAX and
EBX, with the low order 32 bits in EAX and the high order 32 bits in EBX.
To facilitate this a union of two uint32_t values and the mask (type
HV_PARTITION_PRIVILEGE_MASK) is allocated on stack as follows:

union {
    HV_PARTITION_PRIVILEGE_MASK mask;
    uint32_t lo, hi;
} u;

This, of course, is incorrect as both lo and hi will alias the low order
32 bits of the mask.

This patch wraps lo and hi in an anonmymous struct to achieve the desired
effect.

NOTE: Fixing this also stops Windows making the HvGetPartitionId hypercall
      which was previously considered erroneous behaviour. Thus the
      hypercall handler is also modified to stop squashing the
      'unimplemented' warning for this hypercall.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

libacpi: fixes for iasl >= 20180427

New versions of iasl have introduced improved C file generation, as
reported in the changelog:

iASL: Enhanced the -tc option (which creates an AML hex file in C,
suitable for import into a firmware project):
1) Create a unique name for the table, to simplify use of multiple
SSDTs.
2) Add a protection #ifdef in the file, similar to a .h header file.

The net effect of that on generated files is:

-unsigned char AmlCode[] =
+#ifndef __SSDT_S4_HEX__
+#define __SSDT_S4_HEX__
+
+unsigned char ssdt_s4_aml_code[] =

The above example is from ssdt_s4.asl.

Fix the build with newer versions of iasl by stripping the '_aml_code'
suffix from the variable name on generated files.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

x86/HVM: guard against emulator driving ioreq state in weird ways

In the case where hvm_wait_for_io() calls wait_on_xen_event_channel(),
p->state ends up being read twice in succession: once to determine that
state != p->state, and then again at the top of the loop. This gives a
compromised emulator a chance to change the state back between the two
reads, potentially keeping Xen in a loop indefinitely.

Instead:
* Read p->state once in each of the wait_on_xen_event_channel() tests,
* re-use that value the next time around,
* and insist that the states continue to transition "forward" (with the
exception of the transition to STATE_IOREQ_NONE).

This is XSA-262.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

x86/vpt: add support for IO-APIC routed interrupts

And modify the HPET code to make use of it. Currently HPET interrupts
are always treated as ISA and thus injected through the vPIC. This is
wrong because HPET interrupts when not in legacy mode should be
injected from the IO-APIC.

To make things worse, the supported interrupt routing values are set
to [20..23], which clearly falls outside of the ISA range, thus
leading to an ASSERT in debug builds or memory corruption in non-debug
builds because the interrupt injection code will write out of the
bounds of the arch.hvm_domain.vpic array.

Since the HPET interrupt source can change between ISA and IO-APIC
always destroy the timer before changing the mode, or else Xen risks
changing it while the timer is active.

Note that vpt interrupt injection is racy in the sense that the
vIO-APIC RTE entry can be written by the guest in between the call to
pt_irq_masked and hvm_ioapic_assert, or the call to pt_update_irq and
pt_intr_post. Those are not deemed to be security issues, but rather
quirks of the current implementation. In the worse case the guest
might lose interrupts or get multiple interrupt vectors injected for
the same timer source.

This is part of XSA-261.

Address actual and potential compiler warnings. Fix formatting.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/traps: Fix handling of #DB exceptions in hypervisor context

The WARN_ON() can be triggered by guest activities, and emits a full stack
trace without rate limiting. Swap it out for a ratelimited printk with just
enough information to work out what is going on.

Not all #DB exceptions are traps, so blindly continuing is not a safe action
to take. We don't let PV guests select these settings in the real %dr7 to
begin with, but for added safety against unexpected situations, detect the
fault cases and crash in an obvious manner.

This is part of XSA-260 / CVE-2018-8897

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/traps: Use an Interrupt Stack Table for #DB

PV guests can use architectural corner cases to cause #DB to be raised after
transitioning into supervisor mode.

Use an interrupt stack table for #DB to prevent the exception being taken with
a guest controlled stack pointer.

This is part of XSA-260 / CVE-2018-8897

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/pv: Move exception injection into {,compat_}test_all_events()

This allows paths to jump straight to {,compat_}test_all_events() and have
injection of pending exceptions happen automatically, rather than requiring
all calling paths to handle exceptions themselves.

The normal exception path is simplified as a result, and
compat_post_handle_exception() is removed entirely.

This is part of XSA-260 / CVE-2018-8897

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/traps: Fix %dr6 handing in #DB handler

Most bits in %dr6 accumulate, rather than being set directly based on the
current source of #DB. Have the handler follow the manuals guidance, which
avoids leaking hypervisor debugging activities into guest context.

This is part of XSA-260 / CVE-2018-8897

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/domain: Drop the only-written smap_check_policy infrastructure

c/s 4c5d78a10d "x86/pagewalk: Re-implement the pagetable walker" dropped the
consumer of smap_policy. Looking at c/s 31ae587e6f which introduced the
smap_check logic, it exists only to work around a bug in guest_walk_tables()
was resolved by the aformentioned commit.

Remove the unused variables and associated infrastructure.

Reported-by: Jason Andryuk <jandryuk@gmail.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

doc: add credit2_cap_period_ms boot parameter description

credit2_cap_period_ms isn't mentioned in xen-command-line.markdown.
Add a description.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

doc: add architecture qualifier to boot parameter entries

Many of the architecture specific boot parameters are not qualified
as such. Correct that. Reorder PKU to be alphabetical.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>

x86/pv: Hide more EFER bits from PV guests

We don't advertise SVM in CPUID so a PV guest shouldn't be under the
impression that it can use SVM functionality, but despite this, it really
shouldn't see SVME set when reading EFER.

On Intel processors, 32bit PV guests don't see, and can't use SYSCALL.

Introduce EFER_KNOWN_MASK to whitelist the features Xen knows about, and use
this to clamp the guests view.

Take the opportunity to reuse the mask to simplify svm_vmcb_isvalid(), and
change "undefined" to "unknown" in the print message, as there is at least
EFER.TCE (Translation Cache Extension) defined but unknown to Xen.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

SVM: introduce a VM entry helper

Neither the register values copying nor the trace entry generation need
doing in assembly. The VMLOAD invocation can also be further deferred
(and centralized). Therefore replace the svm_asid_handle_vmrun()
invocation with one of the new helper.

Similarly move the VM exit side register value copying into
svm_vmexit_handler().

Now that we always make it out to guest context after VMLOAD,
svm_sync_vmcb() no longer overrides vmcb_needs_vmsave, making
svm_vmexit_handler() setting the field early unnecessary.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

SVM: re-work VMCB sync-ing

While the main problem to be addressed here is the issue of what so far
was named "vmcb_in_sync" starting out with the wrong value (should have
been true instead of false, to prevent performing a VMSAVE without ever
having VMLOADed the vCPU's state), go a step further and make the
sync-ed state a tristate: CPU and memory may be in sync or an update
may be required in either direction. Rename the field and introduce an
enum. Callers of svm_sync_vmcb() now indicate the intended new state
(with a slight "anomaly" when requesting VMLOAD: we could store
vmcb_needs_vmsave in those cases as the callers request, but the VMCB
really is in sync at that point, and hence there's no need to VMSAVE in
case we don't make it out to guest context), and all syncing goes
through that function.

With that, there's no need to VMLOAD the state perhaps multiple times;
all that's needed is loading it once before VM entry.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

docs: fix xpti command line option doc

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>

xen/x86: use PCID feature

Avoid flushing the complete TLB when switching %cr3 for mitigation of
Meltdown by using the PCID feature if available.

We are using 4 PCID values for a 64 bit pv domain subject to XPTI and
2 values for the non-XPTI case:

- guest active and in kernel mode
- guest active and in user mode
- hypervisor active and guest in user mode (XPTI only)
- hypervisor active and guest in kernel mode (XPTI only)

We use PCID only if PCID _and_ INVPCID are supported. With PCID in use
we disable global pages in cr4. A command line parameter controls in
which cases PCID is being used.

As the non-XPTI case has shown not to perform better with PCID at least
on some machines the default is to use PCID only for domains subject to
XPTI.

With PCID enabled we always disable global pages. This avoids having to
either flush the complete TLB or do a cycle through all PCID values
when invalidating a single global page.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/x86: add some cr3 helpers

Add some helper macros to access the address and pcid parts of cr3.

Use those helpers where appropriate.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/x86: convert pv_guest_cr4_to_real_cr4() to a function

pv_guest_cr4_to_real_cr4() is becoming more and more complex. Convert
it from a macro to an ordinary function.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/x86: use flag byte for decision whether xen_cr3 is valid

Today cpu_info->xen_cr3 is either 0 to indicate %cr3 doesn't need to
be switched on entry to Xen, or negative for keeping the value while
indicating not to restore %cr3, or positive in case %cr3 is to be
restored.

Switch to use a flag byte instead of a negative xen_cr3 value in order
to allow %cr3 values with the high bit set in case we want to keep TLB
entries when using the PCID feature.

This reduces the number of branches in interrupt handling and results
in better performance (e.g. parallel make of the Xen hypervisor on my
system was using about 3% less system time).

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/x86: disable global pages for domains with XPTI active

Instead of flushing the TLB from global pages when switching address
spaces with XPTI being active just disable global pages via %cr4
completely when a domain subject to XPTI is active. This avoids the
need for extra TLB flushes as loading %cr3 will remove all TLB
entries.

In order to avoid states with cr3/cr4 having inconsistent values
(e.g. global pages being activated while cr3 already specifies a XPTI
address space) move loading of the new cr4 value to write_ptbase()
(actually to switch_cr3_cr4() called by write_ptbase()).

This requires to use switch_cr3_cr4() instead of write_ptbase() when
building dom0 in order to avoid setting cr4 with cr4.smap set.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/x86: use invpcid for flushing the TLB

If possible use the INVPCID instruction for flushing the TLB instead of
toggling cr4.pge for that purpose.

While at it remove the dependency on cr4.pge being required for mtrr
loading, as this will be required later anyway.

Add a command line option "invpcid" for controlling the use of
INVPCID (default to true).

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/x86: support per-domain flag for xpti

Instead of switching XPTI globally on or off add a per-domain flag for
that purpose. This allows to modify the xpti boot parameter to support
running dom0 without Meltdown mitigations. Using "xpti=no-dom0" as boot
parameter will achieve that.

Move the xpti boot parameter handling to xen/arch/x86/pv/domain.c as
it is pv-domain specific.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

xen/x86: add a function for modifying cr3

Instead of having multiple places with more or less identical asm
statements just have one function doing a write to cr3.

As this function should be named write_cr3() rename the current
write_cr3() function to switch_cr3().

Suggested-by: Andrew Copper <andrew.cooper3@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86/xpti: avoid copying L4 page table contents when possible

For mitigation of Meltdown the current L4 page table is copied to the
cpu local root page table each time a 64 bit pv guest is entered.

Copying can be avoided in cases where the guest L4 page table hasn't
been modified while running the hypervisor, e.g. when handling
interrupts or any hypercall not modifying the L4 page table or %cr3.

So add a per-cpu flag indicating whether the copying should be
performed and set that flag only when loading a new %cr3 or modifying
the L4 page table. This includes synchronization of the cpu local
root page table with other cpus, so add a special synchronization flag
for that case.

A simple performance check (compiling the hypervisor via "make -j 4")
in dom0 with 4 vcpus shows a significant improvement:

- real time drops from 112 seconds to 103 seconds
- system time drops from 142 seconds to 131 seconds

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

x86: fix return value checks of set_guest_{machinecheck,nmi}_trapbounce

Commit 0142064421 ("x86/traps: move set_guest_{machine,nmi}_trapbounce")
converted the functions' return types from int to bool without also
correcting the checks in assembly code: The ABI does not guarantee sub-
32-bit return values to be promoted to 32 bits.

Take the liberty and also adjust the number of spaces used in the compat
code, such that both code sequences end up similar (they already are in
the non-compat case).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen/schedule: Fix races in vcpu migration

The current sequence to initiate vcpu migration is inefficent and error-prone:

- The initiator sets VPF_migraging with the lock held, then drops the
  lock and calls vcpu_sleep_nosync(), which immediately grabs the lock
  again

- A number of places unnecessarily check for v->pause_flags in between
  those two

- Every call to vcpu_migrate() must be prefaced with
  vcpu_sleep_nosync() or introduce a race condition; this code
  duplication is error-prone

- In the event that v->is_running is true at the beginning of
  vcpu_migrate(), it's almost certain that vcpu_migrate() will end up
  being called in context_switch() as well; we might as well simply
  let it run there and save the duplicated effort (which will be
  non-negligible).

The result is that Credit1 has several races which result in runqueue
<-> v->processor invariants being violated (triggering ASSERTs in
debug builds and strange bugs in production builds).

Instead, introduce vcpu_migrate_start() to initiate the process.
vcpu_migrate_start() is called with the scheduling lock held.  It not
only sets VPF_migrating, but also calls vcpu_sleep_nosync_locked()
(which will automatically do nothing if there's nothing to do).

Rename vcpu_migrate() to vcpu_migrate_finish().  Check for v->is_running and
pause_flags & VPF_migrating at the top and return if appropriate.

Then the way to initiate migration is consistently:

* Grab lock
* vcpu_migrate_start()
* Release lock
* vcpu_migrate_finish()

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
Tested-by: Olaf Hering <olaf@aepfle.de>
Release-acked-by: Juergen Gross <jgross@suse.com>

xen: Introduce vcpu_sleep_nosync_locked()

There are a lot of places which release a lock before calling
vcpu_sleep_nosync(), which then just grabs the lock again. This is
not only a waste of time, but leads to more code duplication (since
you have to copy-and-paste recipes rather than calling a unified
function), which in turn leads to an increased chance of bugs.

Introduce vcpu_sleep_nosync_locked(), which can be called if you
already hold the schedule lock.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
Release-acked-by: Juergen Gross <jgross@suse.com>