]> xenbits.xensource.com Git - xen.git/log
xen.git
5 years agoARM: arm64: activate atomic 64-bit accessors
Andre Przywara [Thu, 16 Mar 2017 11:20:10 +0000 (11:20 +0000)]
ARM: arm64: activate atomic 64-bit accessors

For some reason (probably because there was no user before) the 64-bit
atomic access wrappers were commented out so far.
As we will need them in the next patch, active (and fix) them now.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
5 years agoevents: drop arch_evtchn_inject()
Jan Beulich [Thu, 23 May 2019 17:42:29 +0000 (10:42 -0700)]
events: drop arch_evtchn_inject()

Have the only user call vcpu_mark_events_pending() instead, at the same
time arranging for correct ordering of the writes (evtchn_pending_sel
should be written before evtchn_upcall_pending).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
5 years agox86: fix build race when generating temporary object files
Jan Beulich [Wed, 15 May 2019 07:49:35 +0000 (09:49 +0200)]
x86: fix build race when generating temporary object files

The rules to generate xen-syms and xen.efi may run in parallel, but both
recursively invoke $(MAKE) to build symbol/relocation table temporary
object files. These recursive builds would both re-generate the .*.d2
files (where needed). Both would in turn invoke the same rule, thus
allowing for a race on the .*.d2.tmp intermediate files.

The dependency files of the temporary .xen*.o files live in xen/ rather
than xen/arch/x86/ anyway, so won't be included no matter what. Take the
opportunity and delete them, as the just re-generated .xen*.S files will
trigger a proper re-build of the .xen*.o ones anyway.

Empty the DEPS variable in case the set of goals consists of just those
temporary object files, thus eliminating the race.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 761bb575ce97255029d2d2249b2719e54bc76825
master date: 2019-04-11 10:25:05 +0200

(cherry picked from commit 0ab95a98fea75535d11dc5f06290d923feb27dd1)
(cherry picked from commit ac90240785b8e5f6b40ee36739bb8ea9c645bf4b)

5 years agotools/ocaml: Dup2 /dev/null to stdin in daemonize()
Christian Lindig [Wed, 27 Feb 2019 10:33:42 +0000 (10:33 +0000)]
tools/ocaml: Dup2 /dev/null to stdin in daemonize()

Don't close stdin in daemonize() but dup2 /dev/null instead.  Otherwise, fd 0
gets reused later:

  [root@idol ~]# ls -lav /proc/`pgrep xenstored`/fd
  total 0
  dr-x------ 2 root root  0 Feb 28 11:02 .
  dr-xr-xr-x 9 root root  0 Feb 27 15:59 ..
  lrwx------ 1 root root 64 Feb 28 11:02 0 -> /dev/xen/evtchn
  l-wx------ 1 root root 64 Feb 28 11:02 1 -> /dev/null
  l-wx------ 1 root root 64 Feb 28 11:02 2 -> /dev/null
  lrwx------ 1 root root 64 Feb 28 11:02 3 -> /dev/xen/privcmd
  ...

Signed-off-by: Christian Lindig <christian.lindig@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit 677e64dbe315343620c3b266e9eb16623b118038)
(cherry picked from commit 4b72470175a592fb5c0a5d10ed505de73778e10f)
(cherry picked from commit 5cfbc0ffd563a2ee3abfcce74eb3c20d82a7a035)
(cherry picked from commit 3db28b0babfb3db7b5bbf9799da6884453290312)

5 years agox86emul/test: don't use *_len symbols
Jan Beulich [Tue, 24 Jan 2017 16:22:03 +0000 (16:22 +0000)]
x86emul/test: don't use *_len symbols

... as they don't work as intended with -fPIC.

I did prefer them over *_end ones at the time because older gcc would
cause .L* symbols to be public, due to issuing .globl for all
referenced externals. And labels at the end of instructions collide
with the ones at the start of the next instruction, making disassembly
harder to grok. Luckily recent gcc no longer issues those .globl
directives, and hence .L* labels, staying local by default, no longer
get in the way.

Reported-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Tested-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 9315fa0ef736d1153c98ce42bff5853da5ec697f)

5 years agox86/spec-ctrl: Introduce options to control VERW flushing
Andrew Cooper [Wed, 12 Dec 2018 19:22:15 +0000 (19:22 +0000)]
x86/spec-ctrl: Introduce options to control VERW flushing

The Microarchitectural Data Sampling vulnerability is split into categories
with subtly different properties:

 MLPDS - Microarchitectural Load Port Data Sampling
 MSBDS - Microarchitectural Store Buffer Data Sampling
 MFBDS - Microarchitectural Fill Buffer Data Sampling
 MDSUM - Microarchitectural Data Sampling Uncacheable Memory

MDSUM is a special case of the other three, and isn't distinguished further.

These issues pertain to three microarchitectural buffers.  The Load Ports, the
Store Buffers and the Fill Buffers.  Each of these structures are flushed by
the new enhanced VERW functionality, but the conditions under which flushing
is necessary vary.

For this concise overview of the issues and default logic, the abbreviations
SP (Store Port), FB (Fill Buffer), LP (Load Port) and HT (Hyperthreading) are
used for brevity:

 * Vulnerable hardware is divided into two categories - parts which suffer
   from SP only, and parts with any other combination of vulnerabilities.

 * SP only has an HT interaction when the thread goes idle, due to the static
   partitioning of resources.  LP and FB have HT interactions at all points,
   due to the competitive sharing of resources.  All issues potentially leak
   data across the return-to-guest transition.

 * The microcode which implements VERW flushing also extends MSR_FLUSH_CMD, so
   we don't need to do both on the HVM return-to-guest path.  However, some
   parts are not vulnerable to L1TF (therefore have no MSR_FLUSH_CMD), but are
   vulnerable to MDS, so do require VERW on the HVM path.

Note that we deliberately support mds=1 even without MD_CLEAR in case the
microcode has been updated but the feature bit not exposed.

This is part of XSA-297, CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, CVE-2019-11091.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3c04c258ab40405a74e194d9889a4cbc7abe94b4)

5 years agox86/spec-ctrl: Infrastructure to use VERW to flush pipeline buffers
Andrew Cooper [Wed, 12 Dec 2018 19:22:15 +0000 (19:22 +0000)]
x86/spec-ctrl: Infrastructure to use VERW to flush pipeline buffers

Three synthetic features are introduced, as we need individual control of
each, depending on circumstances.  A later change will enable them at
appropriate points.

The verw_sel field doesn't strictly need to live in struct cpu_info.  It lives
there because there is a convenient hole it can fill, and it reduces the
complexity of the SPEC_CTRL_EXIT_TO_{PV,HVM} assembly by avoiding the need for
any temporary stack maintenance.

This is part of XSA-297, CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, CVE-2019-11091.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 548a932ac786d6bf3584e4b54f2ab993e1117710)

5 years agox86/spec-ctrl: CPUID/MSR definitions for Microarchitectural Data Sampling
Andrew Cooper [Wed, 12 Sep 2018 13:36:00 +0000 (14:36 +0100)]
x86/spec-ctrl: CPUID/MSR definitions for Microarchitectural Data Sampling

The MD_CLEAR feature can be automatically offered to guests.  No
infrastructure is needed in Xen to support the guest making use of it.

This is part of XSA-297, CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, CVE-2019-11091.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit d4f6116c080dc013cd1204c4d8ceb95e5f278689)

5 years agox86/spec-ctrl: Misc non-functional cleanup
Andrew Cooper [Wed, 12 Sep 2018 13:36:00 +0000 (14:36 +0100)]
x86/spec-ctrl: Misc non-functional cleanup

 * Identify BTI in the spec_ctrl_{enter,exit}_idle() comments, as other
   mitigations will shortly appear.
 * Use alternative_input() and cover the lack of memory cobber with a further
   barrier.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9b62eba6c429c327e1507816bef403ccc87357ae)

5 years agox86/boot: Detect the firmware SMT setting correctly on Intel hardware
Andrew Cooper [Fri, 5 Apr 2019 12:26:30 +0000 (13:26 +0100)]
x86/boot: Detect the firmware SMT setting correctly on Intel hardware

While boot_cpu_data.x86_num_siblings is an accurate value to use on AMD
hardware, it isn't on Intel when the user has disabled Hyperthreading in the
firmware.  As a result, a user which has chosen to disable HT still gets
nagged on L1TF-vulnerable hardware when they haven't chosen an explicit
smt=<bool> setting.

Make use of the largely-undocumented MSR_INTEL_CORE_THREAD_COUNT which in
practice exists since Nehalem, when booting on real hardware.  Fall back to
using the ACPI table APIC IDs.

While adjusting this logic, fix a latent bug in amd_get_topology().  The
thread count field in CPUID.0x8000001e.ebx is documented as 8 bits wide,
rather than 2 bits wide.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit b12fec4a125950240573ea32f65c61fb9afa74c3)

5 years agox86/msr: Definitions for MSR_INTEL_CORE_THREAD_COUNT
Andrew Cooper [Fri, 5 Apr 2019 12:26:30 +0000 (12:26 +0000)]
x86/msr: Definitions for MSR_INTEL_CORE_THREAD_COUNT

This is a model specific register which details the current configuration
cores and threads in the package.  Because of how Hyperthread and Core
configuration works works in firmware, the MSR it is de-facto constant and
will remain unchanged until the next system reset.

It is a read only MSR (so unilaterally reject writes), but for now retain its
leaky-on-read properties.  Further CPUID/MSR work is required before we can
start virtualising a consistent topology to the guest, and retaining the old
behaviour is the safest course of action.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit d4120936bcd1695faf5b575f1259c58e31d2b18b)

5 years agox86/spec-ctrl: Reposition the XPTI command line parsing logic
Andrew Cooper [Wed, 12 Sep 2018 13:36:00 +0000 (14:36 +0100)]
x86/spec-ctrl: Reposition the XPTI command line parsing logic

It has ended up in the middle of the mitigation calculation logic.  Move it to
be beside the other command line parsing.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c2c2bb0d60c642e64a5243a79c8b1548ffb7bc5b)

5 years agox86/tsx: Implement controls for RTM force-abort mode
Andrew Cooper [Mon, 18 Mar 2019 16:08:25 +0000 (17:08 +0100)]
x86/tsx: Implement controls for RTM force-abort mode

The CPUID bit and MSR are deliberately not exposed to guests, because they
won't exist on newer processors.  As vPMU isn't security supported, the
misbehaviour of PCR3 isn't expected to impact production deployments.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6be613f29b4205349275d24367bd4c82fb2960dd
master date: 2019-03-12 17:05:21 +0000

6 years agotools/firmware: update OVMF Makefile, when necessary
Wei Liu [Wed, 28 Nov 2018 17:43:33 +0000 (17:43 +0000)]
tools/firmware: update OVMF Makefile, when necessary

[ This is two commits from master aka staging-4.12: ]

OVMF has become dependent on OpenSSL, which is included as a
submodule.  Initialise submodules before building.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
(cherry picked from commit b16281870e06f5f526029a4e69634a16dc38e8e4)

tools: only call git when necessary in OVMF Makefile

Users may choose to export a snapshot of OVMF and build it
with xen.git supplied ovmf-makefile. In that case we don't
need to call `git submodule`.

Fixes b16281870e.

Reported-by: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Release-acked-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit 68292c94a60eab24514ab4a8e4772af24dead807)
(cherry picked from commit e983e8ae84efd5e43045a3d20a820f13cb4a75bf)
(cherry picked from commit 5a81de4c6b6036974f29e2330a493f23a8f0c1f0)
(cherry picked from commit 63d9330ba9fdec7c8e9346e6d85360747d61c947)

6 years agox86/pv: _toggle_guest_pt() may not skip TLB flush for shadow mode guests
Jan Beulich [Tue, 5 Mar 2019 14:46:51 +0000 (15:46 +0100)]
x86/pv: _toggle_guest_pt() may not skip TLB flush for shadow mode guests

For shadow mode guests (e.g. PV ones forced into that mode as L1TF
mitigation, or during migration) update_cr3() -> sh_update_cr3() may
result in a change to the (shadow) root page table (compared to the
previous one when running the same vCPU with the same PCID). This can,
first and foremost, be a result of memory pressure on the shadow memory
pool of the domain. Shadow code legitimately relies on the original
(prior to commit 5c81d260c2 ["xen/x86: use PCID feature"]) behavior of
the subsequent CR3 write to flush the TLB of entries still left from
walks with an earlier, different (shadow) root page table.

Restore the flushing behavior, also for the second CR3 write on the exit
path to guest context when XPTI is active. For the moment accept that
this will introduce more flushes than are strictly necessary - no flush
would be needed when the (shadow) root page table doesn't actually
change, but this information isn't readily (i.e. without introducing a
layering violation) available here.

This is XSA-294.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 329b00e4d49f70185561d7cc4b076c77869888a0
master date: 2019-03-05 13:54:42 +0100

6 years agox86/pv: Don't have %cr4.fsgsbase active behind a guest kernels back
Andrew Cooper [Tue, 5 Mar 2019 14:46:16 +0000 (15:46 +0100)]
x86/pv: Don't have %cr4.fsgsbase active behind a guest kernels back

Currently, a 64bit PV guest can appear to set and clear FSGSBASE in %cr4, but
the bit remains set in hardware.  Therefore, the {RD,WR}{FS,GS}BASE are usable
even when the guest kernel believes that they are disabled.

The FSGSBASE feature isn't currently supported in Linux, and its context
switch path has some optimisations which rely on userspace being unable to use
the WR{FS,GS}BASE instructions.  Xen's current behaviour undermines this
expectation.

In 64bit PV guest context, always load the guest kernels setting of FSGSBASE
into %cr4.  This requires adjusting how Xen uses the {RD,WR}{FS,GS}BASE
instructions.

 * Delete the cpu_has_fsgsbase helper.  It is no longer safe, as users need to
   check %cr4 directly.
 * The raw __rd{fs,gs}base() helpers are only safe to use when %cr4.fsgsbase
   is set.  Comment this property.
 * The {rd,wr}{fs,gs}{base,shadow}() and read_msr() helpers are updated to use
   the current %cr4 value to determine which mechanism to use.
 * toggle_guest_mode() and save_segments() are update to avoid reading
   fs/gsbase if the values in hardware cannot be stale WRT struct vcpu.  A
   consequence of this is that the write_cr() path needs to cache the current
   bases, as subsequent context switches will skip saving the values.
 * write_cr4() is updated to ensure that the shadow %cr4.fsgsbase value is
   observed in a safe way WRT the hardware setting, if an interrupt happens to
   hit in the middle.
 * load_segments() is updated to use the VMLOAD optimisation if FSGSBASE is
   unavailable, even if only gs_shadow needs updating.  As a minor perf
   improvement, check cpu_has_svm first to short circuit a context-dependent
   conditional on Intel hardware.
 * pv_make_cr4() is updated for 64bit PV guests to use the guest kernels
   choice of FSGSBASE.

This is part of XSA-293.

Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: eccc170053e46b4ab1d9e7485c09e210be15bbd7
master date: 2019-03-05 13:54:05 +0100

6 years agox86/pv: Rewrite guest %cr4 handling from scratch
Andrew Cooper [Tue, 5 Mar 2019 14:45:37 +0000 (15:45 +0100)]
x86/pv: Rewrite guest %cr4 handling from scratch

The PV cr4 logic is almost impossible to follow, and leaks bits into guest
context which definitely shouldn't be visible (in particular, VMXE).

The biggest problem however, and source of the complexity, is that it derives
new real and guest cr4 values from the current value in hardware - this is
context dependent and an inappropriate source of information.

Rewrite the cr4 logic to be invariant of the current value in hardware.

First of all, modify write_ptbase() to always use mmu_cr4_features for IDLE
and HVM contexts.  mmu_cr4_features *is* the correct value to use, and makes
the ASSERT() obviously redundant.

For PV guests, curr->arch.pv.ctrlreg[4] remains the guests view of cr4, but
all logic gets reworked in terms of this and mmu_cr4_features only.

Two masks are introduced; bits which the guest has control over, and bits
which are forwarded from Xen's settings.  One guest-visible change here is
that Xen's VMXE setting is no longer visible at all.

pv_make_cr4() follows fairly closely from pv_guest_cr4_to_real_cr4(), but
deliberately starts with mmu_cr4_features, and only alters the minimal subset
of bits.

The boot-time {compat_,}pv_cr4_mask variables are removed, as they are a
remnant of the pre-CPUID policy days.  pv_fixup_guest_cr4() gains a related
derivation from the policy.

Another guest visible change here is that a 32bit PV guest can now flip
FSGSBASE in its view of CR4.  While the {RD,WR}{FS,GS}BASE instructions are
unusable outside of a 64bit code segment, the ability to modify FSGSBASE
matches real hardware behaviour, and avoids the need for any 32bit/64bit
differences in the logic.

Overall, this patch shouldn't have a practical change in guest behaviour.
VMXE will disappear from view, and an inquisitive 32bit kernel can now see
FSGSBASE changing, but this new logic is otherwise bug-compatible with before.

This is part of XSA-293.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b2dd00574a4fc87ca964177f8e752a968c27efb2
master date: 2019-03-05 13:53:32 +0100

6 years agox86/pv: Improve pv_cpuid()'s API
Andrew Cooper [Tue, 5 Mar 2019 14:45:01 +0000 (15:45 +0100)]
x86/pv: Improve pv_cpuid()'s API

pv_cpuid()'s API is awkward to use.  There are already two callers jumping
through hoops to use it, and a third is on its way.

Change the API to take each parameter individually (like its counterpart,
hvm_cpuid(), already does), and introduce a new pv_cpuid_regs() wrapper
implementing the old API.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
6 years agox86/mm: properly flush TLB in switch_cr3_cr4()
Jan Beulich [Tue, 5 Mar 2019 14:44:38 +0000 (15:44 +0100)]
x86/mm: properly flush TLB in switch_cr3_cr4()

The CR3 values used for contexts run with PCID enabled uniformly have
CR3.NOFLUSH set, resulting in the CR3 write itself to not cause any
flushing at all. When the second CR4 write is skipped or doesn't do any
flushing, there's nothing so far which would purge TLB entries which may
have accumulated again if the PCID doesn't change; the "just in case"
flush only affects the case where the PCID actually changes. (There may
be particularly many TLB entries re-accumulated in case of a watchdog
NMI kicking in during the critical time window.)

Suppress the no-flush behavior of the CR3 write in this particular case.

Similarly the second CR4 write may not cause any flushing of TLB entries
established again while the original PCID was still in use - it may get
performed because of unrelated bits changing. The flush of the old PCID
needs to happen nevertheless.

At the same time also eliminate a possible race with lazy context
switch: Just like for CR4, CR3 may change at any time while interrupts
are enabled, due to the __sync_local_execstate() invocation from the
flush IPI handler. It is for that reason that the CR3 read, just like
the CR4 one, must happen only after interrupts have been turned off.

This is XSA-292.

Reported-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Sergey Dyasli <sergey.dyasli@citrix.com>
master commit: 6e5f22ba437d78c0a84b9673f7e2cfefdbc62f4b
master date: 2019-03-05 13:52:44 +0100

6 years agox86/mm: don't retain page type reference when IOMMU operation fails
Jan Beulich [Tue, 5 Mar 2019 14:44:06 +0000 (15:44 +0100)]
x86/mm: don't retain page type reference when IOMMU operation fails

The IOMMU update in _get_page_type() happens between recording of the
new reference and validation of the page for its new type (if
necessary). If the IOMMU operation fails, there's no point in actually
carrying out validation. Furthermore, with this resulting in failure
getting indicated to the caller, the recorded type reference also needs
to be dropped again.

Note that in case of failure of alloc_page_type() there's no need to
undo the IOMMU operation: Only special types get handed to the function.
The function, upon failure, clears ->u.inuse.type_info, effectively
converting the page to PGT_none. The IOMMU mapping, however, solely
depends on whether the type is PGT_writable_page.

This is XSA-291.

Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fad0de986220c46e70be2f83279961aad7394af0
master date: 2019-03-05 13:52:15 +0100

6 years agox86/mm: add explicit preemption checks to L3 (un)validation
Jan Beulich [Tue, 5 Mar 2019 14:42:59 +0000 (15:42 +0100)]
x86/mm: add explicit preemption checks to L3 (un)validation

When recursive page tables are used at the L3 level, unvalidation of a
single L4 table may incur unvalidation of two levels of L3 tables, i.e.
a maximum iteration count of 512^3 for unvalidating an L4 table. The
preemption check in free_l2_table() as well as the one in
_put_page_type() may never be reached, so explicit checking is needed in
free_l3_table().

When recursive page tables are used at the L4 level, the iteration count
at L4 alone is capped at 512^2. As soon as a present L3 entry is hit
which itself needs unvalidation (and hence requiring another nested loop
with 512 iterations), the preemption checks added here kick in, so no
further preemption checking is needed at L4 (until we decide to permit
5-level paging for PV guests).

The validation side additions are done just for symmetry.

This is part of XSA-290.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: bac4567a67d5e8b916801ea5a04cf8b443dfb245
master date: 2019-03-05 13:51:46 +0100

6 years agox86/mm: also allow L2 (un)validation to be fully preemptible
Jan Beulich [Tue, 5 Mar 2019 14:42:32 +0000 (15:42 +0100)]
x86/mm: also allow L2 (un)validation to be fully preemptible

Commit c612481d1c ("x86/mm: Plumbing to allow any PTE update to fail
with -ERESTART") added assertions next to the {alloc,free}_l2_table()
invocations to document (and validate in debug builds) that L2
(un)validations are always preemptible.

The assertion in free_page_type() was now observed to trigger when
recursive L2 page tables get cleaned up.

In particular put_page_from_l2e()'s assumption that _put_page_type()
would always succeed is now wrong, resulting in a partially un-validated
page left in a domain, which has no other means of getting cleaned up
later on. If not causing any problems earlier, this would ultimately
trigger the check for ->u.inuse.type_info having a zero count when
freeing the page during cleanup after the domain has died.

As a result it should be considered a mistake to not have extended
preemption fully to L2 when it was added to L3/L4 table handling, which
this change aims to correct.

The validation side additions are done just for symmetry.

This is part of XSA-290.

Reported-by: Manuel Bouyer <bouyer@antioche.eu.org>
Tested-by: Manuel Bouyer <bouyer@antioche.eu.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 176ebf9c8bc2828f6637eb61cc1cf166e302c699
master date: 2019-03-05 13:51:18 +0100

6 years agoxen: Make coherent PV IOMMU discipline
George Dunlap [Tue, 5 Mar 2019 14:42:07 +0000 (15:42 +0100)]
xen: Make coherent PV IOMMU discipline

In order for a PV domain to set up DMA from a passed-through device to
one of its pages, the page must be mapped in the IOMMU.  On the other
hand, before a PV page may be used as a "special" page type (such as a
pagetable or descriptor table), it _must not_ be writable in the IOMMU
(otherwise a malicious guest could DMA arbitrary page tables into the
memory, bypassing Xen's safety checks); and Xen's current rule is to
have such pages not in the IOMMU at all.

At the moment, in order to accomplish this, the code borrows HVM
domain's "physmap" concept: When a page is assigned to a guest,
guess_physmap_add_entry() is called, which for PV guests, will create
a writable IOMMU mapping; and when a page is removed,
guest_physmap_remove_entry() is called, which will remove the mapping.

Additionally, when a page gains the PGT_writable page type, the page
will be added into the IOMMU; and when the page changes away from a
PGT_writable type, the page will be removed from the IOMMU.

Unfortunately, borrowing the "physmap" concept from HVM domains is
problematic.  HVM domains have a lock on their p2m tables, ensuring
synchronization between modifications to the p2m; and all hypercall
parameters must first be translated through the p2m before being used.

Trying to mix this locked-and-gated approach with PV's lock-free
approach leads to several races and inconsistencies:

* A race between a page being assigned and it being put into the
  physmap; for example:
  - P1: call populate_physmap() { A = allocate_domheap_pages() }
  - P2: Guess page A's mfn, and call decrease_reservation(A).  A is owned by the domain,
        and so Xen will clear the PGC_allocated bit and free the page
  - P1: finishes populate_physmap() { guest_physmap_add_entry() }

  Now the domain has a writable IOMMU mapping to a page it no longer owns.

* Pages start out as type PGT_none, but with a writable IOMMU mapping.
  If a guest uses a page as a page table without ever having created a
  writable mapping, the IOMMU mapping will not be removed; the guest
  will have a writable IOMMU mapping to a page it is currently using
  as a page table.

* A newly-allocated page can be DMA'd into with no special actions on
  the part of the guest; However, if a page is promoted to a
  non-writable type, the page must be mapped with a writable type before
  DMA'ing to it again, or the transaction will fail.

To fix this, do away with the "PV physmap" concept entirely, and
replace it with the following IOMMU discipline for PV guests:
 - (type == PGT_writable) <=> in iommu (even if type_count == 0)
 - Upon a final put_page(), check to see if type is PGT_writable; if so,
   iommu_unmap.

In order to achieve that:

- Remove PV IOMMU related code from guest_physmap_*

- Repurpose cleanup_page_cacheattr() into a general
  cleanup_page_mappings() function, which will both fix up Xen
  mappings for pages with special cache attributes, and also check for
  a PGT_writable type and remove pages if appropriate.

- For compatibility with current guests, grab-and-release a
  PGT_writable_page type for PV guests in guest_physmap_add_entry().
  This will cause most "normal" guest pages to start out life with
  PGT_writable_page type (and thus an IOMMU mapping), but no type
  count (so that they can be used as special cases at will).

Also, note that there is one exception to to the "PGT_writable => in
iommu" rule: xenheap pages shared with guests may be given a
PGT_writable type with one type reference.  This reference prevents
the type from changing, which in turn prevents page from gaining an
IOMMU mapping in get_page_type().  It's not clear whether this was
intentional or not, but it's not something to change in a security
update.

This is XSA-288.

Reported-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: fe21b78ef99a1b505cfb6d3789ede9591609dd70
master date: 2019-03-05 13:48:32 +0100

6 years agosteal_page: Get rid of bogus struct page states
George Dunlap [Tue, 5 Mar 2019 14:41:40 +0000 (15:41 +0100)]
steal_page: Get rid of bogus struct page states

The original rules for `struct page` required the following invariants
at all times:

- refcount > 0 implies owner != NULL
- PGC_allocated implies refcount > 0

steal_page, in a misguided attempt to protect against unknown races,
violates both of these rules, thus introducing other races:

- Temporarily, the count_info has the refcount go to 0 while
  PGC_allocated is set

- It explicitly returns the page PGC_allocated set, but owner == NULL
  and page not on the page_list.

The second one meant that page_get_owner_and_reference() could return
NULL even after having successfully grabbed a reference on the page,
leading the caller to leak the reference (since "couldn't get ref" and
"got ref but no owner" look the same).

Furthermore, rather than grabbing a page reference to ensure that the
owner doesn't change under its feet, it appears to rely on holding
d->page_alloc lock to prevent this.

Unfortunately, this is ineffective: page->owner remains non-NULL for
some time after the count has been set to 0; meaning that it would be
entirely possible for the page to be freed and re-allocated to a
different domain between the page_get_owner() check and the count_info
check.

Modify steal_page to instead follow the appropriate access discipline,
taking the page through series of states similar to being freed and
then re-allocated with MEMF_no_owner:

- Grab an extra reference to make sure we don't race with anyone else
  freeing the page

- Drop both references and PGC_allocated atomically, so that (if
successful), anyone else trying to grab a reference will fail

- Attempt to reset Xen's mappings

- Reset the rest of the state.

Then, modify the two callers appropriately:

- Leave count_info alone (it's already been cleared)
- Call free_domheap_page() directly if appropriate
- Call assign_pages() rather than open-coding a partial assign

With all callers to assign_pages() now passing in pages with the
type_info field clear, tighten the respective assertion there.

This is XSA-287.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 3d4868a481eebed232eeacba36ea28e5dee5e946
master date: 2019-03-05 13:48:08 +0100

6 years agoIOMMU/x86: fix type ref-counting race upon IOMMU page table construction
Jan Beulich [Tue, 5 Mar 2019 14:41:08 +0000 (15:41 +0100)]
IOMMU/x86: fix type ref-counting race upon IOMMU page table construction

When arch_iommu_populate_page_table() gets invoked for an already
running guest, simply looking at page types once isn't enough, as they
may change at any time. Add logic to re-check the type after having
mapped the page, unmapping it again if needed.

This is XSA-285.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tentatively-Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1f0b0bb7773d537bcf169e021495d0986d9809fc
master date: 2019-03-05 13:47:36 +0100

6 years agognttab: set page refcount for copy-on-grant-transfer
Jan Beulich [Tue, 5 Mar 2019 14:40:27 +0000 (15:40 +0100)]
gnttab: set page refcount for copy-on-grant-transfer

Commit 5cc77f9098 ("32-on-64: Fix domain address-size clamping,
implement"), which introduced this functionality, took care of clearing
the old page's PGC_allocated, but failed to set the bit (and install the
associated reference) on the newly allocated one. Furthermore the "mfn"
local variable was never updated, and hence the wrong MFN was passed to
guest_physmap_add_page() (and back to the destination domain) in this
case, leading to an IOMMU mapping into an unowned page.

Ideally the code would use assign_pages(), but the call to
gnttab_prepare_for_transfer() sits in the middle of the actions
mirroring that function.

This is XSA-284.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: 6d4f36c3fecc0a6a0991716199612c81d909316e
master date: 2019-03-05 13:45:58 +0100

6 years agoupdate Xen version to 4.8.5 RELEASE-4.8.5
Jan Beulich [Mon, 3 Dec 2018 09:03:57 +0000 (10:03 +0100)]
update Xen version to 4.8.5

6 years agoVMX: allow migration of guests with SSBD enabled
Jan Beulich [Fri, 23 Nov 2018 10:52:54 +0000 (11:52 +0100)]
VMX: allow migration of guests with SSBD enabled

The backport of cd53023df9 ("x86/msr: Virtualise MSR_SPEC_CTRL.SSBD for
guests to use") did not mirror the PV side change into the HVM (VMX-
specific) code path.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
6 years agox86/dom0: Fix shadowing of PV guests with 2M superpages
Andrew Cooper [Tue, 20 Nov 2018 14:59:55 +0000 (15:59 +0100)]
x86/dom0: Fix shadowing of PV guests with 2M superpages

This is a minimal backport of pieces of:

 c/s 28d9a9a2d41759b9e5163037b759ac557aea767c
 c/s 4c5d78a10dc89427140a50a1df5a0b8e9f073e82

to fix a PV shadowing problem which I hadn't anticipated at the time these
fixes were first accepted.

Having opt_allow_superpage disabled causes guest_supports_superpages() to
return false for PV guests.  Returning false causes guest_walk_tables() to
ignore L2 superpages, and read under them.

This ignoring behaviour is correct for 2-level paging when CR4.PSE is clear,
but isn't correct for 3- or 4-level paging.

When opt_allow_superpage is clear, PV domU's can't have superpages, but dom0
will still have its initial P2M constructed with 2M superpages.

The end result is that, if dom0 becomes shadowed (e.g. PV-L1TF), the next
memory access touching a P2M superpage will cause the shadow code to read
under the P2M superpage and attempt to shadow junk.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
6 years agox86/dom0: Avoid using 1G superpages if shadowing may be necessary
Andrew Cooper [Tue, 20 Nov 2018 14:59:17 +0000 (15:59 +0100)]
x86/dom0: Avoid using 1G superpages if shadowing may be necessary

The shadow code doesn't support 1G superpages, and will hand #PF[RSVD] back to
guests.

For dom0's with 512GB of RAM or more (and subject to the P2M alignment), Xen's
domain builder might use 1G superpages.

Avoid using 1G superpages (falling back to 2M superpages instead) if there is
a reasonable chance that we may have to shadow dom0.  This assumes that there
are no circumstances where we will activate logdirty mode on dom0.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 96f6ee15ad7ca96472779fc5c083b4149495c584
master date: 2018-11-12 11:26:04 +0000

6 years agox86/shadow: shrink struct page_info's shadow_flags to 16 bits
Jan Beulich [Tue, 20 Nov 2018 14:58:38 +0000 (15:58 +0100)]
x86/shadow: shrink struct page_info's shadow_flags to 16 bits

This is to avoid it overlapping the linear_pt_count field needed for PV
domains. Introduce a separate, HVM-only pagetable_dying field to replace
the sole one left in the upper 16 bits.

Note that the accesses to ->shadow_flags in shadow_{pro,de}mote() get
switched to non-atomic, non-bitops operations, as {test,set,clear}_bit()
are not allowed on uint16_t fields and hence their use would have
required ugly casts. This is fine because all updates of the field ought
to occur with the paging lock held, and other updates of it use |= and
&= as well (i.e. using atomic operations here didn't really guard
against potentially racing updates elsewhere).

This is part of XSA-280.

Reported-by: Prgmr.com Security <security@prgmr.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 789589968ed90e82a832dbc60e958c76b787be7e
master date: 2018-11-20 14:59:54 +0100

6 years agox86/shadow: move OOS flag bit positions
Jan Beulich [Tue, 20 Nov 2018 14:57:50 +0000 (15:57 +0100)]
x86/shadow: move OOS flag bit positions

In preparation of reducing struct page_info's shadow_flags field to 16
bits, lower the bit positions used for SHF_out_of_sync and
SHF_oos_may_write.

Instead of also adjusting the open coded use in _get_page_type(),
introduce shadow_prepare_page_type_change() to contain knowledge of the
bit positions to shadow code.

This is part of XSA-280.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: d68e1070c3e8f4af7a31040f08bdd98e6d6eac1d
master date: 2018-11-20 14:59:13 +0100

6 years agox86/mm: Don't perform flush after failing to update a guests L1e
Andrew Cooper [Tue, 20 Nov 2018 14:57:06 +0000 (15:57 +0100)]
x86/mm: Don't perform flush after failing to update a guests L1e

If the L1e update hasn't occured, the flush cannot do anything useful.  This
skips the potentially expensive vcpumask_to_pcpumask() conversion, and
broadcast TLB shootdown.

More importantly however, we might be in the error path due to a bad va
parameter from the guest, and this should not propagate into the TLB flushing
logic.  The INVPCID instruction for example raises #GP for a non-canonical
address.

This is XSA-279.

Reported-by: Matthew Daley <mattd@bugfuzz.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6c8d50288722672ecc8e19b0741a31b521d01706
master date: 2018-11-20 14:58:41 +0100

6 years agoAMD/IOMMU: suppress PTE merging after initial table creation
Jan Beulich [Tue, 20 Nov 2018 14:56:29 +0000 (15:56 +0100)]
AMD/IOMMU: suppress PTE merging after initial table creation

The logic is not fit for this purpose, so simply disable its use until
it can be fixed / replaced. Note that this re-enables merging for the
table creation case, which was disabled as a (perhaps unintended) side
effect of the earlier "amd/iommu: fix flush checks". It relies on no
page getting mapped more than once (with different properties) in this
process, as that would still be beyond what the merging logic can cope
with. But arch_iommu_populate_page_table() guarantees this afaict.

This is part of XSA-275.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 937ef32565fa3a81fdb37b9dd5aa99a1b87afa75
master date: 2018-11-20 14:55:14 +0100

6 years agoamd/iommu: fix flush checks
Roger Pau Monné [Tue, 20 Nov 2018 14:55:51 +0000 (15:55 +0100)]
amd/iommu: fix flush checks

Flush checking for AMD IOMMU didn't check whether the previous entry
was present, or whether the flags (writable/readable) changed in order
to decide whether a flush should be executed.

Fix this by taking the writable/readable/next-level fields into account,
together with the present bit.

Along these lines the flushing in amd_iommu_map_page() must not be
omitted for PV domains. The comment there was simply wrong: Mappings may
very well change, both their addresses and their permissions. Ultimately
this should honor iommu_dont_flush_iotlb, but to achieve this
amd_iommu_ops first needs to gain an .iotlb_flush hook.

Also make clear_iommu_pte_present() static, to demonstrate there's no
caller omitting the (subsequent) flush.

This is part of XSA-275.

Reported-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 1a7ffe466cd057daaef245b0a1ab6b82588e4c01
master date: 2018-11-20 14:52:12 +0100

6 years agostubdom/vtpm: fix memcmp in TPM_ChangeAuthAsymFinish
Olaf Hering [Mon, 18 Jun 2018 12:55:36 +0000 (14:55 +0200)]
stubdom/vtpm: fix memcmp in TPM_ChangeAuthAsymFinish

gcc8 spotted this error:
error: 'memcmp' reading 20 bytes from a region of size 8 [-Werror=stringop-overflow=]

Signed-off-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
(cherry picked from commit 22bf5be3237cb482a2ffd772ffd20ce37285eebf)
(cherry picked from commit dea9fc0e02d92f5e6d46680aa0a52fa758eca9c4)
(cherry picked from commit e907460fd61c350487ffee5d8aa375bef56bc81c)
Conflicts:
stubdom/Makefile
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit f13983db120f5e56dfefbee5d56678d2d43e2914)

6 years agox86: work around HLE host lockup erratum
Jan Beulich [Wed, 7 Nov 2018 08:51:44 +0000 (09:51 +0100)]
x86: work around HLE host lockup erratum

XACQUIRE prefixed accesses to the 4Mb range of memory starting at 1Gb
are liable to lock up the processor. Disallow use of this memory range.

Unfortunately the available Core Gen7 and Gen8 spec updates are pretty
old, so I can only guess that they're similarly affected when Core Gen6
is and the Xeon counterparts are, too.

This is part of XSA-282.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: cc76410d20aff2cc07b268b0713dc1d2740c6e12
master date: 2018-11-07 09:33:24 +0100

6 years agox86: extend get_platform_badpages() interface
Jan Beulich [Wed, 7 Nov 2018 08:50:58 +0000 (09:50 +0100)]
x86: extend get_platform_badpages() interface

Use a structure so along with an address (now frame number) an order can
also be specified.

This is part of XSA-282.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 8617e69fb8307b372eeff41d55ec966dbeba36eb
master date: 2018-11-07 09:32:08 +0100

6 years agotools/dombuilder: Initialise vcpu debug registers correctly
Andrew Cooper [Mon, 5 Nov 2018 15:17:56 +0000 (16:17 +0100)]
tools/dombuilder: Initialise vcpu debug registers correctly

In particular, initialising %dr6 with the value 0 is buggy, because on
hardware supporting Transactional Memory, it will cause the sticky RTM bit to
be asserted, even though a debug exception from a transaction hasn't actually
been observed.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
master commit: 46029da12e5efeca6d957e5793bd34f2965fa0a1
master date: 2018-10-24 14:43:05 +0100

6 years agox86/domain: Initialise vcpu debug registers correctly
Andrew Cooper [Mon, 5 Nov 2018 15:17:26 +0000 (16:17 +0100)]
x86/domain: Initialise vcpu debug registers correctly

In particular, initialising %dr6 with the value 0 is buggy, because on
hardware supporting Transactional Memory, it will cause the sticky RTM bit to
be asserted, even though a debug exception from a transaction hasn't actually
been observed.

Introduce arch_vcpu_regs_init() to set various architectural defaults, and
reuse this in the hvm_vcpu_reset_state() path.

Architecturally, %edx's init state contains the processors model information,
and 0xf looks to be a remnant of the old Intel processors.  We clearly have no
software which cares, seeing as it is wrong for the last decade's worth of
Intel hardware and for all other vendors, so lets use the value 0 for
simplicity.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
x86/domain: Fix build with GCC 4.3.x

GCC 4.3.x can't initialise the user_regs structure like this.

Reported-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: dfba4d2e91f63a8f40493c4fc2db03fd8287f6cb
master date: 2018-10-24 14:43:05 +0100
master commit: 0a1fa635029d100d4b6b7eddb31d49603217cab7
master date: 2018-10-30 13:26:21 +0000

6 years agox86/boot: Initialise the debug registers correctly
Andrew Cooper [Mon, 5 Nov 2018 15:16:45 +0000 (16:16 +0100)]
x86/boot: Initialise the debug registers correctly

In particular, initialising %dr6 with the value 0 is buggy, because on
hardware supporting Transactional Memory, it will cause the sticky RTM bit to
be asserted, even though a debug exception from a transaction hasn't actually
been observed.

Move X86_DR6_DEFAULT into x86-defns.h along with the other architectural
register constants, and introduce a new X86_DR7_DEFAULT.  Use the existing
write_debugreg() helper, rather than opencoded inline assembly.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 721da6d41a70fe08b3fcd9c31a62f6709a54c6ba
master date: 2018-10-24 14:43:05 +0100

6 years agox86/boot: enable NMIs after traps init
Sergey Dyasli [Mon, 5 Nov 2018 15:16:19 +0000 (16:16 +0100)]
x86/boot: enable NMIs after traps init

In certain scenarios, NMIs might be disabled during Xen boot process.
Such situation will cause alternative_instructions() to:

    panic("Timed out waiting for alternatives self-NMI to hit\n");

This bug was originally seen when using Tboot to boot Xen 4.11

To prevent this from happening, enable NMIs during cpu_init() and
during __start_xen() for BSP.

Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 072e054359a4d4a4f6c3fa09585667472c4f0f1d
master date: 2018-10-23 12:33:54 +0100

6 years agovtd: add missing check for shared EPT...
Paul Durrant [Mon, 5 Nov 2018 15:15:17 +0000 (16:15 +0100)]
vtd: add missing check for shared EPT...

...in intel_iommu_unmap_page().

This patch also includes some non-functional modifications in
intel_iommu_map_page().

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: e30c47cd8be8ba73cfc1ec7b1ebd036464708a24
master date: 2018-10-04 14:53:57 +0200

6 years agox86: fix "xpti=" and "pv-l1tf=" yet again
Jan Beulich [Mon, 5 Nov 2018 15:14:50 +0000 (16:14 +0100)]
x86: fix "xpti=" and "pv-l1tf=" yet again

While commit 2a3b34ec47 ("x86/spec-ctrl: Yet more fixes for xpti=
parsing") indeed fixed "xpti=dom0", it broke "xpti=no-dom0", in that
this then became equivalent to "xpti=no". In particular, the presence
of "xpti=" alone on the command line means nothing as to which default
is to be overridden; "xpti=no-dom0", for example, ought to have no
effect for DomU-s, as this is distinct from both "xpti=no-dom0,domu"
and "xpti=no-dom0,no-domu".

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 8743d2dea539617e237c77556a91dc357098a8af
master date: 2018-10-04 14:49:56 +0200

6 years agox86: split opt_pv_l1tf
Jan Beulich [Mon, 5 Nov 2018 15:14:25 +0000 (16:14 +0100)]
x86: split opt_pv_l1tf

Use separate tracking variables for the hardware domain and DomU-s.

No functional change intended, but adjust the comment in
init_speculation_mitigations() to match prior as well as resulting code.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0b89643ef6ef14e2c2b731ca675d23e405ed69b1
master date: 2018-10-04 14:49:19 +0200

6 years agox86: split opt_xpti
Jan Beulich [Mon, 5 Nov 2018 15:13:55 +0000 (16:13 +0100)]
x86: split opt_xpti

Use separate tracking variables for the hardware domain and DomU-s.

No functional change intended.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 51e0cb45932d80d4eeb59994ee2c3f3c597b0212
master date: 2018-10-04 14:48:18 +0200

6 years agox86: silence false log messages for plain "xpti" / "pv-l1tf"
Jan Beulich [Mon, 5 Nov 2018 15:13:09 +0000 (16:13 +0100)]
x86: silence false log messages for plain "xpti" / "pv-l1tf"

While commit 2a3b34ec47 ("x86/spec-ctrl: Yet more fixes for xpti=
parsing")  claimed to have got rid of the 'parameter "xpti" has invalid
value "", rc=-22!' log message for "xpti" alone on the command line,
this wasn't the case (the option took effect nevertheless).

Fix this there as well as for plain "pv-l1tf".

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2fb57e4beefeda923446b73f88b392e59b07d847
master date: 2018-09-28 17:12:14 +0200

6 years agostubdom/grub.patches: Drop docs changes, for licensing reasons
Ian Jackson [Tue, 18 Sep 2018 10:25:20 +0000 (11:25 +0100)]
stubdom/grub.patches: Drop docs changes, for licensing reasons

The patch file 00cvs is an import of a new upstream version of
grub1 from upstream CVS.

Unfortunately, in the period covered by the update, upstream changed
the documentation licence from a simple permissive licence, to the GNU
"Free Documentation Licence" with Front and Back Cover Texts.

The Debian Project is of the view that use the Front and Back Cover
Texts feature of the GFDL makes the resulting document not Free
Software, because of the mandatory redistribution of these immutable
texts.  (Personally, I agree.)

This is awkward because Debian do not want to ship non-free content.
So the Debian maintainers need to launder the upstream source code, to
remove the troublesome files.  This is an extra step when
incorporating new upstream versions.  It's particularly annoying for
security response, which often involves rebasing onto a new upstream
release.

grub1 is obsolete and the last change to Xen's PV grub1 stubdom code
was in 2016.  Furthermore, the grub1 documentation is not built and
installed by the Xen pv-grub stubdom Makefiles.

Therefore, remove all docs changes from stubdom/grub.patches.  This
means that there are now no longer any GFDL-licenced grub docs in
xen.git.

There is no user impact, and Debian is helped.  This change would
complicate any attempts to update to a new version of upstream grub1,
but it seems unlikely that such a thing will ever happen.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
CC: Doug Goldstein <cardoe@cardoe.com>
CC: Juergen Gross <jgross@suse.com>
CC: pkg-xen-devel@lists.alioth.debian.org
Acked-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
(cherry picked from commit c62c53d61477dfeb63a47b0673c389082112babc)
(cherry picked from commit 94fba9f438a2c36ad9bf3a481a6013ddc7cf8cd9)
(cherry picked from commit ed024ef538cd10ec33c9edacd5e5f2016a5964d2)
(cherry picked from commit 782ca9b94f77026875dd98d6288fc1f8dcc7ce19)

6 years agox86/hvm/emulate: make sure rep I/O emulation does not cross GFN boundaries
Paul Durrant [Mon, 8 Oct 2018 12:51:33 +0000 (14:51 +0200)]
x86/hvm/emulate: make sure rep I/O emulation does not cross GFN boundaries

When emulating a rep I/O operation it is possible that the ioreq will
describe a single operation that spans multiple GFNs. This is fine as long
as all those GFNs fall within an MMIO region covered by a single device
model, but unfortunately the higher levels of the emulation code do not
guarantee that. This is something that should almost certainly be fixed,
but in the meantime this patch makes sure that MMIO is truncated at GFN
boundaries and hence the appropriate device model is re-evaluated for each
target GFN.

NOTE: This patch does not deal with the case of a single MMIO operation
      spanning a GFN boundary. That is more complex to deal with and is
      deferred to a subsequent patch.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Convert calculations to be 32-bit only.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 7626edeaca972e3e823535dcc44338f6b2f0b21f
master date: 2018-08-16 09:27:30 +0200

6 years agox86/shutdown: use ACPI reboot method for Dell PowerEdge R540
Ross Lagerwall [Mon, 8 Oct 2018 12:51:03 +0000 (14:51 +0200)]
x86/shutdown: use ACPI reboot method for Dell PowerEdge R540

When EFI booting the Dell PowerEdge R540 it consistently wanders into
the weeds and gets an invalid opcode in the EFI ResetSystem call. This
is the same bug which affects the PowerEdge R740 so fix it in the same
way: quirk this hardware to use the ACPI reboot method instead.

BIOS Information
    Vendor: Dell Inc.
    Version: 1.3.7
    Release Date: 02/09/2018
System Information
    Manufacturer: Dell Inc.
    Product Name: PowerEdge R540

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 328ca55b7bd47e1324b75cce2a6c461308ecf93d
master date: 2018-06-28 09:29:13 +0200

6 years agox86/shutdown: use ACPI reboot method for Dell PowerEdge R740
Ross Lagerwall [Mon, 8 Oct 2018 12:50:16 +0000 (14:50 +0200)]
x86/shutdown: use ACPI reboot method for Dell PowerEdge R740

When EFI booting the Dell PowerEdge R740, it consistently wanders into the
weeds and gets an invalid opcode in the EFI ResetSystem call.
Quirk this hardware to use the ACPI reboot method instead.

Example stack trace:

----[ Xen-4.11-unstable  x86_64  debug=n   Not tainted ]----
CPU:    0
RIP:    e008:[<0000000000000017>] 0000000000000017
RFLAGS: 0000000000010202   CONTEXT: hypervisor
rax: 0000000066eb2ff0   rbx: ffff83005f627c20   rcx: 000000006c54e100
rdx: 0000000000000000   rsi: 0000000000000065   rdi: 000000107355f000
rbp: ffff83005f627c70   rsp: ffff83005f627b48   r8:  ffff83005f627b90
r9:  0000000000000000   r10: ffff83005f627c88   r11: 0000000000000000
r12: 0000000000000000   r13: 0000000000000cf9   r14: 0000000000000065
r15: ffff830000000000   cr0: 0000000080050033   cr4: 00000000003526e0
cr3: 000000107355f000   cr2: ffffc90000cff000
fsb: 0000000000000000   gsb: ffff88019f600000   gss: 0000000000000000
ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
Xen code around <0000000000000017> (0000000000000017):
 f0 d8 dd 00 f0 54 ff 00 <f0> 50 dd 00 f0 d8 dd 00 f0 a5 fe 00 f0 87 e9 00
Xen stack trace from rsp=ffff83005f627b48:
   ffff83005f627b50 ffffffffffffffda 000000006c547aaa ffff82d000000001
   ffff83005f627bec 000000107355f000 000000006c546fb8 ffff83107ffe3240
   0000000000000000 0000000000000000 8000000000000002 0000000000000000
   000000006c546b95 000000006c54c700 ffff83005f627bdc ffff83005f627be8
   000000005f616000 ffff83005f627c20 0000000000000000 0000000000000cf9
   ffff820080350001 000000000000000b ffff82d080351eda 0000000000000000
   0000000000000000 0000000000000000 0000000000000000 000000005f616000
   0000000000000000 ffff82d08095ff60 ffff82d08095ff60 000000f100000000
   ffff82d080296097 000000000000e008 0000000000000000 ffff83005f627c88
   0000000000000000 00000000fffffffe ffff82d0802959d2 ffff82d0802959d2
   000000008095f300 000000005f627c9c 00000000000000f8 0000000000000000
   00000000000000f8 ffff82d080932c00 0000000000000000 ffff82d08095f7c8
   ffff82d080932c00 0000000000000000 0000000000000000 ffff82d080295a9b
   ffff83005f627d98 ffff82d0802361f3 ffff82d080932c00 0000000080000000
   ffff83005f627d98 ffff82d080279a19 ffff82d08095f02c ffff82d080000000
   0000000000000000 00000000000000fb 0000000000000000 00000071484e54f6
   ffff831073542098 ffff82d08093ac78 ffff831072befd30 0000000000000000
   0000000000000000 0000000000000000 0000000000000000 0000000000000000
   0000000000000000 ffff82d08034f185 ffff82d080949460 0000000000000000
   ffff82d08095f270 0000000000000008 ffff83107357ae20 0000007146ce4bd3
Xen call trace:
   [<0000000000000017>] 0000000000000017
   [<ffff82d080351eda>] efi_reset_system+0x5a/0x90
   [<ffff82d080296097>] smp_send_stop+0x97/0xa0
   [<ffff82d0802959d2>] machine_restart+0x212/0x2d0
   [<ffff82d0802959d2>] machine_restart+0x212/0x2d0
   [<ffff82d080295a9b>] shutdown.c#__machine_restart+0xb/0x10
   [<ffff82d0802361f3>] smp_call_function_interrupt+0x53/0x80
   [<ffff82d080279a19>] do_IRQ+0x259/0x660
   [<ffff82d08034f185>] common_interrupt+0x85/0x90
   [<ffff82d0802c6152>] mwait-idle.c#mwait_idle+0x242/0x390
   [<ffff82d08026b446>] domain.c#idle_loop+0x86/0xc0

****************************************
Panic on CPU 0:
FATAL TRAP: vector = 6 (invalid opcode)
****************************************

dmidecode info:

BIOS Information:
    Vendor: Dell Inc.
    Version: 1.2.11
    Release Date: 10/19/2017
    BIOS Revision: 1.2
System Information:
    Manufacturer: Dell Inc.
    Product Name: PowerEdge R740

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: f97f774b5aa6b471d1fed1c451c89ec7457dadf2
master date: 2018-01-24 18:01:00 +0100

6 years agox86: assorted array_index_nospec() insertions
Jan Beulich [Fri, 14 Sep 2018 11:36:32 +0000 (13:36 +0200)]
x86: assorted array_index_nospec() insertions

Don't chance having Spectre v1 (including BCBS) gadgets. In some of the
cases the insertions are more of precautionary nature rather than there
provably being a gadget, but I think we should err on the safe (secure)
side here.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 3f2002614af51dfd507168a1696658bac91155ce
master date: 2018-09-03 17:50:10 +0200

6 years agoVT-d/dmar: iommu mem leak fix
Zhenzhong Duan [Fri, 14 Sep 2018 11:35:54 +0000 (13:35 +0200)]
VT-d/dmar: iommu mem leak fix

Release memory allocated for drhd iommu in error path.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: fd07b6648c4c8891dca5bd0f7ef174b6831f80b2
master date: 2018-08-27 11:37:24 +0200

6 years agorangeset: make inquiry functions tolerate NULL inputs
Jan Beulich [Fri, 14 Sep 2018 11:35:27 +0000 (13:35 +0200)]
rangeset: make inquiry functions tolerate NULL inputs

Rather than special casing the ->iomem_caps check in x86's
get_page_from_l1e() for the dom_xen case, let's be more tolerant in
general, along the lines of rangeset_is_empty(): A never allocated
rangeset can't possibly contain or overlap any range.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: ad0a9f273d6d6f0545cd9b708b2d4be581a6cadd
master date: 2018-08-17 13:54:40 +0200

6 years agox86/setup: Avoid OoB E820 lookup when calculating the L1TF safe address
Andrew Cooper [Fri, 14 Sep 2018 11:34:57 +0000 (13:34 +0200)]
x86/setup: Avoid OoB E820 lookup when calculating the L1TF safe address

A number of corner cases (most obviously, no-real-mode and no Multiboot memory
map) can end up with e820_raw.nr_map being 0, at which point the L1TF
calculation will underflow.

Spotted by Coverity.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: 3e4ec07e14bce81f6ae22c31ff1302d1f297a226
master date: 2018-08-16 18:10:07 +0100

6 years agox86/hvm/ioreq: MMIO range checking completely ignores direction flag
Paul Durrant [Fri, 14 Sep 2018 11:34:26 +0000 (13:34 +0200)]
x86/hvm/ioreq: MMIO range checking completely ignores direction flag

hvm_select_ioreq_server() is used to route an ioreq to the appropriate
ioreq server. For MMIO this is done by comparing the range of the ioreq
to the ranges registered by the device models of each ioreq server.
Unfortunately the calculation of the range if the ioreq completely ignores
the direction flag and thus may calculate the wrong range for comparison.
Thus the ioreq may either be routed to the wrong server or erroneously
terminated by null_ops.

NOTE: The patch also fixes whitespace in the switch statement to make it
      style compliant.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 60a56dc0064a00830663ffe48215dcd080cb9504
master date: 2018-08-15 14:14:06 +0200

6 years agox86/vlapic: Bugfixes and improvements to vlapic_{read,write}()
Andrew Cooper [Fri, 14 Sep 2018 11:33:59 +0000 (13:33 +0200)]
x86/vlapic: Bugfixes and improvements to vlapic_{read,write}()

Firstly, there is no 'offset' boundary check on the non-32-bit write path
before the call to vlapic_read_aligned(), which allows an attacker to read
beyond the end of vlapic->regs->data[], which is only 1024 bytes long.

However, as the backing memory is a domheap page, and misaligned accesses get
chunked down to single bytes across page boundaries, I can't spot any
XSA-worthy problems which occur from the overrun.

On real hardware, bad accesses don't instantly crash the machine.  Their
behaviour is undefined, but the domain_crash() prohibits sensible testing.
Behave more like other x86 MMIO and terminate bad accesses with appropriate
defaults.

While making these changes, clean up and simplify the the smaller-access
handling.  In particular, avoid pointer based mechansims for 1/2-byte reads so
as to avoid forcing the value to be spilled to the stack.

  add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-175 (-175)
  function                                     old     new   delta
  vlapic_read                                  211     142     -69
  vlapic_write                                 304     198    -106

Finally, there are a plethora of read/write functions in the vlapic namespace,
so rename these to vlapic_mmio_{read,write}() to make their purpose more
clear.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: b6f43c14cef3af8477a9eca4efab87dd150a2885
master date: 2018-08-10 13:27:24 +0100

6 years agox86/vmx: Avoid hitting BUG_ON() after EPTP-related domain_crash()
Andrew Cooper [Fri, 14 Sep 2018 11:33:20 +0000 (13:33 +0200)]
x86/vmx: Avoid hitting BUG_ON() after EPTP-related domain_crash()

If the EPTP pointer can't be located in the altp2m list, the domain
is (legitimately) crashed.

Under those circumstances, execution will continue and guarentee to hit the
BUG_ON(idx >= MAX_ALTP2M) (unfortunately, just out of context).

Return from vmx_vmexit_handler() after the domain_crash(), which also has the
side effect of reentering the scheduler more promptly.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 48dbb2dbe9d9f92a2890a15bb48a0598c065b9f8
master date: 2018-08-02 10:10:43 +0100

6 years agox86: write to correct variable in parse_pv_l1tf()
Jan Beulich [Wed, 15 Aug 2018 12:24:19 +0000 (14:24 +0200)]
x86: write to correct variable in parse_pv_l1tf()

Apparently a copy-and-paste mistake.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 57c554f8a6e06894f601d977d18b3017d2a60f40
master date: 2018-08-15 14:15:30 +0200

6 years agoxl.conf: Add global affinity masks
Wei Liu [Tue, 7 Aug 2018 14:35:34 +0000 (15:35 +0100)]
xl.conf: Add global affinity masks

XSA-273 involves one hyperthread being able to use Spectre-like
techniques to "spy" on another thread.  The details are somewhat
complicated, but the upshot is that after all Xen-based mitigations
have been applied:

* PV guests cannot spy on sibling threads
* HVM guests can spy on sibling threads

(NB that for purposes of this vulnerability, PVH and HVM guests are
identical.  Whenever this comment refers to 'HVM', this includes PVH.)

There are many possible mitigations to this, including disabling
hyperthreading entirely.  But another solution would be:

* Specify some cores as PV-only, others as PV or HVM
* Allow HVM guests to only run on thread 0 of the "HVM-or-PV" cores
* Allow PV guests to run on the above cores, as well as any thread of the PV-only cores.

For example, suppose you had 16 threads across 8 cores (0-7).  You
could specify 0-3 as PV-only, and 4-7 as HVM-or-PV.  Then you'd set
the affinity of the HVM guests as follows (binary representation):

0000000010101010

And the affinity of the PV guests as follows:

1111111110101010

In order to make this easy, this patches introduces three "global affinity
masks", placed in xl.conf:

    vm.cpumask
    vm.hvm.cpumask
    vm.pv.cpumask

These are parsed just like the 'cpus' and 'cpus_soft' options in the
per-domain xl configuration files.  The resulting mask is AND-ed with
whatever mask results at the end of the xl configuration file.
`vm.cpumask` would be applied to all guest types, `vm.hvm.cpumask`
would be applied to HVM and PVH guest types, and `vm.pv.cpumask`
would be applied to PV guest types.

The idea would be that to implement the above mask across all your
VMs, you'd simply add the following two lines to the configuration
file:

    vm.hvm.cpumask=8,10,12,14
    vm.pv.cpumask=0-8,10,12,14

See xl.conf manpage for details.

This is part of XSA-273 / CVE-2018-3646.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit aa67b97ed34279c43a43d9ca46727b5746caa92e)

PVH guest type in toolstack is not available in this version of Xen.
Change code and manpage to cope. Also xl is still part of libxl in
thsi version, manually backport code to relevant places.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
6 years agox86: Make "spec-ctrl=no" a global disable of all mitigations
Jan Beulich [Mon, 13 Aug 2018 11:07:23 +0000 (05:07 -0600)]
x86: Make "spec-ctrl=no" a global disable of all mitigations

In order to have a simple and easy to remember means to suppress all the
more or less recent workarounds for hardware vulnerabilities, force
settings not controlled by "spec-ctrl=" also to their original defaults,
unless they've been forced to specific values already by earlier command
line options.

This is part of XSA-273.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit d8800a82c3840b06b17672eddee4878bbfdacc6d)

6 years agox86/spec-ctrl: Introduce an option to control L1D_FLUSH for HVM HAP guests
Andrew Cooper [Tue, 29 May 2018 17:44:16 +0000 (18:44 +0100)]
x86/spec-ctrl: Introduce an option to control L1D_FLUSH for HVM HAP guests

This mitigation requires up-to-date microcode, and is enabled by default on
affected hardware if available, and is used for HVM guests

The default for SMT/Hyperthreading is far more complicated to reason about,
not least because we don't know if the user is going to want to run any HVM
guests to begin with.  If a explicit default isn't given, nag the user to
perform a risk assessment and choose an explicit default, and leave other
configuration to the toolstack.

This is part of XSA-273 / CVE-2018-3620.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3bd36952dab60290f33d6791070b57920e10754b)

6 years agox86/msr: Virtualise MSR_FLUSH_CMD for guests
Andrew Cooper [Fri, 13 Apr 2018 15:34:01 +0000 (15:34 +0000)]
x86/msr: Virtualise MSR_FLUSH_CMD for guests

Guests (outside of the nested virt case, which isn't supported yet) don't need
L1D_FLUSH for their L1TF mitigations, but offering/emulating MSR_FLUSH_CMD is
easy and doesn't pose an issue for Xen.

The MSR is offered to HVM guests only.  PV guests attempting to use it would
trap for emulation, and the L1D cache would fill long before the return to
guest context.  As such, PV guests can't make any use of the L1D_FLUSH
functionality.

This is part of XSA-273 / CVE-2018-3646.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit fd9823faf9df057a69a9a53c2e100691d3f4267c)

6 years agox86/spec-ctrl: CPUID/MSR definitions for L1D_FLUSH
Andrew Cooper [Wed, 28 Mar 2018 14:21:39 +0000 (15:21 +0100)]
x86/spec-ctrl: CPUID/MSR definitions for L1D_FLUSH

This is part of XSA-273 / CVE-2018-3646.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3563fc2b2731a63fd7e8372ab0f5cef205bf8477)

6 years agox86/pv: Force a guest into shadow mode when it writes an L1TF-vulnerable PTE
Juergen Gross [Mon, 23 Jul 2018 06:11:40 +0000 (08:11 +0200)]
x86/pv: Force a guest into shadow mode when it writes an L1TF-vulnerable PTE

See the comment in shadow.h for an explanation of L1TF and the safety
consideration of the PTEs.

In the case that CONFIG_SHADOW_PAGING isn't compiled in, crash the domain
instead.  This allows well-behaved PV guests to function, while preventing
L1TF from being exploited.  (Note: PV guest kernels which haven't been updated
with L1TF mitigations will likely be crashed as soon as they try paging a
piece of userspace out to disk.)

This is part of XSA-273 / CVE-2018-3620.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 06e8b622d3f3c0fa5075e91b041c6f45549ad70a)

6 years agox86/mm: Plumbing to allow any PTE update to fail with -ERESTART
Andrew Cooper [Mon, 23 Jul 2018 06:11:40 +0000 (08:11 +0200)]
x86/mm: Plumbing to allow any PTE update to fail with -ERESTART

Switching to shadow mode is performed in tasklet context.  To facilitate this,
we schedule the tasklet, then create a hypercall continuation to allow the
switch to take place.

As a consequence, the x86 mm code needs to cope with an L1e operation being
continuable.  do_mmu{,ext}_op() may no longer assert that a continuation
doesn't happen on the final iteration.

To handle the arguments correctly on continuation, compat_update_va_mapping*()
may no longer call into their non-compat counterparts.  Move the compat
functions into mm.c rather than exporting __do_update_va_mapping() and
{get,put}_pg_owner(), and fix an unsigned long/int inconsistency with
compat_update_va_mapping_otherdomain().

This is part of XSA-273 / CVE-2018-3620.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c612481d1c9232c6abf91b03ec655e92f808805f)

6 years agox86/shadow: Infrastructure to force a PV guest into shadow mode
Juergen Gross [Mon, 23 Jul 2018 06:11:40 +0000 (07:11 +0100)]
x86/shadow: Infrastructure to force a PV guest into shadow mode

To mitigate L1TF, we cannot alter an architecturally-legitimate PTE a PV guest
chooses to write, but we can force the PV domain into shadow mode so Xen
controls the PTEs which are reachable by the CPU pagewalk.

Introduce new shadow mode, PG_SH_forced, and a tasklet to perform the
transition.  Later patches will introduce the logic to enable this mode at the
appropriate time.

To simplify vcpu cleanup, make tasklet_kill() idempotent with respect to
tasklet_init(), which involves adding a helper to check for an uninitialised
list head.

This is part of XSA-273 / CVE-2018-3620.

Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit b76ec3946bf6caca2c3950b857c008bc8db6723f)

6 years agox86/spec-ctrl: Introduce an option to control L1TF mitigation for PV guests
Andrew Cooper [Mon, 23 Jul 2018 13:46:10 +0000 (13:46 +0000)]
x86/spec-ctrl: Introduce an option to control L1TF mitigation for PV guests

Shadowing a PV guest is only available when shadow paging is compiled in.
When shadow paging isn't available, guests can be crashed instead as
mitigation from Xen's point of view.

Ideally, dom0 would also be potentially-shadowed-by-default, but dom0 has
never been shadowed before, and there are some stability issues under
investigation.

This is part of XSA-273 / CVE-2018-3620.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 66a4e986819a86ba66ca2fe9d925e62a4fd30114)

6 years agox86/spec-ctrl: Calculate safe PTE addresses for L1TF mitigations
Andrew Cooper [Wed, 25 Jul 2018 12:10:19 +0000 (12:10 +0000)]
x86/spec-ctrl: Calculate safe PTE addresses for L1TF mitigations

Safe PTE addresses for L1TF mitigations are ones which are within the L1D
address width (may be wider than reported in CPUID), and above the highest
cacheable RAM/NVDIMM/BAR/etc.

All logic here is best-effort heuristics, which should in practice be fine for
most hardware.  Future work will see about disentangling the SRAT handling
further, as well as having L0 pass this information down to lower levels when
virtualised.

This is part of XSA-273 / CVE-2018-3620.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit b03a57c9383b32181e60add6b6de12b473652aa4)

6 years agotools/oxenstored: Make evaluation order explicit
Christian Lindig [Mon, 13 Aug 2018 16:26:56 +0000 (17:26 +0100)]
tools/oxenstored: Make evaluation order explicit

In Store.path_write(), Path.apply_modify() updates the node_created
reference and both the value of apply_modify() and node_created are
returned by path_write().

At least with OCaml 4.06.1 this leads to the value of node_created being
returned *before* it is updated by apply_modify().  This in turn leads
to the quota for a domain not being updated in Store.write().  Hence, a
guest can create an unlimited number of entries in xenstore.

The fix is to make evaluation order explicit.

This is XSA-272.

Signed-off-by: Christian Lindig <christian.lindig@citrix.com>
Reviewed-by: Rob Hoes <rob.hoes@citrix.com>
(cherry picked from commit 73392c7fd14c59f8c96e0b2eeeb329e4ae9086b6)

6 years agox86/vtx: Fix the checking for unknown/invalid MSR_DEBUGCTL bits
Andrew Cooper [Mon, 18 Jun 2018 08:12:39 +0000 (16:12 +0800)]
x86/vtx: Fix the checking for unknown/invalid MSR_DEBUGCTL bits

The VPMU_MODE_OFF early-exit in vpmu_do_wrmsr() introduced by c/s
11fe998e56 bypasses all reserved bit checking in the general case.  As a
result, a guest can enable BTS when it shouldn't be permitted to, and
lock up the entire host.

With vPMU active (not a security supported configuration, but useful for
debugging), the reserved bit checking in broken, caused by the original
BTS changeset 1a8aa75ed.

From a correctness standpoint, it is not possible to have two different
pieces of code responsible for different parts of value checking, if
there isn't an accumulation of bits which have been checked.  A
practical upshot of this is that a guest can set any value it
wishes (usually resulting in a vmentry failure for bad guest state).

Therefore, fix this by implementing all the reserved bit checking in the
main MSR_DEBUGCTL block, and removing all handling of DEBUGCTL from the
vPMU MSR logic.

This is XSA-269.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 2a8a8e99feb950504559196521bc9fd63ed3a962)

6 years agoARM: disable grant table v2
Stefano Stabellini [Tue, 14 Aug 2018 10:20:53 +0000 (11:20 +0100)]
ARM: disable grant table v2

It was never expected to work, the implementation is incomplete.

As a side effect, it also prevents guests from triggering a
"BUG_ON(page_get_owner(pg) != d)" in gnttab_unpopulate_status_frames().

This is XSA-268.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9a5c16a3e75778c8a094ca87784d93b74676f46c)

6 years agocommon/gnttab: Introduce command line feature controls
Andrew Cooper [Tue, 14 Aug 2018 10:20:53 +0000 (11:20 +0100)]
common/gnttab: Introduce command line feature controls

This patch was originally released as part of XSA-226.  It retains the same
command line syntax (as various downstreams are mitigating XSA-226 using this
mechanism) but the defaults have been updated due to the revised XSA-226
patched, after which transitive grants are believed to functioning
properly.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit dc96c65ed6d7ffd4c95487373df708d97443cf77)

6 years agoVMX: fix vmx_{find,del}_msr() build
Jan Beulich [Thu, 19 Jul 2018 09:54:45 +0000 (11:54 +0200)]
VMX: fix vmx_{find,del}_msr() build

Older gcc at -O2 (and perhaps higher) does not recognize that apparently
uninitialized variables aren't really uninitialized. Pull out the
assignments used by two of the three case blocks and make them
initializers of the variables, as I think I had suggested during review.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit 97cb0516a322ecdf0032fa9d8aa1525c03d7772f)

6 years agox86/vmx: Support load-only guest MSR list entries
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Support load-only guest MSR list entries

Currently, the VMX_MSR_GUEST type maintains completely symmetric guest load
and save lists, by pointing VM_EXIT_MSR_STORE_ADDR and VM_ENTRY_MSR_LOAD_ADDR
at the same page, and setting VM_EXIT_MSR_STORE_COUNT and
VM_ENTRY_MSR_LOAD_COUNT to the same value.

However, for MSRs which we won't let the guest have direct access to, having
hardware save the current value on VMExit is unnecessary overhead.

To avoid this overhead, we must make the load and save lists asymmetric.  By
making the entry load count greater than the exit store count, we can maintain
two adjacent lists of MSRs, the first of which is saved and restored, and the
second of which is only restored on VMEntry.

For simplicity:
 * Both adjacent lists are still sorted by MSR index.
 * It undefined behaviour to insert the same MSR into both lists.
 * The total size of both lists is still limited at 256 entries (one 4k page).

Split the current msr_count field into msr_{load,save}_count, and introduce a
new VMX_MSR_GUEST_LOADONLY type, and update vmx_{add,find}_msr() to calculate
which sublist to search, based on type.  VMX_MSR_HOST has no logical sublist,
whereas VMX_MSR_GUEST has a sublist between 0 and the save count, while
VMX_MSR_GUEST_LOADONLY has a sublist between the save count and the load
count.

One subtle point is that inserting an MSR into the load-save list involves
moving the entire load-only list, and updating both counts.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit 1ac46b55632626aeb935726e1b0a71605ef6763a)

6 years agox86/vmx: Pass an MSR value into vmx_msr_add()
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Pass an MSR value into vmx_msr_add()

The main purpose of this change is to allow us to set a specific MSR value,
without needing to know whether there is already a load/save list slot for it.

Previously, callers wanting this property needed to call both vmx_add_*_msr()
and vmx_write_*_msr() to cover both cases, and there are no callers which want
the old behaviour of being a no-op if an entry already existed for the MSR.

As a result of this API improvement, the default value for guest MSRs need not
be 0, and the default for host MSRs need not be passed via hardware register.
In practice, this cleans up the VPMU allocation logic, and avoids an MSR read
as part of vcpu construction.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit ee7689b94ac7094b975ab4a023cfeae209da0a36)

6 years agox86/vmx: Improvements to LBR MSR handling
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Improvements to LBR MSR handling

The main purpose of this patch is to only ever insert the LBR MSRs into the
guest load/save list once, as a future patch wants to change the behaviour of
vmx_add_guest_msr().

The repeated processing of lbr_info and the guests MSR load/save list is
redundant, and a guest using LBR itself will have to re-enable
MSR_DEBUGCTL.LBR in its #DB handler, meaning that Xen will repeat this
redundant processing every time the guest gets a debug exception.

Rename lbr_fixup_enabled to lbr_flags to be a little more generic, and use one
bit to indicate that the MSRs have been inserted into the load/save list.
Shorten the existing FIXUP* identifiers to reduce code volume.

Furthermore, handing the guest #MC on an error isn't a legitimate action.  Two
of the three failure cases are definitely hypervisor bugs, and the third is a
boundary case which shouldn't occur in practice.  The guest also won't execute
correctly, so handle errors by cleanly crashing the guest.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit be73a842e642772d7372004c9c105de35b771020)

6 years agox86/vmx: Support remote access to the MSR lists
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Support remote access to the MSR lists

At the moment, all modifications of the MSR lists are in current context.
However, future changes may need to put MSR_EFER into the lists from domctl
hypercall context.

Plumb a struct vcpu parameter down through the infrastructure, and use
vmx_vmcs_{enter,exit}() for safe access to the VMCS in vmx_add_msr().  Use
assertions to ensure that access is either in current context, or while the
vcpu is paused.

Note these expectations beside the fields in arch_vmx_struct, and reorder the
fields to avoid unnecessary padding.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 80599f0b770199116aa753bfdfac9bfe2e8ea86a)

6 years agox86/vmx: Factor locate_msr_entry() out of vmx_find_msr() and vmx_add_msr()
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Factor locate_msr_entry() out of vmx_find_msr() and vmx_add_msr()

Instead of having multiple algorithms searching the MSR lists, implement a
single one.  It has the semantics required by vmx_add_msr(), to identify the
position in which an MSR should live, if it isn't already present.

There will be a marginal improvement for vmx_find_msr() by avoiding the
function pointer calls to vmx_msr_entry_key_cmp(), and a major improvement for
vmx_add_msr() by using a binary search instead of a linear search.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit 4d94828cf11104256dccea1fa7762f00575dfaa0)

6 years agox86/vmx: Internal cleanup for MSR load/save infrastructure
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: Internal cleanup for MSR load/save infrastructure

 * Use an arch_vmx_struct local variable to reduce later code volume.
 * Use start/total instead of msr_area/msr_count.  This is in preparation for
   more finegrained handling with later changes.
 * Use ent/end pointers (again for preparation), and to make the vmx_add_msr()
   logic easier to follow.
 * Make the memory allocation block of vmx_add_msr() unlikely, and calculate
   virt_to_maddr() just once.

No practical change to functionality.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit 94fda356fcdcc847662a4c9f6cc63511f25c1247)

6 years agox86/vmx: API improvements for MSR load/save infrastructure
Andrew Cooper [Mon, 7 May 2018 10:57:00 +0000 (11:57 +0100)]
x86/vmx: API improvements for MSR load/save infrastructure

Collect together related infrastructure in vmcs.h, rather than having it
spread out.  Turn vmx_{read,write}_guest_msr() into static inlines, as they
are simple enough.

Replace 'int type' with 'enum vmx_msr_list_type', and use switch statements
internally.  Later changes are going to introduce a new type.

Rename the type identifiers for consistency with the other VMX_MSR_*
constants.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit f54b63e8617ada823be43d60467a43c8224b7909)

6 years agox86/vmx: Defer vmx_vmcs_exit() as long as possible in construct_vmcs()
Andrew Cooper [Mon, 28 May 2018 14:02:34 +0000 (15:02 +0100)]
x86/vmx: Defer vmx_vmcs_exit() as long as possible in construct_vmcs()

paging_update_paging_modes() and vmx_vlapic_msr_changed() both operate on the
VMCS being constructed.  Avoid dropping and re-acquiring the reference
multiple times.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit f30e3cf34042846e391e3f8361fc6a76d181a7ee)

6 years agox86/vmx: Fix handing of MSR_DEBUGCTL on VMExit
Andrew Cooper [Thu, 24 May 2018 17:20:09 +0000 (17:20 +0000)]
x86/vmx: Fix handing of MSR_DEBUGCTL on VMExit

Currently, whenever the guest writes a nonzero value to MSR_DEBUGCTL, Xen
updates a host MSR load list entry with the current hardware value of
MSR_DEBUGCTL.

On VMExit, hardware automatically resets MSR_DEBUGCTL to 0.  Later, when the
guest writes to MSR_DEBUGCTL, the current value in hardware (0) is fed back
into guest load list.  As a practical result, `ler` debugging gets lost on any
PCPU which has ever scheduled an HVM vcpu, and the common case when `ler`
debugging isn't active, guest actions result in an unnecessary load list entry
repeating the MSR_DEBUGCTL reset.

Restoration of Xen's debugging setting needs to happen from the very first
vmexit.  Due to the automatic reset, Xen need take no action in the general
case, and only needs to load a value when debugging is active.

This could be fixed by using a host MSR load list entry set up during
construct_vmcs().  However, a more efficient option is to use an alternative
block in the VMExit path, keyed on whether hypervisor debugging has been
enabled.

In order to set this up, drop the per cpu ler_msr variable (as there is no
point having it per cpu when it will be the same everywhere), and use a single
read_mostly variable instead.  Split calc_ler_msr() out of percpu_traps_init()
for clarity.

Finally, clean up do_debug().  Reinstate LBR early to help catch cascade
errors, which allows for the removal of the out label.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
(cherry picked from commit 730dc8d2c9e1b6402e66973cf99a7c56bc78be4c)

6 years agox86/spec-ctrl: Yet more fixes for xpti= parsing
Andrew Cooper [Thu, 9 Aug 2018 16:22:17 +0000 (17:22 +0100)]
x86/spec-ctrl: Yet more fixes for xpti= parsing

As it currently stands, 'xpti=dom0' is indistinguishable from the default
value, which means it will be overridden by ARCH_CAPABILITIES_RDCL_NO on fixed
hardware.

Switch opt_xpti to use -1 as a default like all our other related options, and
clobber it as soon as we have a string to parse.

In addition, 'xpti' alone should be interpreted in its positive boolean form,
rather than resulting in a parse error.

  (XEN) parameter "xpti" has invalid value "", rc=-22!

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 2a3b34ec47817048ab59586855cf0709fc77487e)

6 years agox86/spec-ctrl: Fix the parsing of xpti= on fixed Intel hardware
Andrew Cooper [Mon, 30 Jul 2018 10:10:58 +0000 (12:10 +0200)]
x86/spec-ctrl: Fix the parsing of xpti= on fixed Intel hardware

The calls to xpti_init_default() in parse_xpti() are buggy.  The CPUID data
hasn't been fetched that early, and boot_cpu_has(X86_FEATURE_ARCH_CAPS) will
always evaluate false.

As a result, the default case won't disable XPTI on Intel hardware which
advertises ARCH_CAPABILITIES_RDCL_NO.

Simplify parse_xpti() to solely the setting of opt_xpti according to the
passed string, and have init_speculation_mitigations() call
xpti_init_default() if appropiate.  Drop the force parameter, and pass caps
instead, to avoid redundant re-reading of MSR_ARCH_CAPS.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: be5e2ff6f54e0245331ed360b8786760f82fd673
master date: 2018-07-24 11:25:54 +0100

6 years agox86/hvm: Disallow unknown MSR_EFER bits
Andrew Cooper [Mon, 30 Jul 2018 10:10:18 +0000 (12:10 +0200)]
x86/hvm: Disallow unknown MSR_EFER bits

It turns out that nothing ever prevented HVM guests from trying to set unknown
EFER bits.  Generally, this results in a vmentry failure.

For Intel hardware, all implemented bits are covered by the checks.

For AMD hardware, the only EFER bit which isn't covered by the checks is TCE
(which AFAICT is specific to AMD Fam15/16 hardware).  We never advertise TCE
in CPUID, but it isn't a security problem to have TCE unexpected enabled in
guest context.

Disallow the setting of bits outside of the EFER_KNOWN_MASK, which prevents
any vmentry failures for guests, yielding #GP instead.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: ef0269c6215d642a709866f04ba1a1f9f13f3614
master date: 2018-07-24 11:25:53 +0100

6 years agox86/xstate: Make errors in xstate calculations more obvious by crashing the domain
Andrew Cooper [Mon, 30 Jul 2018 10:09:46 +0000 (12:09 +0200)]
x86/xstate: Make errors in xstate calculations more obvious by crashing the domain

If xcr0_max exceeds xfeature_mask, then something is broken with the CPUID
policy derivation or auditing logic.  If hardware rejects new_bv, then
something is broken with Xen's xstate logic.

In both cases, crash the domain with an obvious error message, to help
highlight the issues.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d6371ccb93012db4ad6615fe666205b86308cb4e
master date: 2018-07-19 19:57:26 +0100

6 years agox86/xstate: Use a guests CPUID policy, rather than allowing all features
Andrew Cooper [Mon, 30 Jul 2018 10:09:18 +0000 (12:09 +0200)]
x86/xstate: Use a guests CPUID policy, rather than allowing all features

It turns out that Xen has never enforced that a domain remain within the
xstate features advertised in CPUID.

The check of new_bv against xfeature_mask ensures that a domain stays within
the set of features that Xen has enabled in hardware (and therefore isn't a
security problem), but this does means that attempts to level a guest for
migration safety might not be effective if the guest ignores CPUID.

Check the CPUID policy in validate_xstate() (for incoming migration) and in
handle_xsetbv() (for guest XSETBV instructions).  This subsumes the PKRU check
for PV guests in handle_xsetbv() (and also demonstrates that I should have
spotted this problem while reviewing c/s fbf9971241f).

For migration, this is correct despite the current (mis)ordering of data
because d->arch.cpuid is the applicable max policy.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 361b835fa00d9f45167c50a60e054ccf22c065d7
master date: 2018-07-19 19:57:26 +0100

6 years agox86/vmx: Don't clobber %dr6 while debugging state is lazy
Andrew Cooper [Mon, 30 Jul 2018 10:08:43 +0000 (12:08 +0200)]
x86/vmx: Don't clobber %dr6 while debugging state is lazy

c/s 4f36452b63 introduced a write to %dr6 in the #DB intercept case, but the
guests debug registers may be lazy at this point, at which point the guests
later attempt to read %dr6 will discard this value and use the older stale
value.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 3cdac2805692c7accde2f405d81cc0be799aee48
master date: 2018-07-19 14:06:48 +0100

6 years agox86: command line option to avoid use of secondary hyper-threads
Jan Beulich [Mon, 30 Jul 2018 10:08:14 +0000 (12:08 +0200)]
x86: command line option to avoid use of secondary hyper-threads

Shared resources (L1 cache and TLB in particular) present a risk of
information leak via side channels. Provide a means to avoid use of
hyperthreads in such cases.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d8f974f1a646c0200b97ebcabb808324b288fadb
master date: 2018-07-19 13:43:33 +0100

6 years agox86: possibly bring up all CPUs even if not all are supposed to be used
Jan Beulich [Mon, 30 Jul 2018 10:07:43 +0000 (12:07 +0200)]
x86: possibly bring up all CPUs even if not all are supposed to be used

Reportedly Intel CPUs which can't broadcast #MC to all targeted
cores/threads because some have CR4.MCE clear will shut down. Therefore
we want to keep CR4.MCE enabled when offlining a CPU, and we need to
bring up all CPUs in order to be able to set CR4.MCE in the first place.

The use of clear_in_cr4() in cpu_mcheck_disable() was ill advised
anyway, and to avoid future similar mistakes I'm removing clear_in_cr4()
altogether right here.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: 8797d20a6ec2dd75195585a107ce345c51c0a59a
master date: 2018-07-19 13:43:33 +0100

6 years agox86: distinguish CPU offlining from CPU removal
Jan Beulich [Mon, 30 Jul 2018 10:07:09 +0000 (12:07 +0200)]
x86: distinguish CPU offlining from CPU removal

In order to be able to service #MC on offlined CPUs, the GDT, IDT,
stack, and per-CPU data (which includes the TSS) need to be kept
allocated. They should only be freed upon CPU removal (which we
currently don't support, so some code is becoming effectively dead for
the moment).

Note that for now park_offline_cpus doesn't get set to true anywhere -
this is going to be the subject of a subsequent patch.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2e6c8f182c9c50129b1c7a620242861e6ad6a9fb
master date: 2018-07-19 13:43:33 +0100

6 years agox86/AMD: distinguish compute units from hyper-threads
Jan Beulich [Mon, 30 Jul 2018 10:06:39 +0000 (12:06 +0200)]
x86/AMD: distinguish compute units from hyper-threads

Fam17 replaces CUs by HTs, which we should reflect accordingly, even if
the difference is not very big. The most relevant change (requiring some
code restructuring) is that the topoext feature no longer means there is
a valid CU ID.

Take the opportunity and convert wrongly plain int variables in
set_cpu_sibling_map() to unsigned int.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Brian Woods <brian.woods@amd.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9429b07a0af7f92a5f25e4068e11db881e157495
master date: 2018-07-19 09:42:42 +0200

6 years agocpupools: fix state when downing a CPU failed
Jan Beulich [Mon, 30 Jul 2018 10:06:08 +0000 (12:06 +0200)]
cpupools: fix state when downing a CPU failed

While I've run into the issue with further patches in place which no
longer guarantee the per-CPU area to start out as all zeros, the
CPU_DOWN_FAILED processing looks to have the same issue: By not zapping
the per-CPU cpupool pointer, cpupool_cpu_add()'s (indirect) invocation
of schedule_cpu_switch() will trigger the "c != old_pool" assertion
there.

Clearing the field during CPU_DOWN_PREPARE is too early (afaict this
should not happen before cpu_disable_scheduler()). Clearing it in
CPU_DEAD and CPU_DOWN_FAILED would be an option, but would take the same
piece of code twice. Since the field's value shouldn't matter while the
CPU is offline, simply clear it (implicitly) for CPU_ONLINE and
CPU_DOWN_FAILED, but only for other than the suspend/resume case (which
gets specially handled in cpupool_cpu_remove()).

By adjusting the conditional in cpupool_cpu_add() CPU_DOWN_FAILED
handling in the suspend case should now also be handled better.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
master commit: cb1ae9a27819cea0c5008773c68a7be6f37eb0e5
master date: 2018-07-19 09:41:55 +0200

6 years agoallow cpu_down() to be called earlier
Jan Beulich [Mon, 30 Jul 2018 10:05:36 +0000 (12:05 +0200)]
allow cpu_down() to be called earlier

The function's use of the stop-machine logic has so far prevented its
use ahead of the processing of the "ordinary" initcalls. Since at this
early time we're in a controlled environment anyway, there's no need for
such a heavy tool. Additionally this ought to have less of a performance
impact especially on large systems, compared to the alternative of
making stop-machine functionality available earlier.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5894c0a2da66243a89088d309c7e1ea212ab28d6
master date: 2018-07-16 15:15:12 +0200

6 years agoxen: oprofile/nmi_int.c: Drop unwanted sexual reference
Ian Jackson [Mon, 30 Jul 2018 10:05:00 +0000 (12:05 +0200)]
xen: oprofile/nmi_int.c: Drop unwanted sexual reference

This is not really very nice.

This line doesn't have much value in itself.  The rest of this comment
block is pretty clear what it wants to convey.  So delete it.

(While we are here, adopt the CODING_STYLE-mandated formatting.)

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Lars Kurth <lars.kurth.xen@gmail.com>
Acked-by: George Dunlap <dunlapg@umich.edu
Acked-by: Jan Beulich <JBeulich@suse.com>
master commit: 41cb2db62627a7438d938aae487550c3f4acb1da
master date: 2018-07-12 16:38:30 +0100

6 years agox86/spec-ctrl: command line handling adjustments
Jan Beulich [Mon, 30 Jul 2018 10:04:28 +0000 (12:04 +0200)]
x86/spec-ctrl: command line handling adjustments

For one, "no-xen" should not imply "no-eager-fpu", as "eager FPU" mode
is to guard guests, not Xen itself, which is also expressed so by
print_details().

And then opt_ssbd, despite being off by default, should also be cleared
by the "no" and "no-xen" sub-options.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ac3f9a72141a48d40fabfff561d5a7dc0e1b810d
master date: 2018-07-10 12:22:31 +0200

6 years agox86: correctly set nonlazy_xstate_used when loading full state
Jan Beulich [Mon, 30 Jul 2018 10:03:28 +0000 (12:03 +0200)]
x86: correctly set nonlazy_xstate_used when loading full state

In this case, just like xcr0_accum, nonlazy_xstate_used should always be
set to the intended new value, rather than possibly leaving the flag set
from a prior state load.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f46bf0e101ca63118b9db2616e8f51e972d7f563
master date: 2018-07-09 10:51:02 +0200

6 years agoxen: Port the array_index_nospec() infrastructure from Linux
Andrew Cooper [Mon, 30 Jul 2018 10:01:55 +0000 (12:01 +0200)]
xen: Port the array_index_nospec() infrastructure from Linux

This is as the infrastructure appeared in Linux 4.17, adapted slightly for
Xen.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 2ddfae51d8b1d7b8cd33a4f6ad4d16d27cb869ae
master date: 2018-07-06 16:49:57 +0100

6 years agocmdline: fix parse_boolean() for NULL incoming end pointer
Jan Beulich [Mon, 30 Jul 2018 10:00:59 +0000 (12:00 +0200)]
cmdline: fix parse_boolean() for NULL incoming end pointer

Use the calculated lengths instead of pointers, as 'e' being NULL will
otherwise cause undue parsing failures.

Reported-by: Karl Johnson <karljohnson.it@gmail.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>