xen/arm: grant-table: Protect gnttab_clear_flag against guest misbehavior
The function gnttab_clear_flag is used to clear the access flags. On
Arm, it is implemented using a loop and guest_cmpxchg.
It is possible that guest_cmpxchg will always return a different value
than old. This can happen if the guest updated the memory before Xen has
time to do the exchange. Because of that, there are no way for to
promise the loop will end.
It is possible to make the current code safe by re-using the same
principle as applied on the guest atomic helper. However this patch
takes a different approach that should lead to more efficient code in
the default case.
A new helper is introduced to clear a set of bits on a 16-bits word.
This should avoid a an extra loop to check cmpxchg succeeded.
Note that a mask is used instead of a bit, so the helper can be re-used
later on for clearing multiple flags at the same time.
This is part of XSA-295.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Stefano Stabellini <stefanos@xilinx.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
xen: Use guest atomics helpers when modifying atomically guest memory
On Arm, exclusive load-store atomics should only be used between trusted
thread. As not all the guests are trusted, it may be possible to DoS Xen
when updating shared memory with guest atomically.
This patch replaces all the atomics operations on shared memory with
a guest by the new guest atomics helpers. The x86 code was not audited
to know where guest atomics helpers could be used. I will leave that
to the x86 folks.
Note that some rework was required in order to plumb use the new guest
atomics in event channel and grant-table.
Because guest_test_bit is ignoring the parameter "d" for now, it
means there a lot of places do not need to drop the const. We may want
to revisit this in the future if the parameter "d" becomes necessary.
xen/cmpxchg: Provide helper to safely modify guest memory atomically
On Arm, exclusive load-store atomics should only be used between trusted
thread. As not all the guests are trusted, it may be possible to DoS Xen
when updating shared memory with guest atomically.
This patch adds a new helper that will update the guest memory safely.
For x86, it is already possible to use the current helper safely. So
just wrap it.
For Arm, we will first attempt to update the guest memory with the
loop bounded by a maximum number of iterations. If it fails, we will
pause the domain and try again.
Note that this heuristics assumes that a page can only
be shared between Xen and one domain. Not Xen and multiple domain.
The maximum number of iterations is based on how many times atomic_inc()
can be executed in 1uS. The maximum value is per-CPU to cater big.LITTLE
and calculated when the CPU is booting.
The maximum number of iterations is based on how many times a simple
load-store atomic operation can be executed in 1uS. The maximum
value is per-CPU to cater big.LITTLE and calculated when the CPU is
booting. The heuristic was randomly chosen and can be modified if
impact too much good-behaving guest.
xen/bitops: Provide helpers to safely modify guest memory atomically
On Arm, exclusive load-store atomics should only be used between trusted
thread. As not all the guests are trusted, it may be possible to DoS Xen
when updating shared memory with guest atomically.
This patch adds a new set of helper that will update the guest memory
safely. For x86, it is already possible to use the current helpers
safely. So just wrap them.
For Arm, we will first attempt to update the guest memory with the loop
bounded by a maximum number of iterations. If it fails, we will pause the
domain and try again.
Note that this heuristics assumes that a page can only be shared between
Xen and one domain. Not Xen and multiple domain.
The maximum number of iterations is based on how many times a simple
load-store atomic operation can be executed in 1uS. The maximum value is
per-CPU to cater big.LITTLE and calculated when the CPU is booting. The
heuristic was randomly chosen and can be modified if impact too much
good-behaving guest.
Note, while test_bit does not requires to use atomic operation, a
wrapper for test_bit was added for completeness. In this case, the
domain stays constified to avoid major rework in the caller for the
time-being.
On Arm, exclusive load-store atomics should only be used between trusted
thread. As not all the guests are trusted, it may be possible to DoS Xen
when updating shared memory with guest atomically.
Recent patches introduced new helpers to update shared memory with guest
atomically. Those helpers relies on a memory region to be be shared with
Xen and a single guest.
At the moment, nothing prevent a guest sharing a page with Xen and as
well with another guest (e.g via grant table).
For the scope of the XSA, the quickest way is to deny communications
between unprivileged guest. So this patch is enabling and using SILO
mode by default on Arm.
Users wanted finer graine policy could wrote their own Flask policy.
This is part of XSA-295.
Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Wed, 22 May 2019 20:39:17 +0000 (13:39 -0700)]
xen/arm: cmpxchg: Provide a new helper that can timeout
Exclusive load-store atomics should only be used between trusted
threads. As not all the guests are trusted, it may be possible to DoS
Xen when updating shared memory with guest atomically.
To prevent the infinite loop, we introduce a new helper that can timeout.
The timeout is based on the maximum number of iterations.
It will be used in follow-up patch to make atomic operations on shared
memory safe.
xen/arm: bitops: Implement a new set of helpers that can timeout
Exclusive load-store atomics should only be used between trusted
threads. As not all the guests are trusted, it may be possible to DoS
Xen when updating shared memory with guest atomically.
To prevent the infinite loop, we introduce a new set of helpers that can
timeout. The timeout is based on the maximum number of iterations.
They will be used in follow-up patch to make atomic operations
on shared memory safe.
xen/arm32: cmpxchg: Simplify the cmpxchg implementation
The only difference between each case of the cmpxchg is the size of
used. Rather than duplicating the code, provide a macro to generate each
cases.
This makes the code easier to read and modify.
While doing the rework, the case for 64-bit cmpxchg is removed. This is
unused today (already commented) and it would not be possible to use
it directly.
xen/grant_table: Rework the prototype of _set_status* for lisibility
It is not clear from the parameters name whether domid and gt_version
correspond to the local or remote domain. A follow-up patch will make
them more confusing.
So rename domid (resp. gt_version) to ldomid (resp. rgt_version). At
the same time re-order the parameters to hopefully make it more
readable.
This is part of XSA-295.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
xen/arm: Add an isb() before reading CNTPCT_EL0 to prevent re-ordering
Per D8.2.1 in ARM DDI 0487C.a, "a read to CNTPCT_EL0 can occur
speculatively and out of order relative to other instructions executed
on the same PE."
Add an instruction barrier to get accurate number of cycles when
requested in get_cycles(). For the other users of CNPCT_EL0, replace by
a call to get_cycles().
Jan Beulich [Tue, 12 Mar 2019 13:40:24 +0000 (14:40 +0100)]
events: drop arch_evtchn_inject()
Have the only user call vcpu_mark_events_pending() instead, at the same
time arranging for correct ordering of the writes (evtchn_pending_sel
should be written before evtchn_upcall_pending).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <julien.grall@arm.com>
Igor Druzhinin [Tue, 4 Jun 2019 13:40:37 +0000 (15:40 +0200)]
libacpi: report PCI slots as enabled only for hotpluggable devices
DSDT for qemu-xen lacks _STA method of PCI slot object. If _STA method
doesn't exist then the slot is assumed to be always present and active
which in conjunction with _EJ0 method makes every device ejectable for
an OS even if it's not the case.
qemu-kvm is able to dynamically add _EJ0 method only to those slots
that either have hotpluggable devices or free for PCI passthrough.
As Xen lacks this capability we cannot use their way.
qemu-xen-traditional DSDT has _STA method which only reports that
the slot is present if there is a PCI devices hotplugged there.
This is done through querying of its PCI hotplug controller.
qemu-xen has similar capability that reports if device is "hotpluggable
or absent" which we can use to achieve the same result.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6761965243b113230bed900d6105be05b28f5cea
master date: 2019-05-24 10:30:21 +0200
Jan Beulich [Tue, 4 Jun 2019 13:40:07 +0000 (15:40 +0200)]
x86/IO-APIC: fix build with gcc9
There are a number of pointless __packed attributes which cause gcc 9 to
legitimately warn:
utils.c: In function 'vtd_dump_iommu_info':
utils.c:287:33: error: converting a packed 'struct IO_APIC_route_entry' pointer (alignment 1) to a 'struct IO_APIC_route_remap_entry' pointer (alignment 8) may result in an unaligned pointer value [-Werror=address-of-packed-member]
287 | remap = (struct IO_APIC_route_remap_entry *) &rte;
| ^~~~~~~~~~~~~~~~~~~~~~~~~
intremap.c: In function 'ioapic_rte_to_remap_entry':
intremap.c:343:25: error: converting a packed 'struct IO_APIC_route_entry' pointer (alignment 1) to a 'struct IO_APIC_route_remap_entry' pointer (alignment 8) may result in an unaligned pointer value [-Werror=address-of-packed-member]
343 | remap_rte = (struct IO_APIC_route_remap_entry *) old_rte;
| ^~~~~~~~~~~~~~~~~~~~~~~~~
Simply drop these attributes. Take the liberty and also re-format the
structure definitions at the same time.
Reported-by: Charles Arnold <carnold@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ca9310b24e6205de5387e5982ccd42c35caf89d4
master date: 2019-05-24 10:19:59 +0200
Juergen Gross [Tue, 4 Jun 2019 13:39:37 +0000 (15:39 +0200)]
xen/sched: fix csched2_deinit_pdata()
Commit 753ba43d6d16e688 ("xen/sched: fix credit2 smt idle handling")
introduced a regression when switching cpus between cpupools.
When assigning a cpu to a cpupool with credit2 being the default
scheduler csched2_deinit_pdata() is called for the credit2 private data
after the new scheduler's private data has been hooked to the per-cpu
scheduler data. Unfortunately csched2_deinit_pdata() will cycle through
all per-cpu scheduler areas it knows of for removing the cpu from the
respective sibling masks including the area of the just moved cpu. This
will (depending on the new scheduler) either clobber the data of the
new scheduler or in case of sched_rt lead to a crash.
Avoid that by removing the cpu from the list of active cpus in credit2
data first.
The opposite problem is occurring when removing a cpu from a cpupool:
init_pdata() of credit2 will access the per-cpu data of the old
scheduler.
Jan Beulich [Tue, 4 Jun 2019 13:39:04 +0000 (15:39 +0200)]
x86emul: add support for missing {,V}PMADDWD insns
Their pre-AVX512 incarnations have clearly been overlooked during much
earlier work. Their memory access pattern is entirely standard, so no
specific tests get added to the harness.
Reported-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Alexandru Isaila <aisaila@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1a48bdd599b268a2d9b7d0c45f1fd40c4892186e
master date: 2019-05-16 13:43:17 +0200
Jan Beulich [Tue, 4 Jun 2019 13:38:17 +0000 (15:38 +0200)]
x86/IRQ: avoid UB (or worse) in trace_irq_mask()
Dynamically allocated CPU mask objects may be smaller than cpumask_t, so
copying has to be restricted to the actual allocation size. This is
particulary important since the function doesn't bail early when tracing
is not active, so even production builds would be affected by potential
misbehavior here.
Take the opportunity and also
- use initializers instead of assignment + memset(),
- constify the cpumask_t input pointer,
- u32 -> uint32_t.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: 6fafb8befa99620a2d7323b9eca5c387bad1f59f
master date: 2019-05-13 16:41:03 +0200
Andrew Cooper [Tue, 4 Jun 2019 13:37:47 +0000 (15:37 +0200)]
x86/boot: Fix latent memory corruption with early_boot_opts_t
c/s ebb26b509f "xen/x86: make VGA support selectable" added an #ifdef
CONFIG_VIDEO into the middle the backing space for early_boot_opts_t,
but didn't adjust the structure definition in cmdline.c
This only functions correctly because the affected fields are at the end
of the structure, and cmdline.c doesn't write to them in this case.
To retain the slimming effect of compiling out CONFIG_VIDEO, adjust
cmdline.c with enough #ifdef-ary to make C's idea of the structure match
the declaration in asm. This requires adding __maybe_unused annotations
to two helper functions.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 30596213617fcf4dd7b71d244e16c8fc0acf456b
master date: 2019-05-13 10:35:38 +0100
Andrew Cooper [Tue, 4 Jun 2019 13:37:18 +0000 (15:37 +0200)]
x86/svm: Fix handling of ICEBP intercepts
c/s 9338a37d "x86/svm: implement debug events" added support for introspecting
ICEBP debug exceptions, but didn't account for the fact that
svm_get_insn_len() (previously __get_instruction_length) can fail and may
already have raised #GP with the guest.
If svm_get_insn_len() fails, return back to guest context rather than
continuing and mistaking a trap-style VMExit for a fault-style one.
Spotted by Coverity.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Acked-by: Brian Woods <brian.woods@amd.com>
master commit: 1495b4ff9b4af2b9c0f12cdb6491082cecf34f86
master date: 2019-05-13 10:35:37 +0100
The limit 1900x1200 do not match real world devices (1900 looks like a
typo, should be 1920). But in practice the limits are arbitrary and do
not serve any real purpose. As discussed in "Increase framebuffer size
to todays standards" thread, drop them completely.
This fixes graphic console on device with 3840x2160 native resolution.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
drivers/video: drop unused limits
MAX_BPP, MAX_FONT_W, MAX_FONT_H are not used in the code at all.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 19600eb75aa9b1df3e4b0a4e55a5d08b957e1fd9
master date: 2019-05-13 10:13:24 +0200
master commit: 343459e34a6d32ba44a21f8b8fe4c1f69b1714c2
master date: 2019-05-13 10:12:56 +0200
When bitmap_fill(..., 0) is called, do not try to write anything. Before
this patch, it tried to write almost LONG_MAX, surely overwriting
something.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 93df28be2d4f620caf18109222d046355ac56327
master date: 2019-05-13 10:12:00 +0200
Tamas K Lengyel [Tue, 4 Jun 2019 13:34:58 +0000 (15:34 +0200)]
x86/vmx: correctly gather gs_shadow value for current vCPU
Currently the gs_shadow value is only cached when the vCPU is being scheduled
out by Xen. Reporting this (usually) stale value through vm_event is incorrect,
since it doesn't represent the actual state of the vCPU at the time the event
was recorded. This prevents vm_event subscribers from correctly finding kernel
structures in the guest when it is trapped while in ring3.
Refresh shadow_gs value when the context being saved is for the current vCPU.
Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: f69fc1c2f36e8a74ba54c9c8fa5c904ea1ad319e
master date: 2019-05-13 09:55:59 +0200
Igor Druzhinin [Tue, 4 Jun 2019 13:34:21 +0000 (15:34 +0200)]
x86/mtrr: recalculate P2M type for domains with iocaps
This change reflects the logic in epte_get_entry_emt() and allows
changes in guest MTTRs to be reflected in EPT for domains having
direct access to certain hardware memory regions but without IOMMU
context assigned (e.g. XenGT).
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f3d880bf2be92534c5bacf11de2f561cbad550fb
master date: 2019-05-13 09:54:45 +0200
Jan Beulich [Tue, 4 Jun 2019 13:33:53 +0000 (15:33 +0200)]
AMD/IOMMU: disable previously enabled IOMMUs upon init failure
If any IOMMUs were successfully initialized before encountering failure,
the successfully enabled ones should be disabled again before cleaning
up their resources.
Move disable_iommu() next to enable_iommu() to avoid a forward
declaration, and take the opportunity to remove stray blank lines ahead
of both functions' final closing braces.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Brian Woods <brian.woods@amd.com>
master commit: 87a3347d476443c66c79953d77d6aef1d2bb3bbd
master date: 2019-05-13 09:52:43 +0200
Jan Beulich [Tue, 4 Jun 2019 13:32:55 +0000 (15:32 +0200)]
trace: fix build with gcc9
While I've not observed this myself, gcc 9 (imo validly) reportedly may
complain
trace.c: In function '__trace_hypercall':
trace.c:826:19: error: taking address of packed member of 'struct <anonymous>' may result in an unaligned pointer value [-Werror=address-of-packed-member]
826 | uint32_t *a = d.args;
and the fix is rather simple - remove the __packed attribute. Introduce
a BUILD_BUG_ON() as replacement, for the unlikely case that Xen might
get ported to an architecture where array alignment higher that that of
its elements.
Reported-by: Martin Liška <martin.liska@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: 3fd3b266d4198c06e8e421ca515d9ba09ccd5155
master date: 2019-05-13 09:51:23 +0200
Igor Druzhinin [Tue, 9 Apr 2019 12:01:58 +0000 (13:01 +0100)]
tools/xl: use libxl_domain_info to get domain type for vcpu-pin
Parsing the config seems to be an overkill for this particular task
and the config might simply be absent. Type returned from libxl_domain_info
should be either LIBXL_DOMAIN_TYPE_HVM or LIBXL_DOMAIN_TYPE_PV but in
that context distinction between PVH and HVM should be irrelevant.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 48dab9767d2eb173495707cb1fd8ceaf73604ac1)
Jan Beulich [Wed, 15 May 2019 07:42:17 +0000 (09:42 +0200)]
x86: fix build race when generating temporary object files
The rules to generate xen-syms and xen.efi may run in parallel, but both
recursively invoke $(MAKE) to build symbol/relocation table temporary
object files. These recursive builds would both re-generate the .*.d2
files (where needed). Both would in turn invoke the same rule, thus
allowing for a race on the .*.d2.tmp intermediate files.
The dependency files of the temporary .xen*.o files live in xen/ rather
than xen/arch/x86/ anyway, so won't be included no matter what. Take the
opportunity and delete them, as the just re-generated .xen*.S files will
trigger a proper re-build of the .xen*.o ones anyway.
Empty the DEPS variable in case the set of goals consists of just those
temporary object files, thus eliminating the race.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 761bb575ce97255029d2d2249b2719e54bc76825
master date: 2019-04-11 10:25:05 +0200
Initially I had just noticed the unnecessary indirection in the call
from pi_update_irte(). The generic wrapper having an iommu_intremap
conditional made me look at the setup code though. So first of all
enforce the necessary dependency.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6c54663786d9f1ed04153867687c158675e7277d
master date: 2019-04-09 15:12:07 +0200
Igor Druzhinin [Wed, 15 May 2019 07:40:45 +0000 (09:40 +0200)]
x86/vmx: Fixup removals of MSR load/save list entries
Commit 540d5422 ("x86/vmx: Support removing MSRs from the host/guest
load/save lists") introduced infrastructure finally exposed by
commit fd32dcfe ("x86/vmx: Don't leak EFER.NXE into guest context")
that led to a functional regression on Harpertown and earlier cores
(Gen 1 VT-x) due to MSR count being incorrectly set in VMCS.
As the result, as soon as guest EFER becomes equal to Xen EFER
(which eventually happens in almost every 64-bit VM) and its MSR
entry is supposed to be removed, a stale version of EFER is loaded
into a guest instead causing almost immediate guest failure.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: e28c0ee3356f52f589bbae54e89aaed25c1f599d
master date: 2019-04-09 10:58:18 +0100
Andrew Cooper [Wed, 15 May 2019 07:40:00 +0000 (09:40 +0200)]
xen/timers: Fix memory leak with cpu unplug/plug
timer_softirq_action() realloc's itself a larger timer heap whenever
necessary, which includes bootstrapping from the empty dummy_heap. Nothing
ever freed this allocation.
CPU plug and unplug has the side effect of zeroing the percpu data area, which
clears ts->heap. This in turn causes new timers to be put on the list rather
than the heap, and for timer_softirq_action() to bootstrap itself again.
This in practice leaks ts->heap every time a CPU is unplugged and replugged.
Implement free_percpu_timers() which includes freeing ts->heap when
appropriate, and update the notifier callback with the recent cpu parking
logic and free-avoidance across suspend.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/cpu: Fix ARM build following c/s 597fbb8
c/s 597fbb8 "xen/timers: Fix memory leak with cpu unplug/plug" broke the ARM
build by being the first patch to add park_offline_cpus to common code.
While it is currently specific to Intel hardware (for reasons of being able to
handle machine check exceptions without an immediate system reset), it isn't
inherently architecture specific, so define it to be false on ARM for now.
Add a comment in both smp.h headers explaining the intended behaviour of the
option.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
timers: move back migrate_timers_from_cpu() invocation
Commit 597fbb8be6 ("xen/timers: Fix memory leak with cpu unplug/plug")
went a little too far: Migrating timers away from a CPU being offlined
needs to heppen independent of whether it get parked or fully offlined.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
xen/timers: Fix memory leak with cpu unplug/plug (take 2)
Previous attempts to fix this leak failed to identify the root cause, and
ultimately failed. The cause is the CPU_UP_PREPARE case (re)initialising
ts->heap back to dummy_heap, which leaks the previous allocation.
Rearrange the logic to only initialise ts once. This also avoids the
redundant (but benign, due to ts->inactive always being empty) initialising of
the other ts fields.
Jan Beulich [Wed, 15 May 2019 07:38:59 +0000 (09:38 +0200)]
x86emul: don't read mask register on AVX512F-incapable platforms
Nor when register state isn't sufficiently enabled.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6cb7e52edf823fd89fe14da94f9bf3e5cf99d1ff
master date: 2019-04-05 17:27:13 +0200
Petre Pircalabu [Wed, 15 May 2019 07:38:13 +0000 (09:38 +0200)]
vm_event: fix XEN_VM_EVENT_RESUME domctl
Make XEN_VM_EVENT_RESUME return 0 in case of success, instead of
-EINVAL.
Remove vm_event_resume form vm_event.h header and set the function's
visibility to static as is used only in vm_event.c.
Move the vm_event_check_ring test inside vm_event_resume in order to
simplify the code.
Jan Beulich [Wed, 15 May 2019 07:36:09 +0000 (09:36 +0200)]
x86emul: suppress general register update upon AVX gather failures
While destination and mask registers may indeed need updating in this
case, the rIP update in particular needs to be avoided, as well as e.g.
raising a single step trap.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 74f299bbd7d5cc52325b5866c17b44dd0bd1c5a2
master date: 2019-04-03 10:14:32 +0200
Juergen Gross [Wed, 15 May 2019 07:34:57 +0000 (09:34 +0200)]
xen/sched: fix credit2 smt idle handling
Credit2's smt_idle_mask_set() and smt_idle_mask_clear() are used to
identify idle cores where vcpus can be moved to. A core is thought to
be idle when all siblings are known to have the idle vcpu running on
them.
Unfortunately the information of a vcpu running on a cpu is per
runqueue. So in case not all siblings are in the same runqueue a core
will never be regarded to be idle, as the sibling not in the runqueue
is never known to run the idle vcpu.
Use a credit2 specific cpumask of siblings with only those cpus
being marked which are in the same runqueue as the cpu in question.
Andrew Cooper [Wed, 12 Dec 2018 19:22:15 +0000 (19:22 +0000)]
x86/spec-ctrl: Introduce options to control VERW flushing
The Microarchitectural Data Sampling vulnerability is split into categories
with subtly different properties:
MLPDS - Microarchitectural Load Port Data Sampling
MSBDS - Microarchitectural Store Buffer Data Sampling
MFBDS - Microarchitectural Fill Buffer Data Sampling
MDSUM - Microarchitectural Data Sampling Uncacheable Memory
MDSUM is a special case of the other three, and isn't distinguished further.
These issues pertain to three microarchitectural buffers. The Load Ports, the
Store Buffers and the Fill Buffers. Each of these structures are flushed by
the new enhanced VERW functionality, but the conditions under which flushing
is necessary vary.
For this concise overview of the issues and default logic, the abbreviations
SP (Store Port), FB (Fill Buffer), LP (Load Port) and HT (Hyperthreading) are
used for brevity:
* Vulnerable hardware is divided into two categories - parts which suffer
from SP only, and parts with any other combination of vulnerabilities.
* SP only has an HT interaction when the thread goes idle, due to the static
partitioning of resources. LP and FB have HT interactions at all points,
due to the competitive sharing of resources. All issues potentially leak
data across the return-to-guest transition.
* The microcode which implements VERW flushing also extends MSR_FLUSH_CMD, so
we don't need to do both on the HVM return-to-guest path. However, some
parts are not vulnerable to L1TF (therefore have no MSR_FLUSH_CMD), but are
vulnerable to MDS, so do require VERW on the HVM path.
Note that we deliberately support mds=1 even without MD_CLEAR in case the
microcode has been updated but the feature bit not exposed.
This is part of XSA-297, CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, CVE-2019-11091.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3c04c258ab40405a74e194d9889a4cbc7abe94b4)
Andrew Cooper [Wed, 12 Dec 2018 19:22:15 +0000 (19:22 +0000)]
x86/spec-ctrl: Infrastructure to use VERW to flush pipeline buffers
Three synthetic features are introduced, as we need individual control of
each, depending on circumstances. A later change will enable them at
appropriate points.
The verw_sel field doesn't strictly need to live in struct cpu_info. It lives
there because there is a convenient hole it can fill, and it reduces the
complexity of the SPEC_CTRL_EXIT_TO_{PV,HVM} assembly by avoiding the need for
any temporary stack maintenance.
This is part of XSA-297, CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, CVE-2019-11091.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 548a932ac786d6bf3584e4b54f2ab993e1117710)
Andrew Cooper [Wed, 12 Sep 2018 13:36:00 +0000 (14:36 +0100)]
x86/spec-ctrl: CPUID/MSR definitions for Microarchitectural Data Sampling
The MD_CLEAR feature can be automatically offered to guests. No
infrastructure is needed in Xen to support the guest making use of it.
This is part of XSA-297, CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, CVE-2019-11091.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit d4f6116c080dc013cd1204c4d8ceb95e5f278689)
Andrew Cooper [Wed, 12 Sep 2018 13:36:00 +0000 (14:36 +0100)]
x86/spec-ctrl: Misc non-functional cleanup
* Identify BTI in the spec_ctrl_{enter,exit}_idle() comments, as other
mitigations will shortly appear.
* Use alternative_input() and cover the lack of memory cobber with a further
barrier.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9b62eba6c429c327e1507816bef403ccc87357ae)
Andrew Cooper [Fri, 5 Apr 2019 12:26:30 +0000 (13:26 +0100)]
x86/boot: Detect the firmware SMT setting correctly on Intel hardware
While boot_cpu_data.x86_num_siblings is an accurate value to use on AMD
hardware, it isn't on Intel when the user has disabled Hyperthreading in the
firmware. As a result, a user which has chosen to disable HT still gets
nagged on L1TF-vulnerable hardware when they haven't chosen an explicit
smt=<bool> setting.
Make use of the largely-undocumented MSR_INTEL_CORE_THREAD_COUNT which in
practice exists since Nehalem, when booting on real hardware. Fall back to
using the ACPI table APIC IDs.
While adjusting this logic, fix a latent bug in amd_get_topology(). The
thread count field in CPUID.0x8000001e.ebx is documented as 8 bits wide,
rather than 2 bits wide.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit b12fec4a125950240573ea32f65c61fb9afa74c3)
Andrew Cooper [Fri, 5 Apr 2019 12:26:30 +0000 (12:26 +0000)]
x86/msr: Definitions for MSR_INTEL_CORE_THREAD_COUNT
This is a model specific register which details the current configuration
cores and threads in the package. Because of how Hyperthread and Core
configuration works works in firmware, the MSR it is de-facto constant and
will remain unchanged until the next system reset.
It is a read only MSR (so unilaterally reject writes), but for now retain its
leaky-on-read properties. Further CPUID/MSR work is required before we can
start virtualising a consistent topology to the guest, and retaining the old
behaviour is the safest course of action.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit d4120936bcd1695faf5b575f1259c58e31d2b18b)
Andrew Cooper [Wed, 12 Sep 2018 13:36:00 +0000 (14:36 +0100)]
x86/spec-ctrl: Reposition the XPTI command line parsing logic
It has ended up in the middle of the mitigation calculation logic. Move it to
be beside the other command line parsing.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c2c2bb0d60c642e64a5243a79c8b1548ffb7bc5b)
Andrew Cooper [Fri, 3 May 2019 08:45:45 +0000 (10:45 +0200)]
x86/spec-ctrl: Extend repoline safey calcuations for eIBRS and Atom parts
All currently-released Atom processors are in practice retpoline-safe, because
they don't fall back to a BTB prediction on RSB underflow.
However, an additional meaning of Enhanced IRBS is that the processor may not
be retpoline-safe. The Gemini Lake platform, based on the Goldmont Plus
microarchitecture is the first Atom processor to support eIBRS.
Until Xen gets full eIBRS support, Gemini Lake will still be safe using
regular IBRS.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 17f74242ccf0ce6e51c03a5860947865c0ef0dc2
master date: 2019-03-18 16:26:40 +0000
Igor Druzhinin [Fri, 3 May 2019 08:44:23 +0000 (10:44 +0200)]
x86/hvm: finish IOREQs correctly on completion path
Since the introduction of linear_{read,write}() helpers in 3bdec530a5
(x86/HVM: split page straddling emulated accesses in more cases) the
completion path for IOREQs has been broken: if there is an IOREQ in
progress but hvm_copy_{to,from}_guest_linear() returns HVMTRANS_okay
(e.g. when P2M type of source/destination has been changed by IOREQ
handler) the execution will never re-enter hvmemul_do_io() where
IOREQs are completed. This usually results in a domain crash upon
the execution of the next IOREQ entering hvmemul_do_io() and finding
the remnants of the previous IOREQ in the state machine.
This particular issue has been discovered in relation to p2m_ioreq_server
type where an emulator changed the memory type between p2m_ioreq_server
and p2m_ram_rw in process of responding to IOREQ which made
hvm_copy_..() to behave differently on the way back.
Fix it for now by checking if IOREQ completion is required (which
can be identified by querying MMIO cache) before trying to finish
a memory access immediately through hvm_copy_..(), re-enter
hvmemul_do_io() otherwise. This change alone only addresses IOREQ
completion issue for P2M type changing from MMIO to RAM in the
middle of emulation but leaves a case where new IOREQs might be
introduced by P2M changes from RAM to MMIO (which is less likely
to find in practice) that requires more substantial changes in
MMIO emulation code.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
master commit: 522a2f3c5c89cc78c0e2b05af924b76cef7d4bff
master date: 2019-03-18 16:29:21 +0100
Igor Druzhinin [Fri, 3 May 2019 08:43:49 +0000 (10:43 +0200)]
x86/hvm: split all linear reads and writes at page boundary
Ruling out page straddling at linear level makes it easier to
distinguish chunks that require proper handling as MMIO access
and not complete them as page straddling memory transactions
prematurely. This doesn't change the general behavior.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2d527ba310dc6695bba2df118ff9e053f7e40c82
master date: 2019-03-18 16:28:45 +0100
Jan Beulich [Fri, 3 May 2019 08:43:13 +0000 (10:43 +0200)]
x86/e820: fix build with gcc9
e820.c: In function ‘clip_to_limit’:
.../xen/include/asm/string.h:10:26: error: ‘__builtin_memmove’ offset [-16, -36] is out of the bounds [0, 20484] of object ‘e820’ with type ‘struct e820map’ [-Werror=array-bounds]
10 | #define memmove(d, s, n) __builtin_memmove(d, s, n)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
e820.c:404:13: note: in expansion of macro ‘memmove’
404 | memmove(&e820.map[i], &e820.map[i+1],
| ^~~~~~~
e820.c:36:16: note: ‘e820’ declared here
36 | struct e820map e820;
| ^~~~
While I can't see where the negative offsets would come from, converting
the loop index to unsigned type helps. Take the opportunity and also
convert several other local variables and copy_e820_map()'s second
parameter to unsigned int (and bool in one case).
Reported-by: Charles Arnold <carnold@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 22e2f8dddf5fbed885b5e4db3ffc9e1101be9ec0
master date: 2019-03-18 11:38:36 +0100
Andrew Cooper [Fri, 3 May 2019 08:40:10 +0000 (10:40 +0200)]
x86/pv: Fix construction of 32bit dom0's
dom0_construct_pv() has logic to transition dom0 into a compat domain when
booting an ELF32 image.
One aspect which is missing is the CPUID policy recalculation, meaning that a
32bit dom0 sees a 64bit policy, which differ by the Long Mode feature flag in
particular. Another missing item is the x87_fip_width initialisation.
Update dom0_construct_pv() to use switch_compat(), rather than retaining the
opencoding. Position the call to switch_compat() such that the compat32 local
variable can disappear entirely.
The 32bit monitor table is now created by setup_compat_l4(), avoiding the need
to for manual creation later.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 356f437171c5bb90701ac9dd7ba4dbbd05988e38
master date: 2019-03-15 14:59:27 +0000
Andrew Cooper [Fri, 3 May 2019 08:39:29 +0000 (10:39 +0200)]
x86/tsx: Implement controls for RTM force-abort mode
The CPUID bit and MSR are deliberately not exposed to guests, because they
won't exist on newer processors. As vPMU isn't security supported, the
misbehaviour of PCR3 isn't expected to impact production deployments.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6be613f29b4205349275d24367bd4c82fb2960dd
master date: 2019-03-12 17:05:21 +0000
Andrew Cooper [Fri, 3 May 2019 08:38:53 +0000 (10:38 +0200)]
x86/vtd: Don't include control register state in the table pointers
iremap_maddr and qinval_maddr point to the base of a block of contiguous RAM,
allocated by the driver, holding the Interrupt Remapping table, and the Queued
Invalidation ring.
Despite their name, they are actually the values of the hardware register,
including control metadata in the lower 12 bits. While uses of these fields
do appear to correctly shift out the metadata, this is very subtle behaviour
and confusing to follow.
Nothing uses the metadata, so make the fields actually point at the base of
the relevant tables.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: a9a05aeee10a5a3763a41305a9f38112dd1fcc82
master date: 2019-03-12 13:57:13 +0000
Jan Beulich [Fri, 3 May 2019 08:37:58 +0000 (10:37 +0200)]
x86/HVM: don't crash guest in hvmemul_find_mmio_cache()
Commit 35a61c05ea ("x86emul: adjust handling of AVX2 gathers") builds
upon the fact that the domain will actually survive running out of MMIO
result buffer space. Drop the domain_crash() invocation. Also delay
incrementing of the usage counter, such that the function can't possibly
use/return an out-of-bounds slot/pointer in case execution subsequently
makes it into the function again without a prior reset of state.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
master commit: a43c1dec246bdee484e6a3de001cc6850a107abe
master date: 2019-03-12 14:39:46 +0100
Igor Druzhinin [Fri, 3 May 2019 08:36:47 +0000 (10:36 +0200)]
iommu: leave IOMMU enabled by default during kexec crash transition
It's unsafe to disable IOMMU on a live system which is the case
if we're crashing since remapping hardware doesn't usually know what
to do with ongoing bus transactions and frequently raises NMI/MCE/SMI,
etc. (depends on the firmware configuration) to signal these abnormalities.
This, in turn, doesn't play well with kexec transition process as there is
no handling available at the moment for this kind of events resulting
in failures to enter the kernel.
Modern Linux kernels taught to copy all the necessary DMAR/IR tables
following kexec from the previous kernel (Xen in our case) - so it's
currently normal to keep IOMMU enabled. It might require minor changes to
kdump command line that enables IOMMU drivers (e.g. intel_iommu=on /
intremap=on) but recent kernels don't require any additional changes for
the transition to be transparent.
A fallback option is still left for compatibility with ancient crash
kernels which didn't like to have IOMMU active under their feet on boot.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 12c36f577d454996c882ecdc5da8113ca2613646
master date: 2019-03-12 14:38:12 +0100
Andrew Cooper [Thu, 21 Mar 2019 19:36:48 +0000 (19:36 +0000)]
passthrough/vtd: Drop the "workaround_bios_bug" logic entirely
It turns out that this code was previously dead.
c/s dcf41790 " x86/mmcfg/drhd: Move acpi_mmcfg_init() call before calling
acpi_parse_dmar()" resulted in PCI segment 0 now having been initialised
enough for acpi_parse_one_drhd() to not take the
/* Skip checking if segment is not accessible yet. */
path unconditionally. However, some systems have DMAR tables which list
devices which are disabled by user choice (in particular, Dell PowerEdge R740
with I/O AT DMA disabled), and turning off all IOMMU functionality in this
case is entirely unhelpful behaviour.
Leave the warning which identifies the problematic devices, but drop the
remaining logic. This leaves the system in better overall state, and working
in the same way that it did in previous releases.
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit 74dadb8556c6a0972fa422b5ae346589ace404b6)
Wei Liu [Wed, 20 Mar 2019 15:43:38 +0000 (15:43 +0000)]
libxc: fix HVM core dump
f969bc9fc96 forbid get_address_size call on HVM guests, because that
didn't make sense. It broke core dump functionality on HVM because
libxc unconditionally asked for guest width.
Force guest_width to a sensible value.
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit 59e9783ddf18e650622e0573cad4f08db65592e4)
George Dunlap [Tue, 5 Mar 2019 12:48:52 +0000 (12:48 +0000)]
README: Document python2 dependency
Much of the tools and configure makefile actually have a python2
dependency; specify this. It also assumes that `python` points to `python2`;
document how to work around this on systems where this is false.
Also update second version requirement listed to match the first.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Tue, 5 Mar 2019 17:04:23 +0000 (18:04 +0100)]
x86/cpuid: add missing PCLMULQDQ dependency
Since we can't seem to be able to settle our discussion for the wider
adjustment previously posted, let's at least add the missing dependency
for 4.12. I'm not convinced though that attaching it to SSE is correct.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Roger Pau Monné [Tue, 5 Mar 2019 16:41:14 +0000 (17:41 +0100)]
x86/dom0: propagate PVH vlapic EOIs to hardware
Current check for MSI EIO is missing a special case for PVH Dom0,
which doesn't have a hvm_irq_dpci struct but requires EIOs to be
forwarded to the physical lapic for passed-through devices.
Add a short-circuit to allow EOIs from PVH Dom0 to be propagated.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Ian Jackson [Tue, 5 Mar 2019 15:31:32 +0000 (15:31 +0000)]
tools/libfsimage: Add `XEN' to environment variable name
This library, which is private to Xen and was properly namespaced in 1a814711881beb17f073f5f57e27e5bd4da1b956
tools/libfsimage: Add `xen' to .h names and principal .so name
honours an environment variable to override the directory where
shared objects (ie filesystem plugins) are to be loaded from.
Rename that variable from FSIMAGE_FSDIR to XEN_FSIMAGE_FSDIR, to give
it a proper namespace prefix.
Nothing in xen.git sets this variable. The three hits for the string
`FSIMAGE_FSDIR' are this getenv, and two references to a compile-time
manifest constant which provides the default value (the -D which sets
it, and the place it is used).
I have also checked the current Debian Xen package in buster and the
variable is not set there either.
CC: Andrew Cooper <andrew.cooper3@citrix.com> CC: Jan Beulich <JBeulich@suse.com> CC: George Dunlap <george.dunlap@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Tue, 5 Mar 2019 16:02:36 +0000 (17:02 +0100)]
x86/mm: fix #GP(0) in switch_cr3_cr4()
With "pcid=no-xpti" and opposite XPTI settings in two 64-bit PV domains
(achievable with one of "xpti=no-dom0" or "xpti=no-domu"), switching
from a PCID-disabled to a PCID-enabled 64-bit PV domain fails to set
CR4.PCIDE in time, as CR4.PGE would not be set in either (see
pv_fixup_guest_cr4(), in particular as used by write_ptbase()), and
hence the early CR4 write would be skipped.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Tue, 5 Mar 2019 12:54:42 +0000 (13:54 +0100)]
x86/pv: _toggle_guest_pt() may not skip TLB flush for shadow mode guests
For shadow mode guests (e.g. PV ones forced into that mode as L1TF
mitigation, or during migration) update_cr3() -> sh_update_cr3() may
result in a change to the (shadow) root page table (compared to the
previous one when running the same vCPU with the same PCID). This can,
first and foremost, be a result of memory pressure on the shadow memory
pool of the domain. Shadow code legitimately relies on the original
(prior to commit 5c81d260c2 ["xen/x86: use PCID feature"]) behavior of
the subsequent CR3 write to flush the TLB of entries still left from
walks with an earlier, different (shadow) root page table.
Restore the flushing behavior, also for the second CR3 write on the exit
path to guest context when XPTI is active. For the moment accept that
this will introduce more flushes than are strictly necessary - no flush
would be needed when the (shadow) root page table doesn't actually
change, but this information isn't readily (i.e. without introducing a
layering violation) available here.
This is XSA-294.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Juergen Gross <jgross@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Tue, 5 Mar 2019 12:54:05 +0000 (13:54 +0100)]
x86/pv: Don't have %cr4.fsgsbase active behind a guest kernels back
Currently, a 64bit PV guest can appear to set and clear FSGSBASE in %cr4, but
the bit remains set in hardware. Therefore, the {RD,WR}{FS,GS}BASE are usable
even when the guest kernel believes that they are disabled.
The FSGSBASE feature isn't currently supported in Linux, and its context
switch path has some optimisations which rely on userspace being unable to use
the WR{FS,GS}BASE instructions. Xen's current behaviour undermines this
expectation.
In 64bit PV guest context, always load the guest kernels setting of FSGSBASE
into %cr4. This requires adjusting how Xen uses the {RD,WR}{FS,GS}BASE
instructions.
* Delete the cpu_has_fsgsbase helper. It is no longer safe, as users need to
check %cr4 directly.
* The raw __rd{fs,gs}base() helpers are only safe to use when %cr4.fsgsbase
is set. Comment this property.
* The {rd,wr}{fs,gs}{base,shadow}() and read_msr() helpers are updated to use
the current %cr4 value to determine which mechanism to use.
* toggle_guest_mode() and save_segments() are update to avoid reading
fs/gsbase if the values in hardware cannot be stale WRT struct vcpu. A
consequence of this is that the write_cr() path needs to cache the current
bases, as subsequent context switches will skip saving the values.
* write_cr4() is updated to ensure that the shadow %cr4.fsgsbase value is
observed in a safe way WRT the hardware setting, if an interrupt happens to
hit in the middle.
* load_segments() is updated to use the VMLOAD optimisation if FSGSBASE is
unavailable, even if only gs_shadow needs updating. As a minor perf
improvement, check cpu_has_svm first to short circuit a context-dependent
conditional on Intel hardware.
* pv_make_cr4() is updated for 64bit PV guests to use the guest kernels
choice of FSGSBASE.
This is part of XSA-293.
Reported-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 5 Mar 2019 12:53:32 +0000 (13:53 +0100)]
x86/pv: Rewrite guest %cr4 handling from scratch
The PV cr4 logic is almost impossible to follow, and leaks bits into guest
context which definitely shouldn't be visible (in particular, VMXE).
The biggest problem however, and source of the complexity, is that it derives
new real and guest cr4 values from the current value in hardware - this is
context dependent and an inappropriate source of information.
Rewrite the cr4 logic to be invariant of the current value in hardware.
First of all, modify write_ptbase() to always use mmu_cr4_features for IDLE
and HVM contexts. mmu_cr4_features *is* the correct value to use, and makes
the ASSERT() obviously redundant.
For PV guests, curr->arch.pv.ctrlreg[4] remains the guests view of cr4, but
all logic gets reworked in terms of this and mmu_cr4_features only.
Two masks are introduced; bits which the guest has control over, and bits
which are forwarded from Xen's settings. One guest-visible change here is
that Xen's VMXE setting is no longer visible at all.
pv_make_cr4() follows fairly closely from pv_guest_cr4_to_real_cr4(), but
deliberately starts with mmu_cr4_features, and only alters the minimal subset
of bits.
The boot-time {compat_,}pv_cr4_mask variables are removed, as they are a
remnant of the pre-CPUID policy days. pv_fixup_guest_cr4() gains a related
derivation from the policy.
Another guest visible change here is that a 32bit PV guest can now flip
FSGSBASE in its view of CR4. While the {RD,WR}{FS,GS}BASE instructions are
unusable outside of a 64bit code segment, the ability to modify FSGSBASE
matches real hardware behaviour, and avoids the need for any 32bit/64bit
differences in the logic.
Overall, this patch shouldn't have a practical change in guest behaviour.
VMXE will disappear from view, and an inquisitive 32bit kernel can now see
FSGSBASE changing, but this new logic is otherwise bug-compatible with before.
This is part of XSA-293.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 5 Mar 2019 12:52:44 +0000 (13:52 +0100)]
x86/mm: properly flush TLB in switch_cr3_cr4()
The CR3 values used for contexts run with PCID enabled uniformly have
CR3.NOFLUSH set, resulting in the CR3 write itself to not cause any
flushing at all. When the second CR4 write is skipped or doesn't do any
flushing, there's nothing so far which would purge TLB entries which may
have accumulated again if the PCID doesn't change; the "just in case"
flush only affects the case where the PCID actually changes. (There may
be particularly many TLB entries re-accumulated in case of a watchdog
NMI kicking in during the critical time window.)
Suppress the no-flush behavior of the CR3 write in this particular case.
Similarly the second CR4 write may not cause any flushing of TLB entries
established again while the original PCID was still in use - it may get
performed because of unrelated bits changing. The flush of the old PCID
needs to happen nevertheless.
At the same time also eliminate a possible race with lazy context
switch: Just like for CR4, CR3 may change at any time while interrupts
are enabled, due to the __sync_local_execstate() invocation from the
flush IPI handler. It is for that reason that the CR3 read, just like
the CR4 one, must happen only after interrupts have been turned off.
This is XSA-292.
Reported-by: Sergey Dyasli <sergey.dyasli@citrix.com> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Jan Beulich [Tue, 5 Mar 2019 12:52:15 +0000 (13:52 +0100)]
x86/mm: don't retain page type reference when IOMMU operation fails
The IOMMU update in _get_page_type() happens between recording of the
new reference and validation of the page for its new type (if
necessary). If the IOMMU operation fails, there's no point in actually
carrying out validation. Furthermore, with this resulting in failure
getting indicated to the caller, the recorded type reference also needs
to be dropped again.
Note that in case of failure of alloc_page_type() there's no need to
undo the IOMMU operation: Only special types get handed to the function.
The function, upon failure, clears ->u.inuse.type_info, effectively
converting the page to PGT_none. The IOMMU mapping, however, solely
depends on whether the type is PGT_writable_page.
This is XSA-291.
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 5 Mar 2019 12:51:46 +0000 (13:51 +0100)]
x86/mm: add explicit preemption checks to L3 (un)validation
When recursive page tables are used at the L3 level, unvalidation of a
single L4 table may incur unvalidation of two levels of L3 tables, i.e.
a maximum iteration count of 512^3 for unvalidating an L4 table. The
preemption check in free_l2_table() as well as the one in
_put_page_type() may never be reached, so explicit checking is needed in
free_l3_table().
When recursive page tables are used at the L4 level, the iteration count
at L4 alone is capped at 512^2. As soon as a present L3 entry is hit
which itself needs unvalidation (and hence requiring another nested loop
with 512 iterations), the preemption checks added here kick in, so no
further preemption checking is needed at L4 (until we decide to permit
5-level paging for PV guests).
The validation side additions are done just for symmetry.
This is part of XSA-290.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 5 Mar 2019 12:51:18 +0000 (13:51 +0100)]
x86/mm: also allow L2 (un)validation to be fully preemptible
Commit c612481d1c ("x86/mm: Plumbing to allow any PTE update to fail
with -ERESTART") added assertions next to the {alloc,free}_l2_table()
invocations to document (and validate in debug builds) that L2
(un)validations are always preemptible.
The assertion in free_page_type() was now observed to trigger when
recursive L2 page tables get cleaned up.
In particular put_page_from_l2e()'s assumption that _put_page_type()
would always succeed is now wrong, resulting in a partially un-validated
page left in a domain, which has no other means of getting cleaned up
later on. If not causing any problems earlier, this would ultimately
trigger the check for ->u.inuse.type_info having a zero count when
freeing the page during cleanup after the domain has died.
As a result it should be considered a mistake to not have extended
preemption fully to L2 when it was added to L3/L4 table handling, which
this change aims to correct.
The validation side additions are done just for symmetry.
This is part of XSA-290.
Reported-by: Manuel Bouyer <bouyer@antioche.eu.org> Tested-by: Manuel Bouyer <bouyer@antioche.eu.org> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
George Dunlap [Wed, 23 Jan 2019 11:57:46 +0000 (11:57 +0000)]
xen: Make coherent PV IOMMU discipline
In order for a PV domain to set up DMA from a passed-through device to
one of its pages, the page must be mapped in the IOMMU. On the other
hand, before a PV page may be used as a "special" page type (such as a
pagetable or descriptor table), it _must not_ be writable in the IOMMU
(otherwise a malicious guest could DMA arbitrary page tables into the
memory, bypassing Xen's safety checks); and Xen's current rule is to
have such pages not in the IOMMU at all.
At the moment, in order to accomplish this, the code borrows HVM
domain's "physmap" concept: When a page is assigned to a guest,
guess_physmap_add_entry() is called, which for PV guests, will create
a writable IOMMU mapping; and when a page is removed,
guest_physmap_remove_entry() is called, which will remove the mapping.
Additionally, when a page gains the PGT_writable page type, the page
will be added into the IOMMU; and when the page changes away from a
PGT_writable type, the page will be removed from the IOMMU.
Unfortunately, borrowing the "physmap" concept from HVM domains is
problematic. HVM domains have a lock on their p2m tables, ensuring
synchronization between modifications to the p2m; and all hypercall
parameters must first be translated through the p2m before being used.
Trying to mix this locked-and-gated approach with PV's lock-free
approach leads to several races and inconsistencies:
* A race between a page being assigned and it being put into the
physmap; for example:
- P1: call populate_physmap() { A = allocate_domheap_pages() }
- P2: Guess page A's mfn, and call decrease_reservation(A). A is owned by the domain,
and so Xen will clear the PGC_allocated bit and free the page
- P1: finishes populate_physmap() { guest_physmap_add_entry() }
Now the domain has a writable IOMMU mapping to a page it no longer owns.
* Pages start out as type PGT_none, but with a writable IOMMU mapping.
If a guest uses a page as a page table without ever having created a
writable mapping, the IOMMU mapping will not be removed; the guest
will have a writable IOMMU mapping to a page it is currently using
as a page table.
* A newly-allocated page can be DMA'd into with no special actions on
the part of the guest; However, if a page is promoted to a
non-writable type, the page must be mapped with a writable type before
DMA'ing to it again, or the transaction will fail.
To fix this, do away with the "PV physmap" concept entirely, and
replace it with the following IOMMU discipline for PV guests:
- (type == PGT_writable) <=> in iommu (even if type_count == 0)
- Upon a final put_page(), check to see if type is PGT_writable; if so,
iommu_unmap.
In order to achieve that:
- Remove PV IOMMU related code from guest_physmap_*
- Repurpose cleanup_page_cacheattr() into a general
cleanup_page_mappings() function, which will both fix up Xen
mappings for pages with special cache attributes, and also check for
a PGT_writable type and remove pages if appropriate.
- For compatibility with current guests, grab-and-release a
PGT_writable_page type for PV guests in guest_physmap_add_entry().
This will cause most "normal" guest pages to start out life with
PGT_writable_page type (and thus an IOMMU mapping), but no type
count (so that they can be used as special cases at will).
Also, note that there is one exception to to the "PGT_writable => in
iommu" rule: xenheap pages shared with guests may be given a
PGT_writable type with one type reference. This reference prevents
the type from changing, which in turn prevents page from gaining an
IOMMU mapping in get_page_type(). It's not clear whether this was
intentional or not, but it's not something to change in a security
update.
This is XSA-288.
Reported-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
steal_page, in a misguided attempt to protect against unknown races,
violates both of these rules, thus introducing other races:
- Temporarily, the count_info has the refcount go to 0 while
PGC_allocated is set
- It explicitly returns the page PGC_allocated set, but owner == NULL
and page not on the page_list.
The second one meant that page_get_owner_and_reference() could return
NULL even after having successfully grabbed a reference on the page,
leading the caller to leak the reference (since "couldn't get ref" and
"got ref but no owner" look the same).
Furthermore, rather than grabbing a page reference to ensure that the
owner doesn't change under its feet, it appears to rely on holding
d->page_alloc lock to prevent this.
Unfortunately, this is ineffective: page->owner remains non-NULL for
some time after the count has been set to 0; meaning that it would be
entirely possible for the page to be freed and re-allocated to a
different domain between the page_get_owner() check and the count_info
check.
Modify steal_page to instead follow the appropriate access discipline,
taking the page through series of states similar to being freed and
then re-allocated with MEMF_no_owner:
- Grab an extra reference to make sure we don't race with anyone else
freeing the page
- Drop both references and PGC_allocated atomically, so that (if
successful), anyone else trying to grab a reference will fail
- Attempt to reset Xen's mappings
- Reset the rest of the state.
Then, modify the two callers appropriately:
- Leave count_info alone (it's already been cleared)
- Call free_domheap_page() directly if appropriate
- Call assign_pages() rather than open-coding a partial assign
With all callers to assign_pages() now passing in pages with the
type_info field clear, tighten the respective assertion there.
This is XSA-287.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 5 Mar 2019 12:47:36 +0000 (13:47 +0100)]
IOMMU/x86: fix type ref-counting race upon IOMMU page table construction
When arch_iommu_populate_page_table() gets invoked for an already
running guest, simply looking at page types once isn't enough, as they
may change at any time. Add logic to re-check the type after having
mapped the page, unmapping it again if needed.
This is XSA-285.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tentatively-Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 5 Mar 2019 12:45:58 +0000 (13:45 +0100)]
gnttab: set page refcount for copy-on-grant-transfer
Commit 5cc77f9098 ("32-on-64: Fix domain address-size clamping,
implement"), which introduced this functionality, took care of clearing
the old page's PGC_allocated, but failed to set the bit (and install the
associated reference) on the newly allocated one. Furthermore the "mfn"
local variable was never updated, and hence the wrong MFN was passed to
guest_physmap_add_page() (and back to the destination domain) in this
case, leading to an IOMMU mapping into an unowned page.
Ideally the code would use assign_pages(), but the call to
gnttab_prepare_for_transfer() sits in the middle of the actions
mirroring that function.
This is XSA-284.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Andrew Cooper [Mon, 25 Feb 2019 13:06:22 +0000 (13:06 +0000)]
tools/tests: Drop obsolete test infrastructure
The regression/ directory was identified as already broken in 2012 (c/s 953953cc5). The logic is intended to test *.py files in the Xen tree against
different versions of python, but every python version is obsolete as well.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Current npt and shadow code to get an entry will always return
INVALID_MFN for foreign entries. Allow to return the entry mfn for
foreign entries, like it's done for grant table entries.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Roger Pau Monne [Wed, 27 Feb 2019 11:09:04 +0000 (12:09 +0100)]
x86/mm: handle foreign mappings in p2m_entry_modify
So that the specific handling can be removed from
atomic_write_ept_entry and be shared with npt and shadow code.
This commit also removes the check that prevent non-ept PVH dom0 from
mapping foreign pages.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Roger Pau Monne [Wed, 27 Feb 2019 11:09:02 +0000 (12:09 +0100)]
x86/mm: split p2m ioreq server pages special handling into helper
So that it can be shared by both ept, npt and shadow code, instead of
duplicating it.
No change in functionality intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Roger Pau Monne [Wed, 27 Feb 2019 11:09:01 +0000 (12:09 +0100)]
x86/p2m: pass the p2m to write_p2m_entry handlers
Current callers pass the p2m to paging_write_p2m_entry, but the
implementation specific handlers of the write_p2m_entry hook instead
of a p2m get a domain struct due to the handling done in
paging_write_p2m_entry.
Change the code so that the implementations of write_p2m_entry take a
p2m instead of a domain.
This is a non-functional change, but will be used by follow up
patches.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Igor Druzhinin [Tue, 26 Feb 2019 17:44:06 +0000 (17:44 +0000)]
x86/nmi: correctly check MSB of P6 performance counter MSR in watchdog
The logic currently tries to work out if a recent overflow (that indicates
that NMI comes from the watchdog) happened by checking MSB of performance
counter MSR that is initially sign extended from a negative value
that we program it to. A possibly incorrect assumption here is that
MSB is always bit 32 while on modern hardware it's usually 47 and
the actual bit-width is reported through CPUID. Checking bit 32 for
overflows is usually fine since we never program it to anything
exceeding 32-bits and NMI is handled shortly after overflow occurs.
A problematic scenario that we saw occurs on systems where SMIs taking
significant time are possible. In that case, NMI handling is deferred to
the point firmware exits SMI which might take enough time for the counter
to go through bit 32 and set it to 1 again. So the logic described above
will misread it and report an unknown NMI erroneously.
Fortunately, we can use the actual MSB, which is usually higher than the
currently hardcoded 32, and treat this case correctly at least on modern
hardware.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Thu, 28 Feb 2019 12:49:13 +0000 (12:49 +0000)]
x86/altp2m: Fix build with !CONFIG_HVM
c/s 0ec9b4ef3148 "x86/vmx: Fix security issue when a guest balloons out the #VE
info page" introduced a caller of altp2m_vcpu_disable_ve() in a common path,
but c/s e72ecc761541 "x86/altp2m: Rework #VE enable/disable paths" didn't have
a suitable prototype in the !CONFIG_HVM case.
Introduce one to fix the build.
Spotted by Travis.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
A triple fault is fatal to the domain under all circumstances (so will print
at most once), and in practice is always an error condition rather than a
reboot fallback.
Reported-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Christian Lindig [Wed, 27 Feb 2019 10:33:42 +0000 (10:33 +0000)]
tools/ocaml: Dup2 /dev/null to stdin in daemonize()
Don't close stdin in daemonize() but dup2 /dev/null instead. Otherwise, fd 0
gets reused later:
[root@idol ~]# ls -lav /proc/`pgrep xenstored`/fd
total 0
dr-x------ 2 root root 0 Feb 28 11:02 .
dr-xr-xr-x 9 root root 0 Feb 27 15:59 ..
lrwx------ 1 root root 64 Feb 28 11:02 0 -> /dev/xen/evtchn
l-wx------ 1 root root 64 Feb 28 11:02 1 -> /dev/null
l-wx------ 1 root root 64 Feb 28 11:02 2 -> /dev/null
lrwx------ 1 root root 64 Feb 28 11:02 3 -> /dev/xen/privcmd
...
Signed-off-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Mon, 11 Feb 2019 13:31:02 +0000 (13:31 +0000)]
x86/vmx: Properly flush the TLB when an altp2m is modified
Modifications to an altp2m mark the p2m as needing flushing, but this was
never wired up in the return-to-guest path. As a result, stale TLB entries
can remain after resuming the guest.
In practice, this manifests as a missing EPT_VIOLATION or #VE exception when
the guest subsequently accesses a page which has had its permissions reduced.
vmx_vmenter_helper() now has 11 p2ms to potentially invalidate, but issuing 11
INVEPT instructions isn't clever. Instead, count how many contexts need
invalidating, and use INVEPT_ALL_CONTEXT if two or more are in need of
flushing.
This doesn't have an XSA because altp2m is not yet a security-supported
feature.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Thu, 17 Jan 2019 12:26:17 +0000 (12:26 +0000)]
x86/vmx: Fix security issue when a guest balloons out the #VE info page
The logic in altp2m_vcpu_{en,dis}able_ve() and vmx_vcpu_update_vmfunc_ve() is
dangerous. After #VE has been set up, the guest can balloon out and free the
nominated GFN, after which the processor may write to it. Also, the unlocked
GFN query means the MFN is stale by the time it is used. Alternatively, a
guest can race two disable calls to cause one VMCS to still reference the
nominated GFN after the tracking information was dropped.
Rework the logic from scratch to make it safe.
Hold an extra page reference on the underlying frame, to account for the
VMCS's reference. This means that if the GFN gets ballooned out, it isn't
freed back to Xen until #VE is disabled, and the VMCS no longer refers to the
page.
A consequence of this is that altp2m_vcpu_disable_ve() needs to be called
during the domain_kill() path, to drop the reference for domains which shut
down with #VE still enabled.
For domains using altp2m, we expect a single enable call and no disable for
the remaining lifetime of the domain. However, to avoid problems with
concurrent calls, use cmpxchg() to locklessly maintain safety.
This doesn't have an XSA because altp2m is not yet a security-supported
feature.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Tested-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Release-acked-by: Juergen Gross <jgross@suse.com>
When interacting with io apic, a guest can specify values that are used
as index to structures, and whose values are not compared against
upper bounds to prevent speculative out-of-bound accesses. This change
prevents these speculative accesses.
Furthermore, variables are initialized and the compiler is asked to not
optimized these initializations, as the uninitialized variables might be
used in a speculative out-of-bound access. Out of the four initialized
variables, two are potentially problematic, namely ones in the functions
vioapic_irq_positive_edge and vioapic_get_trigger_mode.
As the two problematic variables are both used in the common function
gsi_vioapic, the mitigation is implemented there. As the access pattern
of the currently non-guest-controlled functions might change in the
future as well, the other variables are initialized as well.
Norbert Manthey [Tue, 26 Feb 2019 15:57:18 +0000 (16:57 +0100)]
evtchn: block speculative out-of-bound accesses
Guests can issue event channel interaction with guest specified data.
To avoid speculative out-of-bound accesses, we use the nospec macros,
or the domain_vcpu function. Where appropriate, we use the vcpu_id of
the seleceted vcpu instead of the parameter that can be influenced by
the guest, so that only one access needs to be protected.
Jan Beulich [Tue, 26 Feb 2019 15:56:26 +0000 (16:56 +0100)]
x86/shadow: don't use map_domain_page_global() on paths that may not fail
The assumption (according to one comment) and hope (according to
another) that map_domain_page_global() can't fail are both wrong on
large enough systems. Do away with the guest_vtable field altogether,
and establish / tear down the desired mapping as necessary.
The alternatives, discarded as being undesirable, would have been to
either crash the guest in sh_update_cr3() when the mapping fails, or to
bubble up an error indicator, which upper layers would have a hard time
to deal with (other than again by crashing the guest).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Release-acked-by: Juergen Gross <jgross@suse.com>
Paul Durrant [Tue, 26 Feb 2019 15:55:06 +0000 (16:55 +0100)]
viridian: fix the HvFlushVirtualAddress/List hypercall implementation
The current code uses hvm_asid_flush_vcpu() but this is insufficient for
a guest running in shadow mode, which results in guest crashes early in
boot if the 'hcall_remote_tlb_flush' is enabled.
This patch, instead of open coding a new flush algorithm, adapts the one
already used by the HVMOP_flush_tlbs Xen hypercall. The implementation is
modified to allow TLB flushing a subset of a domain's vCPUs. A callback
function determines whether or not a vCPU requires flushing. This mechanism
was chosen because, while it is the case that the currently implemented
viridian hypercalls specify a vCPU mask, there are newer variants which
specify a sparse HV_VP_SET and thus use of a callback will avoid needing to
expose details of this outside of the viridian subsystem if and when those
newer variants are implemented.
NOTE: Use of the common flush function requires that the hypercalls are
restartable and so, with this patch applied, viridian_hypercall()
can now return HVM_HCALL_preempted. This is safe as no modification
to struct cpu_user_regs is done before the return.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Julien Grall [Mon, 18 Feb 2019 10:21:06 +0000 (10:21 +0000)]
xen/arm: domain_build: Panic message should end with a newline
Since commit 25eb5eec79 "xen: Fix inconsistent callers of panic()" all
the panic message should end with a newline. Unfortunately, some
commits pushed afterwards does not follow the rule.
Modify the offending panic messages to avoid more inconsistency.
Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Release-acked-by: Juergen Gross <jgross@suse.com>