tools/xenstore: fix a use after free problem in xenstored
Commit 562a1c0f7ef3fb ("tools/xenstore: dont unlink connection object
twice") introduced a potential use after free problem in
domain_cleanup(): after calling talloc_unlink() for domain->conn
domain->conn is set to NULL. The problem is that domain is registered
as talloc child of domain->conn, so it might be freed by the
talloc_unlink() call.
With Xenstore being single threaded there are normally no concurrent
memory allocations running and freeing a virtual memory area normally
doesn't result in that area no longer being accessible. A problem
could occur only in case either a signal received results in some
memory allocation done in the signal handler (SIGHUP is a primary
candidate leading to reopening the log file), or in case the talloc
framework would do some internal memory allocation during freeing of
the memory (which would lead to clobbering of the freed domain
structure).
Andrew Cooper [Thu, 5 Mar 2020 10:35:17 +0000 (11:35 +0100)]
xen/pvh: Fix segment selector ABI
The written ABI states that %es will be set up, but libxc doesn't do so. In
practice, it breaks `rep movs` inside guests before they reload %es.
The written ABI doesn't mention %ss, but libxc does set it up. Having %ds
different to %ss is obnoxous to work with, as different registers have
different implicit segments.
Modify the spec to state that %ss is set up as a flat read/write segment.
This a) matches the Multiboot 1 spec, b) matches what is set up in practice,
and c) is the more sane behaviour for guests to use.
Fixes: 68e1183411b ('libxc: introduce a xc_dom_arch for hvm-3.0-x86_32 guests') Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wl@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
x86/pvh: Adjust dom0's starting state
Fixes: b25fb1a04e "xen/pvh: Fix segment selector ABI" Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wl@xen.org> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: b25fb1a04e99cc03359eade1affb56ef0eee766f
master date: 2020-02-10 15:26:09 +0000
master commit: 6ee10313623c1f41fc72fe12372e176e744463c1
master date: 2020-02-11 11:04:26 +0000
Jan Beulich [Tue, 14 Apr 2020 13:14:21 +0000 (15:14 +0200)]
gnttab: fix GNTTABOP_copy continuation handling
The XSA-226 fix was flawed - the backwards transformation on rc was done
too early, causing a continuation to not get invoked when the need for
preemption was determined at the very first iteration of the request.
This in particular means that all of the status fields of the individual
operations would be left untouched, i.e. set to whatever the caller may
or may not have initialized them to.
Ross Lagerwall [Tue, 14 Apr 2020 13:13:24 +0000 (15:13 +0200)]
xen/gnttab: Fix error path in map_grant_ref()
Part of XSA-295 (c/s 863e74eb2cffb) inadvertently re-positioned the brackets,
changing the logic. If the _set_status() call fails, the grant_map hypercall
would fail with a status of 1 (rc != GNTST_okay) instead of the expected
negative GNTST_* error.
This error path can be taken due to bad guest state, and causes net/blk-back
in Linux to crash.
This is XSA-316.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: da0c66c8f48042a0186799014af69db0303b1da5
master date: 2020-04-14 14:41:02 +0200
xen/rwlock: Add missing memory barrier in the unlock path of rwlock
The rwlock unlock paths are using atomic_sub() to release the lock.
However the implementation of atomic_sub() rightfully doesn't contain a
memory barrier. On Arm, this means a processor is allowed to re-order
the memory access with the preceeding access.
In other words, the unlock may be seen by another processor before all
the memory accesses within the "critical" section.
The rwlock paths already contains barrier indirectly, but they are not
very useful without the counterpart in the unlock paths.
The memory barriers are not necessary on x86 because loads/stores are
not re-ordered with lock instructions.
So add arch_lock_release_barrier() in the unlock paths that will only
add memory barrier on Arm.
Take the opportunity to document each lock paths explaining why a
barrier is not necessary.
Jan Beulich [Tue, 14 Apr 2020 13:10:48 +0000 (15:10 +0200)]
xenoprof: limit consumption of shared buffer data
Since a shared buffer can be written to by the guest, we may only read
the head and tail pointers from there (all other fields should only ever
be written to). Furthermore, for any particular operation the two values
must be read exactly once, with both checks and consumption happening
with the thus read values. (The backtrace related xenoprof_buf_space()
use in xenoprof_log_event() is an exception: The values used there get
re-checked by every subsequent xenoprof_add_sample().)
Since that code needed touching, also fix the double increment of the
lost samples count in case the backtrace related xenoprof_add_sample()
invocation in xenoprof_log_event() fails.
Where code is being touched anyway, add const as appropriate, but take
the opportunity to entirely drop the now unused domain parameter of
xenoprof_buf_space().
This is part of XSA-313.
Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Wei Liu <wl@xen.org>
master commit: 50ef9a3cb26e2f9383f6fdfbed361d8f174bae9f
master date: 2020-04-14 14:33:19 +0200
Jan Beulich [Tue, 14 Apr 2020 13:08:26 +0000 (15:08 +0200)]
xenoprof: clear buffer intended to be shared with guests
alloc_xenheap_pages() making use of MEMF_no_scrub is fine for Xen
internally used allocations, but buffers allocated to be shared with
(unpriviliged) guests need to be zapped of their prior content.
This is part of XSA-313.
Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wl@xen.org>
master commit: 0763a7ebfcdad66cf9e5475a1301eefb29bae9ed
master date: 2020-04-14 14:32:33 +0200
Julien Grall [Thu, 19 Dec 2019 08:12:21 +0000 (08:12 +0000)]
xen/arm: Place a speculation barrier sequence following an eret instruction
Some CPUs can speculate past an ERET instruction and potentially perform
speculative accesses to memory before processing the exception return.
Since the register state is often controlled by lower privilege level
at the point of an ERET, this could potentially be used as part of a
side-channel attack.
Newer CPUs may implement a new SB barrier instruction which acts
as an architected speculation barrier. For current CPUs, the sequence
DSB; ISB is known to prevent speculation.
The latter sequence is heavier than SB but it would never be executed
(this is speculation after all!).
Introduce a new macro 'sb' that could be used when a speculation barrier
is required. For now it is using dsb; isb but this could easily be
updated to cater SB in the future.
Andrew Cooper [Wed, 11 Dec 2019 14:44:09 +0000 (15:44 +0100)]
AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables
update_paging_mode() has multiple bugs:
1) Booting with iommu=debug will cause it to inform you that that it called
without the pdev_list lock held.
2) When growing by more than a single level, it leaks the newly allocated
table(s) in the case of a further error.
Furthermore, the choice of default level for a domain has issues:
1) All HVM guests grow from 2 to 3 levels during construction because of the
position of the VRAM just below the 4G boundary, so defaulting to 2 is a
waste of effort.
2) The limit for PV guests doesn't take memory hotplug into account, and
isn't dynamic at runtime like HVM guests. This means that a PV guest may
get RAM which it can't map in the IOMMU.
The dynamic height is a property unique to AMD, and adds a substantial
quantity of complexity for what is a marginal performance improvement. Remove
the complexity by removing the dynamic height.
PV guests now get 3 or 4 levels based on any hotplug regions in the host.
This only makes a difference for hardware which previously had all RAM below
the 512G boundary, and a hotplug region above.
HVM guests now get 4 levels (which will be sufficient until 256TB guests
become a thing), because we don't currently have the information to know when
3 would be safe to use.
The overhead of this extra level is not expected to be noticeable. It costs
one page (4k) per domain, and one extra IO-TLB paging structure cache entry
which is very hot and less likely to be evicted.
This is XSA-311.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: b4f042236ae0bb6725b3e8dd40af5a2466a6f971
master date: 2019-12-11 14:55:32 +0100
George Dunlap [Wed, 11 Dec 2019 14:43:44 +0000 (15:43 +0100)]
x86/mm: relinquish_memory: Grab an extra type ref when setting PGT_partial
The PGT_partial bit in page->type_info holds both a type count and a
general ref count. During domain tear-down, when free_page_type()
returns -ERESTART, relinquish_memory() correctly handles the general
ref count, but fails to grab an extra type count when setting
PGT_partial. When this bit is eventually cleared, type_count underflows
and triggers the following BUG in page_alloc.c:free_domheap_pages():
As far as we can tell, this page underflow cannot be exploited any any
other way: The page can't be used as a pagetable by the dying domain
because it's dying; it can't be used as a pagetable by any other
domain since it belongs to the dying domain; and ownership can't
transfer to any other domain without hitting the BUG_ON() in
free_domheap_pages().
(steal_page() won't work on a page in this state, since it requires
PGC_allocated to be set, and PGC_allocated will already have been
cleared.)
Fix this by grabbing an extra type ref if setting PGT_partial in
relinquish_memory.
This is part of XSA-310.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 66bdc16aeed8ddb2ae724adc5ea6bde0dea78c3d
master date: 2019-12-11 14:55:08 +0100
George Dunlap [Wed, 11 Dec 2019 14:43:09 +0000 (15:43 +0100)]
x86/mm: alloc/free_lN_table: Retain partial_flags on -EINTR
When validating or de-validating pages (in alloc_lN_table and
free_lN_table respectively), the `partial_flags` local variable is
used to keep track of whether the "current" PTE started the entire
operation in a "may be partial" state.
One of the patches in XSA-299 addressed the fact that it is possible
for a previously-partially-validated entry to subsequently be found to
have invalid entries (indicated by returning -EINVAL); in which case
page->partial_flags needs to be set to indicate that the current PTE
may have the partial bit set (and thus _put_page_type() should be
called with PTF_partial_set).
Unfortunately, the patches in XSA-299 assumed that once
put_page_from_lNe() returned -ERESTART on a page, it was not possible
for it to return -EINTR. This turns out to be true for
alloc_lN_table() and free_lN_table, but not for _get_page_type() and
_put_page_type(): both can return -EINTR when called on pages with
PGT_partial set. In these cases, the pages PGT_partial will still be
set; failing to set partial_flags appropriately may allow an attacker
to do a privilege escalation similar to those described in XSA-299.
Fix this by always copying the local partial_flags variable into
page->partial_flags when exiting early.
NB that on the "get" side, no adjustment to nr_validated_entries is
needed: whether pte[i] is partially validated or entirely
un-validated, we want nr_validated_entries = i. On the "put" side,
however, we need to adjust nr_validated_entries appropriately: if
pte[i] is entirely validated, we want nr_validated_entries = i + 1; if
pte[i] is partially validated, we want nr_validated_entries = i.
This is part of XSA-310.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4e70f4476c0c543559f971faecdd5f1300cddb0a
master date: 2019-12-11 14:54:43 +0100
George Dunlap [Wed, 11 Dec 2019 14:42:33 +0000 (15:42 +0100)]
x86/mm: Set old_guest_table when destroying vcpu pagetables
Changeset 6c4efc1eba ("x86/mm: Don't drop a type ref unless you held a
ref to begin with"), part of XSA-299, changed the calling discipline
of put_page_type() such that if put_page_type() returned -ERESTART
(indicating a partially de-validated page), subsequent calls to
put_page_type() must be called with PTF_partial_set. If called on a
partially de-validated page but without PTF_partial_set, Xen will
BUG(), because to do otherwise would risk opening up the kind of
privilege escalation bug described in XSA-299.
One place this was missed was in vcpu_destroy_pagetables().
put_page_and_type_preemptible() is called, but on -ERESTART, the
entire operation is simply restarted, causing put_page_type() to be
called on a partially de-validated page without PTF_partial_set. The
result was that if such an operation were interrupted, Xen would hit a
BUG().
Fix this by having vcpu_destroy_pagetables() consistently pass off
interrupted de-validations to put_old_page_type():
- Unconditionally clear references to the page, even if
put_page_and_type failed
- Set old_guest_table and old_guest_table_partial appropriately
While here, do some refactoring:
- Move clearing of arch.cr3 to the top of the function
- Now that clearing is unconditional, move the unmap to the same
conditional as the l4tab mapping. This also allows us to reduce
the scope of the l4tab variable.
- Avoid code duplication by looping to drop references on
guest_table_user
This is part of XSA-310.
Reported-by: Sarah Newman <srn@prgmr.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ececa12b2c4c8e4433e4f9be83f5c668ae36fe08
master date: 2019-12-11 14:54:13 +0100
George Dunlap [Wed, 11 Dec 2019 14:41:54 +0000 (15:41 +0100)]
x86/mm: Don't reset linear_pt_count on partial validation
"Linear pagetables" is a technique which involves either pointing a
pagetable at itself, or to another pagetable the same or higher level.
Xen has limited support for linear pagetables: A page may either point
to itself, or point to another page of the same level (i.e., L2 to L2,
L3 to L3, and so on).
XSA-240 introduced an additional restriction that limited the "depth"
of such chains by allowing pages to either *point to* other pages of
the same level, or *be pointed to* by other pages of the same level,
but not both. To implement this, we keep track of the number of
outstanding times a page points to or is pointed to another page
table, to prevent both from happening at the same time.
Unfortunately, the original commit introducing this reset this count
when resuming validation of a partially-validated pagetable, dropping
some "linear_pt_entry" counts.
On debug builds on systems where guests used this feature, this might
lead to crashes that look like this:
Assertion 'oc > 0' failed at mm.c:874
Worse, if an attacker could engineer such a situation to occur, they
might be able to make loops or other abitrary chains of linear
pagetables, leading to the denial-of-service situation outlined in
XSA-240.
This is XSA-309.
Reported-by: Manuel Bouyer <bouyer@antioche.eu.org> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7473efd12fb7a6548f5303f1f4c5cb521543a813
master date: 2019-12-11 14:10:27 +0100
Andrew Cooper [Wed, 11 Dec 2019 14:41:22 +0000 (15:41 +0100)]
x86/vtx: Work around SingleStep + STI/MovSS VMEntry failures
See patch comment for technical details.
Concerning the timeline, this was first discovered in the aftermath of
XSA-156 which caused #DB to be intercepted unconditionally, but only in
its SingleStep + STI form which is restricted to privileged software.
After working with Intel and identifying the problematic vmentry check,
this workaround was suggested, and the patch was posted in an RFC
series. Outstanding work for that series (not breaking Introspection)
is still pending, and this fix from it (which wouldn't have been good
enough in its original form) wasn't committed.
A vmentry failure was reported to xen-devel, and debugging identified
this bug in its SingleStep + MovSS form by way of INT1, which does not
involve the use of any privileged instructions, and proving this to be a
security issue.
This is XSA-308
Reported-by: Håkon Alstadheim <hakon@alstadheim.priv.no> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 1d3eb8259804e5bec991a3462d69ba6bd80bb40e
master date: 2019-12-11 14:09:30 +0100
Jan Beulich [Wed, 11 Dec 2019 14:40:39 +0000 (15:40 +0100)]
x86+Arm32: make find_next_{,zero_}bit() have well defined behavior
These functions getting used with the 2nd and 3rd arguments being equal
wasn't well defined: Arm64 reliably returns the value of the 2nd
argument in this case, while on x86 for bitmaps up to 64 bits wide the
return value was undefined (due to the undefined behavior of a shift of
a value by the number of bits it's wide) when the incoming value was 64.
On Arm32 an actual out of bounds access would happen when the
size/offset value is a multiple of 32; if this access doesn't fault, the
return value would have been sufficiently correct afaict.
Make the functions consistently tolerate the last two arguments being
equal (and in fact the 3rd argument being greater or equal to the 2nd),
in favor of finding and fixing all the use sites that violate the
original more strict assumption.
Jan Beulich [Wed, 11 Dec 2019 14:39:07 +0000 (15:39 +0100)]
AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page
Unmapping a page which has never been mapped should be a no-op (note how
it already is in case there was no root page table allocated). There's
in particular no need to grow the number of page table levels in use,
and there's also no need to allocate intermediate page tables except
when needing to split a large page.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ad591454f069647c36a7daaa9ec23384c0263f0b
master date: 2019-11-12 11:08:34 +0100
Jan Beulich [Tue, 26 Nov 2019 17:03:41 +0000 (18:03 +0100)]
IOMMU: default to always quarantining PCI devices
XSA-302 relies on the use of libxl's "assignable-add" feature to prepare
devices to be assigned to untrusted guests.
Unfortunately, this is not considered a strictly required step for
device assignment. The PCI passthrough documentation on the wiki
describes alternate ways of preparing devices for assignment, and
libvirt uses its own ways as well. Hosts where these alternate methods
are used will still leave the system in a vulnerable state after the
device comes back from a guest.
Default to always quarantining PCI devices, but provide a command line
option to revert back to prior behavior (such that people who both
sufficiently trust their guests and want to be able to use devices in
Dom0 again after they had been in use by a guest wouldn't need to
"manually" move such devices back from DomIO to Dom0).
This is XSA-306.
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org>
master commit: ba2ab00bbb8c74e311a252d816d68dee47c779a0
master date: 2019-11-26 14:15:01 +0100
Andrew Cooper [Tue, 26 Nov 2019 17:01:17 +0000 (18:01 +0100)]
x86/vvmx: Fix livelock with XSA-304 fix
It turns out that the XSA-304 / CVE-2018-12207 fix of disabling executable
superpages doesn't work well with the nested p2m code.
Nested virt is experimental and not security supported, but is useful for
development purposes. In order to not regress the status quo, disable the
XSA-304 workaround until the nested p2m code can be improved.
Introduce a per-domain exec_sp control and set it based on the current
opt_ept_exec_sp setting. Take the oppotunity to omit a PVH hardware domain
from the performance hit, because it is already permitted to DoS the system in
such ways as issuing a reboot.
When nested virt is enabled on a domain, force it to using executable
superpages and rebuild the p2m.
Having the setting per-domain involves rearranging the internals of
parse_ept_param_runtime() but it still retains the same overall semantics -
for each applicable domain whose setting needs to change, rebuild the p2m.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: 183f354e1430087879de071f0c7122e42703916e
master date: 2019-11-23 14:06:24 +0000
Andrew Cooper [Wed, 19 Jun 2019 17:16:03 +0000 (18:16 +0100)]
x86/tsx: Introduce tsx= to use MSR_TSX_CTRL when available
To protect against the TSX Async Abort speculative vulnerability, Intel have
released new microcode for affected parts which introduce the MSR_TSX_CTRL
control, which allows TSX to be turned off. This will be architectural on
future parts.
Introduce tsx= to provide a global on/off for TSX, including its enumeration
via CPUID. Provide stub virtualisation of this MSR, as it is not exposed to
guests at the moment.
VMs may have booted before microcode is loaded, or before hosts have rebooted,
and they still want to migrate freely. A VM which booted seeing TSX can
migrate safely to hosts with TSX disabled - TSX will start unconditionally
aborting, but still behave in a manner compatible with the ABI.
The guest-visible behaviour is equivalent to late loading the microcode and
setting the RTM_DISABLE bit in the course of live patching.
This is part of XSA-305 / CVE-2019-11135
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 8 Nov 2019 16:36:50 +0000 (16:36 +0000)]
x86/vtx: Allow runtime modification of the exec-sp setting
See patch for details.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Andrew Cooper [Thu, 20 Dec 2018 17:25:29 +0000 (17:25 +0000)]
x86/vtx: Disable executable EPT superpages to work around CVE-2018-12207
CVE-2018-12207 covers a set of errata on various Intel processors, whereby a
machine check exception can be generated in a corner case when an executable
mapping changes size or cacheability without TLB invalidation. HVM guest
kernels can trigger this to DoS the host.
To mitigate, in affected hardware, all EPT superpages are marked NX. When an
instruction fetch violation is observed against the superpage, the superpage
is shattered to 4k and has execute permissions restored. This prevents the
guest kernel from being able to create the necessary preconditions in the iTLB
to exploit the vulnerability.
This does come with a workload-dependent performance overhead, caused by
increased TLB pressure. Performance can be restored, if guest kernels are
trusted not to mount an attack, by specifying ept=exec-sp on the command line.
This is part of XSA-304 / CVE-2018-12207
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 24 Oct 2019 13:09:01 +0000 (14:09 +0100)]
x86/vtd: Hide superpage support for SandyBridge IOMMUs
Something causes SandyBridge IOMMUs to choke when sharing EPT pagetables, and
an EPT superpage gets shattered. The root cause is still under investigation,
but the end result is unusable in combination with CVE-2018-12207 protections.
This is part of XSA-304 / CVE-2018-12207
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Mon, 4 Nov 2019 13:54:09 +0000 (14:54 +0100)]
xen/arm64: Don't blindly unmask interrupts on trap without a change of level
Some of the traps without a change of the level (i.e. hypervisor ->
hypervisor) will unmask interrupts regardless the state of them in the
interrupted context.
One of the consequences is IRQ will be unmasked when receiving a
synchronous exception (used by WARN*()). This could result to unexpected
behavior such as deadlock (if a lock was shared with interrupts).
In a nutshell, interrupts should only be unmasked when it is safe to
do. Xen only unmask IRQ and Abort interrupts, so the logic can stay
simple:
- hyp_error: All the interrupts are now kept masked. SError should
be pretty rare and if ever happen then we most likely want to
avoid any other interrupts to be generated. The potential main
"caller" is during virtual SError synchronization on the exit
path from the guest (see check_pending_vserror).
- hyp_sync: The interrupts state is inherited from the interrupted
context.
- hyp_irq: All the interrupts but IRQ state are inherited from the
interrupted context. IRQ is kept masked.
Julien Grall [Mon, 4 Nov 2019 13:53:46 +0000 (14:53 +0100)]
xen/arm32: Don't blindly unmask interrupts on trap without a change of level
Exception vectors will unmask interrupts regardless the state of them in
the interrupted context.
One of the consequences is IRQ will be unmasked when receiving an
undefined instruction exception (used by WARN*) from the hypervisor.
This could result to unexpected behavior such as deadlock (if a lock was
shared with interrupts).
In a nutshell, interrupts should only be unmasked when it is safe to do.
Xen only unmask IRQ and Abort interrupts, so the logic can stay simple.
As vectors exceptions may be shared between guest and hypervisor, we now
need to have a different policy for the interrupts.
On exception from hypervisor, each vector will select the list of
interrupts to inherit from the interrupted context. Any interrupts not
listed will be kept masked.
On exception from the guest, the Abort and IRQ will be unmasked
depending on the exact vector.
The interrupts will be kept unmasked when the vector cannot used by
either guest or hypervisor.
Note that each vector is not anymore preceded by ALIGN. This is fine
because the alignment is already bigger than what we need.
Julien Grall [Mon, 4 Nov 2019 13:53:27 +0000 (14:53 +0100)]
xen/arm32: entry: Fold the macro SAVE_ALL in the macro vector
Follow-up rework will require the macro vector to distinguish between
a trap from a guest vs while in the hypervisor.
The macro SAVE_ALL already has code to distinguish between the two and
it is only called by the vector macro. So fold the former into the
latter. This will help to avoid duplicating the check.
Julien Grall [Mon, 4 Nov 2019 13:53:02 +0000 (14:53 +0100)]
xen/arm32: entry: Split __DEFINE_ENTRY_TRAP in two
The preprocessing macro __DEFINE_ENTRY_TRAP is used to generate trap
entry function. While the macro is fairly small today, follow-up patches
will increase the size signicantly.
In general, assembly macros are more readable as they allow you to name
parameters and avoid '\'. So the actual implementation of the trap is
now switched to an assembly macro.
Paul Durrant [Mon, 4 Nov 2019 13:52:31 +0000 (14:52 +0100)]
passthrough: quarantine PCI devices
When a PCI device is assigned to an untrusted domain, it is possible for
that domain to program the device to DMA to an arbitrary address. The
IOMMU is used to protect the host from malicious DMA by making sure that
the device addresses can only target memory assigned to the guest. However,
when the guest domain is torn down the device is assigned back to dom0,
thus allowing any in-flight DMA to potentially target critical host data.
This patch introduces a 'quarantine' for PCI devices using dom_io. When
the toolstack makes a device assignable (by binding it to pciback), it
will now also assign it to DOMID_IO and the device will only be assigned
back to dom0 when the device is made unassignable again. Whilst device is
assignable it will only ever transfer between dom_io and guest domains.
dom_io is actually only used as a sentinel domain for quarantining purposes;
it is not configured with any IOMMU mappings. Assignment to dom_io simply
means that the device's initiator (requestor) identifier is not present in
the IOMMU's device table and thus any DMA transactions issued will be
terminated with a fault condition.
In addition, a fix to assignment handling is made for VT-d. Failure
during the assignment step should not lead to a device still being
associated with its prior owner. Hand the device to DomIO temporarily,
until the assignment step has completed successfully. Remove the PI
hooks from the source domain then earlier as well.
Failure of the recovery reassign_device_ownership() may not go silent:
There e.g. may still be left over RMRR mappings in the domain assignment
to which has failed, and hence we can't allow that domain to continue
executing.
NOTE: This patch also includes one printk() cleanup; the
"XEN_DOMCTL_assign_device: " tag is dropped in iommu_do_pci_domctl(),
since similar printk()-s elsewhere also don't log such a tag.
This is XSA-302.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
master commit: 319f9a0ba94c7db505cd5dd9cb0b037ab1aa8e12
master date: 2019-10-31 16:20:05 +0100
Julien Grall [Mon, 4 Nov 2019 13:52:06 +0000 (14:52 +0100)]
xen/arm: p2m: Don't check the return of p2m_get_root_pointer() with BUG_ON()
It turns out that the BUG_ON() was actually reachable with well-crafted
hypercalls. The BUG_ON() is here to prevent catch logical error, so
crashing Xen is a bit over the top.
While all the holes should now be fixed, it would be better to downgrade
the BUG_ON() to something less fatal to prevent any more DoS.
The BUG_ON() in p2m_get_entry() is now replaced by ASSERT_UNREACHABLE()
to catch mistake in debug build and return INVALID_MFN for production
build. The interface also requires to set page_order to give an idea of
the size of "hole". So 'level' is now set so we report a hole of size of
the an entry of the root page-table. This stays inline with what happen
when the GFN is higher than p2m->max_mapped_gfn.
The BUG_ON() in p2m_resolve_translation_fault() is now replaced by
ASSERT_UNREACHABLE() to catch mistake in debug build and just report a
fault for producion build.
Julien Grall [Mon, 4 Nov 2019 13:51:47 +0000 (14:51 +0100)]
xen/arm: p2m: Avoid off-by-one check on p2m->max_mapped_gfn
The code base is using inconsistently the field p2m->max_mapped_gfn.
Some of the useres expect that p2m->max_guest_gfn contain the highest
mapped GFN while others expect highest + 1.
p2m->max_guest_gfn is set as highest + 1, because of that the sanity
check on the GFN in p2m_resolved_translation_fault() and
p2m_get_entry() can be bypassed when GFN == p2m->max_guest_gfn.
p2m_get_root_pointer(p2m->max_guest_gfn) may return NULL if it is
outside of address range supported and therefore the BUG_ON() could be
hit.
The current value hold in p2m->max_mapped_gfn is inconsistent with the
expectation of the common code (see domain_get_maximum_gpfn()) and also
the documentation of the field.
Rather than changing the check in p2m_translation_fault() and
p2m_get_entry(), p2m->max_mapped_gfn is now containing the highest
mapped GFN and the callers assuming "highest + 1" are now adjusted.
Take the opportunity to use 1UL rather than 1 as page_order could
theoritically big enough to overflow a 32-bit integer.
Lastly, the documentation of the field max_guest_gfn to reflect how it
is computed.
Julien Grall [Mon, 4 Nov 2019 13:51:21 +0000 (14:51 +0100)]
xen/arm: p2m: Avoid aliasing guest physical frame
The P2M helpers implementation is quite lax and will end up to ignore
the unused top bits of a guest physical frame.
This effectively means that p2m_set_entry() will create a mapping for a
different frame (it is always equal to gfn & (mask unused bits)). Yet
p2m->max_mapped_gfn will be updated using the original frame.
At the moment, p2m_get_entry() and p2m_resolve_translation_fault()
assume that p2m_get_root_pointer() will always return a non-NULL pointer
when the GFN is smaller than p2m->max_mapped_gfn.
Unfortunately, because of the aliasing described above, it would be
possible to set p2m->max_mapped_gfn high enough so it covers frame that
would lead p2m_get_root_pointer() to return NULL.
As we don't sanity check the guest physical frame provided by a guest, a
malicious guest could craft a series of hypercalls that will hit the
BUG_ON() and therefore DoS Xen.
To prevent aliasing, the function p2m_get_root_pointer() is now reworked
to return NULL If any of the unused top bits are not zero. The caller
can then decide what's the appropriate action to do. Since the two paths
(i.e. P2M_ROOT_PAGES == 1 and P2M_ROOT_PAGES != 1) are now very
similarly, take the opportunity to consolidate them making the code a
bit simpler.
With this change, p2m_get_entry() will not try to insert a mapping as
the root pointer is invalid.
Note that root_table is now switch to unsigned long as unsigned int is
not enough to hold part of a GFN.
George Dunlap [Mon, 4 Nov 2019 13:50:47 +0000 (14:50 +0100)]
x86/mm: Don't drop a type ref unless you held a ref to begin with
Validation and de-validation of pagetable trees may take arbitrarily
large amounts of time, and so must be preemptible. This is indicated
by setting the PGT_partial bit in the type_info, and setting
nr_validated_entries and partial_flags appropriately. Specifically,
if the entry at [nr_validated_entries] is partially validated,
partial_flags should have the PGT_partial_set bit set, and the entry
should hold a general reference count. During de-validation,
put_page_type() is called on partially validated entries.
Unfortunately, there are a number of issues with the current algorithm.
First, doing a "normal" put_page_type() is not safe when no type ref
is held: there is nothing to stop another vcpu from coming along and
picking up validation again: at which point the put_page_type may drop
the only page ref on an in-use page. Some examples are listed in the
appendix.
The core issue is that put_page_type() is being called both to clean
up PGT_partial, and to drop a type count; and has no way of knowing
which is which; and so if in between, PGT_partial is cleared,
put_page_type() will drop the type ref erroneously.
What is needed is to distinguish between two states:
- Dropping a type ref which is held
- Cleaning up a page which has been partially de/validated
Fix this by telling put_page_type() which of the two activities you
intend.
When cleaning up a partial de/validation, take no action unless you
find a page partially validated.
If put_page_type() is called without PTF_partial_set, and finds the
page in a PGT_partial state anyway, then there's certainly been a
misaccounting somewhere, and carrying on would almost certainly cause
a security issue, so crash the host instead.
In put_page_from_lNe, pass partial_flags on to _put_page_type().
old_guest_table may be set either with a fully validated page (when
using the "deferred put" pattern), or with a partially validated page
(when a normal "de-validation" is interrupted, or when a validation
fails part-way through due to invalid entries). Add a flag,
old_guest_table_partial, to indicate which of these it is, and use
that to pass the appropriate flag to _put_page_type().
While here, delete stray trailing whitespace.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix:
Suppose page A, when interpreted as an l3 pagetable, contains all
valid entries; and suppose A[x] points to page B, which when
interpreted as an l2 pagetable, contains all valid entries.
P1: PIN_L3_TABLE
A -> PGT_l3_table | 1 | valid
B -> PGT_l2_table | 1 | valid
P1: UNPIN_TABLE
> Arrange to interrupt after B has been de-validated
B:
type_info -> PGT_l2_table | 0
A:
type_info -> PGT_l3_table | 1 | partial
nr_validated_enties -> (less than x)
P2: mod_l4_entry to point to A
> Arrange for this to be interrupted while B is being validated
B:
type_info -> PGT_l2_table | 1 | partial
(nr_validated_entires &c set as appropriate)
A:
type_info -> PGT_l3_table | 1 | partial
nr_validated_entries -> x
partial_pte = 1
P3: mod_l3_entry some other unrelated l3 to point to B:
B:
type_info -> PGT_l2_table | 1
P1: Restart UNPIN_TABLE
At this point, since A.nr_validate_entries == x and A.partial_pte !=
0, free_l3_table() will call put_page_from_l3e() on pl3e[x], dropping
its type count to 0 while it's still being pointed to by some other l3
A similar issue arises with old_guest_table. Consider the following
scenario:
Suppose A is a page which, when interpreted as an l2, has valid entries
until entry x, which is invalid.
V1: PIN_L2_TABLE(A)
<Validate until we try to validate [x], get -EINVAL>
A -> PGT_l2_table | 1 | PGT_partial
V1 -> old_guest_table = A
<delayed>
V2: PIN_L2_TABLE(A)
<Pick up where V1 left off, try to re-validate [x], get -EINVAL>
A -> PGT_l2_table | 1 | PGT_partial
V2 -> old_guest_table = A
<restart>
put_old_guest_table()
_put_page_type(A)
A -> PGT_l2_table | 0
Indeed, it is possible to engineer for old_guest_table for every vcpu
a guest has to point to the same page.
master commit: c40b33d72630dcfa506d6fd856532d6152cb97dc
master date: 2019-10-31 16:16:37 +0100
George Dunlap [Mon, 4 Nov 2019 13:50:22 +0000 (14:50 +0100)]
x86/mm: Fix nested de-validation on error
If an invalid entry is discovered when validating a page-table tree,
the entire tree which has so far been validated must be de-validated.
Since this may take a long time, alloc_l[2-4]_table() set current
vcpu's old_guest_table immediately; put_old_guest_table() will make
sure that put_page_type() will be called to finish off the
de-validation before any other MMU operations can happen on the vcpu.
The invariant for partial pages should be:
* Entries [0, nr_validated_ptes) should be completely validated;
put_page_type() will de-validate these.
* If [nr_validated_ptes] is partially validated, partial_flags should
set PTF_partiaL_set. put_page_type() will be called on this page to
finish off devalidation, and the appropriate refcount adjustments
will be done.
alloc_l[2-3]_table() indicates partial validation to its callers by
setting current->old_guest_table.
Unfortunately, this is mishandled.
Take the case where validating lNe[x] returns an error.
First, alloc_l3_table() doesn't check old_guest_table at all; as a
result, partial_flags is not set when it should be. nr_validated_ptes
is set to x; and since PFT_partial_set clear, de-validation resumes at
nr_validated_ptes-1. This means that the l2 page at pl3e[x] will not
have put_page_type() called on it when de-validating the rest of the
l3: it will be stuck in the PGT_partial state until the domain is
destroyed, or until it is re-used as an l2. (Any other page type will
fail.)
Worse, alloc_l4_table(), rather than setting PTF_partial_set as it
should, sets nr_validated_ptes to x+1. When de-validating, since
partial is 0, this will correctly resume calling put_page_type at [x];
but, if the put_page_type() is never called, but instead
get_page_type() is called, validation will pick up at [x+1],
neglecting to validate [x]. If the rest of the validation succeeds,
the l4 will be validated even though [x] is invalid.
Fix this in both cases by setting PTF_partial_set if old_guest_table
is set.
While here, add some safety catches:
- old_guest_table must point to the page contained in
[nr_validated_ptes].
- alloc_l1_page shouldn't set old_guest_table
If we experience one of these situations in production builds, it's
safer to avoid calling put_page_type for the pages in question. If
they have PGT_partial set, they will be cleaned up on domain
destruction; if not, we have no idea whether a type count is safe to
drop. Retaining an extra type ref that should have been dropped may
trigger a BUG() on the free_domain_page() path, but dropping a type
count that shouldn't be dropped may cause a privilege escalation.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3c15a2d8cc1981f369cc9542f028054d0dfb325b
master date: 2019-10-31 16:16:13 +0100
George Dunlap [Mon, 4 Nov 2019 13:50:00 +0000 (14:50 +0100)]
x86/mm: Properly handle linear pagetable promotion failures
In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted. This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.
The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count. If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held. If PTF_partial_set is set, then [nr_entries_validated]
is partially validated, and a general reference count is held.
Unfortunately, in cases where an entry began with PTF_partial_set set,
and get_page_from_lNe() returns -EINVAL, the PTF_partial_set bit is
erroneously dropped. (This scenario can be engineered mainly by the
use of interleaving of promoting and demoting a page which has "linear
pagetable" entries; see the appendix for a sketch.) This means that
we will "leak" a general reference count on the page in question,
preventing the page from being freed.
Fix this by setting page->partial_flags to the partial_flags local
variable.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix
Suppose A and B can both be promoted to L2 pages, and A[x] points to B.
V1: MOD_L3_ENTRY pointing something to A.
In the process of validating A[x], grab an extra type / ref on B:
B.type_count = 2 | PGT_validated
B.count = 3 | PGC_allocated
A.type_count = 1 | PGT_validated
A.count = 2 | PGC_allocated
V1: MOD_L3_ENTRY removing the reference to A.
De-validate A, down to A[x], which points to B.
Drop the final type on B. Arrange to be interrupted.
B.type_count = 1 | PGT_partial
B.count = 2 | PGC_allocated
A.type_count = 1 | PGT_partial
A.nr_validated_entries = x
A.partial_pte = -1
V2: MOD_L3_ENTRY adds a reference to A.
At this point, get_page_from_l2e(A[x]) tries
get_page_and_type_from_mfn(), which fails because it's the wrong type;
and get_l2_linear_pagetable() also fails, because B isn't validated as
an l2 anymore.
master commit: 2f126247ef49c2ba52bae29a2ab371059ede67c0
master date: 2019-10-31 16:15:48 +0100
George Dunlap [Mon, 4 Nov 2019 13:49:01 +0000 (14:49 +0100)]
x86/mm: Always retain a general ref on partial
In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted. This is stored in two elements in the page struct:
nr_entries_validated and partial_flags.
The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count. If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held. If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.
At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.
PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held. Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.
Unfortunately, because a refcount is not held, it is possible to
engineer a situation where PFT_partial_set is set but the page in
question has been assigned to another domain. A sketch is provided in
the appendix.
Fix this by having the parent page table entry hold a general
reference count whenever PFT_partial_set is set. (For clarity of
change, keep two separate flags. These will be collapsed in a
subsequent changeset.)
This has two basic implications. On the put_page_from_lNe() side,
this mean that the (partial_set && !partial_ref) case can never happen,
and no longer needs to be special-cased.
Secondly, because both flags are set together, there's no need to carry over
existing bits from partial_pte.
(NB there is still another issue with calling _put_page_type() on a
page which had PGT_partial set; that will be handled in a subsequent
patch.)
On the get_page_and_type_from_mfn() side, we need to distinguish
between callers which hold a reference on partial (i.e.,
alloc_lN_table()), and those which do not (new_cr3, PIN_LN_TABLE, and
so on): pass a flag if the type should be retained on interruption.
NB that since l1 promotion can't be preempted, that get_page_from_l2e
can't return -ERESTART.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
* Appendix: Engineering PTF_partial_set while a page belongs to a
foreign domain
Suppose A is a page which can be promoted to an l3, and B is a page
which can be promoted to an l2, and A[x] points to B. B has
PGC_allocated set but no other general references.
V1: PIN_L3 A.
A is validated, B is validated.
A.type_count = 1 | PGT_validated | PGT_pinned
B.type_count = 1 | PGT_validated
B.count = 2 | PGC_allocated (A[x] holds a general ref)
V1: UNPIN A.
A begins de-validation.
Arrange to be interrupted when i < x
V1->old_guest_table = A
V1->old_guest_table_ref_held = false
A.type_count = 1 | PGT_partial
A.nr_validated_entries = i < x
B.type_count = 0
B.count = 1 | PGC_allocated
V2: MOD_L4_ENTRY to point some l4e to A.
Picks up re-validation of A.
Arrange to be interrupted halfway through B's validation
B.type_count = 1 | PGT_partial
B.count = 2 | PGC_allocated (PGT_partial holds a general ref)
A.type_count = 1 | PGT_partial
A.nr_validated_entries = x
A.partial_pte = PTF_partial_set
V3: MOD_L3_ENTRY to point some other l3e (not in A) to B.
Validates B.
B.type_count = 1 | PGT_validated
B.count = 2 | PGC_allocated ("other l3e" holds a general ref)
V3: MOD_L3_ENTRY to clear l3e pointing to B.
Devalidates B.
B.type_count = 0
B.count = 1 | PGC_allocated
V3: decrease_reservation(B)
Clears PGC_allocated
B.count = 0 => B is freed
B gets assigned to a different domain
V1: Restarts UNPIN of A
put_old_guest_table(A)
...
free_l3_table(A)
Now since A.partial_flags has PTF_partial_set, free_l3_table() will
call put_page_from_l3e() on A[x], which points to B, while B is owned
by another domain.
If A[x] held a general refcount for B on partial validation, as it does
for partial de-validation, then B would still have a reference count of
1 after PGC_allocated was freed; so B wouldn't be freed until after
put_page_from_l3e() had happend on A[x].
master commit: 18b0ab697830a46ce3dacaf9210799322cb3732c
master date: 2019-10-31 16:14:36 +0100
George Dunlap [Mon, 4 Nov 2019 13:48:41 +0000 (14:48 +0100)]
x86/mm: Have alloc_l[23]_table clear partial_flags when preempting
In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted. This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.
The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count. If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held. If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.
At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.
PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held. Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.
Unfortunately, when alloc_l[23]_table check hypercall_preempt_check()
and return -ERESTART, they set nr_entries_validated, but don't clear
partial_flags.
If we were picking up from a previously-interrupted promotion, that
means that PTF_partial_set would be set even though
[nr_entries_validated] was not partially validated. This means that
if the page in this state were de-validated, put_page_type() would
erroneously be called on that entry.
Perhaps worse, if we were racing with a de-validation, then we might
leave both PTF_partial_set and PTF_partial_general_ref; and when
de-validation picked up again, both the type and the general ref would
be erroneously dropped from [nr_entries_validated].
In a sense, the real issue here is code duplication. Rather than
duplicate the interruption code, set rc to -EINTR and fall through to
the code which already handles that case correctly.
Given the logic at this point, it should be impossible for
partial_flags to be non-zero; add an ASSERT() to catch any changes.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ff0b9a5d69b744a99e8bbeac820a985db5a3bf8e
master date: 2019-10-31 16:14:14 +0100
Make it easier to read by declaring the conditions in which we will
retain the ref, rather than the conditions under which we release it.
The only way (page == current->arch.old_guest_table) can be true is if
preemptible is true; so remove this from the query itself, and add an
ASSERT() to that effect on the opposite path.
No functional change intended.
NB that alloc_lN_table() mishandle the "linear pt failure" situation
described in the comment; this will be addressed in a future patch.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2aab06d742e13d7a9d248f1fc7f0ec62b295ada1
master date: 2019-10-31 16:13:23 +0100
George Dunlap [Mon, 4 Nov 2019 13:47:56 +0000 (14:47 +0100)]
x86/mm: Use flags for _put_page_type rather than a boolean
This is in mainly in preparation for _put_page_type taking the
partial_flags value in the future. It also makes it easier to read in
the caller (since you see a flag name rather than `true` or `false`).
No functional change intended.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0121588ec0f6950ed65d906d860df49be2c8e655
master date: 2019-10-31 16:12:53 +0100
George Dunlap [Mon, 4 Nov 2019 13:47:23 +0000 (14:47 +0100)]
x86/mm: Separate out partial_pte tristate into individual flags
At the moment, partial_pte is a tri-state that contains two distinct bits
of information:
1. If zero, the pte at index [nr_validated_ptes] is un-validated. If
non-zero, the pte was last seen with PGT_partial set.
2. If positive, the pte at index [nr_validated_ptes] does not hold a
general reference count. If negative, it does.
To make future patches more clear, separate out this functionality
into two distinct, named bits: PTF_partial_set (for #1) and
PTF_partial_general_ref (for #2).
Additionally, a number of functions which need this information also
take other flags to control behavior (such as `preemptible` and
`defer`). These are hard to read in the caller (since you only see
'true' or 'false'), and ugly when many are added together. In
preparation for adding yet another flag in a future patch, collapse
all of these into a single `flag` variable.
NB that this does mean checking for what was previously the '-1'
condition a bit more ugly in the put_page_from_lNe functions (since
you have to check for both partial_set and general ref); but this
clause will go away in a future patch.
Also note that the original comment had an off-by-one error:
partial_flags (like partial_pte before it) concerns
plNe[nr_validated_ptes], not plNe[nr_validated_ptes+1].
No functional change intended.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1b6fa638d21006d3c0a3038132c6cb326d8bba08
master date: 2019-10-31 16:12:14 +0100
George Dunlap [Mon, 4 Nov 2019 13:47:04 +0000 (14:47 +0100)]
x86/mm: Don't re-set PGT_pinned on a partially de-validated page
When unpinning pagetables, if an operation is interrupted,
relinquish_memory() re-sets PGT_pinned so that the un-pin will
pickedup again when the hypercall restarts.
This is appropriate when put_page_and_type_preemptible() returns
-EINTR, which indicates that the page is back in its initial state
(i.e., completely validated). However, for -ERESTART, this leads to a
state where a page has both PGT_pinned and PGT_partial set.
This happens to work at the moment, although it's not really a
"canonical" state; but in subsequent patches, where we need to make a
distinction in handling between PGT_validated and PGT_partial pages,
this causes issues.
Move to a "canonical" state by:
- Only re-setting PGT_pinned on -EINTR
- Re-dropping the refcount held by PGT_pinned on -ERESTART
In the latter case, the PGT_partial bit will be cleared further down
with the rest of the other PGT_partial pages.
While here, clean up some trainling whitespace.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: bf656e02d8e7f49b484e2587aef4f18deda6e2ab
master date: 2019-10-31 16:11:46 +0100
George Dunlap [Mon, 4 Nov 2019 13:46:41 +0000 (14:46 +0100)]
x86/mm: L1TF checks don't leave a partial entry
On detection of a potential L1TF issue, most validation code returns
-ERESTART to allow the switch to shadow mode to happen and cause the
original operation to be restarted.
However, in the validation code, the return value -ERESTART has been
repurposed to indicate 1) the function has partially completed
something which needs to be undone, and 2) calling put_page_type()
should cleanly undo it. This causes problems in several places.
For L1 tables, on receiving an -ERESTART return from alloc_l1_table(),
alloc_page_type() will set PGT_partial on the page. If for some
reason the original operation never restarts, then on domain
destruction, relinquish_memory() will call free_page_type() on the
page.
Unfortunately, alloc_ and free_l1_table() aren't set up to deal with
PGT_partial. When returning a failure, alloc_l1_table() always
de-validates whatever it's validated so far, and free_l1_table()
always devalidates the whole page. This means that if
relinquish_memory() calls free_page_type() on an L1 that didn't
complete due to an L1TF, it will call put_page_from_l1e() on "page
entries" that have never been validated.
For L2+ tables, setting rc to ERESTART causes the rest of the
alloc_lN_table() function to *think* that the entry in question will
have PGT_partial set. This will cause it to set partial_pte = 1. If
relinqush_memory() then calls free_page_type() on one of those pages,
then free_lN_table() will call put_page_from_lNe() on the entry when
it shouldn't.
Rather than indicating -ERESTART, indicate -EINTR. This is the code
to indicate that nothing has changed from when you started the call
(which is effectively how alloc_l1_table() handles errors).
mod_lN_entry() shouldn't have any of these types of problems, so leave
potential changes there for a clean-up patch later.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3165ffef09e89d38f84d26051f606d2c1421aea3
master date: 2019-10-31 16:11:12 +0100
Jan Beulich [Mon, 4 Nov 2019 13:46:09 +0000 (14:46 +0100)]
x86/PV: check GDT/LDT limits during emulation
Accesses beyond the LDT limit originating from emulation would trigger
the ASSERT() in pv_map_ldt_shadow_page(). On production builds such
accesses would cause an attempt to promote the touched page (offset from
the present LDT base address) to a segment descriptor one. If this
happens to succeed, guest user mode would be able to elevate its
privileges to that of the guest kernel. This is particularly easy when
there's no LDT at all, in which case the LDT base stored internally to
Xen is simply zero.
Also adjust the ASSERT() that was triggering: It was off by one to
begin with, and for production builds we also better use
ASSERT_UNREACHABLE() instead with suitable recovery code afterwards.
This is XSA-298.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 93021cbe880a8013691a48d0febef8ed7d3e3ebd
master date: 2019-10-31 16:08:16 +0100
Andrew Cooper [Mon, 4 Nov 2019 13:45:34 +0000 (14:45 +0100)]
xen/hypercall: Don't use BUG() for parameter checking in hypercall_create_continuation()
Since c/s 1d429034 "hypercall: update vcpu_op to take an unsigned vcpuid",
which incorrectly swapped 'i' for 'u' in the parameter type list, guests have
been able to hit the BUG() in next_args()'s default case.
Correct these back to 'i'.
In addition, make adjustments to prevent this class of issue from occurring in
the future - crashing Xen is not an appropriate form of parameter checking.
Capitalise NEXT_ARG() to catch all uses, to highlight that it is a macro doing
non-function-like things behind the scenes, and undef it when appropriate.
Implement a bad_fmt: block which prints an error, asserts unreachable, and
crashes the guest.
On the ARM side, drop all parameter checking of p. It is asymmetric with the
x86 side, and akin to expecting memcpy() or sprintf() to check their src/fmt
parameter before use. A caller passing "" or something other than a string
literal will be obvious during code review.
This is XSA-296.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com>
master commit: 0bf9f8d3e399a0e1d2b717f71b4776172446184b
master date: 2019-10-31 16:07:11 +0100
Jan Beulich [Fri, 21 Jun 2019 10:17:16 +0000 (12:17 +0200)]
XSM: adjust Kconfig names
Since the Kconfig option renaming was not backported, the new uses of
involved CONFIG_* settings should have been adopted to the existing
names in the XSA-295 series. Do this now, also changing XSM_SILO to just
SILO to better match its FLASK counterpart.
To avoid breaking the Kconfig menu structure also adjust XSM_POLICY's
dependency (as was also silently done on master during the renaming).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien.grall@arm.com>
Julien Grall [Thu, 20 Jun 2019 17:47:06 +0000 (18:47 +0100)]
xen/arm: time: cycles_t should be an uint64_t and not unsigned long
Since commit ca73ac8e7d "xen/arm: Add an isb() before reading CNTPCT_EL0
to prevent re-ordering", get_cycles() is now returning the number of
cycles and used in more callers.
While the counter registers is always 64-bit, get_cycles() will only
reutrn a 32-bit on Arm32 and therefore truncate the value. This will
result to weird behavior by both Xen and the Guest as the timer will not
be setup correctly.
This could be resolved by switch cycles_t from unsigned long to
unsigned int.
xen/arm: grant-table: Protect gnttab_clear_flag against guest misbehavior
The function gnttab_clear_flag is used to clear the access flags. On
Arm, it is implemented using a loop and guest_cmpxchg.
It is possible that guest_cmpxchg will always return a different value
than old. This can happen if the guest updated the memory before Xen has
time to do the exchange. Because of that, there are no way for to
promise the loop will end.
It is possible to make the current code safe by re-using the same
principle as applied on the guest atomic helper. However this patch
takes a different approach that should lead to more efficient code in
the default case.
A new helper is introduced to clear a set of bits on a 16-bits word.
This should avoid a an extra loop to check cmpxchg succeeded.
Note that a mask is used instead of a bit, so the helper can be re-used
later on for clearing multiple flags at the same time.
This is part of XSA-295.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Stefano Stabellini <stefanos@xilinx.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
xen: Use guest atomics helpers when modifying atomically guest memory
On Arm, exclusive load-store atomics should only be used between trusted
thread. As not all the guests are trusted, it may be possible to DoS Xen
when updating shared memory with guest atomically.
This patch replaces all the atomics operations on shared memory with
a guest by the new guest atomics helpers. The x86 code was not audited
to know where guest atomics helpers could be used. I will leave that
to the x86 folks.
Note that some rework was required in order to plumb use the new guest
atomics in event channel and grant-table.
Because guest_test_bit is ignoring the parameter "d" for now, it
means there a lot of places do not need to drop the const. We may want
to revisit this in the future if the parameter "d" becomes necessary.
xen/cmpxchg: Provide helper to safely modify guest memory atomically
On Arm, exclusive load-store atomics should only be used between trusted
thread. As not all the guests are trusted, it may be possible to DoS Xen
when updating shared memory with guest atomically.
This patch adds a new helper that will update the guest memory safely.
For x86, it is already possible to use the current helper safely. So
just wrap it.
For Arm, we will first attempt to update the guest memory with the
loop bounded by a maximum number of iterations. If it fails, we will
pause the domain and try again.
Note that this heuristics assumes that a page can only
be shared between Xen and one domain. Not Xen and multiple domain.
The maximum number of iterations is based on how many times atomic_inc()
can be executed in 1uS. The maximum value is per-CPU to cater big.LITTLE
and calculated when the CPU is booting.
The maximum number of iterations is based on how many times a simple
load-store atomic operation can be executed in 1uS. The maximum
value is per-CPU to cater big.LITTLE and calculated when the CPU is
booting. The heuristic was randomly chosen and can be modified if
impact too much good-behaving guest.
xen/bitops: Provide helpers to safely modify guest memory atomically
On Arm, exclusive load-store atomics should only be used between trusted
thread. As not all the guests are trusted, it may be possible to DoS Xen
when updating shared memory with guest atomically.
This patch adds a new set of helper that will update the guest memory
safely. For x86, it is already possible to use the current helpers
safely. So just wrap them.
For Arm, we will first attempt to update the guest memory with the loop
bounded by a maximum number of iterations. If it fails, we will pause the
domain and try again.
Note that this heuristics assumes that a page can only be shared between
Xen and one domain. Not Xen and multiple domain.
The maximum number of iterations is based on how many times a simple
load-store atomic operation can be executed in 1uS. The maximum value is
per-CPU to cater big.LITTLE and calculated when the CPU is booting. The
heuristic was randomly chosen and can be modified if impact too much
good-behaving guest.
Note, while test_bit does not requires to use atomic operation, a
wrapper for test_bit was added for completeness. In this case, the
domain stays constified to avoid major rework in the caller for the
time-being.
On Arm, exclusive load-store atomics should only be used between trusted
thread. As not all the guests are trusted, it may be possible to DoS Xen
when updating shared memory with guest atomically.
Recent patches introduced new helpers to update shared memory with guest
atomically. Those helpers relies on a memory region to be be shared with
Xen and a single guest.
At the moment, nothing prevent a guest sharing a page with Xen and as
well with another guest (e.g via grant table).
For the scope of the XSA, the quickest way is to deny communications
between unprivileged guest. So this patch is enabling and using SILO
mode by default on Arm.
Users wanted finer graine policy could wrote their own Flask policy.
This is part of XSA-295.
Signed-off-by: Julien Grall <julien.grall@arm.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Xin Li [Tue, 9 Oct 2018 09:33:19 +0000 (17:33 +0800)]
xen/xsm: Introduce new boot parameter xsm
Introduce new boot parameter xsm to choose which xsm module is enabled,
and set default to dummy. And add new option in Kconfig to choose the
default XSM implementation.
Signed-off-by: Xin Li <xin.li@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Julien Grall [Wed, 22 May 2019 20:39:17 +0000 (13:39 -0700)]
xen/arm: cmpxchg: Provide a new helper that can timeout
Exclusive load-store atomics should only be used between trusted
threads. As not all the guests are trusted, it may be possible to DoS
Xen when updating shared memory with guest atomically.
To prevent the infinite loop, we introduce a new helper that can timeout.
The timeout is based on the maximum number of iterations.
It will be used in follow-up patch to make atomic operations on shared
memory safe.
xen/arm: bitops: Implement a new set of helpers that can timeout
Exclusive load-store atomics should only be used between trusted
threads. As not all the guests are trusted, it may be possible to DoS
Xen when updating shared memory with guest atomically.
To prevent the infinite loop, we introduce a new set of helpers that can
timeout. The timeout is based on the maximum number of iterations.
They will be used in follow-up patch to make atomic operations
on shared memory safe.
xen/arm32: cmpxchg: Simplify the cmpxchg implementation
The only difference between each case of the cmpxchg is the size of
used. Rather than duplicating the code, provide a macro to generate each
cases.
This makes the code easier to read and modify.
While doing the rework, the case for 64-bit cmpxchg is removed. This is
unused today (already commented) and it would not be possible to use
it directly.
xen/grant_table: Rework the prototype of _set_status* for lisibility
It is not clear from the parameters name whether domid and gt_version
correspond to the local or remote domain. A follow-up patch will make
them more confusing.
So rename domid (resp. gt_version) to ldomid (resp. rgt_version). At
the same time re-order the parameters to hopefully make it more
readable.
This is part of XSA-295.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
xen/arm: Add an isb() before reading CNTPCT_EL0 to prevent re-ordering
Per D8.2.1 in ARM DDI 0487C.a, "a read to CNTPCT_EL0 can occur
speculatively and out of order relative to other instructions executed
on the same PE."
Add an instruction barrier to get accurate number of cycles when
requested in get_cycles(). For the other users of CNPCT_EL0, replace by
a call to get_cycles().
Jan Beulich [Thu, 23 May 2019 17:42:29 +0000 (10:42 -0700)]
events: drop arch_evtchn_inject()
Have the only user call vcpu_mark_events_pending() instead, at the same
time arranging for correct ordering of the writes (evtchn_pending_sel
should be written before evtchn_upcall_pending).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <julien.grall@arm.com>
Igor Druzhinin [Tue, 4 Jun 2019 13:54:58 +0000 (15:54 +0200)]
libacpi: report PCI slots as enabled only for hotpluggable devices
DSDT for qemu-xen lacks _STA method of PCI slot object. If _STA method
doesn't exist then the slot is assumed to be always present and active
which in conjunction with _EJ0 method makes every device ejectable for
an OS even if it's not the case.
qemu-kvm is able to dynamically add _EJ0 method only to those slots
that either have hotpluggable devices or free for PCI passthrough.
As Xen lacks this capability we cannot use their way.
qemu-xen-traditional DSDT has _STA method which only reports that
the slot is present if there is a PCI devices hotplugged there.
This is done through querying of its PCI hotplug controller.
qemu-xen has similar capability that reports if device is "hotpluggable
or absent" which we can use to achieve the same result.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6761965243b113230bed900d6105be05b28f5cea
master date: 2019-05-24 10:30:21 +0200
Jan Beulich [Tue, 4 Jun 2019 13:54:23 +0000 (15:54 +0200)]
x86/IO-APIC: fix build with gcc9
There are a number of pointless __packed attributes which cause gcc 9 to
legitimately warn:
utils.c: In function 'vtd_dump_iommu_info':
utils.c:287:33: error: converting a packed 'struct IO_APIC_route_entry' pointer (alignment 1) to a 'struct IO_APIC_route_remap_entry' pointer (alignment 8) may result in an unaligned pointer value [-Werror=address-of-packed-member]
287 | remap = (struct IO_APIC_route_remap_entry *) &rte;
| ^~~~~~~~~~~~~~~~~~~~~~~~~
intremap.c: In function 'ioapic_rte_to_remap_entry':
intremap.c:343:25: error: converting a packed 'struct IO_APIC_route_entry' pointer (alignment 1) to a 'struct IO_APIC_route_remap_entry' pointer (alignment 8) may result in an unaligned pointer value [-Werror=address-of-packed-member]
343 | remap_rte = (struct IO_APIC_route_remap_entry *) old_rte;
| ^~~~~~~~~~~~~~~~~~~~~~~~~
Simply drop these attributes. Take the liberty and also re-format the
structure definitions at the same time.
Reported-by: Charles Arnold <carnold@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ca9310b24e6205de5387e5982ccd42c35caf89d4
master date: 2019-05-24 10:19:59 +0200
Juergen Gross [Tue, 4 Jun 2019 13:53:50 +0000 (15:53 +0200)]
xen/sched: fix csched2_deinit_pdata()
Commit 753ba43d6d16e688 ("xen/sched: fix credit2 smt idle handling")
introduced a regression when switching cpus between cpupools.
When assigning a cpu to a cpupool with credit2 being the default
scheduler csched2_deinit_pdata() is called for the credit2 private data
after the new scheduler's private data has been hooked to the per-cpu
scheduler data. Unfortunately csched2_deinit_pdata() will cycle through
all per-cpu scheduler areas it knows of for removing the cpu from the
respective sibling masks including the area of the just moved cpu. This
will (depending on the new scheduler) either clobber the data of the
new scheduler or in case of sched_rt lead to a crash.
Avoid that by removing the cpu from the list of active cpus in credit2
data first.
The opposite problem is occurring when removing a cpu from a cpupool:
init_pdata() of credit2 will access the per-cpu data of the old
scheduler.
Jan Beulich [Tue, 4 Jun 2019 13:53:18 +0000 (15:53 +0200)]
x86emul: add support for missing {,V}PMADDWD insns
Their pre-AVX512 incarnations have clearly been overlooked during much
earlier work. Their memory access pattern is entirely standard, so no
specific tests get added to the harness.
Reported-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Alexandru Isaila <aisaila@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1a48bdd599b268a2d9b7d0c45f1fd40c4892186e
master date: 2019-05-16 13:43:17 +0200
Jan Beulich [Tue, 4 Jun 2019 13:52:51 +0000 (15:52 +0200)]
x86/IRQ: avoid UB (or worse) in trace_irq_mask()
Dynamically allocated CPU mask objects may be smaller than cpumask_t, so
copying has to be restricted to the actual allocation size. This is
particulary important since the function doesn't bail early when tracing
is not active, so even production builds would be affected by potential
misbehavior here.
Take the opportunity and also
- use initializers instead of assignment + memset(),
- constify the cpumask_t input pointer,
- u32 -> uint32_t.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: 6fafb8befa99620a2d7323b9eca5c387bad1f59f
master date: 2019-05-13 16:41:03 +0200
Andrew Cooper [Tue, 4 Jun 2019 13:52:23 +0000 (15:52 +0200)]
x86/boot: Fix latent memory corruption with early_boot_opts_t
c/s ebb26b509f "xen/x86: make VGA support selectable" added an #ifdef
CONFIG_VIDEO into the middle the backing space for early_boot_opts_t,
but didn't adjust the structure definition in cmdline.c
This only functions correctly because the affected fields are at the end
of the structure, and cmdline.c doesn't write to them in this case.
To retain the slimming effect of compiling out CONFIG_VIDEO, adjust
cmdline.c with enough #ifdef-ary to make C's idea of the structure match
the declaration in asm. This requires adding __maybe_unused annotations
to two helper functions.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 30596213617fcf4dd7b71d244e16c8fc0acf456b
master date: 2019-05-13 10:35:38 +0100
The limit 1900x1200 do not match real world devices (1900 looks like a
typo, should be 1920). But in practice the limits are arbitrary and do
not serve any real purpose. As discussed in "Increase framebuffer size
to todays standards" thread, drop them completely.
This fixes graphic console on device with 3840x2160 native resolution.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
drivers/video: drop unused limits
MAX_BPP, MAX_FONT_W, MAX_FONT_H are not used in the code at all.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 19600eb75aa9b1df3e4b0a4e55a5d08b957e1fd9
master date: 2019-05-13 10:13:24 +0200
master commit: 343459e34a6d32ba44a21f8b8fe4c1f69b1714c2
master date: 2019-05-13 10:12:56 +0200
When bitmap_fill(..., 0) is called, do not try to write anything. Before
this patch, it tried to write almost LONG_MAX, surely overwriting
something.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 93df28be2d4f620caf18109222d046355ac56327
master date: 2019-05-13 10:12:00 +0200
Tamas K Lengyel [Tue, 4 Jun 2019 13:50:37 +0000 (15:50 +0200)]
x86/vmx: correctly gather gs_shadow value for current vCPU
Currently the gs_shadow value is only cached when the vCPU is being scheduled
out by Xen. Reporting this (usually) stale value through vm_event is incorrect,
since it doesn't represent the actual state of the vCPU at the time the event
was recorded. This prevents vm_event subscribers from correctly finding kernel
structures in the guest when it is trapped while in ring3.
Refresh shadow_gs value when the context being saved is for the current vCPU.
Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: f69fc1c2f36e8a74ba54c9c8fa5c904ea1ad319e
master date: 2019-05-13 09:55:59 +0200
Igor Druzhinin [Tue, 4 Jun 2019 13:50:03 +0000 (15:50 +0200)]
x86/mtrr: recalculate P2M type for domains with iocaps
This change reflects the logic in epte_get_entry_emt() and allows
changes in guest MTTRs to be reflected in EPT for domains having
direct access to certain hardware memory regions but without IOMMU
context assigned (e.g. XenGT).
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f3d880bf2be92534c5bacf11de2f561cbad550fb
master date: 2019-05-13 09:54:45 +0200
Jan Beulich [Tue, 4 Jun 2019 13:49:30 +0000 (15:49 +0200)]
AMD/IOMMU: disable previously enabled IOMMUs upon init failure
If any IOMMUs were successfully initialized before encountering failure,
the successfully enabled ones should be disabled again before cleaning
up their resources.
Move disable_iommu() next to enable_iommu() to avoid a forward
declaration, and take the opportunity to remove stray blank lines ahead
of both functions' final closing braces.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Brian Woods <brian.woods@amd.com>
master commit: 87a3347d476443c66c79953d77d6aef1d2bb3bbd
master date: 2019-05-13 09:52:43 +0200
Jan Beulich [Tue, 4 Jun 2019 13:48:05 +0000 (15:48 +0200)]
trace: fix build with gcc9
While I've not observed this myself, gcc 9 (imo validly) reportedly may
complain
trace.c: In function '__trace_hypercall':
trace.c:826:19: error: taking address of packed member of 'struct <anonymous>' may result in an unaligned pointer value [-Werror=address-of-packed-member]
826 | uint32_t *a = d.args;
and the fix is rather simple - remove the __packed attribute. Introduce
a BUILD_BUG_ON() as replacement, for the unlikely case that Xen might
get ported to an architecture where array alignment higher that that of
its elements.
Reported-by: Martin Liška <martin.liska@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: 3fd3b266d4198c06e8e421ca515d9ba09ccd5155
master date: 2019-05-13 09:51:23 +0200
Andrew Cooper [Wed, 3 Oct 2018 09:32:54 +0000 (10:32 +0100)]
oxenstored: Don't re-open a xenctrl handle for every domain introduction
Currently, an xc handle is opened in main() which is used for cleanup
activities, and a new xc handle is temporarily opened every time a domain is
introduced. This is inefficient, and amongst other things, requires full root
privileges for the lifetime of oxenstored.
All code using the Xenctrl handle is in domains.ml, so initialise xc as a
global (now happens just before main() is called) and drop it as a parameter
from Domains.create and Domains.cleanup.
Andrew Cooper [Thu, 29 Nov 2018 18:10:38 +0000 (18:10 +0000)]
tools/libxc: Fix issues with libxc and Xen having different featureset lengths
In almost all cases, Xen and libxc will agree on the featureset length,
because they are built from the same source.
However, there are circumstances (e.g. security hotfixes) where the featureset
gets longer and dom0 will, after installing updates, be running with an old
Xen but new libxc. Despite writing the code with this scenario in mind, there
were some bugs.
First, xen-cpuid's get_featureset() erroneously allocates a buffer based on
Xen's featureset length, but records libxc's length, which may be longer.
In this situation, the hypercall bounce buffer code reads/writes the recorded
length, which is beyond the end of the allocated object, and a later free()
encounters corrupt heap metadata. Fix this by recording the same length that
we allocate.
Secondly, get_cpuid_domain_info() has a related bug when the passed-in
featureset is a different length to libxc's.
A large amount of the libxc cpuid functionality depends on info->featureset
being as long as expected, and it is allocated appropriately. However, in the
case that a shorter external featureset is passed in, the logic to check for
trailing nonzero bits may read off the end of it. Rework the logic to use the
correct upper bound.
In addition, leave a comment next to the fields in struct cpuid_domain_info
explaining the relationship between the various lengths, and how to cope with
different lengths.
Igor Druzhinin [Tue, 9 Apr 2019 12:01:58 +0000 (13:01 +0100)]
tools/xl: use libxl_domain_info to get domain type for vcpu-pin
Parsing the config seems to be an overkill for this particular task
and the config might simply be absent. Type returned from libxl_domain_info
should be either LIBXL_DOMAIN_TYPE_HVM or LIBXL_DOMAIN_TYPE_PV but in
that context distinction between PVH and HVM should be irrelevant.
Juergen Gross [Fri, 31 Aug 2018 15:22:04 +0000 (17:22 +0200)]
tools/libxl: correct vcpu affinity output with sparse physical cpu map
With not all physical cpus online (e.g. with smt=0) the output of hte
vcpu affinities is wrong, as the affinity bitmaps are capped after
nr_cpus bits, instead of using max_cpu_id.
Christian Lindig [Wed, 27 Feb 2019 10:33:42 +0000 (10:33 +0000)]
tools/ocaml: Dup2 /dev/null to stdin in daemonize()
Don't close stdin in daemonize() but dup2 /dev/null instead. Otherwise, fd 0
gets reused later:
[root@idol ~]# ls -lav /proc/`pgrep xenstored`/fd
total 0
dr-x------ 2 root root 0 Feb 28 11:02 .
dr-xr-xr-x 9 root root 0 Feb 27 15:59 ..
lrwx------ 1 root root 64 Feb 28 11:02 0 -> /dev/xen/evtchn
l-wx------ 1 root root 64 Feb 28 11:02 1 -> /dev/null
l-wx------ 1 root root 64 Feb 28 11:02 2 -> /dev/null
lrwx------ 1 root root 64 Feb 28 11:02 3 -> /dev/xen/privcmd
...
Signed-off-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit 677e64dbe315343620c3b266e9eb16623b118038)
(cherry picked from commit 4b72470175a592fb5c0a5d10ed505de73778e10f)
tools/misc/xenpm: fix getting info when some CPUs are offline
Use physinfo.max_cpu_id instead of physinfo.nr_cpus to get max CPU id.
This fixes for example 'xenpm get-cpufreq-para' with smt=off, which
otherwise would miss half of the cores.
Jan Beulich [Wed, 15 May 2019 07:55:25 +0000 (09:55 +0200)]
x86: fix build race when generating temporary object files
The rules to generate xen-syms and xen.efi may run in parallel, but both
recursively invoke $(MAKE) to build symbol/relocation table temporary
object files. These recursive builds would both re-generate the .*.d2
files (where needed). Both would in turn invoke the same rule, thus
allowing for a race on the .*.d2.tmp intermediate files.
The dependency files of the temporary .xen*.o files live in xen/ rather
than xen/arch/x86/ anyway, so won't be included no matter what. Take the
opportunity and delete them, as the just re-generated .xen*.S files will
trigger a proper re-build of the .xen*.o ones anyway.
Empty the DEPS variable in case the set of goals consists of just those
temporary object files, thus eliminating the race.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 761bb575ce97255029d2d2249b2719e54bc76825
master date: 2019-04-11 10:25:05 +0200
Initially I had just noticed the unnecessary indirection in the call
from pi_update_irte(). The generic wrapper having an iommu_intremap
conditional made me look at the setup code though. So first of all
enforce the necessary dependency.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6c54663786d9f1ed04153867687c158675e7277d
master date: 2019-04-09 15:12:07 +0200
Petre Pircalabu [Wed, 15 May 2019 07:54:00 +0000 (09:54 +0200)]
vm_event: fix XEN_VM_EVENT_RESUME domctl
Make XEN_VM_EVENT_RESUME return 0 in case of success, instead of
-EINVAL.
Remove vm_event_resume form vm_event.h header and set the function's
visibility to static as is used only in vm_event.c.
Move the vm_event_check_ring test inside vm_event_resume in order to
simplify the code.
Andrew Cooper [Wed, 15 May 2019 07:53:14 +0000 (09:53 +0200)]
xen/timers: Fix memory leak with cpu unplug/plug
timer_softirq_action() realloc's itself a larger timer heap whenever
necessary, which includes bootstrapping from the empty dummy_heap. Nothing
ever freed this allocation.
CPU plug and unplug has the side effect of zeroing the percpu data area, which
clears ts->heap. This in turn causes new timers to be put on the list rather
than the heap, and for timer_softirq_action() to bootstrap itself again.
This in practice leaks ts->heap every time a CPU is unplugged and replugged.
Implement free_percpu_timers() which includes freeing ts->heap when
appropriate, and update the notifier callback with the recent cpu parking
logic and free-avoidance across suspend.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/cpu: Fix ARM build following c/s 597fbb8
c/s 597fbb8 "xen/timers: Fix memory leak with cpu unplug/plug" broke the ARM
build by being the first patch to add park_offline_cpus to common code.
While it is currently specific to Intel hardware (for reasons of being able to
handle machine check exceptions without an immediate system reset), it isn't
inherently architecture specific, so define it to be false on ARM for now.
Add a comment in both smp.h headers explaining the intended behaviour of the
option.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
timers: move back migrate_timers_from_cpu() invocation
Commit 597fbb8be6 ("xen/timers: Fix memory leak with cpu unplug/plug")
went a little too far: Migrating timers away from a CPU being offlined
needs to heppen independent of whether it get parked or fully offlined.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
xen/timers: Fix memory leak with cpu unplug/plug (take 2)
Previous attempts to fix this leak failed to identify the root cause, and
ultimately failed. The cause is the CPU_UP_PREPARE case (re)initialising
ts->heap back to dummy_heap, which leaks the previous allocation.
Rearrange the logic to only initialise ts once. This also avoids the
redundant (but benign, due to ts->inactive always being empty) initialising of
the other ts fields.
Juergen Gross [Wed, 15 May 2019 07:52:15 +0000 (09:52 +0200)]
xen/sched: fix credit2 smt idle handling
Credit2's smt_idle_mask_set() and smt_idle_mask_clear() are used to
identify idle cores where vcpus can be moved to. A core is thought to
be idle when all siblings are known to have the idle vcpu running on
them.
Unfortunately the information of a vcpu running on a cpu is per
runqueue. So in case not all siblings are in the same runqueue a core
will never be regarded to be idle, as the sibling not in the runqueue
is never known to run the idle vcpu.
Use a credit2 specific cpumask of siblings with only those cpus
being marked which are in the same runqueue as the cpu in question.
Andrew Cooper [Wed, 12 Dec 2018 19:22:15 +0000 (19:22 +0000)]
x86/spec-ctrl: Introduce options to control VERW flushing
The Microarchitectural Data Sampling vulnerability is split into categories
with subtly different properties:
MLPDS - Microarchitectural Load Port Data Sampling
MSBDS - Microarchitectural Store Buffer Data Sampling
MFBDS - Microarchitectural Fill Buffer Data Sampling
MDSUM - Microarchitectural Data Sampling Uncacheable Memory
MDSUM is a special case of the other three, and isn't distinguished further.
These issues pertain to three microarchitectural buffers. The Load Ports, the
Store Buffers and the Fill Buffers. Each of these structures are flushed by
the new enhanced VERW functionality, but the conditions under which flushing
is necessary vary.
For this concise overview of the issues and default logic, the abbreviations
SP (Store Port), FB (Fill Buffer), LP (Load Port) and HT (Hyperthreading) are
used for brevity:
* Vulnerable hardware is divided into two categories - parts which suffer
from SP only, and parts with any other combination of vulnerabilities.
* SP only has an HT interaction when the thread goes idle, due to the static
partitioning of resources. LP and FB have HT interactions at all points,
due to the competitive sharing of resources. All issues potentially leak
data across the return-to-guest transition.
* The microcode which implements VERW flushing also extends MSR_FLUSH_CMD, so
we don't need to do both on the HVM return-to-guest path. However, some
parts are not vulnerable to L1TF (therefore have no MSR_FLUSH_CMD), but are
vulnerable to MDS, so do require VERW on the HVM path.
Note that we deliberately support mds=1 even without MD_CLEAR in case the
microcode has been updated but the feature bit not exposed.
This is part of XSA-297, CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, CVE-2019-11091.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3c04c258ab40405a74e194d9889a4cbc7abe94b4)
Andrew Cooper [Wed, 12 Dec 2018 19:22:15 +0000 (19:22 +0000)]
x86/spec-ctrl: Infrastructure to use VERW to flush pipeline buffers
Three synthetic features are introduced, as we need individual control of
each, depending on circumstances. A later change will enable them at
appropriate points.
The verw_sel field doesn't strictly need to live in struct cpu_info. It lives
there because there is a convenient hole it can fill, and it reduces the
complexity of the SPEC_CTRL_EXIT_TO_{PV,HVM} assembly by avoiding the need for
any temporary stack maintenance.
This is part of XSA-297, CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, CVE-2019-11091.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 548a932ac786d6bf3584e4b54f2ab993e1117710)
Andrew Cooper [Wed, 12 Sep 2018 13:36:00 +0000 (14:36 +0100)]
x86/spec-ctrl: CPUID/MSR definitions for Microarchitectural Data Sampling
The MD_CLEAR feature can be automatically offered to guests. No
infrastructure is needed in Xen to support the guest making use of it.
This is part of XSA-297, CVE-2018-12126, CVE-2018-12127, CVE-2018-12130, CVE-2019-11091.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit d4f6116c080dc013cd1204c4d8ceb95e5f278689)
Andrew Cooper [Wed, 12 Sep 2018 13:36:00 +0000 (14:36 +0100)]
x86/spec-ctrl: Misc non-functional cleanup
* Identify BTI in the spec_ctrl_{enter,exit}_idle() comments, as other
mitigations will shortly appear.
* Use alternative_input() and cover the lack of memory cobber with a further
barrier.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 9b62eba6c429c327e1507816bef403ccc87357ae)
Andrew Cooper [Fri, 5 Apr 2019 12:26:30 +0000 (13:26 +0100)]
x86/boot: Detect the firmware SMT setting correctly on Intel hardware
While boot_cpu_data.x86_num_siblings is an accurate value to use on AMD
hardware, it isn't on Intel when the user has disabled Hyperthreading in the
firmware. As a result, a user which has chosen to disable HT still gets
nagged on L1TF-vulnerable hardware when they haven't chosen an explicit
smt=<bool> setting.
Make use of the largely-undocumented MSR_INTEL_CORE_THREAD_COUNT which in
practice exists since Nehalem, when booting on real hardware. Fall back to
using the ACPI table APIC IDs.
While adjusting this logic, fix a latent bug in amd_get_topology(). The
thread count field in CPUID.0x8000001e.ebx is documented as 8 bits wide,
rather than 2 bits wide.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit b12fec4a125950240573ea32f65c61fb9afa74c3)