Jan Beulich [Tue, 12 Dec 2017 14:13:09 +0000 (15:13 +0100)]
x86/shadow: fix ref-counting error handling
The old-Linux handling in shadow_set_l4e() mistakenly ORed together the
results of sh_get_ref() and sh_pin(). As the latter failing is not a
correctness problem, simply ignore its return value.
In sh_set_toplevel_shadow() a failing sh_get_ref() must not be
accompanied by installing the entry, despite the domain being crashed.
This is XSA-250.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 10be8001de7d87be1f0ccdda75cc70e922e56d03
master date: 2017-12-12 14:29:45 +0100
Jan Beulich [Tue, 12 Dec 2017 14:12:45 +0000 (15:12 +0100)]
x86/shadow: fix refcount overflow check
Commit c385d27079 ("x86 shadow: for multi-page shadows, explicitly track
the first page") reduced the refcount width to 25, without adjusting the
overflow check. Eliminate the disconnect by using a manifest constant.
Interestingly, up to commit 047782fa01 ("Out-of-sync L1 shadows: OOS
snapshot") the refcount was 27 bits wide, yet the check was already
using 26.
This is XSA-249.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 54e2292e8df7a1a7b041192be9d6d797b6d00869
master date: 2017-12-12 14:29:13 +0100
Jan Beulich [Tue, 12 Dec 2017 14:12:24 +0000 (15:12 +0100)]
x86/mm: don't wrongly set page ownership
PV domains can obtain mappings of any pages owned by the correct domain,
including ones that aren't actually assigned as "normal" RAM, but used
by Xen internally. At the moment such "internal" pages marked as owned
by a guest include pages used to track logdirty bits, as well as p2m
pages and the "unpaged pagetable" for HVM guests. Since the PV memory
management and shadow code conflict in their use of struct page_info
fields, and since shadow code is being used for log-dirty handling for
PV domains, pages coming from the shadow pool must, for PV domains, not
have the domain set as their owner.
While the change could be done conditionally for just the PV case in
shadow code, do it unconditionally (and for consistency also for HAP),
just to be on the safe side.
There's one special case though for shadow code: The page table used for
running a HVM guest in unpaged mode is subject to get_page() (in
set_shadow_status()) and hence must have its owner set.
This is XSA-248.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: ff2a793e15bb0b6254bc849ef8e83e1c284c3583
master date: 2017-12-12 14:28:36 +0100
Jan Beulich [Tue, 12 Dec 2017 14:11:44 +0000 (15:11 +0100)]
x86: don't wrongly trigger linear page table assertion (2)
_put_final_page_type(), when free_page_type() has exited early to allow
for preemption, should not update the time stamp, as the page continues
to retain the typ which is in the process of being unvalidated. I can't
see why the time stamp update was put on that path in the first place
(albeit it may well have been me who had put it there years ago).
This is part of XSA-240.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap.com>
master commit: e40b0219a8c77741ae48989efb520f4a762a5be3
master date: 2017-12-12 14:27:34 +0100
George Dunlap [Tue, 28 Nov 2017 12:53:39 +0000 (13:53 +0100)]
p2m: Check return value of p2m_set_entry() when decreasing reservation
If the entire range specified to p2m_pod_decrease_reservation() is marked
populate-on-demand, then it will make a single p2m_set_entry() call,
reducing its PoD entry count.
Unfortunately, in the right circumstances, this p2m_set_entry() call
may fail. It that case, repeated calls to decrease_reservation() may
cause p2m->pod.entry_count to fall below zero, potentially tripping
over BUG_ON()s to the contrary.
Instead, check to see if the entry succeeded, and return false if not.
The caller will then call guest_remove_page() on the gfns, which will
return -EINVAL upon finding no valid memory there to return.
Unfortunately if the order > 0, the entry may have partially changed.
A domain_crash() is probably the safest thing in that case.
Other p2m_set_entry() calls in the same function should be fine,
because they are writing the entry at its current order. Nonetheless,
check the return value and crash if our assumption turns otu to be
wrong.
This is part of XSA-247.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a3d64de8e86f5812917d2d0af28298f80debdf9a
master date: 2017-11-28 13:13:26 +0100
George Dunlap [Tue, 28 Nov 2017 12:53:14 +0000 (13:53 +0100)]
p2m: Always check to see if removing a p2m entry actually worked
The PoD zero-check functions speculatively remove memory from the p2m,
then check to see if it's completely zeroed, before putting it in the
cache.
Unfortunately, the p2m_set_entry() calls may fail if the underlying
pagetable structure needs to change and the domain has exhausted its
p2m memory pool: for instance, if we're removing a 2MiB region out of
a 1GiB entry (in the p2m_pod_zero_check_superpage() case), or a 4k
region out of a 2MiB or larger entry (in the p2m_pod_zero_check()
case); and the return value is not checked.
The underlying mfn will then be added into the PoD cache, and at some
point mapped into another location in the p2m. If the guest
afterwards ballons out this memory, it will be freed to the hypervisor
and potentially reused by another domain, in spite of the fact that
the original domain still has writable mappings to it.
There are several places where p2m_set_entry() shouldn't be able to
fail, as it is guaranteed to write an entry of the same order that
succeeded before. Add a backstop of crashing the domain just in case,
and an ASSERT_UNREACHABLE() to flag up the broken assumption on debug
builds.
While we're here, use PAGE_ORDER_2M rather than a magic constant.
This is part of XSA-247.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 92790672dedf2eab042e04ecc277c19d40fd348a
master date: 2017-11-28 13:13:03 +0100
Julien Grall [Tue, 28 Nov 2017 12:52:22 +0000 (13:52 +0100)]
x86/pod: prevent infinite loop when shattering large pages
When populating pages, the PoD may need to split large ones using
p2m_set_entry and request the caller to retry (see ept_get_entry for
instance).
p2m_set_entry may fail to shatter if it is not possible to allocate
memory for the new page table. However, the error is not propagated
resulting to the callers to retry infinitely the PoD.
Prevent the infinite loop by return false when it is not possible to
shatter the large mapping.
This is XSA-246.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: a1c6c6768971ea387d7eba0803908ef0928b43ac
master date: 2017-11-28 13:11:55 +0100
Andrew Cooper [Thu, 16 Nov 2017 11:19:11 +0000 (12:19 +0100)]
x86/shadow: correct SH_LINEAR mapping detection in sh_guess_wrmap()
The fix for XSA-243 / CVE-2017-15592 (c/s bf2b4eadcf379) introduced a change
in behaviour for sh_guest_wrmap(), where it had to cope with no shadow linear
mapping being present.
As the name suggests, guest_vtable is a mapping of the guests pagetable, not
Xen's pagetable, meaning that it isn't the pagetable we need to check for the
shadow linear slot in.
The practical upshot is that a shadow HVM vcpu which switches into 4-level
paging mode, with an L4 pagetable that contains a mapping which aliases Xen's
SH_LINEAR_PT_VIRT_START will fool the safety check for whether a SHADOW_LINEAR
mapping is present. As the check passes (when it should have failed), Xen
subsequently falls over the missing mapping with a pagefault such as:
Jan Beulich [Thu, 16 Nov 2017 11:18:27 +0000 (12:18 +0100)]
x86: don't wrongly trigger linear page table assertion
_put_page_type() may do multiple iterations until its cmpxchg()
succeeds. It invokes set_tlbflush_timestamp() on the first
iteration, however. Code inside the function takes care of this, but
- the assertion in _put_final_page_type() would trigger on the second
iteration if time stamps in a debug build are permitted to be
sufficiently much wider than the default 6 bits (see WRAP_MASK in
flushtlb.c),
- it returning -EINTR (for a continuation to be scheduled) would leave
the page inconsistent state (until the re-invocation completes).
Make the set_tlbflush_timestamp() invocation conditional, bypassing it
(for now) only in the case we really can't tolerate the stamp to be
stored.
This is part of XSA-240.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
George Dunlap [Thu, 16 Nov 2017 11:17:15 +0000 (12:17 +0100)]
x86/mm: Make PV linear pagetables optional
Allowing pagetables to point to other pagetables of the same level
(often called 'linear pagetables') has been included in Xen since its
inception; but recently it has been the source of a number of subtle
reference-counting bugs.
It is not used by Linux or MiniOS; but it is used by NetBSD and Novell
Netware. There are significant numbers of people who are never going
to use the feature, along with significant numbers who need the
feature.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 24 Oct 2017 14:53:36 +0000 (16:53 +0200)]
gnttab: fix pin count / page reference race
Dropping page references before decrementing pin counts is a bad idea
if assumptions are being made that a non-zero pin count implies a valid
page. Fix the order of operations in gnttab_copy_release_buf(), but at
the same time also remove the assertion that was found to trigger:
map_grant_ref() also has the potential of causing a race here, and
changing the order of operations there would likely be quite a bit more
involved.
This is CVE-2017-15597 / XSA-236.
Reported-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e008f7619dcd6d549727c9635b3f9f3c7ee483ed
master date: 2017-10-24 16:01:33 +0200
Jan Beulich [Thu, 12 Oct 2017 14:04:34 +0000 (16:04 +0200)]
x86/HVM: prefill partially used variable on emulation paths
Certain handlers ignore the access size (vioapic_write() being the
example this was found with), perhaps leading to subsequent reads
seeing data that wasn't actually written by the guest. For
consistency and extra safety also do this on the read path of
hvm_process_io_intercept(), even if this doesn't directly affect what
guests get to see, as we've supposedly already dealt with read handlers
leaving data completely unitialized.
This is XSA-239.
Reported-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 0d4732ac29b63063764c29fa3bd8946daf67d6f3
master date: 2017-10-12 14:43:26 +0200
Andrew Cooper [Thu, 12 Oct 2017 14:02:36 +0000 (16:02 +0200)]
x86/cpu: Fix IST handling during PCPU bringup
Clear IST references in newly allocated IDTs. Nothing good will come of
having them set before the TSS is suitably constructed (although the chances
of the CPU surviving such an IST interrupt/exception is extremely slim).
Uniformly set the IST references after the TSS is in place. This fixes an
issue on AMD hardware, where onlining a PCPU while PCPU0 is in HVM context
will cause IST_NONE to be copied into the new IDT, making that PCPU vulnerable
to privilege escalation from PV guests until it subsequently schedules an HVM
guest.
This is XSA-244.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cc08c73c8c1f5ba5ed0f8274548db6725e1c3157
master date: 2017-10-12 14:50:31 +0200
Andrew Cooper [Thu, 12 Oct 2017 14:02:10 +0000 (16:02 +0200)]
x86/shadow: Don't create self-linear shadow mappings for 4-level translated guests
When initially creating a monitor table for 4-level translated guests, don't
install a shadow-linear mapping. This mapping is actually self-linear, and
trips up the writeable heuristic logic into following Xen's mappings, not the
guests' shadows it was expecting to follow.
A consequence of this is that sh_guess_wrmap() needs to cope with there being
no shadow-linear mapping present, which in practice occurs once each time a
vcpu switches to 4-level paging from a different paging mode.
An appropriate shadow-linear slot will be inserted into the monitor table
either while constructing lower level monitor tables, or by sh_update_cr3().
While fixing this, clarify the safety of the other mappings. Despite
appearing unsafe, it is correct to create a guest-linear mapping for
translated domains; this is self-linear and doesn't point into the translated
domain. Drop a dead clause for translate != external guests.
This is XSA-243.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: bf2b4eadcf379d0361b38de9725ea5f7a18a5205
master date: 2017-10-12 14:50:07 +0200
Jan Beulich [Thu, 12 Oct 2017 14:01:18 +0000 (16:01 +0200)]
x86: don't allow page_unlock() to drop the last type reference
Only _put_page_type() does the necessary cleanup, and hence not all
domain pages can be released during guest cleanup (leaving around
zombie domains) if we get this wrong.
Jan Beulich [Thu, 12 Oct 2017 14:00:47 +0000 (16:00 +0200)]
x86: don't store possibly stale TLB flush time stamp
While the timing window is extremely narrow, it is theoretically
possible for an update to the TLB flush clock and a subsequent flush
IPI to happen between the read and write parts of the update of the
per-page stamp. Exclude this possibility by disabling interrupts
across the update, preventing the IPI to be serviced in the middle.
This is XSA-241.
Reported-by: Jann Horn <jannh@google.com> Suggested-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 23a183607a427572185fc51c76cc5ab11c00c4cc
master date: 2017-10-12 14:48:25 +0200
Jan Beulich [Thu, 12 Oct 2017 13:59:44 +0000 (15:59 +0200)]
x86: limit linear page table use to a single level
That's the only way that they're meant to be used. Without such a
restriction arbitrarily long chains of same-level page tables can be
built, tearing down of which may then cause arbitrarily deep recursion,
causing a stack overflow. To facilitate this restriction, a counter is
being introduced to track both the number of same-level entries in a
page table as well as the number of uses of a page table in another
same-level one (counting into positive and negative direction
respectively, utilizing the fact that both counts can't be non-zero at
the same time).
Note that the added accounting introduces a restriction on the number
of times a page can be used in other same-level page tables - more than
32k of such uses are no longer possible.
Note also that some put_page_and_type[_preemptible]() calls are
replaced with open-coded equivalents. This seemed preferrable to
adding "parent_table" to the matrix of functions.
Note further that cross-domain same-level page table references are no
longer permitted (they probably never should have been).
This is XSA-240.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com>
master commit: 6987fc7558bdbab8119eabf026e3cdad1053f0e5
master date: 2017-10-12 14:44:34 +0200
Misbehaving device model can pass incorrect XEN_DMOP_map/
unmap_io_range_to_ioreq_server arguments, namely end < start when
specifying address range. When this happens we hit ASSERT(s <= e) in
rangeset_contains_range()/rangeset_overlaps_range() with debug builds.
Production builds will not trap right away but may misbehave later
while handling such bogus ranges.
Jan Beulich [Thu, 12 Oct 2017 13:58:52 +0000 (15:58 +0200)]
x86/FLASK: fix unmap-domain-IRQ XSM hook
The caller and the FLASK implementation of xsm_unmap_domain_irq()
disagreed about what the "data" argument points to in the MSI case:
Change both sides to pass/take a PCI device.
This is part of XSA-237.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6f17f5c43a3bd28d27ed8133b2bf513e2eab7d59
master date: 2017-10-12 14:37:56 +0200
Jan Beulich [Thu, 12 Oct 2017 13:58:00 +0000 (15:58 +0200)]
x86/MSI: disallow redundant enabling
At the moment, Xen attempts to allow redundant enabling of MSI by
having pci_enable_msi() return 0, and point to the existing MSI
descriptor, when the msi already exists.
Unfortunately, if subsequent errors are encountered, the cleanup
paths assume pci_enable_msi() had done full initialization, and
hence undo everything that was assumed to be done by that
function without also undoing other setup that would normally
occur only after that function was called (in map_domain_pirq()
itself).
Rather than try to make the redundant enabling case work properly, just
forbid it entirely by having pci_enable_msi() return -EEXIST when MSI
is already set up.
This is part of XSA-237.
Reported-by: HW42 <hw42@ipsumj.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: a46126fec20e0cf4f5442352ef45efaea8c89646
master date: 2017-10-12 14:36:58 +0200
Jan Beulich [Thu, 12 Oct 2017 13:56:33 +0000 (15:56 +0200)]
x86: enforce proper privilege when (un)mapping pIRQ-s
(Un)mapping of IRQs, just like other RESOURCE__ADD* / RESOURCE__REMOVE*
actions (in FLASK terms) should be XSM_DM_PRIV rather than XSM_TARGET.
This in turn requires bypassing the XSM check in physdev_unmap_pirq()
for the HVM emuirq case just like is being done in physdev_map_pirq().
The primary goal security wise, however, is to no longer allow HVM
guests, by specifying their own domain ID instead of DOMID_SELF, to
enter code paths intended for PV guest and the control domains of HVM
guests only.
This is part of XSA-237.
Reported-by: HW42 <hw42@ipsumj.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: db72faf69c94513e180568006a9d899ed422ff90
master date: 2017-10-12 14:36:30 +0200
Jan Beulich [Thu, 12 Oct 2017 13:55:58 +0000 (15:55 +0200)]
x86: don't allow MSI pIRQ mapping on unowned device
MSI setup should be permitted only for existing devices owned by the
respective guest (the operation may still be carried out by the domain
controlling that guest).
This is part of XSA-237.
Reported-by: HW42 <hw42@ipsumj.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 3308374b1be7d43e23bd2e9eaf23ec06d7959882
master date: 2017-10-12 14:35:14 +0200
Julien Grall [Fri, 6 Oct 2017 13:22:47 +0000 (15:22 +0200)]
xen/arm: Correctly report the memory region in the dummy NUMA helpers
NUMA is currently not supported on Arm. Because common code is
NUMA-aware, dummy helpers are instead provided to expose a single node.
Those helpers are for instance used to know the region to scrub.
However the memory region is not reported correctly. Indeed, the
frametable may not be at the beginning of the memory and there might be
multiple memory banks. This will lead to not scrub some part of the
memory.
The memory information can be found using:
* first_valid_mfn as the start of the memory
* max_page - first_valid_mfn as the spanned pages
Note that first_valid_mfn is now been exported. The prototype has been
added in asm-arm/numa.h and not in a common header because I would
expect the variable to become static once NUMA is fully supported on
Arm.
Julien Grall [Fri, 6 Oct 2017 13:22:08 +0000 (15:22 +0200)]
xen/page_alloc: Cover memory unreserved after boot in first_valid_mfn
On Arm, some regions (e.g Initramfs, Dom0 Kernel...) are marked as
reserved until the hardware domain is built and they are copied into its
memory. Therefore, they will not be added in the boot allocator via
init_boot_pages.
Instead, init_xenheap_pages will be called once the region are not used
anymore.
Update first_valid_mfn in both init_heap_pages and init_boot_pages
(already exist) to cover all the cases.
Jan Beulich [Tue, 12 Sep 2017 13:18:04 +0000 (15:18 +0200)]
gnttab: also validate PTE permissions upon destroy/replace
In order for PTE handling to match up with the reference counting done
by common code, presence and writability of grant mapping PTEs must
also be taken into account; validating just the frame number is not
enough. This is in particular relevant if a guest fiddles with grant
PTEs via non-grant hypercalls.
Note that the flags being passed to replace_grant_host_mapping()
already happen to be those of the existing mapping, so no new function
parameter is needed.
This is CVE-2017-14319 / XSA-234.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 16b1414de91b5a82a0996c67f6db3af7d7e32873
master date: 2017-09-12 14:45:13 +0200
tools/xenstore: dont unlink connection object twice
A connection object of a domain with associated stubdom has two
parents: the domain and the stubdom. When cleaning up the list of
active domains in domain_cleanup() make sure not to unlink the
connection twice from the same domain. This could happen when the
domain and its stubdom are being destroyed at the same time leading
to the domain loop being entered twice.
Additionally don't use talloc_free() in this case as it will remove
a random parent link, leading eventually to a memory leak. Use
talloc_unlink() instead specifying the context from which the
connection object should be removed.
This is CVE-2017-14317 / XSA-233.
Reported-by: Eric Chanudet <chanudete@ainfosec.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
master commit: 562a1c0f7ef3fbf3c122c3dfa4f2ad9dd51da9fe
master date: 2017-09-12 14:44:56 +0200
Andrew Cooper [Tue, 12 Sep 2017 13:17:22 +0000 (15:17 +0200)]
grant_table: fix GNTTABOP_cache_flush handling
Don't fall over a NULL grant_table pointer when the owner of the domain
is a system domain (DOMID_{XEN,IO} etc).
This is CVE-2017-14318 / XSA-232.
Reported-by: Matthew Daley <mattd@bugfuzz.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c3d830b244998b3686e2eb64db95996be5eb5e5c
master date: 2017-09-12 14:44:11 +0200
George Dunlap [Tue, 12 Sep 2017 13:16:48 +0000 (15:16 +0200)]
xen/mm: make sure node is less than MAX_NUMNODES
The output of MEMF_get_node(memflags) can be as large as nodeid_t can
hold (currently 255). This is then used as an index to arrays of size
MAX_NUMNODE, which is 64 on x86 and 1 on ARM, can be passed in by an
untrusted guest (via memory_exchange and increase_reservation) and is
not currently bounds-checked.
Check the value in page_alloc.c before using it, and also check the
value in the hypercall call sites and return -EINVAL if appropriate.
Don't permit domains other than the hardware or control domain to
allocate node-constrained memory.
This is CVE-2017-14316 / XSA-231.
Reported-by: Matthew Daley <mattd@bugfuzz.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2fece35303529395bfea6b03d2268380ef682c93
master date: 2017-09-12 14:43:16 +0200
Jan Beulich [Wed, 23 Aug 2017 15:55:38 +0000 (17:55 +0200)]
arm/mm: release grant lock on xenmem_add_to_physmap_one() error paths
Commit 55021ff9ab ("xen/arm: add_to_physmap_one: Avoid to map mfn 0 if
an error occurs") introduced error paths not releasing the grant table
lock. Replace them by a suitable check after the lock was dropped.
This is XSA-235.
Reported-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <julien.grall@arm.com>
master commit: 59546c1897a90fe9af5ebbbb05ead8d98b4d17b9
master date: 2017-08-23 17:45:45 +0200
Jan Beulich [Thu, 17 Aug 2017 13:15:55 +0000 (15:15 +0200)]
gnttab: fix transitive grant handling
Processing of transitive grants must not use the fast path, or else
reference counting breaks due to the skipped recursive call to
__acquire_grant_for_copy() (its __release_grant_for_copy()
counterpart occurs independent of original pin count). Furthermore
after re-acquiring temporarily dropped locks we need to verify no grant
properties changed if the original pin count was non-zero; checking
just the pin counts is sufficient only for well-behaved guests. As a
result, __release_grant_for_copy() needs to mirror that new behavior.
Furthermore a __release_grant_for_copy() invocation was missing on the
retry path of __acquire_grant_for_copy(), and gnttab_set_version() also
needs to bail out upon encountering a transitive grant.
This is part of XSA-226.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ad48fb963dbff02762d2db5396fa655ac0c432c7
master date: 2017-08-17 14:40:31 +0200
Jan Beulich [Thu, 17 Aug 2017 13:15:35 +0000 (15:15 +0200)]
gnttab: don't use possibly unbounded tail calls
There is no guarantee that the compiler would actually translate them
to branches instead of calls, so only ones with a known recursion limit
are okay:
- __release_grant_for_copy() can call itself only once, as
__acquire_grant_for_copy() won't permit use of multi-level transitive
grants,
- __acquire_grant_for_copy() is fine to call itself with the last
argument false, as that prevents further recursion,
- __acquire_grant_for_copy() must not call itself to recover from an
observed change to the active entry's pin count
This is part of XSA-226.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 999d2ccb7f73408aa22656e1ba2f98b077eaa1c2
master date: 2017-08-17 14:39:18 +0200
Jan Beulich [Tue, 15 Aug 2017 13:35:46 +0000 (15:35 +0200)]
gnttab: correct pin status fixup for copy
Regardless of copy operations only setting GNTPIN_hst*, GNTPIN_dev*
also need to be taken into account when deciding whether to clear
_GTF_{read,writ}ing. At least for consistency with code elsewhere the
read part better doesn't use any mask at all.
This is XSA-230.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6e2a4c73564ab907b732059adb317d6ca2d138a2
master date: 2017-08-15 15:08:03 +0200
Andrew Cooper [Tue, 15 Aug 2017 13:33:09 +0000 (15:33 +0200)]
x86/grant: disallow misaligned PTEs
Pagetable entries must be aligned to function correctly. Disallow attempts
from the guest to have a grant PTE created at a misaligned address, which
would result in corruption of the L1 table with largely-guest-controlled
values.
This is CVE-2017-12137 / XSA-227.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ce442926c2530da9376199dcc769436376ad2386
master date: 2017-08-15 15:06:45 +0200
Jan Beulich [Tue, 20 Jun 2017 14:59:16 +0000 (16:59 +0200)]
gnttab: __gnttab_unmap_common_complete() is all-or-nothing
All failures have to be detected in __gnttab_unmap_common(), the
completion function must not skip part of its processing. In particular
the GNTMAP_device_map related putting of page references and adjustment
of pin count must not occur if __gnttab_unmap_common() signaled an
error. Furthermore the function must not make adjustments to global
state (here: clearing GNTTAB_device_map) before all possibly failing
operations have been performed.
There's one exception for IOMMU related failures: As IOMMU manipulation
occurs after GNTMAP_*_map have been cleared already, the related page
reference and pin count adjustments need to be done nevertheless. A
fundamental requirement for the correctness of this is that
iommu_{,un}map_page() crash any affected DomU in case of failure.
The version check appears to be pointless (or could perhaps be a
BUG_ON() or ASSERT()), but for the moment also move it.
George Dunlap [Tue, 20 Jun 2017 14:58:54 +0000 (16:58 +0200)]
gnttab: correct logic to get page references during map requests
The rules for reference counting are somewhat complicated:
* Each of GNTTAB_host_map and GNTTAB_device_map need their own
reference count
* If the mapping is writeable:
- GNTTAB_host_map needs a type count under only some conditions
- GNTTAB_device_map always needs a type count
If the mapping succeeds, we need to keep all of these; if the mapping
fails, we need to release whatever references we have acquired so far.
Additionally, the code that does a lot of this calculation "inherits"
a reference as part of the process of finding out who the owner is.
Finally, if the grant is mapped as writeable (without the
GNTMAP_readonly flag), but the hypervisor cannot grab a
PGT_writeable_page type, the entire operation should fail.
Unfortunately, the current code has several logic holes:
* If a grant is mapped only GNTTAB_device_map, and with a writeable
mapping, but in conditions where a *host* type count is not
necessary, the code will fail to grab the necessary type count.
* If a grant is mapped both GNTTAB_device_map and GNTTAB_host_map,
with a writeable mapping, in conditions where the host type count is
not necessary, *and* where the page cannot be changed to type
PGT_writeable, the condition will not be detected.
In both cases, this means that on success, the type count will be
erroneously reduced when the grant is unmapped. In the second case,
the type count will be erroneously reduced on the failure path as
well. (In the first case the failure path logic has the same hole
as the reference grabbing logic.)
Additionally, the return value of get_page() is not checked; but this
may fail even if the first get_page() succeeded due to a reference
counting overflow.
First of all, simplify the restoration logic by explicitly counting
the reference and type references acquired.
Consider each mapping type separately, explicitly marking the
'incoming' reference as used so we know when we need to grab a second
one.
Finally, always check the return value of get_page[_type]() and go to
the failure path if appropriate.
This is part of XSA-224.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 75b384ece635adc55c2bafbdc2d8959c10542c31
master date: 2017-06-20 14:46:21 +0200
Jan Beulich [Tue, 20 Jun 2017 14:58:35 +0000 (16:58 +0200)]
gnttab: never create host mapping unless asked to
We shouldn't create a host mapping unless asked to even in the case of
mapping a granted MMIO page. In particular the mapping wouldn't be torn
down when processing the matching unmap request.
George Dunlap [Tue, 20 Jun 2017 14:58:12 +0000 (16:58 +0200)]
gnttab: fix handling of dev_bus_addr during unmap
If a grant has been mapped with the GNTTAB_device_map flag, calling
grant_unmap_ref() with dev_bus_addr set to zero should cause the
GNTTAB_device_map part of the mapping to be left alone.
Unfortunately, at the moment, op->dev_bus_addr is implicitly checked
before clearing the map and adjusting the pin count, but only the bits
above 12; and it is not checked at all before dropping page
references. This means a guest can repeatedly make such a call to
cause the reference count to drop to zero, causing the page to be
freed and re-used, even though it's still mapped in its pagetables.
To fix this, always check op->dev_bus_addr explicitly for being
non-zero, as well as op->flag & GNTMAP_device_map, before doing
operations on the device_map.
While we're here, make the logic a bit cleaner:
* Always initialize op->frame to zero and set it from act->frame, to reduce the
chance of untrusted input being used
* Explicitly check the full dev_bus_addr against act->frame <<
PAGE_SHIFT, rather than ignoring the lower 12 bits
This is part of XSA-224.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 8fdfcb2b6bcd074776560e76843815f124d587f1
master date: 2017-06-20 14:45:33 +0200
Julien Grall [Tue, 20 Jun 2017 14:57:41 +0000 (16:57 +0200)]
arm: vgic: Don't update the LR when the IRQ is not enabled
gic_raise_inflight_irq will be called if the IRQ is already inflight
(i.e the IRQ is injected to the guest). If the IRQ is already already in
the LRs, then the associated LR will be updated.
To know if the interrupt is already in the LR, the function check if the
interrupt is queued. However, if the interrupt is not enabled then the
interrupt may not be queued nor in the LR. So gic_update_one_lr may be
called (if we inject on the current vCPU) and read the LR.
Because the interrupt is not in the LR, Xen will either read:
* LR 0 if the interrupt was never injected before
* LR 255 (GIC_INVALID_LR) if the interrupt was injected once. This
is because gic_update_one_lr will reset p->lr.
Reading LR 0 will result to potentially update the wrong interrupt and
not keep the LRs in sync with Xen.
Reading LR 255 will result to:
* Crash Xen on GICv3 as the LR index is bigger than supported (see
gicv3_ich_read_lr).
* Read/write always GICH_LR + 255 * 4 that is not part of the memory
mapped.
The problem can be prevented by checking whether the interrupt is
enabled in gic_raise_inflight_irq before calling gic_update_one_lr.
A follow-up of this patch is expected to mitigate the issue in the
future.
Jan Beulich [Tue, 20 Jun 2017 14:57:14 +0000 (16:57 +0200)]
guest_physmap_remove_page() needs its return value checked
Callers, namely such subsequently freeing the page, must not blindly
assume success - the function may namely fail when needing to shatter a
super page, but there not being memory available for the then needed
intermediate page table.
As it happens, guest_remove_page() callers now also all check the
return value.
Furthermore a missed put_gfn() on an error path in gnttab_transfer() is
also being taken care of.
This is part of XSA-222.
Reported-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a0cce6048d010a30ac82f8db7787bbf9aada64f4
master date: 2017-06-20 14:41:16 +0200
Andrew Cooper [Tue, 20 Jun 2017 14:56:28 +0000 (16:56 +0200)]
memory: fix return value handing of guest_remove_page()
Despite the description in mm.h, guest_remove_page() previously returned 0 for
paging errors.
Switch guest_remove_page() to having regular 0/-error semantics, and propagate
the return values from clear_mmio_p2m_entry() and mem_sharing_unshare_page()
to the callers (although decrease_reservation() is the only caller which
currently cares).
This is part of XSA-222.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b614f642c35da5184416787352f51a6379a92628
master date: 2017-06-20 14:39:56 +0200
Jan Beulich [Tue, 20 Jun 2017 14:56:04 +0000 (16:56 +0200)]
evtchn: avoid NULL derefs
Commit fbbd5009e6 ("evtchn: refactor low-level event channel port ops")
added a de-reference of the struct evtchn pointer for a port without
first making sure the bucket pointer is non-NULL. This de-reference is
actually entirely unnecessary, as all relevant callers (beyond the
problematic do_poll()) already hold the port number in their hands, and
the actual leaf functions need nothing else.
For FIFO event channels there's a second problem in that the ordering
of reads and updates to ->num_evtchns and ->event_array[] was so far
undefined (the read side isn't always holding the domain's event lock).
Add respective barriers.
Jan Beulich [Tue, 20 Jun 2017 14:55:09 +0000 (16:55 +0200)]
x86: avoid leaking BND* between vCPU-s
For MPX (BND<n>, BNDCFGU, and BNDSTATUS) the situation is less clear,
and the SDM has not entirely consistent information for that case.
While experimentally the instructions don't change register state as
long as the two XCR0 bits aren't both 1, be on the safe side and enable
both if BNDCFGS.EN is being set the first time.
This is XSA-220.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: de20bb6c4f65c4161e0931402613f9ffac86302d
master date: 2017-06-20 14:36:51 +0200
Andrew Cooper [Tue, 20 Jun 2017 14:54:16 +0000 (16:54 +0200)]
x86/shadow: hold references for the duration of emulated writes
The (misnamed) emulate_gva_to_mfn() function translates a linear address to an
mfn, but releases its page reference before returning the mfn to its caller.
sh_emulate_map_dest() uses the results of one or two translations to construct
a virtual mapping to the underlying frames, completes an emulated
write/cmpxchg, then unmaps the virtual mappings.
The page references need holding until the mappings are unmapped, or the
frames can change ownership before the writes occurs.
This is XSA-219.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 26217aff67ae1538d4e1b2226afab6993cdbe772
master date: 2017-06-20 14:36:11 +0200
Jan Beulich [Tue, 20 Jun 2017 14:52:39 +0000 (16:52 +0200)]
gnttab: correct maptrack table access
In order to observe a consistent (limit,pointer-table) pair, the reader
needs to either hold the grant table lock or both sides need to order
their accesses suitably (the writer side barrier is already there). Add
the missing barrier.
This is part of XSA-218.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 4b78efa91c8ae3c42e14b8eaeaad773c5eb3b71a
master date: 2017-06-20 14:34:34 +0200
George Dunlap [Tue, 20 Jun 2017 14:52:18 +0000 (16:52 +0200)]
gnttab: Avoid potential double-put of maptrack entry
Each grant mapping for a particular domain is tracked by an in-Xen
"maptrack" entry. This entry is is referenced by a "handle", which is
given to the guest when it calls gnttab_map_grant_ref().
There are two types of mapping a particular handle can refer to:
GNTMAP_host_map and GNTMAP_device_map. A given
gnttab_unmap_grant_ref() call can remove either only one or both of
these entries. When a particular handle has no entries left, it must
be freed.
gnttab_unmap_grant_ref() loops through its grant unmap request list
twice. It first removes entries from any host pagetables and (if
appropraite) iommus; then it does a single domain TLB flush; then it
does the clean-up, including telling the granter that entries are no
longer being used (if appropriate).
At the moment, it's during the first pass that the maptrack flags are
cleared, but the second pass that the maptrack entry is freed.
Unfortunately this allows the following race, which results in a
double-free:
A: (pass 1) clear host_map
B: (pass 1) clear device_map
A: (pass 2) See that maptrack entry has no mappings, free it
B: (pass 2) See that maptrack entry has no mappings, free it #
Unfortunately, unlike the active entry pinning update, we can't simply
move the maptrack flag changes to the second half, because the
maptrack flags are used to determine if iommu entries need to be
added: a domain's iommu must never have fewer permissions than the
maptrack flags indicate, or a subsequent map_grant_ref() might fail to
add the necessary iommu entries.
Instead, free the maptrack entry in the first pass if there are no
further mappings.
This is part of XSA-218.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: b7f6cbb9d43f7384e1f38f8764b9a48216c8a525
master date: 2017-06-20 14:33:13 +0200
Jan Beulich [Tue, 20 Jun 2017 14:51:58 +0000 (16:51 +0200)]
gnttab: fix unmap pin accounting race
Once all {writable} mappings of a grant entry have been unmapped, the
hypervisor informs the guest that the grant entry has been released by
clearing the _GTF_{reading,writing} usage flags in the guest's grant
table as appropriate.
Unfortunately, at the moment, the code that updates the accounting
happens in a different critical section than the one which updates the
usage flags; this means that under the right circumstances, there may be
a window in time after the hypervisor reported the grant as being free
during which the grant referee still had access to the page.
Move the grant accounting code into the same critical section as the
reporting code to make sure this kind of race can't happen.
Quan Xu [Tue, 20 Jun 2017 14:51:32 +0000 (16:51 +0200)]
IOMMU: handle IOMMU mapping and unmapping failures
Treat IOMMU mapping and unmapping failures as a fatal to the DomU
If IOMMU mapping and unmapping failed, crash the DomU and propagate
the error up to the call trees.
No spamming of the log can occur. For DomU, we avoid logging any
message for already dying domains. For Dom0, that'll still be more
verbose than we'd really like, but it at least wouldn't outright
flood the console.
Signed-off-by: Quan Xu <quan.xu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 834c97baebb3743c54bcae228e984ae1b9692e6a
master date: 2016-06-14 15:10:57 +0200
Jan Beulich [Tue, 20 Jun 2017 14:50:57 +0000 (16:50 +0200)]
x86/mm: disallow page stealing from HVM domains
The operation's success can't be controlled by the guest, as the device
model may have an active mapping of the page. If we nevertheless
permitted this operation, we'd have to add further TLB flushing to
prevent scenarios like
"Domains A (HVM), B (PV), C (PV); B->target==A
Steps:
1. B maps page X from A as writable
2. B unmaps page X without a TLB flush
3. A sends page X to C via GNTTABOP_transfer
4. C maps page X as pagetable (potentially causing a TLB flush in C,
but not in B)
At this point, X would be mapped as a pagetable in C while being
writable through a stale TLB entry in B."
A similar scenario could be constructed for A using XENMEM_exchange and
some arbitrary PV domain C then having this page allocated.
This is XSA-217.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
master commit: fae7d5be8bb8b7a5b7005c4f3b812a47661a721e
master date: 2017-06-20 14:29:51 +0200
Jan Beulich [Tue, 2 May 2017 13:08:11 +0000 (15:08 +0200)]
x86: correct create_bounce_frame
We may push up to 96 bytes on the guest (kernel) stack, so we should
also cover as much in the early range check. Note that this is the
simplest possible patch, which has the theoretical potential of
breaking a guest: We only really push 96 bytes when invoking the
failsafe callback, ordinary exceptions only have 56 or 64 bytes pushed
(without / with error code respectively). There is, however, no PV OS
known to place a kernel stack there.
This is XSA-215.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 2 May 2017 13:07:30 +0000 (15:07 +0200)]
x86: discard type information when stealing pages
While a page having just a single general reference left necessarily
has a zero type reference count too, its type may still be valid (and
in validated state; at present this is only possible and relevant for
PGT_seg_desc_page, as page tables have their type forcibly zapped when
their type reference count drops to zero, and
PGT_{writable,shared}_page pages don't require any validation). In
such a case when the page is being re-used with the same type again,
validation is being skipped. As validation criteria differ between
32- and 64-bit guests, pages to be transferred between guests need to
have their validation indicator zapped (and with it we zap all other
type information at once).
This is XSA-214.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: eaf537342c909875c10f49b06e17493655410681
master date: 2017-05-02 14:46:58 +0200
Jan Beulich [Tue, 2 May 2017 13:06:54 +0000 (15:06 +0200)]
multicall: deal with early exit conditions
In particular changes to guest privilege level require the multicall
sequence to be aborted, as hypercalls are permitted from kernel mode
only. While likely not very useful in a multicall, also properly handle
the return value in the HYPERVISOR_iret case (which should be the guest
specified value).
This is XSA-213.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com>
master commit: 22c096c99d8c05833c3c19870e36efb2dd4e8013
master date: 2017-05-02 14:45:02 +0200
Thomas Sanders [Tue, 28 Mar 2017 17:57:52 +0000 (18:57 +0100)]
oxenstored: trim history in the frequent_ops function
We were trimming the history of commits only at the end of each
transaction (regardless of how it ended).
Therefore if non-transactional writes were being made but no
transactions were being ended, the history would grow
indefinitely. Now we trim the history at regular intervals.
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Thomas Sanders [Mon, 27 Mar 2017 13:36:34 +0000 (14:36 +0100)]
oxenstored transaction conflicts: improve logging
For information related to transaction conflicts, potentially frequent
logging at "info" priority has been changed to "debug" priority, and
once per two minutes there is an "info" priority summary.
Additional detailed logging has been added at "debug" priority.
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Thomas Sanders [Fri, 24 Mar 2017 19:55:03 +0000 (19:55 +0000)]
oxenstored: don't wake to issue no conflict-credit
In the main loop, when choosing the timeout for the select function
call, we were setting it so as to wake up to issue conflict-credit to
any domains that could accept it. When xenstore is idle, this would
mean waking up every 50ms (by default) to do no work. With this
commit, we check whether any domain is below its cap, and if not then
we set the timeout for longer (the same timeout as before the
conflict-protection feature was added).
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com> Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Thomas Sanders [Fri, 24 Mar 2017 16:16:10 +0000 (16:16 +0000)]
oxenstored: do not commit read-only transactions
The packet telling us to end the transaction has always carried an
argument telling us whether to commit.
If the transaction made no modifications to the tree, now we ignore
that argument and do not commit: it is just a waste of effort.
This makes read-only transactions immune to conflicts, and means that
we do not need to store any of their details in the history that is
used for assigning blame for conflicts.
We count a transaction as a read-only transaction only if it contains
no operations that modified the tree.
This means that (for example) a transaction that creates a new node
then deletes it would NOT count as read-only, even though it makes no
change overall. A more sophisticated algorithm could judge the
transaction based on comparison of its initial and final states, but
this would add complexity and computational cost.
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com> Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Thomas Sanders [Thu, 23 Mar 2017 19:06:54 +0000 (19:06 +0000)]
oxenstored: allow self-conflicts
We already avoid inter-domain conflicts but now allow intra-domain
conflicts. Although there are no known practical examples of a domain
that might perform operations that conflict with its own transactions,
this is conceivable, so here we avoid changing those semantics
unnecessarily.
When a transaction commit fails with a conflict and we look through
the history of commits to see which connection(s) to blame, ignore
historical commits that were made by the same connection as the
failing commit.
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com> Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Jonathan Davies [Thu, 23 Mar 2017 14:28:16 +0000 (14:28 +0000)]
oxenstored: blame the connection that caused a transaction conflict
Blame each connection found to have made a commit that would cause this
transaction to fail. Each blamed connection is penalised by having its
conflict-credit decremented.
Note the change in semantics for the replay function: we no longer stop after
finding the first operation that can't be replayed. This allows us to identify
all operations that conflicted with this transaction, not just the one that
conflicted first.
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
v1 Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Changes since v1:
* use correct log levels for informational messages
Changes since v2:
* fix the blame algorithm and improve logging
(fix was reviewed by Jonathan Davies)
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Thomas Sanders [Thu, 23 Mar 2017 14:25:16 +0000 (14:25 +0000)]
oxenstored: discard old commit-history on txn end
The history of commits is to be used for working out which historical
commit(s) (including atomic writes) caused conflicts with a
currently-failing commit of a transaction. Any commit that was made
before the current transaction started cannot be relevant. Therefore
we never need to keep history from before the start of the
longest-running transaction that is open at any given time: whenever a
transaction ends (with or without a commit) then if it was the
longest-running open transaction we can delete history up until start
of the the next-longest-running open transaction.
Some transactions might stay open for a very long time, so if any
transaction exceeds conflict_max_history_seconds then we remove it
from consideration in this context, and will not guarantee to keep
remembering about historical commits made during such a transaction.
We implement this by keeping a list of all open transactions that have
not been open too long. When a transaction ends, we remove it from the
list, along with any that have been open longer than the maximum; then
we delete any history from before the start of the longest-running
transaction remaining in the list.
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com> Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Jonathan Davies [Thu, 23 Mar 2017 14:20:33 +0000 (14:20 +0000)]
oxenstored: only record operations with side-effects in history
There is no need to record "read" operations as they will never cause another
transaction to fail.
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Thomas Sanders <thomas.sanders@citrix.com>
Backport 4.6 -> 4.5 by removing reference to XS_RESET_WATCHES.
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Jonathan Davies [Tue, 14 Mar 2017 12:17:38 +0000 (12:17 +0000)]
oxenstored: add transaction info relevant to history-tracking
Specifically:
* retain the original store (not just the root) in full transactions
* store commit count at the time of the start of the transaction
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Thomas Sanders <thomas.sanders@citrix.com> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com> Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Thomas Sanders [Tue, 14 Mar 2017 12:15:52 +0000 (12:15 +0000)]
oxenstored: ignore domains with no conflict-credit
When processing connections, skip those from domains with no remaining
conflict-credit.
Also, issue a point of conflict-credit at regular intervals, the
period being set by the configuration option "conflict-max-history-
seconds". When issuing conflict-credit, we give a point either to
every domain at once (one each) or only to the single domain at the
front of the queue, depending on the configuration option
"conflict-rate-limit-is-aggregate".
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com> Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Thomas Sanders [Tue, 14 Mar 2017 12:15:52 +0000 (12:15 +0000)]
oxenstored: handling of domain conflict-credit
This commit gives each domain a conflict-credit variable, which will
later be used for limiting how often a domain can cause other domain's
transaction-commits to fail.
This commit also provides functions and data for manipulating domains
and their conflict-credit, and checking whether they have credit.
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com> Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Thomas Sanders [Tue, 14 Mar 2017 12:15:52 +0000 (12:15 +0000)]
oxenstored: comments explaining some variables
It took a while of reading and reasoning to work out what these are
for, so here are comments to make life easier for everyone reading
this code in future.
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com> Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com> Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Jonathan Davies [Thu, 23 Mar 2017 17:30:49 +0000 (17:30 +0000)]
oxenstored: log request and response during transaction replay
During a transaction replay, the replayed requests and the new responses are
logged in the same way as the original requests and the original responses.
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com> Reviewed-by: Euan Harris <euan.harris@citrix.com> Acked-by: David Scott <dave@recoil.org>
Jonathan Davies [Thu, 23 Mar 2017 17:30:42 +0000 (17:30 +0000)]
oxenstored: replay transaction upon conflict
The existing transaction merge algorithm keeps track of the least upper bound
(longest common prefix) of all the nodes which have been read and written, and
will re-combine two stores which have disjoint upper bounds. This works well for
small transactions but causes unnecessary conflicts for ones that span a large
subtree, such as the following ones used by the xapi toolstack:
* VM start: creates /vm/... /vss/... /local/domain/...
The least upper bound of this transaction is / and so all
these transactions conflict with everything.
* Device hotplug: creates /local/domain/0/... /local/domain/n/...
The least upper bound of this transaction is /local/domain so
all these transactions conflict with each other.
If the existing merge algorithm cannot merge and commit, we attempt
a /replay/ of the failed transaction against the new store.
When we replay the requests we check whether the response sent to the client is
the same as during the first attempt at the transaction. If the responses are
all the same then the transaction replay can be committed. If any differ then
the transaction replay must be aborted and the client must retry.
This algorithm uses the intuition that the transactions made by the toolstack
are designed to be for separate domains, and should fundamentally not conflict
in the sense that they don't read or write any shared keys. By replaying the
transaction on the server side we do what the client would have to do anyway,
only we can do it quickly without allowing any other requests to interfere.
Performing 300 parallel simulated VM start and shutdowns without this code:
300 parallel starts and shutdowns: 268.92
Performing 300 parallel simulated VM start and shutdowns with this code:
300 parallel starts and shutdowns: 3.80
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Dave Scott <dave@recoil.org> Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com> Reviewed-by: Euan Harris <euan.harris@citrix.com> Acked-by: David Scott <dave@recoil.org>
Jonathan Davies [Thu, 23 Mar 2017 17:13:03 +0000 (17:13 +0000)]
oxenstored: move functions that process simple operations
Separate the functions which process operations that can be done as part of a
transaction. Specifically, these operations are: read, write, rm, getperms,
setperms, getdomainpath, directory, mkdir.
Also split function_of_type into two functions: one for processing the simple
operations and one for processing the rest.
This will help allow replay of transactions, allowing us to invoke the functions
that process the simple operations as part of the processing of transaction_end.
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com> Reviewed-by: Euan Harris <euan.harris@citrix.com> Acked-by: David Scott <dave@recoil.org>
Backporting to 4.5:
- Removed references to Reset_watches, which was introduced in 4.6.
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Jonathan Davies [Thu, 23 Mar 2017 17:12:56 +0000 (17:12 +0000)]
oxenstored: keep track of each transaction's operations
A list of (request, response) pairs from the operations performed within the
transaction will be useful to support transaction replay.
Since this consumes memory, the number of requests per transaction must not be
left unbounded. Hence a new quota for this is introduced. This quota, configured
via the configuration key 'quota-maxrequests', limits the size of transactions
initiated by domUs.
After the maximum number of requests has been exhausted, any further requests
will result in EQUOTA errors. The client may then choose to end the transaction;
a successful commit will result in the retention of only the prior requests.
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com> Reviewed-by: Euan Harris <euan.harris@citrix.com> Acked-by: David Scott <dave@recoil.org>
Jonathan Davies [Thu, 23 Mar 2017 17:12:50 +0000 (17:12 +0000)]
oxenstored: refactor request processing
Encapsulate the request in a record that is passed from do_input to
process_packet and input_handle_error.
This will be helpful when keeping track of the requests made as part of a
transaction.
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com> Reviewed-by: Euan Harris <euan.harris@citrix.com> Acked-by: David Scott <dave@recoil.org>
Jonathan Davies [Thu, 23 Mar 2017 17:12:41 +0000 (17:12 +0000)]
oxenstored: remove some unused parameters
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com> Reviewed-by: Euan Harris <euan.harris@citrix.com> Acked-by: David Scott <dave@recoil.org>
Jonathan Davies [Thu, 23 Mar 2017 17:05:22 +0000 (17:05 +0000)]
oxenstored: refactor putting response on wire
Previously, the functions reply_{ack,data,data_or_ack} and input_handle_error
put the response on the wire by invoking Connection.send_{ack,reply,error}.
Instead, these functions now return a value indicating what needs to be put on
the wire, and that action is done by a send_response function called
afterwards.
This refactoring gives us a chance to store the value of the response, useful
for replaying transactions.
Reported-by: Juergen Gross <jgross@suse.com> Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com> Reviewed-by: Euan Harris <euan.harris@citrix.com> Acked-by: David Scott <dave@recoil.org>
Jan Beulich [Tue, 4 Apr 2017 13:05:00 +0000 (15:05 +0200)]
memory: properly check guest memory ranges in XENMEM_exchange handling
The use of guest_handle_okay() here (as introduced by the XSA-29 fix)
is insufficient here, guest_handle_subrange_okay() needs to be used
instead.
Note that the uses are okay in
- XENMEM_add_to_physmap_batch handling due to the size field being only
16 bits wide,
- livepatch_list() due to the limit of 1024 enforced on the
number-of-entries input (leaving aside the fact that this can be
called by a privileged domain only anyway),
- compat mode handling due to counts there being limited to 32 bits,
- everywhere else due to guest arrays being accessed sequentially from
index zero.
This is CVE-2017-7228 / XSA-212.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 938fd2586eb081bcbd694f4c1f09ae6a263b0d90
master date: 2017-04-04 14:47:46 +0200
There is a possible scenario when (d)->need_iommu remains unset
during guest domain execution. For example, when no devices
were assigned to it. Taking into account that teardown callback
is not called when (d)->need_iommu is unset we might have unreleased
resourses after destroying domain.
So, always call teardown callback to roll back actions
that were performed in init callback.
This is XSA-207.
Signed-off-by: Oleksandr Tyshchenko <olekstysh@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Tested-by: Jan Beulich <jbeulich@suse.com> Tested-by: Julien Grall <julien.grall@arm.com>
I backported 3658f7a0bdd8fcda217559927e25263a07398c27
libxl: fix libxl_set_memory_target
but this was not necessary, because the commit it fixes is not in
Xen 4.6.
So the backport introduces a duplicate cal to xc_domain_getinfolist
and also breaks the build with
libxl.c:4908:5: error: ‘r’ undeclared (first use in this function)
Wei Liu [Thu, 29 Dec 2016 16:36:31 +0000 (16:36 +0000)]
libxl: fix libxl_set_memory_target
Commit 26dbc93a ("libxl: Remove pointless hypercall from
libxl_set_memory_target") removed the call to xc_domain_getinfolist, but
it failed to notice that "info" was actually needed later.
Put that back. While at it, make the code conform to coding style
requirement.
Jan Beulich [Wed, 21 Dec 2016 16:47:08 +0000 (17:47 +0100)]
x86: force EFLAGS.IF on when exiting to PV guests
Guest kernels modifying instructions in the process of being emulated
for another of their vCPU-s may effect EFLAGS.IF to be cleared upon
next exiting to guest context, by converting the being emulated
instruction to CLI (at the right point in time). Prevent any such bad
effects by always forcing EFLAGS.IF on. And to cover hypothetical other
similar issues, also force EFLAGS.{IOPL,NT,VM} to zero.
This is CVE-2016-10024 / XSA-202.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0e47f92b072548800223f9a21ea051a017173915
master date: 2016-12-21 16:46:13 +0100
Jan Beulich [Tue, 13 Dec 2016 13:29:02 +0000 (14:29 +0100)]
x86emul: CMPXCHG8B ignores operand size prefix
Otherwise besides mis-handling the instruction, the comparison failure
case would result in uninitialized stack data being handed back to the
guest in rDX:rAX (32 bits leaked for 32-bit guests, 96 bits for 64-bit
ones).
arm64: fix incorrect memory region size in TCR_EL2
The maximum and minimum values for TxSZ depend on level of
translation as per AArch64 Virtual Memory System Architecture.
According to ARM specification DDI0487A_h (sec D4.2.2, page 1752),
the minimum TxSZ value is 16. If TxSZ is programmed to a value
smaller than 16 then it is IMPLEMENTATION DEFINED.
This patch sets T0SZ to (64-48)bits since XEN uses all 4 levels
to cover 48bit (256TB) virtual address instead of value zero.
Ian Jackson [Tue, 22 Nov 2016 13:30:27 +0000 (14:30 +0100)]
pygrub: Properly quote results, when returning them to the caller:
* When the caller wants sexpr output, use `repr()'
This is what Xend expects.
The returned S-expressions are now escaped and quoted by Python,
generally using '...'. Previously kernel and ramdisk were unquoted
and args was quoted with "..." but without proper escaping. This
change may break toolstacks which do not properly dequote the
returned S-expressions.
* When the caller wants "simple" output, crash if the delimiter is
contained in the returned value.
With --output-format=simple it does not seem like this could ever
happen, because the bootloader config parsers all take line-based
input from the various bootloader config files.
With --output-format=simple0, this can happen if the bootloader
config file contains nul bytes.
This is CVE-2016-9379 and CVE-2016-9380 / XSA-198.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Tested-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 27e14d346ed6ff1c3a3cfc479507e62d133e92a9
master date: 2016-11-22 13:52:09 +0100
Andrew Cooper [Tue, 22 Nov 2016 13:29:50 +0000 (14:29 +0100)]
x86/svm: fix injection of software interrupts
The non-NextRip logic in c/s 36ebf14eb "x86/emulate: support for emulating
software event injection" was based on an older version of the AMD software
manual. The manual was later corrected, following findings from that series.
I took the original wording of "not supported without NextRIP" to mean that
X86_EVENTTYPE_SW_INTERRUPT was not eligible for use. It turns out that this
is not the case, and the new wording is clearer on the matter.
Despite testing the original patch series on non-NRip hardware, the
swint-emulation XTF test case focuses on the debug vectors; it never ended up
executing an `int $n` instruction for a vector which wasn't also an exception.
During a vmentry, the use of X86_EVENTTYPE_HW_EXCEPTION comes with a vector
check to ensure that it is only used with exception vectors. Xen's use of
X86_EVENTTYPE_HW_EXCEPTION for `int $n` injection has always been buggy on AMD
hardware.
Fix this by always using X86_EVENTTYPE_SW_INTERRUPT.
Print and decode the eventinj information in svm_vmcb_dump(), as it has
several invalid combinations which cause vmentry failures.
This is CVE-2016-9378 / part of XSA-196.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 920edccd41db6cb0145545afa1850edf5e7d098e
master date: 2016-11-22 13:51:16 +0100
Andrew Cooper [Tue, 22 Nov 2016 13:29:26 +0000 (14:29 +0100)]
x86/emul: correct the IDT entry calculation in inject_swint()
The logic, as introduced in c/s 36ebf14ebe "x86/emulate: support for emulating
software event injection" is buggy. The size of an IDT entry depends on long
mode being active, not the width of the code segment currently in use.
In particular, this means that a compatibility code segment which hits
emulation for software event injection will end up using an incorrect offset
in the IDT for DPL/Presence checking. In practice, this only occurs on old
AMD hardware lacking NRip support; all newer AMD hardware, and all Intel
hardware bypass this path in the emulator.
While here, fix a minor issue with reading the IDT entry. The return value
from ops->read() wasn't checked, but in reality the only failure case is if a
pagefault occurs. This is not a realistic problem as the kernel will almost
certainly crash with a double fault if this setup actually occured.
This is CVE-2016-9377 / part of XSA-196.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 255e8fe95f22ded5186fd75244ffcfb9d5dbc855
master date: 2016-11-22 13:50:49 +0100
Jan Beulich [Tue, 22 Nov 2016 13:29:03 +0000 (14:29 +0100)]
x86emul: fix huge bit offset handling
We must never chop off the high 32 bits.
This is CVE-2016-9383 / XSA-195.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1c6c2d60d205f71ede0fbbd9047e459112f576db
master date: 2016-11-22 13:49:06 +0100
Jan Beulich [Tue, 22 Nov 2016 13:28:40 +0000 (14:28 +0100)]
x86/PV: writes of %fs and %gs base MSRs require canonical addresses
Commit c42494acb2 ("x86: fix FS/GS base handling when using the
fsgsbase feature") replaced the use of wrmsr_safe() on these paths
without recognizing that wr{f,g}sbase() use just wrmsrl() and that the
WR{F,G}SBASE instructions also raise #GP for non-canonical input.
Similarly arch_set_info_guest() needs to prevent non-canonical
addresses from getting stored into state later to be loaded by context
switch code. For consistency also check stack pointers and LDT base.
DR0..3, otoh, already get properly checked in set_debugreg() (albeit
we discard the error there).
The SHADOW_GS_BASE check isn't strictly necessary, but I think we
better avoid trying the WRMSR if we know it's going to fail.
This is CVE-2016-9385 / XSA-193.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f3fa3abf3e61fb1f25ce721e14ac324dda67311f
master date: 2016-11-22 13:46:28 +0100
Jan Beulich [Tue, 22 Nov 2016 13:28:12 +0000 (14:28 +0100)]
x86/HVM: don't load LDTR with VM86 mode attrs during task switch
Just like TR, LDTR is purely a protected mode facility and hence needs
to be loaded accordingly. Also move its loading to where it
architecurally belongs.
This is CVE-2016-9382 / XSA-192.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 93aa42b85ae0084ba7b749d0e990c94fbf0c17e3
master date: 2016-11-22 13:45:44 +0100
Andrew Cooper [Tue, 22 Nov 2016 13:27:40 +0000 (14:27 +0100)]
x86/hvm: Fix the handling of non-present segments
In 32bit, the data segments may be NULL to indicate that the segment is
ineligible for use. In both 32bit and 64bit, the LDT selector may be NULL to
indicate that the entire LDT is ineligible for use. However, nothing in Xen
actually checks for this condition when performing other segmentation
checks. (Note however that limit and writeability checks are correctly
performed).
Neither Intel nor AMD specify the exact behaviour of loading a NULL segment.
Experimentally, AMD zeroes all attributes but leaves the base and limit
unmodified. Intel zeroes the base, sets the limit to 0xfffffff and resets the
attributes to just .G and .D/B.
The use of the segment information in the VMCB/VMCS is equivalent to a native
pipeline interacting with the segment cache. The present bit can therefore
have a subtly different meaning, and it is now cooked to uniformly indicate
whether the segment is usable or not.
GDTR and IDTR don't have access rights like the other segments, but for
consistency, they are treated as being present so no special casing is needed
elsewhere in the segmentation logic.
AMD hardware does not consider the present bit for %cs and %tr, and will
function as if they were present. They are therefore unconditionally set to
present when reading information from the VMCB, to maintain the new meaning of
usability.
Intel hardware has a separate unusable bit in the VMCS segment attributes.
This bit is inverted and stored in the present field, so the hvm code can work
with architecturally-common state.
This is CVE-2016-9386 / XSA-191.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 04beafa8e6c66f5cd814c00e2d2b51cfbc41cb8a
master date: 2016-11-22 13:44:50 +0100
Jan Beulich [Tue, 4 Oct 2016 13:06:11 +0000 (14:06 +0100)]
x86emul: honor guest CR0.TS and CR0.EM
We must not emulate any instructions accessing respective registers
when either of these flags is set in the guest view of the register, or
else we may do so on data not belonging to the guest's current task.
Being architecturally required behavior, the logic gets placed in the
instruction emulator instead of hvmemul_get_fpu(). It should be noted,
though, that hvmemul_get_fpu() being the only current handler for the
get_fpu() callback, we don't have an active problem with CR4: Both
CR4.OSFXSR and CR4.OSXSAVE get handled as necessary by that function.
This is XSA-190.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>