Jan Beulich [Tue, 28 Nov 2017 12:14:10 +0000 (13:14 +0100)]
x86: replace bad ASSERT() in xenmem_add_to_physmap_one()
There are no locks being held, i.e. it is possible to be triggered by
racy hypercall invocations. Subsequent code doesn't really depend on the
checked values, so this is not a security issue.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
George Dunlap [Tue, 28 Nov 2017 12:13:26 +0000 (13:13 +0100)]
p2m: Check return value of p2m_set_entry() when decreasing reservation
If the entire range specified to p2m_pod_decrease_reservation() is marked
populate-on-demand, then it will make a single p2m_set_entry() call,
reducing its PoD entry count.
Unfortunately, in the right circumstances, this p2m_set_entry() call
may fail. It that case, repeated calls to decrease_reservation() may
cause p2m->pod.entry_count to fall below zero, potentially tripping
over BUG_ON()s to the contrary.
Instead, check to see if the entry succeeded, and return false if not.
The caller will then call guest_remove_page() on the gfns, which will
return -EINVAL upon finding no valid memory there to return.
Unfortunately if the order > 0, the entry may have partially changed.
A domain_crash() is probably the safest thing in that case.
Other p2m_set_entry() calls in the same function should be fine,
because they are writing the entry at its current order. Nonetheless,
check the return value and crash if our assumption turns otu to be
wrong.
This is part of XSA-247.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
George Dunlap [Tue, 28 Nov 2017 12:13:03 +0000 (13:13 +0100)]
p2m: Always check to see if removing a p2m entry actually worked
The PoD zero-check functions speculatively remove memory from the p2m,
then check to see if it's completely zeroed, before putting it in the
cache.
Unfortunately, the p2m_set_entry() calls may fail if the underlying
pagetable structure needs to change and the domain has exhausted its
p2m memory pool: for instance, if we're removing a 2MiB region out of
a 1GiB entry (in the p2m_pod_zero_check_superpage() case), or a 4k
region out of a 2MiB or larger entry (in the p2m_pod_zero_check()
case); and the return value is not checked.
The underlying mfn will then be added into the PoD cache, and at some
point mapped into another location in the p2m. If the guest
afterwards ballons out this memory, it will be freed to the hypervisor
and potentially reused by another domain, in spite of the fact that
the original domain still has writable mappings to it.
There are several places where p2m_set_entry() shouldn't be able to
fail, as it is guaranteed to write an entry of the same order that
succeeded before. Add a backstop of crashing the domain just in case,
and an ASSERT_UNREACHABLE() to flag up the broken assumption on debug
builds.
While we're here, use PAGE_ORDER_2M rather than a magic constant.
This is part of XSA-247.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Tue, 28 Nov 2017 12:11:55 +0000 (13:11 +0100)]
x86/pod: prevent infinite loop when shattering large pages
When populating pages, the PoD may need to split large ones using
p2m_set_entry and request the caller to retry (see ept_get_entry for
instance).
p2m_set_entry may fail to shatter if it is not possible to allocate
memory for the new page table. However, the error is not propagated
resulting to the callers to retry infinitely the PoD.
Prevent the infinite loop by return false when it is not possible to
shatter the large mapping.
This is XSA-246.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
George Dunlap [Wed, 22 Nov 2017 19:19:02 +0000 (19:19 +0000)]
SUPPORT.md: Add x86-specific virtual hardware
x86-specific virtual hardware provided by the hypervisor, toolstack,
or QEMU.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
George Dunlap [Wed, 22 Nov 2017 19:19:01 +0000 (19:19 +0000)]
SUPPORT.md: Toolstack core
For now only include xl-specific features, or interaction with the
system. Feature support matrix will be added when features are
mentioned.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
George Dunlap [Thu, 23 Nov 2017 17:32:14 +0000 (17:32 +0000)]
Introduce skeleton SUPPORT.md
Add a machine-readable file to describe what features are in what
state of being 'supported', as well as information about how long this
release will be supported, and so on.
The document should be formatted using "semantic newlines" [1], to make
changes easier.
Begin with the basic framework.
Signed-off-by: Ian Jackson <ian.jackson@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
[1] http://rhodesmill.org/brandon/2012/one-sentence-per-line/
Jan Beulich [Thu, 23 Nov 2017 10:40:31 +0000 (11:40 +0100)]
x86emul/test: keep compiler from using {x,y,z}mm registers itself
Since the emulator acts on the live hardware registers, we need to
prevent the compiler from using them e.g. for inlined memcpy() /
memset() (as gcc7 does). We can't, however, set this from the command
line, as otherwise the 64-bit build would face issues with functions
returning floating point values and being declared in standard headers.
As the pragma isn't available prior to gcc6, we need to invoke it
conditionally. Luckily up to gcc6 we haven't seen generated code access
SIMD registers beyond what our asm()s do.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Jan Beulich [Thu, 23 Nov 2017 10:38:22 +0000 (11:38 +0100)]
sync CPU state upon final domain destruction
See the code comment being added for why we need this.
This is being placed here to balance between the desire to prevent
future similar issues (the risk of which would grow if it was put
further down the call stack, e.g. in vmx_vcpu_destroy()) and the
intention to limit the performance impact (otherwise it could also go
into rcu_do_batch(), paralleling the use in do_tasklet_work()).
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Andrew Cooper [Thu, 16 Nov 2017 21:34:02 +0000 (21:34 +0000)]
x86/hvm: Don't corrupt the HVM context stream when writing the MSR record
Ever since it was introduced in c/s bd1f0b45ff, hvm_save_cpu_msrs() has had a
bug whereby it corrupts the HVM context stream if some, but fewer than the
maximum number of MSRs are written.
_hvm_init_entry() creates an hvm_save_descriptor with length for
msr_count_max, but in the case that we write fewer than max, h->cur only moves
forward by the amount of space used, causing the subsequent
hvm_save_descriptor to be written within the bounds of the previous one.
To resolve this, reduce the length reported by the descriptor to match the
actual number of bytes used.
A typical failure on the destination side looks like:
(XEN) HVM4 restore: CPU_MSR 0
(XEN) HVM4.0 restore: not enough data left to read 56 MSR bytes
(XEN) HVM4 restore: failed to load entry 20/0
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Andrew Cooper [Thu, 16 Nov 2017 21:10:00 +0000 (21:10 +0000)]
tools/libxc: Fix restoration of PV MSRs after migrate
There are two bugs in process_vcpu_msrs() which clearly demonstrate that I
didn't test this bit of Migration v2 very well when writing it...
vcpu->msrsz is always expected to be a multiple of xen_domctl_vcpu_msr_t
records in a spec-compliant stream, so the modulo yields 0 for the msr_count,
rather than the actual number sent in the stream.
Passing 0 for the msr_count causes the hypercall to exit early, and hides the
fact that the guest handle is inserted into the wrong field in the domctl
union.
The reason that these bugs have gone unnoticed for so long is that the only
MSRs passed like this for PV guests are the AMD DBGEXT MSRs, which only exist
in fairly modern hardware, and whose use doesn't appear to be implemented in
any contemporary PV guests.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
The altp2m_vcpu_enable_notify subop handler might skip calling
rcu_unlock_domain() after rcu_lock_current_domain(). Albeit since both
rcu functions are no-ops when run on the current domain, this doesn't
really have repercussions.
The second change is adding a missing break that would have potentially
enabled #VE for the current domain even if it had intended to enable it
for another one (not a supported functionality).
Signed-off-by: Adrian Pop <apop@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Andrew Cooper [Thu, 16 Nov 2017 09:38:14 +0000 (10:38 +0100)]
x86/shadow: correct SH_LINEAR mapping detection in sh_guess_wrmap()
The fix for XSA-243 / CVE-2017-15592 (c/s bf2b4eadcf379) introduced a change
in behaviour for sh_guest_wrmap(), where it had to cope with no shadow linear
mapping being present.
As the name suggests, guest_vtable is a mapping of the guests pagetable, not
Xen's pagetable, meaning that it isn't the pagetable we need to check for the
shadow linear slot in.
The practical upshot is that a shadow HVM vcpu which switches into 4-level
paging mode, with an L4 pagetable that contains a mapping which aliases Xen's
SH_LINEAR_PT_VIRT_START will fool the safety check for whether a SHADOW_LINEAR
mapping is present. As the check passes (when it should have failed), Xen
subsequently falls over the missing mapping with a pagefault such as:
Jan Beulich [Thu, 16 Nov 2017 09:37:29 +0000 (10:37 +0100)]
x86: don't wrongly trigger linear page table assertion
_put_page_type() may do multiple iterations until its cmpxchg()
succeeds. It invokes set_tlbflush_timestamp() on the first
iteration, however. Code inside the function takes care of this, but
- the assertion in _put_final_page_type() would trigger on the second
iteration if time stamps in a debug build are permitted to be
sufficiently much wider than the default 6 bits (see WRAP_MASK in
flushtlb.c),
- it returning -EINTR (for a continuation to be scheduled) would leave
the page inconsistent state (until the re-invocation completes).
Make the set_tlbflush_timestamp() invocation conditional, bypassing it
(for now) only in the case we really can't tolerate the stamp to be
stored.
This is part of XSA-240.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Julien Grall [Wed, 15 Nov 2017 19:34:14 +0000 (19:34 +0000)]
xen/arm: p2m: Add more debug in get_page_from_gva
The function get_page_from_gva is used by copy_*_guest helpers to
translate a guest virtual address to a machine physical address and take
reference on the page.
There are a couple of errors paths that will return the same value making
it difficult to know the exact error. Add more debug in each error patch
only for debug-build.
This should help narrowing down the intermittent failure with the
hypercall GNTTABOP_copy (see [1]).
Julien Grall [Wed, 15 Nov 2017 19:34:13 +0000 (19:34 +0000)]
xen/arm: mm: Change the return value of gvirt_to_maddr
Currently, gvirt_to_maddr return -EFAULT when the translation failed.
It might be useful to return the PAR_EL1 (Physical Address Register)
in such a case to get a better idea of the reason.
So modify the return value to use 0 on success or return the PAR on
failure.
The callers are modified to reflect the change of the return value.
Note that with the change in gvirt_to_maddr, ma needs to be initialized
to avoid GCC been confused (i.e value may be uninitialized) with the new
construction.
Yu Zhang [Tue, 14 Nov 2017 16:11:26 +0000 (17:11 +0100)]
x86/mm: fix race condition in modify_xen_mappings()
In modify_xen_mappings(), a L1/L2 page table shall be freed,
if all entries of this page table are empty. Corresponding
L2/L3 PTE will need be cleared in such scenario.
However, concurrent paging structure modifications on different
CPUs may cause the L2/L3 PTEs to be already be cleared or set
to reference a superpage.
Therefore the logic to enumerate the L1/L2 page table and to
reset the corresponding L2/L3 PTE need to be protected with
spinlock. And the _PAGE_PRESENT and _PAGE_PSE flags need be
checked after the lock is obtained.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Min He [Tue, 14 Nov 2017 16:10:56 +0000 (17:10 +0100)]
x86/mm: fix race conditions in map_pages_to_xen()
In map_pages_to_xen(), a L2 page table entry may be reset to point to
a superpage, and its corresponding L1 page table need be freed in such
scenario, when these L1 page table entries are mapping to consecutive
page frames and having the same mapping flags.
However, variable `pl1e` is not protected by the lock before L1 page table
is enumerated. A race condition may happen if this code path is invoked
simultaneously on different CPUs.
For example, `pl1e` value on CPU0 may hold an obsolete value, pointing
to a page which has just been freed on CPU1. Besides, before this page
is reused, it will still be holding the old PTEs, referencing consecutive
page frames. Consequently the `free_xen_pagetable(l2e_to_l1e(ol2e))` will
be triggered on CPU0, resulting the unexpected free of a normal page.
This patch fixes the above problem by protecting the `pl1e` with the lock.
Also, there're other potential race conditions. For instance, the L2/L3
entry may be modified concurrently on different CPUs, by routines such as
map_pages_to_xen(), modify_xen_mappings() etc. To fix this, this patch will
check the _PAGE_PRESENT and _PAGE_PSE flags, after the spinlock is obtained,
for the corresponding L2/L3 entry.
Signed-off-by: Min He <min.he@intel.com> Signed-off-by: Yi Zhang <yi.z.zhang@intel.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Eric Chanudet [Tue, 14 Nov 2017 16:09:50 +0000 (17:09 +0100)]
x86/hvm: do not register hpet mmio during s3 cycle
Do it once at domain creation (hpet_init).
Sleep -> Resume cycles will end up crashing an HVM guest with hpet as
the sequence during resume takes the path:
-> hvm_s3_suspend
-> hpet_reset
-> hpet_deinit
-> hpet_init
-> register_mmio_handler
-> hvm_next_io_handler
register_mmio_handler will use a new io handler each time, until
eventually it reaches NR_IO_HANDLERS, then hvm_next_io_handler calls
domain_crash.
Signed-off-by: Eric Chanudet <chanudete@ainfosec.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
tools/xenstored: Check number of strings passed to do_control()
It is possible to send a zero-string message body to xenstore's
XS_CONTROL handling function. Then the number of strings is used
for an array allocation. This leads to a crash in strcmp() in a
CONTROL sub-command invocation loop.
The output of xs_count_string() should be verified and all 0 or
negative values should be rejected with an EINVAL. At least the
sub-command name must be specified.
The xenstore crash can only be triggered from within dom0 (there
is a check in do_control() rejecting all non-dom0 requests with
an EACCES).
Testing: reproduced with the following command:
python -c 'print 16*"\x00"' | nc -U $XENSTORED_RUNDIR/socket
Signed-off-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Reviewed-by: Martin Pohlack <mpohlack@amazon.de> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Julien Grall [Fri, 10 Nov 2017 17:10:50 +0000 (17:10 +0000)]
libs/evtchn: Remove active handler on clean-up or failure
Commit 89d55473ed16543044a31d1e0d4660cf5a3f49df "xentoolcore_restrict_all:
Implement for libxenevtchn" added a call to register allowing to
restrict the event channel.
However, the call to deregister the handler was not performed if open
failed or when closing the event channel. This will result to corrupt
the list of handlers and potentially crash the application later one.
Fix it by calling xentoolcore_deregister_active_handle on failure and
closure.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Anthony PERARD [Mon, 13 Nov 2017 12:27:32 +0000 (12:27 +0000)]
Config.mk: Update QEMU changeset
New commits:
- xen/pt: allow QEMU to request MSI unmasking at bind time
To fix a passthrough bug.
- ui/gtk: Fix deprecation of vte_terminal_copy_clipboard
A build fix.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Andrew Cooper [Wed, 8 Nov 2017 12:40:40 +0000 (13:40 +0100)]
x86/cpuid: minor fixups missed from previous work
* Add more feature names to ./xen-cpuid
* Vertically align the magic comments in cpufeatureset.h
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Andrew Cooper [Thu, 12 Oct 2017 19:19:09 +0000 (20:19 +0100)]
tools/dombuilder: Prevent failures of xc_dom_gnttab_init()
Recent changes in grant table configuration have caused calls to
xc_dom_gnttab_init() to fail if not proceeded with a call to
xc_domain_set_gnttab_limits(). This is backwards from the point of view of
3rd party dombuilder users.
Add max_{grant,maptrack}_frames parameters to struct xc_dom_image, and require
them to be set by callers using xc_dom_gnttab_init(). Libxl, which uses
xc_dom_gnttab_init() itself is updated appropriately.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Tested-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Juergen Gross <jgross@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Andrew Cooper [Thu, 12 Oct 2017 19:19:08 +0000 (20:19 +0100)]
tools/dombuilder: Fix asymmetry when setting up console and xenstore rings
libxl always uses xc_dom_gnttab_init(), which internally calls
xc_dom_gnttab{_hvm,}_seed() to set up the grants point at the console and
xenstore rings. For HVM guests, libxl then asks Xen for the information set
up previously, and calls xc_dom_gnttab_hvm_seed() a second time, which is
wasteful. ARM construction expects libxl to have set up
dom->{console,xenstore}_evtchn earlier, so only actually functions because of
this second call.
Rationalise everything and make it consistent for all guests.
1) Users of the domain builder are expected to provide
dom->{console,xenstore}_{evtchn,domid} unconditionally. This is checked
by setting invalid values in xc_dom_allocate(), and checking in
xc_dom_boot_image().
2) For x86 HVM and ARM guests, the event channels are given to Xen at the
same time as the ring gfns. ARM already did this, but x86 is updated to
match. x86 PV already provides this information in the start_info page.
3) Libxl is updated to drop all relevant functionality from
hvm_build_set_params(), and behave consistently with PV guests when it
comes to the handling of dom->{console,xenstore}_{evtchn,domid,gfn}.
This removes several redundant hypercalls (including a foreign mapping) from
the x86 HVM and ARM construction paths.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Tested-by: Julien Grall <julien.grall@arm.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Wei Liu [Thu, 12 Oct 2017 19:19:07 +0000 (20:19 +0100)]
tools/dombuilder: Switch to using gfn terminology for console and xenstore rings
The sole use of xc_dom_translated() and xc_dom_p2m() outside of the domain
builder is for libxl_dom() to translate the console and xenstore pfns back
into useful values. PV guest pfns are only interesting to the domain builder,
and gfns are the address space used by all other hypercalls.
Renaming the fields in xc_dom_image is deliberate, as it will cause
out-of-tree users of the dombuilder to notice the different semantics.
Correct the terminology throughout xc_dom_gnttab{_hvm,}_seed(), which are all
using gfns despite the existing variable names.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Tested-by: Julien Grall <julien.grall@arm.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
[ wei: fix stubdom build ] Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Tue, 31 Oct 2017 17:07:41 +0000 (17:07 +0000)]
common/multicall: Increase debugability for bad hypercalls
While investigating an issue (in a new codepath I'd introduced, as it turns
out), leaving interrupts disabled manifested as a subsequent op in the
multicall failing a check_lock() test.
The codepath would have hit the ASSERT_NOT_IN_ATOMIC on the return-to-guest
path, had it not hit the check_lock() first.
Call ASSERT_NOT_IN_ATOMIC() after each operation in the multicall, to make
failures more obvious.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Andrew Cooper [Mon, 30 Oct 2017 17:42:52 +0000 (17:42 +0000)]
common/spinlock: Improve the output from check_lock() if it trips
If check_lock() triggers, a crash will occur. Instead of simply identifying
"the irq context was different", indicate the expected and current irq
context.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Andrew Cooper [Thu, 12 Oct 2017 19:19:06 +0000 (20:19 +0100)]
tools/dombuilder: Remove clear_page() from xc_dom_boot.c
pfn 0 is a legitimate (albeit unlikely) frame to use, so skipping it is wrong.
This behaviour appears to exists simply to cover the fact that zero is the
default value of an uninitialised field in dom.
ARM already clears the frames at the point that the pfns are allocated,
meaning that the added clear_page() is wasteful. Alter x86 to match ARM and
clear the page when it is allocated.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Tested-by: Julien Grall <julien.grall@arm.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Andrew Cooper [Thu, 12 Oct 2017 19:19:05 +0000 (20:19 +0100)]
tools/dombuilder: Drop more PVH v1 leftovers
alloc_magic_pages() is renamed to alloc_magic_pages_pv() to mirror its
alloc_magic_pages_hvm() counterpart. Delete a redundant comment, introduce
some newlines clarity, and remove a logically dead allocation of shared info.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Tested-by: Julien Grall <julien.grall@arm.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
With the current SBSA UART emulation, streaming larger amounts of data
(as caused by "find /", for instance) can lead to character losses.
This is due to the OUT ring buffer getting full, because we change the
TX interrupt bit only when the FIFO is actually full, and not already
when it's half-way filled, as the Linux driver expects.
The SBSA spec does not explicitly state this, but we assume that an
SBSA compliant UART uses the PL011 default "interrupt FIFO level select
register" value of "1/2 way". The Linux driver certainly makes this
assumption, so it expect to be able to write a number of characters
after the TX interrupt line has been asserted.
On a similar issue we have the same wrong behaviour on the receive side.
However changing the RX interrupt to trigger on reaching half of the FIFO
level will lead to lag, because the guest would not be notified of incoming
characters until the FIFO is half way filled. This leads to inacceptible
lags when typing on a terminal.
Real hardware solves this issue by using the "receive timeout
interrupt" (RTI), which is triggered when character reception stops for
32 baud cycles. As we cannot and do not want to emulate any timing here,
we slightly abuse the timeout interrupt to notify the guest of new
characters: when a new character comes in, the RTI is asserted, when
the FIFO is cleared, the interrupt gets cleared.
So this patch changes the emulated interrupt trigger behaviour to come
as close to real hardware as possible: the RX and TX interrupt trigger
when the FIFO gets half full / half empty, and the RTI interrupt signals
new incoming characters.
Bhupinder Thakur [Tue, 24 Oct 2017 17:09:21 +0000 (18:09 +0100)]
arm/xen: vpl011: Fix the slow early console SBSA UART output
The early console output uses pl011_early_write() to write data. This
function waits for BUSY bit to get cleared before writing the next byte.
In the SBSA UART emulation logic, the BUSY bit was set as soon one
byte was written in the FIFO and it remained set until the FIFO was
emptied. This meant that the output was delayed as each character needed
the BUSY to get cleared.
Since the SBSA UART is getting emulated in Xen using ring buffers, it
ensures that once the data is enqueued in the FIFO, it will be received
by xenconsole so it is safe to set the BUSY bit only when FIFO becomes
full. This will ensure that pl011_early_write() is not delayed unduly
to write the data.
Signed-off-by: Bhupinder Thakur <bhupinder.thakur@linaro.org> Reviewed-by: Andre Przywara <andre.przywara@linaro.org> Signed-off-by: Andre Przywara <andre.przywara@linaro.org> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Roger Pau Monne [Thu, 26 Oct 2017 09:19:30 +0000 (10:19 +0100)]
gcov: return ENOSYS for unimplemented gcov domctl
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org> Acked-by: Wei Liu <wei.liu2@citrix.com>
Ross Lagerwall [Wed, 18 Oct 2017 13:42:33 +0000 (14:42 +0100)]
xentoolcore_restrict_all: Implement for libxenevtchn
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Ross Lagerwall [Wed, 18 Oct 2017 13:42:32 +0000 (14:42 +0100)]
tools/libs/evtchn: Add support for restricting a handle
Implement support for restricting evtchn handles to a particular domain
on Linux by calling the IOCTL_EVTCHN_RESTRICT_DOMID ioctl (support added
in Linux v4.8).
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Changeset 2b1cde7783 introduced "batch mode" to afl-harness, which allowed
the handling of several inputs in sequence.
Unfortunately, it introduced a file pointer leak when the file was
larger than the maximum size. Restructure the code to always close fp
if we opened it.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
George Dunlap [Fri, 27 Oct 2017 13:26:27 +0000 (14:26 +0100)]
x86/mm: Make PV linear pagetables optional
Allowing pagetables to point to other pagetables of the same level
(often called 'linear pagetables') has been included in Xen since its
inception; but recently it has been the source of a number of subtle
reference-counting bugs.
It is not used by Linux or MiniOS; but it is used by NetBSD and Novell
Netware. There are significant numbers of people who are never going
to use the feature, along with significant numbers who need the
feature.
Add a Kconfig option for the feature (default to 'y'). Also add a
command-line option to control whether PV linear pagetables are
allowed (default to 'true').
NB that we leave linear_pt_count in the page struct. It's in a union,
so its presence doesn't increase the size of the data struct.
Changing the layout of the other elements based on configuration
options is asking for trouble however; so we'll just leave it there
and ASSERT that it's zero.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Boris Ostrovsky [Tue, 24 Oct 2017 23:30:20 +0000 (19:30 -0400)]
x86/vpmu: Remove unnecessary call to do_interrupt()
This call was left during PVHv1 removal (commit 33e5c32559e1 ("x86:
remove PVHv1 code")):
- if ( is_pvh_vcpu(sampling) &&
- !(vpmu_mode & XENPMU_MODE_ALL) &&
+ if ( !(vpmu_mode & XENPMU_MODE_ALL) &&
!vpmu->arch_vpmu_ops->do_interrupt(regs) )
return;
As result of this extra call VPMU no longer works for PV guests on Intel
because we effectively lose value of MSR_CORE_PERF_GLOBAL_STATUS.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Jan Beulich [Thu, 26 Oct 2017 07:57:31 +0000 (01:57 -0600)]
x86: fix asm() constraint for GS selector update
Exception fixup code may alter the operand, which ought to be reflected
in the constraint.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Jan Beulich [Thu, 26 Oct 2017 07:57:04 +0000 (01:57 -0600)]
x86: don't latch wrong (stale) GS base addresses
load_segments() writes selector registers before doing any of the base
address updates. Any of these selector loads can cause a page fault in
case it references the LDT, and the LDT page accessed was only recently
installed. Therefore the call tree map_ldt_shadow_page() ->
guest_get_eff_kern_l1e() -> toggle_guest_mode() would in such a case
wrongly latch the outgoing vCPU's GS.base into the incoming vCPU's
recorded state.
Split page table toggling from GS handling - neither
guest_get_eff_kern_l1e() nor guest_io_okay() need more than the page
tables being the kernel ones for the memory access they want to do.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
libxc: remove stale error check for domain size in xc_sr_save_x86_hvm.c
Long ago domains to be saved were limited to 1TB size due to the
migration stream v1 limitations which used a 32 bit value for the
PFN and the frame type (4 bits) leaving only 28 bits for the PFN.
Migration stream V2 uses a 64 bit value for this purpose, so there
is no need to refuse saving (or migrating) domains larger than 1 TB.
For 32 bit toolstacks there is still a size limit, as domains larger
than about 1TB will lead to an exhausted virtual address space of the
saving process. So keep the test for 32 bit, but don't base it on the
page type macros. As a migration could lead to the situation where a
32 bit toolstack would have to handle such a large domain (in case the
sending side is 64 bit) the same test should be added for restoring a
domain.
Jan Beulich [Tue, 24 Oct 2017 16:13:13 +0000 (18:13 +0200)]
x86: also show FS/GS base addresses when dumping registers
Their state may be important to figure the reason for a crash. To not
further grow duplicate code, break out a helper function.
I realize that (ab)using the control register array here may not be
considered the nicest solution, but it seems easier (and less overall
overhead) to do so compared to the alternative of introducing another
helper structure.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Jan Beulich [Tue, 24 Oct 2017 16:12:31 +0000 (18:12 +0200)]
x86: fix GS-base-dirty determination
load_segments() writes the two MSRs in their "canonical" positions
(GS_BASE for the user base, SHADOW_GS_BASE for the kernel one) and uses
SWAPGS to switch them around if the incoming vCPU is in kernel mode. In
order to not leave a stale kernel address in GS_BASE when the incoming
guest is in user mode, the check on the outgoing vCPU needs to be
dependent upon the mode it is currently in, rather than blindly looking
at the user base.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Chao Gao [Tue, 24 Oct 2017 14:02:40 +0000 (16:02 +0200)]
x86/vpt: guarantee the return value of pt_update_irq() set in vIRR or PIR
pt_update_irq() is expected to return the vector number of periodic
timer interrupt, which should be set in vIRR of vlapic or in PIR.
Otherwise it would trigger the assertion in vmx_intr_assist(), please
seeing https://lists.xenproject.org/archives/html/xen-devel/2017-10/msg00915.html.
But it fails to achieve that in the following two case:
1. hvm_isa_irq_assert() may not set the corresponding bit in vIRR for
mask field of IOAPIC RTE is set. Please refer to the call tree
vmx_intr_assist() -> pt_update_irq() -> hvm_isa_irq_assert() ->
assert_irq() -> assert_gsi() -> vioapic_irq_positive_edge(). The patch
checks whether the vector is set or not in vIRR of vlapic or PIR before
returning.
2. someone changes the vector field of IOAPIC RTE between asserting
the irq and getting the vector of the irq, leading to setting the
old vector number but returning a different vector number. This patch
allows hvm_isa_irq_assert() to accept a callback which can get the
interrupt vector with irq_lock held. Thus, no one can change the vector
between the two operations.
BTW, the first argument of pi_test_and_set_pir() should be uint8_t
and I take this chance to fix it.
Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Jan Beulich [Tue, 24 Oct 2017 14:01:33 +0000 (16:01 +0200)]
gnttab: fix pin count / page reference race
Dropping page references before decrementing pin counts is a bad idea
if assumptions are being made that a non-zero pin count implies a valid
page. Fix the order of operations in gnttab_copy_release_buf(), but at
the same time also remove the assertion that was found to trigger:
map_grant_ref() also has the potential of causing a race here, and
changing the order of operations there would likely be quite a bit more
involved.
This is CVE-2017-15597 / XSA-236.
Reported-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Ian Jackson [Fri, 20 Oct 2017 10:42:42 +0000 (11:42 +0100)]
libxl: Replace open-coded __attribute__ with NN() macro
Inspired by
#define __nonnull(...) __attribute__((__nonnull__(__VA_ARGS__)))
which is used in the hypervisor.
These annotations may well become very common in libxl, so we choose a
short name.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org> Acked-by: Wei Liu <wei.liu2@citrix.com> CC: Andrew Cooper <andrew.cooper3@citrix.com>
Wei Liu [Mon, 16 Oct 2017 14:04:10 +0000 (15:04 +0100)]
libxl: annotate s to be nonnull in libxl__enum_from_string
Hope this can placate coverity.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Anthony PERARD [Thu, 19 Oct 2017 14:29:56 +0000 (15:29 +0100)]
tools/Makefile: unset MAKELEVEL before building QEMU
Since QEMU commits aef45d51d1204f3335fb99de6658e0c5612c2b67
"build: automatically handle GIT submodule checkout for dtc"
the QEMU makefiles rely on the variable MAKELEVEL to make a decision on
whether to update some git submodules or not. Since we call QEMU build
from within the Xen one, MAKELEVEL would already be greater than 0 and
the git submodules would not be updated and QEMU would fail to build.
Fix this by removing MAKELEVEL from the environment before trying to
build QEMU.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Jan Beulich [Fri, 20 Oct 2017 07:31:54 +0000 (09:31 +0200)]
gcov: support gcc 7.x
Taking Linux commit 0538421343 ("gcov: support GCC 7.1") as reference,
enable gcc 7 support requiring __gcov_exit() and having 9 counters.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Roger Pau Monné [Fri, 20 Oct 2017 07:30:13 +0000 (09:30 +0200)]
ubsan: add clang 5.0 support
clang 5.0 changed the layout of the type_mismatch_data structure and
introduced __ubsan_handle_type_mismatch_v1 and
__ubsan_handle_pointer_overflow.
This commit adds support for the new structure layout, adds the
missing handlers and the new types for type_check_kinds.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
[jb: unconditionally emit always the same message in
__ubsan_handle_pointer_overflow()] Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
David Esler [Fri, 20 Oct 2017 07:29:29 +0000 (09:29 +0200)]
x86/boot: fix early error output
In 9180f5365524 a change was made to the send_chr function to take in
C-strings and output a character at a time until a NULL was encountered.
However, when there is no VGA there is no code to increment the current
character position resulting in an endless loop of the first character.
This moves the (implicit) increment such that it occurs in all cases.
Signed-off-by: David Esler <drumandstrum@gmail.com> Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
[jb: correct title and description] Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
This is due to memmove doing a n-1+addr when n is 0. This is harmless
because the loop is bounded to 0. Fix this by returning earlier when n
is 0.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
[jb: add return value and unlikely()] Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Julien Grall [Thu, 19 Oct 2017 17:09:05 +0000 (18:09 +0100)]
xen/arm: gic-v3: Make sure ICC_SRE_EL1 is restored before ICH_VMCR_EL2
Per 8.4.8 in ARM IHI 0069D, ICH_VMCR_EL2.VFIQEn is RES1 when
ICC_SRE_EL1.SRE is 1. This causes a Group 0 interrupt (as generated in
GICv2 mode) to be delivered as a FIQ to the guest, with potentially
consequence. So we must make sure that ICC_SRE_EL1 has been actually
programmed before at ICH_VMCR_EL2.
This was discovered when booting EFI in a GICv2 guest on a GICv3
hardware.
arm: configure interrupts to be in non-secure group1
Xen uses non-secure group1 interrupts, however it doesn't configure the
GICv3 accordingly. Xen needs to set GICD_IGROUPR for SPIs and
GICR_IGROUPR0 for local interrupt to "1" to specify that interrupts
belong to group1. This is particularly important if the system has
GICD_CTLR.DS set, also see commit 7c9b973061b03af62734f613f6abec46c0dd4a88 in Linux.
Andrew Cooper [Tue, 17 Oct 2017 14:11:23 +0000 (15:11 +0100)]
xen/public: Correct the definition of GNTTAB_CACHE_SOURCE_GREF
Discovered when running the XSA-232 PoC on a UBSAN-enabled hypervisor.
(d79) XSA-232 PoC
(XEN) ================================================================================
(XEN) UBSAN: Undefined behaviour in grant_table.c:3217:25
(XEN) left shift of 1 by 31 places cannot be represented in type 'int'
(XEN) ----[ Xen-4.10.0-rc x86_64 debug=y Tainted: H ]----
Update all of the GNTTAB_CACHE_* constants to be unsigned integers.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Andrew Cooper [Tue, 29 Aug 2017 10:35:31 +0000 (10:35 +0000)]
x86/mm: Consolidate all Xen L4 slot writing into init_xen_l4_slots()
There are currently three functions which write L4 pagetables for Xen, but
they all behave subtly differently. sh_install_xen_entries_in_l4() in
particular is catering for two different usecases, which makes the safety of
the linear mappings hard to follow.
By consolidating the L4 pagetable writing in a single function, the resulting
setup of Xen's virtual layout is easier to understand.
No practical changes to the resulting L4, although the logic has been
rearranged to avoid rewriting some slots. This changes the zap_ro_mpt
parameter to simply ro_mpt.
Both {hap,sh}_install_xen_entries_in_l4() get folded into their callers. The
hap side only a single caller, while the shadow side has two. The shadow
split helps highlight the correctness of the linear slots.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Release-acked-by: Julien Grall <julien.gral@linaro.org> Acked-by: George Dunlap <george.dunlap@citrix.com>
Andrew Cooper [Tue, 29 Aug 2017 10:35:31 +0000 (11:35 +0100)]
x86/mm: Consolidate all Xen L2 slot writing into init_xen_pae_l2_slots()
Having all of this logic together makes it easier to follow Xen's virtual
setup across the whole system.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Release-acked-by: Julien Grall <julien.gral@linaro.org> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
The following patches (post XSA-243 fixes) requires init_guest_l4_table()
being common code.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org> Release-acked-by: Julien Grall <julien.gral@linaro.org>
Ian Jackson [Tue, 17 Oct 2017 16:52:02 +0000 (17:52 +0100)]
tools: libxendevicemodel: Restore symbol versions for 1.0
In 1462f9ea8f4219d520a530787b80c986e050aa98
"tools: libxendevicemodel: Provide xendevicemodel_shutdown"
we added a new version 1.1 to the symbol map and simply abolished
the old one. That is quite wrong.
Instead, we should have left the 1.0 map alone and added a new version
which simply adds the new symbol.
Fix this.
Reported-by: Ross Lagerwall <ross.lagerwall@citrix.com> CC: Stefano Stabellini <sstabellini@kernel.org> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monné [Tue, 17 Oct 2017 10:23:53 +0000 (11:23 +0100)]
mm/shadow: fix declaration of fetch_type_names
fetch_type_names usage is guarded by SHADOW_DEBUG_PROPAGATE in
SHADOW_DEBUG, fix the declaration so it's also guarded by
SHADOW_DEBUG_PROPAGATE instead of DEBUG_TRACE_DUMP.
Observed while building with clang and ubsan enabled.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Andrew Cooper [Mon, 16 Oct 2017 13:20:00 +0000 (13:20 +0000)]
xen/dom0: Fix latent dom0 construction bugs on all architectures
* x86 PV and ARM dom0's must not clear _VPF_down from v->pause_flags until
all state is actually set up. As it currently stands, d0v0 is eligible for
scheduling before its registers have been set. This is latent as we also
hold a systemcontroller pause reference at the time which prevents d0 from
being scheduled.
* x86 PVH previously was not setting v->is_initialised for d0v0, despite
setting the vcpu running eventually. Therefore, a later VCPUOP_initialise
hypercall will modify state under the feet of the running vcpu. This is
latent as PVH dom0 construction don't yet function.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
While I can't seem to find any users of this hypercall (being a likely
explanation of why the problem wasn't noticed so far), just like for
do_mmu_update() paged-out and shared page handling is needed here. Move
all this logic into mod_l1_entry(), which then also results in no
longer
- doing any of this handling for non-present PTEs,
- acquiring two temporary page references when one is already more than
enough.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Jan Beulich [Fri, 13 Oct 2017 10:42:43 +0000 (12:42 +0200)]
x86: request page table page-in for the correct domain
The domain passed to p2m_mem_paging_populate() should match the one
passed to the corresponding get_page_from_gfn().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Ian Jackson [Fri, 13 Oct 2017 10:21:57 +0000 (11:21 +0100)]
libxl: dm_restrict: DEFINE_USERLOOKUP_HELPER returned a pointer to an auto
When I converted the previous open-coded user lookup functionality
into DEFINE_USERLOOKUP_HELPER, I moved the struct passwd buffer into
the function generated by the macro. This is wrong because that
buffer is used by get{pw,gr}* for its return value, so the helper
function would contrive to return a pointer to the buffer on its own
stack.
Fix this by adding a buffer parameter to the generated helpers, that
the caller must supply, and updating all the call sites.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Thu, 12 Oct 2017 12:50:31 +0000 (14:50 +0200)]
x86/cpu: Fix IST handling during PCPU bringup
Clear IST references in newly allocated IDTs. Nothing good will come of
having them set before the TSS is suitably constructed (although the chances
of the CPU surviving such an IST interrupt/exception is extremely slim).
Uniformly set the IST references after the TSS is in place. This fixes an
issue on AMD hardware, where onlining a PCPU while PCPU0 is in HVM context
will cause IST_NONE to be copied into the new IDT, making that PCPU vulnerable
to privilege escalation from PV guests until it subsequently schedules an HVM
guest.
This is XSA-244.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 12 Oct 2017 12:50:07 +0000 (14:50 +0200)]
x86/shadow: Don't create self-linear shadow mappings for 4-level translated guests
When initially creating a monitor table for 4-level translated guests, don't
install a shadow-linear mapping. This mapping is actually self-linear, and
trips up the writeable heuristic logic into following Xen's mappings, not the
guests' shadows it was expecting to follow.
A consequence of this is that sh_guess_wrmap() needs to cope with there being
no shadow-linear mapping present, which in practice occurs once each time a
vcpu switches to 4-level paging from a different paging mode.
An appropriate shadow-linear slot will be inserted into the monitor table
either while constructing lower level monitor tables, or by sh_update_cr3().
While fixing this, clarify the safety of the other mappings. Despite
appearing unsafe, it is correct to create a guest-linear mapping for
translated domains; this is self-linear and doesn't point into the translated
domain. Drop a dead clause for translate != external guests.
This is XSA-243.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Tim Deegan <tim@xen.org>
Jan Beulich [Wed, 27 Sep 2017 10:00:56 +0000 (11:00 +0100)]
x86: don't allow page_unlock() to drop the last type reference
Only _put_page_type() does the necessary cleanup, and hence not all
domain pages can be released during guest cleanup (leaving around
zombie domains) if we get this wrong.
Jan Beulich [Thu, 12 Oct 2017 12:48:25 +0000 (14:48 +0200)]
x86: don't store possibly stale TLB flush time stamp
While the timing window is extremely narrow, it is theoretically
possible for an update to the TLB flush clock and a subsequent flush
IPI to happen between the read and write parts of the update of the
per-page stamp. Exclude this possibility by disabling interrupts
across the update, preventing the IPI to be serviced in the middle.
This is XSA-241.
Reported-by: Jann Horn <jannh@google.com> Suggested-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Wed, 27 Sep 2017 10:46:52 +0000 (11:46 +0100)]
x86: limit linear page table use to a single level
That's the only way that they're meant to be used. Without such a
restriction arbitrarily long chains of same-level page tables can be
built, tearing down of which may then cause arbitrarily deep recursion,
causing a stack overflow. To facilitate this restriction, a counter is
being introduced to track both the number of same-level entries in a
page table as well as the number of uses of a page table in another
same-level one (counting into positive and negative direction
respectively, utilizing the fact that both counts can't be non-zero at
the same time).
Note that the added accounting introduces a restriction on the number
of times a page can be used in other same-level page tables - more than
32k of such uses are no longer possible.
Note also that some put_page_and_type[_preemptible]() calls are
replaced with open-coded equivalents. This seemed preferrable to
adding "parent_table" to the matrix of functions.
Note further that cross-domain same-level page table references are no
longer permitted (they probably never should have been).
This is XSA-240.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Thu, 12 Oct 2017 12:43:26 +0000 (14:43 +0200)]
x86/HVM: prefill partially used variable on emulation paths
Certain handlers ignore the access size (vioapic_write() being the
example this was found with), perhaps leading to subsequent reads
seeing data that wasn't actually written by the guest. For
consistency and extra safety also do this on the read path of
hvm_process_io_intercept(), even if this doesn't directly affect what
guests get to see, as we've supposedly already dealt with read handlers
leaving data completely unitialized.
This is XSA-239.
Reported-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Misbehaving device model can pass incorrect XEN_DMOP_map/
unmap_io_range_to_ioreq_server arguments, namely end < start when
specifying address range. When this happens we hit ASSERT(s <= e) in
rangeset_contains_range()/rangeset_overlaps_range() with debug builds.
Production builds will not trap right away but may misbehave later
while handling such bogus ranges.
This is XSA-238.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>