Zhenzhong Duan [Mon, 4 Dec 2017 10:01:24 +0000 (11:01 +0100)]
x86/physdev: remove redundant code in branch MAP_PIRQ_TYPE_MSI
Same code is already in allocate_and_map_msi_pirq()
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Reviewed-by: Joe Jin <joe.jin@oracle.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com>
David Esler [Mon, 4 Dec 2017 10:00:24 +0000 (11:00 +0100)]
x86/boot: rename send_chr to print_err
The send_chr function sends an entire C-string and not one character and
doesn't necessarily just send it over the serial UART anymore so rename
it to print_err so that its closer in name to what it does.
Signed-off-by: David Esler <drumandstrum@gmail.com> Reviewed-by: Doug Goldstein <cardoe@cardoe.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Brian Woods [Thu, 16 Nov 2017 22:11:15 +0000 (16:11 -0600)]
x86/svm: Add virtual GIF support
This patch detects and enables Virtual GIF if available. This allows
a nested hypervisor to perform STGIs and CLGIs without having to be
intercepted by host hypervisor.
Signed-off-by: Brian Woods <brian.woods@amd.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Brian Woods [Thu, 16 Nov 2017 22:11:14 +0000 (16:11 -0600)]
x86/svm: Add virtual GIF feature definition
Add support for enabling the virtual GIF feature.
Signed-off-by: Brian Woods <brian.woods@amd.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Tue, 28 Nov 2017 18:48:07 +0000 (18:48 +0000)]
x86/traps: Drop redundant printk() in fatal_trap()
show_page_walk() already prints the linear address of the walk, and
show_execution_state() has printed a raw %cr2 value. This avoids having
two adjacent log lines with identical information.
Boris Ostrovsky [Thu, 9 Nov 2017 15:37:53 +0000 (10:37 -0500)]
x86/pvh: Do not add DSDT and FACS to PVH dom0 XSDT
These tables are pointed to from FADT. Adding them will
result in duplicate entries in the guest's tables.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Sergey Dyasli [Mon, 23 Oct 2017 09:33:02 +0000 (10:33 +0100)]
x86/vvmx: don't enable vmcs shadowing for nested guests
Running "./xtf_runner vvmx" in L1 Xen under L0 Xen produces the
following result on H/W with VMCS shadowing:
Test: vmxon
Failure in test_vmxon_in_root_cpl0()
Expected 0x8200000f: VMfailValid(15) VMXON_IN_ROOT
Got 0x82004400: VMfailValid(17408) <unknown>
Test result: FAILURE
This happens because SDM allows vmentries with enabled VMCS shadowing
VM-execution control and VMCS link pointer value of ~0ull. But results
of a nested VMREAD are undefined in such cases.
Fix this by not copying the value of VMCS shadowing control from vmcs01
to vmcs02.
Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Brian Woods [Tue, 31 Oct 2017 22:03:08 +0000 (17:03 -0500)]
x86/svm: add virtual VMLOAD/VMSAVE support
On AMD family 17h server processors, there is a feature called virtual
VMLOAD/VMSAVE. This allows a nested hypervisor to preform a VMLOAD or
VMSAVE without needing to be intercepted by the host hypervisor.
Virtual VMLOAD/VMSAVE requires the host hypervisor to be in long mode
and nested page tables to be enabled. For more information about it
please see:
AMD64 Architecture Programmer’s Manual Volume 2: System Programming
http://support.amd.com/TechDocs/24593.pdf
Section: VMSAVE and VMLOAD Virtualization (Section 15.33.1)
This patch series adds support to check for and enable the virtual
VMLOAD/VMSAVE features if available.
Signed-off-by: Brian Woods <brian.woods@amd.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Adding support for enabling the virtual VMLOAD/VMSAVE feature..
Signed-off-by: Brian Woods <brian.woods@amd.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Brian Woods [Tue, 31 Oct 2017 22:03:06 +0000 (17:03 -0500)]
x86/svm: rename lbr control field in vmcb
Rename the lbr_control field in the vmcb for future/upcoming changes.
Signed-off-by: Brian Woods <brian.woods@amd.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Andrew Cooper [Tue, 17 Oct 2017 17:06:23 +0000 (18:06 +0100)]
x86/vmx: Don't rewrite HOST_TR_SELECTOR on every context switch
TSS_ENTRY is a compile time constant, so HOST_TR_SELECTOR can be set up during
VMCS construction and left alone thereafter, rather than rewriting it on every
context switch.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Mon, 16 Oct 2017 13:20:07 +0000 (13:20 +0000)]
xen/pv: Construct d0v0's GDT properly
c/s cf6d39f8199 "x86/PV: properly populate descriptor tables" changed the GDT
to reference zero_page for intermediate frames between the guest and Xen
frames.
Because dom0_construct_pv() doesn't call arch_set_info_guest(), some bits of
initialisation are missed, including the pv_destroy_gdt() which initially
fills the references to zero_page.
In practice, this means there is a window between starting and the first call
to HYPERCALL_set_gdt() were lar/lsl/verr/verw suffer non-architectural
behaviour.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
This probably wants backporting to Xen 4.7 and later.
Andrew Cooper [Mon, 2 Oct 2017 14:13:38 +0000 (14:13 +0000)]
x86/ldt: Alter how invalidate_shadow_ldt() deals with TLB flushes
Modify invalidate_shadow_ldt() to return a boolean indicating whether mappings
have been dropped, rather than taking a flush parameter. Tweak the internal
logic to be able to ASSERT() that v->arch.pv_vcpu.shadow_ldt_mapcnt matches
the number of PTEs removed.
This allows MMUEXTOP_SET_LDT to avoid a local TLB flush if no LDT entries had
been faulted in to begin with.
Finally, correct a comment in __get_page_type(). Under no circumstance is it
safe to forgo the TLB shootdown for GDT/LDT pages, as that would allow one
vcpu to gain a writeable mapping to a frame still mapped as a GDT/LDT by
another vcpu.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 2 Oct 2017 13:58:17 +0000 (13:58 +0000)]
xen/x86: Introduce static inline wrappers for l{idt,gdt,ldt,tr}()
This avoids indirection and parameter constraint issues. Doing so relaxes the
load_LDT() constraints from %ax to any general purpose register. The helpers
are upgraded to full compiler barriers, because nothing good will come of
having these reordered with respect to other segment accesses.
The triple-fault reboot method stays as is, to avoid the int3 possibly getting
moved relative to the lidt.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Tue, 28 Nov 2017 14:05:19 +0000 (14:05 +0000)]
x86/hvm: fix interaction between internal and external emulation
A call to handle_hvm_io_completion() is needed for completing I/O
that requires external emulation. Such completion should be requested when
hvm_vcpu_io_need_completion() returns true after hvm_emulate_once() has
completed. This is indicative of the underlying I/O emulation having
returned X86EMUL_RETRY and hence a re-emulation of the instruction is
needed to pick up the result of the I/O.
A call to handle_hvm_io_completion() is NOT needed when the underlying
I/O has not returned X86EMUL_RETRY since there will be no result to pick
up. Hence it bogus to request such completion when mmio_retry is set,
since this can only happen if the underlying I/O emulation has returned
X86EMUL_OKAY (meaning the I/O has completed successfully).
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Andrew Cooper [Sat, 25 Nov 2017 15:17:14 +0000 (15:17 +0000)]
x86: Avoid corruption on migrate for vcpus using CPUID Faulting
Xen 4.8 and later virtualises CPUID Faulting support for guests. However, the
value of MSR_MISC_FEATURES_ENABLES is omitted from the vcpu state, meaning
that the current cpuid faulting setting is lost on migrate/suspend/resume.
Instead of following the MSR status quo, take the opportunity to make the
logic more generic, and in particular, trivial to extend for future MSRs.
This is done by discarding the notion of optional MSRs, and requiring the
toolstack to be prepared to move all of the MSRs, although only a subset will
typically need to move.
This allows for the use of guest_{rd,wr}msr() alone to evaluate whether an MSR
needs moving. This is a benefit because it means there is a single piece of
logic responsible for evaluating whether a guest can use an MSR, and which
values are acceptable.
One small adjustment to guest_wrmsr() is required to cope with being called in
toolstack context.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Andre Przywara [Thu, 16 Nov 2017 12:02:35 +0000 (12:02 +0000)]
arm64: ITS: fix cacheability adjustment
If the host GICv3 redistributor reports that the pending table cannot
use shareable memory, we try to drop the cacheability attributes as
well. However we fail horribly in doing computer science 101 bit
masking, effectively clearing the whole register instead of just a few
bits.
Fix this by removing the one redundant masking operation and adding the
magic negation for the actually needed other operation.
Ian Jackson [Tue, 14 Nov 2017 12:15:42 +0000 (12:15 +0000)]
tools: xentoolcore_restrict_all: Do deregistration before close
Closing the fd before unhooking it from the list runs the risk that a
concurrent thread calls xentoolcore_restrict_all will operate on the
old fd value, which might refer to a new fd by then. So we need to do
it in the other order.
Sadly this weakens the guarantee provided by xentoolcore_restrict_all
slightly, but not (I think) in a problematic way. It would be
possible to implement the previous guarantee, but it would involve
replacing all of the close() calls in all of the individual osdep
parts of all of the individual libraries with calls to a new function
which does
dup2("/dev/null", thing->fd);
pthread_mutex_lock(&handles_lock);
thing->fd = -1;
pthread_mutex_unlock(&handles_lock);
close(fd);
which would be terribly tedious.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
As a follow-up to XSA-212 we should have addressed a similar issue here:
The handles being advanced at the top of xenmem_add_to_physmap_batch()
means we allow hypervisor space accesses (in particular, for "errs",
writes) with suitably crafted input arguments. This isn't a security
issue in this case because of the limited width of struct
xen_add_to_physmap_batch's size field: It being 16-bits wide, only the
r/o M2P area can be accessed. Still we can and should do better.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Jan Beulich [Tue, 28 Nov 2017 12:14:43 +0000 (13:14 +0100)]
x86: check paging mode earlier in xenmem_add_to_physmap_one()
There's no point in deferring this until after some initial processing,
and it's actively wrong for the XENMAPSPACE_gmfn_foreign handling to not
have such a check at all.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Jan Beulich [Tue, 28 Nov 2017 12:14:10 +0000 (13:14 +0100)]
x86: replace bad ASSERT() in xenmem_add_to_physmap_one()
There are no locks being held, i.e. it is possible to be triggered by
racy hypercall invocations. Subsequent code doesn't really depend on the
checked values, so this is not a security issue.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
George Dunlap [Tue, 28 Nov 2017 12:13:26 +0000 (13:13 +0100)]
p2m: Check return value of p2m_set_entry() when decreasing reservation
If the entire range specified to p2m_pod_decrease_reservation() is marked
populate-on-demand, then it will make a single p2m_set_entry() call,
reducing its PoD entry count.
Unfortunately, in the right circumstances, this p2m_set_entry() call
may fail. It that case, repeated calls to decrease_reservation() may
cause p2m->pod.entry_count to fall below zero, potentially tripping
over BUG_ON()s to the contrary.
Instead, check to see if the entry succeeded, and return false if not.
The caller will then call guest_remove_page() on the gfns, which will
return -EINVAL upon finding no valid memory there to return.
Unfortunately if the order > 0, the entry may have partially changed.
A domain_crash() is probably the safest thing in that case.
Other p2m_set_entry() calls in the same function should be fine,
because they are writing the entry at its current order. Nonetheless,
check the return value and crash if our assumption turns otu to be
wrong.
This is part of XSA-247.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
George Dunlap [Tue, 28 Nov 2017 12:13:03 +0000 (13:13 +0100)]
p2m: Always check to see if removing a p2m entry actually worked
The PoD zero-check functions speculatively remove memory from the p2m,
then check to see if it's completely zeroed, before putting it in the
cache.
Unfortunately, the p2m_set_entry() calls may fail if the underlying
pagetable structure needs to change and the domain has exhausted its
p2m memory pool: for instance, if we're removing a 2MiB region out of
a 1GiB entry (in the p2m_pod_zero_check_superpage() case), or a 4k
region out of a 2MiB or larger entry (in the p2m_pod_zero_check()
case); and the return value is not checked.
The underlying mfn will then be added into the PoD cache, and at some
point mapped into another location in the p2m. If the guest
afterwards ballons out this memory, it will be freed to the hypervisor
and potentially reused by another domain, in spite of the fact that
the original domain still has writable mappings to it.
There are several places where p2m_set_entry() shouldn't be able to
fail, as it is guaranteed to write an entry of the same order that
succeeded before. Add a backstop of crashing the domain just in case,
and an ASSERT_UNREACHABLE() to flag up the broken assumption on debug
builds.
While we're here, use PAGE_ORDER_2M rather than a magic constant.
This is part of XSA-247.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Tue, 28 Nov 2017 12:11:55 +0000 (13:11 +0100)]
x86/pod: prevent infinite loop when shattering large pages
When populating pages, the PoD may need to split large ones using
p2m_set_entry and request the caller to retry (see ept_get_entry for
instance).
p2m_set_entry may fail to shatter if it is not possible to allocate
memory for the new page table. However, the error is not propagated
resulting to the callers to retry infinitely the PoD.
Prevent the infinite loop by return false when it is not possible to
shatter the large mapping.
This is XSA-246.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
George Dunlap [Wed, 22 Nov 2017 19:19:02 +0000 (19:19 +0000)]
SUPPORT.md: Add x86-specific virtual hardware
x86-specific virtual hardware provided by the hypervisor, toolstack,
or QEMU.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
George Dunlap [Wed, 22 Nov 2017 19:19:01 +0000 (19:19 +0000)]
SUPPORT.md: Toolstack core
For now only include xl-specific features, or interaction with the
system. Feature support matrix will be added when features are
mentioned.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
George Dunlap [Thu, 23 Nov 2017 17:32:14 +0000 (17:32 +0000)]
Introduce skeleton SUPPORT.md
Add a machine-readable file to describe what features are in what
state of being 'supported', as well as information about how long this
release will be supported, and so on.
The document should be formatted using "semantic newlines" [1], to make
changes easier.
Begin with the basic framework.
Signed-off-by: Ian Jackson <ian.jackson@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
[1] http://rhodesmill.org/brandon/2012/one-sentence-per-line/
Jan Beulich [Thu, 23 Nov 2017 10:40:31 +0000 (11:40 +0100)]
x86emul/test: keep compiler from using {x,y,z}mm registers itself
Since the emulator acts on the live hardware registers, we need to
prevent the compiler from using them e.g. for inlined memcpy() /
memset() (as gcc7 does). We can't, however, set this from the command
line, as otherwise the 64-bit build would face issues with functions
returning floating point values and being declared in standard headers.
As the pragma isn't available prior to gcc6, we need to invoke it
conditionally. Luckily up to gcc6 we haven't seen generated code access
SIMD registers beyond what our asm()s do.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Jan Beulich [Thu, 23 Nov 2017 10:38:22 +0000 (11:38 +0100)]
sync CPU state upon final domain destruction
See the code comment being added for why we need this.
This is being placed here to balance between the desire to prevent
future similar issues (the risk of which would grow if it was put
further down the call stack, e.g. in vmx_vcpu_destroy()) and the
intention to limit the performance impact (otherwise it could also go
into rcu_do_batch(), paralleling the use in do_tasklet_work()).
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Andrew Cooper [Thu, 16 Nov 2017 21:34:02 +0000 (21:34 +0000)]
x86/hvm: Don't corrupt the HVM context stream when writing the MSR record
Ever since it was introduced in c/s bd1f0b45ff, hvm_save_cpu_msrs() has had a
bug whereby it corrupts the HVM context stream if some, but fewer than the
maximum number of MSRs are written.
_hvm_init_entry() creates an hvm_save_descriptor with length for
msr_count_max, but in the case that we write fewer than max, h->cur only moves
forward by the amount of space used, causing the subsequent
hvm_save_descriptor to be written within the bounds of the previous one.
To resolve this, reduce the length reported by the descriptor to match the
actual number of bytes used.
A typical failure on the destination side looks like:
(XEN) HVM4 restore: CPU_MSR 0
(XEN) HVM4.0 restore: not enough data left to read 56 MSR bytes
(XEN) HVM4 restore: failed to load entry 20/0
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Andrew Cooper [Thu, 16 Nov 2017 21:10:00 +0000 (21:10 +0000)]
tools/libxc: Fix restoration of PV MSRs after migrate
There are two bugs in process_vcpu_msrs() which clearly demonstrate that I
didn't test this bit of Migration v2 very well when writing it...
vcpu->msrsz is always expected to be a multiple of xen_domctl_vcpu_msr_t
records in a spec-compliant stream, so the modulo yields 0 for the msr_count,
rather than the actual number sent in the stream.
Passing 0 for the msr_count causes the hypercall to exit early, and hides the
fact that the guest handle is inserted into the wrong field in the domctl
union.
The reason that these bugs have gone unnoticed for so long is that the only
MSRs passed like this for PV guests are the AMD DBGEXT MSRs, which only exist
in fairly modern hardware, and whose use doesn't appear to be implemented in
any contemporary PV guests.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
The altp2m_vcpu_enable_notify subop handler might skip calling
rcu_unlock_domain() after rcu_lock_current_domain(). Albeit since both
rcu functions are no-ops when run on the current domain, this doesn't
really have repercussions.
The second change is adding a missing break that would have potentially
enabled #VE for the current domain even if it had intended to enable it
for another one (not a supported functionality).
Signed-off-by: Adrian Pop <apop@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Andrew Cooper [Thu, 16 Nov 2017 09:38:14 +0000 (10:38 +0100)]
x86/shadow: correct SH_LINEAR mapping detection in sh_guess_wrmap()
The fix for XSA-243 / CVE-2017-15592 (c/s bf2b4eadcf379) introduced a change
in behaviour for sh_guest_wrmap(), where it had to cope with no shadow linear
mapping being present.
As the name suggests, guest_vtable is a mapping of the guests pagetable, not
Xen's pagetable, meaning that it isn't the pagetable we need to check for the
shadow linear slot in.
The practical upshot is that a shadow HVM vcpu which switches into 4-level
paging mode, with an L4 pagetable that contains a mapping which aliases Xen's
SH_LINEAR_PT_VIRT_START will fool the safety check for whether a SHADOW_LINEAR
mapping is present. As the check passes (when it should have failed), Xen
subsequently falls over the missing mapping with a pagefault such as:
Jan Beulich [Thu, 16 Nov 2017 09:37:29 +0000 (10:37 +0100)]
x86: don't wrongly trigger linear page table assertion
_put_page_type() may do multiple iterations until its cmpxchg()
succeeds. It invokes set_tlbflush_timestamp() on the first
iteration, however. Code inside the function takes care of this, but
- the assertion in _put_final_page_type() would trigger on the second
iteration if time stamps in a debug build are permitted to be
sufficiently much wider than the default 6 bits (see WRAP_MASK in
flushtlb.c),
- it returning -EINTR (for a continuation to be scheduled) would leave
the page inconsistent state (until the re-invocation completes).
Make the set_tlbflush_timestamp() invocation conditional, bypassing it
(for now) only in the case we really can't tolerate the stamp to be
stored.
This is part of XSA-240.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Julien Grall [Wed, 15 Nov 2017 19:34:14 +0000 (19:34 +0000)]
xen/arm: p2m: Add more debug in get_page_from_gva
The function get_page_from_gva is used by copy_*_guest helpers to
translate a guest virtual address to a machine physical address and take
reference on the page.
There are a couple of errors paths that will return the same value making
it difficult to know the exact error. Add more debug in each error patch
only for debug-build.
This should help narrowing down the intermittent failure with the
hypercall GNTTABOP_copy (see [1]).
Julien Grall [Wed, 15 Nov 2017 19:34:13 +0000 (19:34 +0000)]
xen/arm: mm: Change the return value of gvirt_to_maddr
Currently, gvirt_to_maddr return -EFAULT when the translation failed.
It might be useful to return the PAR_EL1 (Physical Address Register)
in such a case to get a better idea of the reason.
So modify the return value to use 0 on success or return the PAR on
failure.
The callers are modified to reflect the change of the return value.
Note that with the change in gvirt_to_maddr, ma needs to be initialized
to avoid GCC been confused (i.e value may be uninitialized) with the new
construction.
Yu Zhang [Tue, 14 Nov 2017 16:11:26 +0000 (17:11 +0100)]
x86/mm: fix race condition in modify_xen_mappings()
In modify_xen_mappings(), a L1/L2 page table shall be freed,
if all entries of this page table are empty. Corresponding
L2/L3 PTE will need be cleared in such scenario.
However, concurrent paging structure modifications on different
CPUs may cause the L2/L3 PTEs to be already be cleared or set
to reference a superpage.
Therefore the logic to enumerate the L1/L2 page table and to
reset the corresponding L2/L3 PTE need to be protected with
spinlock. And the _PAGE_PRESENT and _PAGE_PSE flags need be
checked after the lock is obtained.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Min He [Tue, 14 Nov 2017 16:10:56 +0000 (17:10 +0100)]
x86/mm: fix race conditions in map_pages_to_xen()
In map_pages_to_xen(), a L2 page table entry may be reset to point to
a superpage, and its corresponding L1 page table need be freed in such
scenario, when these L1 page table entries are mapping to consecutive
page frames and having the same mapping flags.
However, variable `pl1e` is not protected by the lock before L1 page table
is enumerated. A race condition may happen if this code path is invoked
simultaneously on different CPUs.
For example, `pl1e` value on CPU0 may hold an obsolete value, pointing
to a page which has just been freed on CPU1. Besides, before this page
is reused, it will still be holding the old PTEs, referencing consecutive
page frames. Consequently the `free_xen_pagetable(l2e_to_l1e(ol2e))` will
be triggered on CPU0, resulting the unexpected free of a normal page.
This patch fixes the above problem by protecting the `pl1e` with the lock.
Also, there're other potential race conditions. For instance, the L2/L3
entry may be modified concurrently on different CPUs, by routines such as
map_pages_to_xen(), modify_xen_mappings() etc. To fix this, this patch will
check the _PAGE_PRESENT and _PAGE_PSE flags, after the spinlock is obtained,
for the corresponding L2/L3 entry.
Signed-off-by: Min He <min.he@intel.com> Signed-off-by: Yi Zhang <yi.z.zhang@intel.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Eric Chanudet [Tue, 14 Nov 2017 16:09:50 +0000 (17:09 +0100)]
x86/hvm: do not register hpet mmio during s3 cycle
Do it once at domain creation (hpet_init).
Sleep -> Resume cycles will end up crashing an HVM guest with hpet as
the sequence during resume takes the path:
-> hvm_s3_suspend
-> hpet_reset
-> hpet_deinit
-> hpet_init
-> register_mmio_handler
-> hvm_next_io_handler
register_mmio_handler will use a new io handler each time, until
eventually it reaches NR_IO_HANDLERS, then hvm_next_io_handler calls
domain_crash.
Signed-off-by: Eric Chanudet <chanudete@ainfosec.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
tools/xenstored: Check number of strings passed to do_control()
It is possible to send a zero-string message body to xenstore's
XS_CONTROL handling function. Then the number of strings is used
for an array allocation. This leads to a crash in strcmp() in a
CONTROL sub-command invocation loop.
The output of xs_count_string() should be verified and all 0 or
negative values should be rejected with an EINVAL. At least the
sub-command name must be specified.
The xenstore crash can only be triggered from within dom0 (there
is a check in do_control() rejecting all non-dom0 requests with
an EACCES).
Testing: reproduced with the following command:
python -c 'print 16*"\x00"' | nc -U $XENSTORED_RUNDIR/socket
Signed-off-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Reviewed-by: Martin Pohlack <mpohlack@amazon.de> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Julien Grall [Fri, 10 Nov 2017 17:10:50 +0000 (17:10 +0000)]
libs/evtchn: Remove active handler on clean-up or failure
Commit 89d55473ed16543044a31d1e0d4660cf5a3f49df "xentoolcore_restrict_all:
Implement for libxenevtchn" added a call to register allowing to
restrict the event channel.
However, the call to deregister the handler was not performed if open
failed or when closing the event channel. This will result to corrupt
the list of handlers and potentially crash the application later one.
Fix it by calling xentoolcore_deregister_active_handle on failure and
closure.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Anthony PERARD [Mon, 13 Nov 2017 12:27:32 +0000 (12:27 +0000)]
Config.mk: Update QEMU changeset
New commits:
- xen/pt: allow QEMU to request MSI unmasking at bind time
To fix a passthrough bug.
- ui/gtk: Fix deprecation of vte_terminal_copy_clipboard
A build fix.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Andrew Cooper [Wed, 8 Nov 2017 12:40:40 +0000 (13:40 +0100)]
x86/cpuid: minor fixups missed from previous work
* Add more feature names to ./xen-cpuid
* Vertically align the magic comments in cpufeatureset.h
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Andrew Cooper [Thu, 12 Oct 2017 19:19:09 +0000 (20:19 +0100)]
tools/dombuilder: Prevent failures of xc_dom_gnttab_init()
Recent changes in grant table configuration have caused calls to
xc_dom_gnttab_init() to fail if not proceeded with a call to
xc_domain_set_gnttab_limits(). This is backwards from the point of view of
3rd party dombuilder users.
Add max_{grant,maptrack}_frames parameters to struct xc_dom_image, and require
them to be set by callers using xc_dom_gnttab_init(). Libxl, which uses
xc_dom_gnttab_init() itself is updated appropriately.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Tested-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Juergen Gross <jgross@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Andrew Cooper [Thu, 12 Oct 2017 19:19:08 +0000 (20:19 +0100)]
tools/dombuilder: Fix asymmetry when setting up console and xenstore rings
libxl always uses xc_dom_gnttab_init(), which internally calls
xc_dom_gnttab{_hvm,}_seed() to set up the grants point at the console and
xenstore rings. For HVM guests, libxl then asks Xen for the information set
up previously, and calls xc_dom_gnttab_hvm_seed() a second time, which is
wasteful. ARM construction expects libxl to have set up
dom->{console,xenstore}_evtchn earlier, so only actually functions because of
this second call.
Rationalise everything and make it consistent for all guests.
1) Users of the domain builder are expected to provide
dom->{console,xenstore}_{evtchn,domid} unconditionally. This is checked
by setting invalid values in xc_dom_allocate(), and checking in
xc_dom_boot_image().
2) For x86 HVM and ARM guests, the event channels are given to Xen at the
same time as the ring gfns. ARM already did this, but x86 is updated to
match. x86 PV already provides this information in the start_info page.
3) Libxl is updated to drop all relevant functionality from
hvm_build_set_params(), and behave consistently with PV guests when it
comes to the handling of dom->{console,xenstore}_{evtchn,domid,gfn}.
This removes several redundant hypercalls (including a foreign mapping) from
the x86 HVM and ARM construction paths.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Tested-by: Julien Grall <julien.grall@arm.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Wei Liu [Thu, 12 Oct 2017 19:19:07 +0000 (20:19 +0100)]
tools/dombuilder: Switch to using gfn terminology for console and xenstore rings
The sole use of xc_dom_translated() and xc_dom_p2m() outside of the domain
builder is for libxl_dom() to translate the console and xenstore pfns back
into useful values. PV guest pfns are only interesting to the domain builder,
and gfns are the address space used by all other hypercalls.
Renaming the fields in xc_dom_image is deliberate, as it will cause
out-of-tree users of the dombuilder to notice the different semantics.
Correct the terminology throughout xc_dom_gnttab{_hvm,}_seed(), which are all
using gfns despite the existing variable names.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Tested-by: Julien Grall <julien.grall@arm.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
[ wei: fix stubdom build ] Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Tue, 31 Oct 2017 17:07:41 +0000 (17:07 +0000)]
common/multicall: Increase debugability for bad hypercalls
While investigating an issue (in a new codepath I'd introduced, as it turns
out), leaving interrupts disabled manifested as a subsequent op in the
multicall failing a check_lock() test.
The codepath would have hit the ASSERT_NOT_IN_ATOMIC on the return-to-guest
path, had it not hit the check_lock() first.
Call ASSERT_NOT_IN_ATOMIC() after each operation in the multicall, to make
failures more obvious.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Andrew Cooper [Mon, 30 Oct 2017 17:42:52 +0000 (17:42 +0000)]
common/spinlock: Improve the output from check_lock() if it trips
If check_lock() triggers, a crash will occur. Instead of simply identifying
"the irq context was different", indicate the expected and current irq
context.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Andrew Cooper [Thu, 12 Oct 2017 19:19:06 +0000 (20:19 +0100)]
tools/dombuilder: Remove clear_page() from xc_dom_boot.c
pfn 0 is a legitimate (albeit unlikely) frame to use, so skipping it is wrong.
This behaviour appears to exists simply to cover the fact that zero is the
default value of an uninitialised field in dom.
ARM already clears the frames at the point that the pfns are allocated,
meaning that the added clear_page() is wasteful. Alter x86 to match ARM and
clear the page when it is allocated.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Tested-by: Julien Grall <julien.grall@arm.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Andrew Cooper [Thu, 12 Oct 2017 19:19:05 +0000 (20:19 +0100)]
tools/dombuilder: Drop more PVH v1 leftovers
alloc_magic_pages() is renamed to alloc_magic_pages_pv() to mirror its
alloc_magic_pages_hvm() counterpart. Delete a redundant comment, introduce
some newlines clarity, and remove a logically dead allocation of shared info.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Tested-by: Julien Grall <julien.grall@arm.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
With the current SBSA UART emulation, streaming larger amounts of data
(as caused by "find /", for instance) can lead to character losses.
This is due to the OUT ring buffer getting full, because we change the
TX interrupt bit only when the FIFO is actually full, and not already
when it's half-way filled, as the Linux driver expects.
The SBSA spec does not explicitly state this, but we assume that an
SBSA compliant UART uses the PL011 default "interrupt FIFO level select
register" value of "1/2 way". The Linux driver certainly makes this
assumption, so it expect to be able to write a number of characters
after the TX interrupt line has been asserted.
On a similar issue we have the same wrong behaviour on the receive side.
However changing the RX interrupt to trigger on reaching half of the FIFO
level will lead to lag, because the guest would not be notified of incoming
characters until the FIFO is half way filled. This leads to inacceptible
lags when typing on a terminal.
Real hardware solves this issue by using the "receive timeout
interrupt" (RTI), which is triggered when character reception stops for
32 baud cycles. As we cannot and do not want to emulate any timing here,
we slightly abuse the timeout interrupt to notify the guest of new
characters: when a new character comes in, the RTI is asserted, when
the FIFO is cleared, the interrupt gets cleared.
So this patch changes the emulated interrupt trigger behaviour to come
as close to real hardware as possible: the RX and TX interrupt trigger
when the FIFO gets half full / half empty, and the RTI interrupt signals
new incoming characters.
Bhupinder Thakur [Tue, 24 Oct 2017 17:09:21 +0000 (18:09 +0100)]
arm/xen: vpl011: Fix the slow early console SBSA UART output
The early console output uses pl011_early_write() to write data. This
function waits for BUSY bit to get cleared before writing the next byte.
In the SBSA UART emulation logic, the BUSY bit was set as soon one
byte was written in the FIFO and it remained set until the FIFO was
emptied. This meant that the output was delayed as each character needed
the BUSY to get cleared.
Since the SBSA UART is getting emulated in Xen using ring buffers, it
ensures that once the data is enqueued in the FIFO, it will be received
by xenconsole so it is safe to set the BUSY bit only when FIFO becomes
full. This will ensure that pl011_early_write() is not delayed unduly
to write the data.
Signed-off-by: Bhupinder Thakur <bhupinder.thakur@linaro.org> Reviewed-by: Andre Przywara <andre.przywara@linaro.org> Signed-off-by: Andre Przywara <andre.przywara@linaro.org> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Roger Pau Monne [Thu, 26 Oct 2017 09:19:30 +0000 (10:19 +0100)]
gcov: return ENOSYS for unimplemented gcov domctl
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org> Acked-by: Wei Liu <wei.liu2@citrix.com>
Ross Lagerwall [Wed, 18 Oct 2017 13:42:33 +0000 (14:42 +0100)]
xentoolcore_restrict_all: Implement for libxenevtchn
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Ross Lagerwall [Wed, 18 Oct 2017 13:42:32 +0000 (14:42 +0100)]
tools/libs/evtchn: Add support for restricting a handle
Implement support for restricting evtchn handles to a particular domain
on Linux by calling the IOCTL_EVTCHN_RESTRICT_DOMID ioctl (support added
in Linux v4.8).
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Changeset 2b1cde7783 introduced "batch mode" to afl-harness, which allowed
the handling of several inputs in sequence.
Unfortunately, it introduced a file pointer leak when the file was
larger than the maximum size. Restructure the code to always close fp
if we opened it.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
George Dunlap [Fri, 27 Oct 2017 13:26:27 +0000 (14:26 +0100)]
x86/mm: Make PV linear pagetables optional
Allowing pagetables to point to other pagetables of the same level
(often called 'linear pagetables') has been included in Xen since its
inception; but recently it has been the source of a number of subtle
reference-counting bugs.
It is not used by Linux or MiniOS; but it is used by NetBSD and Novell
Netware. There are significant numbers of people who are never going
to use the feature, along with significant numbers who need the
feature.
Add a Kconfig option for the feature (default to 'y'). Also add a
command-line option to control whether PV linear pagetables are
allowed (default to 'true').
NB that we leave linear_pt_count in the page struct. It's in a union,
so its presence doesn't increase the size of the data struct.
Changing the layout of the other elements based on configuration
options is asking for trouble however; so we'll just leave it there
and ASSERT that it's zero.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Boris Ostrovsky [Tue, 24 Oct 2017 23:30:20 +0000 (19:30 -0400)]
x86/vpmu: Remove unnecessary call to do_interrupt()
This call was left during PVHv1 removal (commit 33e5c32559e1 ("x86:
remove PVHv1 code")):
- if ( is_pvh_vcpu(sampling) &&
- !(vpmu_mode & XENPMU_MODE_ALL) &&
+ if ( !(vpmu_mode & XENPMU_MODE_ALL) &&
!vpmu->arch_vpmu_ops->do_interrupt(regs) )
return;
As result of this extra call VPMU no longer works for PV guests on Intel
because we effectively lose value of MSR_CORE_PERF_GLOBAL_STATUS.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Jan Beulich [Thu, 26 Oct 2017 07:57:31 +0000 (01:57 -0600)]
x86: fix asm() constraint for GS selector update
Exception fixup code may alter the operand, which ought to be reflected
in the constraint.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Jan Beulich [Thu, 26 Oct 2017 07:57:04 +0000 (01:57 -0600)]
x86: don't latch wrong (stale) GS base addresses
load_segments() writes selector registers before doing any of the base
address updates. Any of these selector loads can cause a page fault in
case it references the LDT, and the LDT page accessed was only recently
installed. Therefore the call tree map_ldt_shadow_page() ->
guest_get_eff_kern_l1e() -> toggle_guest_mode() would in such a case
wrongly latch the outgoing vCPU's GS.base into the incoming vCPU's
recorded state.
Split page table toggling from GS handling - neither
guest_get_eff_kern_l1e() nor guest_io_okay() need more than the page
tables being the kernel ones for the memory access they want to do.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
libxc: remove stale error check for domain size in xc_sr_save_x86_hvm.c
Long ago domains to be saved were limited to 1TB size due to the
migration stream v1 limitations which used a 32 bit value for the
PFN and the frame type (4 bits) leaving only 28 bits for the PFN.
Migration stream V2 uses a 64 bit value for this purpose, so there
is no need to refuse saving (or migrating) domains larger than 1 TB.
For 32 bit toolstacks there is still a size limit, as domains larger
than about 1TB will lead to an exhausted virtual address space of the
saving process. So keep the test for 32 bit, but don't base it on the
page type macros. As a migration could lead to the situation where a
32 bit toolstack would have to handle such a large domain (in case the
sending side is 64 bit) the same test should be added for restoring a
domain.
Jan Beulich [Tue, 24 Oct 2017 16:13:13 +0000 (18:13 +0200)]
x86: also show FS/GS base addresses when dumping registers
Their state may be important to figure the reason for a crash. To not
further grow duplicate code, break out a helper function.
I realize that (ab)using the control register array here may not be
considered the nicest solution, but it seems easier (and less overall
overhead) to do so compared to the alternative of introducing another
helper structure.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Jan Beulich [Tue, 24 Oct 2017 16:12:31 +0000 (18:12 +0200)]
x86: fix GS-base-dirty determination
load_segments() writes the two MSRs in their "canonical" positions
(GS_BASE for the user base, SHADOW_GS_BASE for the kernel one) and uses
SWAPGS to switch them around if the incoming vCPU is in kernel mode. In
order to not leave a stale kernel address in GS_BASE when the incoming
guest is in user mode, the check on the outgoing vCPU needs to be
dependent upon the mode it is currently in, rather than blindly looking
at the user base.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>