Andrew Cooper [Thu, 18 Jul 2019 16:53:03 +0000 (17:53 +0100)]
xen/trace: Add trace.h to MAINTAINER
... to match the existing trace.c entry.
Reported-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
To remove a device from a domain, a qmp command is sent to qemu. But it is
handled by qemu asychronously. Even the qmp command is claimed to be done,
the actual handling in qemu side may happen later.
This behavior brings two questions:
1. Attaching a device back to a domain right after detaching the device from
that domain would fail with error:
libxl: error: libxl_qmp.c:341:qmp_handle_error_response: Domain 1:received an
error message from QMP server: Duplicate ID 'pci-pt-60_00.0' for device
2. Accesses to PCI configuration space in Qemu may overlap with later device
reset issued by 'xl' or by pciback.
In order to avoid mentioned questions, wait for the completion of device
removal by querying all pci devices using qmp command and ensuring the target
device isn't listed. Only retry 5 times to avoid 'xl' potentially being blocked
by qemu.
Signed-off-by: Chao Gao <chao.gao@intel.com>
Message-Id: <1562133373-19208-1-git-send-email-chao.gao@intel.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Daniel P. Smith [Thu, 18 Jul 2019 21:11:44 +0000 (22:11 +0100)]
golang/xenlight: Fixing compilation for go 1.11
This deals with two casting issues for compiling under go 1.11:
- explicitly cast to *C.xentoollog_logger for Ctx.logger pointer
- add cast to unsafe.Pointer for the C string cpath
Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
George Dunlap [Mon, 8 Jul 2019 10:56:24 +0000 (06:56 -0400)]
MAINTAINERS: Make myself libxl golang binding maintainer
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Thu, 18 Jul 2019 13:29:35 +0000 (14:29 +0100)]
xen/trace: Fix build with !CONFIG_TRACEBUFFER
GCC reports:
In file included from hvm.c:24:0:
/local/xen.git/xen/include/xen/trace.h: In function ‘tb_control’:
/local/xen.git/xen/include/xen/trace.h:60:13: error: ‘ENOSYS’
undeclared (first use in this function)
return -ENOSYS;
^~~~~~
Include xen/errno.h to resolve the issue. While tweaking this, add comments
to the #else and #endif, as they are a fair distance apart.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Sat, 13 Apr 2019 21:03:05 +0000 (22:03 +0100)]
x86/mm: Provide more useful information in diagnostics
* alloc_l?_table() should identify the failure, not just state that there is
one.
* get_page() should use %pd for the two domains, to render system domains in
a more obvious way.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Wed, 17 Jul 2019 13:46:08 +0000 (15:46 +0200)]
x86emul: add a PCLMUL/VPCLMUL test case to the harness
Also use this for AVX512_VBMI2 VPSH{L,R}D{,V}{D,Q,W} testing (only the
quad word right shifts get actually used; the assumption is that their
"left" counterparts as well as the double word and word forms then work
as well).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citirx.com>
Jan Beulich [Wed, 17 Jul 2019 13:43:57 +0000 (15:43 +0200)]
x86emul: restore ordering within main switch statement
Incremental additions and/or mistakes have lead to some code blocks
sitting in "unexpected" places. Re-sort the case blocks (opcode space;
major opcode; 66/F3/F2 prefix; legacy/VEX/EVEX encoding).
As an exception the opcode space 0x0f EVEX-encoded VPEXTRW is left at
its current place, to keep it close to the "pextr" label.
Pure code movement.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citirx.com>
Jan Beulich [Wed, 17 Jul 2019 13:43:06 +0000 (15:43 +0200)]
x86emul: support GFNI insns
As to the feature dependency adjustment, while strictly speaking SSE is
a sufficient prereq (to have XMM registers), vectors of bytes and qwords
have got introduced only with SSE2. gcc, for example, uses a similar
connection in its respective intrinsics header.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 17 Jul 2019 13:41:58 +0000 (15:41 +0200)]
x86emul: support VAES insns
As to the feature dependency adjustment, just like for VPCLMULQDQ while
strictly speaking AVX is a sufficient prereq (to have YMM registers),
256-bit vectors of integers have got fully introduced with AVX2 only.
A new test case (also covering AESNI) will be added to the harness by a
subsequent patch.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citirx.com>
Jan Beulich [Wed, 17 Jul 2019 13:41:20 +0000 (15:41 +0200)]
x86emul: support VPCLMULQDQ insns
As to the feature dependency adjustment, while strictly speaking AVX is
a sufficient prereq (to have YMM registers), 256-bit vectors of integers
have got fully introduced with AVX2 only. Sadly gcc can't be used as a
reference here: They don't provide any AVX512-independent built-in at
all.
Along the lines of PCLMULQDQ, since the insns here and in particular
their memory access patterns follow the usual scheme, I didn't think it
was necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 17 Jul 2019 13:40:42 +0000 (15:40 +0200)]
x86emul: support AVX512_VNNI insns
Along the lines of the 4FMAPS case, convert the 4VNNIW-based table
entries to a decoder adjustment. Because of the current sharing of table
entries between different (implied) opcode prefixes and with the same
major opcodes being used for vp4dpwssd{,s}, which have a different
memory operand size and different Disp8 scaling, the pre-existing table
entries get converted to a decoder override. The table entries will now
represent the insns here, in line with other table entries preferably
representing the prefix-66 insns.
As in a few cases before, since the insns here and in particular their
memory access patterns follow the usual scheme, I didn't think it was
necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 17 Jul 2019 13:39:54 +0000 (15:39 +0200)]
x86emul: support AVX512_4VNNIW insns
As in a few cases before, since the insns here and in particular their
memory access patterns follow the AVX512_4FMAPS scheme, I didn't think
it was necessary to add contrived tests specifically for them, beyond
the Disp8 scaling ones.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 17 Jul 2019 13:39:10 +0000 (15:39 +0200)]
x86emul: support AVX512_4FMAPS insns
A decoder adjustment is needed here because of the current sharing of
table entries between different (implied) opcode prefixes: The same
major opcodes are used for vfmsub{132,213}{p,s}{s,d}, which have a
different memory operand size and different Disp8 scaling.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 17 Jul 2019 13:38:35 +0000 (15:38 +0200)]
x86emul: support remaining AVX512_VBMI2 insns
As in a few cases before, since the insns here and in particular their
memory access patterns follow the usual scheme, I didn't think it was
necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 17 Jul 2019 13:37:54 +0000 (15:37 +0200)]
x86emul: support of AVX512_IFMA insns
Once again take the liberty and also correct the (public interface) name
of the AVX512_IFMA feature flag to match the SDM, on the assumption that
no external consumer has actually been using that flag so far.
As in a few cases before, since the insns here and in particular their
memory access patterns follow the usual scheme, I didn't think it was
necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 17 Jul 2019 13:37:00 +0000 (15:37 +0200)]
x86emul: support of AVX512* population count insns
Plus the only other AVX512_BITALG one.
As in a few cases before, since the insns here and in particular their
memory access patterns follow the usual scheme, I didn't think it was
necessary to add a contrived test specifically for them, beyond the
Disp8 scaling one.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Xen internal running status(trace event at pre-defined trace point)
will be saved to trace memory when enabled.
Trace event data and config params can be read/changed
by system control hypercall at run time.
Can be disabled for smaller code footprint.
Signed-off-by: Baodong Chen <chenbaodong@mxnavi.com> Acked-by: George Dunlap <george.dunlap@citrix.com> [tracing] Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 17 Jul 2019 13:34:23 +0000 (15:34 +0200)]
dom_cow is needed for mem-sharing only
A couple of adjustments are needed to code checking for dom_cow, but
since there are pretty few it is probably better to adjust those than
to set up and keep around a never used domain.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com>
Jan Beulich [Wed, 17 Jul 2019 13:33:05 +0000 (15:33 +0200)]
x86/PV: drop page table ownership check from emul-priv-op.c:read_cr()
We have such a check here but no-where else. It shouldn't have been
added by af909e7e16 ("M2P translation cannot be handled through flat
table with") in the first place.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Mon, 15 Jul 2019 16:21:02 +0000 (17:21 +0100)]
x86/suspend: Don't save/restore %cr8
%cr8 is an alias of APIC_TASKPRI, which is handled by
lapic_{suspend,resume}() with the rest of the Local APIC state. Saving
and restoring the TPR state in isolation is not a clever idea.
Drop it all.
While editing wakeup_prot.S, trim its include list to just the headers
which are used, which is precicely none of them.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 11 Jul 2019 14:50:17 +0000 (09:50 -0500)]
x86/smpboot: Remove redundant order calculations
The GDT and IDT allocations are all order 0, and not going to change.
Use an explicit 0, instead of calling get_order_from_pages(). This
allows for the removal of the 'order' local parameter in both
cpu_smpboot_{alloc,free}().
While making this adjustment, rearrange cpu_smpboot_free() to fold the
two "if ( remove )" clauses. There is no explicit requirements for the
order of free()s.
No practical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Tue, 16 Jul 2019 11:29:02 +0000 (13:29 +0200)]
mm.h: fix BUG_ON() condition in put_page_alloc_ref()
The BUG_ON() was misplaced when this function was introduced in commit ec83f825 "mm.h: add helper function to test-and-clear _PGC_allocated".
It will fire incorrectly if _PGC_allocated is already clear on entry. Thus
it should be moved after the if statement.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Tue, 16 Jul 2019 07:10:36 +0000 (09:10 +0200)]
mm.h: add helper function to test-and-clear _PGC_allocated
The _PGC_allocated flag is set on a page when it is assigned to a domain
along with an initial reference count of at least 1. To clear this
'allocation' reference it is necessary to test-and-clear _PGC_allocated and
then only drop the reference if the test-and-clear succeeds. This is open-
coded in many places. It is also unsafe to test-and-clear _PGC_allocated
unless the caller holds an additional reference.
This patch adds a helper function, put_page_alloc_ref(), to replace all the
open-coded test-and-clear/put_page occurrences. That helper function
incorporates a check that an additional page reference is held and will
BUG() if it is not.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 16 Jul 2019 07:09:44 +0000 (09:09 +0200)]
x86/hvm: make hvmemul_virtual_to_linear()'s reps parameter optional
A majority of callers wants just a single iteration handled. Allow to
express this by passing in a NULL pointer, instead of setting up a local
variable just to hold the "1" to pass in here.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Alexandru Isaila <aisaila@bitdefender.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
EPT differs from NPT and shadow when translating page orders to levels
in the physmap page tables. EPT page tables level for order 0 pages is
0, while NPT and shadow instead use 1, ie: EPT page tables levels
starts at 0 while NPT and shadow starts at 1.
Fix the p2m_entry_modify call in atomic_write_ept_entry to always add
one to the level, in order to match NPT and shadow usage.
While there also add a check to ensure p2m_entry_modify is never
called with level == 0. That should allow to catch future errors
related to the level parameter.
Fixes: c7a4c088ad1c ('x86/mm: split p2m ioreq server pages special handling into helper') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Fri, 17 May 2019 10:08:56 +0000 (11:08 +0100)]
tools/xenstored: Drop mapping of the ring via foreign map
This is a vestigial remnent of the pre xenstored stub domain days.
Foreign mapping via MFN is a privileged operation which is not
necessary, because grant details are unconditionally set up during
domain construction. In practice, this means xenstored never uses its
ability to foreign map the ring.
Drop the ability completely, which removes the penultimate use of the
unstable libxc interface.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Fri, 17 May 2019 10:06:16 +0000 (11:06 +0100)]
tools/xenstored: Make gnttab interface mandatory
xenstored currently requires an libxc and evtchn interface, but leaves
the gnttab interface as optional.
gnttab is ubiquitous these days, and in practice mandatory in all cases
where xenstored isn't running as root in dom0 (due to the inability to
foreign map by MFN).
The toolstack has unconditionally set up grant details for many years
now, and longterm it would be good to phase out the use of libxc. This
requires that xenstored map the store ring by grant map, rather than
foreign map.
No practical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com>
Juergen Gross [Wed, 26 Jun 2019 13:37:26 +0000 (14:37 +0100)]
libxl: fix pci device re-assigning after domain reboot
After a reboot of a guest only the first pci device configuration will
be retrieved from Xenstore resulting in loss of any further assigned
passed through pci devices.
The main reason is that all passed through pci devices reside under a
common root device "0" in Xenstore. So when the device list is rebuilt
from Xenstore after a reboot the sub-devices below that root device
need to be selected instead of using the root device number as a
selector.
Fix that by adding a new member to struct libxl_device_type which when
set is used to get the number of devices. Add such a member for pci to
get the correct number of pci devices instead of implying it from the
number of pci root devices (which will always be 1).
While at it fix the type of libxl__device_pci_from_xs_be() to match
the one of the .from_xenstore member of struct libxl_device_type. This
fixes a latent bug checking the return value of a function returning
void.
Andrew Cooper [Thu, 4 Jul 2019 15:13:32 +0000 (16:13 +0100)]
x86/ctxt-switch: Document and improve GDT handling
Calling virt_to_mfn() in the context switch path is a lot
of wasted cycles for a result which is constant after boot.
Begin by documenting how Xen handles the GDTs across context switch.
The loop in write_full_gdt_ptes() is unnecessary, because
NR_RESERVED_GDT_PAGES is 1. Dropping it makes the code substantially
more clear, and with it dropped, write_full_gdt_ptes() becomes more
obviously a poor name, so rename it to update_xen_slot_in_full_gdt().
Furthermore, load_full_gdt() is completely independent of the current
CPU, and load_default_gdt() only needs the current CPU's regular
GDT. (This is a change in behaviour, as previously it may have used the
compat GDT, but either will do.)
Add two extra per-cpu variables which cache the L1e for the regular and compat
GDT, calculated in cpu_smpboot_alloc()/trap_init() as appropriate, so
update_xen_slot_in_full_gdt() doesn't need to waste time performing the same
calculation on every context switch.
One performance scenario of Jüergen's (time to build the hypervisor on
an 8 CPU system, with two single-vCPU MiniOS VMs constantly interrupting
dom0 with events) shows the following, average over 5 measurements:
elapsed user system
Unpatched 66.51 232.93 109.21
Patched 57.00 225.47 105.47
which is a substantial improvement.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Tested-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Will Abele [Tue, 9 Jul 2019 13:22:23 +0000 (13:22 +0000)]
xen/arm: use correct device tree root node name
The root node of a device tree should not have a node name. This is
specified in section 2.2.1 of version 0.2 of the device tree
specification, available from devicetree.org.
Linux Kernel versions prior to 4.15 misinterpret flattened device trees
with a "/" as the name of the root node as an FDT version older than 16.
Linux then fails to parse the FDT.
Signed-off-by: Will Abele <will.abele@starlab.io> Reviewed-by: Julien Grall <julien.grall@arm.com>
xen/arm: optee: check if OP-TEE is virtualization-aware
This is workaround for OP-TEE 3.5. This is the first OP-TEE release
which supports virtualization, but there is no way to tell if
OP-TEE was built with that support enabled. We can probe for it
by calling SMC that is available only when OP-TEE is built with
virtualization support.
xen/arm: tee: place OP-TEE Kconfig option right after TEE
It is nicer, when options for particular TEE mediators (currently,
OP-TEE only) are following generic "Enable TEE mediators support"
option in the menuconfig:
[*] Enable TEE mediators support
[ ] Enable OP-TEE mediator
Amit Singh Tomar [Sun, 23 Jun 2019 12:56:31 +0000 (18:26 +0530)]
xen/arm: domain_build: Black list devices using PPIs
Currently, the vGIC is not able to cope with hardware PPIs routed to guests.
One of the solutions to this problem is to skip any device that uses PPI
source completely while building the domain itself.
This patch goes through all the interrupt sources of a device and skip it
if one of the interrupts sources is a PPI. It fixes XEN boot on i.MX8MQ by
skipping the PMU node.
Andrew Cooper [Mon, 8 Jul 2019 22:12:06 +0000 (23:12 +0100)]
x86/gnttab: Use explicit instruction size in gnttab_clear_flags()
The OpenSUSE Leap compilers complain about ambiguity:
In file included from grant_table.c:33:
In file included from ...xen/include/xen/grant_table.h:30:
...xen/include/asm/grant_table.h:67:19: error: ambiguous instructions require
an explicit suffix (could be 'andb', 'andw', 'andl', or 'andq')
asm volatile ("lock and %1,%0" : "+m" (*addr) : "ir" ((uint16_t)~mask));
^
<inline asm>:1:2: note: instantiated into assembly here
lock and $-17,(%rsi)
^
Full logs: https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/247600284 Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 21 Nov 2018 18:38:41 +0000 (18:38 +0000)]
xen/gnttab: Refactor gnttab_clear_flag() to be gnttab_clear_flags()
To allow for further improvements, it is useful to be able to clear more than
a single flag at once. Rework gnttab_clear_flag() into gnttab_clear_flags()
which takes a bitmask rather than a bit number.
No practical change yet.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien.grall@arm.com>
Andrew Cooper [Wed, 21 Nov 2018 18:38:41 +0000 (18:38 +0000)]
arm/gnttab: Implement stub helpers as static inlines
It is inefficient to call into a different translation unit for a stub
function, when a static inline will work fine. Replace an open-coded
printk_once() while moving it.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com>
Paul Durrant [Mon, 8 Jul 2019 08:31:35 +0000 (10:31 +0200)]
xmalloc: add a Kconfig option to poison free pool memory
This patch adds XMEM_POOL_POISON to the Kconfig DEBUG options. If set,
free blocks (greater than MIN_BLOCK_SIZE) will be poisoned with 0xAA
bytes which will then be verified when memory is subsequently allocated.
This can help in spotting heap corruption, particularly use-after-free.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Mon, 8 Jul 2019 08:30:39 +0000 (10:30 +0200)]
xmalloc: remove struct xmem_pool init_region
This patch dispenses with the init_region. It's simply not necessary
(pools will still happily grow and shrink on demand in its absence) and the
code can be shortended by removing it. It also avoids the sole evaluation
of ADD_REGION without holding the pool lock (which is unsafe).
NOTE: The if statement that is removed from xmem_pool_destroy() has actually
been bogus since commit 6009f4dd "Transcendent memory ("tmem") for
Xen." when the allocation of the init_region was moved out of
xmem_pool_create().
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Suggested-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Fri, 5 Jul 2019 08:42:36 +0000 (10:42 +0200)]
x86emul: complete support of AVX512_VBMI insns
Also add testing of ones support for which was added before. Sadly gcc's
command line option naming is not in line with Intel's naming of the
feature, which makes it necessary to mis-name things in the test harness.
Since the only new insn here and in particular its memory access pattern
follows the usual scheme, I didn't think it was necessary to add a
contrived test specifically for it, beyond the Disp8 scaling one.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 5 Jul 2019 08:41:44 +0000 (10:41 +0200)]
x86emul: support AVX512CD insns
Since the insns here and in particular their memory access patterns
follow the usual scheme I didn't think it was necessary to add
contrived tests specifically for them, beyond the Disp8 scaling ones.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 5 Jul 2019 08:40:55 +0000 (10:40 +0200)]
x86emul: support AVX512PF insns
Some adjustments are necessary to the EVEX Disp8 scaling test code to
account for the zero byte reads/writes, which get issued for the test
harness only.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 5 Jul 2019 08:40:10 +0000 (10:40 +0200)]
x86emul: support AVX512F scatter insns
This completes support of AVX512F in the insn emulator.
Note that in the test harness there's a little bit of trickery needed to
get around the not fully consistent naming of AVX512VL gather and
scatter compiler built-ins. To suppress expansion of the "di" and "si"
tokens they get constructed by token concatenation in BS(), which is
different from BG().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 5 Jul 2019 08:39:41 +0000 (10:39 +0200)]
x86emul: add high register S/G test cases
In order to verify that in particular the index register decoding works
correctly in the S/G emulation paths, add dedicated (64-bit only) cases
disallowing the compiler to use the lower registers. Other than in the
generic SIMD case, where occasional uses of %xmm or %ymm registers in
generated code cause various internal compiler errors when disallowing
use of all of the lower 16 registers (apparently due to insn templates
trying to use AVX2 encodings), doing so here in the AVX512F case looks
to be fine.
While the main goal here is the AVX512F case, add an AVX2 variant as
well.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 4 Jul 2019 15:43:57 +0000 (17:43 +0200)]
x86emul: test harness adjustments for AVX512F S/G insns
There was an encoding mistake in the EVEX Disp8 test code, which was
benign (due to %rdx getting set to zero) to all non-vSIB tests as it
mistakenly encoded <disp8>(%rdx,%rdx) instead of <disp8>(%rdx,%riz). In
the vSIB case this meant <disp8>(%rdx,%zmm2) instead of the intended
<disp8>(%rdx,%zmm4).
Likewise the access count check wasn't entirely correct for the S/G
case: In the quad-word-index but dword-data case only half the number
of full vector elements get accessed.
As an unrelated change in the main test harness source file distinguish
the "n/a" messages by bitness.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 4 Jul 2019 15:43:09 +0000 (17:43 +0200)]
x86emul: prepare for AVX512F S/G insns
They require getting modrm_reg and sib_index set correctly in the EVEX
case, to account for the high 16 [XYZ]MM registers when used as
addressing index register. Extend the adjustments to modrm_rm as well,
such that x86_insn_modrm() would correctly report register numbers (this
was a latent issue only as we don't currently have callers of that
function which would care about an EVEX case).
The adjustment in turn requires dropping the assertion from decode_gpr()
as well as re-introducing the explicit masking, as we now need to
actively mask off the high bit when a GPR is meant.
_decode_gpr() invocations also need slight adjustments, when invoked in
generic code ahead of the main switch(). All other uses of modrm_reg and
modrm_rm already get suitably masked where necessary.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 4 Jul 2019 15:32:53 +0000 (17:32 +0200)]
x86/IRQ: simplify and rename pirq_acktype()
Its only caller already has the IRQ descriptor in its hands, so there's
no need for the function to re-obtain it. As a result the leading p of
its name is no longer appropriate and hence gets dropped.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 4 Jul 2019 14:07:01 +0000 (16:07 +0200)]
x86/vPIC: avoid speculative out of bounds accesses
Array indexes used in the I/O port read/write emulation functions are
derived from guest controlled values. Where this is not already done,
restrict their ranges to limit the side effects of speculative execution.
This is part of the speculative hardening effort.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 4 Jul 2019 14:06:27 +0000 (16:06 +0200)]
x86/vMSI: avoid speculative out of bounds accesses
Array indexes used in the MMIO read/write emulation functions are
derived from guest controlled values. Restrict their ranges to limit the
side effects of speculative execution.
Note that the index into .msi_ad[] may also be speculatively out of
bounds, by exactly one (indexes 0...3 are possible while the array has
just 3 elements). This is not a problem with the current data layout, as
such overrun of the array would either touch the next element of the
parent array or (for the last entry of the parent array) access the
subsequent acc_valid bit array.
This is part of the speculative hardening effort.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 4 Jul 2019 14:05:18 +0000 (16:05 +0200)]
x86emul: avoid speculative out of bounds accesses
There are a few array accesses here the indexes of which are (at least
indirectly) driven by the guest. Use array_access_nospec() to bound
such accesses. In the {,_}decode_gpr() cases replace existing guarding
constructs.
To deal with an otherwise occurring #include cycle, drop the inclusion
of asm/x86_emulate.h from asm/processor.h. This include had been
introduced for obtaining the struct cpuid_leaf declaration, which has
since moved into the x86 helper library.
This is part of the speculative hardening effort.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Anthony PERARD [Thu, 4 Jul 2019 14:04:33 +0000 (16:04 +0200)]
MAINTAINERS: add Anthony as libxl maintainer
Create a new section with only libxl.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Thu, 4 Jul 2019 14:03:47 +0000 (16:03 +0200)]
xmalloc: stop using a magic '1' in alignment padding
Alignment padding inserts a pseudo block header in front of the allocation,
sets its size field to the pad size and then ORs in 1, which is equivalent
to marking it as a free block, so that xfree() can distinguish it from a
real block header.
This patch simply replaces the magic '1' with the defined 'FREE_BLOCK' to
make it more obvious what's going on. Also, whilst in the neighbourhood,
it removes a stray space after a cast.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86: make loading of GDT at context switch more modular
In preparation for core scheduling, carve out the GDT related
functionality (writing GDT related PTEs, loading default of full GDT)
into sub-functions.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 2 Jul 2019 11:27:39 +0000 (13:27 +0200)]
AMD/IOMMU: restrict feature logging
The common case is all IOMMUs having the same features. Log them only
for the first IOMMU, or for any that have a differing feature set.
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Brian Woods <brian.woods@amd.com>
Thus the loop termination condition is dereferencing a struct pointer that
is being incremented by the loop.
A block of MSI entries stores the number of vectors in entry[0].msi.nvec,
with all subsequent entries using a value of 0. Therefore, for a block of
two or more MSIs will terminate the loop early, as entry[1].msi.nvec is 0.
However, for a single MSI, ++entry moves the pointer out of bounds, and a
bogus read is used for the termination condition. In the case that the
loop body gets entered, there are subsequent OoB writes which clobber
adjacent memory in the heap.
This patch simply initializes a stack variable to the value of
entry->msi.nvec before starting the loop and then uses that in the
termination condition instead.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 28 Jun 2019 13:18:21 +0000 (14:18 +0100)]
arm/optee: Fix arm32 build
A Travis randconfig build notices:
optee.c: In function ‘allocate_and_pin_shm_rpc’:
optee.c:383:13: error: format ‘%lx’ expects argument of type
‘long unsigned int’, but argument 5 has type ‘uint64_t’ [-Werror=format=]
gdprintk(XENLOG_WARNING, "Guest tries to use the same RPC SHM cookie %lx\n",
^
Use PRIx64 instead of %lx
Full logs https://travis-ci.org/andyhhp/xen/jobs/551754253
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Andrew Cooper [Thu, 20 Jun 2019 12:04:14 +0000 (13:04 +0100)]
x86/svm: Drop svm_vm{load,save}() helpers
Following on from c/s 7d161f6537 "x86/svm: Fix svm_vmcb_dump() when used in
current context", there is now only a single user of svm_vmsave() remaining in
the tree, with all users moved to svm_vm{load,save}_pa().
nv->nv_n1vmcx has a matching nv->nv_n1vmcx_pa which is always correct, and
avoids a redundant __pa() translation behind the scenes.
With this gone, all VM{LOAD,SAVE} operations are using paddr_t's which is more
efficient, so drop the svm_vm{load,save}() helpers to avoid uses of them
reappearing in the future.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Brian Woods <brian.woods@amd.com>
Igor Druzhinin [Thu, 27 Jun 2019 19:41:54 +0000 (20:41 +0100)]
x86/cpuid: leak OSXSAVE only when XSAVE is not clear in policy
This fixes booting of old non-PV-OPS kernels which historically
looked for OSXSAVE instead of XSAVE bit in CPUID to check whether
XSAVE feature is enabled. If such a guest appears to be started on
an XSAVE enabled CPU and the feature is explicitly cleared in
policy, leaked OSXSAVE bit from Xen will lead to guest crash early in
boot.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Wed, 21 Nov 2018 18:38:41 +0000 (18:38 +0000)]
xen/gnttab: Reduce code volume when using union grant_combo
There is no need for 'struct { ... } shorts' to be named. Convert it to being
an anonymous struct, and rename 'word' to the more common 'raw'.
For _set_status_v1() and gnttab_prepare_for_transfer() which use a bounded
cmpxchg loop, rename {prev,new}_scombo to {prev,new} and reduce their scope to
within the loop.
For _set_status_v2(), the flags and id variables are completely unnecessary.
Drop them.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 21 Nov 2018 18:38:41 +0000 (18:38 +0000)]
xen/gnttab: Reduce complexity when reading grant_entry_header_t
_set_status_v{1,2}() and gnttab_prepare_for_transfer() read the shared header
by always casting to u32. Despite grant_entry_header_t only having an
alignment of 2, this is actually safe because of the grant table ABI itself.
Switch to using an explicit uint32_t *, which removes all subsequent casting.
Furthermore, switch to using ACCESS_ONCE() for the read. There is nothing in
the _set_status_v1() and gnttab_prepare_for_transfer() which prevents the
compiler from issuing multiple memory reads and creating a TOCTOU race around
the sanity checks, although the worst that can happen is Xen stamping a status
flag over a bad grant entry if the guest is misbehaving.
_set_status_v2() does use barrier() to try avoid multiple reads, but this is
overkill. All that matters is that the shared header gets read in one go, and
this allows the compiler more room to optimise.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Sergey Dyasli [Thu, 29 Mar 2018 15:47:06 +0000 (16:47 +0100)]
x86/vvmx: set CR4 before CR0
Otherwise hvm_set_cr0() will check the wrong CR4 bits (L1 instead of L2
and vice-versa).
Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Roger Pau Monne [Thu, 27 Jun 2019 09:33:34 +0000 (11:33 +0200)]
xen/link: handle .init.rodata.cst* sections in the linker script
Note that those sections when not prefixed with .init are already
handled by the more general .rodata.* matching pattern in the .rodata
output section.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
[Make .init.rodata consistent with .rodata] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monne [Thu, 27 Jun 2019 09:33:33 +0000 (11:33 +0200)]
x86/linker: add a reloc section to ELF linker script
if the hypervisor has been built with EFI support (ie: multiboot2).
This allows to position the .reloc section correctly in the output
binary.
Note that for the ELF output format the .reloc section is moved before
.bss because the data it contains is read-only, so it belongs with the
other sections containing read-only data.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Tue, 25 Jun 2019 13:07:23 +0000 (14:07 +0100)]
page-alloc: Rename the first_node local variable
first_node is the name of a local variable, and part of the nodemask API. The
only reason this compiles is because the nodemask API is implemented as a
macro rather than an inline function.
It is confusing to read, and breaks when the nodemask API is cleaned up.
Rename the local variable to just 'first' which is still clear in context.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 26 Jun 2019 16:50:06 +0000 (17:50 +0100)]
xen/Kconfig: Fix -Wformat-security when compiling with Clang
Clang observes:
tools/kconfig/conf.c:77:10:
warning: format string is not a string literal (potentially insecure)
[-Wformat-security]
printf(_("aborted!\n\n"));
^~~~~~~~~~~~~~~~~
And it is absolutely correct. gettext() can easily return a string with a %
in.
This could be fixed by switching to using printf("%s", _(...)), or by
switching to puts() (as there is no formatting going on), but the better
option is follow Linux and remove localisation support.
The localization support is broken and appears unused.
There is no google hits on the update-po-config target.
And there is no recent (5 years) activity related to the localization.
So lets just drop this as it is no longer used.
Suggested-by: Ulf Magnusson <ulfalizer@gmail.com> Suggested-by: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
[Ported to Xen] Reported-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Doug Goldstein <cardoe@cardoe.com>
Jan Beulich [Thu, 27 Jun 2019 10:34:24 +0000 (12:34 +0200)]
AMD/IOMMU: don't "add" IOMMUs
For find_iommu_for_device() to consistently (independent of ACPI tables)
return NULL for the PCI devices corresponding to IOMMUs, make sure
IOMMUs don't get mapped to themselves by ivrs_mappings[].
While amd_iommu_add_device() won't be called for IOMMUs from
pci_add_device(), as IOMMUs have got marked r/o,
_setup_hwdom_pci_devices() calls there nevertheless. Avoid issuing the
bogus debugging only "No iommu for ...; cannot be handed to ..." log
message as well as the non-debugging "setup ... for ... failed (-19)"
one.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Brian Woods <brian.woods@amd.com>
Jan Beulich [Tue, 25 Jun 2019 15:34:53 +0000 (17:34 +0200)]
drop __get_cpu_var() and __get_cpu_ptr()
this_cpu{,_ptr}() are shorter, and have previously been marked as
preferred in Xen anyway.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien.grall@arm.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 25 Jun 2019 15:32:37 +0000 (17:32 +0200)]
x86/mcheck: allow varying bank counts per CPU
Up to now we've been assuming that all CPUs would have the same number
of reporting banks. However, on upcoming AMD CPUs this isn't the case,
and one can observe
(XEN) mce.c:666: Different bank number on cpu <N>
indicating that Machine Check support would not be enabled on the
affected CPUs. Convert the count variable to a per-CPU one, and adjust
code where needed to cope with the values not being the same. In
particular the mcabanks_alloc() invocations during AP bringup need to
now allocate maximum-size bitmaps, because the truly needed size can't
be known until we actually execute on that CPU, yet mcheck_init() gets
called too early to do any allocations itself.
Take the liberty and also
- make mca_cap_init() static,
- replace several __get_cpu_var() uses when a local variable suitable
for use with per_cpu() appears,
- correct which CPU's cpu_data[] entry x86_mc_msrinject_verify() uses,
- replace a BUG() by panic().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monné [Tue, 25 Jun 2019 13:39:44 +0000 (15:39 +0200)]
config: don't hardcode toolchain binaries
Currently the names of the build toolchain binaries are hardcoded in
StdGNU.mk, and the values from the environment are ignored.
Switch StdGNU.mk to use '?=' instead of '=', so that values from the
environment are used if present, else default to the values provided
by the config file.
This change fixes the gitlab CI loop, that was relying on passing
custom values in the environment variables for the compiler and the
linker.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
The mask calculation in pdx_init_mask is wrong when the first bank
starts at address 0x0. The reason is that pdx_init_mask will do '0 - 1'
causing an underflow. As a result, the mask becomes 0xffffffffffffffff
which is the biggest possible mask and ends up causing a significant
memory waste in the frametable size computation.
For instance, on platforms that have a low memory bank starting at 0x0
and a high memory bank, the frametable will end up covering all the
holes in between.
The purpose of the mask is to be passed as a parameter to
pfn_pdx_hole_setup, which based on the mask parameter calculates
pfn_pdx_hole_shift, pfn_pdx_bottom_mask, etc. which are actually the
important masks for frametable initialization later on.
pfn_pdx_hole_setup never compresses addresses below MAX_ORDER bits (1GB
on ARM). Thus, it is safe to initialize mask passing 1ULL << (MAX_ORDER
+ PAGE_SHIFT) as start address to pdx_init_mask.
xen: Replace u64 with uint64_t in pdx_init_mask() and callers
Xen is phasing out the use of u64 in favor of uint64_t. Therefore, the
instance of u64 in the pdx_init_mask() (and the callers) are now
replaced with uint64_t. Take the opportunity to also modify
srat_region_mask as this is used to store the result of pdx_init_mask().
Signed-off-by: Stefano Stabellini <stefanos@xilinx.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: julien.grall@arm.com
Roger Pau Monne [Fri, 21 Jun 2019 16:38:00 +0000 (18:38 +0200)]
x86/linker: use DECL_SECTION uniformly
Replace the two open-coded EFI related section declarations with the
usage of DECL_SECTION. This is a preparatory change for also adding a
reloc section to the ELF binary.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant [Fri, 21 Jun 2019 15:57:51 +0000 (17:57 +0200)]
viridian: unify time sources
Currently, the time_ref_count enlightened time source maintains an offset
such that time is frozen when the domain paused, but the reference_tsc
enlightened time source does not. After migrate, the reference_tsc source
may become invalidated (e.g. because of host cpu frequency mismatch) which
will cause Windows to fall back to time_ref_count. Thus, the guest will
observe a jump in time equivalent to the offset.
This patch unifies the two enlightened time sources such that the same
offset applies to both of them. Also, it's not really necessary to have
two different functions to calculating a 10MHz counter value, time_now() and
raw_trc_val(), so this patch removes the latter implementation. The
unification also allows removal of the reference_tsc_valid flag.
Whilst in the area, this patch also takes the opportunity to constify a few
pointers which were missed in earlier patches.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Alexandru Isaila [Fri, 21 Jun 2019 15:21:25 +0000 (17:21 +0200)]
MAINTAINERS: add myself as a designated reviewer to vm_event
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
Jan Beulich [Fri, 21 Jun 2019 15:16:52 +0000 (17:16 +0200)]
AMD/IOMMU: initialize IRQ tasklet only once
Don't do this once per IOMMU, nor after setting up the IOMMU interrupt
(which will want to schedule this tasklet). In fact it can be
initialized at build time.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Brian Woods <brian.woods@amd.com>