Jan Beulich [Mon, 5 Nov 2018 10:13:59 +0000 (11:13 +0100)]
x86emul: VME and PVI modes require a #GP(0) check first thing
As explicitly spelled out by the SDM, EFLAGS.VIF and EFLAGS.VIP both set
at the start of an instruction trigger #GP(0) independent of actual
instruction.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 5 Nov 2018 10:13:09 +0000 (11:13 +0100)]
x86: deal with firmware setting bogus TSC_ADJUST values
The system Intel have handed me for AVX512 emulator work ("Gigabyte
Technology Co., Ltd. X299 AORUS Gaming 3 Pro/X299 AORUS Gaming 3
Pro-CF, BIOS F3 12/28/2017") would not come up under Xen - it hung in
the middle of Dom0 PCI initialization. As it turned out, Xen's time
management did not work because of the firmware setting (only) the boot
CPU's TSC_ADJUST MSR to a large negative value (on the order of -2^50).
Follow Linux (also shamelessly stealing their comments) in
- clearing the register for the boot CPU (we don't have a need for
exceptions here yet, as the only exception in Linux is a class of
systems Xen doesn't work on anyway as far as I'm aware),
- forcing non-negative values uniformly (commit 855615eee9 ["x86/tsc:
Remove the TSC_ADJUST clamp"] dropped this, but without this my
Haswell box won't boot anymore),
- syncing the registers within sockets.
Linux, prior to aforementioned commit, capped at 0x7fffffff as well, but as the
description there says this issue has been addressed with a microcode
update. Hence until someone runs into such a system without being able
to update its microcode, I think we should leave out that specific part.
In order to avoid making init_percpu_time() depend on running _before_
set_cpu_sibling_map() (and hence the booting CPU _not_ being accounted
in socket_cpumask[] yet), move that call slightly earlier in
start_secondary().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 5 Nov 2018 10:12:39 +0000 (11:12 +0100)]
x86/TSC: don't allow deadline timer to be used with unfixed errata
In preparation of writes to the TSC_ADJUST MSR, avoid the bad
interaction of writes to it and the TSC_DEADLINE one. Presumably the
original Linux commit bd9240a18e ("x86/apic: Add TSC_DEADLINE quirk due
to errata") refers to e.g. KBW092. (Of course this is an issue also
without us writing the TSC_ADJUST MSR, if instead firmware did already.
The errata checking can't be put in init_apic_mappings() as Linux does,
as that runs before we update microcode on the boot CPU. It needs to
happen before consumers of tdt_enabled, i.e.
- __setup_APIC_LVTT() <- setup_APIC_timer() <- setup_boot_APIC_clock()
- <- calibrate_APIC_clock() <- setup_boot_APIC_clock()
- setup_boot_APIC_clock()
setup_boot_APIC_clock() gets called from smp_prepare_cpus(), which sits
after microcode loading (note that calibrate_APIC_clock() gets called
before setting tdt_enabled).
Also add an MFENCE as per Linux commit 5d7c631d92 ("x86/apic: Serialize
LVTT and TSC_DEADLINE writes"), but I see no reason to put a conditional
around it.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant [Mon, 5 Nov 2018 10:11:39 +0000 (11:11 +0100)]
viridian: remove duplicate union types
The 'viridian_vp_assist', 'viridian_hypercall_gpa' and
'viridian_reference_tsc' union types are identical in layout. The layout
is also common throughout the specification [1].
This patch declares a common 'viridian_page_msr' type and converts the rest
of the code to use that type for both the hypercall and VP assist pages.
Also, rename 'viridian_guest_os_id' to 'viridian_guest_os_id_msr' since it
also is a union representing an MSR value.
Paul Durrant [Mon, 5 Nov 2018 10:10:55 +0000 (11:10 +0100)]
viridian: remove comments referencing section number in the spec
Microsoft has a habit of re-numbering sections in the spec. so avoid
referring to section numbers in comments. Also remove the URL for the
spec. from the boilerplate... Again, Microsoft has a habit of changing
these too.
This patch also cleans up some > 80 character lines.
Purely cosmetic. No functional change.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Roger Pau Monne <roger.pau@citrix.com>
Wei Liu [Fri, 2 Nov 2018 12:34:12 +0000 (12:34 +0000)]
libxl/arm: fix guest type conversion
Commit 359970fd8b ("tools/libxl: Switch Arm guest type to PVH") missed
changing the type field in c_info. This issue didn't surface until ef72c93df9 which made creating PV guest on Arm unusable.
Create libxl__arch_domain_create_info_setdefault and switch the type
there.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
The may_defer var was left with the older bool_t type. This patch
changes the type to bool.
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Brian Woods <brian.woods@amd.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Paul Durrant <paul.durrant@citrix.com>
Jan Beulich [Fri, 2 Nov 2018 11:15:33 +0000 (12:15 +0100)]
VMX: fix vmx_handle_eoi()
In commit 303066fdb1e ("VMX: fix interaction of APIC-V and Viridian
emulation") I screwed up: Instead of clearing SVI, other ISR bits
should be taken into account.
Introduce a new helper set_svi(), split out of vmx_process_isr(), and
use it also from vmx_handle_eoi().
Following the problems in vmx_intr_assist() (see the still present big
block of debugging code there) also warn (once) if EOI'd vector and
original SVI don't match.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Commit 81946a73dc975a7dafe9017a8e61d1e64fdbedbf removed
Xenctrl.with_intf based on its undesirable behaviour of opening and
closing a Xenctrl connection with every invocation. This commit
re-introduces with_intf but with an updated behaviour: it maintains a
global Xenctrl connection which is opened upon first usage and kept
open. This handle can be obtained by clients using new functions
get_handle() and close_handle().
The main motivation of re-introducing with_intf is that otherwise
clients will have to implement this functionality individually.
Signed-off-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
tools/misc/xenpm: fix getting info when some CPUs are offline
Use physinfo.max_cpu_id instead of physinfo.nr_cpus to get max CPU id.
This fixes for example 'xenpm get-cpufreq-para' with smt=off, which
otherwise would miss half of the cores.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Since udev is no longer used to call hotplug scripts (neither in dom0
nor driver domain), this scripts is no longer referenced anywhere. libxl
(xl devd or else) has own cleanup code.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Wed, 31 Oct 2018 16:57:19 +0000 (17:57 +0100)]
use consistent values when consuming runtime-changeable parameters
There's no guarantee that e.g. a switch() control expression's memory
operand(s) get(s) read just once. Guard against the compiler producing
"unexpected" code by sprinkling around some ACCESS_ONCE().
I'm leaving alone opt_conswitch[]: It gets accessed in quite a few
places anyway, and an intermediate change won't have any severe effect
afaict.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
xen/arm: make platform specific code dependent on ALL32_PLAT or ALL64_PLAT
Compile platform code that doesn't have its own specific kconfig option
only if ALL32_PLAT or ALL64_PLAT depending on the architecture. The
benefit is that choosing one of the platforms available as a menu
option allows the user not to build other unnecessary platform code.
Andrew Cooper [Tue, 30 Oct 2018 11:17:00 +0000 (11:17 +0000)]
x86/pv: Fix crash when using `xl set-parameter pcid=...`
"pcid=" is registered as a runtime parameter, which means that parse_pcid()
must not reside in .init, or the following happens when parse_params() tries
to call an unmapped function pointer.
Andrew Cooper [Thu, 25 Oct 2018 13:11:58 +0000 (14:11 +0100)]
x86/vvmx: Don't handle unknown nested vmexit reasons at L0
This is very dangerous from a security point of view, because a missing entry
will cause L2's action to be interpreted as L1's action.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Sergey Dyasli <sergey.dyasli@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Thu, 25 Oct 2018 14:17:50 +0000 (15:17 +0100)]
x86/vvmx: Drop the now-obsolete vmx_inst_check_privilege()
Now that nvmx_handle_vmx_insn() performs all VT-x instruction checks, there is
no need for redundant checking in vmx_inst_check_privilege(). Remove it, and
take out the vmxon_check boolean which was plumbed through decode_vmx_inst().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Sergey Dyasli <sergey.dyasli@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Thu, 25 Oct 2018 13:40:11 +0000 (14:40 +0100)]
x86/vvmx: Unconditionally initialise vmxon_region_pa during vcpu construction
This is a stopgap solution until the toolstack side of initialisation can be
sorted out, but it does result in the nvmx_vcpu_in_vmx() predicate working
correctly even when nested virt hasn't been enabled for the domain.
Update nvmx_handle_vmx_insn() to include the in-vmx mode check (for all
instructions other than VMXON) to complete the set of #UD checks.
In addition, sanity check that the nested vmexit handler has worked correctly,
and that we are only providing emulation of the VT-x instructions to L1
guests.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Sergey Dyasli <sergey.dyasli@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Thu, 25 Oct 2018 13:08:33 +0000 (14:08 +0100)]
x86/vvmx: Let L1 handle all the unconditional vmexit instructions
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Sergey Dyasli <sergey.dyasli@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Mon, 28 May 2018 14:22:49 +0000 (15:22 +0100)]
x86: Reorganise and rename debug register fields in struct vcpu
Reusing debugreg[5] for the PV emulated IO breakpoint information is confusing
to read. Instead, introduce a dr7_emul field in pv_vcpu for the purpose.
With the PV emulation out of the way, debugreg[4,5] are entirely unused and
don't need to be stored.
Rename debugreg[0..3] to dr[0..3] to reduce code volume, but keep them as an
array because their behaviour is identical and this helps simplfy some of the
PV handling. Introduce dr6 and dr7 fields to replace debugreg[6,7] which
removes the storage for debugreg[4,5].
In arch_get_info_guest(), handle the merging of emulated dr7 state alongside
all other dr handling, rather than much later.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Andrew Cooper [Mon, 29 Oct 2018 11:29:54 +0000 (11:29 +0000)]
x86/domain: Fix build with GCC 4.3.x
GCC 4.3.x can't initialise the user_regs structure like this.
Reported-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
arm,smmu: backport "Disable stalling faults for all endpoints"
Backport commit 3714ce1d6655098ee69ede632883e5874d67e4ab
"iommu/arm-smmu: Disable stalling faults for all endpoints" from the
Linux kernel. This works-around Erratum #842869.
Original commit message:
Enabling stalling faults can result in hardware deadlock on poorly
designed systems, particularly those with a PCI root complex upstream of
the SMMU.
Although it's not really Linux's job to save hardware integrators from
their own misfortune, it *is* our job to stop userspace (e.g. VFIO
clients) from hosing the system for everybody else, even if they might
already be required to have elevated privileges.
Given that the fault handling code currently executes entirely in IRQ
context, there is nothing that can sensibly be done to recover from
things like page faults anyway, so let's rip this code out for now and
avoid the potential for deadlock.
Cc: <stable@vger.kernel.org> Fixes: 48ec83bcbcf5 ("iommu/arm-smmu: Add initial driver support for ARM SMMUv3 devices") Reported-by: Matt Evans <matt.evans@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Stefano Stabellini <stefanos@xilinx.com> Acked-by: Julien Grall <julien.grall@arm.com>
George Dunlap [Mon, 29 Oct 2018 14:51:51 +0000 (14:51 +0000)]
Make credit2 the default scheduler
Credit2 was declared "supported" in 4.8, and as of 4.10 had two other
critical features implemented (soft affinity / NUMA and caps).
Why change the default?
The code is better: more predictable, less jitter, easier to determine
how modifications will affect overall behavior, easier in the future
to make load-balancing behavior more subtle (e.g., taking into account
the cost of powering up extra cores, &c).
Overall performance compared to Credit1 is somewhat of a mixed bag.
Unfortunately most of what I have are tests using XenServer's internal
perf testing system, so I can't share the raw data (via links anyway).
Here is a summary of data from an internal e-mail Dario sent in the
past:
* DVDbench: On underloaded systems, credit2 outperformed credit1 by
about 4%. On overloaded systems, credit2 underperformed by about 3%.
* On a range of tests (unixbench, lmbench, &c), credit and credit2
perform within 5% of each other (up and down).
* Credit2 fairly consistently beats credit for TCP-style workloads.
* Credit2 is sometimes equal to, sometimes 5-15% worse than, credit for
synthetic CPU workloads (e.g., Dhrystone).
* On LoginVSI, credit2 fairly consistently outperforms credit by about 10%.
Credit2, like credit, has a number of workloads / setups for which
performance could be improved. Personally I think networking and
partially-loaded systems is going to be more representative of what
Xen is actually used for; so I think credit2 is on the whole the
better scheduler to use by default. And in any case, making those
improvements on credit2 will be easier than on credit.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Dario Faggioli <dfaggioli@suse.com>
Jan Beulich [Mon, 29 Oct 2018 12:40:56 +0000 (13:40 +0100)]
x86emul: generalize vector length handling for AVX512/EVEX
To allow for some code sharing where possible, copy VEX.L into EVEX.LR
even for VEX (or XOP) encoded insns. Make operand size determination
use this right away, at the same time adding consistency checks for the
EVEX scalar insn cases (the non-scalar ones aren't uniform enough for
the checking to be done in a central place like this).
Note that the broadcast case is not handled here, but will be taken care
of elsewhere (in just a single place rather than at least two).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Wei Liu [Fri, 19 Oct 2018 14:28:28 +0000 (15:28 +0100)]
x86: put some code in arch_set_info_guest under CONFIG_PV
This function is called by both PV and HVM. Unfortunately the code is
very convoluted. We can reason that code between the call to
hvm_set_info_guest and out label is PV only. Put that portion under
CONFIG_PV.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Wei Liu [Fri, 19 Oct 2018 14:28:27 +0000 (15:28 +0100)]
x86: make mm.c build with !CONFIG_PV
Start by putting hypercall handlers which are supposed to be PV only
under CONFIG_PV. Shuffle some code around to avoid introducing
excessive numbers of CONFIG_PV.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Fri, 26 Oct 2018 15:50:01 +0000 (17:50 +0200)]
x86emul: correct EVEX decoding
Fix an inverted pair of checks, drop an incorrect instance of #UD
raising for non-64-bit mode, and add further generic checks.
Note: Despite what SDM Vol 2 rev 067 states, EVEX.V' is _not_ ignored
outside of 64-bit mode when the field does not encode a register.
Just like EVEX.VVVV is required to be 0b1111 in that case, EVEX.V'
is required to be 1 there.
Also rename the bcst field to br, as #UD generation for individual insns
will need to consider both of its possible meanings.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 26 Oct 2018 13:21:20 +0000 (15:21 +0200)]
x86emul/test: introduce eq()
In preparation for sensible to-boolean conversion on AVX512, wrap
another abstraction function around the present to_bool(<x> == <y>), to
get rid of the open-coded == (which will get in the way of using
built-in functions instead). For the future AVX512 use scalar operands
can't be used then anymore: Use (vec_t){} when the operand is zero,
and broadcast (if available) otherwise (assume pre-AVX512 when broadcast
is not available, in which case a plain scalar is still fine).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 26 Oct 2018 13:20:37 +0000 (15:20 +0200)]
x86emul: support AVX512 opmask insns
These are all VEX encoded, so the EVEX decoding logic continues to
remain unused at this point.
The new testcase is deliberately coded in assembly, as a C one would
have become almost unreadable due to the overwhelming amount of
__builtin_...() that would need to be used. After all the compiler has
no underlying type (yet) that could be operated on without builtins,
other than the vector types used for "normal" SIMD insns.
Note that outside of 64-bit mode and despite the SDM not currently
saying so, VEX.W is ignored for the KMOV{D,Q} encodings to/from GPRs,
just like e.g. for the similar VMOV{D,Q}.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 26 Oct 2018 13:18:52 +0000 (15:18 +0200)]
x86: restrict HVMOP_pagetable_dying to current
This is not used (and probably was never meant to be) by the tool stack.
Limiting it to the current domain in particular allows to eliminate a
bogus use of vCPU 0 in pagetable_dying().
Remove the now unnecessary domain/vCPU parameters from the wrapper/hook
functions at the same time.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 26 Oct 2018 13:16:23 +0000 (15:16 +0200)]
x86: don't build guest-walk code without HVM and SHADOW_PAGING
It's dead code in that case.
We could go further, as we don't really need the 2- and 3-level walk
code in PV mode, but to drop their compilation requires quite a bit of
disentangling of shadow mode code.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Wed, 10 Oct 2018 09:17:15 +0000 (09:17 +0000)]
x86/vvmx: Disallow the use of VT-x instructions when nested virt is disabled
c/s ac6a4500b "vvmx: set vmxon_region_pa of vcpu out of VMX operation to an
invalid address" was a real bugfix as described, but has a very subtle bug
which results in all VT-x instructions being usable by a guest.
The toolstack constructs a guest by issuing:
XEN_DOMCTL_createdomain
XEN_DOMCTL_max_vcpus
and optionally later, HVMOP_set_param to enable nested virt.
As a result, the call to nvmx_vcpu_initialise() in hvm_vcpu_initialise()
(which is what makes the above patch look correct during review) is actually
dead code. In practice, nvmx_vcpu_initialise() first gets called when nested
virt is enabled, which is typically never.
As a result, the zeroed memory of struct vcpu causes nvmx_vcpu_in_vmx() to
return true before nested virt is enabled for the guest.
Fixing the order of initialisation is a work in progress for other reasons,
but not viable for security backports.
A compounding factor is that the vmexit handlers for all instructions, other
than VMXON, pass 0 into vmx_inst_check_privilege()'s vmxop_check parameter,
which skips the CR4.VMXE check. (This is one of many reasons why nested virt
isn't a supported feature yet.)
However, the overall result is that when nested virt is not enabled by the
toolstack (i.e. the default configuration for all production guests), the VT-x
instructions (other than VMXON) are actually usable, and Xen very quickly
falls over the fact that the nvmx structure is uninitialised.
In order to fail safe in the supported case, re-implement all the VT-x
instruction handling using a single function with a common prologue, covering
all the checks which should cause #UD or #GP faults. This deliberately
doesn't use any state from the nvmx structure, in case there are other lurking
issues.
This is XSA-278
Reported-by: Sergey Dyasli <sergey.dyasli@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Sergey Dyasli <sergey.dyasli@citrix.com>
In particular, initialising %dr6 with the value 0 is buggy, because on
hardware supporting Transactional Memory, it will cause the sticky RTM bit to
be asserted, even though a debug exception from a transaction hasn't actually
been observed.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
In particular, initialising %dr6 with the value 0 is buggy, because on
hardware supporting Transactional Memory, it will cause the sticky RTM bit to
be asserted, even though a debug exception from a transaction hasn't actually
been observed.
Introduce arch_vcpu_regs_init() to set various architectural defaults, and
reuse this in the hvm_vcpu_reset_state() path.
Architecturally, %edx's init state contains the processors model information,
and 0xf looks to be a remnant of the old Intel processors. We clearly have no
software which cares, seeing as it is wrong for the last decade's worth of
Intel hardware and for all other vendors, so lets use the value 0 for
simplicity.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Mon, 28 May 2018 14:18:17 +0000 (15:18 +0100)]
x86/boot: Initialise the debug registers correctly
In particular, initialising %dr6 with the value 0 is buggy, because on
hardware supporting Transactional Memory, it will cause the sticky RTM bit to
be asserted, even though a debug exception from a transaction hasn't actually
been observed.
Move X86_DR6_DEFAULT into x86-defns.h along with the other architectural
register constants, and introduce a new X86_DR7_DEFAULT. Use the existing
write_debugreg() helper, rather than opencoded inline assembly.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
The extra 0. is harmless but ugly. We should be somewhat consistent.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Wei Liu [Fri, 19 Oct 2018 14:28:36 +0000 (15:28 +0100)]
x86: don't setup legacy syscall vector when !CONFIG_PV
The code snippet is to switch between SYS_DECS_trap_gate and
SYS_DESC_irq_gate depending on whether XPTI is used. When PV is
disabled there is no need to switch.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant [Tue, 9 Oct 2018 08:25:48 +0000 (09:25 +0100)]
x86/hvm/ioreq: allow ioreq servers to use HVM_PARAM_[BUF]IOREQ_PFN
Since commit 2c257bd6 "x86/hvm: remove default ioreq server (again)" the
GFNs allocated by the toolstack and set in HVM_PARAM_IOREQ_PFN and
HVM_PARAM_BUFIOREQ_PFN have been unused. This patch allows them to be used
by (non-default) ioreq servers.
While in the area, also make sure HVM_PARAM_[BUF]IOREQ_PFN can only be set
once. These parameters should have always been in the 'set once' category
but this has, so far, not been enforced.
NOTE: This fixes a compatibility issue. A guest created on a version of
Xen that pre-dates the initial ioreq server implementation and then
migrated in will currently fail to resume because its migration
stream will lack values for HVM_PARAM_IOREQ_SERVER_PFN and
HVM_PARAM_NR_IOREQ_SERVER_PAGES *unless* the system has an
emulator domain that uses direct resource mapping (which depends
on the version of privcmd it happens to have) in which case it
will not require use of GFNs for the ioreq server shared
pages.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Fri, 5 Oct 2018 17:02:15 +0000 (17:02 +0000)]
x86/svm: Remove the pdpe fields from struct vmcb
These fields have existed since the SVM code was first introduced.
The earliest reference I can find is c/s d1bd157fbc9 which is unforunately a
rebase & squash of a separate dev tree. Looking a the commit message, I'm
guessing it was introduced by:
> user: twoller@xen-trw1.site
> date: Tue Dec 13 19:49:53 2005 -0500
> files: ... xen/include/asm-x86/svm_vmcb.h ...
> description:
> Add SVM base files to repository.
Anyway, the AMD SDM has no mention of PDPE fields in the VMCB and marks this
part of the VMCB as reserved. The manual does explicitly say that 32bit PAE
paging may read the PDPE fields from memory rather from the CPU registers.
Chances are very good that this is a vestigial remnent of an early design.
Xen doesn't use the fields at all, except to copy them on virtual
vmentry/vmexit.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Andrew Cooper [Thu, 4 Oct 2018 16:36:35 +0000 (16:36 +0000)]
x86/svm: Fix svm_update_guest_efer() for domains using shadow paging
When using shadow paging, EFER.NX is a Xen controlled bit, and is required by
the shadow pagefault handler to distinguish instruction fetches from data
accesses.
This can be observed by a guest which has NX and SMEP clear but SMAP active by
attempting to execute code on a user mapping. The first attempt to build the
target shadow will #PF so is handled by the shadow code, but when walking the
the guest pagetables, the lack of PFEC_insn_fetch being signalled causes the
shadow code to mistake the instruction fetch for a data fetch, and believe
that it is a real guest fault. As a result, the guest receives #PF[-d-srP]
for an action which should complete successfully.
The suspicious-looking gymnastics with LME is actually a subtle corner case
with shadow paging. When dropping out of Long Mode, a guests choice of LME
and Xen's choice of CR0.PG cause hardware to operate in Long Mode, but the
shadow code to operate in 2-on-3 mode.
In addition to describing this corner case in the SVM side, extend the comment
for the same fix on the VT-x side. (I have a suspicion that I've just worked
out why VT-x doesn't tolerate LMA != LME when Unrestricted Guest is clear.)
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Alexander Schulz [Wed, 17 Oct 2018 16:29:03 +0000 (17:29 +0100)]
Reservation of PCI device range 0xc200-0xc2ff to XCP-ng Project
We are the XCP-ng project (https://xcp-ng.org) and want to distribut our
own PV-Tools (maybe also per windows updates) so we need an extra range.
We also registered a PCI-Device:
"XCP-ng Project PCI Device for Windows Update" ->
https://pci-ids.ucw.cz/read/PC/5853/c200
Signed-off-by: Alexander Schulz <code@schulzalex.de> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
George Dunlap [Thu, 27 Sep 2018 11:25:36 +0000 (12:25 +0100)]
mem_access: Fix npfec.kind propagation
The name of the "with_gla" flag is confusing; it has nothing to do
with the existence or lack thereof of a faulting GLA, but rather where
the fault originated. The npfec.kind value is always valid, and
should thus be propagated, regardless of whether gla_valid is set or
not.
In particular, gla_valid will never be set on AMD systems; but
npfec.kind will still be valid and should still be propagated.
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Alexandru Isaila <aisaila@bitdefender.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
x86/altp2m: Add a subop for obtaining the mem access of a page
Currently there is a subop for setting the memaccess of a page, but not
for consulting it. The new HVMOP_altp2m_get_mem_access adds this
functionality.
Both altp2m get/set mem access functions use the struct
xen_hvm_altp2m_mem_access which has now dropped the `set' part and has
been renamed from xen_hvm_altp2m_set_mem_access.
Signed-off-by: Adrian Pop <apop@bitdefender.com> Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Ian Jackson [Tue, 9 Oct 2018 16:02:34 +0000 (17:02 +0100)]
tools/libfsimage: Bump soname to 4.12
This library does not have a stable ABI promise. As far as we know it
is used only by pygrub. Bump its soname to the Xen version (and
intend to change it each time).
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Ian Jackson [Thu, 4 Oct 2018 11:32:00 +0000 (12:32 +0100)]
pygrub fsimage.so: Honour LDFLAGS when building
This seems to have been simply omitted. Obviously this is needed when
building and not just when installing. Passing only when installing
is ineffective.
Signed-off-by: Ian Jackson <ian.jackson@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Xin Li [Tue, 9 Oct 2018 09:33:19 +0000 (17:33 +0800)]
xen/xsm: Introduce new boot parameter xsm
Introduce new boot parameter xsm to choose which xsm module is enabled,
and set default to dummy. And add new option in Kconfig to choose the
default XSM implementation.
Signed-off-by: Xin Li <xin.li@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant [Wed, 26 Sep 2018 13:44:07 +0000 (14:44 +0100)]
amd-iommu: use correct constants in amd_iommu_get_next_table_from_pte()
...and change the name to amd_iommu_get_address_from_pte() since the
address read is not necessarily the address of a next level page table.
(If the 'next level' field is not 1 - 6 then the address is a page
address).
The constants in use prior to this patch relate to device table entries
rather than page table entries. Although they do have the same value, it
makes the code confusing to read.
This patch also changes the PDE/PTE pointer argument to void *, and
removes any u32/uint32_t casts in the call sites. Unnecessary casts
surrounding call sites are also removed.
No functional change.
NOTE: The patch also adds emacs boilerplate to iommu_map.c
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewd-by: Brian Woods <brian.woods@amd.com>
Roger Pau Monne [Wed, 10 Oct 2018 14:39:35 +0000 (16:39 +0200)]
tools/pvh: set coherent MTRR state for all vCPUs
Instead of just doing it for the BSP. This requires storing the
maximum number of possible vCPUs in xc_dom_image.
This has been a latent bug so far because PVH doesn't yet support
pci-passthrough, so the effective memory cache attribute is forced to
WB by the hypervisor. Note also that even without this in place vCPU#0
is preferred in certain scenarios in order to calculate the memory
cache attributes.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Wei Liu [Tue, 9 Oct 2018 14:57:08 +0000 (15:57 +0100)]
x86/vtd: fix IOMMU share PT destruction path
Commit 2916951c1 ("mm / iommu: include need_iommu() test in
iommu_use_hap_pt()") included need_iommu() in iommu_use_hap_pt and 91d4eca7add ("mm / iommu: split need_iommu() into has_iommu_pt() and
need_iommu_pt_sync()") made things finer grain by spliting need_iommu
into three states.
The destruction path can't use iommu_use_hap_pt because at the point
platform op is called, IOMMU is either already switched to or has
always been in disabled state, and the shared PT test would always be
false.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
George Dunlap [Wed, 10 Oct 2018 11:36:25 +0000 (12:36 +0100)]
libxl: Restore scheduling parameters after migrate in best-effort fashion
Commit 3b4adba ("tools/libxl: include scheduler parameters in the
output of xl list -l") added scheduling parameters to the set of
information collected by libxl_retrieve_domain_configuration(), in
order to report that information in `xl list -l`.
Unfortunately, libxl_retrieve_domain_configuration() is also called by
the migration / save code, and the results passed to the restore /
receive code. This meant scheduler parameters were inadvertently
added to the migration stream, without proper consideration for how to
handle corner cases. The result was that if migrating from a host
running one scheduler to a host running a different scheduler, the
migration would fail with an error like the following:
Luckily there's a fairly straightforward way to set parameters in a
"best-effort" fashion. libxl provides a single struct containing the
parameters of all schedulers, as well as a parameter specifying which
scheduler. Parameters not used by a given scheduler are ignored.
Additionally, the struct contains a parameter to specify the
scheduler. If you specify a specific scheduler,
libxl_domain_sched_params_set() will fail if there's a different
scheduler. However, if you pass LIBXL_SCHEDULER_UNKNOWN, it will use
the value of the current scheduler for that domain.
In domcreate_stream_done(), before calling libxl__build_post(), set
the scheduler to LIBXL_SCHEDULER_UNKNOWN. This will propagate
scheduler parameters from the previous instantiation on a best-effort
basis.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Tue, 9 Oct 2018 14:27:59 +0000 (16:27 +0200)]
x86: put_page_from_l2e() should honor _PAGE_RW
56fff3e5e9 ("x86: nuke PV superpage option and code") has introduced a
(luckily latent only) bug here, in that it didn't make reference
dropping dependent on whether the page was mapped writable. The only
current source of large page mappings for PV domains is the Dom0
builder, which only produces writeable ones.
Take the opportunity and also convert to bool both put_data_page()'s
respective parameter and the argument put_page_from_l3e() passes.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Roger Pau Monné [Tue, 9 Oct 2018 14:27:13 +0000 (16:27 +0200)]
x86/vtd: fix iommu_share_p2m_table
Commit 2916951c1 "mm / iommu: include need_iommu() test in
iommu_use_hap_pt()" changed the check in iommu_share_p2m_table to use
need_iommu(d) (as part of iommu_use_hap_pt) instead of iommu_enabled,
which broke the check because at the point in domain construction
where iommu_share_p2m_table is called need_iommu(d) will always return
false.
Fix this by reverting to the previous logic.
While there turn the hap_enabled check into an ASSERT, since the only
caller of iommu_share_p2m_table already performs the hap_enabled check
before calling the function.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Daniel De Graaf [Tue, 9 Oct 2018 14:26:54 +0000 (16:26 +0200)]
flask: sort io{port,mem}con entries
These entries are not always sorted by checkpolicy, so sort them during
policy load (as is already done for later ocontext additions).
Reported-by: Nicolas Poirot <nicolas.poirot@bertin.fr> Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Tested-by: Nicolas Poirot <nicolas.poirot@bertin.fr> Reviewed-by: Nicolas Poirot <nicolas.poirot@bertin.fr>
Jan Beulich [Tue, 9 Oct 2018 14:25:35 +0000 (16:25 +0200)]
x86/HVM: move vendor independent CPU save/restore logic to shared code
A few pieces of the handling here are (no longer?) vendor specific, and
hence there's no point in replicating the code. Zero the full structure
before calling the save hook, eliminating the need for the hook
functions to zero individual fields.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com Reviewed-by: Kevin Tian <kevin.tian@intel.com>