Paul Durrant [Wed, 25 Jan 2017 09:41:35 +0000 (10:41 +0100)]
dm_op: convert HVMOP_*ioreq_server*
The definitions of HVM_IOREQSRV_BUFIOREQ_* have to persist as they are
already in use by callers of the libxc interface.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant [Wed, 25 Jan 2017 09:40:51 +0000 (10:40 +0100)]
public / x86: introduce __HYPERCALL_dm_op...
...as a set of hypercalls to be used by a device model.
As stated in the new docs/designs/dm_op.markdown:
"The aim of DMOP is to prevent a compromised device model from
compromising domains other then the one it is associated with. (And is
therefore likely already compromised)."
See that file for further information.
This patch simply adds the boilerplate for the hypercall.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Suggested-by: Ian Jackson <ian.jackson@citrix.com> Suggested-by: Jennifer Herbert <jennifer.herbert@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Elena Ufimtseva [Wed, 25 Jan 2017 09:38:05 +0000 (10:38 +0100)]
VT-d: add command line option for extra rmrrs
On some platforms firmware fails to specify RMRR regions in ACPI tables and
thus those regions will not be mapped in dom0 or guests and may cause IO
Page Faults and prevent dom0 from booting if "iommu=dom0-strict" option is
specified on the Xen command line.
New Xen command line option rmrr allows to specify such devices and
memory regions. These regions are added to the list of RMRR defined in ACPI
if the device is present in system. As a result, additional RMRRs will be
mapped 1:1 in dom0 with correct permissions.
The above mentioned problems were discovered during the PVH work with
ThinkCentre M and Dell 5600T. No official documentation was found so far
in regards to what devices and why cause this. Experiments show that
ThinkCentre M USB devices with enabled debug port generate DMA read
transactions to the regions of memory marked reserved in host e820 map.
For Dell 5600T the device and faulting addresses are not found yet.
For detailed history of the discussion please check following threads:
http://lists.Xen.org/archives/html/xen-devel/2015-02/msg01724.html
http://lists.Xen.org/archives/html/xen-devel/2015-01/msg02513.html
Format for rmrr Xen command line option:
rmrr=start<-end>=[s1]bdf1[,[s1]bdf2[,...]];start<-end>=[s2]bdf1[,[s2]bdf2[,...]]
For example, for Lenovo ThinkCentre M, use:
rmrr=0xd5d45=0:0:1d.0;0xd5d46=0:0:1a.0
If grub2 used and multiple ranges are specified, ';' should be
quoted/escaped, refer to grub2 manual for more information.
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com> Signed-off-by: Venu Busireddy <venu.busireddy@oracle.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Elena Ufimtseva [Wed, 25 Jan 2017 09:37:43 +0000 (10:37 +0100)]
pci: add wrapper for parse_pci
For sbdf's parsing in RMRR command line, add parse_pci_seg with additional
parameter def_seg. parse_pci_seg will help to identify if segment was
found in string being parsed or default segment was used.
Make a wrapper parse_pci so the rest of the callers are not affected.
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com> Signed-off-by: Venu Busireddy <venu.busireddy@oracle.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Elena Ufimtseva [Wed, 25 Jan 2017 09:37:14 +0000 (10:37 +0100)]
VT-d: separate rmrr addition function
In preparation for auxiliary RMRR data provided on Xen command line,
make RMRR adding a separate function.
Also free memery for rmrr device scope in error path.
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com> Signed-off-by: Venu Busireddy <venu.busireddy@oracle.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Dario Faggioli [Tue, 17 Jan 2017 17:27:10 +0000 (18:27 +0100)]
xen: sched: simplify ACPI S3 resume path.
In fact, when domains are being unpaused:
- it's not necessary to put the vcpu to sleep, as
they are all already paused;
- it is not necessary to call vcpu_migrate(), as
the vcpus are still paused, and therefore won't
wakeup anyway.
Basically, the only important thing is to call
pick_cpu, to let the scheduler run and figure out
what would be the best initial placement (i.e., the
value stored in v->processor), for the vcpus, as
they come back up, one after another.
Note that this is consistent with what was happening
before this change, as vcpu_migrate() calls pick_cpu.
But much simpler and quicker.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Dario Faggioli [Tue, 17 Jan 2017 17:27:03 +0000 (18:27 +0100)]
xen: sched: impove use of cpumask scratch space in Credit1.
It is ok to use just cpumask_scratch in csched_runq_steal().
In fact, the cpu parameter comes from the cpu local variable
in csched_load_balance(), which in turn comes from cpu in
csched_schedule(), which is smp_processor_id().
While there, also:
- move the comment about cpumask_scratch in the header
where the scratch space is declared;
- spell more clearly (in that same comment) what are the
serialization rules.
No functional change intended.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Dario Faggioli [Tue, 17 Jan 2017 17:26:55 +0000 (18:26 +0100)]
xen: credit2: fix shutdown/suspend when playing with cpupools.
In fact, during shutdown/suspend, we temporarily move all
the vCPUs to the BSP (i.e., pCPU 0, as of now). For Credit2
domains, we call csched2_vcpu_migrate(), expects to find the
target pCPU in the domain's pool
Therefore, if Credit2 is the default scheduler and we have
removed pCPU 0 from cpupool0, shutdown/suspend fails like
this:
****************************************
Panic on CPU 8:
Assertion 'svc->vcpu->processor < nr_cpu_ids' failed at sched_credit2.c:1729
****************************************
On the other hand, if Credit2 is the scheduler of another
pool, when trying (still during shutdown/suspend) to move
the vCPUs of the Credit2 domains to pCPU 0, it figures
out that pCPU 0 is not a Credit2 pCPU, and fails like this:
The solution is to recognise the specific situation, inside
csched2_vcpu_migrate() and, considering it is something temporary,
which only happens during shutdown/suspend, quickly deal with it.
Then, in the resume path, in restore_vcpu_affinity(), things
are set back to normal, and a new v->processor is chosen, for
each vCPU, from the proper set of pCPUs (i.e., the ones of
the proper cpupool).
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Dario Faggioli [Tue, 17 Jan 2017 17:26:46 +0000 (18:26 +0100)]
xen: credit2: never consider CPUs outside of our cpupool.
In fact, relying on the mask of what pCPUs belong to
which Credit2 runqueue is not enough. If we only do that,
when Credit2 is the boot scheduler, we may ASSERT() or
panic when moving a pCPU from Pool-0 to another cpupool.
This is because pCPUs outside of any pool are considered
part of cpupool0. This puts us at risk of crash when those
same pCPUs are added to another pool and something
different than the idle domain is found to be running
on them.
Note that, even if we prevent the above to happen (which
is the purpose of this patch), this is still pretty bad,
in fact, when we remove a pCPU from Pool-0:
- in Credit1, as we do *not* update prv->ncpus and
prv->credit, which means we're considering the wrong
total credits when doing accounting;
- in Credit2, the pCPU remains part of one runqueue,
and is hence at least considered during load balancing,
even if no vCPU should really run there.
In Credit1, this "only" causes skewed accounting and
no crashes because there is a lot of `cpumask_and`ing
going on with the cpumask of the domains' cpupool
(which, BTW, comes at a price).
A quick and not to involved (and easily backportable)
solution for Credit2, is to do exactly the same.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com Acked-by: George Dunlap <george.dunlap@citrix.com>
Dario Faggioli [Tue, 17 Jan 2017 17:26:38 +0000 (18:26 +0100)]
xen: credit2: use the correct scratch cpumask.
In fact, there is one scratch mask per each CPU. When
you use the one of a CPU, it must be true that:
- the CPU belongs to your cpupool and scheduler,
- you own the runqueue lock (the one you take via
{v,p}cpu_schedule_lock()) for that CPU.
This was not the case within the following functions:
get_fallback_cpu(), csched2_cpu_pick(): as we can't be
sure we either are on, or hold the lock for, the CPU
that is in the vCPU's 'v->processor'.
migrate(): it's ok, when called from balance_load(),
because that comes from csched2_schedule(), which takes
the runqueue lock of the CPU where it executes. But it is
not ok when we come from csched2_vcpu_migrate(), which
can be called from other places.
The fix is to explicitly use the scratch space of the
CPUs for which we know we hold the runqueue lock.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reported-by: Jan Beulich <JBeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Tue, 24 Jan 2017 16:22:03 +0000 (16:22 +0000)]
x86emul/test: don't use *_len symbols
... as they don't work as intended with -fPIC.
I did prefer them over *_end ones at the time because older gcc would
cause .L* symbols to be public, due to issuing .globl for all
referenced externals. And labels at the end of instructions collide
with the ones at the start of the next instruction, making disassembly
harder to grok. Luckily recent gcc no longer issues those .globl
directives, and hence .L* labels, staying local by default, no longer
get in the way.
Reported-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Tested-by: Wei Liu <wei.liu2@citrix.com>
Joao Martins [Tue, 24 Jan 2017 11:37:36 +0000 (12:37 +0100)]
x86/hvm: do not set msr_tsc_adjust on hvm_set_guest_tsc_fixed
Commit 6e03363 ("x86: Implement TSC adjust feature for HVM guest")
implemented TSC_ADJUST MSR for hvm guests. Though while booting
an HVM guest the boot CPU would have a value set with delta_tsc -
guest tsc while secondary CPUS would have 0. For example one can
observe:
$ xen-hvmctx 17 | grep tsc_adjust
TSC_ADJUST: tsc_adjust ff9377dfef47fe66
TSC_ADJUST: tsc_adjust 0
TSC_ADJUST: tsc_adjust 0
TSC_ADJUST: tsc_adjust 0
Upcoming Linux 4.10 now validates whether this MSR is correct and
adjusts them accordingly under the following conditions: values of < 0
(our case for CPU 0) or != 0 or values > 7FFFFFFF. In this conditions it
will force set to 0 and for the CPUs that the value doesn't match all
together. If this msr is not correct we would see messages such as:
And on HVM guests supporting TSC_ADJUST (requiring at least Haswell
Intel) it won't boot.
Our current vCPU 0 values are incorrect and according to Intel SDM which on
section "Time-Stamp Counter Adjustment" states that "On RESET, the value
of the IA32_TSC_ADJUST MSR is 0." hence we should set it 0 and be
consistent across multiple vCPUs. Perhaps this MSR should be only
changed by the guest which already happens through
hvm_set_guest_tsc_adjust(..) routines (see below). After this patch
guests running Linux 4.10 will see a valid IA32_TSC_ADJUST msr of value
0 for all CPUs and are able to boot.
On the same section of the spec ("Time-Stamp Counter Adjustment") it is
also stated:
"If an execution of WRMSR to the IA32_TIME_STAMP_COUNTER MSR
adds (or subtracts) value X from the TSC, the logical processor also
adds (or subtracts) value X from the IA32_TSC_ADJUST MSR.
Unlike the TSC, the value of the IA32_TSC_ADJUST MSR changes only in
response to WRMSR (either to the MSR itself, or to the
IA32_TIME_STAMP_COUNTER MSR). Its value does not otherwise change as
time elapses. Software seeking to adjust the TSC can do so by using
WRMSR to write the same value to the IA32_TSC_ADJUST MSR on each logical
processor."
This suggests these MSRs values should only be changed through guest i.e.
throught write intercept msrs. We keep IA32_TSC MSR logic such that writes
accomodate adjustments to TSC_ADJUST, hence no functional change in the
msr_tsc_adjust for IA32_TSC msr. Though, we do that in a separate routine
namely hvm_set_guest_tsc_msr instead of through hvm_set_guest_tsc(...).
Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 24 Jan 2017 11:36:55 +0000 (12:36 +0100)]
x86/HVM: make hvm_set_guest_tsc*() static
Other than hvm_set_guest_tsc(), neither needs to be exposed. And
hvm_get_guest_tsc_adjust() is pretty pointless as a seperate function
altogether, let alone a non-static one.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 24 Jan 2017 11:36:30 +0000 (12:36 +0100)]
x86/PVH: only set accessed/busy bits for present segments
Commit 366ff5f1b3 ("x86: segment attribute handling adjustments" went a
little too far: We must not do such adjustments for non-present segments.
Reported-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Roger Pau Monné <roger.pau@citrix.com>
The current function pointers in struct vmx_domain for managing hvm
posted interrupt can be used also by SVM AVIC. Therefore, this patch
introduces the struct hvm_pi_ops in the struct hvm_domain to hold them.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Mon, 9 Jan 2017 13:42:02 +0000 (13:42 +0000)]
x86/hvm: Conditionally leave CPUID Faulting active in HVM context
If the hardware supports faulting, and the guest has chosen to use it, leave
faulting active in HVM context.
It is more efficient to have hardware convert CPUID to a #GP fault (which we
don't intercept), than to take a VMExit and have Xen re-inject a #GP fault.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Wed, 18 Jan 2017 18:10:41 +0000 (18:10 +0000)]
x86/cpuid: Hide VT-x/SVM from HVM-based control domains
The VT-x/SVM features are hidden from PV dom0 by the pv_featureset[] upper
mask, but nothing thus far has prevented the features being visible in
HVM-based control domains (where there is no toolstack decision to hide the
features).
As a side effect of calling nestedhvm_enabled() earlier during domain
creation, it needs to cope with the params[] array not having been allocated.
Reported-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Doug Goldstein <cardoe@cardoe.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 20 Jan 2017 15:34:54 +0000 (15:34 +0000)]
x86/emul: Fix clang build following BMI1/BMI2/TBM instruction support
Travis reports that Clang objects to integer truncation during assignments to
a bitfield:
./x86_emulate/x86_emulate.c:6150:19: error: implicit truncation from 'int'
to bitfield changes value from -1 to 15 [-Werror,-Wbitfield-constant-conversion]
pxop->reg = ~0; /* rAX */
^ ~~
Use 0xf instead.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Eric DeVolder [Thu, 19 Jan 2017 17:10:53 +0000 (11:10 -0600)]
kexec: ensure kexec_status() return bit value of 0 or 1
When checking kexec_flags bit corresponding to the
requested image, ensure that 0 or 1 is returned.
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 20 Jan 2017 13:39:12 +0000 (14:39 +0100)]
x86: segment attribute handling adjustments
Null selector loads into SS (possible in 64-bit mode only, and only in
rings other than ring 3) must not alter SS.DPL. (This was found to be
an issue on KVM, and fixed in Linux commit 33ab91103b.)
Further arch_set_info_hvm_guest() didn't make sure that the ASSERT()s
in hvm_set_segment_register() wouldn't trigger: Add further checks, but
tolerate (adjust) clear accessed (CS, SS, DS, ES) and busy (TR) bits.
Finally the setting of the accessed bits for user segments was lost by
commit dd5c85e312 ("x86/hvm: Reposition the modification of raw segment
data from the VMCB/VMCS"), yet VMX requires them to be set for usable
segments. Add respective ASSERT()s (the only path not properly setting
them was arch_set_info_hvm_guest()).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 20 Jan 2017 13:37:33 +0000 (14:37 +0100)]
x86emul: LOCK check adjustments
BT, being encoded as DstBitBase just like BT{C,R,S}, nevertheless does
not write its (register or memory) operand and hence also doesn't allow
a LOCK prefix to be used.
At the same time CLAC/STAC have no need to explicitly check lock_prefix
- this is being taken care of by generic code.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 20 Jan 2017 13:36:58 +0000 (14:36 +0100)]
x86emul: CMPXCHG{8,16}B are memory writes
This fixes a regression introduced by commit ff913f68c9 ("x86/PV:
restrict permitted instructions during memory write emulation")
breaking namely 32-bit PV guests (which commonly use CMPXCHG8B for
certain page table updates).
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Daniel Kiper [Fri, 20 Jan 2017 13:34:33 +0000 (14:34 +0100)]
x86/boot: implement early command line parser in C
Current early command line parser implementation in assembler
is very difficult to change to relocatable stuff using segment
registers. This requires a lot of changes in very weird and
fragile code. So, reimplement this functionality in C. This
way code will be relocatable out of the box (without playing
with segment registers) and much easier to maintain.
Additionally, put all common cmdline.c and reloc.c definitions
into defs.h header. This way we do not duplicate needlessly
some stuff.
And finally remove unused xen/include/asm-x86/config.h
header from reloc.c dependencies.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
Juergen Gross [Thu, 19 Jan 2017 07:18:53 +0000 (08:18 +0100)]
tools/tests: add xenstore testing framework
Add tools/tests/xenstore for a framework to do tests of xenstore.
The aim is to test for correctness and performance.
Add a test program containing some tests meant to be run against any
xenstore implementation (xenstored, oxenstored, xenstore-stubdom).
It is using libxenstore for access to xenstore and supports multiple
tests to be either selected all or individually. All tests are using
/local/domain/<own-domid>/xenstore-test/<pid> as base for doing the
tests. This allows multiple instances of the program to run in
parallel.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Thu, 19 Jan 2017 09:33:55 +0000 (10:33 +0100)]
x86emul: rename the no_writeback label
This is to bring its name in line with what actually happens there.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 19 Jan 2017 09:28:28 +0000 (10:28 +0100)]
x86emul: support BMI2 insns
Note that the adjustment to the mode_64bit() definition is so that we
can avoid "#ifdef __x86_64__" around the 64-bit asm() portions. An
alternative would be single asm()s with a conditional branch over the
(manually encoded) REX64 prefix.
Note that RORX raising #UD when VEX.VVVV is not all ones is matching
observed behavior rather than what the SDM says.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Julien Grall [Wed, 18 Jan 2017 18:54:08 +0000 (18:54 +0000)]
xen/arm: gic-v3: Make sure read from ICC_IAR1_EL1 is visible on the redistributor
"The effects of reading ICC_IAR0_EL1 and ICC_IAR1_EL1 on the state of a
returned INTID are not guaranteed to be visible until after the execution
of a DSB".
Because of the GIC is an external component, a dsb sy is required.
Without it the sysreg read may not have been made visible on the
redistributor.
Andrew Cooper [Mon, 9 Jan 2017 12:54:55 +0000 (12:54 +0000)]
x86/cpuid: Offer ITSC to domains which are automatically non-migrateable
Dom0 doesn't have a toolstack to explicitly decide that ITSC is safe to offer.
For domains which are automatically built with disable_migrate set, offer ITSC
automatically.
This is important for HVM-based dom0, and for when cpuid faulting is imposed
on the control domain.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 4 Jan 2017 15:07:02 +0000 (15:07 +0000)]
tools/libxc: Remove xsave calculations from libxc
libxc performs a lot of calculations for the xstate leaf when generating a
guests cpuid policy. To correctly construct a policy for an HVM guest, this
logic depends on native cpuid leaking through from real hardware.
In particular, the logic is potentially wrong for an HVM-based toolstack
domain (e.g. PVH dom0), and definitely wrong if cpuid faulting is applied to a
PV domain.
Xen now performs all the necessary calculations, using native values. The
only piece of information the toolstack need worry about is single xstate
feature leaf.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Fri, 16 Dec 2016 16:21:20 +0000 (16:21 +0000)]
x86/cpuid: Move all xstate leaf handling into guest_cpuid()
The xstate union now contains sanitised values, so it can be handled fully in
the non-legacy path.
c/s 1c0bc709d "x86/cpuid: Perform max_leaf calculations in guest_cpuid()"
accidentally introduced a boundary error for the subleaf check, although it
was masked by the correct logic in the legacy path.
Two dynamic adjustments need making, but a TODO and BUILD_BUG_ON() are left to
cover a latent bug which will present itself when Xen starts supporting XSS
states for guests.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 4 Jan 2017 15:00:23 +0000 (15:00 +0000)]
x86/cpuid: Introduce recalculate_xstate()
All data in the xstate union, other than the Da1 feature word, is derived from
other state; either feature bits from other words, or layout information which
has already been collected by Xen's xstate driver.
Recalculate the xstate information for each policy object when the feature
bits may have changed.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 12 Jan 2017 11:45:10 +0000 (11:45 +0000)]
x86/cpuid: Move x86_vendor from arch_domain to cpuid_policy
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Andrew Cooper [Thu, 12 Jan 2017 11:45:10 +0000 (11:45 +0000)]
x86/cpuid: Drop a guests cached x86 family and model information
The model information isn't used at all, and the family information is only
used once.
Make get_cpu_family() a static inline (as it is just basic calculation, and
the function call is probably more expensive than the function itself) and
rearange the logic to avoid calculating model entirely if the caller doesn't
want it.
Calculate a guests family only when necessary in hvm_select_ioreq_server().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Eric DeVolder [Tue, 17 Jan 2017 17:29:16 +0000 (11:29 -0600)]
kexec: implement STATUS hypercall to check if image is loaded
The tools that use kexec are asynchronous in nature and do not keep
state changes. As such provide an hypercall to find out whether an
image has been loaded for either type.
Note: No need to modify XSM as it has one size fits all check and
does not check for subcommands.
Note: No need to check KEXEC_FLAG_IN_PROGRESS (and error out of
kexec_status()) as this flag is set only once by the first/only
cpu on the crash path.
Note: This is just the Xen side of the hypercall, kexec-tools patch
to come separately.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Tue, 17 Jan 2017 09:32:25 +0000 (10:32 +0100)]
x86emul: VEX.B is ignored in compatibility mode
While VEX.R and VEX.X are guaranteed to be 1 in compatibility mode
(and hence a respective mode_64bit() check can be dropped), VEX.B can
be encoded as zero, but would be ignored by the processor. Since we
emulate instructions in 64-bit mode (except possibly in the test
harness), we need to force the bit to 1 in order to not act on the
wrong {X,Y,Z}MM register (which has no bad effect on 32-bit test
harness builds, as there the bit would again be ignored by the
hardware, and would by default be expected to be 1 anyway).
We must not, however, fiddle with the high bit of VEX.VVVV in the
decode phase, as that would undermine the checking of instructions
requiring the field to be all ones independent of mode. This is
being enforced in copy_REX_VEX() instead.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 17 Jan 2017 09:31:39 +0000 (10:31 +0100)]
x86emul: suppress memory writes after faulting FPU insns
FPU insns writing to memory must not touch memory if they latch #MF (to
be delivered on the next waiting FPU insn). Note that inspecting FSW.ES
needs to be avoided for all FNST* insns, as they don't raise exceptions
themselves, but may instead be invoked with the bit already set.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Fri, 13 Jan 2017 18:51:04 +0000 (18:51 +0000)]
x86/xstate: Fix array overrun on hardware with LWP
c/s da62246e4c "x86/xsaves: enable xsaves/xrstors/xsavec in xen" introduced
setup_xstate_features() to allocate and fill xstate_offsets[] and
xstate_sizes[].
However, fls() casts xfeature_mask to 32bits which truncates LWP out of the
calculation. As a result, the arrays are allocated too short, and the cpuid
infrastructure reads off the end of them when calculating xstate_size for the
guest.
On one test system, this results in 0x3fec83c0 being returned as the maximum
size of an xsave area, which surprisingly appears not to bother Windows or
Linux too much. I suspect they both use current size based on xcr0, which Xen
forwards from real hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 6 Jan 2017 20:05:36 +0000 (20:05 +0000)]
x86/pv: Check that emulate_privileged_op() don't change any unexpected flags
No bits, other than arithmetic ones and the resume flag (which will most
likely change from 1 to 0), can be changed by the instructions we permit.
Extend the check to cover other flags.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 13 Jan 2017 13:23:42 +0000 (13:23 +0000)]
x86/emul: Calculate not_64bit during instruction decode
... rather than repeating "generate_exception_if(mode_64bit(), EXC_UD);" in
the emulation switch statement.
Bloat-o-meter shows:
add/remove: 0/0 grow/shrink: 1/2 up/down: 8/-495 (-487)
function old new delta
per_cpu__state 98 106 +8
x86_decode 6782 6726 -56
x86_emulate 57160 56721 -439
The reason for x86_decode() getting smaller is that this change alters the
x86_decode_onebyte() switch statement from a chain of if()/else's to a jump
table. The jump table adds 250 bytes of data which bloat-o-meter clearly
can't see.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Fri, 13 Jan 2017 14:28:31 +0000 (15:28 +0100)]
x86emul: improve CR/DR access handling
- don't accept LOCK for DR accesses (it's undefined in the manuals)
- only accept LOCK for CR accesses when the respective feature flag is
set (which would not normally be the case for Intel)
- add (rather than or) 8 when LOCK is present; real hardware #UDs
when both REX.W and LOCK are present, implying that these would
rather access hypothetical CR16...23
- eliminate explicit decode_register() calls
- streamline remaining read/write code
No further functional change, i.e. not addressing the missing exception
generation (#UD for invalid CR/DR encodings, #GP(0) for invalid write
values, #DB for DR accesses with DR7.GD set).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 13 Jan 2017 14:24:45 +0000 (15:24 +0100)]
x86emul: conditionally clear BNDn for branches
Considering that we surface MPX to HVM guests, instructions we emulate
should also correctly deal with MPX state. While for now BND*
instructions don't get emulated, the effect of branches (which we do
emulate) without BND prefix should be taken care of.
No need to alter XABORT behavior: While not mentioned in the SDM so
far, this restores BNDn as they were at the XBEGIN, and since we make
XBEGIN abort right away, XABORT in the emulator is only a no-op.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Wed, 4 Jan 2017 12:46:09 +0000 (12:46 +0000)]
x86/cpuid: Effectively remove domain_cpuid()
The only callers of domain_cpuid() are the legacy cpuid path via
{pv,hvm}_cpuid(). Move domain_cpuid() to being private in cpuid.c, with an
adjusted API to use struct cpuid_leaf rather than individual pointers.
The ITSC clobbering logic is dropped. It is no longer necessary now that the
logic has moved into recalculate_cpuid_policy()
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 4 Jan 2017 12:20:51 +0000 (12:20 +0000)]
x86/cpuid: Store the toolstacks choice of hypervisor max leaf
This removes all dependencies on the legacy cpuids[] array from
cpuid_hypervisor_leaves(). Swap a BUG() to an ASSERT_UNREACHABLE(), because
in the unlikely case that we hit it, returning all zeros to the guest is fine.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 4 Jan 2017 12:43:57 +0000 (12:43 +0000)]
x86/domctl: Move all CPUID update logic into update_domain_cpuid_info()
This simplifies the XEN_DOMCTL_set_cpuid handling, splitting the safety logic
away from the internals of how an update is completed.
The legacy cpuids[] logic is left in alone in a fuction, as it wont survive
very long. update_domain_cpuid_info() gains a small performance optimisation
to skip all update activites for leaves which won't have any impact on the
guest. This is temporary until the new hypercall API is completed.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 12 Jan 2017 16:14:56 +0000 (16:14 +0000)]
x86/cpuid: Fix feature flags reported to dom0
c/s a11e8c9 "x86/pv: Use per-domain policy information in pv_cpuid()" switched
PV domains from using a (hardware for dom0, toolstack-chosen from domU) value
masked against pv_featureset[], to actually using the value calculated by
recalculate_cpuid_policy().
For domU, this is no practical change as the content is still chosen by the
toolstack. For dom0 however, we no longer have two sources of information
potentially clearing bits. Modern Linux seems to care about having CMP_LEGACY
set in its view of CPUID on an Intel box.
The deliberate setting of HTT, X2APIC and CMP_LEGACY in {pv,hvm}_featureset[]
is necessary for domUs, as the toolstack may have (tried to) set up topology
information in a different representation than the hardware uses. The bits
therefore needed to be set in the masks used in the older logic, to avoid
clobbering the toolstacks information.
Move the HTT/X2APIC/CMP_LEGACY logic from calculate_{pv,hvm}_max_policy()
(where the meaning of {pv,hvm}_featureset[] has changed subtly) to
recalculate_cpuid_policy() where the masking logic now lives.
This will cause {pv,hvm}_max_policy to actually contain real hardware values
(so dom0 sees real hardware values), but still allows the toolstack to set
bits not present in real hardware for domUs.
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Tue, 10 Jan 2017 16:13:38 +0000 (17:13 +0100)]
tools/xenstore: start with empty data base
Today xenstored tries to open a tdb data base file on disk when it is
started. As this is problematic in most cases the scripts used to start
xenstored ensure xenstored won't find such a file in order to start
with an empty xenstore.
A tdb data base file can't be used to restore all Xenstore state as
e.g. Xenstore watches are not kept in the tdb data base. The file is
meant to be used for debugging purposes after a xenstored crash only.
Instead of opening a Xenstore data base file found on disk always start
with an empty data base. This will avoid problems in case someone is
testing multiple xenstored versions without rebooting (which is not
supported but helps debugging in some cases).
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Mon, 9 Jan 2017 13:17:01 +0000 (13:17 +0000)]
tools/libxc: Fix the reported max_leaf values for PV guests
When iterating through CPUID leaves to generating a policy, libxc will clip
itself at the hardcoded maxima, meaning that no data outside of the hardcoded
maxima are provided to Xen (in turn, causing Xen to return zeros if these
leaves are requested.)
The HVM code also clips the max_leaf data reported to the guest, but the PV
side didn't.
This results in a PV guest using the emulated CPUID, or via Xen using CPUID
faulting, to observe a max_leaf higher than the toolstack wants, although with
zeros being returned in the intervening leaves.
Fix the PV side to behave like the HVM side, and clip the max_leaf values in
leaf 0 and 0x80000000.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Wed, 11 Jan 2017 12:43:04 +0000 (13:43 +0100)]
x86emul: correct EFLAGS.TF handling
For repeated string instructions we should not emulate multiple
iterations in one go when a single step trap needs injecting (which
needs to happen after every iteration).
For all non-branch instructions as well as not taken conditional
branches we additionally need to take DebugCtl.BTF into consideration.
For mov-to/pop-into %ss there should be no #DB at all (EFLAGS.TF
remaining set means there'll be #DB after the next instruction).
Additionally retire.sti should remain clear when retire.singlestep gets
set to true.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citirx.com>
Jan Beulich [Wed, 11 Jan 2017 12:40:49 +0000 (13:40 +0100)]
x86/HVM: restrict permitted instructions during special purpose emulation
Most invocations of the instruction emulator are for VM exits where the
set of legitimate instructions (i.e. ones capable of causing the
respective exit) is rather small. Restrict the permitted sets via a new
callback, at once eliminating the abuse of handle_mmio() for non-MMIO
operations.
A seemingly unrelated comment adjustment is being done here to keep
x86_emulate() in sync with x86_insn_is_mem_write() (in the context of
which this was found to be wrong).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Wed, 11 Jan 2017 11:59:02 +0000 (11:59 +0000)]
x86/cpuid: Alter the legacy-path prototypes to match guest_cpuid()
This allows the compiler to have a far easier time inlining the legacy paths
into guest_cpuid(), and avoids the need to have a full struct cpu_user_regs in
the guest_cpuid() stack frame.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <JBeulich@suse.com>
Andrew Cooper [Wed, 11 Jan 2017 11:59:02 +0000 (11:59 +0000)]
x86/cpuid: Effectively remove pv_cpuid() and hvm_cpuid()
All callers of pv_cpuid() and hvm_cpuid() (other than guest_cpuid() legacy
path) have been removed from the codebase. Move them into cpuid.c to avoid
any further use, leaving guest_cpuid() as the sole API to use.
This is purely code motion, with no functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 11 Jan 2017 11:59:02 +0000 (11:59 +0000)]
x86/svm: Use guest_cpuid() rather than hvm_cpuid()
More work is required before LWP details can be read straight out of the
cpuid_policy block, but in the meantime hvm_cpuid() wants to disappear so
update the code to use the newer interface.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 11 Jan 2017 11:59:02 +0000 (11:59 +0000)]
x86/hvm: Use guest_cpuid() rather than hvm_cpuid()
More work is required before maxphysaddr can be read straight out of the
cpuid_policy block, but in the meantime hvm_cpuid() wants to disappear so
update the code to use the newer interface.
Use the behaviour of max_leaf handling (returning all zeros) to avoid a double
call into guest_cpuid().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 11 Jan 2017 11:59:02 +0000 (11:59 +0000)]
x86/cpuid: Move all leaf 7 handling into guest_cpuid()
All per-domain policy data concerning leaf 7 is accurate. Handle it all in
guest_cpuid() by reading out of the raw array block, and introduing a dynamic
adjustment for OSPKE.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 11 Jan 2017 11:59:02 +0000 (11:59 +0000)]
x86/cpuid: Drop the temporary linear feature bitmap from struct cpuid_policy
With most uses of the *_featureset API removed, the remaining uses are only
during XEN_SYSCTL_get_cpu_featureset, init_guest_cpuid(), and
recalculate_cpuid_policy(), none of which are hot paths.
Drop the temporary infrastructure, and have the current users recreate the
linear bitmap using cpuid_policy_to_featureset(). This avoids storing
duplicated information in struct cpuid_policy.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 11 Jan 2017 11:59:02 +0000 (11:59 +0000)]
x86/pv: Use per-domain policy information in pv_cpuid()
... rather than performing runtime adjustments. This is safe now that
recalculate_cpuid_policy() perfoms suitable sanitisation when the policy data
is loaded.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 11 Jan 2017 11:59:02 +0000 (11:59 +0000)]
x86/svm: Improvements using named features
This avoids calling into hvm_cpuid() to obtain information which is directly
available. In particular, this avoids the need to overload flag_dr_dirty
because of hvm_cpuid() being unavailable in svm_save_dr().
flag_dr_dirty is returned to a boolean (as it was before c/s c097f549 which
introduced the need to overload it). While returning it to type bool, remove
the use of bool_t for the adjacent fields.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Andrew Cooper [Wed, 11 Jan 2017 11:59:02 +0000 (11:59 +0000)]
x86/vvmx: Use hvm_cr4_guest_valid_bits() to calculate MSR_IA32_VMX_CR4_FIXED1
Reuse the logic in hvm_cr4_guest_valid_bits() instead of duplicating it.
This fixes a bug to do with the handling of X86_CR4_PCE. The RDPMC
instruction predate the architectural performance feature, and has been around
since the P6. X86_CR4_PCE is like X86_CR4_TSD and only controls whether RDPMC
is available at cpl!=0, not whether RDPMC is generally unavailable.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 11 Jan 2017 11:59:02 +0000 (11:59 +0000)]
x86/hvm: Improve hvm_efer_valid() using named features
Pick the appropriate cpuid_policy object rather than using hvm_cpuid() or
boot_cpu_data. This breaks the dependency on current.
As data is read straight out of cpuid_policy, there is no need to work around
the fact that X86_FEATURE_SYSCALL might be clear because of the dynamic
adjustment in hvm_cpuid(). This simplifies the SCE handling, as EFER.SCE can
be set in isolation in 32bit mode on Intel hardware.
Alter nestedhvm_enabled() to be const-correct, allowing hvm_efer_valid() to be
properly const-correct.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 11 Jan 2017 11:59:02 +0000 (11:59 +0000)]
x86/cpuid: Introduce named feature bitfields
It greatly aids the readibility of code to express feature checks with their
direct name (e.g. p->basic.mtrr or p->extd.lm), rarther that by a field and a
bitmask. gen-cpuid.py is augmented to calculate a suitable declaration to
live in a union with the underlying feature word.
gen-cpuid.py doesn't know Xen's choice of naming for the feature word indicies
(and arguably shouldn't care), so provides the declarations in terms of their
numeric feature word index. The DECL_BITFIELD() macro (local to cpuid_policy)
takes a feature word index name and chooses the right declaration, to aid
clarity.
All X86_FEATURE_*'s are included in the naming, other than the features
fast-forwarded from other state (APIC, OSXSAVE, OSPKE), whose value cannot be
read out of the feature word.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 11 Jan 2017 11:59:02 +0000 (11:59 +0000)]
x86/cpuid: Dispatch cpuid_hypervisor_leaves() from guest_cpuid()
... rather than from the legacy path. Update the API to match guest_cpuid(),
and remove its dependence on current.
Make use of guest_cpuid() unconditionally zeroing res to avoid repeated
re-zeroing. To use a const struct domain, domain_cpuid() needs to be
const-corrected.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>