Dario Faggioli [Mon, 16 Jun 2014 10:13:03 +0000 (12:13 +0200)]
derive NUMA node affinity from hard and soft CPU affinity
if a domain's NUMA node-affinity (which is what controls
memory allocations) is provided by the user/toolstack, it
just is not touched. However, if the user does not say
anything, leaving it all to Xen, let's compute it in the
following way:
1. cpupool's cpus & hard-affinity & soft-affinity
2. if (1) is empty: cpupool's cpus & hard-affinity
This guarantees memory to be allocated from the narrowest
possible set of NUMA nodes, ad makes it relatively easy to
set up NUMA-aware scheduling on top of soft affinity.
Note that such 'narrowest set' is guaranteed to be non-empty.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Dario Faggioli [Mon, 16 Jun 2014 10:12:28 +0000 (12:12 +0200)]
sched: introduce soft-affinity and use it instead d->node-affinity
Before this change, each vcpu had its own vcpu-affinity
(in v->cpu_affinity), representing the set of pcpus where
the vcpu is allowed to run. Since when NUMA-aware scheduling
was introduced the (credit1 only, for now) scheduler also
tries as much as it can to run all the vcpus of a domain
on one of the nodes that constitutes the domain's
node-affinity.
The idea here is making the mechanism more general by:
* allowing for this 'preference' for some pcpus/nodes to be
expressed on a per-vcpu basis, instead than for the domain
as a whole. That is to say, each vcpu should have its own
set of preferred pcpus/nodes, instead than it being the
very same for all the vcpus of the domain;
* generalizing the idea of 'preferred pcpus' to not only NUMA
awareness and support. That is to say, independently from
it being or not (mostly) useful on NUMA systems, it should
be possible to specify, for each vcpu, a set of pcpus where
it prefers to run (in addition, and possibly unrelated to,
the set of pcpus where it is allowed to run).
We will be calling this set of *preferred* pcpus the vcpu's
soft affinity, and this changes introduce it, and starts using it
for scheduling, replacing the indirect use of the domain's NUMA
node-affinity. This is more general, as soft affinity does not
have to be related to NUMA. Nevertheless, it allows to achieve the
same results of NUMA-aware scheduling, just by making soft affinity
equal to the domain's node affinity, for all the vCPUs (e.g.,
from the toolstack).
This also means renaming most of the NUMA-aware scheduling related
functions, in credit1, to something more generic, hinting toward
the concept of soft affinity rather than directly to NUMA awareness.
As a side effects, this simplifies the code quit a bit. In fact,
prior to this change, we needed to cache the translation of
d->node_affinity (which is a nodemask_t) to a cpumask_t, since that
is what scheduling decisions require (we used to keep it in
node_affinity_cpumask). This, and all the complicated logic
required to keep it updated, is not necessary any longer.
The high level description of NUMA placement and scheduling in
docs/misc/xl-numa-placement.markdown is being updated too, to match
the new architecture.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Dario Faggioli [Mon, 16 Jun 2014 10:11:52 +0000 (12:11 +0200)]
sched: rename v->cpu_affinity into v->cpu_hard_affinity
in order to distinguish it from the cpu_soft_affinity which will
be introduced a later commit ("xen: sched: introduce soft-affinity
and use it instead d->node-affinity").
This patch does not imply any functional change, it is basically
the result of something like the following:
Malcolm Crossley [Mon, 16 Jun 2014 10:02:00 +0000 (12:02 +0200)]
spread boot time page scrubbing across all available CPU's
The page scrubbing is done in 128MB chunks in lockstep across all the
non-SMT CPU's. This allows for the boot CPU to hold the heap_lock whilst each
chunk is being scrubbed and then release the heap_lock when the CPU's are
finished scrubing their individual chunk. This allows for the heap_lock to
not be held continously and for pending softirqs are to be serviced
periodically across the CPU's.
The page scrub memory chunks are allocated to the CPU's in a NUMA aware
fashion to reduce socket interconnect overhead and improve performance.
Specifically in the first phase we scrub at the same time on all the
NUMA nodes that have CPUs - we also weed out the SMT threads so that
we only use cores (that gives a 50% boost). The second phase is for NUMA
nodes that have no CPUs - for that we use the closest NUMA node's CPUs
(non-SMT again) to do the job.
This patch reduces the boot page scrub time on a 128GB 64 core AMD Opteron
6386 machine from 49 seconds to 3 seconds.
On a IvyBridge-EX 8 socket box with 1.5TB it cuts it down from 15 minutes
to 63 seconds.
Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86/mce: don't spam the console with "CPUx: Temperature z"
If the machine has been quite busy it ends up with these messages
printed on the hypervisor console:
(XEN) CPU3: Temperature/speed normal
(XEN) CPU1: Temperature/speed normal
(XEN) CPU0: Temperature/speed normal
(XEN) CPU1: Temperature/speed normal
(XEN) CPU0: Temperature/speed normal
(XEN) CPU2: Temperature/speed normal
(XEN) CPU3: Temperature/speed normal
(XEN) CPU0: Temperature/speed normal
(XEN) CPU2: Temperature/speed normal
(XEN) CPU3: Temperature/speed normal
(XEN) CPU1: Temperature/speed normal
(XEN) CPU0: Temperature above threshold
(XEN) CPU0: Running in modulated clock mode
(XEN) CPU1: Temperature/speed normal
(XEN) CPU2: Temperature/speed normal
(XEN) CPU3: Temperature/speed normal
While the state changes are important, the non-altered state
information is not needed. As such add a latch mechanism to only print
the information if it has changed since the last update (and the
hardware doesn't properly suppress redundant notifications).
This was observed on Intel DQ67SW,
BIOS SWQ6710H.86A.0066.2012.1105.1504 11/05/2012
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christoph Egger <chegger@amazon.de>
Ross Lagerwall [Mon, 16 Jun 2014 09:59:05 +0000 (11:59 +0200)]
cpuidle: improve perf for certain workloads
The existing mechanism of using interrupt frequency as a heuristic does
not work well for certain workloads. As an example, synchronous dd on a
small block size uses deep C-states because much of the time is spent
doing processing so the interrupt frequency is not too high, but when an
IOP is submitted, the interrupt occurs soon after going idle. This
causes exit latency to be a significant factor.
To fix this, add a new factor which limits the exit latency to be no
more than 10% of the decaying measured idle time. This improves
performance for workloads with a medium interrupt frequency but a short
idle duration.
In the workload given previously, throughput improves by 20% with this
patch.
This is not ported from the Linux menu governor since that uses load
average and number of IO wait processes to satisfy latency constraints.
If a process is in IO wait state, it compares the exit latency with the
predicted residency reduced by a factor of 10, which is somewhat similar
to what this patch does.
A side effect of this patch is to correctly limit the maximum idle time
used in the correction factor calculation. Previously data->measured_us
was used, and it was never set.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Jan Beulich [Mon, 16 Jun 2014 09:52:34 +0000 (11:52 +0200)]
x86/EFI: improve boot time diagnostics (try 2)
To aid analysis of eventual errors, print EFI status codes with error
messages where available. Also remove a case where the status gets
stored into a local variable without being used examined (which mis-
guided me to add an error check there in try 1 of this patch).
Jan Beulich [Mon, 16 Jun 2014 09:50:44 +0000 (11:50 +0200)]
pt-irq fixes and improvements
Tools side:
- don't silently ignore unrecognized PT_IRQ_TYPE_* values
- respect that the interface type contains a union, making the code at
once no longer depend on the hypervisor ignoring the bus field of the
PCI portion of the interface structure)
Hypervisor side:
- don't ignore the PCI bus number passed in
- don't store values (gsi, link) calculated from other stored values
- avoid calling xfree() with a spin lock held where easily possible
- have pt_irq_destroy_bind() respect the passed in type
- scope reduction and constification of various variables
- use switch instead of if/else-if chains
- formatting
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Yang Zhang <yang.z.zhang@intel.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
xen/arm: Implement a dummy debug monitor for ARM32
XSA-93 (commit 0b18220 "xen/arm: Don't let guess access to Debug and Performance
Monitors registers") disable Debug Registers access.
When CONFIG_PERF_EVENTS is enabled in the Linux Kernel, it will try to
initialize the debug monitors. If an error occured Linux won't use this
feature.
The implementation made Xen expose a minimal set of registers which let think
the guest (i.e.) thinks HW debug won't work.
Signed-off-by: Julien Grall <julien.grall@linaro.org>
[ ijc -- s/DBGCR/DBGBCR/ to use correct register name ] Acked-by: Ian Campbell <ian.campbell@citrix.com>
xen/arm: Implement a dummy Performance Monitor for ARM32
XSA-93 (commit 0b18220 "xen/arm: Don't let guess access to Debug and Performance
Monitor registers") disable Performance Monitor.
When CONFIG_PERF_EVENTS is enabled in the Linux Kernel, regardless the
ID_DFR0 (which tell if Perfomance Monitors Extension is implemented) the
kernel will try to access to PMCR.
Therefore we tell the guest we have 0 counters. Unfortunately we must always
support PMCCNTR (the cycle counter): we just RAZ/WI for all PM register,
which doesn't crash the kernel at least.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Karim Raslan [Wed, 11 Jun 2014 10:30:15 +0000 (11:30 +0100)]
mini-os: moved events code under arch
This is all code motion, except that we now initialise
the ev_actions array before calling the arch-specific code
to make it more robust against future changes.
Signed-off-by: Karim Allah Ahmed <karim.allah.ahmed@gmail.com>
[talex5@gmail.com: separated from big ARM commit] Signed-off-by: Thomas Leonard <talex5@gmail.com> Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Karim Raslan [Wed, 11 Jun 2014 10:30:14 +0000 (11:30 +0100)]
mini-os: tidied up code
Signed-off-by: Karim Allah Ahmed <karim.allah.ahmed@gmail.com>
[talex5@gmail.com: separated from big ARM commit] Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
[talex5@gmail.com: use __func__ in DEBUG macro]
[talex5@gmail.com: drop text about "xm create"] Signed-off-by: Thomas Leonard <talex5@gmail.com>
Client requests are safe to compile into code for running outside of
valgrind. Therefore, enable client requests whenever autoconf can find
memcheck.h and debug builds are enabled.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Ian Campbell <Ian.Campbell@citrix.com> CC: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- reran autogen.sh ]
Boris Ostrovsky [Wed, 11 Jun 2014 08:55:43 +0000 (10:55 +0200)]
x86/VPMU: mark context LOADED before registers are loaded
Because a PMU interrupt may be generated as soon as PMU registers are
loaded (or, more precisely, as soon as HW PMU is "armed") we don't want
to delay marking context as LOADED until after registers are loaded.
Otherwise during interrupt handling VPMU_CONTEXT_LOADED may not be set
and this could be confusing.
(Technically, only SVM needs this change right now since VMX will "arm"
PMU later, during VMRUN when global control register is loaded from
VMCS. However, both AMD and Intel code will require this patch when we
introduce PV VPMU.)
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com> Tested-by: Dietmar Hahn <dietmar.hahn@ts.fujitsu.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Wei Liu [Tue, 10 Jun 2014 21:21:40 +0000 (22:21 +0100)]
libxl: move some internal functions to libxl_internal.h
In 752f181f ("libxl_json: introduce parser functions for builtin types")
a bunch of parser functions are added to libxl_json.h, which breaks
GCC < 4.6.
These functions are internal and libxl_json.h is public header, so move
them to libxl_internal.h.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Andrew Cooper [Tue, 10 Jun 2014 14:07:59 +0000 (15:07 +0100)]
tools/libxc: Introduce ARRAY_SIZE() and replace handrolled examples
xen-hptool and xen-mfndump include xc_private.h. This is bad, but not trivial
to fix, so they gain a protective #undef and a stern comment.
MiniOS leaks ARRAY_SIZE into the libxc namespace as part of a stubdom build.
Therefore, xc_private.h gains an #ifndef until MiniOS is fixed.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Ian Campbell <Ian.Campbell@citrix.com> CC: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Despite my 'Reviewed-by' tag on c/s 65e3554908 "x86/PV: support data
breakpoint extension registers", I have re-evaluated my position as far as the
hypercall interface is concerned.
Previously, for the sake of not modifying the migration code in libxc,
XEN_DOMCTL_get_ext_vcpucontext would jump though hoops to return -ENOBUFS if
and only if MSRs were in use and no buffer was present.
This is fragile, and awkward from a toolstack point-of-view when actually
sending MSR content in the migration stream. It also complicates fixing a
further race condition, between querying the number of MSRs for a vcpu, and
the vcpu touching a new one.
As this code is still only in unstable, take this opportunity to redesign the
interface. This patch introduces the brand new XEN_DOMCTL_{get,set}_vcpu_msrs
subops.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Arianna Avanzini [Sun, 25 May 2014 10:51:42 +0000 (12:51 +0200)]
arch/arm: domain build: let dom0 access I/O memory of mapped devices
Currently, dom0 is allowed access to the I/O memory ranges used
to access devices exposed to it, but it doesn't have those
ranges in its iomem_caps. This commit implements the correct
bookkeeping in the generic function which actually maps a
device's I/O memory to the domain, adding the ranges to the
domain's iomem_caps.
NOTE: This commit suffers from the following limitations;
. with this patch, I/O memory ranges pertaining disabled
devices are not mapped;
. the "iomem" option could be used to map memory ranges that
are not described in the device tree.
In both these cases, this patch does not allow the domain
the privileges needed to map the needed I/O memory ranges
afterwards.
Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Acked-by: Ian Campbell <Ian.Campbell@eu.citrix.com> Acked-by: Julien Grall <julien.grall@citrix.com> Cc: Dario Faggioli <dario.faggioli@citrix.com> Cc: Paolo Valente <paolo.valente@unimore.it> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: Keir Fraser <keir@xen.org> Cc: Tim Deegan <tim@xen.org> Cc: Ian Jackson <Ian.Jackson@eu.citrix.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Eric Trudeau <etrudeau@broadcom.com> Cc: Viktor Kleinik <viktor.kleinik@globallogic.com>
Generate JSON map handle according to KeyedUnion discriminator.
The original JSON output for a keyed union is like:
{
...
"u" : { FIELDS }
...
}
The discriminator is not generated, so that the parser won't be able to
figure out the fields in the incoming stream.
So we need to change this to something more sensible. For example, for
keyed union libxl_domain_type, which has a discriminator called "type",
we generate following for HVM guest:
{
...
"type.hvm" : { HVM FIELDS }
...
}
Parser then can know the type of this union and how to interpret the
incoming stream.
Note that we change the existing API here. However the original output is
quite broken anyway, we cannot make sensible use of it and I doubt that
there's existing user of existing API. So we are acutally fixing a
problem.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Wei Liu [Mon, 9 Jun 2014 12:43:18 +0000 (13:43 +0100)]
libxl IDL: rename json_fn to json_gen_fn
This json_fn is in fact used to generate string representation of a json
data structure. We will introduce another json function to parse json
data structure in later changeset, so rename json_fn to json_gen_fn to
clarify.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Wei Liu [Mon, 9 Jun 2014 12:43:17 +0000 (13:43 +0100)]
libxl: fix JSON generator for uint64_t
yajl_gen_integer cannot cope with uint64_t, because it takes a signed
long long. If we pass to it an uint64_t number which is between INT_MAX
and UINT_MAX, it generates a negative number. Later when we feed this
generated number into parser, the result gets signed extended, which is
wrong.
A new function called libxl__uint64_gen_json is introduced to handle
uint64_t. It utilises yajl_gen_number to generate numbers.
Also removed a duplicated definition of MemKB while I was there.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Wei Liu [Mon, 9 Jun 2014 12:43:16 +0000 (13:43 +0100)]
xl: remove parsing of "vncviewer" option in xl domain config file
Print out a warning and suggest user use "-V" option when invoking "xl
create". Also remove that option in manpage. This will introduce a
minor functional regression but it's very easy to work around.
The rationale behind this change is that, this option is actually not
part of domain configuration. It just affects whether a vncviewer
should be automatically spawn, but has nothing to do with how a domain
should be constructed. And this option is also bogus, considering if you
migrate a domain to a remote host and the receiver spawns a vncviewer on
the receiving side then it either dies silently or occupies resource.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Andrew Cooper [Mon, 9 Jun 2014 15:41:08 +0000 (16:41 +0100)]
tools/libxc: Use _Static_assert if available
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Ian Campbell <Ian.Campbell@citrix.com> CC: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Andrew Cooper [Mon, 9 Jun 2014 15:41:07 +0000 (16:41 +0100)]
tools/libxc: Annotate xc_osdep_log with __attribute__((format))
This helps the compiler spot printf formatting errors.
Fix up resulting errors in xenctrl_osdep_ENOSYS.c. Substitute %p for the
slightly less bad %lx when trying to format an opaque structure.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Ian Campbell <Ian.Campbell@citrix.com> CC: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Andrew Cooper [Mon, 9 Jun 2014 15:41:06 +0000 (16:41 +0100)]
tools/libxc: Annotate xc_report_error with __attribute__((format))
This helps the compiler spot printf formatting errors.
Fix up all errors discovered.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Ian Campbell <Ian.Campbell@citrix.com> CC: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Andrew Cooper [Tue, 10 Jun 2014 11:13:47 +0000 (13:13 +0200)]
x86/traps: const-correctness for IST handlers
NMI and MCE interrupt handlers have no right to modify their exception frame
or underlying vcpu registers. Apply liberal quantities of 'const' to 'struct
cpu_user_regs *' throughout the codebase.
The Double Fault handler, while an IST handler, reloads some extra
architectural state back into its regs parameter. As this is for printing
purposes and on a terminal error path, the const requirements for #DF are
relaxed.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 10 Jun 2014 11:12:05 +0000 (13:12 +0200)]
x86/HVM: refine SMEP/SMAP tests in HVM_CR4_GUEST_RESERVED_BITS()
Andrew validly points out that the use of the macro on the restore path
can't rely on the CPUID bits for the guest already being in place (as
their setting by the tool stack in turn requires the other restore
operations already having taken place). And even worse, using
hvm_cpuid() is invalid here because that function assumes to be used in
the context of the vCPU in question.
Reverting to the behavior prior to the change from checking
cpu_has_sm?p to hvm_vcpu_has_sm?p() would break the other (non-restore)
use of the macro. So let's revert to the prior behavior only for the
restore path, by adding a respective second parameter to the macro.
Obviously the two cpu_has_* uses in the macro should really also be
converted to hvm_cpuid() based checks at least for the non-restore
path.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: David Vrabel <david.vrabel@citrix.com>
Juergen Gross [Tue, 10 Jun 2014 10:04:08 +0000 (12:04 +0200)]
avoid crash on HVM domain destroy with PCI passthrough
c/s bac6334b5 "move domain to cpupool0 before destroying it" introduced a
problem when destroying a HVM domain with PCI passthrough enabled. The
moving of the domain to cpupool0 includes moving the pirqs to the cpupool0
cpus, but the event channel infrastructure already is unusable for the
domain. So just avoid moving pirqs for dying domains.
Andrew Cooper [Tue, 10 Jun 2014 10:03:16 +0000 (12:03 +0200)]
x86/domctl: further fix to XEN_DOMCTL_[gs]etvcpuextstate
Do not clobber errors from certain codepaths. Clobbering of -EINVAL from
failing "evc->size <= PV_XSAVE_SIZE(_xcr0_accum)" was a pre-existing bug.
However, clobbering -EINVAL/-EFAULT from the get codepath was a bug
unintentionally introduced by 090ca8c1 "x86/domctl: two functional fixes to
XEN_DOMCTL_[gs]etvcpuextstate".
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 5 Jun 2014 15:57:07 +0000 (17:57 +0200)]
x86/amd: protect set_cpuidmask() against #GP faults
Virtual environments such as Xen HVM containers and VirtualBox do not
necessarily provide support for feature masking MSRs.
As their presence is detected by model numbers alone, and their use predicated
on command line parameters, use the safe() variants of {wr,rd}msr() to avoid
dying with an early #GP fault.
In fact, use the password variants in all cases because:
a) they are safe to use even if not strictly required
b) have a more useful function prototype for this purposes
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
If there's a guest using VMX/SVM when the hypervisor shuts down, it
can lead to the following crash due to VMX/SVM functions being called
after hvm_cpu_down has been called. In order to prevent that, check in
{svm/vmx}_ctxt_switch_from that the cpu virtualization extensions are
still enabled.
Andrew Cooper [Thu, 5 Jun 2014 15:52:57 +0000 (17:52 +0200)]
x86/domctl: two functional fixes to XEN_DOMCTL_[gs]etvcpuextstate
Interacting with the vcpu itself should be protected by vcpu_pause().
Buggy/naive toolstacks might encounter adverse interaction with a vcpu context
switch, or increase of xcr0_accum. There are no much problems with current
in-tree code.
Explicitly permit a NULL guest handle as being a request for size. It is the
prevailing Xen style, and without it, valgrind's ioctl handler is unable to
determine whether evc->buffer actually got written to.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 5 Jun 2014 15:52:11 +0000 (17:52 +0200)]
x86/xsave: add fastpath for common xstate_ctxt_size() requests
xstate_ctxt_size(xfeature_mask) is runtime constant after boot, and for bounds
checking when handling xsave state. Avoid reloading xcr0 twice to obtain a
number which has already been calculated.
Also annotate xfeature_mask as __read_mostly as it is only ever written once.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 5 Jun 2014 15:49:14 +0000 (17:49 +0200)]
VT-d: honor APEI firmware-first mode in XSA-59 workaround code
When firmware-first mode is being indicated by firmware, we shouldn't
be modifying AER registers - these are considered to be owned by
firmware in that case. Violating this is being reported to result in
SMI storms. While circumventing the workaround means re-exposing
affected hosts to the XSA-59 issues, this in any event seems better
than not booting at all. Respective messages are being issued to the
log, so the situation can be diagnosed.
The basic building blocks were taken from Linux 3.15-rc. Note that
this includes a block of code enclosed in #ifdef CONFIG_X86_MCE - we
don't define that symbol, and that code also wouldn't build without
suitable machine check side code added; that should happen eventually,
but isn't subject of this change.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Reported-by: Malcolm Crossley <malcolm.crossley@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Malcolm Crossley <malcolm.crossley@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Yang Zhang <yang.z.zhang@intel.com>
Jan Beulich [Thu, 5 Jun 2014 15:46:13 +0000 (17:46 +0200)]
x86/HVM: make vmsi_deliver() return proper error values
... and propagate this from hvm_inject_msi(). In the course of this I
spotted further room for cleanup:
- vmsi_inj_irq()'s struct domain * parameter was unused
- vmsi_deliver() pointlessly passed on dest_ExtINT to vmsi_inj_irq()
(which that one validly refused to handle)
- vmsi_inj_irq()'s sole caller guarantees a proper delivery mode (i.e.
rather than printing an obscure message we can just BUG())
- some formatting and log message quirks
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 5 Jun 2014 15:45:27 +0000 (17:45 +0200)]
x86/HVM: properly propagate errors from HVMOP_inject_msi
There are a number of ways this operation can go wrong, all of which
got ignored so far.
In the context of this I wonder whether map_domain_emuirq_pirq()
returning 0 in the "already mapped" case is really intended to be that
way (this is why the subsequent NULL check here can't be an ASSERT()).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Tue, 3 Jun 2014 13:13:48 +0000 (14:13 +0100)]
docs: Support building pdfs from markdown using pandoc
The Xen command line parameters document is far more useful as an indexed pdf
than it is as unindexed html webpage.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Ian Campbell <Ian.Campbell@citrix.com> CC: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- reran autogen.sh ]
Most of the functions follow the proper style, but these
two are the odd ones out.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Campbell [Wed, 4 Jun 2014 13:58:38 +0000 (14:58 +0100)]
xen: arm: ensure we hold a reference to guest pages while we copy to/from them
This at once:
- prevents the page from being reassigned under our feet
- ensures that the domain owns the page, which stops a domain from giving a
grant mapping, MMIO region, other non-RAM as a hypercall input/output.
We need to hold the p2m lock while doing the lookup until we have the
reference.
This also requires that during domain 0 building current is set to an actual
dom0 vcpu, so take care of this at the same time as the p2m is temporarily
loaded.
Lastly when dumping the guest stack we need to make sure that the guest hasn't
pointed its sp off into the weeds and/or misaligned it, which could lead to
hypervisor traps. Solve this by using the new function and checking alignment
first.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Julien Grall <julien.grall@linaro.org>
Ian Campbell [Wed, 4 Jun 2014 13:58:36 +0000 (14:58 +0100)]
xen: arm: check permissions when copying to/from guest virtual addresses
In particular we need to make sure the guest has write permissions to buffers
which it passes as output buffers for hypercalls, otherwise the guest can
overwrite memory which it shouldn't be able to write (like r/o grant table
mappings).
This is XSA-98.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Julien Grall <julien.grall@linaro.org>
Mukesh Rathor [Wed, 4 Jun 2014 09:27:50 +0000 (11:27 +0200)]
x86/PVH: avoid call to handle_mmio
handle_mmio() is currently unsafe for pvh guests. A call to it would
result in call to vioapic_range that will crash xen since the vioapic
ptr in struct hvm_domain is not initialized for pvh guests.
However, one path exists for such a call. If a pvh guest, dom0 or domU,
unintentionally touches non-existing memory, an EPT violation would occur.
This would result in unconditional call to hvm_hap_nested_page_fault. In
that function, because get_gfn_type_access returns p2m_mmio_dm for non
existing mfns by default, handle_mmio() will get called. This would result
in xen crash instead of the guest crash. This patch addresses that.
Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Use < instead of <= (which I wrongly suggested), return -ENODATA
instead of -EINVAL, and make description match code.
Jan Beulich [Wed, 4 Jun 2014 09:24:33 +0000 (11:24 +0200)]
VT-d: replace another fixmap use with ioremap()
... making the code more generic and limiting address space consumption
(however small it might be) to just those machines that need this
mapping (this is an erratum workaround after all).
At the same time properly map the full needed range from the base
address instead of just the third page and fix some formatting.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Tue, 3 Jun 2014 13:17:14 +0000 (15:17 +0200)]
x86/HVM: eliminate vulnerabilities from hvm_inject_msi()
- pirq_info() returns NULL for a non-allocated pIRQ, and hence we
mustn't unconditionally de-reference it, and we need to invoke it
another time after having called map_domain_emuirq_pirq()
- don't use printk(), namely without XENLOG_GUEST, for error reporting
Andrew Cooper [Tue, 3 Jun 2014 10:00:53 +0000 (12:00 +0200)]
x86/xsave: remove xfeat_mask checking from validate_xstate()
validate_xsave() is called codepaths which load new vcpu xsave state from
XEN_DOMCTL_{setvcpuextstate,sethvmcontext}, usually as part of migration. In
both cases, this is the xfeature_mask of the saving Xen rather than the
restoring Xen.
Given that the xsave state itself is checked for consistency and validity on
the current cpu, checking whether it was valid for the cpu before migration is
not interesting (or indeed relevant, as the error can't be distinguished from
the other validity checking).
This change removes the need to pass the saving Xen's xfeature_mask,
simplifying the toolstack code and migration stream format in this area.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Julien Grall [Tue, 27 May 2014 11:11:41 +0000 (12:11 +0100)]
xen/arm: grant: Add another entry to map MFN 1:1 in dom0 p2m
Grant mappings can be used for DMA requests. Currently the dev_bus_addr returned
by the hypercall is the MFN (not the IPA). Guest expects to be able the returned
address for DMA. When the device is protected by IOMMU the request will fail.
Therefore, we have to add 1:1 mapping in the domain p2m to allow DMA request
to work.
This is valid because DOM0 has its memory mapped 1:1 and therefore we know
that RAM and devices cannot clash.
If the guest only owns protected device, the return dev_bus_addr should be an
IPA. This will allow us to remove safely the 1:1 mapping and make grant mapping
works correctly in the guest. For now, this is not addressed by this patch.
The grant mapping code does the reference counting on every MFN and will
call iommu_{map,unmap}_page when necessary. This was already handle for x86
PV guests, so we can reuse the same code path for ARM guest.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
[ ijc s/ld/d/ in both arch's gnttab_need_iommu_mapping() ]
Julien Grall [Tue, 27 May 2014 11:11:40 +0000 (12:11 +0100)]
drivers/passthrough: arm: Add support for SMMU drivers
This patch add support for ARM architected SMMU driver. It's based on the
linux drivers (drivers/iommu/arm-smmu) commit 89ac23cd.
The major differences with the Linux driver are:
- Fault by default if the SMMU is enabled to translate an
address (Linux is bypassing the SMMU)
- Using P2M page table instead of creating new one
- Dropped stage-1 support
- Dropped chained SMMUs support for now
- Reworking device assignment and the different structures
Xen is programming each IOMMU by:
- Using stage-2 mode translation
- Sharing the page table with the processor
- Injecting a fault if the device has made a wrong translation
Signed-off-by: Julien Grall<julien.grall@linaro.org> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Campbell [Wed, 14 May 2014 09:10:04 +0000 (10:10 +0100)]
tools: Use SeaBIOS's defconfig
Compared with our local config this enables CONFIG_BOOTSPLASH and disables
CONFIG_ATA_DMA and CONFIG_ATA_PIO32.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Don Slutz <dslutz@verizon.com> Tested-by: Fabio Fantoni <fabio.fantoni@m2r.biz> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Wed, 14 May 2014 09:10:03 +0000 (10:10 +0100)]
tools: update to seabios rel-1.7.4
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Tested-by: Fabio Fantoni <fabio.fantoni@m2r.biz> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Thu, 22 May 2014 09:46:44 +0000 (10:46 +0100)]
tools: arm: increase size of region set aside for guest grant table
The current size is sufficient for the default maximum grant table size
(32-frames), but increase the reserved region to 16M/4096 pages to allow for
the use of the gnttab_max_nr_frames command line option.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Thu, 22 May 2014 09:46:43 +0000 (10:46 +0100)]
tools: arm: support up to (almost) 1TB of guest RAM
This creates a second bank of RAM starting at 8GB and potentially
extending to the 1TB boundary, which is the limit imposed by our
current use of a 3 level p2m with 2 pages at level 0 (2^40 bits).
I've deliberately left a gap between the two banks just to
exercise those code paths.
The second bank is 1016GB in size which plus the 3GB below 4GB is
1019GB maximum guest RAM. At the point where the fact that this
is slightly less than a full TB starts to become an issue for
people then we can switch to a 4 level p2m, which would be needed
to support guests larger than 1TB anyhow.
Tested on 32-bit with 1, 4 and 6GB guests. Anything more than
~3GB requires an LPAE enabled kernel, or a 64-bit guest.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Thu, 22 May 2014 09:46:42 +0000 (10:46 +0100)]
tools: arm: prepare guest FDT building for multiple RAM banks
This required exposing the sizes of the banks determined by the domain builder
up to libxl via xc_dom_image.
Since the domain build needs to know the size of the DTB we create placeholder
nodes for each possible bank and when we finalise the DTB we fill in the ones
which are actually populated and NOP out the rest.
Note that the number of guest RAM banks is still 1 after this change.
Also fixes a coding style violation in
libxl__arch_domain_finalise_hw_description while there.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
[ ijc -- minor coding style fix ]
Ian Campbell [Thu, 22 May 2014 09:46:41 +0000 (10:46 +0100)]
tools: arm: prepare domain builder for multiple banks of guest RAM
Prepare for adding more banks of guest RAM by renaming a bunch of defines
as RAM0 and replacing variables with arrays and introducing loops.
Also in preparation switch to using GUEST_RAM0_BASE explicitly instead of
implicitly via dom->rambase_pfn (while asserting that they must be the same).
This makes the multiple bank case cleaner (although it looks a bit odd for
now).
GUEST_RAM_BASE is defined as the address of the lowest RAM bank, it is used in
tools/libxl/libxl_dom.c to call xc_dom_rambase_init().
Lastly for now ramsize (total size) and rambank_size[0] (size of first bank)
are the same, but use the appropriate one for each context.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Thu, 22 May 2014 09:46:40 +0000 (10:46 +0100)]
tools: arm: refactor code to setup guest p2m and fill it with RAM
This will help when we have more guest RAM banks.
Mostly code motion of the p2m_host initialisation and allocation loop into the
new function populate_guest_memory, but in addition in the caller we now
initialise the p2m all the INVALID_MFN to handle any holes, although in this
patch we still fill in the entire allocated region.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Thu, 22 May 2014 09:46:39 +0000 (10:46 +0100)]
tools: arm: rearrange guest physical address space to increase max RAM
By switching things around we can manage to expose up to 3GB of RAM to guests.
I deliberately didn't place the RAM at address 0 to avoid coming to rely on
this, so the various peripherals, MMIO and magic pages etc all live in the
lower 1GB leaving the upper 3GB available for RAM.
It would likely have been possible to reduce the space used by the peripherals
etc and allow for 3.5 or 3.75GB but I decided to keep things simple and will
handle >3GB memory in a subsequent patch.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Thu, 22 May 2014 09:46:38 +0000 (10:46 +0100)]
tools: arm: move magic pfns out of guest RAM region
Because toolstacks (at least libxl) only allow RAM to be specified in 1M
increments these two pages were effectively costing 1M of guest RAM space.
Since these pages don't actually need to live in RAM just move them out.
With this a guest can now use the full 768M of the address space reserved
for RAM. (ok, not that impressive, but it simplifies things later)
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
--
v3: make the size of the region explicit.
v2: remove spurious w/s change
tools: arm: make the size of the magic page region explicit
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Ian Campbell [Thu, 22 May 2014 09:46:37 +0000 (10:46 +0100)]
tools: arm: report an error if the guest RAM is too large
Due to the layout of the guest physical address space we cannot support more
than 768M of RAM before overrunning the area set aside for the grant table. Due
to the presence of the magic pages at the end of the RAM region guests are
actually limited to 767M.
Catch this case during domain build and fail gracefully instead of obscurely
later on.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Campbell [Thu, 22 May 2014 09:46:36 +0000 (10:46 +0100)]
tools: libxl: use uint64_t not unsigned long long for addresses
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Fri, 23 May 2014 10:32:01 +0000 (11:32 +0100)]
tools/xenstore: Fix memory leaks in the client
Free the expanding buffer and output buffer after use. Close the xenstore
handle after use.
The command line client is now valgrind-clean.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Ian Campbell <Ian.Campbell@citrix.com> CC: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
xen/arm: vcpu: Correctly release resources when a VCPU fails to initialize
While I was adding new failing code at the end of the function, I noticed
that the vtimers are not freed which messes up all the timers and will crash
Xen quickly when the page s reused.
Currently neither vcpu_vgic_init nor vcpu_vtimer_init fails, so we
are safe for now. With the new GICv3 code, the former function will be able
to fail. This will result in a memory leak.
Call vcpu_destroy if the initialization has failed. We also need to add a
boolean to know if the vtimers are correctly setup as the timer common code
doesn't have any safeguard against removing a non-initialized timer.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Jason Andryuk [Fri, 16 May 2014 20:48:16 +0000 (16:48 -0400)]
libvchan: Make raw_get_{data_ready, buffer_space} match
For writing into a vchan, raw_get_buffer_space used >, allowing the full
ring size to be written. On the read side, raw_get_data_ready compared
the ring size with >=. This mismatch means a completely filled buffer
cannot be read. Fix this by making the size checks identical.
Signed-off-by: Jason Andryuk <andryuk@aero.org> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Mukesh Rathor [Mon, 2 Jun 2014 08:31:49 +0000 (10:31 +0200)]
pvh dom0: allow get_pg_owner for translated domains if pvh
When creating a PV guest, toolstack on pvh dom0 will do_mmuext_op
to pin guest tables. do_mmuext_op calls get_pg_owner, which must allow
foreign mappings for pvh.
Mukesh Rathor [Mon, 2 Jun 2014 08:30:47 +0000 (10:30 +0200)]
pvh dom0: add and remove foreign pages
In this patch, a new function, p2m_add_foreign(), is added
to map pages from a foreign guest into dom0 for various purposes
like domU creation, running xentrace, etc... Such pages are
typed p2m_map_foreign. Note, it is the nature of such pages
that a refcnt is held during their stay in the p2m. The
refcnt is added and released in the low level ept function
atomic_write_ept_entry. That macro is converted to a function to allow
for such refcounting, which only applies to leaf entries in the ept.
Furthermore, please note that paging/sharing is disabled if the
controlling or hardware domain is pvh. Any enabling of those features
would need to ensure refcnt are properly maintained for foreign types,
or paging/sharing is skipped for foreign types.
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Reviewed-by: Tim Deegan <tim@xen.org>
Boris Ostrovsky [Mon, 2 Jun 2014 08:20:23 +0000 (10:20 +0200)]
x86: correctly report max number of hypervisor leaves
Commit def0bbd31 provided support for changing max number of
hypervisor cpuid leaves (in leaf 0x4000xx00). It also made the
hypervisor incorrectly report this number for guests that
use default value (i.e. don't specify leaf 0x4000xx00 in config
file)
A failure would result in log message like so-
(XEN) microcode: CPU0 update from revision 0x6000637 to 0x6000626 failed
^^^^^^^^^^^^^^^^^^^^^^
The above message has the revision numbers inverted. Fix this.
amd_k8.c did a lot of common work and very little K8
specific work. So merge init functions of amd_f10.c and
amd_k8.c and move it into the common amd_mcheck_init
handler. With that done, there is not much left in either
files, so fold all code into just one file - mce_amd.c
While at it, update the comments regarding documentation
with correct URL's and revision numbers.
Also, update copyright info.
Signed-off-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com> Acked-by: Christoph Egger <chegger@amazon.de>