Paul Durrant [Thu, 26 Nov 2015 14:48:41 +0000 (15:48 +0100)]
x86/viridian: flush remote tlbs by hypercall
The Microsoft Hypervisor Top Level Functional Spec. (section 3.4) defines
two bits in CPUID leaf 0x40000004:EAX for the hypervisor to recommend
whether or not to issue a hypercall for local or remote TLB flush.
Whilst it's doubtful whether using a hypercall for local TLB flush would
be any more efficient than a specific INVLPG VMEXIT, a remote TLB flush
may well be more efficiently done. This is because the alternative
mechanism is to IPI all the vCPUs in question which (in the absence of
APIC virtualisation) will require emulation and scheduling of the vCPUs
only to have them immediately VMEXIT for local TLB flush.
This patch therefore adds a viridian option which, if selected, enables
the hypercall for remote TLB flush and implements it using ASID
invalidation for targetted vCPUs followed by an IPI only to the set of
CPUs that happened to be running a targetted vCPU (which may be the empty
set). The flush may be more severe than requested since the hypercall can
request flush only for a specific address space (CR3) but Xen neither
keeps a mapping of ASID to guest CR3 nor allows invalidation of a specific
ASID, but on a host with contended CPUs performance is still likely to
be better than a more specific flush using IPIs.
The implementation of the patch introduces per-vCPU viridian_init() and
viridian_deinit() functions to allow a scratch cpumask to be allocated.
This avoids needing to put this potentially large data structure on stack
during hypercall processing. It also modifies the hypercall input and
output bit-fields to allow a check for the 'fast' calling convention,
and a white-space fix in the definition of HVMPV_feature_mask (to remove
hard tabs).
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Peng Fan [Wed, 25 Nov 2015 16:26:09 +0000 (17:26 +0100)]
public/event_channel.h: correct comment
According to definition of structure evtchn_alloc_unbound,
there is an entry "domid_t remote_dom", no "rdom". So
using "remote_dom" in comments instead of "rdom".
Daniel Kiper [Wed, 25 Nov 2015 16:24:36 +0000 (17:24 +0100)]
x86/boot: check for not allowed sections before linking
Currently check for not allowed sections is performed just after
compilation. However, if compilation succeeds and check fails then
second build will create xen.gz/xen.efi without any visible error.
This happens because %.o: %.c recipe created object file during first
run and make do not execute this recipe during second run. So, look
for not allowed sections before linking. This way check will be
executed every time.
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Shuai Ruan [Wed, 25 Nov 2015 16:24:17 +0000 (17:24 +0100)]
libxc: expose xsaves/xgetbv1/xsavec to hvm guest
This patch exposes xsaves/xgetbv1/xsavec to hvm guest.
The reserved bits of eax/ebx/ecx/edx must be cleaned up
when call cpuid(0dh) with leaf 1 or 2..63.
According to the spec the following bits must be reserved:
For leaf 1, bits 03-04/08-31 of ecx is reserved. Edx is reserved.
For leaf 2...63, bits 01-31 of ecx is reserved, Edx is reserved.
But as no XSS festures are currently supported, even in HVM guests,
for leaf 2...63, ecx should be zero at the moment.
Signed-off-by: Shuai Ruan <shuai.ruan@intel.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Shuai Ruan [Wed, 25 Nov 2015 16:20:05 +0000 (17:20 +0100)]
x86/xsaves: enable xsaves/xrstors/xsavec in xen
This patch uses xsaves/xrstors/xsavec instead of xsaveopt/xrstor
to perform the xsave_area switching so that xen itself
can benefit from them when available.
For xsaves/xrstors/xsavec only use compact format. Add format conversion
support when perform guest os migration. Also, pv guest will not support
xsaves/xrstors.
Signed-off-by: Shuai Ruan <shuai.ruan@linux.intel.com>
[dropped redundant uses of XRSTOR_FIXUP and fix formatting] Signed-off-by: Jan Beulich <jbeulich@suse.com>
Shuai Ruan [Wed, 25 Nov 2015 16:19:45 +0000 (17:19 +0100)]
x86/xsaves: using named operand instead numbered operand in xrstor
This is pre-req patch for latter xsaves patch. This patch introduce
a macro to handle restor fixup, also use named opreand instead of
numbered operand in restor fixup code.
Signed-off-by: Shuai Ruan <shuai.ruan@intel.com>
[with the expectation of later doing some cleanup:] Acked-by: Jan Beulich <jbeulich@suse.com>
Dependency files were getting left behind in the xen
directory (since 8b6ef9c152edceabecc7f90c811cd538a7b7a110),
so append the $(DEPS) to the clean rule that runs in the
hypervisor directory.
Signed-off-by: Jonathan Creekmore <jonathan.creekmore@gmail.com>
Jan Beulich [Wed, 25 Nov 2015 16:18:21 +0000 (17:18 +0100)]
console: make printk() line continuation tracking per-CPU
This avoids cases where split messages (with other than the initial
part not carrying a log level; single line messages only of course)
issued on multiple CPUs interfere with each other, causing messages to
be issued which are supposed to be suppressed due to the log level
setting. E.g.
CPU A CPU B
XENLOG_G_DEBUG "abc"
XENLOG_G_DEBUG "def\n"
"xyz\n"
would cause the last message to be logged despite this obviously not
being intended (at default log levels).
Suggested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Julien Grall [Wed, 18 Nov 2015 17:28:02 +0000 (17:28 +0000)]
xen/arm: vgic: Re-order the register emulations to match the memory map
It helps to find quickly whether we forgot to emulate a register or not.
At the same time add the missing reserved/implementation defined
registers. All other missing registers will be added in a follow-up if
necessary.
Note that only the distributor register map explicitely say the
size of a register (see 8.8 in ARM IHI 0069A). When the size is not
known, the implementation defined/reserved may not be emulated
correctly.
Signed-off-by: Julien Grall <julien.grall@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Julien Grall [Wed, 18 Nov 2015 17:28:00 +0000 (17:28 +0000)]
xen/arm: vgic: Properly emulate the full register
The offset in the emulation is based on byte. As most of the registers
are 64/32 bits, they will span over multiple bytes.
However, the current emulation only cares about the first offset. This
will result in not properly emulating any access on the register with
any other offset.
Introduce new macros to help implementing access on multiple byte and
use them over the vGIC emulation.
Note that I didn't convert the reserved/implementation defined
registers. It will be done in a follow-up.
Signed-off-by: Julien Grall <julien.grall@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Julien Grall [Wed, 18 Nov 2015 17:27:59 +0000 (17:27 +0000)]
xen/arm: vgic-v3: Only emulate identification registers required by the spec
Most of the identification registers space contains implementation
defined registers (see 8.1.13 in ARM IHI 0069A) and only GIC{D,R}_PIDR2
is required to be implemented.
Currently the emulation of those registers mimic the ARM implementation,
but it's untrue to say that we properly emulate a such implementation.
Keep only GIC{D,R}_PIDR2 implemented with the "implementation defined
bits" to zero and the ArchRev field (bits[7:4]) to 0x3 as we emulate a
GICv3.
Note that the emulation of the range wasn't valid anyway because the
registers are split in 2 sets (PIDR4-PIDR7 and PIDR0-PIDR2).
Signed-off-by: Julien Grall <julien.grall@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Each ITARGETSR register is 4-bytes wide and the offset is in bytes.
The current implementation is computing the offset of ICFGR1 and ICFG2
wrongly result to emulate only the first 2 byte of the ICFGR<n> range
read-only. The rest will be treated as read-write.
For convenience introduce ITARGETSR1 and ITARGETSR2.
Signed-off-by: Julien Grall <julien.grall@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- typoes in commit message ]
Julien Grall [Wed, 18 Nov 2015 16:42:43 +0000 (16:42 +0000)]
xen/arm: vgic-v3: Support 32-bit access for 64-bit registers
Based on 8.1.3 (IHI 0069A), unless stated otherwise, the 64-bit registers
supports both 32-bit and 64-bits access.
All the registers we properly emulate (i.e not RAZ/WI) supports 32-bit access.
For RAZ/WI, it's also seems to be the case but I'm not 100% sure. Anyway,
emulating 32-bit access for them doesn't hurt. Note that we would need
some extra care when they will be implemented (for instance GICR_PROPBASER).
Signed-off-by: Julien Grall <julien.grall@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Julien Grall [Wed, 18 Nov 2015 16:42:41 +0000 (16:42 +0000)]
xen/arm: vgic: Optimize the way to store the target vCPU in the rank
Xen is currently directly storing the value of GICD_ITARGETSR register
(for GICv2) and GICD_IROUTER (for GICv3) in the rank. This makes the
emulation of the registers access very simple but makes the code to get
the target vCPU for a given vIRQ more complex.
While the target vCPU of an vIRQ is retrieved every time an vIRQ is
injected to the guest, the access to the register occurs less often.
So the data structure should be optimized for the most common case
rather than the inverse.
This patch introduces the usage of an array to store the target vCPU for
every interrupt in the rank. This will make the code to get the target
very quick. The emulation code will now have to generate the GICD_ITARGETSR
and GICD_IROUTER register for read access and split it to store in a
convenient way.
With the new way to store the target vCPU, the structure vgic_irq_rank
is shrunk down from 320 bytes to 92 bytes. This is saving about 228
bytes of memory allocated separately per vCPU.
Note that with these changes, any read to those register will list only
the target vCPU used by Xen. As the spec is not clear whether this is a
valid choice or not, OSes which have a different interpretation of the
spec (i.e OSes which perform read-modify-write operations on these
registers) may not boot anymore on Xen. Although, I think this is fair
trade between memory usage in Xen (1KB less on a domain using 4 vCPUs
with no SPIs) and a strict interpretation of the spec (though all the
cases are not clearly defined).
Furthermore, the implementation of the callback get_target_vcpu is now
exactly the same. Consolidate the implementation in the common vGIC code
and drop the callback.
Finally take the opportunity to fix coding style and replace "irq" by
"virq" to make clear that we are dealing with virtual IRQ in section we
are modifying.
Signed-off-by: Julien Grall <julien.grall@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Julien Grall [Wed, 18 Nov 2015 16:42:40 +0000 (16:42 +0000)]
xen/arm: vgic-v2: Don't ignore a write in ITARGETSR if one field is 0
The current implementation ignores the whole write if one of the field is
0. Although, based on the spec (4.3.12 IHI 0048B.b), 0 is a valid value
when:
- The interrupt is not wired in the distributor. From the Xen
point of view, it means that the corresponding bit is not set in
d->arch.vgic.allocated_irqs.
- The user wants to disable the IRQ forwarding in the distributor.
I.e the IRQ stays pending in the distributor and never received by
the guest.
Implementing the later will require more work in Xen because we always
assume the interrupt is forwarded to a valid vCPU. So for now, ignore
any field where the value is 0.
The emulation of the write access of ITARGETSR has been reworked and
moved to a new function because it would have been difficult to
implement properly the behavior with the current code.
The new implementation is breaking the register in 4 distinct bytes. For
each byte, it will check the validity of the target list, find the new
target, migrate the interrupt and store the value if necessary.
In the new implementation there is nearly no distinction of the access
size to avoid having too many different path which is harder to test.
Signed-off-by: Julien Grall <julien.grall@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Julien Grall [Wed, 18 Nov 2015 16:42:39 +0000 (16:42 +0000)]
xen/arm: vgic-v2: Handle correctly byte write in ITARGETSR
During a store, the byte is always in the low part of the register (i.e
[0:7]).
We are incorrectly masking the register by using a shift of the byte
offset in the ITARGETSR while the byte is alwasy in r[0:7]. This will
result in a target list equal to 0 which is ignored by the emulation.
Because of that the guest will only be able to modify the first byte in
each ITARGETSR.
Furthermore, the body of the loop is retrieving the old target list
using the index of the byte.
To avoid modifying too much the loop, shift the byte stored to the correct
offset.
Each ITARGETSR register are 4-byte wide and the offset is in byte.
The current implementation is computing the end of the range wrongly
resulting to emulate only ITARGETSR{0,1} read-only. The rest will be
treated as read-write.
As 8 registers should be read-only, the end of the range should be
ITARGETSR + (4 * 8) - 1.
For convenience introduce ITARGETSR7 and ITARGETSR8.
Call update_domain_wallclock_time at domain initialization.
Set time_offset_seconds to the number of seconds between physical boot
and domain initialization: it is going to be used to get/set the
wallclock time.
Add time_offset_seconds to system_time when before calling do_settime,
so that system_time actually accounts for all the time in nsec between
machine boot and when the wallclock was set.
Expose xsm_platform_op to ARM.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Reviewed-by: Julien Grall <julien.grall@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> CC: dgdegra@tycho.nsa.gov
Remove dummy arm implementation of wallclock_time.
Use shared_info() in common code rather than x86-ism to access it, when
possible.
Define the static variable wc_sec, and the local variable sec in
update_domain_wallclock_time, as uint64_t instead of unsigned long, to
avoid size issue on arm.
Take a uint64_t sec parameter in do_settime for the same reason.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> CC: JBeulich@suse.com CC: andrew.cooper3@citrix.com
[ ijc -- typoes in commit message ]
Paul Durrant [Wed, 25 Nov 2015 10:12:40 +0000 (11:12 +0100)]
public/io/netif.h: tidy up and remove duplicate comments
Now that requests and response types and extra info segments are
documented in block comments, we can get rid of the inline comments
in the structures. This has the happy side-effect of making the Linux
checkpatch.pl script make fewer complaints after import.
This patch also fixes a small whitespace issue in the initial boiler-
plate comment, and a typo in one of the ascii-art diagrams.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Paul Durrant [Wed, 25 Nov 2015 10:12:34 +0000 (11:12 +0100)]
public/io/netif.h: add definition of gso_prefix flag
This flag is defined here only for compatibility with the Linux variant of
this header. The feature has never been documented and should be
considered deprecated.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Paul Durrant [Wed, 25 Nov 2015 10:12:26 +0000 (11:12 +0100)]
public/io/netif.h: document the reality of netif_rx_request/reponse
Because GSO metadata is passed from backend to frontend using
netif_extra_info segments, which do not carry information stating which
netif_rx_request_t was consumed to free up their slot, frontends must
assume some form of identity relation between ring slot and request.
Hence, so that it is able to use GSO metadata, Linux netfront simply
assumes rx responses appear in the same ring slot as their corresponding
request.
This patch documents the assumption made by Linux netfront and the
necessity of the assumption (to support GSO) so that backends are coded
to be compatible.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Boris Ostrovsky [Tue, 24 Nov 2015 17:33:08 +0000 (18:33 +0100)]
x86/VPMU: Initialize VPMU's lvtpc vector
If a guest sets up performance counters so that they can generate
a PMC interrupt but does not initilaize APIC LVTPC register the
resulting interrupt will cause an APIC error.
Note that a guest deciding to clear LVTPC in order to unduce the error
will not be successful in achieving its goal: emulation code only
looks at the mask bit and always sets the vector to PMU_APIC_VECTOR.
Only the initial value of LVTPC (which is zero) that gets loaded into
APIC as result of PMC initialization is the problem.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Andrew Cooper [Tue, 24 Nov 2015 16:41:04 +0000 (17:41 +0100)]
x86/kexec: hide more kexec infrastructure behind CONFIG_KEXEC
Experimenting with the kconfig series showed that various bits of kexec
infrastructure were still being unconditionally included. Make them
conditional on CONFIG_KEXEC.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Olaf Hering [Thu, 19 Nov 2015 08:32:52 +0000 (08:32 +0000)]
tools/hotplug: quote all variables in vif-bridge
Cosmetics: most of the variables used in vif-bridge are already quoted.
Add quoting also to the remaining shell variables.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Paul Durrant [Tue, 17 Nov 2015 11:32:05 +0000 (11:32 +0000)]
docs: Introduce xenstore paths for guest network address information
It is useful for a toolstack to be able to see the network addresses
in use by a domain for a particular vif in xenstore for display
purposes and, for example, so that a VNC session can be established
to the guest GUI.
This patch documents paths to allow a domain to advertise an interface
name, MAC (unicast and multicast) and IP (version 4 and 6) address
information.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Keir Fraser <keir@xen.org> Cc: Tim Deegan <tim@xen.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Paul Durrant [Tue, 17 Nov 2015 11:32:04 +0000 (11:32 +0000)]
docs: Introduce xenstore paths for hotplug features
Without some indication from a guest it is not possible for a
toolstack to know whether instantiation of a new vbd or vif should
result in a new PV device of the appropriate type being brought online.
(In other words whether guest PV drivers are present and functioning).
This patch documents two paths which vif and vbd frontend drivers can
use to advertise their ability to respond to new vif or vbd
instantiations.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Keir Fraser <keir@xen.org> Cc: Tim Deegan <tim@xen.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Paul Durrant [Tue, 17 Nov 2015 11:32:03 +0000 (11:32 +0000)]
docs: Introduce xenstore paths for PV driver information
For domain management purposes it is convenient to be able to see
information about PV drivers in xenstore. The XAPI toolstack in
XenServer has always created a ~/drivers path for this purpose.
This patch documents that path and also adds a specification of how
it should be used.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Keir Fraser <keir@xen.org> Cc: Tim Deegan <tim@xen.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Paul Durrant [Tue, 17 Nov 2015 11:32:02 +0000 (11:32 +0000)]
docs: Introduce xenstore paths for PV control features
XenServer already makes use of ~/control/feature-suspend being written
to advertise guest capability of responding to 'suspend' when written to
~/control/shutdown and, since they are derived from XenServer drivers,
the Xen Project Windows PV drivers attempt to write this value. The write
currently fails for libxl provisioned VMs because ~/control is read-only
to the guest (only ~/control/shutdown is writable, for ackowledgement
purposes).
This patch documents feature-suspend and also a set of similar control
feature flags, so that that they may be added to libxl provisioned
guests by subsequent patches:
feature-poweroff: PV drivers/agent can shut down the guest
feature-reboot: PV drivers/agent can reboot the guest
feature-s3: PV drivers/agent can trigger guest sleep (HVM only)
feature-s4: PV drivers/agent can trigger guest hibernate (HVM only)
The patch (bacause it adds features relating to S3 and S4 power states)
also clarifies that the initial set of platform properties mentioned are
booleans, and updates the specifier accordingly.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Keir Fraser <keir@xen.org> Cc: Tim Deegan <tim@xen.org>
Signed-off-by: Olaf Hering <olaf@aepfle.de> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Keir Fraser <keir@xen.org> Cc: Tim Deegan <tim@xen.org> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Julien Grall [Thu, 19 Nov 2015 12:46:09 +0000 (12:46 +0000)]
xen/arm: use masking operation instead of test_bit for MCSF bits
This is a follow of commit 90f2e2a307fc6a6258c39cc87b3b2bf9441c0fa7 "use
masking operation instead of test_bit for MCSF bits" where the ARM
changes were missing.
Signed-off-by: Julien Grall <julien.grall@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Dario Faggioli [Tue, 24 Nov 2015 13:50:30 +0000 (14:50 +0100)]
sched: get rid of the per domain vCPU list in Credit2
As, curently, there is no reason for bothering having
it and keeping it updated.
In fact, it is only used for dumping and changing
vCPUs parameters, but that can be achieved easily with
for_each_vcpu.
While there, improve alignment of comments, ad
add a const qualifier to a pointer, making things
more consistent with what happens everywhere else
in the source file.
This also allows us to kill one of the remaining
FIXMEs in the code, which is always good.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Dario Faggioli [Tue, 24 Nov 2015 13:50:09 +0000 (14:50 +0100)]
sched: get rid of the per domain vCPU list in RTDS
As, curently, there is no reason for bothering having
it and keeping it updated.
In fact, it is only used for dumping and changing
vCPUs parameters, but that can be achieved easily with
for_each_vcpu.
While there, take care of the case when
XEN_DOMCTL_SCHEDOP_getinfo is called but no vCPUs have
been allocated yet (by returning the default scheduling
parameters).
Dario Faggioli [Tue, 24 Nov 2015 13:49:47 +0000 (14:49 +0100)]
sched: better handle (not) inserting idle vCPUs in runqueues
Idle vCPUs are set to run immediately, as a part of their
own initialization, so we shouldn't even try to put them
in a runqueue. In fact, no scheduler does that, even when
asked to (that is rather explicit in Credit2 and RTDS, a
bit less evident in Credit1).
Let's make things look as follows:
- in generic code, explicitly avoid even trying to
insert idle vCPUs in runqueues;
- in specific schedulers' code, enforce that.
Note that, as csched_vcpu_insert() is no longer being
called, during boot (from sched_init_vcpu()) we can
safely avoid saving the flags when taking the runqueue
lock.
Dario Faggioli [Tue, 24 Nov 2015 13:49:09 +0000 (14:49 +0100)]
sched: clarify use cases of schedule_cpu_switch()
schedule_cpu_switch() is meant to be only used for moving
pCPUs from a cpupool to no cpupool, and from there back
to a cpupool, *not* to move them directly from one cpupool
to another.
This is something inherent to the way the function is
implemented and called, but is not that clear, just by the
look of it.
Make it more evident by:
- adding commentary and ASSERT()s;
- update the cpupool per-CPU variable (mapping pCPUs to
pools) directly in schedule_cpu_switch(), rather than
in various places in cpupool.c.
Dario Faggioli [Tue, 24 Nov 2015 13:48:34 +0000 (14:48 +0100)]
sched: fix locking for insert_vcpu() in credit1 and RTDS
The insert_vcpu() hook is handled with inconsistent locking.
In fact, schedule_cpu_switch() calls the hook with runqueue
lock held, while sched_move_domain() relies on the hook
implementations to take the lock themselves (and, since that
is not done in Credit1 and RTDS, such operation is not safe
in those cases).
This is fixed as follows:
- take the lock in the hook implementations, in specific
schedulers' code;
- avoid calling insert_vcpu(), for the idle vCPU, in
schedule_cpu_switch(). In fact, idle vCPUs are set to run
immediately, and the various schedulers won't insert them
in their runqueues anyway, even when explicitly asked to.
While there, still in schedule_cpu_switch(), locking with
_irq() is enough (there's no need to do *_irqsave()).
Jan Beulich [Tue, 24 Nov 2015 11:31:13 +0000 (12:31 +0100)]
x86/HVM: type adjustments
- constify struct hvm_trap * function parameters
- width reduce and shuffle some struct hvm_trap members
- use bool_t for boolean fields struct hvm_function_table
- use unsigned for struct hvm_function_table's hap_capabilities field
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky<boris.ostrovsky@oracle.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Tue, 24 Nov 2015 11:30:31 +0000 (12:30 +0100)]
VMX: fix/adjust trap injection
In the course of investigating the 4.1.6 backport issue of the XSA-156
patch I realized that #DB injection has always been broken, but with it
now getting always intercepted the problem has got worse: Documentation
clearly states that neither DR7.GD nor DebugCtl.LBR get cleared before
the intercept, so this is something we need to do before reflecting the
intercepted exception.
While adjusting this (and also with 4.1.6's strange use of
X86_EVENTTYPE_SW_EXCEPTION for #DB in mind) I further realized that
the special casing of individual vectors shouldn't be done for
software interrupts (resulting from INT $nn).
And then some code movement: Setting of CR2 for #PF can be done in the
same switch() statement (no need for a separate if()), and reading of
intr_info is better done close the the consumption of the variable
(allowing the compiler to generate better code / use fewer registers
for variables).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Naresh Bhat [Tue, 24 Nov 2015 11:18:02 +0000 (12:18 +0100)]
acpi/NUMA: build NUMA for x86 only
NUMA is currently not supported for ARM in Xen. Add a new compilation
option HAS_NUMA for NUMA. Configure and build NUMA only for x86
architecture now.
Feng Wu [Tue, 24 Nov 2015 11:10:36 +0000 (12:10 +0100)]
vmx: extend struct pi_desc to support VT-d Posted-Interrupts
Extend struct pi_desc according to VT-d Posted-Interrupts Spec.
Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Feng Wu [Tue, 24 Nov 2015 11:10:10 +0000 (12:10 +0100)]
VT-d Posted-Interrupts feature detection
VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.
Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Feng Wu [Tue, 24 Nov 2015 11:09:28 +0000 (12:09 +0100)]
iommu: add iommu_intpost to control VT-d Posted-Interrupts feature
VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
With VT-d Posted-Interrupts enabled, external interrupts from
direct-assigned devices can be delivered to guests without VMM
intervention when guest is running in non-root mode.
This patch adds variable 'iommu_intpost' to control whether enable VT-d
posted-interrupt or not in the generic IOMMU code.
Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 24 Nov 2015 11:07:27 +0000 (12:07 +0100)]
vVMX: use latched VMCS machine address
Instead of calling domain_page_map_to_mfn() over and over, latch the
guest VMCS machine address unconditionally (i.e. independent of whether
VMCS shadowing is supported by the hardware).
Since this requires altering the parameters of __[gs]et_vmcs{,_real}()
(and hence all their callers) anyway, take the opportunity to also drop
the bogus double underscores from their names (and from
__[gs]et_vmcs_virtual() as well).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Jan Beulich [Tue, 24 Nov 2015 11:06:26 +0000 (12:06 +0100)]
VMX: allocate VMCS pages from domain heap
There being only very few uses of the virtual address of a VMCS,
convert these cases to establish a mapping and lift the Xen heap
restriction from the VMCS allocation.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Andrew Cooper [Thu, 19 Nov 2015 14:45:41 +0000 (14:45 +0000)]
tools/libxc: Correct XC_DOM_PAGE_SIZE() to return a long long
c/s abdf3c5b "libxc: create p2m list outside of kernel mapping if supported"
introduces a use which Coverity objects to; an int used to mask a uint64_t.
The result needs to be signed to allow ~XC_DOM_PAGE_SIZE() to function
correctly, and long long to function properly in 32bit builds.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Thu, 19 Nov 2015 16:11:08 +0000 (17:11 +0100)]
libxl: correct bug in domain builder regarding page tables for pvh
Commit 81a76e4b12961a9f54f5021809074196dfe6dbba ("libxc: rework of
domain builder's page table handler") dropped a special case for pvh
resulting in page tables being mapped read-only. This led to a panic
of the domain in early boot.
Correct this error.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Jan Beulich [Fri, 20 Nov 2015 11:38:33 +0000 (12:38 +0100)]
x86/P2M: consolidate handling of types not requiring a valid MFN
As noted regarding the mixture of checks in p2m_pt_set_entry(),
introduce a new P2M type group allowing to be used everywhere we
just care about accepting operations with either a valid MFN or a type
permitting to be used without (valid) MFN.
Note that p2m_mmio_dm is not included in P2M_NO_MFN_TYPES, as for the
intended purpose that one ought to be treated similar to p2m_invalid
(perhaps the two should ultimately get folded anyway).
Note further that PoD superpages now get INVALID_MFN used when creating
page table entries (was _mfn(0) before).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Fri, 20 Nov 2015 11:37:37 +0000 (12:37 +0100)]
x86/PoD: tighten conditions for checking super page
Since calling the function isn't cheap, try to avoid the call when we
know up front it won't help; see the code comment for details on those
conditions.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Thu, 19 Nov 2015 15:46:10 +0000 (16:46 +0100)]
x86/IO-APIC: fix setting of destinations
In commit a85da715cf ("x86/IO-APIC: adjust setting of destinations") I
made a pretty blatant mistake: get_apic_id() can be used there only
when running APICs in physical mode. For both flat and clustered modes
the change was wrong, causing different kinds of boot problems on
affected systems. Don't revert that change though, but use TARGET_CPUS
(equaling cpu_online_map, and with there only being a single online CPU
fulfilling the original commits intention).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Thu, 19 Nov 2015 15:44:59 +0000 (16:44 +0100)]
x86: fixes to LAPIC probing
* Fix (unsafe) assumption that X86_FEATURE_APIC resided in feature word 0.
* All 64bit processors have local APICs; drop the vendor check.
* Unconditionally probe MSR_IA32_APICBASE (safely, to fail more gracefully in
broken situations) and avoid a redundant double rdmsr().
* Avoid repeatedly OR'ing APICBASE_ENABLE and DEFAULT_PHYS_BASE when
attempting to reenable the LAPIC.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 17 Nov 2015 12:23:11 +0000 (13:23 +0100)]
ns16550: limit mapped MMIO size
There's no point in mapping more than the memory we actually may need
to touch, and in fact the too large region could actually extend into
another device's one (which currently is benign on x86 since only a
single page gets mapped anyway, but which is a latent bug on ARM
whenever PCI support gets enabled there).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Juergen Gross [Thu, 12 Nov 2015 13:43:36 +0000 (14:43 +0100)]
libxc: create p2m list outside of kernel mapping if supported
In case the kernel of a new pv-domU indicates it is supporting a p2m
list outside the initial kernel mapping by specifying INIT_P2M, let
the domain builder allocate the memory for the p2m list from physical
guest memory only and map it to the address the kernel is expecting.
This will enable loading pv-domUs larger than 512 GB.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Thu, 12 Nov 2015 13:43:35 +0000 (14:43 +0100)]
libxc: rework of domain builder's page table handler
In order to prepare a p2m list outside of the initial kernel mapping
do a rework of the domain builder's page table handler. The goal is
to be able to use common helpers for page table allocation and setup
for initial kernel page tables and page tables mapping the p2m list.
This is achieved by supporting multiple mapping areas. The mapped
virtual addresses of the single areas must not overlap, while the
page tables of a new area added might already be partially present.
Especially the top level page table is existing only once, of course.
Currently restrict the number of mappings to 1 because the only mapping
now is the initial mapping created by toolstack. There should not be
behaviour change and guest visible change introduced.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Thu, 12 Nov 2015 13:43:34 +0000 (14:43 +0100)]
libxc: split p2m allocation in domain builder from other magic pages
Carve out the p2m list allocation from the .alloc_magic_pages hook of
the domain builder in order to prepare allocating the p2m list outside
of the initial kernel mapping. This will be needed to support loading
domains with huge memory (>512 GB).
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Thu, 12 Nov 2015 13:43:33 +0000 (14:43 +0100)]
libxc: create unmapped initrd in domain builder if supported
In case the kernel of a new pv-domU indicates it is supporting an
unmapped initrd, don't waste precious virtual space for the initrd,
but allocate only guest physical memory for it.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Thu, 12 Nov 2015 13:43:31 +0000 (14:43 +0100)]
libxc: introduce domain builder architecture specific data
Reorganize struct xc_dom_image to contain a pointer to domain builder
architecture specific private data. This will abstract the architecture
or domain type specific data from the general used data.
The new area is allocated as soon as the domain type is known.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Thu, 12 Nov 2015 13:43:30 +0000 (14:43 +0100)]
libxc: rename domain builder count_pgtables to alloc_pgtables
Rename the count_pgtables hook of the domain builder to alloc_pgtables
and do the allocation of the guest memory for page tables inside this
hook. This will remove the need for accessing the x86 specific pgtables
member of struct xc_dom_image in the generic domain builder code.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Thu, 12 Nov 2015 13:43:29 +0000 (14:43 +0100)]
xen: add generic flag to elf_dom_parms indicating support of unmapped initrd
Support of an unmapped initrd is indicated by the kernel of the domain
via elf notes. In order not to have to use raw elf data in the tools
for support of an unmapped initrd add a flag to the parsed data area
to indicate the kernel supporting this feature.
Switch using this flag in the hypervisor domain builder.
Cc: andrew.cooper3@citrix.com Cc: jbeulich@suse.com Cc: keir@xen.org Suggested-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Guest memory allocation in the domain builder of libxc is done via
virtual addresses only. In order to be able to support preallocated
areas not virtually mapped reorganize the memory allocator to keep
track of allocated pages globally and in allocated segments.
This requires an interface change of the allocate callback of the
domain builder which currently is using the last mapped virtual
address as a parameter. This is no problem as the only user of this
callback is stubdom/grub/kexec.c using this virtual address to
calculate the last used pfn.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Mon, 16 Nov 2015 12:11:59 +0000 (13:11 +0100)]
x86/IO-APIC: adjust setting of destinations
setup_IO_APIC_irqs() runs before APs get brought up, so using
desc->arch.cpu_mask as best risks it being either empty or having bits
for CPUs other than the BP set. Just use the APIC ID of the only
online CPU directly.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 16 Nov 2015 12:11:08 +0000 (13:11 +0100)]
x86/IO-APIC: fix setup of Xen internally used IRQs (take 2)
..., i.e. namely that of a PCI serial card with an IRQ above the
legacy range. This had got broken by the switch to cpumask_any() in
cpu_mask_to_apicid_phys(). Fix this by allowing all CPUs for that IRQ
(via setup_vector_irq() properly updating a booting CPU's vector_irq[],
thus avoiding "No irq handler for vector" messages and the interrupt
not working).
Cleanup coding style and types there at once.
While doing this I also noticed that io_apic_set_pci_routing() can't
be quite right: It sets up the destination _before_ getting a vector
allocated (which on other than systems using the flat APIC mode
affects the possible destinations), and also didn't restrict affinity
to ->arch.cpu_mask (as established by assign_irq_vector()).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Tue, 10 Nov 2015 10:46:44 +0000 (10:46 +0000)]
tools/ocaml/xb: Correct calculations of data/space the ring
ml_interface_{read,write}() would miscalculate the quantity of
data/space in the ring if it crossed the ring boundary, and incorrectly
return a short read/write.
This causes a protocol stall, as either side of the ring ends up waiting
for what they believe to be the other side needing to take the next
action.
Correct the calculations to cope with crossing the ring boundary.
In addition, correct the error detection. It is a hard error if the
producer index gets more than a ring size ahead of the consumer, or if
the consumer ever overtakes the producer.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Samuel Thibault <samuel.thibault@ens-lyon.org> Reviewed-by: David Scott <dave@recoil.org>
Jim Fehlig [Fri, 13 Nov 2015 02:40:46 +0000 (19:40 -0700)]
libxl: relax readonly check introduced by XSA-142 fix
The fix for XSA-142 is quite a big hammer, rejecting readonly
disk configuration even when the requested backend is known to
support readonly. While it is true that qemu doesn't support
readonly for emulated IDE or AHCI disks
$ /usr/lib/xen/bin/qemu-system-i386 \
-drive file=/tmp/disk.raw,if=ide,media=disk,format=raw,readonly=on
qemu-system-i386: Can't use a read-only drive
$ /usr/lib/xen/bin/qemu-system-i386 -device ahci,id=ahci0 \
-drive file=/tmp/disk.raw,if=none,id=ahcidisk-0,format=raw,readonly=on \
-device ide-hd,bus=ahci0.0,unit=0,drive=ahcidisk-0
qemu-system-i386: -device ide-hd,bus=ahci0.0,unit=0,drive=ahcidisk-0:
Can't use a read-only drive
Inside a guest using such a disk, the SCSI kernel driver sees write
protect on
[ 7.339232] sd 2:0:1:0: [sdb] Write Protect is on
Also, PV drivers support readonly, but the patch rejects such
configuration even when PV drivers (vdev=xvd*) have been explicitly
specified and creation of an emulated twin is skiped.
This follow-up patch loosens the restriction to reject readonly when
creating an emulated IDE or AHCI disk, but allows it when the backend
is known to support readonly.
Signed-off-by: Jim Fehlig <jfehlig@suse.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Juergen Gross [Fri, 23 Oct 2015 13:05:00 +0000 (15:05 +0200)]
libxc: remove most of tools/libxc/xc_dom_compat_linux.c
In tools/libxc/xc_dom_compat_linux.c xc_linux_build() is the only
domain building function used by an in-tree component (qemu-xen) which
is really necessary.
Remove the other domain building functions and the unused python
wrapper xc.linux_build() referencing one of the to be removed
functions.
Suggested-by: Ian Campbell <ian.campbell@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>