]> xenbits.xensource.com Git - people/royger/xen.git/log
people/royger/xen.git
8 years agoclang: disable the gcc-compat warnings for read_atomic fix_clang gitlab/fix_clang
Roger Pau Monne [Mon, 10 Apr 2017 15:12:12 +0000 (16:12 +0100)]
clang: disable the gcc-compat warnings for read_atomic

clang gcc-compat warnings can wrongly fire when certain constructions are used,
at least the following flow:

switch ( ... )
{
case ...:
    while ( ({ int x; switch ( foo ) { case 1: x = 1; break; } x }) )
    {
        ...

Will cause clang to emit the following warning "'break' is bound to loop, GCC
binds it to switch", which is a false positive, and both gcc and clang bound
the break to the inner switch. In order to workaround this issue, disable the
gcc-compat checks for the usage of the read_atomic macro.

This has been reported upstream as http://bugs.llvm.org/show_bug.cgi?id=32595.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
8 years agox86/atomic: fix cmpxchg16b inline assembly to work with clang
Roger Pau Monné [Mon, 10 Apr 2017 15:32:01 +0000 (17:32 +0200)]
x86/atomic: fix cmpxchg16b inline assembly to work with clang

clang doesn't understand the "=A" register constrain when used with 64bits
assembly and spits out an internal error:

fatal error: error in backend: Cannot select: 0x7f9fb89c9390: i64 = build_pair 0x7f9fb89c92b0,
      0x7f9fb89c9320
  0x7f9fb89c92b0: i32,ch,glue = CopyFromReg 0x7f9fb89c9240, Register:i32 %EAX, 0x7f9fb89c9240:1
    0x7f9fb89c8c20: i32 = Register %EAX
    0x7f9fb89c9240: ch,glue = inlineasm 0x7f9fb89c90f0,
TargetExternalSymbol:i64'lock; cmpxchg16b $1', MDNode:ch<0x7f9fb8476c38>,
TargetConstant:i64<25>, TargetConstant:i32<18>, Register:i32 %EAX, Register:i32
%EDX, TargetConstant:i32<196622>, 0x7f9fb89c87c0, TargetConstant:i32<9>,
Register:i64 %RCX, TargetConstant:i32<9>, Register:i64 %RBX,
TargetConstant:i32<9>, Register:i64 %RDX, TargetConstant:i32<9>, Register:i64
%RAX, TargetConstant:i32<196622>, 0x7f9fb89c87c0, TargetConstant:i32<12>,
Register:i32 %EFLAGS, 0x7f9fb89c90f0:1
      0x7f9fb89c8a60: i64 = TargetExternalSymbol'lock; cmpxchg16b $1'
      0x7f9fb89c8b40: i64 = TargetConstant<25>
      0x7f9fb89c8bb0: i32 = TargetConstant<18>
      0x7f9fb89c8c20: i32 = Register %EAX
      0x7f9fb89c8c90: i32 = Register %EDX
      0x7f9fb89c8d00: i32 = TargetConstant<196622>
      0x7f9fb89c87c0: i64,ch = load<LD8[%4]> 0x7f9fb9053da0, FrameIndex:i64<1>, undef:i64
        0x7f9fb9053a90: i64 = FrameIndex<1>
        0x7f9fb9053e80: i64 = undef
      0x7f9fb89c8e50: i32 = TargetConstant<9>
      0x7f9fb89c8d70: i64 = Register %RCX
      0x7f9fb89c8e50: i32 = TargetConstant<9>
      0x7f9fb89c8ec0: i64 = Register %RBX
      0x7f9fb89c8e50: i32 = TargetConstant<9>
      0x7f9fb89c8fa0: i64 = Register %RDX
      0x7f9fb89c8e50: i32 = TargetConstant<9>
      0x7f9fb89c9080: i64 = Register %RAX
[...]

Fix this by specifying "rdx:rax" manually using the "d" and "a" constraints.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agoxsm: fix clang 3.5 build after c47d1d
Roger Pau Monné [Mon, 10 Apr 2017 15:31:42 +0000 (17:31 +0200)]
xsm: fix clang 3.5 build after c47d1d

The changes introduced on c47d1d broke the clang build due to undefined
references to __xsm_action_mismatch_detected, because clang hasn't optimized
the code properly. The following patch allows the clang build to work again,
while keeping the same functionality.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
8 years agooxenstored: make --restart option best-effort
Jonathan Davies [Fri, 7 Apr 2017 13:27:22 +0000 (14:27 +0100)]
oxenstored: make --restart option best-effort

Only attempt to restore from saved state if it exists.

Without this, oxenstored immediately exits with an exception if the
--restart option is provided but the state file is not present.

(The time-of-check to time-of-use race isn't a concern as oxenstored is
the only thing that should write the state file.)

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agooxenstored: improve event-channel binding logging
Jonathan Davies [Fri, 7 Apr 2017 13:27:21 +0000 (14:27 +0100)]
oxenstored: improve event-channel binding logging

It's useful to see a bit more detail when an inter-domain event-channel
is bound, especially over an oxenstored restart.

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agooxenstored: save remote evtchn port, not local port
Jonathan Davies [Fri, 7 Apr 2017 13:27:20 +0000 (14:27 +0100)]
oxenstored: save remote evtchn port, not local port

Previously, Domain.dump output the number of the local port
corresponding to each domain's event-channel. However, when oxenstored
exits, it closes /dev/xen/evtchn which causes the kernel to close the
local port (evtchn_release), so this port is no longer useful.

Instead, store the remote port. This can be used to reconnect the
event-channel by binding the original remote port to a fresh local port.

Indeed, the logic for parsing the stored state already expects a remote
port as it passes the parsed port number to Domain.make (via
Domains.create), which takes a remote port.

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agooxenstored: avoid leading slash in paths in saved store state
Jonathan Davies [Fri, 7 Apr 2017 13:27:19 +0000 (14:27 +0100)]
oxenstored: avoid leading slash in paths in saved store state

Internally, paths are represented as lists of strings, where
  * path "/" is represented by []
  * path "/local/domain/0" is represented by ["local"; "domain"; "0"]
(see comment for Store.Path.t).

However, the traversal function generated paths like
    [""; "local"; "domain"; "0"]
because the name of the root node is "". Change it to generate paths
correctly.

Furthermore, the function passed to Store.dump_fct would render the node
"foo" under the path [] as "//foo". Change this to return "/foo".

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agooxenstored: initialise logging earlier
Jonathan Davies [Fri, 7 Apr 2017 13:27:18 +0000 (14:27 +0100)]
oxenstored: initialise logging earlier

Otherwise we miss out on messages from things that try to log earlier in
the start-up procedure.

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agoRevert "setup vwfi correctly on cpu0"
Stefano Stabellini [Fri, 7 Apr 2017 22:38:58 +0000 (15:38 -0700)]
Revert "setup vwfi correctly on cpu0"

This reverts commit b32d442abd92cdd4d8f2a2e7794cfee9dba7fe22. There is
no need for this patch after "xen/arm: Set and restore HCR_EL2 register
for each vCPU separately".

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoARM: GICv3 ITS: introduce device mapping
Andre Przywara [Fri, 7 Apr 2017 22:08:01 +0000 (23:08 +0100)]
ARM: GICv3 ITS: introduce device mapping

The ITS uses device IDs to map LPIs to a device. Dom0 will later use
those IDs, which we directly pass on to the host.
For this we have to map each device that Dom0 may request to a host
ITS device with the same identifier.
Allocate the respective memory and enter each device into an rbtree to
later be able to iterate over it or to easily teardown guests.
Because device IDs are per ITS, we need to identify a virtual ITS. We
use the doorbell address for that purpose, as it is a nice architectural
MSI property and spares us handling with opaque pointer or break
the VGIC abstraction.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoARM: vGICv3: introduce ITS emulation stub
Andre Przywara [Fri, 7 Apr 2017 22:08:00 +0000 (23:08 +0100)]
ARM: vGICv3: introduce ITS emulation stub

Create a new file to hold the emulation code for the ITS widget.
This just holds the data structure and a init and free function for now.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoARM: GICv3 ITS: introduce host LPI array
Andre Przywara [Fri, 7 Apr 2017 22:07:59 +0000 (23:07 +0100)]
ARM: GICv3 ITS: introduce host LPI array

The number of LPIs on a host can be potentially huge (millions),
although in practise will be mostly reasonable. So prematurely allocating
an array of struct irq_desc's for each LPI is not an option.
However Xen itself does not care about LPIs, as every LPI will be injected
into a guest (Dom0 for now).
Create a dense data structure (8 Bytes) for each LPI which holds just
enough information to determine the virtual IRQ number and the VCPU into
which the LPI needs to be injected.
Also to not artificially limit the number of LPIs, we create a 2-level
table for holding those structures.
This patch introduces functions to initialize these tables and to
create, lookup and destroy entries for a given LPI.
By using the naturally atomic access guarantee the native uint64_t data
type gives us, we allocate and access LPI information in a way that does
not require a lock.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoARM: GICv3 ITS: introduce ITS command handling
Andre Przywara [Fri, 7 Apr 2017 22:07:58 +0000 (23:07 +0100)]
ARM: GICv3 ITS: introduce ITS command handling

To be able to easily send commands to the ITS, create the respective
wrapper functions, which take care of the ring buffer.
The first two commands we implement provide methods to map a collection
to a redistributor (aka host core) and to flush the command queue (SYNC).
Start using these commands for mapping one collection to each host CPU.
As an ITS might choose between *two* ways of addressing a redistributor,
we store both the MMIO base address as well as the processor number in
a per-CPU variable to give each ITS what it wants.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoARM: GICv3 ITS: map ITS command buffer
Andre Przywara [Fri, 7 Apr 2017 22:07:57 +0000 (23:07 +0100)]
ARM: GICv3 ITS: map ITS command buffer

Instead of directly manipulating the tables in memory, an ITS driver
sends commands via a ring buffer in normal system memory to the ITS h/w
to create or alter the LPI mappings.
Allocate memory for that buffer and tell the ITS about it to be able
to send ITS commands.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoARM: GICv3 ITS: allocate device and collection table
Andre Przywara [Fri, 7 Apr 2017 22:07:56 +0000 (23:07 +0100)]
ARM: GICv3 ITS: allocate device and collection table

Each ITS maps a pair of a DeviceID (for instance derived from a PCI
b/d/f triplet) and an EventID (the MSI payload or interrupt ID) to a
pair of LPI number and collection ID, which points to the target CPU.
This mapping is stored in the device and collection tables, which software
has to provide for the ITS to use.
Allocate the required memory and hand it over to the ITS.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoARM: GICv3: allocate LPI pending and property table
Andre Przywara [Fri, 7 Apr 2017 22:07:55 +0000 (23:07 +0100)]
ARM: GICv3: allocate LPI pending and property table

The ARM GICv3 provides a new kind of interrupt called LPIs.
The pending bits and the configuration data (priority, enable bits) for
those LPIs are stored in tables in normal memory, which software has to
provide to the hardware.
Allocate the required memory, initialize it and hand it over to each
redistributor. The maximum number of LPIs to be used can be adjusted with
the command line option "max_lpi_bits", which defaults to 20 bits,
covering about one million LPIs.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoARM: GICv3 ITS: initialize host ITS
Andre Przywara [Fri, 7 Apr 2017 22:07:54 +0000 (23:07 +0100)]
ARM: GICv3 ITS: initialize host ITS

Map the registers frame for each host ITS and populate the host ITS
structure with some parameters describing the size of certain properties
like the number of bits for device IDs.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoARM: GICv3 ITS: parse and store ITS subnodes from hardware DT
Andre Przywara [Fri, 7 Apr 2017 22:07:53 +0000 (23:07 +0100)]
ARM: GICv3 ITS: parse and store ITS subnodes from hardware DT

Parse the GIC subnodes in the device tree to find every ITS MSI controller
the hardware offers. Store that information in a list to both propagate
all of them later to Dom0, but also to be able to iterate over all ITSes.
This introduces an ITS Kconfig option (as an EXPERT option), use
XEN_CONFIG_EXPERT=y on the make command line to see and use the option.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoxen: credit1: treat pCPUs more evenly during balancing.
Dario Faggioli [Fri, 7 Apr 2017 16:57:14 +0000 (18:57 +0200)]
xen: credit1: treat pCPUs more evenly during balancing.

Right now, we use cpumask_first() for going through
the bus pCPUs in csched_load_balance(). This means
not all pCPUs have equal chances of seeing their
pending work stolen. It also means there is more
runqueue lock pressure on lower ID pCPUs.

To avoid all this, let's record and remember, for
each NUMA node, from what pCPU we have stolen for
last, and start from that the following time.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: credit1: increase efficiency and scalability of load balancing.
Dario Faggioli [Fri, 7 Apr 2017 16:57:07 +0000 (18:57 +0200)]
xen: credit1: increase efficiency and scalability of load balancing.

During load balancing, we check the non idle pCPUs to
see if they have runnable but not running vCPUs that
can be stolen by and set to run on currently idle pCPUs.

If a pCPU has only one running (or runnable) vCPU,
though, we don't want to steal it from there, and
it's therefore pointless bothering with it
(especially considering that bothering means trying
to take its runqueue lock!).

On large systems, when load is only slightly higher
than the number of pCPUs (i.e., there are just a few
more active vCPUs than the number of the pCPUs), this
may mean that:
 - we go through all the pCPUs,
 - for each one, we (try to) take its runqueue locks,
 - we figure out there's actually nothing to be stolen!

To mitigate this, we introduce a counter for the number
of runnable vCPUs on each pCPU. In fact, unless there
re least 2 runnable vCPUs --typically, one running,
and the others in the runqueue-- it does not make sense
to try stealing anything.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: credit2: avoid cpumask_any() in pick_cpu().
Dario Faggioli [Fri, 7 Apr 2017 16:57:00 +0000 (18:57 +0200)]
xen: credit2: avoid cpumask_any() in pick_cpu().

cpumask_any() is costly (because of the randomization).
And since it does not really matter which exact CPU is
selected within a runqueue, as that will be overridden
shortly after, in runq_tickle(), spending too much time
and achieving true randomization is pretty pointless.

As the picked CPU, however, would be used as an hint,
within runq_tickle(), don't give up on it entirely,
and let's make sure we don't always return the same
CPU, or favour lower or higher ID CPUs.

To achieve that, let's record and remember, for each
runqueue, what CPU we picked for last, and start from
that the following time.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen/tools: tracing: add record for credit1 runqueue stealing.
Dario Faggioli [Fri, 7 Apr 2017 16:56:52 +0000 (18:56 +0200)]
xen/tools: tracing: add record for credit1 runqueue stealing.

Including whether we actually tried stealing a vCPU from
a given pCPU, or we skipped that one, because of lock
contention.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: credit: consider tickled pCPUs as busy.
Dario Faggioli [Fri, 7 Apr 2017 16:56:45 +0000 (18:56 +0200)]
xen: credit: consider tickled pCPUs as busy.

Currently, it can happen that __runq_tickle(),
running on pCPU 2 because vCPU x woke up, decides
to tickle pCPU 3, because it's idle. Just after
that, but before pCPU 3 manages to schedule and
pick up x, either __runq_tickel() or
__csched_cpu_pick(), running on pCPU 6, sees that
idle pCPUs are 0, 1 and also 3, and for whatever
reason it also chooses 3 for waking up (or
migrating) vCPU y.

When pCPU 3 goes through the scheduler, it will
pick up, say, vCPU x, and y will sit in its
runqueue, even if there are idle pCPUs.

Alleviate this by marking a pCPU to be idle
right away when tickling it (like, e.g., it happens
in Credit2).

Note that this does not eliminate the race. That
is not possible without introducing proper locking
for the cpumasks the scheduler uses. It significantly
reduces the window during which it can happen, though.

Introduce proper locking for the cpumask can, in
theory, be done, and may be investigated in future.
It is a significant amount of work to do it properly
(e.g., avoiding deadlock), and it is likely to adversely
affect scalability, and so it may be a path it is just
not worth following.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: credit: (micro) optimize csched_runq_steal().
Dario Faggioli [Fri, 7 Apr 2017 16:56:38 +0000 (18:56 +0200)]
xen: credit: (micro) optimize csched_runq_steal().

Checking whether or not a vCPU can be 'stolen'
from a peer pCPU's runqueue is relatively cheap.

Therefore, let's do that  as early as possible,
avoiding potentially useless complex checks, and
cpumask manipulations.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: credit1: simplify csched_runq_steal() a little bit.
Dario Faggioli [Fri, 7 Apr 2017 16:56:31 +0000 (18:56 +0200)]
xen: credit1: simplify csched_runq_steal() a little bit.

Since we're holding the lock on the pCPU from which we
are trying to steal, it can't have disappeared, so we
can drop the check for that (and convert it in an
ASSERT()).

And since we try to steal only from busy pCPUs, it's
unlikely for such pCPU to be idle, so we can:
 - tell the compiler this is actually unlikely,
 - bail early if the pCPU, unfortunately, turns out
   to really be idle.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
8 years agox86/svm: Correct event injection check in svm_vmcb_restore()
Andrew Cooper [Fri, 7 Apr 2017 15:38:53 +0000 (16:38 +0100)]
x86/svm: Correct event injection check in svm_vmcb_restore()

SVM's maximum valid event type is 4.  This appears to be a straigth copy and
paste error in c/s e94e3f210a62, as VT-x's maximum is 6.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
8 years agox86/svm: Fix indentation in svm_vmcb_restore()
Andrew Cooper [Fri, 7 Apr 2017 15:38:12 +0000 (16:38 +0100)]
x86/svm: Fix indentation in svm_vmcb_restore()

Inroduced by c/s b706e1c6af274, spotted by Coverity.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
8 years agox86/emul: Poision the stubs with debug traps
Andrew Cooper [Wed, 8 Mar 2017 15:38:55 +0000 (15:38 +0000)]
x86/emul: Poision the stubs with debug traps

...rather than leaving fragments of old instructions in place.  This reduces
the chances of something going further-wrong (as the debug trap will be caught
and terminate the guest) in a cascade-failure where we end up executing the
instruction fragments.

Before:
    (XEN) d2v0 exception 6 (ec=0000) in emulation stub (line 6239)
    (XEN) d2v0 stub: c4 e1 44 77 c3 80 d0 82 ff ff ff d1 90 ec 90

After:
    (XEN) d3v0 exception 6 (ec=0000) in emulation stub (line 6239)
    (XEN) d3v0 stub: c4 e1 44 77 c3 cc cc cc cc cc cc cc cc cc cc

To make this work, the int3 handler needs to be extended to attempt recovery
rather than simply returning back to Xen context.  While altering do_int3(),
leave an obvious sign if an embedded breakpoint has been hit and not dealt
with by debugging facilities.

    (XEN) Hit embedded breakpoint at ffff82d0803d01f6 [extable.c#stub_selftest+0xda/0xee]

Extend the selftests to include int3, and add an extra printk indicating the
start of the recovery selftests, to avoid leaving otherwise-spurious faults
visible in the log.

    (XEN) build-id: 55d7e6f420b4f0ce277f776be620f43d7cb8646c
    (XEN) Running stub recovery selftests...
    (XEN) traps.c:3466: GPF (0000): ffff82d0bffff041 [ffff82d0bffff041] -> ffff82d08035937a
    (XEN) traps.c:813: Trap 12: ffff82d0bffff040 [ffff82d0bffff040] -> ffff82d08035937a
    (XEN) traps.c:1215: Trap 3: ffff82d0bffff041 [ffff82d0bffff041] -> ffff82d08035937a
    (XEN) ACPI sleep modes: S3

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/ioreq server: synchronously reset outstanding p2m_ioreq_server entries when an...
Yu Zhang [Fri, 7 Apr 2017 15:40:04 +0000 (17:40 +0200)]
x86/ioreq server: synchronously reset outstanding p2m_ioreq_server entries when an ioreq server unmaps

After an ioreq server has unmapped, the remaining p2m_ioreq_server
entries need to be reset back to p2m_ram_rw. This patch does this
synchronously by iterating the p2m table.

The synchronous resetting is necessary because we need to guarantee
the p2m table is clean before another ioreq server is mapped. And
since the sweeping of p2m table could be time consuming, it is done
with hypercall continuation.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agox86/ioreq server: asynchronously reset outstanding p2m_ioreq_server entries
Yu Zhang [Fri, 7 Apr 2017 15:39:16 +0000 (17:39 +0200)]
x86/ioreq server: asynchronously reset outstanding p2m_ioreq_server entries

After an ioreq server has unmapped, the remaining p2m_ioreq_server
entries need to be reset back to p2m_ram_rw. This patch does this
asynchronously with the current p2m_change_entry_type_global()
interface.

New field entry_count is introduced in struct p2m_domain, to record
the number of p2m_ioreq_server p2m page table entries. One nature of
these entries is that they only point to 4K sized page frames, because
all p2m_ioreq_server entries are originated from p2m_ram_rw ones in
p2m_change_type_one(). We do not need to worry about the counting for
2M/1G sized pages.

This patch disallows mapping of an ioreq server, when there's still
p2m_ioreq_server entry left, in case another mapping occurs right after
the current one being unmapped, releases its lock, with p2m table not
synced yet.

This patch also disallows live migration, when there's remaining
p2m_ioreq_server entry in p2m table. The core reason is our current
implementation of p2m_change_entry_type_global() lacks information
to resync p2m_ioreq_server entries correctly if global_logdirty is
on.

We still need to handle other recalculations, however; which means
that when doing a recalculation, if the current type is
p2m_ioreq_server, we check to see if p2m->ioreq.server is valid or
not.  If it is, we leave it as type p2m_ioreq_server; if not, we reset
it to p2m_ram as appropriate.

To avoid code duplication, lift recalc_type() out of p2m-pt.c and use
it for all type recalculations (both in p2m-pt.c and p2m-ept.c).

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/ioreq server: handle read-modify-write cases for p2m_ioreq_server pages
Paul Durrant [Fri, 7 Apr 2017 15:38:48 +0000 (17:38 +0200)]
x86/ioreq server: handle read-modify-write cases for p2m_ioreq_server pages

In ept_handle_violation(), write violations are also treated as
read violations. And when a VM is accessing a write-protected
address with read-modify-write instructions, the read emulation
process is triggered first.

For p2m_ioreq_server pages, current ioreq server only forwards
the write operations to the device model. Therefore when such page
is being accessed by a read-modify-write instruction, the read
operations should be emulated here in hypervisor. This patch provides
such a handler to copy the data to the buffer.

Note: MMIOs with p2m_mmio_dm type do not need such special treatment
because both reads and writes will go to the device mode.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/ioreq server: add device model wrappers for new DMOP
Yu Zhang [Fri, 7 Apr 2017 15:38:40 +0000 (17:38 +0200)]
x86/ioreq server: add device model wrappers for new DMOP

A new device model wrapper is added for the newly introduced
DMOP - XEN_DMOP_map_mem_type_to_ioreq_server.

Since currently this DMOP only supports the emulation of write
operations, attempts to trigger the DMOP with values other than
XEN_DMOP_IOREQ_MEM_ACCESS_WRITE or 0(to unmap the ioreq server)
shall fail. The wrapper shall be updated once read operations
are also to be emulated in the future.

Also note currently this DMOP only supports one memory type,
and can be extended in the future to map multiple memory types
to multiple ioreq servers, e.g. mapping HVMMEM_ioreq_serverX to
ioreq server X, This wrapper shall be updated when such change
is made.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
8 years agox86/ioreq server: add DMOP to map guest ram with p2m_ioreq_server to an ioreq server
Paul Durrant [Fri, 7 Apr 2017 15:38:11 +0000 (17:38 +0200)]
x86/ioreq server: add DMOP to map guest ram with p2m_ioreq_server to an ioreq server

Previously, p2m_ioreq_server is used to write-protect guest ram
pages, which are tracked with ioreq server's rangeset. However,
number of ram pages to be tracked may exceed the upper limit of
rangeset.

Now, a new DMOP - XEN_DMOP_map_mem_type_to_ioreq_server, is added
to let one ioreq server claim/disclaim its responsibility for the
handling of guest pages with p2m type p2m_ioreq_server. Users of
this DMOP can specify which kind of operation is supposed to be
emulated in a parameter named flags. Currently, this DMOP only
support the emulation of write operations. And it can be further
extended to support the emulation of read ones if an ioreq server
has such requirement in the future.

For now, we only support one ioreq server for this p2m type, so
once an ioreq server has claimed its ownership, subsequent calls
of the XEN_DMOP_map_mem_type_to_ioreq_server will fail. Users can
also disclaim the ownership of guest ram pages with p2m_ioreq_server,
by triggering this new DMOP, with ioreq server id set to the current
owner's and flags parameter set to 0.

Note:
a> both XEN_DMOP_map_mem_type_to_ioreq_server and p2m_ioreq_server
are only supported for HVMs with HAP enabled.

b> only after one ioreq server claims its ownership of p2m_ioreq_server,
will the p2m type change to p2m_ioreq_server be allowed.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Acked-by: Tim Deegan <tim@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agox86/ioreq server: release the p2m lock after mmio is handled
Yu Zhang [Fri, 7 Apr 2017 15:35:44 +0000 (17:35 +0200)]
x86/ioreq server: release the p2m lock after mmio is handled

Routine hvmemul_do_io() may need to peek the p2m type of a gfn to
select the ioreq server. For example, operations on gfns with
p2m_ioreq_server type will be delivered to a corresponding ioreq
server, and this requires that the p2m type not be switched back
to p2m_ram_rw during the emulation process. To avoid this race
condition, we delay the release of p2m lock in hvm_hap_nested_page_fault()
until mmio is handled.

Note: previously in hvm_hap_nested_page_fault(), put_gfn() was moved
before the handling of mmio, due to a deadlock risk between the p2m
lock and the event lock(in commit 77b8dfe). Later, a per-event channel
lock was introduced in commit de6acb7, to send events. So we do not
need to worry about the deadlock issue.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agotools: sched: add support for 'null' scheduler
Dario Faggioli [Fri, 7 Apr 2017 12:28:31 +0000 (14:28 +0200)]
tools: sched: add support for 'null' scheduler

It being very very basic, also means this scheduler does
not need much support at the tools level (for now).

Basically, just the definition of the symbol of the
scheduler itself and a couple of stubs.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: sched_null: support for hard affinity
Dario Faggioli [Fri, 7 Apr 2017 12:28:23 +0000 (14:28 +0200)]
xen: sched_null: support for hard affinity

As a (rudimental) way of directing and affecting the
placement logic implemented by the scheduler, support
vCPU hard affinity.

Basically, a vCPU will now be assigned only to a pCPU
that is part of its own hard affinity. If such pCPU(s)
is (are) busy, the vCPU will wait, like it happens
when there are no free pCPUs.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: sched: introduce the 'null' semi-static scheduler
Dario Faggioli [Fri, 7 Apr 2017 12:28:15 +0000 (14:28 +0200)]
xen: sched: introduce the 'null' semi-static scheduler

In cases where one is absolutely sure that there will be
less vCPUs than pCPUs, having to pay the cost, mostly in
terms of overhead, of an advanced scheduler may be not
desirable.

The simple scheduler implemented here could be a solution.
Here how it works:
 - each vCPU is statically assigned to a pCPU;
 - if there are pCPUs without any vCPU assigned, they
   stay idle (as in, the run their idle vCPU);
 - if there are vCPUs which are not assigned to any
   pCPU (e.g., because there are more vCPUs than pCPUs)
   they *don't* run, until they get assigned;
 - if a vCPU assigned to a pCPU goes away, one of the
   waiting to be assigned vCPU, if any, gets assigned
   to the pCPU and can run there.

This scheduler, therefore, if used in configurations
where every vCPUs can be assigned to a pCPU, guarantees
low overhead, low latency, and consistent performance.

If used as default scheduler, at Xen boot, it is
recommended to limit the number of Dom0 vCPUs (e.g., with
'dom0_max_vcpus=x'). Otherwise, all the pCPUs will have
one Dom0's vCPU assigned, and there won't be room for
running efficiently (if at all) any guest.

Target use cases are embedded and HPC, but it may well
be interesting also in circumnstances.

Kconfig and documentation are update accordingly.

While there, also document the availability of sched=rtds
as boot parameter, which apparently had been forgotten.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: sched: make sure a pCPU added to a pool runs the scheduler ASAP
Dario Faggioli [Fri, 7 Apr 2017 12:28:08 +0000 (14:28 +0200)]
xen: sched: make sure a pCPU added to a pool runs the scheduler ASAP

When a pCPU is added to a cpupool, the pool's scheduler
should immediately run on it so, for instance, any runnable
but not running vCPU can start executing there.

This currently does not happen. Make it happen by raising
the scheduler softirq directly from the function that
sets up the new scheduler for the pCPU.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: sched: improve robustness (and rename) DOM2OP()
Dario Faggioli [Fri, 7 Apr 2017 12:28:01 +0000 (14:28 +0200)]
xen: sched: improve robustness (and rename) DOM2OP()

Clarify and enforce (with ASSERTs) when the function
is called on the idle domain, and explain in comments
what it means and when it is ok to do so.

While there, change the name of the function to a more
self-explanatory one, and do the same to VCPU2OP.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
8 years agox86/mce: always re-initialize 'severity_cpu' in mcheck_cmn_handler()
Haozhong Zhang [Fri, 7 Apr 2017 13:56:09 +0000 (15:56 +0200)]
x86/mce: always re-initialize 'severity_cpu' in mcheck_cmn_handler()

mcheck_cmn_handler() does not always set 'severity_cpu' to override
its value taken from previous rounds of MC handling, which will
interfere the current round of MC handling. Always re-initialize it to
clear the historical value.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/mce: make 'severity_cpu' private to its users
Haozhong Zhang [Fri, 7 Apr 2017 13:55:34 +0000 (15:55 +0200)]
x86/mce: make 'severity_cpu' private to its users

The current 'severity_cpu' is used by both mcheck_cmn_handler() and
mce_softirq(). If MC# happens during mce_softirq(), the values set in
mcheck_cmn_handler() and mce_softirq() may interfere with each
other. Use private 'severity_cpu' for each function to fix this issue.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/monitor: add support for descriptor access events
Adrian Pop [Fri, 7 Apr 2017 13:39:32 +0000 (15:39 +0200)]
x86/monitor: add support for descriptor access events

Adds monitor support for descriptor access events (reads & writes of
IDTR/GDTR/LDTR/TR) for the x86 architecture (VMX and SVM).

Signed-off-by: Adrian Pop <apop@bitdefender.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
[jb: minor cosmetic (hopefully!) cleanup]
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agopassthrough/io: fall back to remapping interrupt when we can't use VT-d PI
Chao Gao [Fri, 7 Apr 2017 13:38:40 +0000 (15:38 +0200)]
passthrough/io: fall back to remapping interrupt when we can't use VT-d PI

The current logic of using VT-d pi is when guest configurates the pirq's
destination vcpu to a single vcpu, the according IRTE is updated to
posted format. If the destination of the pirq is multiple vcpus, we will
stay in posted format. Obviously, we should fall back to remapping interrupt
when guest wrongly configurate destination of pirq or makes it have
multi-destination vcpus.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
[jb: guard against vcpu being NULL]
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agoVT-d: introduce update_irte to update irte safely
Chao Gao [Fri, 7 Apr 2017 13:38:17 +0000 (15:38 +0200)]
VT-d: introduce update_irte to update irte safely

We used structure assignment to update irte which was non-atomic when the
whole IRTE was to be updated. It is unsafe when a interrupt happened during
update. Furthermore, no bug or warning would be reported when this happened.

This patch introduces two variants, atomic and non-atomic, to update irte.
For initialization and release case, the non-atomic variant will be used. for
other cases (such as reprogramming to set irq affinity), the atomic variant
will be used. If the caller requests an atomic update but we can't meet it, we
raise a bug.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Jan Beulich <jbeulich@suse.com> [x86]
8 years agoVMX: fixup PI descriptor when cpu is offline
Feng Wu [Fri, 7 Apr 2017 13:37:55 +0000 (15:37 +0200)]
VMX: fixup PI descriptor when cpu is offline

When cpu is offline, we need to move all the vcpus in its blocking
list to another online cpu, this patch handles it.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
8 years agoVT-d: some cleanups
Feng Wu [Fri, 7 Apr 2017 13:37:33 +0000 (15:37 +0200)]
VT-d: some cleanups

Use type-safe structure assignment instead of memcpy()
Use sizeof(*iremap_entry).

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
8 years agoVT-d: introduce new fields in msi_desc to track binding with guest interrupt
Feng Wu [Fri, 7 Apr 2017 13:37:07 +0000 (15:37 +0200)]
VT-d: introduce new fields in msi_desc to track binding with guest interrupt

msi_msg_to_remap_entry() is buggy when the live IRTE is in posted format. It
wrongly inherits the 'im' field meaning the IRTE is in posted format but
updates all the other fields to remapping format.

There are also two situations that lead to the above issue. One is some callers
really want to change the IRTE to remapped format. The other is some callers
only want to update msi message (e.g. set msi affinity) for they don't aware
that this msi is binded with a guest interrupt. We should suppress update
in the second situation. To distinguish them, straightforwardly, we can let
caller specify which format of IRTE they want update to. It isn't feasible for
making all callers be aware of the binding with guest interrupt will cause a
far more complicated change (including the interfaces exposed to IOAPIC and
MSI). Also some callings happen in interrupt context where we can't acquire
d->event_lock to read struct hvm_pirq_dpci.

This patch introduces two new fields in msi_desc to track binding with a guest
interrupt such that msi_msg_to_remap_entry() can get the binding and update
IRTE accordingly. After that change, pi_update_irte() can utilize
msi_msg_to_remap_entry() to update IRTE to posted format.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agopassthrough: don't migrate pirq when it is delivered through VT-d PI
Chao Gao [Fri, 7 Apr 2017 13:36:20 +0000 (15:36 +0200)]
passthrough: don't migrate pirq when it is delivered through VT-d PI

When a vCPU was migrated to another pCPU, pt irqs binded to this vCPU might
also need migration as a optimization to reduce IPI between pCPUs. When VT-d
PI is enabled, interrupt vector will be recorded to a main memory resident
data-structure and a notification whose destination is decided by NDST is
generated. NDST is properly adjusted during vCPU migration so pirq directly
injected to guest needn't be migrated.

This patch adds a indicator, @posted, to show whether the pt irq is delivered
through VT-d PI. Also this patch fixes a bug that hvm_migrate_pirq() accesses
pirq_dpci->gmsi.dest_vcpu_id without checking the pirq_dpci's type.

Signed-off-by: Chao Gao <chao.gao@intel.com>
[jb: remove an extranious check from hvm_migrate_pirq()]
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86: add multiboot2 protocol support for relocatable images
Daniel Kiper [Fri, 7 Apr 2017 11:37:24 +0000 (13:37 +0200)]
x86: add multiboot2 protocol support for relocatable images

Add multiboot2 protocol support for relocatable images. Only GRUB2 with
"multiboot2: Add support for relocatable images" patch understands
that feature. Older multiboot protocol (regardless of version)
compatible loaders ignore it and everything works as usual.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
8 years agox86/boot: rename sym_phys() to sym_offs()
Daniel Kiper [Fri, 7 Apr 2017 11:37:02 +0000 (13:37 +0200)]
x86/boot: rename sym_phys() to sym_offs()

This way macro name better describes its function.
Currently it is used to calculate symbol offset in
relation to the beginning of Xen image mapping.
However, value returned by sym_offs() for a given
symbol is not always equal its physical address.

There is no functional change.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
8 years agox86: make Xen early boot code relocatable
Daniel Kiper [Fri, 7 Apr 2017 11:36:32 +0000 (13:36 +0200)]
x86: make Xen early boot code relocatable

Every multiboot protocol (regardless of version) compatible image must
specify its load address (in ELF or multiboot header). Multiboot protocol
compatible loader have to load image at specified address. However, there
is no guarantee that the requested memory region (in case of Xen it starts
at 2 MiB and ends at ~5 MiB) where image should be loaded initially is a RAM
and it is free (legacy BIOS platforms are merciful for Xen but I found at
least one EFI platform on which Xen load address conflicts with EFI boot
services; it is Dell PowerEdge R820 with latest firmware). To cope with that
problem we must make Xen early boot code relocatable and help boot loader to
relocate image in proper way by suggesting, not requesting specific load
addresses as it is right now, allowed address ranges. This patch does former.
It does not add multiboot2 protocol interface which is done in "x86: add
multiboot2 protocol support for relocatable images" patch.

This patch changes following things:
  - %esi register is used as a storage for Xen image load base address;
    it is mostly unused in early boot code and preserved during C functions
    calls in 32-bit mode,
  - %fs is used as base for Xen data relative addressing in 32-bit code
    if it is possible; %esi is used for that thing during error printing
    because it is not always possible to properly and efficiently
    initialize %fs.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/setup: use XEN_IMG_OFFSET instead of...
Daniel Kiper [Fri, 7 Apr 2017 11:36:01 +0000 (13:36 +0200)]
x86/setup: use XEN_IMG_OFFSET instead of...

..calculating its value during runtime.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
8 years agox86: change default load address from 1 MiB to 2 MiB
Daniel Kiper [Fri, 7 Apr 2017 11:35:32 +0000 (13:35 +0200)]
x86: change default load address from 1 MiB to 2 MiB

Subsequent patches introducing relocatable early boot code play with
page tables using 2 MiB huge pages. If load address is not aligned at
2 MiB then code touching such page tables must have special cases for
start and end of Xen image memory region. So, let's make life easier
and move default load address from 1 MiB to 2 MiB. This way page table
code will be nice and easy. Hence, there is a chance that it will be
less error prone too... :-)))

Additionally, drop first 2 MiB mapping from Xen image mapping.
It is no longer needed.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
8 years agox86emul: correct compat mode system descriptor handling
Jan Beulich [Fri, 7 Apr 2017 09:04:02 +0000 (09:04 +0000)]
x86emul: correct compat mode system descriptor handling

There are some oddities to take care of here - see the code comment.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
8 years agox86/HVM: don't leak PFEC_implict to guests
Jan Beulich [Fri, 7 Apr 2017 10:08:34 +0000 (12:08 +0200)]
x86/HVM: don't leak PFEC_implict to guests

Doing so may not only confuse them, but will - on VMX - lead to
VMRESUME failures. Add respective ASSERT()s where the fields get set
to guard against future similar issues (or - in the restore case -
fail the operation). In that latter code at once convert the mis-used
gdprintk() to dprintk(), as the vCPU of interest is not "current".

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
8 years agox86/hvm: make io.h self-contained
Chao Gao [Fri, 7 Apr 2017 10:06:18 +0000 (12:06 +0200)]
x86/hvm: make io.h self-contained

io.h uses structure npfec without including the file xen/mm.h where the
structure is defined.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
8 years agoboot allocator: use arch helper for virt_to_mfn on DIRECTMAP_VIRT region
Vijaya Kumar K [Fri, 7 Apr 2017 10:04:14 +0000 (12:04 +0200)]
boot allocator: use arch helper for virt_to_mfn on DIRECTMAP_VIRT region

On ARM platforms with NUMA, while initializing second memory node,
panic is triggered from init_node_heap() when virt_to_mfn()
is called for DIRECTMAP_VIRT region address because DIRECTMAP_VIRT
region is not mapped to any virtual address.

The check virt_to_mfn() here is used to know whether the max MFN is
part of the direct mapping. The max MFN is found by calling virt_to_mfn
on end address of DIRECTMAP_VIRT region, which is DIRECTMAP_VIRT_END.

On ARM64, all RAM is currently direct mapped in Xen and virt_to_mfn
uses the hardware for address translation. So if the virtual address
is not mapped translation fault is raised.

In this patch, instead of calling virt_to_mfn(), arch helper
arch_mfn_in_directmap() is introduced.

On ARM64 this arch helper will return true, because currently all RAM
is direct mapped in Xen.
On ARM32, only a limited amount of RAM, called xenheap, is always mapped
and DIRECTMAP_VIRT region is not mapped. Hence return false.
For x86 this helper does virt_to_mfn.

Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agox86/vpmu_intel: handle SMT consistently for programmable and fixed counters
Mohit Gambhir [Fri, 7 Apr 2017 10:03:46 +0000 (12:03 +0200)]
x86/vpmu_intel: handle SMT consistently for programmable and fixed counters

The patch introduces a macro FIXED_CTR_CTRL_ANYTHREAD_MASK and uses it
to mask .Anythread bit for all counter in IA32_FIXED_CTR_CTRL MSR in all
versions of Intel Arhcitectural Performance Monitoring.  Masking .AnyThread bit
 is necesssry for two reasons:

1. We need to be consistent in the implementation. We disable .Anythread bit in
programmable counters (regardless of the version) by masking bit 21 in
IA32_PERFEVTSELx.  (See code snippet below from vpmu_intel.c)

 /* Masks used for testing whether and MSR is valid */
 #define ARCH_CTRL_MASK  (~((1ull << 32) - 1) | (1ull << 21))

But we leave it enabled in fixed function counters for version 3. Removing the
condition disables the bit in fixed function counters regardless of the version,
which is consistent with what is done for programmable counters.

2. We don't want to expose event counts from another guest (or hypervisor)
which can happen if .AnyThread bit is not masked and a VCPU is only scheduled
to run on one of the hardware threads in a hyper-threaded CPU.

Also, note that Intel SDM discourages the  use of .AnyThread bit in virtualized
 environments (per section 18.2.3.1 AnyThread Counting and Software Evolution).

Signed-off-by: Mohit Gambhir <mohit.gambhir@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
8 years agox86/io: move the list of guest to machine IO ports out of domain_iommu
Roger Pau Monné [Fri, 7 Apr 2017 10:03:15 +0000 (12:03 +0200)]
x86/io: move the list of guest to machine IO ports out of domain_iommu

There's no reason to store that list inside of the domain_iommu struct, the
forwarding of guest IO ports into machine IO ports is not tied to the presence
of an IOMMU.

Move it inside of the hvm_domain struct instead.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/io: rename misleading dpci_ prefixed functions to g2m_
Roger Pau Monné [Fri, 7 Apr 2017 10:02:22 +0000 (12:02 +0200)]
x86/io: rename misleading dpci_ prefixed functions to g2m_

The dpci_ prefix used on those IO handlers is misleading, there's nothing PCI
specific in them, they simply map a guest IO port into a machine (physical) IO
port. They don't specifically trap the PCI IO port range in any way
(0xcf8/0xcfc).

Rename them to use the g2m_ prefix in order to avoid this confusion.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
8 years agoaltp2m: introduce external-only and limited use-cases
Tamas K Lengyel [Fri, 7 Apr 2017 10:01:10 +0000 (12:01 +0200)]
altp2m: introduce external-only and limited use-cases

Currently setting altp2mhvm=1 in the domain configuration allows access to the
altp2m interface for both in-guest and external privileged tools. This poses
a problem for use-cases where only external access should be allowed, requiring
the user to compile Xen with XSM enabled to be able to appropriately restrict
access.

In this patch we deprecate the altp2mhvm domain configuration option and
introduce the altp2m option, which allows specifying if by default the altp2m
interface should be external-only or limited. The information is stored in
HVM_PARAM_ALTP2M which we now define with specific XEN_ALTP2M_* modes.
If external mode is selected, the XSM check is shifted to use XSM_DM_PRIV
type check, thus restricting access to the interface by the guest itself. Note
that we keep the default XSM policy untouched. Users of XSM who wish to enforce
external mode for altp2m can do so by adjusting their XSM policy directly,
as this domain config option does not override an active XSM policy.

Also, as part of this patch we adjust the hvmop handler to require
HVM_PARAM_ALTP2M to be of a type other then disabled for all ops. This has been
previously only required for get/set altp2m domain state, all other options
were gated on altp2m_enabled. Since altp2m_enabled only gets set during set
altp2m domain state, this change introduces no new requirements to the other
ops but makes it more clear that it is required for all ops.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@zentific.com>
Signed-off-by: Sergej Proskurin <proskurin@sec.in.tum.de>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Jan Beulich <jbeulich@suse.com>
8 years agoxen: use a dummy file in C99 header check
Wei Liu [Thu, 6 Apr 2017 18:33:36 +0000 (19:33 +0100)]
xen: use a dummy file in C99 header check

The check builds header file as if it is a C file. Clang doesn't like
the idea of having dead code in C file. The check as-is fails on Clang
with unused function warnings.

Use a dummy file like the C++ header check to fix this.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
8 years agovgic: refuse irq migration when one is already in progress
Stefano Stabellini [Wed, 5 Apr 2017 20:28:43 +0000 (13:28 -0700)]
vgic: refuse irq migration when one is already in progress

When an irq migration is already in progress, but not yet completed
(GIC_IRQ_GUEST_MIGRATING is set), refuse any other irq migration
requests for the same irq.

This patch implements this approach by returning success or failure from
vgic_migrate_irq, and avoiding irq target changes on failure. It prints
a warning in case the irq migration fails.

It also moves the clear_bit of GIC_IRQ_GUEST_MIGRATING to after the
physical irq affinity has been changed so that all operations regarding
irq migration are completed.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoarm: remove irq from inflight, then change physical affinity
Stefano Stabellini [Wed, 5 Apr 2017 20:28:42 +0000 (13:28 -0700)]
arm: remove irq from inflight, then change physical affinity

This patch fixes a potential race that could happen when
gic_update_one_lr and vgic_vcpu_inject_irq run simultaneously.

When GIC_IRQ_GUEST_MIGRATING is set, we must make sure that the irq has
been removed from inflight before changing physical affinity, to avoid
concurrent accesses to p->inflight, as vgic_vcpu_inject_irq will take a
different vcpu lock.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agotools/insn-fuzz: Fix assertion failures in x86_emulate_wrapper()
Andrew Cooper [Tue, 7 Mar 2017 16:20:51 +0000 (16:20 +0000)]
tools/insn-fuzz: Fix assertion failures in x86_emulate_wrapper()

c/s 92cf67888 "x86/emul: Hold x86_emulate() to strict X86EMUL_EXCEPTION
requirements" was appropriate for the hypervisor, but the fuzzer stubs didn't
conform to the stricter requirements.  AFL is very quick to discover this.

Extend the fuzzing harness exception logic to raise exceptions appropriately.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agotools/insn-fuzz: Provide IA32_DEBUGCTL consistently to the emulator
Andrew Cooper [Mon, 27 Mar 2017 09:37:35 +0000 (10:37 +0100)]
tools/insn-fuzz: Provide IA32_DEBUGCTL consistently to the emulator

x86_emulates()'s is_branch_step() performs a speculative read of
IA32_DEBUGCTL, but doesn't squash exceptions should they arise.  In reality,
this MSR is always available.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agotools/insn-fuzz: Correct hook prototypes, and assert() appropriate segments
Andrew Cooper [Tue, 21 Mar 2017 16:49:36 +0000 (16:49 +0000)]
tools/insn-fuzz: Correct hook prototypes, and assert() appropriate segments

The correct prototypes for the hooks are to use enum x86_segment rather than
unsigned int.  It is implementation specific as to whether this compiles.

assert() that the emulator never passes an inappropriate segment.  The only
hook which may legitimately be passed x86_seg_none is invlpg().

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agotools/insn-fuzz: Fix a stability bug in afl-clang-fast mode
Andrew Cooper [Mon, 20 Mar 2017 19:17:33 +0000 (19:17 +0000)]
tools/insn-fuzz: Fix a stability bug in afl-clang-fast mode

The fuzzing harness conditionally disables hooks to test error paths in the
emulator.  However, fuzz_emulops is a static structure.

c/s 69f4633 "tools/insn-fuzz: Support AFL's afl-clang-fast mode" introduced
persistent mode, but because fuzz_emulops is static, the clobbering of hooks
accumulates over repeated input, meaning that previous corpora influence the
execution over the current corpus.

Move the partially clobbered struct x86_emulate_ops into struct fuzz_state,
which is re-initialised from full on each call to LLVMFuzzerTestOneInput()

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agotools/insn-fuzz: Avoid making use of static data
Andrew Cooper [Mon, 20 Mar 2017 18:33:59 +0000 (18:33 +0000)]
tools/insn-fuzz: Avoid making use of static data

AFL has a measure of stability, where it passes the same corpus into the
fuzzing harness and observes whether the execution path changes from before.
Any instability in the fuzzing harness reduces its effectiveness, as an
observed crash may not reliably be caused by the original corpus.

In preparation to fix a stability bug, introduce struct fuzz_state, allocated
on the stack and passed around via struct x86_emulate_ctxt's data parameter.
Propagate ctxt into the helpers such as maybe_fail(), so the state can be
retrieved.

Move the previously-static data_{index,num} into struct fuzz_state.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agotools/insn-fuzz: Don't hit memcpy() for zero-length reads
Andrew Cooper [Thu, 2 Mar 2017 18:36:54 +0000 (18:36 +0000)]
tools/insn-fuzz: Don't hit memcpy() for zero-length reads

For control-flow changes, the emulator needs to perform a zero-length
instruction fetch at the target offset.  It also passes NULL for the
destination buffer, as there is no instruction stream to collect.

This trips up UBSAN when passed to memcpy(), as passing NULL is undefined
behaviour per the C spec (irrespective of passing a size of 0).

Special case these fetches in fuzz_insn_fetch() before reaching data_read().

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agoMAINTAINERS: Move the x86 instruction emulator under x86 maintainership
Andrew Cooper [Wed, 29 Mar 2017 16:12:37 +0000 (17:12 +0100)]
MAINTAINERS: Move the x86 instruction emulator under x86 maintainership

Requested-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
8 years agox86/emul: Require callers to provide LMA in the emulation context
Andrew Cooper [Fri, 31 Mar 2017 13:49:45 +0000 (14:49 +0100)]
x86/emul: Require callers to provide LMA in the emulation context

Long mode (or not) influences emulation behaviour in a number of cases.
Instead of reusing the ->read_msr() hook to obtain EFER.LMA, require callers
to provide it directly.

This simplifies all long mode checks during emulation to a simple boolean
read, removing embedded msr reads.  It also allows for the removal of a local
variable in the sysenter emulation block, and removes a latent bug in the
syscall emulation block where rc contains a non X86EMUL_* constant for a
period of time.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Reviewed-by: Jan Beulich <JBeulich@suse.com>
8 years agox86/emul: Drop swint_emulate infrastructure
Andrew Cooper [Fri, 31 Mar 2017 17:13:38 +0000 (18:13 +0100)]
x86/emul: Drop swint_emulate infrastructure

With the SVM injection logic capable of doing its own emulation, there is no
need for this hardware-specific assistance in the common emulator.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
8 years agox86/svm: Introduce svm_emul_swint_injection()
Andrew Cooper [Thu, 30 Mar 2017 17:27:07 +0000 (17:27 +0000)]
x86/svm: Introduce svm_emul_swint_injection()

Software events require emulation in some cases on AMD hardware.  Introduce
svm_emul_swint_injection() to perform this emulation if necessary in
svm_inject_event(), which will cope with any sources of event, rather than
just those coming from x86_emulate().

This logic mirrors inject_swint() in the x86 instruction emulator.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/hvm: Fix segmentation logic for system segments
Andrew Cooper [Fri, 31 Mar 2017 16:03:26 +0000 (17:03 +0100)]
x86/hvm: Fix segmentation logic for system segments

c/s c785f759718 "x86/emul: Prepare to allow use of system segments for memory
references" made alterations to hvm_virtual_to_linear_addr() to allow for the
use of system segments.

However, the determination of which segmentation mode to use was based on the
current address size from emulation.

In particular, it is wrong for system segment accesses while executing in a
compatibility mode code segment.  When long mode is active, all system
segments have a 64-bit base, and this must not be truncated during the
calculation of the linear address.  (Note that the presence and limit checks
for system segments behave the same, and are already uniformly applied in both
cases.)

Replace the existing addr_size parameter with active_cs, which gets used in
combination with current to work out which segmentation logic to use.

While here, also fix the determination of segmentation to use for vm86 mode,
which is a protected mode facility but which uses real mode segmentation.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/hvm: Correct long mode predicate
Andrew Cooper [Fri, 31 Mar 2017 17:14:07 +0000 (17:14 +0000)]
x86/hvm: Correct long mode predicate

hvm_long_mode_enabled() tests for EFER.LMA, which is specifically different to
EFER.LME.

Rename it to match its behaviour, and have it strictly return a boolean value
(although all its callers already use it in implicitly-boolean contexts, so no
functional change).

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Tim Deegan <tim@xen.org>
8 years agox86/hvm: Correct some address space terminology
Andrew Cooper [Fri, 31 Mar 2017 15:06:07 +0000 (16:06 +0100)]
x86/hvm: Correct some address space terminology

The function hvm_translate_linear_addr() translates a virtual address to a
linear address, not a linear address to a physical address.  Correct its name.

Both hvm_translate_virtual_addr() and hvmemul_virtual_to_linear() return a
linear address, but a parameter name of paddr is easily confused with paddr_t.
Rename it to linear, to clearly identify the address space, and for
consistency with hvm_virtual_to_linear_addr().

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agogolang/xenlight: Implement cpupool operations
Ronald Rojas [Wed, 5 Apr 2017 16:05:54 +0000 (17:05 +0100)]
golang/xenlight: Implement cpupool operations

Include some useful "Utility" functions:
- CpupoolFindByName
- CpupoolMakeFree

Still need to implement the following functions:
- libxl_cpupool_rename
- libxl_cpupool_cpuadd_node
- libxl_cpupool_cpuremove_node
- libxl_cpupool_movedomain

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Ronald Rojas <ronladred@gmail.com>
Acked-by: Ian Jackson <ian.jackson@citrix.com>
8 years agogolang/xenlight: Implement get console path operations
Ronald Rojas [Wed, 5 Apr 2017 16:05:53 +0000 (17:05 +0100)]
golang/xenlight: Implement get console path operations

Implement Golang enumeration of libxl_console_type
as ConsoleType

Implement the following libxl functions:
- libxl_console_get_tty
- libxl_primary_console_get_tty

Signed-off-by: Ronald Rojas <ronladred@gmail.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Ian Jackson <ian.jackson@citrix.com>
8 years agogolang/xenlight: Implement Vcpuinfo and ListVcpu
Ronald Rojas [Wed, 5 Apr 2017 16:05:52 +0000 (17:05 +0100)]
golang/xenlight: Implement Vcpuinfo and ListVcpu

Include Golang version of libxl_vcpu_info
as VcpuInfo

Add a Golang call for libxl_list_vcpu as
ListVcpu

Signed-off-by: Ronald Rojas <ronladred@gmail.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Ian Jackson <ian.jackson@citrix.com>
8 years agogolang/xenlight: Implement Domain operations
Ronald Rojas [Wed, 5 Apr 2017 16:05:51 +0000 (17:05 +0100)]
golang/xenlight: Implement Domain operations

Add calls for the following Domain related functionality
- libxl_domain_pause
- libxl_domain_shutdown
- libxl_domain_reboot
- libxl_list_domain

Signed-off-by: Ronald Rojas <ronladred@gmail.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Ian Jackson <ian.jackson@citrix.com>
8 years agogolang/xenlight: Implement libxl_scheduler enumeration
Ronald Rojas [Wed, 5 Apr 2017 16:05:50 +0000 (17:05 +0100)]
golang/xenlight: Implement libxl_scheduler enumeration

Include both constants and a Stringification for libxl_scheduler.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Ronald Rojas <ronladred@gmail.com>
Acked-by: Ian Jackson <ian.jackson@citrix.com>
8 years agogolang/xenlight: Implement libxl_bitmap and helper operations
Ronald Rojas [Wed, 5 Apr 2017 16:05:49 +0000 (17:05 +0100)]
golang/xenlight: Implement libxl_bitmap and helper operations

Implement Bitmap type, along with helper functions.

The Bitmap type is implemented interllay in a way which makes it
easy to copy into and out of the C libxl_bitmap type.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Ronald Rojas <ronladred@gmail.com>
Acked-by: Ian Jackson <ian.jackson@citrix.com>
8 years agogolang/xenlight: Implement libxl_domain_info and libxl_domain_unpause
Ronald Rojas [Wed, 5 Apr 2017 16:05:48 +0000 (17:05 +0100)]
golang/xenlight: Implement libxl_domain_info and libxl_domain_unpause

Add calls for the following host-related functionality:
- libxl_domain_info
- libxl_domain_unpause

Include Golang version for the libxl_domain_info as
DomainInfo.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Ronald Rojas <ronladred@gmail.com>
Acked-by: Ian Jackson <ian.jackson@citrix.com>
8 years agogolang/xenlight: Add host-related functionality
Ronald Rojas [Wed, 5 Apr 2017 16:05:47 +0000 (17:05 +0100)]
golang/xenlight: Add host-related functionality

Add calls for the following host-related functionality:
- libxl_get_max_cpus
- libxl_get_online_cpus
- libxl_get_max_nodes
- libxl_get_free_memory
- libxl_get_physinfo
- libxl_get_version_info

Include Golang versions of the following structs:
- libxl_physinfo as Physinfo
- libxl_version_info as VersionInfo
- libxl_hwcap as Hwcap

Signed-off-by: Ronald Rojas <ronladred@gmail.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Ian Jackson <ian.jackson@citrix.com>
8 years agogolang/xenlight: Add error constants and standard handling
Ronald Rojas [Wed, 5 Apr 2017 16:05:46 +0000 (17:05 +0100)]
golang/xenlight: Add error constants and standard handling

Create error type Errorxl for throwing proper xenlight
errors.

Update Ctx functions to throw Errorxl errors.

Signed-off-by: Ronald Rojas <ronladred@gmail.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Ian Jackson <ian.jackson@citrix.com>
8 years agogolang/xenlight: Create stub package
Ronald Rojas [Wed, 5 Apr 2017 16:05:45 +0000 (17:05 +0100)]
golang/xenlight: Create stub package

Create a basic Makefile to build and install libxenlight Golang
bindings. Also add a stub package which only opens libxl context.

Include a global xenlight.Ctx variable which can be used as the
default context by the entire program if desired.

For now, return simple errors. Proper error handling will be
added in next patch.

Until we get configure support, disable it by default.  It can be
enabled either by adding "CONFIG_GOLANG=y" to .config, or adding it to
the 'make' line.

Signed-off-by: Ronald Rojas <ronladred@gmail.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Ian Jackson <ian.jackson@citrix.com>
8 years agodocs: Clarify the expected behaviour of zero-content records
Andrew Cooper [Thu, 30 Mar 2017 16:32:34 +0000 (17:32 +0100)]
docs: Clarify the expected behaviour of zero-content records

The sending side shouldn't send any data records which end up having
zero-length content, but the receiving side will need to tolerate such
records for compatibility purposes.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
[ wei: fix typos etc ]
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agotools/python: Adjust migration v2 library to warn about zero-content records
Andrew Cooper [Thu, 30 Mar 2017 16:32:33 +0000 (17:32 +0100)]
tools/python: Adjust migration v2 library to warn about zero-content records

These records shouldn't be in a stream, but accidentally are.  Warn about
them, but don't abort the verification.

While here, add a missing length check to the X86_PV_P2M_FRAMES record
checker.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agotools/libxc: Avoid generating inappropriate zero-content records
Andrew Cooper [Thu, 30 Mar 2017 16:32:32 +0000 (17:32 +0100)]
tools/libxc: Avoid generating inappropriate zero-content records

The code as written attempted to elide zero-content records, as such records
serve no purpose but come with a performance hit.  Unfortunately, in the case
where the hypervisor reported max size is non-zero, but the actual size is
zero, the record is not elided.

This previously tripped up the sanity checks in the restore side of migration,
but as the underlying reasons for eliding the records in the first place are
still valid, fix the elision logic.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agotools/libxc: Tolerate specific zero-content records in migration v2 streams
Andrew Cooper [Thu, 30 Mar 2017 16:32:31 +0000 (17:32 +0100)]
tools/libxc: Tolerate specific zero-content records in migration v2 streams

The migration v2 save code was written to avoid sending data records with no
content, as such records serve no purpose but come with a performance hit.
The restore code sanity checks this expectation.

Under some circumstances (most notably, on AMD hardware with Debug Extensions,
and a PV guest kernel which is not using the feature), the save code would
generate a record with no content, which trips the sanity check in the restore
code.

As the stream is otherwise fine, tolerate these records and avoid failing the
migration.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxc: fix xc_translate_foreign_address()
Cristian-Bogdan Sirb [Wed, 5 Apr 2017 14:53:53 +0000 (17:53 +0300)]
libxc: fix xc_translate_foreign_address()

Currently xc_translate_foreign_address() only checks for the PSE bit on
level 2 entries (that's 2 MB pages on x64 and 32-bit with PAE, and 4 MB
pages on 32-bit). But the Linux kernel sometimes uses 1 GB pages. This
patch fixes that, by checking the PSE bit on level 3 entries if the guest
has 4 translation levels (that means 64-bit guests only).

Signed-off-by: Cristian-Bogdan Sirb <csirb@bitdefender.com>
Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agoxen/arm: Handle guest external abort as guest SError
Wei Chen [Wed, 5 Apr 2017 09:09:21 +0000 (17:09 +0800)]
xen/arm: Handle guest external abort as guest SError

The guest generated external data/instruction aborts can be treated
as guest SErrors. We already have a handler to handle the SErrors,
so we can reuse this handler to handle guest external aborts.

Signed-off-by: Wei Chen <Wei.Chen@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoxen/arm: Prevent slipping hypervisor SError to guest
Wei Chen [Wed, 5 Apr 2017 09:09:20 +0000 (17:09 +0800)]
xen/arm: Prevent slipping hypervisor SError to guest

If there is a pending SError while we're returning from trap. If the
SError handle option is "DIVERSE", we have to prevent slipping this
hypervisor SError to guest. So we have to use the dsb/isb to guarantee
that the pending hypervisor SError would be caught in hypervisor before
return to guest.

In previous patch, we will set SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT to
cpu_hwcaps when option is NOT "DIVERSE". This means we can use the
alternative to skip synchronizing SErrors for other SErrors handle options.

Because we have umasked the Abort/SError bit in previous patch. We have
to disable the Abort/SError before returning to guest as we have done
for IRQ.

Signed-off-by: Wei Chen <Wei.Chen@arm.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoxen/arm: Isolate the SError between the context switch of 2 vCPUs
Wei Chen [Wed, 5 Apr 2017 09:09:19 +0000 (17:09 +0800)]
xen/arm: Isolate the SError between the context switch of 2 vCPUs

If there is a pending SError while we are doing context switch, if the
SError handle option is "FORWARD", We have to guarantee this SError to
be caught by current vCPU, otherwise it will be caught by next vCPU and
be forwarded to this wrong vCPU.

So we have to synchronize SError before switch to next vCPU. But this is
only required by "FORWARD" option. In this case we added a new flag
SKIP_CTXT_SWITCH_SERROR_SYNC in cpu_hwcaps to skip synchronizing SError
in context switch for other options. In the meantime, we don't need to
export serror_op accessing to other source files.

Because we have umasked the Abort/SError bit in previous patch, we have
to disable Abort/SError before doing context switch as we have done for
IRQ.

Signed-off-by: Wei Chen <Wei.Chen@arm.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoxen/arm: Introduce a macro to synchronize SError
Wei Chen [Wed, 5 Apr 2017 09:09:18 +0000 (17:09 +0800)]
xen/arm: Introduce a macro to synchronize SError

In previous patches, we have provided the ability to synchronize
SErrors in exception entries. But we haven't synchronized SErrors
while returning to guest and doing context switch.

So we still have two risks:
1. Slipping hypervisor SErrors to guest. For example, hypervisor
   triggers a SError while returning to guest, but this SError may be
   delivered after entering guest. In "DIVERSE" option, this SError
   would be routed back to guest and panic the guest. But actually,
   we should crash the whole system due to this hypervisor SError.
2. Slipping previous guest SErrors to the next guest. In "FORWARD"
   option, if hypervisor triggers a SError while context switching.
   This SError may be delivered after switching to next vCPU. In this
   case, this SError will be forwarded to next vCPU and may panic
   an incorrect guest.

So we have have to introduce this macro to synchronize SErrors while
returning to guest and doing context switch. In this macro, we use
ASSERT to make sure the abort is ummasked. Because we unmasked abort
in the entries, but we don't know whether someone will mask it in the
future.

We also added a barrier to this macro to prevent compiler reorder our
asm volatile code.

Signed-off-by: Wei Chen <Wei.Chen@arm.com>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoxen/arm: Introduce a helper to check local abort is enabled
Wei Chen [Wed, 5 Apr 2017 09:09:17 +0000 (17:09 +0800)]
xen/arm: Introduce a helper to check local abort is enabled

In previous patch, we have umasked the Abort/SError bit for Xen
in most of its running time. So in some use-cases, we have to
check whether the abort is enabled in current context. For example,
while we want to synchronize SErrors, we have to confirm the abort
is enabled. Otherwise synchronize SErrors is pointless.

Signed-off-by: Wei Chen <Wei.Chen@arm.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoxen/arm: Unmask the Abort/SError bit in the exception entries
Wei Chen [Wed, 5 Apr 2017 09:09:16 +0000 (17:09 +0800)]
xen/arm: Unmask the Abort/SError bit in the exception entries

Currently, we masked the Abort/SError bit in Xen exception entries.
So Xen could not capture any Abort/SError while it's running.
Now, Xen has the ability to handle the Abort/SError, we should unmask
the Abort/SError bit by default to let Xen capture Abort/SError while
it's running.

But in order to avoid receiving nested asynchronous abort, we don't
unmask Abort/SError bit in hyp_error and trap_data_abort.

Signed-off-by: Wei Chen <Wei.Chen@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoxen/arm: Replace do_trap_guest_serror with new helpers
Wei Chen [Wed, 5 Apr 2017 09:09:15 +0000 (17:09 +0800)]
xen/arm: Replace do_trap_guest_serror with new helpers

We have introduced two helpers to handle the guest/hyp SErrors:
do_trap_guest_serror and do_trap_guest_hyp_serror. These handlers
can take the role of do_trap_guest_serror and reduce the assembly
code in the same time. So we use these two helpers to replace it
and drop it now.

Signed-off-by: Wei Chen <Wei.Chen@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoxen/arm: Introduce new helpers to handle guest/hyp SErrors
Wei Chen [Wed, 5 Apr 2017 09:09:14 +0000 (17:09 +0800)]
xen/arm: Introduce new helpers to handle guest/hyp SErrors

Currently, ARM32 and ARM64 has different SError exception handlers.
These handlers include lots of code to check SError handle options
and code to distinguish guest-generated SErrors from hypervisor
SErrors.

The new helpers: do_trap_guest_serror and do_trap_hyp_serror are
wrappers of __do_trap_serror with constant guest/hyp parameters.
__do_trap_serror moves the option checking code and SError checking
code from assembly to C source. This will make the code become more
readable and avoid placing check code in too many places.

These two helpers only handle the following 3 types of SErrors:
1) Guest-generated SError and had been delivered in EL1 and then
   been forwarded to EL2.
2) Guest-generated SError but hadn't been delivered in EL1 before
   trapping to EL2. This SError would be caught in EL2 as soon as
   we just unmasked the PSTATE.A bit.
3) Hypervisor generated native SError, that would be a bug.

In the new helpers, we have used the function "inject_vabt_exception"
which was disabled by "#if 0" before. Now, we can remove the "#if 0"
to make this function to be available.

Signed-off-by: Wei Chen <Wei.Chen@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>