]> xenbits.xensource.com Git - people/royger/xen.git/log
people/royger/xen.git
8 years ago(no commit message) vpci_v1.1 gitlab/vpci_v1.1
Roger Pau Monne [Tue, 18 Apr 2017 11:10:16 +0000 (12:10 +0100)]

8 years agovpci/msi: add MSI handlers
Roger Pau Monne [Tue, 18 Apr 2017 11:10:16 +0000 (12:10 +0100)]
vpci/msi: add MSI handlers

8 years agovpci: add a priority field to the vPCI register initializer
Roger Pau Monne [Tue, 18 Apr 2017 11:10:16 +0000 (12:10 +0100)]
vpci: add a priority field to the vPCI register initializer

And mark the capability and header vPCI register initializer as high priority,
so that they are initialized first.

This is needed for MSI-X, since MSI-X needs to know the position of the BARs in
order to perform it's initialization, and in order to mask or enable the
MSI/MSI-X functionality on demand.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
8 years agoxen/vpci: trap access to the list of PCI capabilities
Roger Pau Monne [Tue, 18 Apr 2017 11:10:16 +0000 (12:10 +0100)]
xen/vpci: trap access to the list of PCI capabilities

Add traps to each capability PCI_CAP_LIST_NEXT field in order to mask them on
request.

All capabilities from the device are fetched and stored in an internal list,
that's later used in order to return the next capability to the guest. Note
that this only removes the capability from the linked list as seen by the
guest, but the actual capability structure could still be accessed by the
guest, provided that it's position can be found using another mechanism.
Finally the MSI and MSI-X capabilities are masked until Xen knows how to
properly handle accesses to them.

This should allow a PVH Dom0 to boot on some hardware, provided that the
hardware doesn't require MSI/MSI-X and that there are no SR-IOV devices in the
system, so the panic at the end of the PVH Dom0 build is replaced by a
warning.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v1:
 - Add missing newline between cmd handlers.
 - Switch the handler to use list_for_each_entry_continue instead of a wrong
   open-coded version of it.

8 years agoxen/vpci: add handlers to map the BARs
Roger Pau Monne [Tue, 18 Apr 2017 11:10:16 +0000 (12:10 +0100)]
xen/vpci: add handlers to map the BARs

Introduce a set of handlers that trap accesses to the PCI BARs and the command
register, in order to emulate BAR sizing and BAR relocation.

The command handler is used to detect changes to bit 2 (response to memory
space accesses), and maps/unmaps the BARs of the device into the guest p2m.

The BAR register handlers are used to detect attempts by the guest to size or
relocate the BARs.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Tim Deegan <tim@xen.org>
Cc: Wei Liu <wei.liu2@citrix.com>
8 years agoxen/pci: split code to size BARs from pci_add_device
Roger Pau Monne [Tue, 18 Apr 2017 11:10:15 +0000 (12:10 +0100)]
xen/pci: split code to size BARs from pci_add_device

So that it can be called from outside in order to get the size of regular PCI
BARs. This will be required in order to map the BARs from PCI devices into PVH
Dom0 p2m.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
8 years agoxen/mm: move modify_identity_mmio to global file
Roger Pau Monne [Tue, 18 Apr 2017 11:10:15 +0000 (12:10 +0100)]
xen/mm: move modify_identity_mmio to global file

Mostly code motion, this function will be needed in other parts apart from PVH
Dom0 build. Also note that the __init attribute is dropped.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
8 years agox86/ecam: add handlers for the PVH Dom0 MMCFG areas
Roger Pau Monne [Tue, 18 Apr 2017 11:10:15 +0000 (12:10 +0100)]
x86/ecam: add handlers for the PVH Dom0 MMCFG areas

Introduce a set of handlers for the accesses to the ECAM areas. Those areas are
setup based on the contents of the hardware MMCFG tables, and the list of
handled ECAM areas is stored inside of the hvm_domain struct.

The read/writes are forwarded to the generic vpci handlers once the address is
decoded in order to obtain the device and register the guest is trying to
access.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v1:
 - Added locking.

8 years agoxen/vpci: introduce basic handlers to trap accesses to the PCI config space
Roger Pau Monne [Tue, 18 Apr 2017 11:10:15 +0000 (12:10 +0100)]
xen/vpci: introduce basic handlers to trap accesses to the PCI config space

This functionality is going to reside in vpci.c (and the corresponding vpci.h
header), and should be arch-agnostic. The handlers introduced in this patch
setup the basic functionality required in order to trap accesses to the PCI
config space, and allow decoding the address and finding the corresponding
handler that should handle the access (although no handlers are implemented).

Note that the traps to the PCI IO ports registers (0xcf8/0xcfc) are setup
inside of a x86 HVM file, since that's not shared with other arches.

A new XEN_X86_EMU_VPCI x86 domain flag is added in order to signal Xen whether
a domain should use the newly introduced vPCI handlers, this is only enabled
for PVH Dom0 at the moment.

A very simple user-space test is also provided, so that the basic functionality
of the vPCI traps can be asserted. This has been proven quite helpful during
development, since the logic to handle partial accesses or accesses that expand
across multiple registers is not trivial.

The handlers for the registers are added to a red-black tree, that indexes them
based on their offset. Since Xen needs to handle partial accesses to the
registers and access that expand across multiple registers the logic in
xen_vpci_{read/write} is kind of convoluted, I've tried to properly comment it
in order to make it easier to understand.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v1:
 - Allow access to cross a word-boundary.
 - Add locking.
 - Add cleanup to xen_vpci_add_handlers in case of failure.

8 years agox86/vioapic: bind interrupts to PVH Dom0 dom0_gsi_v2 gitlab/dom0_gsi_v2
Roger Pau Monne [Wed, 19 Apr 2017 15:03:05 +0000 (16:03 +0100)]
x86/vioapic: bind interrupts to PVH Dom0

Add the glue in order to bind the PVH Dom0 GSI from bare metal. This is done
when Dom0 unmasks the vIO APIC pins, by fetching the current pin settings and
setting up the PIRQ, which will then be bound to Dom0.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v1:
 - Mask the pin on error (instead of panicking).
 - Factor out the Dom0 specific code into a function.
 - Use the newly introduced allocate_and_map_gsi_pirq instead of
   physdev_map_pirq.

8 years agox86/pt: enable binding of GSIs to a PVH Dom0
Roger Pau Monne [Wed, 19 Apr 2017 15:03:05 +0000 (16:03 +0100)]
x86/pt: enable binding of GSIs to a PVH Dom0

Achieve this by expanding pt_irq_create_bind in order to support mapping
interrupts of type PT_IRQ_TYPE_PCI to a PVH Dom0. GSIs bound to Dom0 are always
identity bound, which means the all the fields inside of the u.pci sub-struct
are ignored, and only the machine_irq is actually used in order to determine
which GSI the caller wants to bind.

Also, the hvm_irq_dpci struct is not used by a PVH Dom0, since that's used to
route interrupts and allow different host to guest GSI mappings, which is not
used by a PVH Dom0.

This requires adding some specific handlers for such directly mapped GSIs,
which bypass the PCI interrupt routing done by Xen for HVM guests.

Note that currently there's no support for unbinding this interrupts.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v1:
 - Remove the PT_IRQ_TYPE_GSI and instead just use PT_IRQ_TYPE_PCI with a
   hardware domain special casing.
 - Check the trigger mode of the Dom0 vIO APIC in order to set the shareable
   flags in pt_irq_create_bind.

8 years agox86/physdev: factor out the code to allocate and map a pirq
Roger Pau Monne [Wed, 19 Apr 2017 15:03:05 +0000 (16:03 +0100)]
x86/physdev: factor out the code to allocate and map a pirq

Move the code to allocate and map a domain pirq (either GSI or MSI) into the
x86 irq code base, so that it can be used outside of the physdev ops.

This change shouldn't affect the functionality of the already existing physdev
ops.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Jan Beulich <jbeulich@suse.com>
Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v1:
 - New in this version.

8 years agox86emul: force CLZERO feature flag in test harness
Jan Beulich [Wed, 19 Apr 2017 11:30:27 +0000 (13:30 +0200)]
x86emul: force CLZERO feature flag in test harness

Commit b988e88cc0 ("x86/emul: Add feature check for clzero") added a
feature check to the emulator, which breaks the harness without this
flag being forced to true.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agox86/vioapic: allow holes in the GSI range for PVH Dom0
Roger Pau Monné [Wed, 19 Apr 2017 11:29:51 +0000 (13:29 +0200)]
x86/vioapic: allow holes in the GSI range for PVH Dom0

The current vIO APIC for PVH Dom0 doesn't allow non-contiguous GSIs, which
means that all GSIs must belong to an IO APIC. This doesn't match reality,
where there are systems with non-contiguous GSIs.

In order to fix this add a base_gsi field to each hvm_vioapic struct, in order
to store the base GSI for each emulated IO APIC. For PVH Dom0 those values are
populated based on the hardware ones.

Reported-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Chao Gao <chao.gao@intel.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agox86/HVM: don't uniformly report "MMIO" for various forms of failed emulation
Jan Beulich [Wed, 19 Apr 2017 11:29:14 +0000 (13:29 +0200)]
x86/HVM: don't uniformly report "MMIO" for various forms of failed emulation

This helps distinguishing the call paths leading there.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agoVMX: don't blindly enable descriptor table exiting control
Jan Beulich [Wed, 19 Apr 2017 11:26:55 +0000 (13:26 +0200)]
VMX: don't blindly enable descriptor table exiting control

This is an optional feature and hence we should check for it before
use.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agox86/HVM: restrict emulation in hvm_descriptor_access_intercept()
Jan Beulich [Wed, 19 Apr 2017 11:26:18 +0000 (13:26 +0200)]
x86/HVM: restrict emulation in hvm_descriptor_access_intercept()

While I did review d0a699a389 ("x86/monitor: add support for descriptor
access events") it didn't really occur to me that someone could be this
blunt and add unguarded emulation again just a few weeks after we
guarded all special purpose emulator invocations. Fix this.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agox86emul: always fill x86_insn_modrm()'s outputs
Jan Beulich [Wed, 19 Apr 2017 11:25:44 +0000 (13:25 +0200)]
x86emul: always fill x86_insn_modrm()'s outputs

The function is rather unlikely to be called for insns which don't have
ModRM bytes, and hence addressing Coverity's recurring complaint of
callers potentially consuming uninitialized data when they know that
certain opcodes have ModRM bytes can be suppressed this way without
unduly adding overhead to fast paths.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agox86emul: add "unblock NMI" retire flag
Jan Beulich [Wed, 19 Apr 2017 11:24:18 +0000 (13:24 +0200)]
x86emul: add "unblock NMI" retire flag

No matter that we emulate IRET for (guest) real mode only right now, we
should get its effect on (virtual) NMI delivery right. Note that we can
simply check the other retire flags also in the !OKAY case, as the
insn emulator now guarantees them to only be set on OKAY.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agodocs: fix configuration syntax in xl.cfg manpage
Wei Liu [Tue, 11 Apr 2017 11:03:00 +0000 (12:03 +0100)]
docs: fix configuration syntax in xl.cfg manpage

No quote is required when a string is provided as part of a spec string.

Reported-by: Doug Freed <dwfreed@mtu.edu>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
8 years agotools: Use POSIX signal.h instead of sys/signal.h
Alistair Francis [Mon, 17 Apr 2017 21:33:11 +0000 (14:33 -0700)]
tools: Use POSIX signal.h instead of sys/signal.h

The POSIX spec specifies to use:
    #include <signal.h>
instead of:
    #include <sys/signal.h>
as seen here:
   http://pubs.opengroup.org/onlinepubs/009695399/functions/signal.html

This removes the warning:
    #warning redirecting incorrect #include <sys/signal.h> to <signal.h>
when building with the musl C-library.

Signed-off-by: Alistair Francis <alistair.francis@xilinx.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agotools: Use POSIX poll.h instead of sys/poll.h
Alistair Francis [Mon, 17 Apr 2017 21:33:10 +0000 (14:33 -0700)]
tools: Use POSIX poll.h instead of sys/poll.h

The POSIX spec specifies to use:
    #include <poll.h>
instead of:
    #include <sys/poll.h>
as seen here:
    http://pubs.opengroup.org/onlinepubs/009695399/functions/poll.html

This removes the warning:
    #warning redirecting incorrect #include <sys/poll.h> to <poll.h>
when building with the musl C-library.

Signed-off-by: Alistair Francis <alistair.francis@xilinx.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agox86/emul: Drop more redundant ctxt.event_pending checks
Andrew Cooper [Mon, 10 Apr 2017 12:11:06 +0000 (13:11 +0100)]
x86/emul: Drop more redundant ctxt.event_pending checks

Since c/s 92cf67888a, x86_emulate_wrapper() asserts stricter behaviour about
the relationship between X86EMUL_EXCEPTION and ctxt.event_pending.

These removals should have been included in the aforementioned changeset, and
were only omitted due an oversight.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Tim Deegan <tim@xen.org>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agox86/vIO-APIC: fix uninitialized variable warning
Jan Beulich [Thu, 13 Apr 2017 15:35:02 +0000 (17:35 +0200)]
x86/vIO-APIC: fix uninitialized variable warning

In a release build modern gcc validly complains about "pin" possibly
being uninitialized in vioapic_irq_positive_edge().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agoVT-d: correct a comment and remove an useless if() statement
Chao Gao [Thu, 13 Apr 2017 15:34:29 +0000 (17:34 +0200)]
VT-d: correct a comment and remove an useless if() statement

Fix two flaws in the patch (93358e8e VT-d: introduce update_irte to update
irte safely):
1. Expand a comment in update_irte() to make it clear that VT-d hardware
doesn't update IRTE and software can't update IRTE behind us since we hold
iremap_lock.
2. remove an useless if() statement

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agoclang: disable the gcc-compat warnings for read_atomic
Roger Pau Monné [Thu, 13 Apr 2017 15:33:21 +0000 (17:33 +0200)]
clang: disable the gcc-compat warnings for read_atomic

clang gcc-compat warnings can wrongly fire when certain constructions are used,
at least the following flow:

switch ( ... )
{
case ...:
    while ( ({ int x; switch ( foo ) { case 1: x = 1; break; } x }) )
    {
        ...

Will cause clang to emit the following warning "'break' is bound to loop, GCC
binds it to switch", which is a false positive, and both gcc and clang bind
the break to the inner switch. In order to workaround this issue, disable the
gcc-compat checks for the usage of the read_atomic macro.

This has been reported upstream as http://bugs.llvm.org/show_bug.cgi?id=32595.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agotools:misc:xenpm: set max freq to all cpu with default cpuid
Luwei Kang [Thu, 13 Apr 2017 10:44:28 +0000 (18:44 +0800)]
tools:misc:xenpm: set max freq to all cpu with default cpuid

User can set max freq to specific cpu by
"xenpm set-scaling-maxfreq [cpuid] <HZ>"
or set max freq to all cpu with default cpuid by
"xenpm set-scaling-maxfreq <HZ>".

Set max freq with default cpuid will cause
segmentation fault after commit id d4906b5d05.
This patch will fix this issue and add ability
to set max freq with default cpuid.

Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Compile-tested-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agoxen: credit: change an ASSERT on nr_runnable so that it makes sense.
Dario Faggioli [Thu, 13 Apr 2017 07:49:54 +0000 (09:49 +0200)]
xen: credit: change an ASSERT on nr_runnable so that it makes sense.

Since the counter is unsigned, it's pointless/bogous to check
for if to be above zero.

Check that it is at least one before it's decremented, instead.

Spotted by Coverity.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agoxen: credit2: cleanup patch for type betterness
Praveen Kumar [Tue, 11 Apr 2017 18:08:42 +0000 (23:38 +0530)]
xen: credit2: cleanup patch for type betterness

The patch actually doesn't impact the functionality as such. This only replaces
bool_t with bool in credit2.

Signed-off-by: Praveen Kumar <kpraveen.lkml@gmail.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agomisc/release-checklist.txt: Try to avoid wrong-tag mistakes
Ian Jackson [Wed, 12 Apr 2017 15:36:30 +0000 (16:36 +0100)]
misc/release-checklist.txt: Try to avoid wrong-tag mistakes

Add some better checking and make the runes a bit more robust.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
8 years agomisc/release-checklist.txt: Preemptive updates
Ian Jackson [Wed, 12 Apr 2017 15:31:40 +0000 (16:31 +0100)]
misc/release-checklist.txt: Preemptive updates

These are things I noticed should be fixed, while trying to make
4.9.0-rc1.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
8 years agoConfig.mk: Update for 4.9.0-rc1.2
Ian Jackson [Wed, 12 Apr 2017 15:18:57 +0000 (16:18 +0100)]
Config.mk: Update for 4.9.0-rc1.2

Contrary to what I wrote in d0db50ced1f7 "Config.mk: Update for
4.9.0-rc1.1", the build failure with 4.9.0-rc1 was not due to a wrong
qemu tag, but a wrong mini-os tag.  So burn 4.9.0-rc1.1 too :-(.  (We
can rewind the qemu-trad tag to 4.9.0-rc1; the -rc1 and -rc1.1 tags
are identical.)

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
8 years agoConfig.mk: Update for 4.9.0-rc1.1
Ian Jackson [Wed, 12 Apr 2017 15:03:35 +0000 (16:03 +0100)]
Config.mk: Update for 4.9.0-rc1.1

In qemu-trad, I made xen-4.9.0-rc1 refer erroneously to the 4.8
branch.  That doesn't build.  So we are burning the version number
4.9.0-rc1 in xen.git and qemu-trad.  (The other trees can remain as
they are.)

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
8 years agoConfig.mk, etc.: Prepare 4.9.0-rc1
Ian Jackson [Wed, 12 Apr 2017 14:45:39 +0000 (15:45 +0100)]
Config.mk, etc.: Prepare 4.9.0-rc1

* Update Config.mk REVISION values to refer to relevant tags.
* Update README version number
* Update xen/Makefile version number

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
8 years agox86/atomic: fix cmpxchg16b inline assembly to work with clang
Roger Pau Monné [Mon, 10 Apr 2017 15:32:01 +0000 (17:32 +0200)]
x86/atomic: fix cmpxchg16b inline assembly to work with clang

clang doesn't understand the "=A" register constrain when used with 64bits
assembly and spits out an internal error:

fatal error: error in backend: Cannot select: 0x7f9fb89c9390: i64 = build_pair 0x7f9fb89c92b0,
      0x7f9fb89c9320
  0x7f9fb89c92b0: i32,ch,glue = CopyFromReg 0x7f9fb89c9240, Register:i32 %EAX, 0x7f9fb89c9240:1
    0x7f9fb89c8c20: i32 = Register %EAX
    0x7f9fb89c9240: ch,glue = inlineasm 0x7f9fb89c90f0,
TargetExternalSymbol:i64'lock; cmpxchg16b $1', MDNode:ch<0x7f9fb8476c38>,
TargetConstant:i64<25>, TargetConstant:i32<18>, Register:i32 %EAX, Register:i32
%EDX, TargetConstant:i32<196622>, 0x7f9fb89c87c0, TargetConstant:i32<9>,
Register:i64 %RCX, TargetConstant:i32<9>, Register:i64 %RBX,
TargetConstant:i32<9>, Register:i64 %RDX, TargetConstant:i32<9>, Register:i64
%RAX, TargetConstant:i32<196622>, 0x7f9fb89c87c0, TargetConstant:i32<12>,
Register:i32 %EFLAGS, 0x7f9fb89c90f0:1
      0x7f9fb89c8a60: i64 = TargetExternalSymbol'lock; cmpxchg16b $1'
      0x7f9fb89c8b40: i64 = TargetConstant<25>
      0x7f9fb89c8bb0: i32 = TargetConstant<18>
      0x7f9fb89c8c20: i32 = Register %EAX
      0x7f9fb89c8c90: i32 = Register %EDX
      0x7f9fb89c8d00: i32 = TargetConstant<196622>
      0x7f9fb89c87c0: i64,ch = load<LD8[%4]> 0x7f9fb9053da0, FrameIndex:i64<1>, undef:i64
        0x7f9fb9053a90: i64 = FrameIndex<1>
        0x7f9fb9053e80: i64 = undef
      0x7f9fb89c8e50: i32 = TargetConstant<9>
      0x7f9fb89c8d70: i64 = Register %RCX
      0x7f9fb89c8e50: i32 = TargetConstant<9>
      0x7f9fb89c8ec0: i64 = Register %RBX
      0x7f9fb89c8e50: i32 = TargetConstant<9>
      0x7f9fb89c8fa0: i64 = Register %RDX
      0x7f9fb89c8e50: i32 = TargetConstant<9>
      0x7f9fb89c9080: i64 = Register %RAX
[...]

Fix this by specifying "rdx:rax" manually using the "d" and "a" constraints.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agoxsm: fix clang 3.5 build after c47d1d
Roger Pau Monné [Mon, 10 Apr 2017 15:31:42 +0000 (17:31 +0200)]
xsm: fix clang 3.5 build after c47d1d

The changes introduced on c47d1d broke the clang build due to undefined
references to __xsm_action_mismatch_detected, because clang hasn't optimized
the code properly. The following patch allows the clang build to work again,
while keeping the same functionality.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
8 years agooxenstored: make --restart option best-effort
Jonathan Davies [Fri, 7 Apr 2017 13:27:22 +0000 (14:27 +0100)]
oxenstored: make --restart option best-effort

Only attempt to restore from saved state if it exists.

Without this, oxenstored immediately exits with an exception if the
--restart option is provided but the state file is not present.

(The time-of-check to time-of-use race isn't a concern as oxenstored is
the only thing that should write the state file.)

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agooxenstored: improve event-channel binding logging
Jonathan Davies [Fri, 7 Apr 2017 13:27:21 +0000 (14:27 +0100)]
oxenstored: improve event-channel binding logging

It's useful to see a bit more detail when an inter-domain event-channel
is bound, especially over an oxenstored restart.

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agooxenstored: save remote evtchn port, not local port
Jonathan Davies [Fri, 7 Apr 2017 13:27:20 +0000 (14:27 +0100)]
oxenstored: save remote evtchn port, not local port

Previously, Domain.dump output the number of the local port
corresponding to each domain's event-channel. However, when oxenstored
exits, it closes /dev/xen/evtchn which causes the kernel to close the
local port (evtchn_release), so this port is no longer useful.

Instead, store the remote port. This can be used to reconnect the
event-channel by binding the original remote port to a fresh local port.

Indeed, the logic for parsing the stored state already expects a remote
port as it passes the parsed port number to Domain.make (via
Domains.create), which takes a remote port.

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agooxenstored: avoid leading slash in paths in saved store state
Jonathan Davies [Fri, 7 Apr 2017 13:27:19 +0000 (14:27 +0100)]
oxenstored: avoid leading slash in paths in saved store state

Internally, paths are represented as lists of strings, where
  * path "/" is represented by []
  * path "/local/domain/0" is represented by ["local"; "domain"; "0"]
(see comment for Store.Path.t).

However, the traversal function generated paths like
    [""; "local"; "domain"; "0"]
because the name of the root node is "". Change it to generate paths
correctly.

Furthermore, the function passed to Store.dump_fct would render the node
"foo" under the path [] as "//foo". Change this to return "/foo".

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agooxenstored: initialise logging earlier
Jonathan Davies [Fri, 7 Apr 2017 13:27:18 +0000 (14:27 +0100)]
oxenstored: initialise logging earlier

Otherwise we miss out on messages from things that try to log earlier in
the start-up procedure.

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
8 years agoRevert "setup vwfi correctly on cpu0"
Stefano Stabellini [Fri, 7 Apr 2017 22:38:58 +0000 (15:38 -0700)]
Revert "setup vwfi correctly on cpu0"

This reverts commit b32d442abd92cdd4d8f2a2e7794cfee9dba7fe22. There is
no need for this patch after "xen/arm: Set and restore HCR_EL2 register
for each vCPU separately".

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoARM: GICv3 ITS: introduce device mapping
Andre Przywara [Fri, 7 Apr 2017 22:08:01 +0000 (23:08 +0100)]
ARM: GICv3 ITS: introduce device mapping

The ITS uses device IDs to map LPIs to a device. Dom0 will later use
those IDs, which we directly pass on to the host.
For this we have to map each device that Dom0 may request to a host
ITS device with the same identifier.
Allocate the respective memory and enter each device into an rbtree to
later be able to iterate over it or to easily teardown guests.
Because device IDs are per ITS, we need to identify a virtual ITS. We
use the doorbell address for that purpose, as it is a nice architectural
MSI property and spares us handling with opaque pointer or break
the VGIC abstraction.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoARM: vGICv3: introduce ITS emulation stub
Andre Przywara [Fri, 7 Apr 2017 22:08:00 +0000 (23:08 +0100)]
ARM: vGICv3: introduce ITS emulation stub

Create a new file to hold the emulation code for the ITS widget.
This just holds the data structure and a init and free function for now.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoARM: GICv3 ITS: introduce host LPI array
Andre Przywara [Fri, 7 Apr 2017 22:07:59 +0000 (23:07 +0100)]
ARM: GICv3 ITS: introduce host LPI array

The number of LPIs on a host can be potentially huge (millions),
although in practise will be mostly reasonable. So prematurely allocating
an array of struct irq_desc's for each LPI is not an option.
However Xen itself does not care about LPIs, as every LPI will be injected
into a guest (Dom0 for now).
Create a dense data structure (8 Bytes) for each LPI which holds just
enough information to determine the virtual IRQ number and the VCPU into
which the LPI needs to be injected.
Also to not artificially limit the number of LPIs, we create a 2-level
table for holding those structures.
This patch introduces functions to initialize these tables and to
create, lookup and destroy entries for a given LPI.
By using the naturally atomic access guarantee the native uint64_t data
type gives us, we allocate and access LPI information in a way that does
not require a lock.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoARM: GICv3 ITS: introduce ITS command handling
Andre Przywara [Fri, 7 Apr 2017 22:07:58 +0000 (23:07 +0100)]
ARM: GICv3 ITS: introduce ITS command handling

To be able to easily send commands to the ITS, create the respective
wrapper functions, which take care of the ring buffer.
The first two commands we implement provide methods to map a collection
to a redistributor (aka host core) and to flush the command queue (SYNC).
Start using these commands for mapping one collection to each host CPU.
As an ITS might choose between *two* ways of addressing a redistributor,
we store both the MMIO base address as well as the processor number in
a per-CPU variable to give each ITS what it wants.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoARM: GICv3 ITS: map ITS command buffer
Andre Przywara [Fri, 7 Apr 2017 22:07:57 +0000 (23:07 +0100)]
ARM: GICv3 ITS: map ITS command buffer

Instead of directly manipulating the tables in memory, an ITS driver
sends commands via a ring buffer in normal system memory to the ITS h/w
to create or alter the LPI mappings.
Allocate memory for that buffer and tell the ITS about it to be able
to send ITS commands.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoARM: GICv3 ITS: allocate device and collection table
Andre Przywara [Fri, 7 Apr 2017 22:07:56 +0000 (23:07 +0100)]
ARM: GICv3 ITS: allocate device and collection table

Each ITS maps a pair of a DeviceID (for instance derived from a PCI
b/d/f triplet) and an EventID (the MSI payload or interrupt ID) to a
pair of LPI number and collection ID, which points to the target CPU.
This mapping is stored in the device and collection tables, which software
has to provide for the ITS to use.
Allocate the required memory and hand it over to the ITS.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoARM: GICv3: allocate LPI pending and property table
Andre Przywara [Fri, 7 Apr 2017 22:07:55 +0000 (23:07 +0100)]
ARM: GICv3: allocate LPI pending and property table

The ARM GICv3 provides a new kind of interrupt called LPIs.
The pending bits and the configuration data (priority, enable bits) for
those LPIs are stored in tables in normal memory, which software has to
provide to the hardware.
Allocate the required memory, initialize it and hand it over to each
redistributor. The maximum number of LPIs to be used can be adjusted with
the command line option "max_lpi_bits", which defaults to 20 bits,
covering about one million LPIs.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoARM: GICv3 ITS: initialize host ITS
Andre Przywara [Fri, 7 Apr 2017 22:07:54 +0000 (23:07 +0100)]
ARM: GICv3 ITS: initialize host ITS

Map the registers frame for each host ITS and populate the host ITS
structure with some parameters describing the size of certain properties
like the number of bits for device IDs.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoARM: GICv3 ITS: parse and store ITS subnodes from hardware DT
Andre Przywara [Fri, 7 Apr 2017 22:07:53 +0000 (23:07 +0100)]
ARM: GICv3 ITS: parse and store ITS subnodes from hardware DT

Parse the GIC subnodes in the device tree to find every ITS MSI controller
the hardware offers. Store that information in a list to both propagate
all of them later to Dom0, but also to be able to iterate over all ITSes.
This introduces an ITS Kconfig option (as an EXPERT option), use
XEN_CONFIG_EXPERT=y on the make command line to see and use the option.

Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoxen: credit1: treat pCPUs more evenly during balancing.
Dario Faggioli [Fri, 7 Apr 2017 16:57:14 +0000 (18:57 +0200)]
xen: credit1: treat pCPUs more evenly during balancing.

Right now, we use cpumask_first() for going through
the bus pCPUs in csched_load_balance(). This means
not all pCPUs have equal chances of seeing their
pending work stolen. It also means there is more
runqueue lock pressure on lower ID pCPUs.

To avoid all this, let's record and remember, for
each NUMA node, from what pCPU we have stolen for
last, and start from that the following time.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: credit1: increase efficiency and scalability of load balancing.
Dario Faggioli [Fri, 7 Apr 2017 16:57:07 +0000 (18:57 +0200)]
xen: credit1: increase efficiency and scalability of load balancing.

During load balancing, we check the non idle pCPUs to
see if they have runnable but not running vCPUs that
can be stolen by and set to run on currently idle pCPUs.

If a pCPU has only one running (or runnable) vCPU,
though, we don't want to steal it from there, and
it's therefore pointless bothering with it
(especially considering that bothering means trying
to take its runqueue lock!).

On large systems, when load is only slightly higher
than the number of pCPUs (i.e., there are just a few
more active vCPUs than the number of the pCPUs), this
may mean that:
 - we go through all the pCPUs,
 - for each one, we (try to) take its runqueue locks,
 - we figure out there's actually nothing to be stolen!

To mitigate this, we introduce a counter for the number
of runnable vCPUs on each pCPU. In fact, unless there
re least 2 runnable vCPUs --typically, one running,
and the others in the runqueue-- it does not make sense
to try stealing anything.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: credit2: avoid cpumask_any() in pick_cpu().
Dario Faggioli [Fri, 7 Apr 2017 16:57:00 +0000 (18:57 +0200)]
xen: credit2: avoid cpumask_any() in pick_cpu().

cpumask_any() is costly (because of the randomization).
And since it does not really matter which exact CPU is
selected within a runqueue, as that will be overridden
shortly after, in runq_tickle(), spending too much time
and achieving true randomization is pretty pointless.

As the picked CPU, however, would be used as an hint,
within runq_tickle(), don't give up on it entirely,
and let's make sure we don't always return the same
CPU, or favour lower or higher ID CPUs.

To achieve that, let's record and remember, for each
runqueue, what CPU we picked for last, and start from
that the following time.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen/tools: tracing: add record for credit1 runqueue stealing.
Dario Faggioli [Fri, 7 Apr 2017 16:56:52 +0000 (18:56 +0200)]
xen/tools: tracing: add record for credit1 runqueue stealing.

Including whether we actually tried stealing a vCPU from
a given pCPU, or we skipped that one, because of lock
contention.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: credit: consider tickled pCPUs as busy.
Dario Faggioli [Fri, 7 Apr 2017 16:56:45 +0000 (18:56 +0200)]
xen: credit: consider tickled pCPUs as busy.

Currently, it can happen that __runq_tickle(),
running on pCPU 2 because vCPU x woke up, decides
to tickle pCPU 3, because it's idle. Just after
that, but before pCPU 3 manages to schedule and
pick up x, either __runq_tickel() or
__csched_cpu_pick(), running on pCPU 6, sees that
idle pCPUs are 0, 1 and also 3, and for whatever
reason it also chooses 3 for waking up (or
migrating) vCPU y.

When pCPU 3 goes through the scheduler, it will
pick up, say, vCPU x, and y will sit in its
runqueue, even if there are idle pCPUs.

Alleviate this by marking a pCPU to be idle
right away when tickling it (like, e.g., it happens
in Credit2).

Note that this does not eliminate the race. That
is not possible without introducing proper locking
for the cpumasks the scheduler uses. It significantly
reduces the window during which it can happen, though.

Introduce proper locking for the cpumask can, in
theory, be done, and may be investigated in future.
It is a significant amount of work to do it properly
(e.g., avoiding deadlock), and it is likely to adversely
affect scalability, and so it may be a path it is just
not worth following.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: credit: (micro) optimize csched_runq_steal().
Dario Faggioli [Fri, 7 Apr 2017 16:56:38 +0000 (18:56 +0200)]
xen: credit: (micro) optimize csched_runq_steal().

Checking whether or not a vCPU can be 'stolen'
from a peer pCPU's runqueue is relatively cheap.

Therefore, let's do that  as early as possible,
avoiding potentially useless complex checks, and
cpumask manipulations.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: credit1: simplify csched_runq_steal() a little bit.
Dario Faggioli [Fri, 7 Apr 2017 16:56:31 +0000 (18:56 +0200)]
xen: credit1: simplify csched_runq_steal() a little bit.

Since we're holding the lock on the pCPU from which we
are trying to steal, it can't have disappeared, so we
can drop the check for that (and convert it in an
ASSERT()).

And since we try to steal only from busy pCPUs, it's
unlikely for such pCPU to be idle, so we can:
 - tell the compiler this is actually unlikely,
 - bail early if the pCPU, unfortunately, turns out
   to really be idle.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
8 years agox86/svm: Correct event injection check in svm_vmcb_restore()
Andrew Cooper [Fri, 7 Apr 2017 15:38:53 +0000 (16:38 +0100)]
x86/svm: Correct event injection check in svm_vmcb_restore()

SVM's maximum valid event type is 4.  This appears to be a straigth copy and
paste error in c/s e94e3f210a62, as VT-x's maximum is 6.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
8 years agox86/svm: Fix indentation in svm_vmcb_restore()
Andrew Cooper [Fri, 7 Apr 2017 15:38:12 +0000 (16:38 +0100)]
x86/svm: Fix indentation in svm_vmcb_restore()

Inroduced by c/s b706e1c6af274, spotted by Coverity.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
8 years agox86/emul: Poision the stubs with debug traps
Andrew Cooper [Wed, 8 Mar 2017 15:38:55 +0000 (15:38 +0000)]
x86/emul: Poision the stubs with debug traps

...rather than leaving fragments of old instructions in place.  This reduces
the chances of something going further-wrong (as the debug trap will be caught
and terminate the guest) in a cascade-failure where we end up executing the
instruction fragments.

Before:
    (XEN) d2v0 exception 6 (ec=0000) in emulation stub (line 6239)
    (XEN) d2v0 stub: c4 e1 44 77 c3 80 d0 82 ff ff ff d1 90 ec 90

After:
    (XEN) d3v0 exception 6 (ec=0000) in emulation stub (line 6239)
    (XEN) d3v0 stub: c4 e1 44 77 c3 cc cc cc cc cc cc cc cc cc cc

To make this work, the int3 handler needs to be extended to attempt recovery
rather than simply returning back to Xen context.  While altering do_int3(),
leave an obvious sign if an embedded breakpoint has been hit and not dealt
with by debugging facilities.

    (XEN) Hit embedded breakpoint at ffff82d0803d01f6 [extable.c#stub_selftest+0xda/0xee]

Extend the selftests to include int3, and add an extra printk indicating the
start of the recovery selftests, to avoid leaving otherwise-spurious faults
visible in the log.

    (XEN) build-id: 55d7e6f420b4f0ce277f776be620f43d7cb8646c
    (XEN) Running stub recovery selftests...
    (XEN) traps.c:3466: GPF (0000): ffff82d0bffff041 [ffff82d0bffff041] -> ffff82d08035937a
    (XEN) traps.c:813: Trap 12: ffff82d0bffff040 [ffff82d0bffff040] -> ffff82d08035937a
    (XEN) traps.c:1215: Trap 3: ffff82d0bffff041 [ffff82d0bffff041] -> ffff82d08035937a
    (XEN) ACPI sleep modes: S3

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/ioreq server: synchronously reset outstanding p2m_ioreq_server entries when an...
Yu Zhang [Fri, 7 Apr 2017 15:40:04 +0000 (17:40 +0200)]
x86/ioreq server: synchronously reset outstanding p2m_ioreq_server entries when an ioreq server unmaps

After an ioreq server has unmapped, the remaining p2m_ioreq_server
entries need to be reset back to p2m_ram_rw. This patch does this
synchronously by iterating the p2m table.

The synchronous resetting is necessary because we need to guarantee
the p2m table is clean before another ioreq server is mapped. And
since the sweeping of p2m table could be time consuming, it is done
with hypercall continuation.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agox86/ioreq server: asynchronously reset outstanding p2m_ioreq_server entries
Yu Zhang [Fri, 7 Apr 2017 15:39:16 +0000 (17:39 +0200)]
x86/ioreq server: asynchronously reset outstanding p2m_ioreq_server entries

After an ioreq server has unmapped, the remaining p2m_ioreq_server
entries need to be reset back to p2m_ram_rw. This patch does this
asynchronously with the current p2m_change_entry_type_global()
interface.

New field entry_count is introduced in struct p2m_domain, to record
the number of p2m_ioreq_server p2m page table entries. One nature of
these entries is that they only point to 4K sized page frames, because
all p2m_ioreq_server entries are originated from p2m_ram_rw ones in
p2m_change_type_one(). We do not need to worry about the counting for
2M/1G sized pages.

This patch disallows mapping of an ioreq server, when there's still
p2m_ioreq_server entry left, in case another mapping occurs right after
the current one being unmapped, releases its lock, with p2m table not
synced yet.

This patch also disallows live migration, when there's remaining
p2m_ioreq_server entry in p2m table. The core reason is our current
implementation of p2m_change_entry_type_global() lacks information
to resync p2m_ioreq_server entries correctly if global_logdirty is
on.

We still need to handle other recalculations, however; which means
that when doing a recalculation, if the current type is
p2m_ioreq_server, we check to see if p2m->ioreq.server is valid or
not.  If it is, we leave it as type p2m_ioreq_server; if not, we reset
it to p2m_ram as appropriate.

To avoid code duplication, lift recalc_type() out of p2m-pt.c and use
it for all type recalculations (both in p2m-pt.c and p2m-ept.c).

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/ioreq server: handle read-modify-write cases for p2m_ioreq_server pages
Paul Durrant [Fri, 7 Apr 2017 15:38:48 +0000 (17:38 +0200)]
x86/ioreq server: handle read-modify-write cases for p2m_ioreq_server pages

In ept_handle_violation(), write violations are also treated as
read violations. And when a VM is accessing a write-protected
address with read-modify-write instructions, the read emulation
process is triggered first.

For p2m_ioreq_server pages, current ioreq server only forwards
the write operations to the device model. Therefore when such page
is being accessed by a read-modify-write instruction, the read
operations should be emulated here in hypervisor. This patch provides
such a handler to copy the data to the buffer.

Note: MMIOs with p2m_mmio_dm type do not need such special treatment
because both reads and writes will go to the device mode.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/ioreq server: add device model wrappers for new DMOP
Yu Zhang [Fri, 7 Apr 2017 15:38:40 +0000 (17:38 +0200)]
x86/ioreq server: add device model wrappers for new DMOP

A new device model wrapper is added for the newly introduced
DMOP - XEN_DMOP_map_mem_type_to_ioreq_server.

Since currently this DMOP only supports the emulation of write
operations, attempts to trigger the DMOP with values other than
XEN_DMOP_IOREQ_MEM_ACCESS_WRITE or 0(to unmap the ioreq server)
shall fail. The wrapper shall be updated once read operations
are also to be emulated in the future.

Also note currently this DMOP only supports one memory type,
and can be extended in the future to map multiple memory types
to multiple ioreq servers, e.g. mapping HVMMEM_ioreq_serverX to
ioreq server X, This wrapper shall be updated when such change
is made.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
8 years agox86/ioreq server: add DMOP to map guest ram with p2m_ioreq_server to an ioreq server
Paul Durrant [Fri, 7 Apr 2017 15:38:11 +0000 (17:38 +0200)]
x86/ioreq server: add DMOP to map guest ram with p2m_ioreq_server to an ioreq server

Previously, p2m_ioreq_server is used to write-protect guest ram
pages, which are tracked with ioreq server's rangeset. However,
number of ram pages to be tracked may exceed the upper limit of
rangeset.

Now, a new DMOP - XEN_DMOP_map_mem_type_to_ioreq_server, is added
to let one ioreq server claim/disclaim its responsibility for the
handling of guest pages with p2m type p2m_ioreq_server. Users of
this DMOP can specify which kind of operation is supposed to be
emulated in a parameter named flags. Currently, this DMOP only
support the emulation of write operations. And it can be further
extended to support the emulation of read ones if an ioreq server
has such requirement in the future.

For now, we only support one ioreq server for this p2m type, so
once an ioreq server has claimed its ownership, subsequent calls
of the XEN_DMOP_map_mem_type_to_ioreq_server will fail. Users can
also disclaim the ownership of guest ram pages with p2m_ioreq_server,
by triggering this new DMOP, with ioreq server id set to the current
owner's and flags parameter set to 0.

Note:
a> both XEN_DMOP_map_mem_type_to_ioreq_server and p2m_ioreq_server
are only supported for HVMs with HAP enabled.

b> only after one ioreq server claims its ownership of p2m_ioreq_server,
will the p2m type change to p2m_ioreq_server be allowed.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Acked-by: Tim Deegan <tim@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agox86/ioreq server: release the p2m lock after mmio is handled
Yu Zhang [Fri, 7 Apr 2017 15:35:44 +0000 (17:35 +0200)]
x86/ioreq server: release the p2m lock after mmio is handled

Routine hvmemul_do_io() may need to peek the p2m type of a gfn to
select the ioreq server. For example, operations on gfns with
p2m_ioreq_server type will be delivered to a corresponding ioreq
server, and this requires that the p2m type not be switched back
to p2m_ram_rw during the emulation process. To avoid this race
condition, we delay the release of p2m lock in hvm_hap_nested_page_fault()
until mmio is handled.

Note: previously in hvm_hap_nested_page_fault(), put_gfn() was moved
before the handling of mmio, due to a deadlock risk between the p2m
lock and the event lock(in commit 77b8dfe). Later, a per-event channel
lock was introduced in commit de6acb7, to send events. So we do not
need to worry about the deadlock issue.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agotools: sched: add support for 'null' scheduler
Dario Faggioli [Fri, 7 Apr 2017 12:28:31 +0000 (14:28 +0200)]
tools: sched: add support for 'null' scheduler

It being very very basic, also means this scheduler does
not need much support at the tools level (for now).

Basically, just the definition of the symbol of the
scheduler itself and a couple of stubs.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: sched_null: support for hard affinity
Dario Faggioli [Fri, 7 Apr 2017 12:28:23 +0000 (14:28 +0200)]
xen: sched_null: support for hard affinity

As a (rudimental) way of directing and affecting the
placement logic implemented by the scheduler, support
vCPU hard affinity.

Basically, a vCPU will now be assigned only to a pCPU
that is part of its own hard affinity. If such pCPU(s)
is (are) busy, the vCPU will wait, like it happens
when there are no free pCPUs.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: sched: introduce the 'null' semi-static scheduler
Dario Faggioli [Fri, 7 Apr 2017 12:28:15 +0000 (14:28 +0200)]
xen: sched: introduce the 'null' semi-static scheduler

In cases where one is absolutely sure that there will be
less vCPUs than pCPUs, having to pay the cost, mostly in
terms of overhead, of an advanced scheduler may be not
desirable.

The simple scheduler implemented here could be a solution.
Here how it works:
 - each vCPU is statically assigned to a pCPU;
 - if there are pCPUs without any vCPU assigned, they
   stay idle (as in, the run their idle vCPU);
 - if there are vCPUs which are not assigned to any
   pCPU (e.g., because there are more vCPUs than pCPUs)
   they *don't* run, until they get assigned;
 - if a vCPU assigned to a pCPU goes away, one of the
   waiting to be assigned vCPU, if any, gets assigned
   to the pCPU and can run there.

This scheduler, therefore, if used in configurations
where every vCPUs can be assigned to a pCPU, guarantees
low overhead, low latency, and consistent performance.

If used as default scheduler, at Xen boot, it is
recommended to limit the number of Dom0 vCPUs (e.g., with
'dom0_max_vcpus=x'). Otherwise, all the pCPUs will have
one Dom0's vCPU assigned, and there won't be room for
running efficiently (if at all) any guest.

Target use cases are embedded and HPC, but it may well
be interesting also in circumnstances.

Kconfig and documentation are update accordingly.

While there, also document the availability of sched=rtds
as boot parameter, which apparently had been forgotten.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: sched: make sure a pCPU added to a pool runs the scheduler ASAP
Dario Faggioli [Fri, 7 Apr 2017 12:28:08 +0000 (14:28 +0200)]
xen: sched: make sure a pCPU added to a pool runs the scheduler ASAP

When a pCPU is added to a cpupool, the pool's scheduler
should immediately run on it so, for instance, any runnable
but not running vCPU can start executing there.

This currently does not happen. Make it happen by raising
the scheduler softirq directly from the function that
sets up the new scheduler for the pCPU.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
8 years agoxen: sched: improve robustness (and rename) DOM2OP()
Dario Faggioli [Fri, 7 Apr 2017 12:28:01 +0000 (14:28 +0200)]
xen: sched: improve robustness (and rename) DOM2OP()

Clarify and enforce (with ASSERTs) when the function
is called on the idle domain, and explain in comments
what it means and when it is ok to do so.

While there, change the name of the function to a more
self-explanatory one, and do the same to VCPU2OP.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
8 years agox86/mce: always re-initialize 'severity_cpu' in mcheck_cmn_handler()
Haozhong Zhang [Fri, 7 Apr 2017 13:56:09 +0000 (15:56 +0200)]
x86/mce: always re-initialize 'severity_cpu' in mcheck_cmn_handler()

mcheck_cmn_handler() does not always set 'severity_cpu' to override
its value taken from previous rounds of MC handling, which will
interfere the current round of MC handling. Always re-initialize it to
clear the historical value.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/mce: make 'severity_cpu' private to its users
Haozhong Zhang [Fri, 7 Apr 2017 13:55:34 +0000 (15:55 +0200)]
x86/mce: make 'severity_cpu' private to its users

The current 'severity_cpu' is used by both mcheck_cmn_handler() and
mce_softirq(). If MC# happens during mce_softirq(), the values set in
mcheck_cmn_handler() and mce_softirq() may interfere with each
other. Use private 'severity_cpu' for each function to fix this issue.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/monitor: add support for descriptor access events
Adrian Pop [Fri, 7 Apr 2017 13:39:32 +0000 (15:39 +0200)]
x86/monitor: add support for descriptor access events

Adds monitor support for descriptor access events (reads & writes of
IDTR/GDTR/LDTR/TR) for the x86 architecture (VMX and SVM).

Signed-off-by: Adrian Pop <apop@bitdefender.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
[jb: minor cosmetic (hopefully!) cleanup]
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agopassthrough/io: fall back to remapping interrupt when we can't use VT-d PI
Chao Gao [Fri, 7 Apr 2017 13:38:40 +0000 (15:38 +0200)]
passthrough/io: fall back to remapping interrupt when we can't use VT-d PI

The current logic of using VT-d pi is when guest configurates the pirq's
destination vcpu to a single vcpu, the according IRTE is updated to
posted format. If the destination of the pirq is multiple vcpus, we will
stay in posted format. Obviously, we should fall back to remapping interrupt
when guest wrongly configurate destination of pirq or makes it have
multi-destination vcpus.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
[jb: guard against vcpu being NULL]
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agoVT-d: introduce update_irte to update irte safely
Chao Gao [Fri, 7 Apr 2017 13:38:17 +0000 (15:38 +0200)]
VT-d: introduce update_irte to update irte safely

We used structure assignment to update irte which was non-atomic when the
whole IRTE was to be updated. It is unsafe when a interrupt happened during
update. Furthermore, no bug or warning would be reported when this happened.

This patch introduces two variants, atomic and non-atomic, to update irte.
For initialization and release case, the non-atomic variant will be used. for
other cases (such as reprogramming to set irq affinity), the atomic variant
will be used. If the caller requests an atomic update but we can't meet it, we
raise a bug.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Jan Beulich <jbeulich@suse.com> [x86]
8 years agoVMX: fixup PI descriptor when cpu is offline
Feng Wu [Fri, 7 Apr 2017 13:37:55 +0000 (15:37 +0200)]
VMX: fixup PI descriptor when cpu is offline

When cpu is offline, we need to move all the vcpus in its blocking
list to another online cpu, this patch handles it.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
8 years agoVT-d: some cleanups
Feng Wu [Fri, 7 Apr 2017 13:37:33 +0000 (15:37 +0200)]
VT-d: some cleanups

Use type-safe structure assignment instead of memcpy()
Use sizeof(*iremap_entry).

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
8 years agoVT-d: introduce new fields in msi_desc to track binding with guest interrupt
Feng Wu [Fri, 7 Apr 2017 13:37:07 +0000 (15:37 +0200)]
VT-d: introduce new fields in msi_desc to track binding with guest interrupt

msi_msg_to_remap_entry() is buggy when the live IRTE is in posted format. It
wrongly inherits the 'im' field meaning the IRTE is in posted format but
updates all the other fields to remapping format.

There are also two situations that lead to the above issue. One is some callers
really want to change the IRTE to remapped format. The other is some callers
only want to update msi message (e.g. set msi affinity) for they don't aware
that this msi is binded with a guest interrupt. We should suppress update
in the second situation. To distinguish them, straightforwardly, we can let
caller specify which format of IRTE they want update to. It isn't feasible for
making all callers be aware of the binding with guest interrupt will cause a
far more complicated change (including the interfaces exposed to IOAPIC and
MSI). Also some callings happen in interrupt context where we can't acquire
d->event_lock to read struct hvm_pirq_dpci.

This patch introduces two new fields in msi_desc to track binding with a guest
interrupt such that msi_msg_to_remap_entry() can get the binding and update
IRTE accordingly. After that change, pi_update_irte() can utilize
msi_msg_to_remap_entry() to update IRTE to posted format.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agopassthrough: don't migrate pirq when it is delivered through VT-d PI
Chao Gao [Fri, 7 Apr 2017 13:36:20 +0000 (15:36 +0200)]
passthrough: don't migrate pirq when it is delivered through VT-d PI

When a vCPU was migrated to another pCPU, pt irqs binded to this vCPU might
also need migration as a optimization to reduce IPI between pCPUs. When VT-d
PI is enabled, interrupt vector will be recorded to a main memory resident
data-structure and a notification whose destination is decided by NDST is
generated. NDST is properly adjusted during vCPU migration so pirq directly
injected to guest needn't be migrated.

This patch adds a indicator, @posted, to show whether the pt irq is delivered
through VT-d PI. Also this patch fixes a bug that hvm_migrate_pirq() accesses
pirq_dpci->gmsi.dest_vcpu_id without checking the pirq_dpci's type.

Signed-off-by: Chao Gao <chao.gao@intel.com>
[jb: remove an extranious check from hvm_migrate_pirq()]
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86: add multiboot2 protocol support for relocatable images
Daniel Kiper [Fri, 7 Apr 2017 11:37:24 +0000 (13:37 +0200)]
x86: add multiboot2 protocol support for relocatable images

Add multiboot2 protocol support for relocatable images. Only GRUB2 with
"multiboot2: Add support for relocatable images" patch understands
that feature. Older multiboot protocol (regardless of version)
compatible loaders ignore it and everything works as usual.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
8 years agox86/boot: rename sym_phys() to sym_offs()
Daniel Kiper [Fri, 7 Apr 2017 11:37:02 +0000 (13:37 +0200)]
x86/boot: rename sym_phys() to sym_offs()

This way macro name better describes its function.
Currently it is used to calculate symbol offset in
relation to the beginning of Xen image mapping.
However, value returned by sym_offs() for a given
symbol is not always equal its physical address.

There is no functional change.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
8 years agox86: make Xen early boot code relocatable
Daniel Kiper [Fri, 7 Apr 2017 11:36:32 +0000 (13:36 +0200)]
x86: make Xen early boot code relocatable

Every multiboot protocol (regardless of version) compatible image must
specify its load address (in ELF or multiboot header). Multiboot protocol
compatible loader have to load image at specified address. However, there
is no guarantee that the requested memory region (in case of Xen it starts
at 2 MiB and ends at ~5 MiB) where image should be loaded initially is a RAM
and it is free (legacy BIOS platforms are merciful for Xen but I found at
least one EFI platform on which Xen load address conflicts with EFI boot
services; it is Dell PowerEdge R820 with latest firmware). To cope with that
problem we must make Xen early boot code relocatable and help boot loader to
relocate image in proper way by suggesting, not requesting specific load
addresses as it is right now, allowed address ranges. This patch does former.
It does not add multiboot2 protocol interface which is done in "x86: add
multiboot2 protocol support for relocatable images" patch.

This patch changes following things:
  - %esi register is used as a storage for Xen image load base address;
    it is mostly unused in early boot code and preserved during C functions
    calls in 32-bit mode,
  - %fs is used as base for Xen data relative addressing in 32-bit code
    if it is possible; %esi is used for that thing during error printing
    because it is not always possible to properly and efficiently
    initialize %fs.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/setup: use XEN_IMG_OFFSET instead of...
Daniel Kiper [Fri, 7 Apr 2017 11:36:01 +0000 (13:36 +0200)]
x86/setup: use XEN_IMG_OFFSET instead of...

..calculating its value during runtime.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
8 years agox86: change default load address from 1 MiB to 2 MiB
Daniel Kiper [Fri, 7 Apr 2017 11:35:32 +0000 (13:35 +0200)]
x86: change default load address from 1 MiB to 2 MiB

Subsequent patches introducing relocatable early boot code play with
page tables using 2 MiB huge pages. If load address is not aligned at
2 MiB then code touching such page tables must have special cases for
start and end of Xen image memory region. So, let's make life easier
and move default load address from 1 MiB to 2 MiB. This way page table
code will be nice and easy. Hence, there is a chance that it will be
less error prone too... :-)))

Additionally, drop first 2 MiB mapping from Xen image mapping.
It is no longer needed.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
8 years agox86emul: correct compat mode system descriptor handling
Jan Beulich [Fri, 7 Apr 2017 09:04:02 +0000 (09:04 +0000)]
x86emul: correct compat mode system descriptor handling

There are some oddities to take care of here - see the code comment.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
8 years agox86/HVM: don't leak PFEC_implict to guests
Jan Beulich [Fri, 7 Apr 2017 10:08:34 +0000 (12:08 +0200)]
x86/HVM: don't leak PFEC_implict to guests

Doing so may not only confuse them, but will - on VMX - lead to
VMRESUME failures. Add respective ASSERT()s where the fields get set
to guard against future similar issues (or - in the restore case -
fail the operation). In that latter code at once convert the mis-used
gdprintk() to dprintk(), as the vCPU of interest is not "current".

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
8 years agox86/hvm: make io.h self-contained
Chao Gao [Fri, 7 Apr 2017 10:06:18 +0000 (12:06 +0200)]
x86/hvm: make io.h self-contained

io.h uses structure npfec without including the file xen/mm.h where the
structure is defined.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
8 years agoboot allocator: use arch helper for virt_to_mfn on DIRECTMAP_VIRT region
Vijaya Kumar K [Fri, 7 Apr 2017 10:04:14 +0000 (12:04 +0200)]
boot allocator: use arch helper for virt_to_mfn on DIRECTMAP_VIRT region

On ARM platforms with NUMA, while initializing second memory node,
panic is triggered from init_node_heap() when virt_to_mfn()
is called for DIRECTMAP_VIRT region address because DIRECTMAP_VIRT
region is not mapped to any virtual address.

The check virt_to_mfn() here is used to know whether the max MFN is
part of the direct mapping. The max MFN is found by calling virt_to_mfn
on end address of DIRECTMAP_VIRT region, which is DIRECTMAP_VIRT_END.

On ARM64, all RAM is currently direct mapped in Xen and virt_to_mfn
uses the hardware for address translation. So if the virtual address
is not mapped translation fault is raised.

In this patch, instead of calling virt_to_mfn(), arch helper
arch_mfn_in_directmap() is introduced.

On ARM64 this arch helper will return true, because currently all RAM
is direct mapped in Xen.
On ARM32, only a limited amount of RAM, called xenheap, is always mapped
and DIRECTMAP_VIRT region is not mapped. Hence return false.
For x86 this helper does virt_to_mfn.

Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agox86/vpmu_intel: handle SMT consistently for programmable and fixed counters
Mohit Gambhir [Fri, 7 Apr 2017 10:03:46 +0000 (12:03 +0200)]
x86/vpmu_intel: handle SMT consistently for programmable and fixed counters

The patch introduces a macro FIXED_CTR_CTRL_ANYTHREAD_MASK and uses it
to mask .Anythread bit for all counter in IA32_FIXED_CTR_CTRL MSR in all
versions of Intel Arhcitectural Performance Monitoring.  Masking .AnyThread bit
 is necesssry for two reasons:

1. We need to be consistent in the implementation. We disable .Anythread bit in
programmable counters (regardless of the version) by masking bit 21 in
IA32_PERFEVTSELx.  (See code snippet below from vpmu_intel.c)

 /* Masks used for testing whether and MSR is valid */
 #define ARCH_CTRL_MASK  (~((1ull << 32) - 1) | (1ull << 21))

But we leave it enabled in fixed function counters for version 3. Removing the
condition disables the bit in fixed function counters regardless of the version,
which is consistent with what is done for programmable counters.

2. We don't want to expose event counts from another guest (or hypervisor)
which can happen if .AnyThread bit is not masked and a VCPU is only scheduled
to run on one of the hardware threads in a hyper-threaded CPU.

Also, note that Intel SDM discourages the  use of .AnyThread bit in virtualized
 environments (per section 18.2.3.1 AnyThread Counting and Software Evolution).

Signed-off-by: Mohit Gambhir <mohit.gambhir@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
8 years agox86/io: move the list of guest to machine IO ports out of domain_iommu
Roger Pau Monné [Fri, 7 Apr 2017 10:03:15 +0000 (12:03 +0200)]
x86/io: move the list of guest to machine IO ports out of domain_iommu

There's no reason to store that list inside of the domain_iommu struct, the
forwarding of guest IO ports into machine IO ports is not tied to the presence
of an IOMMU.

Move it inside of the hvm_domain struct instead.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/io: rename misleading dpci_ prefixed functions to g2m_
Roger Pau Monné [Fri, 7 Apr 2017 10:02:22 +0000 (12:02 +0200)]
x86/io: rename misleading dpci_ prefixed functions to g2m_

The dpci_ prefix used on those IO handlers is misleading, there's nothing PCI
specific in them, they simply map a guest IO port into a machine (physical) IO
port. They don't specifically trap the PCI IO port range in any way
(0xcf8/0xcfc).

Rename them to use the g2m_ prefix in order to avoid this confusion.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
8 years agoaltp2m: introduce external-only and limited use-cases
Tamas K Lengyel [Fri, 7 Apr 2017 10:01:10 +0000 (12:01 +0200)]
altp2m: introduce external-only and limited use-cases

Currently setting altp2mhvm=1 in the domain configuration allows access to the
altp2m interface for both in-guest and external privileged tools. This poses
a problem for use-cases where only external access should be allowed, requiring
the user to compile Xen with XSM enabled to be able to appropriately restrict
access.

In this patch we deprecate the altp2mhvm domain configuration option and
introduce the altp2m option, which allows specifying if by default the altp2m
interface should be external-only or limited. The information is stored in
HVM_PARAM_ALTP2M which we now define with specific XEN_ALTP2M_* modes.
If external mode is selected, the XSM check is shifted to use XSM_DM_PRIV
type check, thus restricting access to the interface by the guest itself. Note
that we keep the default XSM policy untouched. Users of XSM who wish to enforce
external mode for altp2m can do so by adjusting their XSM policy directly,
as this domain config option does not override an active XSM policy.

Also, as part of this patch we adjust the hvmop handler to require
HVM_PARAM_ALTP2M to be of a type other then disabled for all ops. This has been
previously only required for get/set altp2m domain state, all other options
were gated on altp2m_enabled. Since altp2m_enabled only gets set during set
altp2m domain state, this change introduces no new requirements to the other
ops but makes it more clear that it is required for all ops.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@zentific.com>
Signed-off-by: Sergej Proskurin <proskurin@sec.in.tum.de>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: Jan Beulich <jbeulich@suse.com>
8 years agoxen: use a dummy file in C99 header check
Wei Liu [Thu, 6 Apr 2017 18:33:36 +0000 (19:33 +0100)]
xen: use a dummy file in C99 header check

The check builds header file as if it is a C file. Clang doesn't like
the idea of having dead code in C file. The check as-is fails on Clang
with unused function warnings.

Use a dummy file like the C++ header check to fix this.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
8 years agovgic: refuse irq migration when one is already in progress
Stefano Stabellini [Wed, 5 Apr 2017 20:28:43 +0000 (13:28 -0700)]
vgic: refuse irq migration when one is already in progress

When an irq migration is already in progress, but not yet completed
(GIC_IRQ_GUEST_MIGRATING is set), refuse any other irq migration
requests for the same irq.

This patch implements this approach by returning success or failure from
vgic_migrate_irq, and avoiding irq target changes on failure. It prints
a warning in case the irq migration fails.

It also moves the clear_bit of GIC_IRQ_GUEST_MIGRATING to after the
physical irq affinity has been changed so that all operations regarding
irq migration are completed.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoarm: remove irq from inflight, then change physical affinity
Stefano Stabellini [Wed, 5 Apr 2017 20:28:42 +0000 (13:28 -0700)]
arm: remove irq from inflight, then change physical affinity

This patch fixes a potential race that could happen when
gic_update_one_lr and vgic_vcpu_inject_irq run simultaneously.

When GIC_IRQ_GUEST_MIGRATING is set, we must make sure that the irq has
been removed from inflight before changing physical affinity, to avoid
concurrent accesses to p->inflight, as vgic_vcpu_inject_irq will take a
different vcpu lock.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agotools/insn-fuzz: Fix assertion failures in x86_emulate_wrapper()
Andrew Cooper [Tue, 7 Mar 2017 16:20:51 +0000 (16:20 +0000)]
tools/insn-fuzz: Fix assertion failures in x86_emulate_wrapper()

c/s 92cf67888 "x86/emul: Hold x86_emulate() to strict X86EMUL_EXCEPTION
requirements" was appropriate for the hypervisor, but the fuzzer stubs didn't
conform to the stricter requirements.  AFL is very quick to discover this.

Extend the fuzzing harness exception logic to raise exceptions appropriately.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agotools/insn-fuzz: Provide IA32_DEBUGCTL consistently to the emulator
Andrew Cooper [Mon, 27 Mar 2017 09:37:35 +0000 (10:37 +0100)]
tools/insn-fuzz: Provide IA32_DEBUGCTL consistently to the emulator

x86_emulates()'s is_branch_step() performs a speculative read of
IA32_DEBUGCTL, but doesn't squash exceptions should they arise.  In reality,
this MSR is always available.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agotools/insn-fuzz: Correct hook prototypes, and assert() appropriate segments
Andrew Cooper [Tue, 21 Mar 2017 16:49:36 +0000 (16:49 +0000)]
tools/insn-fuzz: Correct hook prototypes, and assert() appropriate segments

The correct prototypes for the hooks are to use enum x86_segment rather than
unsigned int.  It is implementation specific as to whether this compiles.

assert() that the emulator never passes an inappropriate segment.  The only
hook which may legitimately be passed x86_seg_none is invlpg().

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>