Add handlers for accesses to the MSI-X message control field on the PCI
configuration space, and traps for accesses to the memory region that contains
the MSI-X table. This traps detect attempts from the guest to configure MSI-X
interrupts and properly sets them up.
Note that accesses to the Table Offset, Table BIR, PBA Offset, PBA BIR and the
PBA memory region itself are not trapped by Xen at the moment.
Whether Xen is going to provide this functionality to Dom0 (MSI-X emulation) is
controlled by the "msix" option in the dom0 field. When disabling this option
Xen will hide the MSI-X capability structure from Dom0.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Jan Beulich <jbeulich@suse.com>
Andrew Cooper <andrew.cooper3@citrix.com>
---
This patch has been tested with devices using both a single MSI-X entry and
multiple ones.
Add handlers for the MSI control, address, data and mask fields in order to
detect accesses to them and setup the interrupts as requested by the guest.
Note that the pending register is not trapped, and the guest can freely
read/write to it.
Whether Xen is going to provide this functionality to Dom0 (MSI emulation) is
controlled by the "msi" option in the dom0 field. When disabling this option
Xen will hide the MSI capability structure from Dom0.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Jan Beulich <jbeulich@suse.com>
Andrew Cooper <andrew.cooper3@citrix.com>
Paul Durrant <paul.durrant@citrix.com>
---
NB: I've only been able to test this with devices using a single MSI interrupt
and no mask register. I will try to find hardware that supports the mask
register and more than one vector, but I cannot make any promises.
If there are doubts about the untested parts we could always force Xen to
report no per-vector masking support and only 1 available vector, but I would
rather avoid doing it.
vpci: add a priority field to the vPCI register initializer
And mark the capability and header vPCI register initializers as high priority,
so that they are initialized first.
This is needed for MSI-X, since MSI-X needs to know the position of the BARs in
order to perform it's initialization, and in order to mask or enable the
MSI/MSI-X functionality on demand.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Jan Beulich <jbeulich@suse.com>
Andrew Cooper <andrew.cooper3@citrix.com>
xen/vpci: trap access to the list of PCI capabilities
Add traps to each capability PCI_CAP_LIST_NEXT field in order to mask them on
request.
All capabilities from the device are fetched and stored in an internal list,
that's later used in order to return the next capability to the guest. Note
that this only removes the capability from the linked list as seen by the
guest, but the actual capability structure could still be accessed by the
guest, provided that it's position can be found using another mechanism.
Finally the MSI and MSI-X capabilities are masked until Xen knows how to
properly handle accesses to them.
This should allow a PVH Dom0 to boot on some hardware, provided that the
hardware doesn't require MSI/MSI-X and that there are no SR-IOV devices in the
system, so the panic at the end of the PVH Dom0 build is replaced by a
warning.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v1:
- Add missing newline between cmd handlers.
- Switch the handler to use list_for_each_entry_continue instead of a wrong
open-coded version of it.
Introduce a set of handlers that trap accesses to the PCI BARs and the command
register, in order to emulate BAR sizing and BAR relocation.
The command handler is used to detect changes to bit 2 (response to memory
space accesses), and maps/unmaps the BARs of the device into the guest p2m.
The BAR register handlers are used to detect attempts by the guest to size or
relocate the BARs.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: George Dunlap <George.Dunlap@eu.citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Tim Deegan <tim@xen.org> Cc: Wei Liu <wei.liu2@citrix.com>
xen/pci: split code to size BARs from pci_add_device
So that it can be called from outside in order to get the size of regular PCI
BARs. This will be required in order to map the BARs from PCI devices into PVH
Dom0 p2m.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jan Beulich <jbeulich@suse.com>
x86/ecam: add handlers for the PVH Dom0 MMCFG areas
Introduce a set of handlers for the accesses to the ECAM areas. Those areas are
setup based on the contents of the hardware MMCFG tables, and the list of
handled ECAM areas is stored inside of the hvm_domain struct.
The read/writes are forwarded to the generic vpci handlers once the address is
decoded in order to obtain the device and register the guest is trying to
access.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v1:
- Added locking.
xen/vpci: introduce basic handlers to trap accesses to the PCI config space
This functionality is going to reside in vpci.c (and the corresponding vpci.h
header), and should be arch-agnostic. The handlers introduced in this patch
setup the basic functionality required in order to trap accesses to the PCI
config space, and allow decoding the address and finding the corresponding
handler that should handle the access (although no handlers are implemented).
Note that the traps to the PCI IO ports registers (0xcf8/0xcfc) are setup
inside of a x86 HVM file, since that's not shared with other arches.
A new XEN_X86_EMU_VPCI x86 domain flag is added in order to signal Xen whether
a domain should use the newly introduced vPCI handlers, this is only enabled
for PVH Dom0 at the moment.
A very simple user-space test is also provided, so that the basic functionality
of the vPCI traps can be asserted. This has been proven quite helpful during
development, since the logic to handle partial accesses or accesses that expand
across multiple registers is not trivial.
The handlers for the registers are added to a red-black tree, that indexes them
based on their offset. Since Xen needs to handle partial accesses to the
registers and access that expand across multiple registers the logic in
xen_vpci_{read/write} is kind of convoluted, I've tried to properly comment it
in order to make it easier to understand.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Paul Durrant <paul.durrant@citrix.com>
---
Changes since v1:
- Allow access to cross a word-boundary.
- Add locking.
- Add cleanup to xen_vpci_add_handlers in case of failure.
Add the glue in order to bind the PVH Dom0 GSI from bare metal. This is done
when Dom0 unmasks the vIO APIC pins, by fetching the current pin settings and
setting up the PIRQ, which will then be bound to Dom0.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v1:
- Mask the pin on error (instead of panicking).
- Factor out the Dom0 specific code into a function.
- Use the newly introduced allocate_and_map_gsi_pirq instead of
physdev_map_pirq.
Achieve this by expanding pt_irq_create_bind in order to support mapping
interrupts of type PT_IRQ_TYPE_PCI to a PVH Dom0. GSIs bound to Dom0 are always
identity bound, which means the all the fields inside of the u.pci sub-struct
are ignored, and only the machine_irq is actually used in order to determine
which GSI the caller wants to bind.
Also, the hvm_irq_dpci struct is not used by a PVH Dom0, since that's used to
route interrupts and allow different host to guest GSI mappings, which is not
used by a PVH Dom0.
This requires adding some specific handlers for such directly mapped GSIs,
which bypass the PCI interrupt routing done by Xen for HVM guests.
Note that currently there's no support for unbinding this interrupts.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
--- Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v1:
- Remove the PT_IRQ_TYPE_GSI and instead just use PT_IRQ_TYPE_PCI with a
hardware domain special casing.
- Check the trigger mode of the Dom0 vIO APIC in order to set the shareable
flags in pt_irq_create_bind.
x86/physdev: factor out the code to allocate and map a pirq
Move the code to allocate and map a domain pirq (either GSI or MSI) into the
x86 irq code base, so that it can be used outside of the physdev ops.
This change shouldn't affect the functionality of the already existing physdev
ops.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Jan Beulich <jbeulich@suse.com>
Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes since v1:
- New in this version.
Ross Lagerwall [Tue, 18 Apr 2017 15:47:24 +0000 (16:47 +0100)]
x86: Move microcode loading earlier
Move microcode loading earlier for the boot CPU and secondary CPUs so
that it takes place before identify_cpu() is called for each CPU.
Without this, the detected features may be wrong if the new microcode
loading adjusts the feature bits. That could mean that some fixes (e.g. d6e9f8d4f35d ("x86/vmx: fix vmentry failure with TSX bits in LBR"))
don't work as expected.
Previously during boot, the microcode loader was invoked for each
secondary CPU started and then again for each CPU as part of an
initcall. Simplify the code so that it is invoked exactly once for each
CPU during boot.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Jan Beulich [Wed, 19 Apr 2017 11:30:27 +0000 (13:30 +0200)]
x86emul: force CLZERO feature flag in test harness
Commit b988e88cc0 ("x86/emul: Add feature check for clzero") added a
feature check to the emulator, which breaks the harness without this
flag being forced to true.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
x86/vioapic: allow holes in the GSI range for PVH Dom0
The current vIO APIC for PVH Dom0 doesn't allow non-contiguous GSIs, which
means that all GSIs must belong to an IO APIC. This doesn't match reality,
where there are systems with non-contiguous GSIs.
In order to fix this add a base_gsi field to each hvm_vioapic struct, in order
to store the base GSI for each emulated IO APIC. For PVH Dom0 those values are
populated based on the hardware ones.
Jan Beulich [Wed, 19 Apr 2017 11:29:14 +0000 (13:29 +0200)]
x86/HVM: don't uniformly report "MMIO" for various forms of failed emulation
This helps distinguishing the call paths leading there.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Jan Beulich [Wed, 19 Apr 2017 11:26:55 +0000 (13:26 +0200)]
VMX: don't blindly enable descriptor table exiting control
This is an optional feature and hence we should check for it before
use.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Jan Beulich [Wed, 19 Apr 2017 11:26:18 +0000 (13:26 +0200)]
x86/HVM: restrict emulation in hvm_descriptor_access_intercept()
While I did review d0a699a389 ("x86/monitor: add support for descriptor
access events") it didn't really occur to me that someone could be this
blunt and add unguarded emulation again just a few weeks after we
guarded all special purpose emulator invocations. Fix this.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Jan Beulich [Wed, 19 Apr 2017 11:25:44 +0000 (13:25 +0200)]
x86emul: always fill x86_insn_modrm()'s outputs
The function is rather unlikely to be called for insns which don't have
ModRM bytes, and hence addressing Coverity's recurring complaint of
callers potentially consuming uninitialized data when they know that
certain opcodes have ModRM bytes can be suppressed this way without
unduly adding overhead to fast paths.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Jan Beulich [Wed, 19 Apr 2017 11:24:18 +0000 (13:24 +0200)]
x86emul: add "unblock NMI" retire flag
No matter that we emulate IRET for (guest) real mode only right now, we
should get its effect on (virtual) NMI delivery right. Note that we can
simply check the other retire flags also in the !OKAY case, as the
insn emulator now guarantees them to only be set on OKAY.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Alistair Francis [Mon, 17 Apr 2017 21:33:11 +0000 (14:33 -0700)]
tools: Use POSIX signal.h instead of sys/signal.h
The POSIX spec specifies to use:
#include <signal.h>
instead of:
#include <sys/signal.h>
as seen here:
http://pubs.opengroup.org/onlinepubs/009695399/functions/signal.html
This removes the warning:
#warning redirecting incorrect #include <sys/signal.h> to <signal.h>
when building with the musl C-library.
Signed-off-by: Alistair Francis <alistair.francis@xilinx.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Alistair Francis [Mon, 17 Apr 2017 21:33:10 +0000 (14:33 -0700)]
tools: Use POSIX poll.h instead of sys/poll.h
The POSIX spec specifies to use:
#include <poll.h>
instead of:
#include <sys/poll.h>
as seen here:
http://pubs.opengroup.org/onlinepubs/009695399/functions/poll.html
This removes the warning:
#warning redirecting incorrect #include <sys/poll.h> to <poll.h>
when building with the musl C-library.
Signed-off-by: Alistair Francis <alistair.francis@xilinx.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Andrew Cooper [Mon, 10 Apr 2017 12:11:06 +0000 (13:11 +0100)]
x86/emul: Drop more redundant ctxt.event_pending checks
Since c/s 92cf67888a, x86_emulate_wrapper() asserts stricter behaviour about
the relationship between X86EMUL_EXCEPTION and ctxt.event_pending.
These removals should have been included in the aforementioned changeset, and
were only omitted due an oversight.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Tim Deegan <tim@xen.org> Release-acked-by: Julien Grall <julien.grall@arm.com>
Jan Beulich [Thu, 13 Apr 2017 15:35:02 +0000 (17:35 +0200)]
x86/vIO-APIC: fix uninitialized variable warning
In a release build modern gcc validly complains about "pin" possibly
being uninitialized in vioapic_irq_positive_edge().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
VT-d: correct a comment and remove an useless if() statement
Fix two flaws in the patch (93358e8e VT-d: introduce update_irte to update
irte safely):
1. Expand a comment in update_irte() to make it clear that VT-d hardware
doesn't update IRTE and software can't update IRTE behind us since we hold
iremap_lock.
2. remove an useless if() statement
Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
clang: disable the gcc-compat warnings for read_atomic
clang gcc-compat warnings can wrongly fire when certain constructions are used,
at least the following flow:
switch ( ... )
{
case ...:
while ( ({ int x; switch ( foo ) { case 1: x = 1; break; } x }) )
{
...
Will cause clang to emit the following warning "'break' is bound to loop, GCC
binds it to switch", which is a false positive, and both gcc and clang bind
the break to the inner switch. In order to workaround this issue, disable the
gcc-compat checks for the usage of the read_atomic macro.
This has been reported upstream as http://bugs.llvm.org/show_bug.cgi?id=32595.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Luwei Kang [Thu, 13 Apr 2017 10:44:28 +0000 (18:44 +0800)]
tools:misc:xenpm: set max freq to all cpu with default cpuid
User can set max freq to specific cpu by
"xenpm set-scaling-maxfreq [cpuid] <HZ>"
or set max freq to all cpu with default cpuid by
"xenpm set-scaling-maxfreq <HZ>".
Set max freq with default cpuid will cause
segmentation fault after commit id d4906b5d05.
This patch will fix this issue and add ability
to set max freq with default cpuid.
Signed-off-by: Luwei Kang <luwei.kang@intel.com> Compile-tested-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Ian Jackson [Wed, 12 Apr 2017 15:18:57 +0000 (16:18 +0100)]
Config.mk: Update for 4.9.0-rc1.2
Contrary to what I wrote in d0db50ced1f7 "Config.mk: Update for
4.9.0-rc1.1", the build failure with 4.9.0-rc1 was not due to a wrong
qemu tag, but a wrong mini-os tag. So burn 4.9.0-rc1.1 too :-(. (We
can rewind the qemu-trad tag to 4.9.0-rc1; the -rc1 and -rc1.1 tags
are identical.)
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Wed, 12 Apr 2017 15:03:35 +0000 (16:03 +0100)]
Config.mk: Update for 4.9.0-rc1.1
In qemu-trad, I made xen-4.9.0-rc1 refer erroneously to the 4.8
branch. That doesn't build. So we are burning the version number
4.9.0-rc1 in xen.git and qemu-trad. (The other trees can remain as
they are.)
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
The changes introduced on c47d1d broke the clang build due to undefined
references to __xsm_action_mismatch_detected, because clang hasn't optimized
the code properly. The following patch allows the clang build to work again,
while keeping the same functionality.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Jonathan Davies [Fri, 7 Apr 2017 13:27:20 +0000 (14:27 +0100)]
oxenstored: save remote evtchn port, not local port
Previously, Domain.dump output the number of the local port
corresponding to each domain's event-channel. However, when oxenstored
exits, it closes /dev/xen/evtchn which causes the kernel to close the
local port (evtchn_release), so this port is no longer useful.
Instead, store the remote port. This can be used to reconnect the
event-channel by binding the original remote port to a fresh local port.
Indeed, the logic for parsing the stored state already expects a remote
port as it passes the parsed port number to Domain.make (via
Domains.create), which takes a remote port.
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Christian Lindig <christian.lindig@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Jonathan Davies [Fri, 7 Apr 2017 13:27:19 +0000 (14:27 +0100)]
oxenstored: avoid leading slash in paths in saved store state
Internally, paths are represented as lists of strings, where
* path "/" is represented by []
* path "/local/domain/0" is represented by ["local"; "domain"; "0"]
(see comment for Store.Path.t).
However, the traversal function generated paths like
[""; "local"; "domain"; "0"]
because the name of the root node is "". Change it to generate paths
correctly.
Furthermore, the function passed to Store.dump_fct would render the node
"foo" under the path [] as "//foo". Change this to return "/foo".
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Christian Lindig <christian.lindig@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
This reverts commit b32d442abd92cdd4d8f2a2e7794cfee9dba7fe22. There is
no need for this patch after "xen/arm: Set and restore HCR_EL2 register
for each vCPU separately".
Andre Przywara [Fri, 7 Apr 2017 22:08:01 +0000 (23:08 +0100)]
ARM: GICv3 ITS: introduce device mapping
The ITS uses device IDs to map LPIs to a device. Dom0 will later use
those IDs, which we directly pass on to the host.
For this we have to map each device that Dom0 may request to a host
ITS device with the same identifier.
Allocate the respective memory and enter each device into an rbtree to
later be able to iterate over it or to easily teardown guests.
Because device IDs are per ITS, we need to identify a virtual ITS. We
use the doorbell address for that purpose, as it is a nice architectural
MSI property and spares us handling with opaque pointer or break
the VGIC abstraction.
Andre Przywara [Fri, 7 Apr 2017 22:07:59 +0000 (23:07 +0100)]
ARM: GICv3 ITS: introduce host LPI array
The number of LPIs on a host can be potentially huge (millions),
although in practise will be mostly reasonable. So prematurely allocating
an array of struct irq_desc's for each LPI is not an option.
However Xen itself does not care about LPIs, as every LPI will be injected
into a guest (Dom0 for now).
Create a dense data structure (8 Bytes) for each LPI which holds just
enough information to determine the virtual IRQ number and the VCPU into
which the LPI needs to be injected.
Also to not artificially limit the number of LPIs, we create a 2-level
table for holding those structures.
This patch introduces functions to initialize these tables and to
create, lookup and destroy entries for a given LPI.
By using the naturally atomic access guarantee the native uint64_t data
type gives us, we allocate and access LPI information in a way that does
not require a lock.
Andre Przywara [Fri, 7 Apr 2017 22:07:58 +0000 (23:07 +0100)]
ARM: GICv3 ITS: introduce ITS command handling
To be able to easily send commands to the ITS, create the respective
wrapper functions, which take care of the ring buffer.
The first two commands we implement provide methods to map a collection
to a redistributor (aka host core) and to flush the command queue (SYNC).
Start using these commands for mapping one collection to each host CPU.
As an ITS might choose between *two* ways of addressing a redistributor,
we store both the MMIO base address as well as the processor number in
a per-CPU variable to give each ITS what it wants.
Andre Przywara [Fri, 7 Apr 2017 22:07:57 +0000 (23:07 +0100)]
ARM: GICv3 ITS: map ITS command buffer
Instead of directly manipulating the tables in memory, an ITS driver
sends commands via a ring buffer in normal system memory to the ITS h/w
to create or alter the LPI mappings.
Allocate memory for that buffer and tell the ITS about it to be able
to send ITS commands.
Andre Przywara [Fri, 7 Apr 2017 22:07:56 +0000 (23:07 +0100)]
ARM: GICv3 ITS: allocate device and collection table
Each ITS maps a pair of a DeviceID (for instance derived from a PCI
b/d/f triplet) and an EventID (the MSI payload or interrupt ID) to a
pair of LPI number and collection ID, which points to the target CPU.
This mapping is stored in the device and collection tables, which software
has to provide for the ITS to use.
Allocate the required memory and hand it over to the ITS.
Andre Przywara [Fri, 7 Apr 2017 22:07:55 +0000 (23:07 +0100)]
ARM: GICv3: allocate LPI pending and property table
The ARM GICv3 provides a new kind of interrupt called LPIs.
The pending bits and the configuration data (priority, enable bits) for
those LPIs are stored in tables in normal memory, which software has to
provide to the hardware.
Allocate the required memory, initialize it and hand it over to each
redistributor. The maximum number of LPIs to be used can be adjusted with
the command line option "max_lpi_bits", which defaults to 20 bits,
covering about one million LPIs.
Andre Przywara [Fri, 7 Apr 2017 22:07:54 +0000 (23:07 +0100)]
ARM: GICv3 ITS: initialize host ITS
Map the registers frame for each host ITS and populate the host ITS
structure with some parameters describing the size of certain properties
like the number of bits for device IDs.
Andre Przywara [Fri, 7 Apr 2017 22:07:53 +0000 (23:07 +0100)]
ARM: GICv3 ITS: parse and store ITS subnodes from hardware DT
Parse the GIC subnodes in the device tree to find every ITS MSI controller
the hardware offers. Store that information in a list to both propagate
all of them later to Dom0, but also to be able to iterate over all ITSes.
This introduces an ITS Kconfig option (as an EXPERT option), use
XEN_CONFIG_EXPERT=y on the make command line to see and use the option.
xen: credit1: treat pCPUs more evenly during balancing.
Right now, we use cpumask_first() for going through
the bus pCPUs in csched_load_balance(). This means
not all pCPUs have equal chances of seeing their
pending work stolen. It also means there is more
runqueue lock pressure on lower ID pCPUs.
To avoid all this, let's record and remember, for
each NUMA node, from what pCPU we have stolen for
last, and start from that the following time.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
xen: credit1: increase efficiency and scalability of load balancing.
During load balancing, we check the non idle pCPUs to
see if they have runnable but not running vCPUs that
can be stolen by and set to run on currently idle pCPUs.
If a pCPU has only one running (or runnable) vCPU,
though, we don't want to steal it from there, and
it's therefore pointless bothering with it
(especially considering that bothering means trying
to take its runqueue lock!).
On large systems, when load is only slightly higher
than the number of pCPUs (i.e., there are just a few
more active vCPUs than the number of the pCPUs), this
may mean that:
- we go through all the pCPUs,
- for each one, we (try to) take its runqueue locks,
- we figure out there's actually nothing to be stolen!
To mitigate this, we introduce a counter for the number
of runnable vCPUs on each pCPU. In fact, unless there
re least 2 runnable vCPUs --typically, one running,
and the others in the runqueue-- it does not make sense
to try stealing anything.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
cpumask_any() is costly (because of the randomization).
And since it does not really matter which exact CPU is
selected within a runqueue, as that will be overridden
shortly after, in runq_tickle(), spending too much time
and achieving true randomization is pretty pointless.
As the picked CPU, however, would be used as an hint,
within runq_tickle(), don't give up on it entirely,
and let's make sure we don't always return the same
CPU, or favour lower or higher ID CPUs.
To achieve that, let's record and remember, for each
runqueue, what CPU we picked for last, and start from
that the following time.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Currently, it can happen that __runq_tickle(),
running on pCPU 2 because vCPU x woke up, decides
to tickle pCPU 3, because it's idle. Just after
that, but before pCPU 3 manages to schedule and
pick up x, either __runq_tickel() or
__csched_cpu_pick(), running on pCPU 6, sees that
idle pCPUs are 0, 1 and also 3, and for whatever
reason it also chooses 3 for waking up (or
migrating) vCPU y.
When pCPU 3 goes through the scheduler, it will
pick up, say, vCPU x, and y will sit in its
runqueue, even if there are idle pCPUs.
Alleviate this by marking a pCPU to be idle
right away when tickling it (like, e.g., it happens
in Credit2).
Note that this does not eliminate the race. That
is not possible without introducing proper locking
for the cpumasks the scheduler uses. It significantly
reduces the window during which it can happen, though.
Introduce proper locking for the cpumask can, in
theory, be done, and may be investigated in future.
It is a significant amount of work to do it properly
(e.g., avoiding deadlock), and it is likely to adversely
affect scalability, and so it may be a path it is just
not worth following.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
xen: credit1: simplify csched_runq_steal() a little bit.
Since we're holding the lock on the pCPU from which we
are trying to steal, it can't have disappeared, so we
can drop the check for that (and convert it in an
ASSERT()).
And since we try to steal only from busy pCPUs, it's
unlikely for such pCPU to be idle, so we can:
- tell the compiler this is actually unlikely,
- bail early if the pCPU, unfortunately, turns out
to really be idle.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
Andrew Cooper [Wed, 8 Mar 2017 15:38:55 +0000 (15:38 +0000)]
x86/emul: Poision the stubs with debug traps
...rather than leaving fragments of old instructions in place. This reduces
the chances of something going further-wrong (as the debug trap will be caught
and terminate the guest) in a cascade-failure where we end up executing the
instruction fragments.
After:
(XEN) d3v0 exception 6 (ec=0000) in emulation stub (line 6239)
(XEN) d3v0 stub: c4 e1 44 77 c3 cc cc cc cc cc cc cc cc cc cc
To make this work, the int3 handler needs to be extended to attempt recovery
rather than simply returning back to Xen context. While altering do_int3(),
leave an obvious sign if an embedded breakpoint has been hit and not dealt
with by debugging facilities.
(XEN) Hit embedded breakpoint at ffff82d0803d01f6 [extable.c#stub_selftest+0xda/0xee]
Extend the selftests to include int3, and add an extra printk indicating the
start of the recovery selftests, to avoid leaving otherwise-spurious faults
visible in the log.
x86/ioreq server: synchronously reset outstanding p2m_ioreq_server entries when an ioreq server unmaps
After an ioreq server has unmapped, the remaining p2m_ioreq_server
entries need to be reset back to p2m_ram_rw. This patch does this
synchronously by iterating the p2m table.
The synchronous resetting is necessary because we need to guarantee
the p2m table is clean before another ioreq server is mapped. And
since the sweeping of p2m table could be time consuming, it is done
with hypercall continuation.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
After an ioreq server has unmapped, the remaining p2m_ioreq_server
entries need to be reset back to p2m_ram_rw. This patch does this
asynchronously with the current p2m_change_entry_type_global()
interface.
New field entry_count is introduced in struct p2m_domain, to record
the number of p2m_ioreq_server p2m page table entries. One nature of
these entries is that they only point to 4K sized page frames, because
all p2m_ioreq_server entries are originated from p2m_ram_rw ones in
p2m_change_type_one(). We do not need to worry about the counting for
2M/1G sized pages.
This patch disallows mapping of an ioreq server, when there's still
p2m_ioreq_server entry left, in case another mapping occurs right after
the current one being unmapped, releases its lock, with p2m table not
synced yet.
This patch also disallows live migration, when there's remaining
p2m_ioreq_server entry in p2m table. The core reason is our current
implementation of p2m_change_entry_type_global() lacks information
to resync p2m_ioreq_server entries correctly if global_logdirty is
on.
We still need to handle other recalculations, however; which means
that when doing a recalculation, if the current type is
p2m_ioreq_server, we check to see if p2m->ioreq.server is valid or
not. If it is, we leave it as type p2m_ioreq_server; if not, we reset
it to p2m_ram as appropriate.
To avoid code duplication, lift recalc_type() out of p2m-pt.c and use
it for all type recalculations (both in p2m-pt.c and p2m-ept.c).
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Fri, 7 Apr 2017 15:38:48 +0000 (17:38 +0200)]
x86/ioreq server: handle read-modify-write cases for p2m_ioreq_server pages
In ept_handle_violation(), write violations are also treated as
read violations. And when a VM is accessing a write-protected
address with read-modify-write instructions, the read emulation
process is triggered first.
For p2m_ioreq_server pages, current ioreq server only forwards
the write operations to the device model. Therefore when such page
is being accessed by a read-modify-write instruction, the read
operations should be emulated here in hypervisor. This patch provides
such a handler to copy the data to the buffer.
Note: MMIOs with p2m_mmio_dm type do not need such special treatment
because both reads and writes will go to the device mode.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/ioreq server: add device model wrappers for new DMOP
A new device model wrapper is added for the newly introduced
DMOP - XEN_DMOP_map_mem_type_to_ioreq_server.
Since currently this DMOP only supports the emulation of write
operations, attempts to trigger the DMOP with values other than
XEN_DMOP_IOREQ_MEM_ACCESS_WRITE or 0(to unmap the ioreq server)
shall fail. The wrapper shall be updated once read operations
are also to be emulated in the future.
Also note currently this DMOP only supports one memory type,
and can be extended in the future to map multiple memory types
to multiple ioreq servers, e.g. mapping HVMMEM_ioreq_serverX to
ioreq server X, This wrapper shall be updated when such change
is made.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Paul Durrant [Fri, 7 Apr 2017 15:38:11 +0000 (17:38 +0200)]
x86/ioreq server: add DMOP to map guest ram with p2m_ioreq_server to an ioreq server
Previously, p2m_ioreq_server is used to write-protect guest ram
pages, which are tracked with ioreq server's rangeset. However,
number of ram pages to be tracked may exceed the upper limit of
rangeset.
Now, a new DMOP - XEN_DMOP_map_mem_type_to_ioreq_server, is added
to let one ioreq server claim/disclaim its responsibility for the
handling of guest pages with p2m type p2m_ioreq_server. Users of
this DMOP can specify which kind of operation is supposed to be
emulated in a parameter named flags. Currently, this DMOP only
support the emulation of write operations. And it can be further
extended to support the emulation of read ones if an ioreq server
has such requirement in the future.
For now, we only support one ioreq server for this p2m type, so
once an ioreq server has claimed its ownership, subsequent calls
of the XEN_DMOP_map_mem_type_to_ioreq_server will fail. Users can
also disclaim the ownership of guest ram pages with p2m_ioreq_server,
by triggering this new DMOP, with ioreq server id set to the current
owner's and flags parameter set to 0.
Note:
a> both XEN_DMOP_map_mem_type_to_ioreq_server and p2m_ioreq_server
are only supported for HVMs with HAP enabled.
b> only after one ioreq server claims its ownership of p2m_ioreq_server,
will the p2m type change to p2m_ioreq_server be allowed.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Acked-by: Tim Deegan <tim@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
x86/ioreq server: release the p2m lock after mmio is handled
Routine hvmemul_do_io() may need to peek the p2m type of a gfn to
select the ioreq server. For example, operations on gfns with
p2m_ioreq_server type will be delivered to a corresponding ioreq
server, and this requires that the p2m type not be switched back
to p2m_ram_rw during the emulation process. To avoid this race
condition, we delay the release of p2m lock in hvm_hap_nested_page_fault()
until mmio is handled.
Note: previously in hvm_hap_nested_page_fault(), put_gfn() was moved
before the handling of mmio, due to a deadlock risk between the p2m
lock and the event lock(in commit 77b8dfe). Later, a per-event channel
lock was introduced in commit de6acb7, to send events. So we do not
need to worry about the deadlock issue.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
As a (rudimental) way of directing and affecting the
placement logic implemented by the scheduler, support
vCPU hard affinity.
Basically, a vCPU will now be assigned only to a pCPU
that is part of its own hard affinity. If such pCPU(s)
is (are) busy, the vCPU will wait, like it happens
when there are no free pCPUs.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
xen: sched: introduce the 'null' semi-static scheduler
In cases where one is absolutely sure that there will be
less vCPUs than pCPUs, having to pay the cost, mostly in
terms of overhead, of an advanced scheduler may be not
desirable.
The simple scheduler implemented here could be a solution.
Here how it works:
- each vCPU is statically assigned to a pCPU;
- if there are pCPUs without any vCPU assigned, they
stay idle (as in, the run their idle vCPU);
- if there are vCPUs which are not assigned to any
pCPU (e.g., because there are more vCPUs than pCPUs)
they *don't* run, until they get assigned;
- if a vCPU assigned to a pCPU goes away, one of the
waiting to be assigned vCPU, if any, gets assigned
to the pCPU and can run there.
This scheduler, therefore, if used in configurations
where every vCPUs can be assigned to a pCPU, guarantees
low overhead, low latency, and consistent performance.
If used as default scheduler, at Xen boot, it is
recommended to limit the number of Dom0 vCPUs (e.g., with
'dom0_max_vcpus=x'). Otherwise, all the pCPUs will have
one Dom0's vCPU assigned, and there won't be room for
running efficiently (if at all) any guest.
Target use cases are embedded and HPC, but it may well
be interesting also in circumnstances.
Kconfig and documentation are update accordingly.
While there, also document the availability of sched=rtds
as boot parameter, which apparently had been forgotten.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
xen: sched: make sure a pCPU added to a pool runs the scheduler ASAP
When a pCPU is added to a cpupool, the pool's scheduler
should immediately run on it so, for instance, any runnable
but not running vCPU can start executing there.
This currently does not happen. Make it happen by raising
the scheduler softirq directly from the function that
sets up the new scheduler for the pCPU.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
x86/mce: always re-initialize 'severity_cpu' in mcheck_cmn_handler()
mcheck_cmn_handler() does not always set 'severity_cpu' to override
its value taken from previous rounds of MC handling, which will
interfere the current round of MC handling. Always re-initialize it to
clear the historical value.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
The current 'severity_cpu' is used by both mcheck_cmn_handler() and
mce_softirq(). If MC# happens during mce_softirq(), the values set in
mcheck_cmn_handler() and mce_softirq() may interfere with each
other. Use private 'severity_cpu' for each function to fix this issue.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Adrian Pop [Fri, 7 Apr 2017 13:39:32 +0000 (15:39 +0200)]
x86/monitor: add support for descriptor access events
Adds monitor support for descriptor access events (reads & writes of
IDTR/GDTR/LDTR/TR) for the x86 architecture (VMX and SVM).
Signed-off-by: Adrian Pop <apop@bitdefender.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
[jb: minor cosmetic (hopefully!) cleanup] Reviewed-by: Jan Beulich <jbeulich@suse.com>
passthrough/io: fall back to remapping interrupt when we can't use VT-d PI
The current logic of using VT-d pi is when guest configurates the pirq's
destination vcpu to a single vcpu, the according IRTE is updated to
posted format. If the destination of the pirq is multiple vcpus, we will
stay in posted format. Obviously, we should fall back to remapping interrupt
when guest wrongly configurate destination of pirq or makes it have
multi-destination vcpus.
Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
[jb: guard against vcpu being NULL] Reviewed-by: Jan Beulich <jbeulich@suse.com>
We used structure assignment to update irte which was non-atomic when the
whole IRTE was to be updated. It is unsafe when a interrupt happened during
update. Furthermore, no bug or warning would be reported when this happened.
This patch introduces two variants, atomic and non-atomic, to update irte.
For initialization and release case, the non-atomic variant will be used. for
other cases (such as reprogramming to set irq affinity), the atomic variant
will be used. If the caller requests an atomic update but we can't meet it, we
raise a bug.
Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Jan Beulich <jbeulich@suse.com> [x86]
VT-d: introduce new fields in msi_desc to track binding with guest interrupt
msi_msg_to_remap_entry() is buggy when the live IRTE is in posted format. It
wrongly inherits the 'im' field meaning the IRTE is in posted format but
updates all the other fields to remapping format.
There are also two situations that lead to the above issue. One is some callers
really want to change the IRTE to remapped format. The other is some callers
only want to update msi message (e.g. set msi affinity) for they don't aware
that this msi is binded with a guest interrupt. We should suppress update
in the second situation. To distinguish them, straightforwardly, we can let
caller specify which format of IRTE they want update to. It isn't feasible for
making all callers be aware of the binding with guest interrupt will cause a
far more complicated change (including the interfaces exposed to IOAPIC and
MSI). Also some callings happen in interrupt context where we can't acquire
d->event_lock to read struct hvm_pirq_dpci.
This patch introduces two new fields in msi_desc to track binding with a guest
interrupt such that msi_msg_to_remap_entry() can get the binding and update
IRTE accordingly. After that change, pi_update_irte() can utilize
msi_msg_to_remap_entry() to update IRTE to posted format.
Signed-off-by: Feng Wu <feng.wu@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
passthrough: don't migrate pirq when it is delivered through VT-d PI
When a vCPU was migrated to another pCPU, pt irqs binded to this vCPU might
also need migration as a optimization to reduce IPI between pCPUs. When VT-d
PI is enabled, interrupt vector will be recorded to a main memory resident
data-structure and a notification whose destination is decided by NDST is
generated. NDST is properly adjusted during vCPU migration so pirq directly
injected to guest needn't be migrated.
This patch adds a indicator, @posted, to show whether the pt irq is delivered
through VT-d PI. Also this patch fixes a bug that hvm_migrate_pirq() accesses
pirq_dpci->gmsi.dest_vcpu_id without checking the pirq_dpci's type.
Signed-off-by: Chao Gao <chao.gao@intel.com>
[jb: remove an extranious check from hvm_migrate_pirq()] Reviewed-by: Jan Beulich <jbeulich@suse.com>
Daniel Kiper [Fri, 7 Apr 2017 11:37:24 +0000 (13:37 +0200)]
x86: add multiboot2 protocol support for relocatable images
Add multiboot2 protocol support for relocatable images. Only GRUB2 with
"multiboot2: Add support for relocatable images" patch understands
that feature. Older multiboot protocol (regardless of version)
compatible loaders ignore it and everything works as usual.
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
Daniel Kiper [Fri, 7 Apr 2017 11:37:02 +0000 (13:37 +0200)]
x86/boot: rename sym_phys() to sym_offs()
This way macro name better describes its function.
Currently it is used to calculate symbol offset in
relation to the beginning of Xen image mapping.
However, value returned by sym_offs() for a given
symbol is not always equal its physical address.
There is no functional change.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
Daniel Kiper [Fri, 7 Apr 2017 11:36:32 +0000 (13:36 +0200)]
x86: make Xen early boot code relocatable
Every multiboot protocol (regardless of version) compatible image must
specify its load address (in ELF or multiboot header). Multiboot protocol
compatible loader have to load image at specified address. However, there
is no guarantee that the requested memory region (in case of Xen it starts
at 2 MiB and ends at ~5 MiB) where image should be loaded initially is a RAM
and it is free (legacy BIOS platforms are merciful for Xen but I found at
least one EFI platform on which Xen load address conflicts with EFI boot
services; it is Dell PowerEdge R820 with latest firmware). To cope with that
problem we must make Xen early boot code relocatable and help boot loader to
relocate image in proper way by suggesting, not requesting specific load
addresses as it is right now, allowed address ranges. This patch does former.
It does not add multiboot2 protocol interface which is done in "x86: add
multiboot2 protocol support for relocatable images" patch.
This patch changes following things:
- %esi register is used as a storage for Xen image load base address;
it is mostly unused in early boot code and preserved during C functions
calls in 32-bit mode,
- %fs is used as base for Xen data relative addressing in 32-bit code
if it is possible; %esi is used for that thing during error printing
because it is not always possible to properly and efficiently
initialize %fs.
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Daniel Kiper [Fri, 7 Apr 2017 11:35:32 +0000 (13:35 +0200)]
x86: change default load address from 1 MiB to 2 MiB
Subsequent patches introducing relocatable early boot code play with
page tables using 2 MiB huge pages. If load address is not aligned at
2 MiB then code touching such page tables must have special cases for
start and end of Xen image memory region. So, let's make life easier
and move default load address from 1 MiB to 2 MiB. This way page table
code will be nice and easy. Hence, there is a chance that it will be
less error prone too... :-)))
Additionally, drop first 2 MiB mapping from Xen image mapping.
It is no longer needed.
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
Jan Beulich [Fri, 7 Apr 2017 09:04:02 +0000 (09:04 +0000)]
x86emul: correct compat mode system descriptor handling
There are some oddities to take care of here - see the code comment.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 7 Apr 2017 10:08:34 +0000 (12:08 +0200)]
x86/HVM: don't leak PFEC_implict to guests
Doing so may not only confuse them, but will - on VMX - lead to
VMRESUME failures. Add respective ASSERT()s where the fields get set
to guard against future similar issues (or - in the restore case -
fail the operation). In that latter code at once convert the mis-used
gdprintk() to dprintk(), as the vCPU of interest is not "current".
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
boot allocator: use arch helper for virt_to_mfn on DIRECTMAP_VIRT region
On ARM platforms with NUMA, while initializing second memory node,
panic is triggered from init_node_heap() when virt_to_mfn()
is called for DIRECTMAP_VIRT region address because DIRECTMAP_VIRT
region is not mapped to any virtual address.
The check virt_to_mfn() here is used to know whether the max MFN is
part of the direct mapping. The max MFN is found by calling virt_to_mfn
on end address of DIRECTMAP_VIRT region, which is DIRECTMAP_VIRT_END.
On ARM64, all RAM is currently direct mapped in Xen and virt_to_mfn
uses the hardware for address translation. So if the virtual address
is not mapped translation fault is raised.
In this patch, instead of calling virt_to_mfn(), arch helper
arch_mfn_in_directmap() is introduced.
On ARM64 this arch helper will return true, because currently all RAM
is direct mapped in Xen.
On ARM32, only a limited amount of RAM, called xenheap, is always mapped
and DIRECTMAP_VIRT region is not mapped. Hence return false.
For x86 this helper does virt_to_mfn.
Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@cavium.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <julien.grall@arm.com>
x86/vpmu_intel: handle SMT consistently for programmable and fixed counters
The patch introduces a macro FIXED_CTR_CTRL_ANYTHREAD_MASK and uses it
to mask .Anythread bit for all counter in IA32_FIXED_CTR_CTRL MSR in all
versions of Intel Arhcitectural Performance Monitoring. Masking .AnyThread bit
is necesssry for two reasons:
1. We need to be consistent in the implementation. We disable .Anythread bit in
programmable counters (regardless of the version) by masking bit 21 in
IA32_PERFEVTSELx. (See code snippet below from vpmu_intel.c)
/* Masks used for testing whether and MSR is valid */
#define ARCH_CTRL_MASK (~((1ull << 32) - 1) | (1ull << 21))
But we leave it enabled in fixed function counters for version 3. Removing the
condition disables the bit in fixed function counters regardless of the version,
which is consistent with what is done for programmable counters.
2. We don't want to expose event counts from another guest (or hypervisor)
which can happen if .AnyThread bit is not masked and a VCPU is only scheduled
to run on one of the hardware threads in a hyper-threaded CPU.
Also, note that Intel SDM discourages the use of .AnyThread bit in virtualized
environments (per section 18.2.3.1 AnyThread Counting and Software Evolution).
Signed-off-by: Mohit Gambhir <mohit.gambhir@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
x86/io: move the list of guest to machine IO ports out of domain_iommu
There's no reason to store that list inside of the domain_iommu struct, the
forwarding of guest IO ports into machine IO ports is not tied to the presence
of an IOMMU.
Move it inside of the hvm_domain struct instead.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/io: rename misleading dpci_ prefixed functions to g2m_
The dpci_ prefix used on those IO handlers is misleading, there's nothing PCI
specific in them, they simply map a guest IO port into a machine (physical) IO
port. They don't specifically trap the PCI IO port range in any way
(0xcf8/0xcfc).
Rename them to use the g2m_ prefix in order to avoid this confusion.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Tamas K Lengyel [Fri, 7 Apr 2017 10:01:10 +0000 (12:01 +0200)]
altp2m: introduce external-only and limited use-cases
Currently setting altp2mhvm=1 in the domain configuration allows access to the
altp2m interface for both in-guest and external privileged tools. This poses
a problem for use-cases where only external access should be allowed, requiring
the user to compile Xen with XSM enabled to be able to appropriately restrict
access.
In this patch we deprecate the altp2mhvm domain configuration option and
introduce the altp2m option, which allows specifying if by default the altp2m
interface should be external-only or limited. The information is stored in
HVM_PARAM_ALTP2M which we now define with specific XEN_ALTP2M_* modes.
If external mode is selected, the XSM check is shifted to use XSM_DM_PRIV
type check, thus restricting access to the interface by the guest itself. Note
that we keep the default XSM policy untouched. Users of XSM who wish to enforce
external mode for altp2m can do so by adjusting their XSM policy directly,
as this domain config option does not override an active XSM policy.
Also, as part of this patch we adjust the hvmop handler to require
HVM_PARAM_ALTP2M to be of a type other then disabled for all ops. This has been
previously only required for get/set altp2m domain state, all other options
were gated on altp2m_enabled. Since altp2m_enabled only gets set during set
altp2m domain state, this change introduces no new requirements to the other
ops but makes it more clear that it is required for all ops.
Signed-off-by: Tamas K Lengyel <tamas.lengyel@zentific.com> Signed-off-by: Sergej Proskurin <proskurin@sec.in.tum.de> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Jan Beulich <jbeulich@suse.com>
Wei Liu [Thu, 6 Apr 2017 18:33:36 +0000 (19:33 +0100)]
xen: use a dummy file in C99 header check
The check builds header file as if it is a C file. Clang doesn't like
the idea of having dead code in C file. The check as-is fails on Clang
with unused function warnings.
Use a dummy file like the C++ header check to fix this.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
vgic: refuse irq migration when one is already in progress
When an irq migration is already in progress, but not yet completed
(GIC_IRQ_GUEST_MIGRATING is set), refuse any other irq migration
requests for the same irq.
This patch implements this approach by returning success or failure from
vgic_migrate_irq, and avoiding irq target changes on failure. It prints
a warning in case the irq migration fails.
It also moves the clear_bit of GIC_IRQ_GUEST_MIGRATING to after the
physical irq affinity has been changed so that all operations regarding
irq migration are completed.
arm: remove irq from inflight, then change physical affinity
This patch fixes a potential race that could happen when
gic_update_one_lr and vgic_vcpu_inject_irq run simultaneously.
When GIC_IRQ_GUEST_MIGRATING is set, we must make sure that the irq has
been removed from inflight before changing physical affinity, to avoid
concurrent accesses to p->inflight, as vgic_vcpu_inject_irq will take a
different vcpu lock.
Andrew Cooper [Tue, 7 Mar 2017 16:20:51 +0000 (16:20 +0000)]
tools/insn-fuzz: Fix assertion failures in x86_emulate_wrapper()
c/s 92cf67888 "x86/emul: Hold x86_emulate() to strict X86EMUL_EXCEPTION
requirements" was appropriate for the hypervisor, but the fuzzer stubs didn't
conform to the stricter requirements. AFL is very quick to discover this.
Extend the fuzzing harness exception logic to raise exceptions appropriately.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 27 Mar 2017 09:37:35 +0000 (10:37 +0100)]
tools/insn-fuzz: Provide IA32_DEBUGCTL consistently to the emulator
x86_emulates()'s is_branch_step() performs a speculative read of
IA32_DEBUGCTL, but doesn't squash exceptions should they arise. In reality,
this MSR is always available.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>