Jan Beulich [Wed, 26 Apr 2017 07:49:24 +0000 (09:49 +0200)]
x86emul: correct stub invocation constraints
Stub invocations need to have the space the stub occupies as an input,
to prevent the compiler from re-ordering (or omitting) writes to it.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Commit 407a3c00ff ("compat/memory: fix build with old gcc") "fixed" a
build issue by switching to the use of uninitialized data. Due to
- the bounding of the uninitialized data item
- the accessed area being outside of Xen space
- arguments being properly verified by the native hypercall function
this is not a security issue.
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Jennifer Herbert [Wed, 26 Apr 2017 07:47:30 +0000 (09:47 +0200)]
dmop: add xendevicemodel_modified_memory_bulk()
This new lib devicemodel call allows multiple extents of pages to be
marked as modified in a single call. This is something needed for a
usecase I'm working on.
The xen side of the modified_memory call has been modified to accept
an array of extents. The devicemodel library either provides an array
of length 1, to support the original library function, or a second
function allows an array to be provided.
Signed-off-by: Jennifer Herbert <Jennifer.Herbert@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
copy_{to,from}_guest_buf() are now implemented using an offset of 0.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jennifer Herbert <Jennifer.Herbert@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
This does only extend to the functionality here, specifically not to
the use of all-upper-case names for the macros:
Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Andrew Cooper [Wed, 26 Apr 2017 07:40:22 +0000 (09:40 +0200)]
hvm/dmop: implement COPY_{TO,FROM}_GUEST_BUF() in terms of raw accessors
This also allows the usual cases to be simplified, by omitting an unnecessary
buf parameters, and because the macros can appropriately size the object.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jennifer Herbert <Jennifer.Herbert@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
This does only extend to the functionality here, specifically not to
the use of all-upper-case names for the macros:
Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Jennifer Herbert [Wed, 26 Apr 2017 07:40:00 +0000 (09:40 +0200)]
hvm/dmop: make copy_buf_{from, to}_guest for a buffer not big enough an error
This makes copying to or from a buf that isn't big enough an error.
If the buffer isnt big enough, trying to carry on regardless
can only cause trouble later on.
Signed-off-by: Jennifer Herbert <Jennifer.Herbert@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Jennifer Herbert [Wed, 26 Apr 2017 07:39:14 +0000 (09:39 +0200)]
hvm/dmop: box dmop_args rather than passing multiple parameters around
No functional change.
Signed-off-by: Jennifer Herbert <Jennifer.Herbert@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Andrew Cooper [Mon, 24 Apr 2017 17:07:20 +0000 (18:07 +0100)]
x86/mm: Add missing newline to a printk() in get_page_from_l1e()
This avoids the log message being followed by
<G><1>mm.c:5374:d0v0 could not get_page_from_l1e()
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Andrew Cooper [Wed, 19 Apr 2017 15:56:32 +0000 (16:56 +0100)]
x86/hvm: Corrections and improvements to unhandled vmexit logging
* Use gprintk rather than gdprintk. These logging messages shouldn't
disappear in release builds, as they usually happen immediately before a
domain crash. Raise them from WARNING to ERR.
* Format the vmexit reason in the same base as is used in the vendor
manuals (decimal for Intel, hex for AMD), and consistently use 0x for hex
numbers.
* Consistently use "Unexpected vmexit" terminology.
In particular, this corrects the information printed for nested VT-x, and
actually prints information for nested SVM.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Jan Beulich [Fri, 21 Apr 2017 10:10:51 +0000 (12:10 +0200)]
correct rcu_unlock_domain()
Match rcu_lock_domain(), and remove the slightly misleading comment:
This isn't just the companion to rcu_lock_domain_by_id() (and that
latter function indeed also keeps the domain locked, not the domain
list).
No functional change, as rcu_read_{,un}lock() ignore their arguments
anyway.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
x86/vlapic: Don't reset APIC ID when handling INIT signal
According to SDM "ADVANCED PROGRAMMABLE INTERRUPT CONTROLLER (APIC) ->
"EXTENDED XAPIC (X2APIC)" -> "x2APIC State Transitions", the APIC mode
and APIC ID are preserved when handling INIT signal and a reset places
APIC to xAPIC mode and APIC base address to 0xFEE00000h (this part
is in "Local APIC" -> "Local APIC Status and Location"). So there are
two problems in current code:
1. Using reset logic (aka vlapic_reset) to handle INIT signal.
2. Forgetting resetting APIC mode and base address in vlapic_reset()
This patch introduces a new function vlapic_do_init() and replaces the
wrongly used vlapic_reset(). Also reset APIC mode and APIC base address
in vlapic_reset().
Note that: LDR is read only in x2APIC mode. Resetting it to zero in x2APIC
mode is unreasonable. This patch also doesn't reset LDR when handling INIT
signal in x2APIC mode.
xen/arm: Properly map the FDT in the boot page table
Currently, Xen is assuming the FDT will always fit in a 2MB section.
Recently, I noticed an early crash on Xen when using GRUB with the
following call trace:
Indeed, the booting documentation for AArch32 and AArch64 only requires
the FDT to be placed on a 8-byte boundary. This means the Device-Tree can
cross a 2MB boundary.
Given that Xen limits the size of the FDT to 2MB, it will always fit in
a 4MB slot. So extend the fixmap slot for FDT from 2MB to 4MB.
The second 2MB superpage will only be mapped if the FDT is cross the 2MB
boundary.
xen/arm: Check if the FDT passed by the bootloader is valid
There is currently no sanity check on the FDT passed by the bootloader.
Whilst they are stricly not necessary, it will avoid us to spend hours
to try to find out why it does not work.
>From the booting documentation for AArch32 [1] and AArch64 [2] must :
- be placed on 8-byte boundary
- not exceed 2MB (only on AArch64)
Even if AArch32 does not seem to limit the size, Xen is not currently
able to support more the 2MB FDT. It is better to crash rather with a nice
error message than claiming we are supporting any size of FDT.
The checks are mostly borrowed from the Linux code (see fixmap_remap_fdt
in arch/arm64/mm/mmu.c).
[1] Section 2 in linux/Documentation/arm64/booting.txt
[2] Section 4b in linux/Documentation/arm/Booting
xen/arm: Move the code to map FDT in the boot tables from assembly to C
The FDT will not be accessed before start_xen (begining of C code) is
called and it will be easier to maintain as the code could be common
between AArch32 and AArch64.
A new function early_fdt_map is introduced to map the FDT in the boot
page table.
Eric DeVolder [Wed, 19 Apr 2017 21:01:49 +0000 (16:01 -0500)]
xen/kexec: remove spinlock now that all KEXEC hypercall ops are protected at the top-level
The spinlock in kexec_swap_images() was removed as
this function is only reachable on the kexec hypercall, which is
now protected at the top-level in do_kexec_op_internal(),
thus the local spinlock is no longer necessary.
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
The interesting thing is that the page bits (063) look legit.
The operation on which we blow up is us trying to write
in the L1 and finding that the L2 entry points to some
bizzare MFN. It stinks of a race, and it looks like
the issue is due to no concurrency locks when dealing
with the crash kernel space.
Specifically we concurrently call kimage_alloc_crash_control_page
which iterates over the kexec_crash_area.start -> kexec_crash_area.size
and once found:
if ( page )
{
image->next_crash_page = hole_end;
clear_domain_page(_mfn(page_to_mfn(page)));
}
clears. Since the parameters of what MFN to use are provided
by the callers (and the area to search is bounded) the the 'page'
is probably the same. So #1 we concurrently clear the
'control_code_page'.
The next step is us passing this 'control_code_page' to
machine_kexec_add_page. This function requires the MFNs:
page_to_maddr(image->control_code_page).
And this would always return the same virtual address, as
the MFN of the control_code_page is inside of the
kexec_crash_area.start -> kexec_crash_area.size area.
Then machine_kexec_add_page updates the L1 .. which can be done
concurrently and on subsequent calls we mangle it up.
This is all a theory at this time, but testing reveals
that adding the hypercall_create_continuation() at the
kexec hypercall fixes the crash.
NOTE: This patch follows 5c5216 (kexec: clear kexec_image slot
when unloading kexec image) to prevent crashes during
simultaneous load/unloads.
NOTE: Consideration was given to using the existing flag
KEXEC_FLAG_IN_PROGRESS to denote a kexec hypercall in
progress. This, however, overloads the original intent of
the flag which is to denote that we are about-to/have made
the jump to the crash path. The overloading would lead to
failures in existing checks on this flag as the flag would
always be set at the top level in do_kexec_op_internal().
For this reason, the new flag KEXEC_FLAG_HC_IN_PROGRESS
was introduced.
While at it, fixed the #define mismatched spacing
Signed-off-by: Eric DeVolder <eric.devolder@oracle.com> Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Ross Lagerwall [Thu, 20 Apr 2017 13:18:00 +0000 (14:18 +0100)]
x86/microcode: Use the return value from early_microcode_update_cpu
Use the return value from early_microcode_update_cpu rather than
ignoring it.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-Acked-by: Julien Grall <julien.grall@arm.com>
Wei Liu [Tue, 18 Apr 2017 14:48:59 +0000 (15:48 +0100)]
hotplug/FreeBSD: configure xenstored
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Wei Liu [Tue, 18 Apr 2017 14:42:43 +0000 (15:42 +0100)]
oxenstored: provide options to define xenstored devices
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Christian Lindig <christian.lindig@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Wei Liu [Tue, 18 Apr 2017 14:20:03 +0000 (15:20 +0100)]
paths.m4: provide XENSTORED_{KVA,PORT}
The default values are Linux device names. No users yet.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Christian Lindig <christian.lindig@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Ross Lagerwall [Tue, 18 Apr 2017 15:47:24 +0000 (16:47 +0100)]
x86: Move microcode loading earlier
Move microcode loading earlier for the boot CPU and secondary CPUs so
that it takes place before identify_cpu() is called for each CPU.
Without this, the detected features may be wrong if the new microcode
loading adjusts the feature bits. That could mean that some fixes (e.g. d6e9f8d4f35d ("x86/vmx: fix vmentry failure with TSX bits in LBR"))
don't work as expected.
Previously during boot, the microcode loader was invoked for each
secondary CPU started and then again for each CPU as part of an
initcall. Simplify the code so that it is invoked exactly once for each
CPU during boot.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Jan Beulich [Wed, 19 Apr 2017 11:30:27 +0000 (13:30 +0200)]
x86emul: force CLZERO feature flag in test harness
Commit b988e88cc0 ("x86/emul: Add feature check for clzero") added a
feature check to the emulator, which breaks the harness without this
flag being forced to true.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
x86/vioapic: allow holes in the GSI range for PVH Dom0
The current vIO APIC for PVH Dom0 doesn't allow non-contiguous GSIs, which
means that all GSIs must belong to an IO APIC. This doesn't match reality,
where there are systems with non-contiguous GSIs.
In order to fix this add a base_gsi field to each hvm_vioapic struct, in order
to store the base GSI for each emulated IO APIC. For PVH Dom0 those values are
populated based on the hardware ones.
Jan Beulich [Wed, 19 Apr 2017 11:29:14 +0000 (13:29 +0200)]
x86/HVM: don't uniformly report "MMIO" for various forms of failed emulation
This helps distinguishing the call paths leading there.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Jan Beulich [Wed, 19 Apr 2017 11:26:55 +0000 (13:26 +0200)]
VMX: don't blindly enable descriptor table exiting control
This is an optional feature and hence we should check for it before
use.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Jan Beulich [Wed, 19 Apr 2017 11:26:18 +0000 (13:26 +0200)]
x86/HVM: restrict emulation in hvm_descriptor_access_intercept()
While I did review d0a699a389 ("x86/monitor: add support for descriptor
access events") it didn't really occur to me that someone could be this
blunt and add unguarded emulation again just a few weeks after we
guarded all special purpose emulator invocations. Fix this.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Jan Beulich [Wed, 19 Apr 2017 11:25:44 +0000 (13:25 +0200)]
x86emul: always fill x86_insn_modrm()'s outputs
The function is rather unlikely to be called for insns which don't have
ModRM bytes, and hence addressing Coverity's recurring complaint of
callers potentially consuming uninitialized data when they know that
certain opcodes have ModRM bytes can be suppressed this way without
unduly adding overhead to fast paths.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Jan Beulich [Wed, 19 Apr 2017 11:24:18 +0000 (13:24 +0200)]
x86emul: add "unblock NMI" retire flag
No matter that we emulate IRET for (guest) real mode only right now, we
should get its effect on (virtual) NMI delivery right. Note that we can
simply check the other retire flags also in the !OKAY case, as the
insn emulator now guarantees them to only be set on OKAY.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Alistair Francis [Mon, 17 Apr 2017 21:33:11 +0000 (14:33 -0700)]
tools: Use POSIX signal.h instead of sys/signal.h
The POSIX spec specifies to use:
#include <signal.h>
instead of:
#include <sys/signal.h>
as seen here:
http://pubs.opengroup.org/onlinepubs/009695399/functions/signal.html
This removes the warning:
#warning redirecting incorrect #include <sys/signal.h> to <signal.h>
when building with the musl C-library.
Signed-off-by: Alistair Francis <alistair.francis@xilinx.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Alistair Francis [Mon, 17 Apr 2017 21:33:10 +0000 (14:33 -0700)]
tools: Use POSIX poll.h instead of sys/poll.h
The POSIX spec specifies to use:
#include <poll.h>
instead of:
#include <sys/poll.h>
as seen here:
http://pubs.opengroup.org/onlinepubs/009695399/functions/poll.html
This removes the warning:
#warning redirecting incorrect #include <sys/poll.h> to <poll.h>
when building with the musl C-library.
Signed-off-by: Alistair Francis <alistair.francis@xilinx.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Andrew Cooper [Mon, 10 Apr 2017 12:11:06 +0000 (13:11 +0100)]
x86/emul: Drop more redundant ctxt.event_pending checks
Since c/s 92cf67888a, x86_emulate_wrapper() asserts stricter behaviour about
the relationship between X86EMUL_EXCEPTION and ctxt.event_pending.
These removals should have been included in the aforementioned changeset, and
were only omitted due an oversight.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Tim Deegan <tim@xen.org> Release-acked-by: Julien Grall <julien.grall@arm.com>
Jan Beulich [Thu, 13 Apr 2017 15:35:02 +0000 (17:35 +0200)]
x86/vIO-APIC: fix uninitialized variable warning
In a release build modern gcc validly complains about "pin" possibly
being uninitialized in vioapic_irq_positive_edge().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
VT-d: correct a comment and remove an useless if() statement
Fix two flaws in the patch (93358e8e VT-d: introduce update_irte to update
irte safely):
1. Expand a comment in update_irte() to make it clear that VT-d hardware
doesn't update IRTE and software can't update IRTE behind us since we hold
iremap_lock.
2. remove an useless if() statement
Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
clang: disable the gcc-compat warnings for read_atomic
clang gcc-compat warnings can wrongly fire when certain constructions are used,
at least the following flow:
switch ( ... )
{
case ...:
while ( ({ int x; switch ( foo ) { case 1: x = 1; break; } x }) )
{
...
Will cause clang to emit the following warning "'break' is bound to loop, GCC
binds it to switch", which is a false positive, and both gcc and clang bind
the break to the inner switch. In order to workaround this issue, disable the
gcc-compat checks for the usage of the read_atomic macro.
This has been reported upstream as http://bugs.llvm.org/show_bug.cgi?id=32595.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Luwei Kang [Thu, 13 Apr 2017 10:44:28 +0000 (18:44 +0800)]
tools:misc:xenpm: set max freq to all cpu with default cpuid
User can set max freq to specific cpu by
"xenpm set-scaling-maxfreq [cpuid] <HZ>"
or set max freq to all cpu with default cpuid by
"xenpm set-scaling-maxfreq <HZ>".
Set max freq with default cpuid will cause
segmentation fault after commit id d4906b5d05.
This patch will fix this issue and add ability
to set max freq with default cpuid.
Signed-off-by: Luwei Kang <luwei.kang@intel.com> Compile-tested-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Ian Jackson [Wed, 12 Apr 2017 15:18:57 +0000 (16:18 +0100)]
Config.mk: Update for 4.9.0-rc1.2
Contrary to what I wrote in d0db50ced1f7 "Config.mk: Update for
4.9.0-rc1.1", the build failure with 4.9.0-rc1 was not due to a wrong
qemu tag, but a wrong mini-os tag. So burn 4.9.0-rc1.1 too :-(. (We
can rewind the qemu-trad tag to 4.9.0-rc1; the -rc1 and -rc1.1 tags
are identical.)
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Ian Jackson [Wed, 12 Apr 2017 15:03:35 +0000 (16:03 +0100)]
Config.mk: Update for 4.9.0-rc1.1
In qemu-trad, I made xen-4.9.0-rc1 refer erroneously to the 4.8
branch. That doesn't build. So we are burning the version number
4.9.0-rc1 in xen.git and qemu-trad. (The other trees can remain as
they are.)
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
The changes introduced on c47d1d broke the clang build due to undefined
references to __xsm_action_mismatch_detected, because clang hasn't optimized
the code properly. The following patch allows the clang build to work again,
while keeping the same functionality.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Jonathan Davies [Fri, 7 Apr 2017 13:27:20 +0000 (14:27 +0100)]
oxenstored: save remote evtchn port, not local port
Previously, Domain.dump output the number of the local port
corresponding to each domain's event-channel. However, when oxenstored
exits, it closes /dev/xen/evtchn which causes the kernel to close the
local port (evtchn_release), so this port is no longer useful.
Instead, store the remote port. This can be used to reconnect the
event-channel by binding the original remote port to a fresh local port.
Indeed, the logic for parsing the stored state already expects a remote
port as it passes the parsed port number to Domain.make (via
Domains.create), which takes a remote port.
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Christian Lindig <christian.lindig@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Jonathan Davies [Fri, 7 Apr 2017 13:27:19 +0000 (14:27 +0100)]
oxenstored: avoid leading slash in paths in saved store state
Internally, paths are represented as lists of strings, where
* path "/" is represented by []
* path "/local/domain/0" is represented by ["local"; "domain"; "0"]
(see comment for Store.Path.t).
However, the traversal function generated paths like
[""; "local"; "domain"; "0"]
because the name of the root node is "". Change it to generate paths
correctly.
Furthermore, the function passed to Store.dump_fct would render the node
"foo" under the path [] as "//foo". Change this to return "/foo".
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Christian Lindig <christian.lindig@citrix.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
This reverts commit b32d442abd92cdd4d8f2a2e7794cfee9dba7fe22. There is
no need for this patch after "xen/arm: Set and restore HCR_EL2 register
for each vCPU separately".
Andre Przywara [Fri, 7 Apr 2017 22:08:01 +0000 (23:08 +0100)]
ARM: GICv3 ITS: introduce device mapping
The ITS uses device IDs to map LPIs to a device. Dom0 will later use
those IDs, which we directly pass on to the host.
For this we have to map each device that Dom0 may request to a host
ITS device with the same identifier.
Allocate the respective memory and enter each device into an rbtree to
later be able to iterate over it or to easily teardown guests.
Because device IDs are per ITS, we need to identify a virtual ITS. We
use the doorbell address for that purpose, as it is a nice architectural
MSI property and spares us handling with opaque pointer or break
the VGIC abstraction.
Andre Przywara [Fri, 7 Apr 2017 22:07:59 +0000 (23:07 +0100)]
ARM: GICv3 ITS: introduce host LPI array
The number of LPIs on a host can be potentially huge (millions),
although in practise will be mostly reasonable. So prematurely allocating
an array of struct irq_desc's for each LPI is not an option.
However Xen itself does not care about LPIs, as every LPI will be injected
into a guest (Dom0 for now).
Create a dense data structure (8 Bytes) for each LPI which holds just
enough information to determine the virtual IRQ number and the VCPU into
which the LPI needs to be injected.
Also to not artificially limit the number of LPIs, we create a 2-level
table for holding those structures.
This patch introduces functions to initialize these tables and to
create, lookup and destroy entries for a given LPI.
By using the naturally atomic access guarantee the native uint64_t data
type gives us, we allocate and access LPI information in a way that does
not require a lock.
Andre Przywara [Fri, 7 Apr 2017 22:07:58 +0000 (23:07 +0100)]
ARM: GICv3 ITS: introduce ITS command handling
To be able to easily send commands to the ITS, create the respective
wrapper functions, which take care of the ring buffer.
The first two commands we implement provide methods to map a collection
to a redistributor (aka host core) and to flush the command queue (SYNC).
Start using these commands for mapping one collection to each host CPU.
As an ITS might choose between *two* ways of addressing a redistributor,
we store both the MMIO base address as well as the processor number in
a per-CPU variable to give each ITS what it wants.
Andre Przywara [Fri, 7 Apr 2017 22:07:57 +0000 (23:07 +0100)]
ARM: GICv3 ITS: map ITS command buffer
Instead of directly manipulating the tables in memory, an ITS driver
sends commands via a ring buffer in normal system memory to the ITS h/w
to create or alter the LPI mappings.
Allocate memory for that buffer and tell the ITS about it to be able
to send ITS commands.
Andre Przywara [Fri, 7 Apr 2017 22:07:56 +0000 (23:07 +0100)]
ARM: GICv3 ITS: allocate device and collection table
Each ITS maps a pair of a DeviceID (for instance derived from a PCI
b/d/f triplet) and an EventID (the MSI payload or interrupt ID) to a
pair of LPI number and collection ID, which points to the target CPU.
This mapping is stored in the device and collection tables, which software
has to provide for the ITS to use.
Allocate the required memory and hand it over to the ITS.
Andre Przywara [Fri, 7 Apr 2017 22:07:55 +0000 (23:07 +0100)]
ARM: GICv3: allocate LPI pending and property table
The ARM GICv3 provides a new kind of interrupt called LPIs.
The pending bits and the configuration data (priority, enable bits) for
those LPIs are stored in tables in normal memory, which software has to
provide to the hardware.
Allocate the required memory, initialize it and hand it over to each
redistributor. The maximum number of LPIs to be used can be adjusted with
the command line option "max_lpi_bits", which defaults to 20 bits,
covering about one million LPIs.
Andre Przywara [Fri, 7 Apr 2017 22:07:54 +0000 (23:07 +0100)]
ARM: GICv3 ITS: initialize host ITS
Map the registers frame for each host ITS and populate the host ITS
structure with some parameters describing the size of certain properties
like the number of bits for device IDs.
Andre Przywara [Fri, 7 Apr 2017 22:07:53 +0000 (23:07 +0100)]
ARM: GICv3 ITS: parse and store ITS subnodes from hardware DT
Parse the GIC subnodes in the device tree to find every ITS MSI controller
the hardware offers. Store that information in a list to both propagate
all of them later to Dom0, but also to be able to iterate over all ITSes.
This introduces an ITS Kconfig option (as an EXPERT option), use
XEN_CONFIG_EXPERT=y on the make command line to see and use the option.
xen: credit1: treat pCPUs more evenly during balancing.
Right now, we use cpumask_first() for going through
the bus pCPUs in csched_load_balance(). This means
not all pCPUs have equal chances of seeing their
pending work stolen. It also means there is more
runqueue lock pressure on lower ID pCPUs.
To avoid all this, let's record and remember, for
each NUMA node, from what pCPU we have stolen for
last, and start from that the following time.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
xen: credit1: increase efficiency and scalability of load balancing.
During load balancing, we check the non idle pCPUs to
see if they have runnable but not running vCPUs that
can be stolen by and set to run on currently idle pCPUs.
If a pCPU has only one running (or runnable) vCPU,
though, we don't want to steal it from there, and
it's therefore pointless bothering with it
(especially considering that bothering means trying
to take its runqueue lock!).
On large systems, when load is only slightly higher
than the number of pCPUs (i.e., there are just a few
more active vCPUs than the number of the pCPUs), this
may mean that:
- we go through all the pCPUs,
- for each one, we (try to) take its runqueue locks,
- we figure out there's actually nothing to be stolen!
To mitigate this, we introduce a counter for the number
of runnable vCPUs on each pCPU. In fact, unless there
re least 2 runnable vCPUs --typically, one running,
and the others in the runqueue-- it does not make sense
to try stealing anything.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
cpumask_any() is costly (because of the randomization).
And since it does not really matter which exact CPU is
selected within a runqueue, as that will be overridden
shortly after, in runq_tickle(), spending too much time
and achieving true randomization is pretty pointless.
As the picked CPU, however, would be used as an hint,
within runq_tickle(), don't give up on it entirely,
and let's make sure we don't always return the same
CPU, or favour lower or higher ID CPUs.
To achieve that, let's record and remember, for each
runqueue, what CPU we picked for last, and start from
that the following time.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Currently, it can happen that __runq_tickle(),
running on pCPU 2 because vCPU x woke up, decides
to tickle pCPU 3, because it's idle. Just after
that, but before pCPU 3 manages to schedule and
pick up x, either __runq_tickel() or
__csched_cpu_pick(), running on pCPU 6, sees that
idle pCPUs are 0, 1 and also 3, and for whatever
reason it also chooses 3 for waking up (or
migrating) vCPU y.
When pCPU 3 goes through the scheduler, it will
pick up, say, vCPU x, and y will sit in its
runqueue, even if there are idle pCPUs.
Alleviate this by marking a pCPU to be idle
right away when tickling it (like, e.g., it happens
in Credit2).
Note that this does not eliminate the race. That
is not possible without introducing proper locking
for the cpumasks the scheduler uses. It significantly
reduces the window during which it can happen, though.
Introduce proper locking for the cpumask can, in
theory, be done, and may be investigated in future.
It is a significant amount of work to do it properly
(e.g., avoiding deadlock), and it is likely to adversely
affect scalability, and so it may be a path it is just
not worth following.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
xen: credit1: simplify csched_runq_steal() a little bit.
Since we're holding the lock on the pCPU from which we
are trying to steal, it can't have disappeared, so we
can drop the check for that (and convert it in an
ASSERT()).
And since we try to steal only from busy pCPUs, it's
unlikely for such pCPU to be idle, so we can:
- tell the compiler this is actually unlikely,
- bail early if the pCPU, unfortunately, turns out
to really be idle.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
Andrew Cooper [Wed, 8 Mar 2017 15:38:55 +0000 (15:38 +0000)]
x86/emul: Poision the stubs with debug traps
...rather than leaving fragments of old instructions in place. This reduces
the chances of something going further-wrong (as the debug trap will be caught
and terminate the guest) in a cascade-failure where we end up executing the
instruction fragments.
After:
(XEN) d3v0 exception 6 (ec=0000) in emulation stub (line 6239)
(XEN) d3v0 stub: c4 e1 44 77 c3 cc cc cc cc cc cc cc cc cc cc
To make this work, the int3 handler needs to be extended to attempt recovery
rather than simply returning back to Xen context. While altering do_int3(),
leave an obvious sign if an embedded breakpoint has been hit and not dealt
with by debugging facilities.
(XEN) Hit embedded breakpoint at ffff82d0803d01f6 [extable.c#stub_selftest+0xda/0xee]
Extend the selftests to include int3, and add an extra printk indicating the
start of the recovery selftests, to avoid leaving otherwise-spurious faults
visible in the log.
x86/ioreq server: synchronously reset outstanding p2m_ioreq_server entries when an ioreq server unmaps
After an ioreq server has unmapped, the remaining p2m_ioreq_server
entries need to be reset back to p2m_ram_rw. This patch does this
synchronously by iterating the p2m table.
The synchronous resetting is necessary because we need to guarantee
the p2m table is clean before another ioreq server is mapped. And
since the sweeping of p2m table could be time consuming, it is done
with hypercall continuation.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
After an ioreq server has unmapped, the remaining p2m_ioreq_server
entries need to be reset back to p2m_ram_rw. This patch does this
asynchronously with the current p2m_change_entry_type_global()
interface.
New field entry_count is introduced in struct p2m_domain, to record
the number of p2m_ioreq_server p2m page table entries. One nature of
these entries is that they only point to 4K sized page frames, because
all p2m_ioreq_server entries are originated from p2m_ram_rw ones in
p2m_change_type_one(). We do not need to worry about the counting for
2M/1G sized pages.
This patch disallows mapping of an ioreq server, when there's still
p2m_ioreq_server entry left, in case another mapping occurs right after
the current one being unmapped, releases its lock, with p2m table not
synced yet.
This patch also disallows live migration, when there's remaining
p2m_ioreq_server entry in p2m table. The core reason is our current
implementation of p2m_change_entry_type_global() lacks information
to resync p2m_ioreq_server entries correctly if global_logdirty is
on.
We still need to handle other recalculations, however; which means
that when doing a recalculation, if the current type is
p2m_ioreq_server, we check to see if p2m->ioreq.server is valid or
not. If it is, we leave it as type p2m_ioreq_server; if not, we reset
it to p2m_ram as appropriate.
To avoid code duplication, lift recalc_type() out of p2m-pt.c and use
it for all type recalculations (both in p2m-pt.c and p2m-ept.c).
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Fri, 7 Apr 2017 15:38:48 +0000 (17:38 +0200)]
x86/ioreq server: handle read-modify-write cases for p2m_ioreq_server pages
In ept_handle_violation(), write violations are also treated as
read violations. And when a VM is accessing a write-protected
address with read-modify-write instructions, the read emulation
process is triggered first.
For p2m_ioreq_server pages, current ioreq server only forwards
the write operations to the device model. Therefore when such page
is being accessed by a read-modify-write instruction, the read
operations should be emulated here in hypervisor. This patch provides
such a handler to copy the data to the buffer.
Note: MMIOs with p2m_mmio_dm type do not need such special treatment
because both reads and writes will go to the device mode.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/ioreq server: add device model wrappers for new DMOP
A new device model wrapper is added for the newly introduced
DMOP - XEN_DMOP_map_mem_type_to_ioreq_server.
Since currently this DMOP only supports the emulation of write
operations, attempts to trigger the DMOP with values other than
XEN_DMOP_IOREQ_MEM_ACCESS_WRITE or 0(to unmap the ioreq server)
shall fail. The wrapper shall be updated once read operations
are also to be emulated in the future.
Also note currently this DMOP only supports one memory type,
and can be extended in the future to map multiple memory types
to multiple ioreq servers, e.g. mapping HVMMEM_ioreq_serverX to
ioreq server X, This wrapper shall be updated when such change
is made.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Paul Durrant [Fri, 7 Apr 2017 15:38:11 +0000 (17:38 +0200)]
x86/ioreq server: add DMOP to map guest ram with p2m_ioreq_server to an ioreq server
Previously, p2m_ioreq_server is used to write-protect guest ram
pages, which are tracked with ioreq server's rangeset. However,
number of ram pages to be tracked may exceed the upper limit of
rangeset.
Now, a new DMOP - XEN_DMOP_map_mem_type_to_ioreq_server, is added
to let one ioreq server claim/disclaim its responsibility for the
handling of guest pages with p2m type p2m_ioreq_server. Users of
this DMOP can specify which kind of operation is supposed to be
emulated in a parameter named flags. Currently, this DMOP only
support the emulation of write operations. And it can be further
extended to support the emulation of read ones if an ioreq server
has such requirement in the future.
For now, we only support one ioreq server for this p2m type, so
once an ioreq server has claimed its ownership, subsequent calls
of the XEN_DMOP_map_mem_type_to_ioreq_server will fail. Users can
also disclaim the ownership of guest ram pages with p2m_ioreq_server,
by triggering this new DMOP, with ioreq server id set to the current
owner's and flags parameter set to 0.
Note:
a> both XEN_DMOP_map_mem_type_to_ioreq_server and p2m_ioreq_server
are only supported for HVMs with HAP enabled.
b> only after one ioreq server claims its ownership of p2m_ioreq_server,
will the p2m type change to p2m_ioreq_server be allowed.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Acked-by: Tim Deegan <tim@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
x86/ioreq server: release the p2m lock after mmio is handled
Routine hvmemul_do_io() may need to peek the p2m type of a gfn to
select the ioreq server. For example, operations on gfns with
p2m_ioreq_server type will be delivered to a corresponding ioreq
server, and this requires that the p2m type not be switched back
to p2m_ram_rw during the emulation process. To avoid this race
condition, we delay the release of p2m lock in hvm_hap_nested_page_fault()
until mmio is handled.
Note: previously in hvm_hap_nested_page_fault(), put_gfn() was moved
before the handling of mmio, due to a deadlock risk between the p2m
lock and the event lock(in commit 77b8dfe). Later, a per-event channel
lock was introduced in commit de6acb7, to send events. So we do not
need to worry about the deadlock issue.
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
As a (rudimental) way of directing and affecting the
placement logic implemented by the scheduler, support
vCPU hard affinity.
Basically, a vCPU will now be assigned only to a pCPU
that is part of its own hard affinity. If such pCPU(s)
is (are) busy, the vCPU will wait, like it happens
when there are no free pCPUs.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
xen: sched: introduce the 'null' semi-static scheduler
In cases where one is absolutely sure that there will be
less vCPUs than pCPUs, having to pay the cost, mostly in
terms of overhead, of an advanced scheduler may be not
desirable.
The simple scheduler implemented here could be a solution.
Here how it works:
- each vCPU is statically assigned to a pCPU;
- if there are pCPUs without any vCPU assigned, they
stay idle (as in, the run their idle vCPU);
- if there are vCPUs which are not assigned to any
pCPU (e.g., because there are more vCPUs than pCPUs)
they *don't* run, until they get assigned;
- if a vCPU assigned to a pCPU goes away, one of the
waiting to be assigned vCPU, if any, gets assigned
to the pCPU and can run there.
This scheduler, therefore, if used in configurations
where every vCPUs can be assigned to a pCPU, guarantees
low overhead, low latency, and consistent performance.
If used as default scheduler, at Xen boot, it is
recommended to limit the number of Dom0 vCPUs (e.g., with
'dom0_max_vcpus=x'). Otherwise, all the pCPUs will have
one Dom0's vCPU assigned, and there won't be room for
running efficiently (if at all) any guest.
Target use cases are embedded and HPC, but it may well
be interesting also in circumnstances.
Kconfig and documentation are update accordingly.
While there, also document the availability of sched=rtds
as boot parameter, which apparently had been forgotten.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
xen: sched: make sure a pCPU added to a pool runs the scheduler ASAP
When a pCPU is added to a cpupool, the pool's scheduler
should immediately run on it so, for instance, any runnable
but not running vCPU can start executing there.
This currently does not happen. Make it happen by raising
the scheduler softirq directly from the function that
sets up the new scheduler for the pCPU.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
x86/mce: always re-initialize 'severity_cpu' in mcheck_cmn_handler()
mcheck_cmn_handler() does not always set 'severity_cpu' to override
its value taken from previous rounds of MC handling, which will
interfere the current round of MC handling. Always re-initialize it to
clear the historical value.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
The current 'severity_cpu' is used by both mcheck_cmn_handler() and
mce_softirq(). If MC# happens during mce_softirq(), the values set in
mcheck_cmn_handler() and mce_softirq() may interfere with each
other. Use private 'severity_cpu' for each function to fix this issue.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Adrian Pop [Fri, 7 Apr 2017 13:39:32 +0000 (15:39 +0200)]
x86/monitor: add support for descriptor access events
Adds monitor support for descriptor access events (reads & writes of
IDTR/GDTR/LDTR/TR) for the x86 architecture (VMX and SVM).
Signed-off-by: Adrian Pop <apop@bitdefender.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
[jb: minor cosmetic (hopefully!) cleanup] Reviewed-by: Jan Beulich <jbeulich@suse.com>
passthrough/io: fall back to remapping interrupt when we can't use VT-d PI
The current logic of using VT-d pi is when guest configurates the pirq's
destination vcpu to a single vcpu, the according IRTE is updated to
posted format. If the destination of the pirq is multiple vcpus, we will
stay in posted format. Obviously, we should fall back to remapping interrupt
when guest wrongly configurate destination of pirq or makes it have
multi-destination vcpus.
Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
[jb: guard against vcpu being NULL] Reviewed-by: Jan Beulich <jbeulich@suse.com>
We used structure assignment to update irte which was non-atomic when the
whole IRTE was to be updated. It is unsafe when a interrupt happened during
update. Furthermore, no bug or warning would be reported when this happened.
This patch introduces two variants, atomic and non-atomic, to update irte.
For initialization and release case, the non-atomic variant will be used. for
other cases (such as reprogramming to set irq affinity), the atomic variant
will be used. If the caller requests an atomic update but we can't meet it, we
raise a bug.
Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Jan Beulich <jbeulich@suse.com> [x86]
VT-d: introduce new fields in msi_desc to track binding with guest interrupt
msi_msg_to_remap_entry() is buggy when the live IRTE is in posted format. It
wrongly inherits the 'im' field meaning the IRTE is in posted format but
updates all the other fields to remapping format.
There are also two situations that lead to the above issue. One is some callers
really want to change the IRTE to remapped format. The other is some callers
only want to update msi message (e.g. set msi affinity) for they don't aware
that this msi is binded with a guest interrupt. We should suppress update
in the second situation. To distinguish them, straightforwardly, we can let
caller specify which format of IRTE they want update to. It isn't feasible for
making all callers be aware of the binding with guest interrupt will cause a
far more complicated change (including the interfaces exposed to IOAPIC and
MSI). Also some callings happen in interrupt context where we can't acquire
d->event_lock to read struct hvm_pirq_dpci.
This patch introduces two new fields in msi_desc to track binding with a guest
interrupt such that msi_msg_to_remap_entry() can get the binding and update
IRTE accordingly. After that change, pi_update_irte() can utilize
msi_msg_to_remap_entry() to update IRTE to posted format.
Signed-off-by: Feng Wu <feng.wu@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
passthrough: don't migrate pirq when it is delivered through VT-d PI
When a vCPU was migrated to another pCPU, pt irqs binded to this vCPU might
also need migration as a optimization to reduce IPI between pCPUs. When VT-d
PI is enabled, interrupt vector will be recorded to a main memory resident
data-structure and a notification whose destination is decided by NDST is
generated. NDST is properly adjusted during vCPU migration so pirq directly
injected to guest needn't be migrated.
This patch adds a indicator, @posted, to show whether the pt irq is delivered
through VT-d PI. Also this patch fixes a bug that hvm_migrate_pirq() accesses
pirq_dpci->gmsi.dest_vcpu_id without checking the pirq_dpci's type.
Signed-off-by: Chao Gao <chao.gao@intel.com>
[jb: remove an extranious check from hvm_migrate_pirq()] Reviewed-by: Jan Beulich <jbeulich@suse.com>
Daniel Kiper [Fri, 7 Apr 2017 11:37:24 +0000 (13:37 +0200)]
x86: add multiboot2 protocol support for relocatable images
Add multiboot2 protocol support for relocatable images. Only GRUB2 with
"multiboot2: Add support for relocatable images" patch understands
that feature. Older multiboot protocol (regardless of version)
compatible loaders ignore it and everything works as usual.
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
Daniel Kiper [Fri, 7 Apr 2017 11:37:02 +0000 (13:37 +0200)]
x86/boot: rename sym_phys() to sym_offs()
This way macro name better describes its function.
Currently it is used to calculate symbol offset in
relation to the beginning of Xen image mapping.
However, value returned by sym_offs() for a given
symbol is not always equal its physical address.
There is no functional change.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
Daniel Kiper [Fri, 7 Apr 2017 11:36:32 +0000 (13:36 +0200)]
x86: make Xen early boot code relocatable
Every multiboot protocol (regardless of version) compatible image must
specify its load address (in ELF or multiboot header). Multiboot protocol
compatible loader have to load image at specified address. However, there
is no guarantee that the requested memory region (in case of Xen it starts
at 2 MiB and ends at ~5 MiB) where image should be loaded initially is a RAM
and it is free (legacy BIOS platforms are merciful for Xen but I found at
least one EFI platform on which Xen load address conflicts with EFI boot
services; it is Dell PowerEdge R820 with latest firmware). To cope with that
problem we must make Xen early boot code relocatable and help boot loader to
relocate image in proper way by suggesting, not requesting specific load
addresses as it is right now, allowed address ranges. This patch does former.
It does not add multiboot2 protocol interface which is done in "x86: add
multiboot2 protocol support for relocatable images" patch.
This patch changes following things:
- %esi register is used as a storage for Xen image load base address;
it is mostly unused in early boot code and preserved during C functions
calls in 32-bit mode,
- %fs is used as base for Xen data relative addressing in 32-bit code
if it is possible; %esi is used for that thing during error printing
because it is not always possible to properly and efficiently
initialize %fs.
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>