Ian Campbell [Tue, 19 Nov 2013 13:00:18 +0000 (13:00 +0000)]
xen: arm: define guest virtual platform in API headers
The tools and the hypervisor need to agree on various aspects of the guest
environment, such as interrupt numbers, memory layout, initial register values
for registers which are implementation defined etc. Therefore move the
associated defines into the public interface headers, or create them as
necessary.
This just exposes the current de-facto standard guest layout, which may be
subject to change in the future. This deliberately does not make the guest
layout dynamic since there is currently no need.
These values should not be exposed to guests, they should find these things
out via device tree or should not be relying on implementation defined
defaults.
Various bits of the hypervisor needed to change to configure dom0 with the real
platform values while using the virtual platform configuration for guests.
Arrange for this where appropriate and plumb through as needed.
We also need to expose some 64-bit values (e.g. PSR_GUEST64_INIT) for the
benefit of 32 bit toolstacks building 64 bit guests.
Ian Campbell [Tue, 19 Nov 2013 13:00:12 +0000 (13:00 +0000)]
xen: arm: allocate dom0 memory separately from preparing the dtb
Mixing these two together is a pain, it forces us to prepare the dtb before
processing the kernel which means we don't know whether the guest is 32- or
64-bit while we construct its DTB.
Instead split out the memory allocation (including 1:1 workaround handling)
and p2m setup into a separate phase and then create a memory node in the DTB
based on the result.
This allows us to move kernel parsing before DTB setup.
As part of this it was also necessary to rework where the decision regarding
the placement of the DTB and initrd in RAM was made. It is now made when
loading the kernel, which allows it to make use of the zImage/ELF specific
information and therefore to make decisions based on complete knowledge and do
it right rather than guessing in prepare_dtb and relying on a later check to
see if things worked.
Ian Campbell [Tue, 19 Nov 2013 13:00:11 +0000 (13:00 +0000)]
xen: arm: move dom0 gic and timer device tree nodes under /xen-core-devices/
Julien observed that we were relying on the provided host DTB supplying
suitable #address-cells and #size-cells values to allow us to represent these
addresses, which may not reliably be the case. Moving these under our own
known (somewhat analogous to the use of /soc/ or /motherboard/ on some
platforms) allows us to control these sizes.
Since the new node is created out of thin air it does not have a corresponding
struct dt_device_node and therefore we cannot use dt_n_addr_cells or
dt_n_size_cells, we can use hardcoded constants instead. For the same reason
we define and use set_xen_range instead of dt_set_range.
The hypervisor, cpus and psci node all either defined #foo-cells for their
children or do not contain reg properties and therefore can remain at the top
level.
The logging in make_gic_node was inconsistent. Fix it.
Julien Grall [Fri, 15 Nov 2013 15:27:36 +0000 (15:27 +0000)]
xen/arm: Panic if platform initialization failed
Actually, if an error occurs, Xen will silently ignore it and continue.
Convert platform_init to a void function and panic if we fail to
correctly initialize the platform.
Julien Grall [Mon, 18 Nov 2013 13:08:23 +0000 (13:08 +0000)]
xen/arm: ioremap_attr: return NULL is __vmap failed
Most of ioremap_* caller check if ioremap returns NULL. Actually, if the
physical address is non-aligned, Xen will return the pointer given by
__vmap plus the offset in the page. So if ioremap_* fails, the caller
will retrieve an non-NULL address and continue as if there was no error.
Julien Grall [Thu, 14 Nov 2013 17:00:34 +0000 (17:00 +0000)]
xen/arm: p2m: flush TLB by VMID when a new domain is creating
Once the VMID is marked unused, a new domain can reuse the VMID for its
own. If the TLB is not flushed, entries can contain wrong translation.
When a new p2m is allocated, switch to the new VMID and flush TLB on
every physical CPUs.
Kelley Nielsen [Mon, 11 Nov 2013 23:24:00 +0000 (15:24 -0800)]
opw: libxl: use CTX macro in libxl_utils.c
The new coding style uses the convenience macro CTX as declared in
libxl_internal.h. Substitute an invocation of this macro for its
body at the two places it occurs in libxl_utils.c.
Suggested-by: Ian Campbell <Ian.Campbell@citrix.com> Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Kelley Nielsen [Mon, 11 Nov 2013 23:23:56 +0000 (15:23 -0800)]
libxl: use LOG instead of LIBXL__LOG in libxl_utils.c
To conform to the new coding style, replace the invocation of
LIBXL__LOG in the function libxl_pipe() in the file libxl_utils.c
with an invocation of LOG. Create a local libxl__gc gc* for LOG
to use by invoking GC_INIT(ctx) at the top of the function, and
clean it up by invoking GC_FREE at the exit. Create a variable,
ret, to consolidate exits in one place and avoid invoking GC_FREE
twice.
Suggested-by: Ian Campbell <Ian.Campbell@citrix.com> Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Kelley Nielsen [Mon, 11 Nov 2013 23:23:55 +0000 (15:23 -0800)]
libxl: use LOG and LOGE instead of LIBXL__LOG* in libxl_utils.c
Code cleanup - no functional changes
The convenience macros LOG and LOGE have been written to take the
place of the old macros in the LIBXL__LOG* family. Replace the
invocations of the old macros in the function libxl_read_file_contents()
with invocations of the corresponding new ones. Create a local
libxl__gc gc* for the new macros to use by invoking GC_INIT(ctx) at the
top of the function, and clean it up by invoking GC_FREE at the two
exit points.
Suggested-by: Ian Campbell <Ian.Campbell@citrix.com> Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Kelley Nielsen [Fri, 15 Nov 2013 01:50:43 +0000 (17:50 -0800)]
libxl: use LOGE instead of LIBXL__LOG_ERRNO in libxl_utils.c
Code cleanup - no functional changes
The convenience macro LOGE has been written to take the place of
LIBXL__LOG_ERRNO. LOGE depends on the existence of a local libgl__gc
*gc. Replace two invocations of LIBXL__LOG_ERRNO, which are in
functions that already have a libxl__gc *gc present, to invocations
of the new macro.
Suggested-by: Ian Campbell <Ian.Campbell@citrix.com> Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Kelley Nielsen [Mon, 11 Nov 2013 23:23:53 +0000 (15:23 -0800)]
libxl: use GCSPRINTF instead of libxl__sprintf
Code cleanup - no functional changes
The convenience macro GCSPRINTF has been written to be used in place
of libxl__sprintf(). Replace all calls to libxl__sprintf() in
libxl_utils.c with invocations of the new macro.
Suggested-by: Ian Campbell <Ian.Campbell@citrix.com> Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Kelley Nielsen [Mon, 11 Nov 2013 23:23:52 +0000 (15:23 -0800)]
libxl: use GCSPRINTF in place of libxl_sprintf() in libxl_qmp.c
Code cleanup -- no functional changes
The convenience macro GCSPRINTF has been written to be used in place of
libxl_sprintf. Change all calls to libxl_sprintf() in libxl_qmp.c to
invocations of the new macro.
Suggested-by: Anthony PERARD <anthony.perard@citrix.com> Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Kelley Nielsen [Mon, 11 Nov 2013 23:23:51 +0000 (15:23 -0800)]
libxl: Use new macro LOGE() in libxl_qmp.c
Code cleanup -- no functional changes
Coding style has recently been changed for libxl. The convenience
macro LOGE() has been introduced, and invocations of the old macro
LIBXL__LOG_ERROR() are to be replaced with it. Change all occurences
of the old macro (in functions that have a local libxl_gc *gc) except
the one in register_serials_chardev_callback() to the new one. (This
function lacks a local libxl__gc *gc, which LOGE() requires.)
Suggested-by: Anthony PERARD <anthony.perard@citrix.com> Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Kelley Nielsen [Fri, 15 Nov 2013 01:41:07 +0000 (17:41 -0800)]
libxl: change most remaining LIBXL_LOG to LOG in libxl_qmp.c
Coding style has recently been changed for libxl. The convenience
macro LOG() has been introduced, and invocations of the old macro
LIBXL__LOG() are to be replaced with it. Change occurences of the
old macro to the new one in the functions qmp_handle_response()
and qmp_handle_error_response(). The new macros need access to a
local libxl__gc *gc, so add it as a parameter to both these functions,
and pass the instance in qmp_next() down the call chain to
qmp_handle_response() and in turn to qmp_handle_error_response().
Suggested-by: Anthony PERARD <anthony.perard@citrix.com> Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
[ijc -- reverted one unintential w/s change]
Don Slutz [Tue, 5 Nov 2013 14:11:51 +0000 (09:11 -0500)]
get_maintainer.pl: Adjust to Xen workflow
Based on feedback from reviewers:
* Disable git fallback by default: it has a tendency to mail
anyone who did a single oneline change and should not be
necessary for a project of Xen's size.
* Disable rolestats: Makes cut-and-paste from the output into the
commit message easy.
* Drop "THE REST" fallback: Don't spam Keir *too* much.
Signed-off-by: Don Slutz <dslutz@verizon.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ijc -- expanded the changelog]
libxl: add device backend listener in order to launch backends
Add the necessary logic in libxl to allow it to act as a listener for
launching backends in a driver domain, replacing udev (like we already
do on Dom0). This new functionality is acomplished by watching the
domain backend path (/local/domain/<domid>/backend) and reacting to
device creation/destruction.
The way to launch this listener daemon is from xl, using the newly
introduced "devd" command. The command will daemonize by default,
using "xldevd.log" as it's logfile. Optionally the user can force the
execution of the listener in the foreground by passing the "-F"
option to the devd command.
Current backends handled by this daemon include Qdisk, vbd and vif
device types.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
When running libxl from a driver domain there's no xenstore pid file
(because xenstore is not running on the driver domain). Also, at that
point in libxl initialization there's no way to know wether libxl is
running on a domain different than Dom0, so just revert the change in
order to allow libxl to work on driver domains.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Move the daemonizer code from create_domain into it's own function
that can be called from other places different than create_domain.
This will be used to daemonize the driver domain backend handler.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Current Qemu launch functions in libxl require the usage of data
structures only avaialbe on domain creation. All this information is
not need in order to launch a Qemu instance to serve Qdisk backends,
so introduce a new simplified helper that can be used to launch
Qemu/Qdisk, that will be used to launch Qdisk in driver domains.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Anthony PERARD <anthony.perard@citrix.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
libxl: remove the Qemu bodge for driver domain devices
When Qemu is launched from a driver domain to act as a PV disk
backend we can make sure that Qemu is running before detaching
devices, so there's no need for the bodge there.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
libxl: synchronize device removal when using driver domains
Synchronize the clean up of the backend from the toolstack domain when
the driver domain has actually finished closing the backend for the
device.
This is accomplished by waiting for the driver domain to remove the
directory containing the backend keys, then the toolstack domain will
finish the cleanup by removing the empty folders on the backend path.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
libxl: create a local xenstore libxl and device-model dir for guests
If libxl is executed inside a guest domain it needs write access to
the local libxl xenstore dir (/local/<domid>/libxl) to store internal
data. This also applies to Qemu which needs a
/local/<domid>/device-model xenstore directory.
This patch creates the mentioned directories for each guest launched
from libxl.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Cc: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Jan Beulich [Mon, 18 Nov 2013 12:57:20 +0000 (13:57 +0100)]
x86: consider modules when cutting off memory
The code in question runs after module ranges got already removed from
the E820 table, so when determining the new maximum page/PDX we need to
explicitly take them into account.
Furthermore we need to round up the ending addresses here, in order to
fully cover eventual partial trailing pages.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
Jan Beulich [Mon, 18 Nov 2013 12:55:55 +0000 (13:55 +0100)]
VT-d: fix TLB flushing in dma_pte_clear_one()
The third parameter of __intel_iommu_iotlb_flush() is to indicate
whether the to be flushed entry was a present one. A few lines before,
we bailed if !dma_pte_present(*pte), so there's no need to check the
flag here again - we can simply always pass TRUE here.
This is XSA-78.
Suggested-by: Cheng Yueqiang <yqcheng.2008@phdis.smu.edu.sg> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
Jan Beulich [Mon, 18 Nov 2013 08:39:01 +0000 (09:39 +0100)]
nested VMX: don't ignore mapping errors
Rather than ignoring failures to map the virtual VMCS as well as MSR or
I/O port bitmaps, convert those into failures of the respective
instructions (avoiding to dereference NULL pointers). Ultimately such
failures should be handled transparently (by using transient mappings
when they actually need to be accessed, just like nested SVM does).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Eddie Dong <eddie.dong@intel.com>
Dario Faggioli [Fri, 15 Nov 2013 16:43:28 +0000 (17:43 +0100)]
fix leaking of v->cpu_affinity_saved on domain destruction
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Nate Studer [Fri, 15 Nov 2013 16:38:10 +0000 (17:38 +0100)]
credit: Update other parameters when setting tslice_ms
Add a utility function to update the rest of the timeslice
accounting fields when updating the timeslice of the
credit scheduler, so that capped CPUs behave correctly.
Before this patch changing the timeslice to a value higher
than the default would result in a domain not utilizing
its full capacity and changing the timeslice to a value
lower than the default would result in a domain exceeding
its capacity.
Paul Durrant [Fri, 15 Nov 2013 10:02:17 +0000 (11:02 +0100)]
x86/VT-x: Disable MSR intercept for SHADOW_GS_BASE
Intercepting this MSR is pointless - The swapgs instruction does not cause a
vmexit, so the cached result of this is potentially stale after the next guest
instruction. It is correctly saved and restored on vcpu context switch.
Furthermore, 64bit Windows writes to this MSR on every thread context switch,
so interception causes a substantial performance hit.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org> Acked-by: Jun Nakajima <jun.nakajima@intel.com>
If an image source page is allocated in kimage_alloc_page() but the
machine_kexec_add_page() fails, the image may appear to load
succesfully but it will not execute. The relocation will fault
(rebooting the host) when trying to copy the source page, as it is not
mapped.
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
If a bad image type is supplied in a KEXECOP_unload hypercall, the
kexec_lock in kexec_swap_images() was left locked, causing a deadlock
on a subsequent image load or unload.
The kexec_lock is only required to serialize the swap operation
itself.
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
George Dunlap [Wed, 13 Nov 2013 08:42:51 +0000 (09:42 +0100)]
pvh tools: libxl changes to create a PVH guest
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Eddie Dong <eddie.dong@intel.com>
Mukesh Rathor [Wed, 13 Nov 2013 08:42:14 +0000 (09:42 +0100)]
pvh tools: libxc changes to build a PVH guest
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Eddie Dong <eddie.dong@intel.com>
Mukesh Rathor [Wed, 13 Nov 2013 08:41:12 +0000 (09:41 +0100)]
pvh: restrict tsc_mode to NEVER_EMULATE for now
The reason given for this restriction in the first place, given in one
of the comments checking for PVH requirements, had to do with
additional infrastructure required to allow PV RDTSC emulation for PVH
guests.
Since we don't use the PV emulation path at all anymore, we may be
able to remove this restriction.
Experiments show that pvh will boot without apparent issues in
"default", "native", and "native_paravirt" mode, but not in
"always_emulate" mode. We'll leave this restriction in until
we can sort out what's going on.
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Eddie Dong <eddie.dong@intel.com>
Mukesh Rathor [Wed, 13 Nov 2013 08:40:41 +0000 (09:40 +0100)]
pvh: disable 32-bit guest support for now
Removing the assert allows the PVH code to call this during vmcs
construction in a later patch, making the code more robust by removing
duplicate code.
To be implemented.
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Eddie Dong <eddie.dong@intel.com>
George Dunlap [Wed, 13 Nov 2013 08:40:03 +0000 (09:40 +0100)]
pvh: use PV handlers for PIO
Register an IO handler for the entire PIO range, and have it call the
PV PIO handlers.
NB at this point this won't do the full "copy and execute on the stack
with full GPRs" work-around; this may need to be sorted out for dom0 to allow
these instructions to happen in guest context.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Eddie Dong <eddie.dong@intel.com>
Mukesh Rathor [Wed, 13 Nov 2013 08:37:01 +0000 (09:37 +0100)]
pvh: use PV e820
Allow PV e820 map to be set and read from a PVH domain. This requires
moving the pv e820 struct out from the pv-specific domain struct and
into the arch domain struct.
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Eddie Dong <eddie.dong@intel.com>
Mukesh Rathor [Wed, 13 Nov 2013 08:35:20 +0000 (09:35 +0100)]
pvh: vmx-specific changes
Changes:
* Enforce HAP mode for now
* Disable exits related to virtual interrupts or emulated APICs
* Disable changing paging mode
- "unrestricted guest" (i.e., real mode for EPT) disabled
- write guest EFER disabled
* Start in 64-bit mode
* Paging mode update to happen in arch_set_info_guest
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Eddie Dong <eddie.dong@intel.com>
Mukesh Rathor [Wed, 13 Nov 2013 08:30:09 +0000 (09:30 +0100)]
pvh prep: introduce pv guest type and has_hvm_container macros
The goal of this patch is to classify conditionals more clearly, as to
whether they relate to pv guests, hvm-only guests, or guests with an
"hvm container" (which will eventually include PVH).
This patch introduces an enum for guest type, as well as two new macros
for switching behavior on and off: is_pv_* and has_hvm_container_*. At the
moment is_pv_* <=> !has_hvm_container_*. The purpose of having two is that
it seems to me different to take a path because something does *not* have PV
structures as to take a path because it *does* have HVM structures, even if the
two happen to coincide 100% at the moment. The exact usage is occasionally a bit
fuzzy though, and a judgement call just needs to be made on which is clearer.
In general, a switch should use is_pv_* (or !is_pv_*) if the code in question
relates directly to a PV guest. Examples include use of pv_vcpu structs or
other behavior directly related to PV domains.
hvm_container is more of a fuzzy concept, but in general:
* Most core HVM behavior will be included in this. Behavior not
appropriate for PVH mode will be disabled in later patches
* Hypercalls related to HVM guests will *not* be included by default;
functionality needed by PVH guests will be enabled in future patches
* The following functionality are not considered part of the HVM
container, and PVH will end up behaving like PV by default: Event
channel, vtsc offset, code related to emulated timers, nested HVM,
emuirq, PoD
* Some features are left to implement for PVH later: vpmu, shadow mode
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Tim Deegan <tim@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Eddie Dong <eddie.dong@intel.com>
George Dunlap [Wed, 13 Nov 2013 08:29:02 +0000 (09:29 +0100)]
pvh: tolerate HVM guests having no ioreq page
PVH guests don't have a backing device model emulator (qemu); just
tolerate this situation explicitly, rather than special-casing PVH.
For unhandled IO, hvmemul_do_io() will now return X86EMUL_OKAY, which
is I believe what would be the effect if qemu didn't have a handler
for the IO.
This also fixes a potetial DoS in the host from the reworked series:
If the guest makes a hypercall which sends an invalidate request, it
would have crashed the host.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Eddie Dong <eddie.dong@intel.com>
Mukesh Rathor [Wed, 13 Nov 2013 08:26:38 +0000 (09:26 +0100)]
pvh prep: code motion
There are many functions where PVH requires some code in common with
HVM. Rearrange some of these functions so that the code is together.
In general, the HVM code that PVH also uses includes:
- cacheattr functionality
- paging
- hvm_funcs
- hvm_assert_evtchn_irq tasklet
- tm_list
- hvm_params
And code that PVH shares with PV but not with PVH:
- updating the domain wallclock
- setting v->is_initialized
There should be no end-to-end changes in behavior.
Signed-off-by: Mukesh Rathor <mukesh.rathor@oracle.com> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Tim Deegan <tim@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Eddie Dong <eddie.dong@intel.com>
Roger Pau Monné [Wed, 13 Nov 2013 08:26:13 +0000 (09:26 +0100)]
libxc: move temporary grant table mapping to end of memory
In order to set up the grant table for HVM guests, libxc needs to map
the grant table temporarily. At the moment, it does this by adding the
grant page to the HVM guest's p2m table in the MMIO hole (at gfn 0xFFFFE),
then mapping that gfn, setting up the table, then unmapping the gfn and
removing it from the p2m table.
This breaks with PVH guests with 4G or more of ram, because there is
no MMIO hole; so it ends up clobbering a valid RAM p2m entry, then
leaving a "hole" when it removes the grant map from the p2m table.
Since the guest thinks this is normal ram, when it maps it and tries
to access the page, it crashes.
This patch maps the page at max_gfn+1 instead.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Eddie Dong <eddie.dong@intel.com>
George Dunlap [Wed, 13 Nov 2013 08:25:36 +0000 (09:25 +0100)]
VMX: allow vmx_update_debug_state to be called when v!=current
Removing the assert allows the PVH code to call this during vmcs
construction in a later patch, making the code more robust by removing
duplicate code.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Tim Deegan <tim@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org> Acked-by: Eddie Dong <eddie.dong@intel.com>
Ian Jackson [Thu, 18 Apr 2013 15:27:46 +0000 (16:27 +0100)]
libxl: Avoid realloc(,0) when libxl__xs_directory returns empty list
If the named path is a leaf node, libxl__xs_directory can succeed,
returning non-null, but set *nb to 0.
In three places in libxl this may result in a zero size argument being
passed to malloc() or realloc(), which is not adviseable.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Mon, 14 Oct 2013 16:26:01 +0000 (17:26 +0100)]
libxl: Deprecate synchronous waiting for the device model
libxl__wait_for_device_model blocks, with the ctx lock held, waiting
for a response from the device model. If the dm doesn't respond
quickly (for example, because it has crashed), this may block the
whole process. Explain this in a comment, rename the function to
libxl__wait_for_device_model_deprecated, and explain what to use
instead.
libxl__wait_for_offspring is the core implementation for the above.
Its name leads people to think it might be generally useful for
waiting for children, which is far from the case. It only waits for
xenstore. Also it has the problems described above. Explain this,
rename it to libxl__xenstore_child_wait_deprecated, and explain what
to use instead.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Tue, 3 Sep 2013 12:41:46 +0000 (13:41 +0100)]
libxl: Do not generate short block in libxl__datacopier_prefixdata
libxl__datacopier_prefixdata would prepend a deliberately short block
(not just a half-full one, but one with a short buffer) to the
dc->bufs queue. However, this is wrong because datacopier_readable
will find it and try to continue to fill it up.
Instead, allocate a full-sized buffer.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Tested-by: Chunyan Liu <cyliu@suse.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
This allows a long-running ao to avoid accumulating memory. Each
nested ao has its own gc.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
specifically used signed integers, identical to the code copied out of vsprintf.
When committed, these had changed to unsigned integers, which causes a
functional change. This causes glacial boot performance and an excessive
quantity of spaces printed to the serial console, as we loop to the upper
bound of a 32bit integer.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 12 Nov 2013 15:28:47 +0000 (16:28 +0100)]
x86: eliminate has_arch_mmios()
... as being generally insufficient: Either has_arch_pdevs() or
cache_flush_permitted() should be used (in particular, it is
insufficient to consider MMIO ranges alone - I/O port ranges have the
same requirements if available to a guest).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
David Vrabel [Tue, 12 Nov 2013 12:19:25 +0000 (13:19 +0100)]
evtchn/fifo: don't spin indefinitely when setting LINK
A malicious or buggy guest can cause another domain to spin
indefinitely by repeatedly writing to an event word when the other
guest is trying to link a new event. The cmpxchg() in
evtchn_fifo_set_link() will repeatedly fail and the loop may never
terminate.
Fixing this requires a change to the ABI which is documented in draft
H of the design.
Since a well-behaved guest only makes a limited set of state changes,
the loop can terminate early if the guest makes an invalid state
transition.
The guest may:
- clear LINKED and LINK.
- clear PENDING
- set MASKED
- clear MASKED
It is valid for the guest to mask and unmask an event at any time so
specify that it is not valid for a guest to clear MASKED if Xen is
trying to update LINK. Indicate this to the guest with an additional
BUSY bit in the event word. The guest must not clear MASKED if BUSY
is set and it should spin until BUSY is cleared.
The remaining valid writes (clear LINKED, clear PENDING, set MASKED,
clear MASKED by Xen) will limit the number of failures of the
cmpxchg() to at most 4. A clear of LINKED will also terminate the
loop early. Therefore, the loop can then be limited to at most 4
iterations.
If the buggy or malicious guest does cause the loop to exit with
LINKED set and LINK unset then that buggy guest will lose events.
Reported-by: Anthony Liguori <aliguori@amazon.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 12 Nov 2013 10:52:19 +0000 (11:52 +0100)]
VMX: don't crash processing 'd' debug key
There's a window during scheduling where "current" and the active VMCS
may disagree: The former gets set much earlier than the latter. Since
both vmx_vmcs_enter() and vmx_vmcs_exit() immediately return when the
subject vCPU is "current", accessing VMCS fields would, depending on
whether there is any currently active VMCS, either read wrong data, or
cause a crash.
Going forward we might want to consider reducing the window during
which vmx_vmcs_enter() might fail (e.g. doing a plain __vmptrld() when
v->arch.hvm_vmx.vmcs != this_cpu(current_vmcs) but arch_vmx->active_cpu
== -1), but that would add complexities (acquiring and - more
importantly - properly dropping v->arch.hvm_vmx.vmcs_lock) that don't
look worthwhile adding right now.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Keir Fraser <keir@xen.org>
Jan Beulich [Tue, 12 Nov 2013 10:51:15 +0000 (11:51 +0100)]
nested SVM: adjust guest handling of structure mappings
For one, nestedsvm_vmcb_map() error checking must not consist of using
assertions: Global (permanent) mappings can fail, and hence failure
needs to be dealt with properly. And non-global (transient) mappings
can't fail anyway.
And then the I/O port access bitmap handling was broken: It checked
only to first of the accessed ports rather than each of them.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Christoph Egger <chegger@amazon.de> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
David Vrabel [Tue, 12 Nov 2013 10:47:26 +0000 (11:47 +0100)]
x86: check kexec relocation code fits in a page
The kexec relocation (control) code must fit in a single page so add a
link time check for this.
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Don Slutz <dslutz@verizon.com> Tested-by: Don Slutz <dslutz@verizon.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Tested-by: Daniel Kiper <daniel.kiper@oracle.com> Acked-by: Keir Fraser <keir@xen.org>
David Vrabel [Tue, 12 Nov 2013 10:47:07 +0000 (11:47 +0100)]
libxc: add API for kexec hypercall
Add xc_kexec_exec(), xc_kexec_get_ranges(), xc_kexec_load(), and
xc_kexec_unload(). The load and unload calls require the v2 load and
unload ops.
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Tested-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Don Slutz <dslutz@verizon.com> Tested-by: Don Slutz <dslutz@verizon.com> Acked-by: Keir Fraser <keir@xen.org>
David Vrabel [Tue, 12 Nov 2013 10:46:39 +0000 (11:46 +0100)]
libxc: add hypercall buffer arrays
Hypercall buffer arrays are used when a hypercall takes a variable
length array of buffers.
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Tested-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Don Slutz <dslutz@verizon.com> Tested-by: Don Slutz <dslutz@verizon.com> Acked-by: Keir Fraser <keir@xen.org>
David Vrabel [Tue, 12 Nov 2013 10:46:06 +0000 (11:46 +0100)]
kexec crash image when dom0 crashes
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Tested-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Don Slutz <dslutz@verizon.com> Tested-by: Don Slutz <dslutz@verizon.com> Acked-by: Keir Fraser <keir@xen.org>
David Vrabel [Tue, 12 Nov 2013 10:44:41 +0000 (11:44 +0100)]
kexec: extend hypercall with improved load/unload ops
In the existing kexec hypercall, the load and unload ops depend on
internals of the Linux kernel (the page list and code page provided by
the kernel). The code page is used to transition between Xen context
and the image so using kernel code doesn't make sense and will not
work for PVH guests.
Add replacement KEXEC_CMD_kexec_load and KEXEC_CMD_kexec_unload ops
that no longer require a code page to be provided by the guest -- Xen
now provides the code for calling the image directly.
The new load op looks similar to the Linux kexec_load system call and
allows the guest to provide the image data to be loaded. The guest
specifies the architecture of the image which may be a 32-bit subarch
of the hypervisor's architecture (i.e., an EM_386 image on an
EM_X86_64 hypervisor).
The toolstack can now load images without kernel involvement. This is
required for supporting kexec when using a dom0 with an upstream
kernel.
Crash images are copied directly into the crash region on load.
Default images are copied into domheap pages and a list of source and
destination machine addresses is created. This is list is used in
kexec_reloc() to relocate the image to its destination.
The old load and unload sub-ops are still available (as
KEXEC_CMD_load_v1 and KEXEC_CMD_unload_v1) and are implemented on top
of the new infrastructure.
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Don Slutz <dslutz@verizon.com> Tested-by: Don Slutz <dslutz@verizon.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Tested-by: Daniel Kiper <daniel.kiper@oracle.com> Acked-by: Keir Fraser <keir@xen.org>
David Vrabel [Tue, 12 Nov 2013 10:41:02 +0000 (11:41 +0100)]
kexec: add infrastructure for handling kexec images
Add the code needed to handle and load kexec images into Xen memory or
into the crash region. This is needed for the new KEXEC_CMD_load and
KEXEC_CMD_unload hypercall sub-ops.
Much of this code is derived from the Linux kernel.
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Don Slutz <dslutz@verizon.com> Tested-by: Don Slutz <dslutz@verizon.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Tested-by: Daniel Kiper <daniel.kiper@oracle.com> Acked-by: Keir Fraser <keir@xen.org>
David Vrabel [Tue, 12 Nov 2013 10:39:29 +0000 (11:39 +0100)]
kexec: add public interface for improved load/unload sub-ops
Add replacement KEXEC_CMD_load and KEXEC_CMD_unload sub-ops to the
kexec hypercall. These new sub-ops allow a priviledged guest to
provide the image data to be loaded into Xen memory or the crash
region instead of guests loading the image data themselves and
providing the relocation code and metadata.
The old interface is provided to guests requesting an interface
version prior to 4.4.
Bump __XEN_LATEST_INTERFACE_VERSION__ to 0x00040400.
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Don Slutz <dslutz@verizon.com> Tested-by: Don Slutz <dslutz@verizon.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Tested-by: Daniel Kiper <daniel.kiper@oracle.com> Acked-by: Keir Fraser <keir@xen.org>
David Vrabel [Tue, 12 Nov 2013 10:37:19 +0000 (11:37 +0100)]
x86: give FIX_EFI_MPF its own fixmap entry
FIX_EFI_MPF was the same as FIX_KEXEC_BASE_0 which is going away. So
add its own entry.
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Tested-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Don Slutz <dslutz@verizon.com> Tested-by: Don Slutz <dslutz@verizon.com> Acked-by: Keir Fraser <keir@xen.org>
Andrew Cooper [Tue, 12 Nov 2013 10:11:30 +0000 (11:11 +0100)]
common/symbols: Remove print_symbol() and associated infrastructure
Also adjust the one common user of print_symbol() to use the new printk()
format. While adjusting the format string, increase the width so a
long-to-expire plt_overflow() timer doesn't break the column alignment.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
Dario Faggioli [Tue, 12 Nov 2013 09:54:28 +0000 (10:54 +0100)]
numa-sched: leave node-affinity alone if not in "auto" mode
If the domain's NUMA node-affinity is being specified by the
user/toolstack (instead of being automatically computed by Xen),
we really should stick to that. This means domain_update_node_affinity()
is wrong when it filters out some stuff from there even in "!auto"
mode.
This commit fixes that. Of course, this does not mean node-affinity
is always honoured (e.g., a vcpu won't run on a pcpu of a different
cpupool) but the necessary logic for taking into account all the
possible situations lives in the scheduler code, where it belongs.
What could happen without this change is that, under certain
circumstances, the node-affinity of a domain may change when the
user modifies the vcpu-affinity of the domain's vcpus. This, even
if probably not a real bug, is at least something the user does
not expect, so let's avoid it.