Yang Hongyang [Fri, 18 Jul 2014 07:08:36 +0000 (15:08 +0800)]
libxl/remus: setup and control network output buffering
This patch adds the machinery required for protecting a guest's
network device state. This patch comprises of two parts:
1. Hotplug scripts: The remus-netbuf-setup script is responsible for
setting up and tearing down the necessary infrastructure required for
network output buffering. This script should be invoked by libxl for
each of the guest's network interfaces, when starting or stopping Remus.
Apart from returning success/failure indication via the usual hotplug
entries in xenstore, this script also writes to xenstore, the name of
the REMUS_IFB device to be used to control the vif's network output.
The script relies on libnl3 command line utilities to perform various
setup/teardown functions. The script is confined to Linux platforms only
since NetBSD does not seem to have libnl3.
2. Remus network device: Implements the interfaces required by the
remus abstract device layer. A note about the implementation:
a) init_subkind_nic() & cleanup_subkind_nic() are called once per Remus
invocation. They establish and free netlink related state respectively.
b) setup() and teardown are called for each vif attached to the
guest.
During setup():
i) The hotplug script is called to setup a network buffer on a
given vif. The script chooses an available IFB device from
the system, redirects vif egress traffic to the IFB device
and sets up the plug qdisc (output buffer) on the IFB device.
The name of the IFB device is communicated via xenstore to
libxl.
ii) Libxl obtains a handle to the plug qdisc using the libnl3 API
and subsequently controls output buffering using this handle
in the checkpoint callbacks.
During teardown(), the hotplug scripts are called again to remove
the vif->ifb traffic redirection, release the ifb and the plug
qdisc associated with it.
c) The checkpoint callbacks [postsuspend(), preresume() and commit()]
are implemented as synchronous ops as the netlink calls associated
with the qdisc subsystem are very fast.
Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Yang Hongyang [Fri, 18 Jul 2014 07:02:34 +0000 (15:02 +0800)]
libxl/remus: introduce an abstract Remus device layer
Introduce an abstract device layer that allows the Remus
logic in libxl to control a guest's devices in a device-agnostic
manner. The device layer also exposes a set of internal interfaces
that a device type must implement, if it wishes to support Remus.
The following API are exposed to libxl:
One-time configuration operations:
*libxl__remus_devices_setup
> Enable output buffering for NICs, setup disk replication, etc.
*libxl__remus_devices_teardown
> Disable network output buffering and disk replication;
teardown any associated external setups like qdiscs for NICs.
Operations executed every checkpoint (in order of invocation):
*libxl__remus_devices_postsuspend
*libxl__remus_devices_preresume
*libxl__remus_devices_commit
Each device type needs to implement the interfaces specified in
the libxl__remus_device_instance_ops if it wishes to support Remus.
The high-level control flow through the Remus device layer is shown below:
callback processing
* Only call the per-device libxl__multidev_one_callback
when the iteration has succeded or failed.
* The final callback (called by multidev) is a trivial
shim to shuffle the pointers and notify our own caller.
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Yang Hongyang [Fri, 27 Jun 2014 01:43:51 +0000 (09:43 +0800)]
autoconf: add libnl3 dependency for Remus network buffering support
Libnl3 is required for controlling Remus network buffering.
This patch adds dependency on libnl3 (>= 3.2.8) to autoconf scripts.
It also provides the ability to configure tools without libnl3 support
i.e., without network buffering support.
When there is no network buffering support, libxl__netbuffer_enabled()
returns 0, otherwise returns 1. The callers of this api will be
introduced in the rest of the series.
NOTE: This patch changes tools/configure.ac, please rerun
autogen.sh while applying the patch.
Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com> Reviewed-by: Wen Congyang <wency@cn.fujitsu.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Yang Hongyang [Fri, 18 Jul 2014 08:40:54 +0000 (16:40 +0800)]
libxl: Extend libxl__ao_device with a libxl__ev_child member
This can be used to fork children to allow the asynchronous execution
of system calls which only come in a synchronous variant. This will
be useful for Remus, in the following patches.
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com> Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
libxl__multidev_prepare_with_aodev is similar to libxl__multidev_prepare,
but takes a libxl__ao_device as an extra argument.
libxl__multidev_prepare is now a wrapper around
libxl__multidev_prepare_with_aodev.
This new internal API will be used by the Remus device abstract layer
for handling various Remus devices.
Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com> Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Now a caller who wants to be able to do other work when the aodev
completes can put their own callback into the aodev, and make the
multidev machinery aware that the particular aodev is complete (from
the point of view that multidev should have) whenever it likes.
No functional change in this patch.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Ian Jackson [Thu, 25 Sep 2014 14:59:06 +0000 (15:59 +0100)]
libxl: multidev: Clarify comments about which callbacks are meant
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Roy Franz [Fri, 26 Sep 2014 10:00:55 +0000 (12:00 +0200)]
EFI: add arch specific function to control use of config file
The x86 EFI build of Xen always uses a configuration file to load modules, but
the ARM version can either use a config file to specify the modules, or be
loaded by GRUB in which case GRUB loads the modules and adds them to the DTB
that is passed to Xen. Add the efi_arch_use_config_file() to indicate if a
configuration file is required. For x86, this will always be true. ARM will
examine the DTB passed via EFI configuration table (if any), and if it contains
module information will use that that not use the configuration file at all.
Add Emacs footer to efi-boot.h and boot.c
Signed-off-by: Roy Franz <roy.franz@linaro.org> Acked-by: Jan Beulich <jbeulich@suse.com>
Roy Franz [Fri, 26 Sep 2014 10:00:27 +0000 (12:00 +0200)]
EFI: add several misc. arch functions for boot code
Add efi_arch_blexit() for arch specific cleanup on error exit,
efi_arch_load_addr_check() to do the arch specific verifications
of where the UEFI firmware loaded Xen, and efi_arch_cpu() for
probing CPU features.
Signed-off-by: Roy Franz <roy.franz@linaro.org> Acked-by: Jan Beulich <jbeulich@suse.com>
Roy Franz [Fri, 26 Sep 2014 09:59:56 +0000 (11:59 +0200)]
EFI: add arch specific module handling to read_file()
Each architecture tracks modules differently internally, so add
efi_arch_handle_module() routine to enable the common code to invoke the proper
handling of modules as they are loaded. Module handling for ucode,ramdisk, and
xsm is changed to not process remainder of string after filename as options,
since these modules don't take options.
Signed-off-by: Roy Franz <roy.franz@linaro.org> Acked-by: Jan Beulich <jbeulich@suse.com>
Roy Franz [Thu, 25 Sep 2014 12:30:16 +0000 (14:30 +0200)]
EFI: arch specific memory setup
This patch adds efi_arch_memory() to allow each architecture a hook
to use for do memory setup. x86 uses this for trampoline memory setup
and some pagetable setup.
Signed-off-by: Roy Franz <roy.franz@linaro.org> Acked-by: Jan Beulich <jbeulich@suse.com>
Roy Franz [Thu, 25 Sep 2014 12:28:27 +0000 (14:28 +0200)]
EFI: add efi_arch_cfg_file_early/late() to handle arch specific cfg file fields
Different architectures have some different configuration file
fields that need to be handled. In particular, x86 has ucode
and ARM has device tree files to be loaded. These arch specific
functions is used to allow each architecture to implement these
features in arch specific code. Early/late versions are provided,
as ARM needs to process the DTB entry first, and x86 wants to process
the ucode entry last as it is the smallest allocation.
Signed-off-by: Roy Franz <roy.franz@linaro.org> Acked-by: Jan Beulich <jbeulich@suse.com>
Roy Franz [Thu, 25 Sep 2014 12:27:55 +0000 (14:27 +0200)]
EFI: add architecture functions for pre/post ExitBootServices
The UEFI ExitBootServices function is invoked to transition the
system to the 'runtime' mode of operation, and is done right before
transitioning from the EFI loader code into Xen proper. x86 does some
arch specific memory management (trampoline) before exit boot services,
and the code that transitions from the EFI application state to Xen
is architecture specific. This patch adds two functions, one pre
and one post ExitBootServices to allow each architecture to
to handle these cases in a customized manner.
Signed-off-by: Roy Franz <roy.franz@linaro.org> Acked-by: Jan Beulich <jbeulich@suse.com>
Roy Franz [Thu, 25 Sep 2014 12:26:34 +0000 (14:26 +0200)]
EFI: create arch functions to allocate memory for and process memory map
The memory used to store the EFI memory map is allocated in an architecture
specific way, and the processing of the memory map itself uses x86 specific
data structures. This patch adds architecture specific funtions so each
architecture can provide its own implementation.
Signed-off-by: Roy Franz <roy.franz@linaro.org> Acked-by: Jan Beulich <jbeulich@suse.com>
Roy Franz [Thu, 25 Sep 2014 12:22:12 +0000 (14:22 +0200)]
EFI: move x86 boot/runtime code to common/efi
This moves the EFI boot and runtime services code to the common/efi directory.
This code is symbolicly linked back into the arch/x86/efi directory where it is
built if a build-time check for PE/COFF support in the toolchain passes. In
the PE/COFF supporting case, both the EFI executable and the normal Xen image
(with stubbed EFI functions) are built. We can't use the normal common build
infrastructure since we are building two versions at the same time, with
different EFI related code in each. No code changes, just file movement and
make updates. The files are symbolicly linked at build time back toe the
original arch/x86/efi directory. This is in preparation for adding ARM EFI
support where much of these files can be shared.
Signed-off-by: Roy Franz <roy.franz@linaro.org> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 25 Sep 2014 12:10:01 +0000 (14:10 +0200)]
x86/vlapic: don't silently accept bad vectors
Vectors 0-15 are reserved, and a physical LAPIC - upon sending or
receiving one - would generate an APIC error instead of doing the
requested action. Make our emulation behave similarly.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Tim Deegan <tim@xen.org>
Jan Beulich [Thu, 25 Sep 2014 12:08:20 +0000 (14:08 +0200)]
x86/HVM: fix ID handling of x2APIC emulation
- properly change ID when switching into x2APIC mode (instead of
mimicking necessary behavior in hvm_x2apic_msr_read())
- correctly (meaningfully) set LDR (so far it ended up being 1 on all
vCPU-s)
- even if we don't support more than 128 vCPU-s in a HVM guest for now,
we should properly handle IDs as 32-bit values (i.e. not ignore the
top 24 bits)
- with that, properly do cluster ID and bit mask check in
vlapic_match_logical_addr()
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Tim Deegan <tim@xen.org>
Jan Beulich [Thu, 25 Sep 2014 12:07:27 +0000 (14:07 +0200)]
x86/HVM: fix miscellaneous aspects of x2APIC emulation
- generate #GP on invalid APIC base MSR transitions
- fail reads from the EOI and self-IPI registers (which are write-only)
- handle self-IPI writes and the ICR2 half of ICR writes largely in
hvm_x2apic_msr_write() and (for self-IPI only) vlapic_apicv_write()
- don't permit MMIO-based access in x2APIC mode
- filter writes to read-only registers in hvm_x2apic_msr_write(),
allowing conditionals to be dropped from vlapic_reg_write()
- don't ignore upper half of MSR-based write to ESR being non-zero
- don't ignore other writes to reserved bits
- VMX's EXIT_REASON_APIC_WRITE must not result in #GP (this exit being
trap-like, this exception would get raised on the wrong RIP)
- make hvm_x2apic_msr_read() produce X86EMUL_* return codes just like
hvm_x2apic_msr_write() does (benign to the only caller)
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Tim Deegan <tim@xen.org>
Ian Campbell [Wed, 24 Sep 2014 14:13:28 +0000 (15:13 +0100)]
xen: arm: correct VTCR setting on arm32.
1c92a2aaf8c6 "xen: arm: support for up to 48-bit IPA addressing on
arm64" inadvertently changes the VTCR setting for 32-bit from
0x80003558 to 0x80003518, changing the SL0 setting from 0x1 (p2m
starts at L1) to 0x0 (p2m starts at L2).
For some (inexplicable) reason this doesn't cause any issue on
Arndale but it does on the OdroidXU.
Andrew Cooper [Wed, 24 Sep 2014 16:28:15 +0000 (17:28 +0100)]
tools/libxc: Avoid cacheflush toolstack hypercalls on x86
XEN_DOMCTL_cacheflush hypercalls are (and will always be) -ENOSYS on x86, but
xc_domain_cacheflush() is called often during domain build and migrate for
correct behaviour on ARM.
Stub xc_domain_cacheflush() out on x86 to remove its pressure on the global
domctl lock, and the hypercall overhead (which applies further pressure to the
already heavily-contended TLB flush lock).
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> CC: Keir Fraser <keir@xen.org> CC: Jan Beulich <JBeulich@suse.com> CC: Tim Deegan <tim@xen.org> CC: Ian Campbell <Ian.Campbell@citrix.com> CC: Ian Jackson <Ian.Jackson@eu.citrix.com> CC: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Jan Beulich [Thu, 25 Sep 2014 09:55:49 +0000 (11:55 +0200)]
x86: make dump_pageframe_info() slightly more verbose for dying domains
Allowing more than just 10 pages to be printed in this case gives a
better chance to fully understand eventual page reference leaks: Report
up to 16 "normal" (writable or untyped) pages, and an unlimited number
of special type (page or descriptor table) ones.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 25 Sep 2014 09:53:32 +0000 (11:53 +0200)]
x86emul: fix SYSCALL/SYSENTER/SYSEXIT emulation
SYSCALL:
- make sure SS selector has RPL 0
- only use 32 bits of RIP to fill RCX when target execution mode is 32-bit
- don't shadow function wide variable 'rc'
- consolidate CS attribute setting into single statements
- drop pointless initializers and casts
- drop redundant MSR_STAR read (as suggested by Andrew Cooper)
SYSENTER/SYSEXIT:
- #GP condition doesn't depend on guest mode
- only use 32 bits for setting RIP/RSP when target execution mode is 32-bit
- don't shadow function wide variable 'rc'
- consolidate CS attribute setting into single statements
- drop pointless (and inconsistently used) casts
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roy Franz [Wed, 24 Sep 2014 09:09:11 +0000 (11:09 +0200)]
x86/EFI: fix freeing of uninitialized pointer
The only valid response from the LocateHandle() call is EFI_BUFFER_TOO_SMALL,
so exit if we get anything else. We pass a 0 size/NULL pointer buffer, so the
only other returns we will get is an error. Return right away as there is
nothing to do. Also return if there is an error allocating the buffer, as the
previous code path also allowed for an undefined pointer to be freed.
Signed-off-by: Roy Franz <roy.franz@linaro.org>
Re-structure the change.
Wei Liu [Mon, 15 Sep 2014 19:29:15 +0000 (20:29 +0100)]
flask/policy: use naming convention xenpolicy-$(XEN_FULLVERSION)
Daniel suggested we use xenpolicy-$(XEN_FULLVERSION) as flask policy
naming convention.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Cc: Daniel De Graaf <dgdegra@tycho.nsa.gov> Cc: Ian Campbell <ian.campbell@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Ian Campbell [Mon, 22 Sep 2014 16:10:39 +0000 (17:10 +0100)]
MAINTAINERS: Add Wei Liu as toolstack co-maintainer.
The three existing maintainers are not really able to keep up with
the flow and Wei is one of the top tools contributors (according to
"git shortlog -s -n -p RELEASE-4.4.0..origin/staging tools" and my
own impressions).
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Introduce a document that describes the interfaces used on PVH. This
document has been designed from a guest OS point of view (i.e.: what a guest
needs to do in order to support PVH).
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Mukesh Rathor <mukesh.rathor@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jan Beulich <JBeulich@suse.com> Cc: Mukesh Rathor <mukesh.rathor@oracle.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: David Vrabel <david.vrabel@citrix.com>
xen/arm: remove check for generic timer support for arm64
Information about support for generic support is available in
IDR_PFR1 register in ARMv7. Where as this information is not
available in ARMv8 that supports only aarch64 bit mode.
ARMv8 being always supports generic timer, this check is not
required.
For platforms that support only aarch64 mode, IDR_PFR1 is
not implemented
Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
xen/arm: Restricted access to IFSR32_EL2 and FPEXC32_EL2
IFSR32_EL1 and FPEXC32_EL1 registers are accessible in
aarch64 mode only if aarch32 mode is support in EL1.
So allow access to these registers only for 32-bit domains.
Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Ian Jackson [Tue, 23 Sep 2014 16:46:21 +0000 (17:46 +0100)]
libxl: Remove a duplicate calculation of be_path
Coverity-ID: 1238177 CC: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Reviewed-by: Don Slutz <dslutz@verizon.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
George Dunlap [Mon, 15 Sep 2014 16:25:04 +0000 (17:25 +0100)]
make: Make "src-tarball" target actually make a source tarball
At the moment, making a release tarball is an annoyingly manual
process that involves running "git archive" into a temporary directory.
Script this process up and make a target, so that the release manager
can simply type "make src-tarball-release" and have everything show up
nice and neat in dist/xen-$version.tar.gz. "make src-tarball" will
make a version number based on git describe, which will typically have
the most recent tag, number of commits since that tag, and the git
commit id of the current HEAD.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
xen: arm: Add support for the Exynos secure firmware
The existence of secure firmware is dictated by the presence of
"samsung,secure-firmware" in the DT.
The Arndale board does not have that entry, and uses the address as defined
in "samsung,exynos4210-sysram", offset 0 as the smp init address. This is
possibly true for all SoCs without secure firmware.
For other boards which do have a "secure-firmware" node, use sysram-ns
at offset +0x1c as the smp init address.
The "secure-firmware" MMIO range contains ways to idle the CPU. As this gets
mapped to DOM0 because of its presence in the DT, we blacklist it.
Have tested this on the Odroid XU. I have also tested the other code path
on the Odroid XU by removing "secure-firmware" from its DT. I could see
that the other code path was exercised with correct smp init address
values.
Signed-off-by: Suriyan Ramasami <suriyan.r@gmail.com> Tested-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
This implementation uses the tsc of vcpu0, which is preserved across
save/restore as part of the architectural state, and then converts that
to a 100ns tick using the domain's tsc_khz.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Cc: Keir Fraser <keir@xen.org> Acked-by: Jan Beulich <jbeulich@suse.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Christoph Egger <chegger@amazon.de> Acked-by: Ian Campbell <ian.campbell@citrix.com>
because a flaw in the implementation meant the counter was reset on
migration.
All of these changes were made without any addtional options being
added to the VM configuration, or any compatibility checks being made
in the domain save/restore code. Hence setting the single boolean
'viridian' option in the VM configuration yields a different set of
features depending on which version of Xen the VM is started on, and the
feature set can change across migration (so new MSRs can magically appear).
This patch grandfathers in the current viridian features set and calls them
the 'base' and 'freq' feature sets. HVM_PARAM_VIRIDIAN is re-purposed as
a feature mask. The hypervisor has only ever allowed it ot be set to 0
or 1, so the presence of the base and freq sets are indicated by setting
bit 0. The freq set can then be turned off by setting bit 1, thus
restoring the pre-Xen-4.4 base set. Newly implemented viridian features
can be optionally enabled in future by setting further bits.
The viridian option in xl.cfg(5) has also been changed to a list so
that the sets can be individually enabled or disabled. For compatibility,
if the option is specified as a boolean, then a true (1) value will enable
the base and freq sets and a false (0) value will not enable any
enlightenments.
This patch also alters the allowed write accesses to HVM_PARAM_VIRIDIAN.
Currently there is nothing to stop the guest writing this value (which,
while harmless to anything else, should not happen) and nothing to
stop a toolstack from setting the value back to zero whilst the guest is
running, causing CPUID leaves to disappear and MSR accesses to start
causing GPFs in the guest. Both of these possibilities are now disallowed:
Once the parameter is set to a non-zero value it may not be modified (only
re-written with the same value), and guests no longer have any write
access.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Cc: Keir Fraser <keir@xen.org> Acked-by: Jan Beulich <jbeulich@suse.com> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: David Scott <dave.scott@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Add xl command for rtds scheduler
Note: VCPU's parameter (period, budget) is in microsecond (us).
Signed-off-by: Meng Xu <mengxu@cis.upenn.edu> Signed-off-by: Sisu Xi <xisisu@gmail.com> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Add libxl functions to set/get domain's parameters for rtds scheduler
Note: VCPU's information (period, budget) is in microsecond (us).
Signed-off-by: Meng Xu <mengxu@cis.upenn.edu> Signed-off-by: Sisu Xi <xisisu@gmail.com> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Add xc_sched_rtds_* functions to interact with Xen to set/get domain's
parameters for rtds scheduler.
Note: VCPU's information (period, budget) is in microsecond (us).
Signed-off-by: Meng Xu <mengxu@cis.upenn.edu> Signed-off-by: Sisu Xi <xisisu@gmail.com> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
[ ijc -- xenctrl.h has moved to tools/libxc/include, adjust patch to match ]
This scheduler follows the Preemptive Global Earliest Deadline First
(EDF) theory in real-time field.
At any scheduling point, the VCPU with earlier deadline has higher
priority. The scheduler always picks the highest priority VCPU to run on a
feasible PCPU.
A PCPU is feasible if the VCPU can run on this PCPU and (the PCPU is
idle or has a lower-priority VCPU running on it.)
Each VCPU has a dedicated period and budget.
The deadline of a VCPU is at the end of each period;
A VCPU has its budget replenished at the beginning of each period;
While scheduled, a VCPU burns its budget.
The VCPU needs to finish its budget before its deadline in each period;
The VCPU discards its unused budget at the end of each period.
If a VCPU runs out of budget in a period, it has to wait until next period.
Each VCPU is implemented as a deferable server.
When a VCPU has a task running on it, its budget is continuously burned;
When a VCPU has no task but with budget left, its budget is preserved.
Queue scheme:
A global runqueue and a global depletedq for each CPU pool.
The runqueue holds all runnable VCPUs with budget and sorted by deadline;
The depletedq holds all VCPUs without budget and unsorted.
Note: cpumask and cpupool is supported.
This is an experimental scheduler.
Signed-off-by: Meng Xu <mengxu@cis.upenn.edu> Signed-off-by: Sisu Xi <xisisu@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com> Tested-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
[ ijc -- use PRI_stime to print delta in burn_budget, to fix build on
32-bit (i.e. arm32) ]
Build qemu-xen on ARM and ARM64: it is used to provide the PV backends,
disk and framebuffer in particular.
Ideally we would also modify the configure options to only build what is
necessary: a machine just for PV backends. However that is a work in
progress and not yet available in QEMU (see
http://marc.info/?l=qemu-devel&m=139082425718379&w=2). So we just build
the usual i386 target, even though no i386 emulation is going to be done
by qemu-xen on ARM.
Move xenstore and libxc public headers to include subdir
Also moves xc_dom.h to include as it is used often by other xen tools.
Use the new include subdirectories to build Xen tools, qemu-xen and
stubdoms.
Add the old libxc include path to the programs that need it to build,
on a case by case basis and commeting that they shouldn't require
internal libxc headers to build.
[ And: update QEMU_TRADITIONAL_REVISION to corresponding qemu patch
- Ian jackson ]
Jan Beulich [Tue, 23 Sep 2014 12:33:50 +0000 (14:33 +0200)]
x86emul: only emulate software interrupt injection for real mode
Protected mode emulation currently lacks proper privilege checking of
the referenced IDT entry, and there's currently no legitimate way for
any of the respective instructions to reach the emulator when the guest
is in protected mode.
This is XSA-106.
Reported-by: Andrei LUTAS <vlutas@bitdefender.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Keir Fraser <keir@xen.org>
Andrew Cooper [Tue, 23 Sep 2014 12:33:06 +0000 (14:33 +0200)]
x86/emulate: check cpl for all privileged instructions
Without this, it is possible for userspace to load its own IDT or GDT.
This is XSA-105.
Reported-by: Andrei LUTAS <vlutas@bitdefender.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrei LUTAS <vlutas@bitdefender.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 23 Sep 2014 12:31:47 +0000 (14:31 +0200)]
x86/shadow: fix race condition sampling the dirty vram state
d->arch.hvm_domain.dirty_vram must be read with the domain's paging lock held.
If not, two concurrent hypercalls could both end up attempting to free
dirty_vram (the second of which will free a wild pointer), or both end up
allocating a new dirty_vram structure (the first of which will be leaked).
This is XSA-104.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org>
Olaf Hering [Mon, 22 Sep 2014 13:00:05 +0000 (15:00 +0200)]
tools/libxc: provide variable paths to libxc
In preparation to remove hardcoded /var/run/xen paths, provide
XEN_RUN_DIR and related directories to xc_private.h. Similar code exists
already for libxl, stubdom and other parts.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Olaf Hering [Mon, 22 Sep 2014 13:00:04 +0000 (15:00 +0200)]
tools/libxl: use buildmakevars2header to create _paths.h
Replace usage of buildmakevars2file with buildmakevars2header. The macro
generates a C header file, so remove code which converts shell variables
into C defines. Also update the dependency, the macro itself creates a
dependency for _paths.h. A temporary file is not needed anymore.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Olaf Hering [Mon, 22 Sep 2014 13:00:03 +0000 (15:00 +0200)]
Config.mk: add new macro buildmakevars2header
This macro is similar to buildmakevars2file, it just creates a C header
file instead of shell style syntax. Upcoming changes will use this macro
in libxl and libxc.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Olaf Hering [Mon, 22 Sep 2014 13:00:02 +0000 (15:00 +0200)]
Config.mk: replace dependency to genpath with actual target
genpath is a detail of buildmakevars2file. Replace the dependency to
genpath with the actual buildmakevars2file target. This change by
itself does not fix any bug. Upcoming changes will add dependencies to
$(target), but no rule exist to create $(target).
To force a rebuild of the $(1) rule the target now depends on the
existing .phony target. This dummy target is already used elsewhere in
the code.
No change in behaviour is expected by this patch.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Olaf Hering [Mon, 22 Sep 2014 13:00:01 +0000 (15:00 +0200)]
Config.mk: move directory list into BUILD_MAKE_VARS
To maintain the list of directories in a single place, move the existing
list into its own variable and use it in buildmakevars2file.
Required for upcoming changes.
Trim also whitespaces.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Olaf Hering [Mon, 22 Sep 2014 12:59:56 +0000 (14:59 +0200)]
tools/hotplug: create XEN_RUN_DIR at runtime
Create XEN_RUN_DIR instead of hardcoded path because it is a compiletime
setting. Also /var/run might be empty on startup because it is a tmpfs
mount point.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Olaf Hering [Mon, 22 Sep 2014 12:59:55 +0000 (14:59 +0200)]
tools/pygrub: store kernels in /var/run/xen/pygrub
Move location of temporary bootfiles from /var/run/xend/boot to
/var/run/xen/pygrub. Create the subdirectory if does not exist.
The <dir> argument --output-directory must be an existing directory.
The reason for this change is that all entrys below /var/run have to be
created at runtime in case /var/run is cleared on every boot.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Olaf Hering [Mon, 22 Sep 2014 12:59:54 +0000 (14:59 +0200)]
tools: remove obsolete path.py from tools/python
The directory tools/python/xen/util does not exist.
Upcoming changes to genpath will fail if the rule persists.
Nothing uses path.py (anymore?), so get rid it.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- removed from .gitignore too ]
Olaf Hering [Mon, 22 Sep 2014 12:59:53 +0000 (14:59 +0200)]
tools/mkrpm: allow custom rpm package name
Even if xen is configured and compiled with different --prefix= so that
it operates entirely below $prefix, the resulting package from 'make
rpmball' is always called "xen.rpm".
Use an environment name to give a different name.
This can be used like this:
suffix=-bugN
prefix=/opx/xen/staging${suffix}
./configure --prefix=${prefix}
make rpmball PKG_SUFFIX=${suffix}
The result will be "xen-bugN.rpm" instead of "xen.rpm". The benefit is that
many xen${suffix}.rpm packages can be installed at the same time.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Olaf Hering [Mon, 22 Sep 2014 12:59:50 +0000 (14:59 +0200)]
stubdom: fix lwip compile
stubdom/lwip-x86_64/src/core/dhcp.c: In function 'dhcp_create_request':
stubdom/lwip-x86_64/src/core/dhcp.c:1359:71: error: array subscript is above array bounds [-Werror=array-bounds]
dhcp->msg_out->chaddr[i] = (i < netif->hwaddr_len) ? netif->hwaddr[i] : 0/* pad byte*/;
gcc can not know if hwaddr_len exceeds the hwaddr array size,
so force an upper limit to assist gcc.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Samuel Thibault <samuel.thibault@ens-lyon.org>
Ian Campbell [Wed, 17 Sep 2014 21:21:03 +0000 (22:21 +0100)]
xen: arm: Enable physical address space compression (PDX) on arm
This allows us to support sparse physical address maps which we previously
could not because the frametable would end up taking up an enormous fraction
of RAM.
On a fast model which has RAM at 0x80000000-0x100000000 and
0x880000000-0x900000000 this reduces the size of the frametable from
478M to 84M.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Julien Grall <julien.grall@linaro.org>
Ian Campbell [Tue, 16 Sep 2014 20:01:41 +0000 (21:01 +0100)]
xen: add helpers for PDX mask initialisation calculations
I wanted to make fill_mask a public function so I could use it on ARM, but it
was actually easier to think of a (semi) reasonable public name for the users
of it, so that is what I have done.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <julien.grall@linaro.org>
Ian Campbell [Wed, 17 Sep 2014 21:21:01 +0000 (22:21 +0100)]
xen: refactor physical address space compression support into common code
The "pdx compression" functionality will be useful on ARM as well.
Move the code to common code+header and introduce HAS_PDX to control when it is
built. L2_PAGETABLE_SHIFT is x86 specific, so introduce PDX_GROUP_SHIFT to
abstract it out.
ARM has no need for superpage compression (yet?) and lacks SUPERPAGE_SHIFT so
those functions (spage_to_mfn et al) are not moved.
No affect on x86 and no change for ARM (yet).
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Ian Campbell [Thu, 18 Sep 2014 00:09:55 +0000 (01:09 +0100)]
xen: arm: support for up to 48-bit IPA addressing on arm64
Currently we support only 40-bits. This is insufficient on systems where
peripherals which need to be 1:1 mapped to dom0 are above the 40-bit limit.
Unfortunately the hardware requirements are such that this means that the
number of levels in the P2M is not static and must vary with the number of
implemented physical address bits. This is described in ARM DDI 0487A.b Table
D4-5. In short there is no single p2m configuration which supports everything
from 40- to 48- bits.
For example a system which supports up to 40-bit addressing will only support 3
level p2m (maximum SL0 is 1 == 3 levels), requiring a concatenated page table
root consisting of two pages to make the full 40-bits of addressing.
A maximum of 16 pages can be concatenated meaning that a 3 level p2m can only
support up to 43-bit addresses. Therefore support for 48-bit addressing
requires SL0==2 (4 levels of paging).
After the previous patches our various p2m lookup and manipulation functions
already support starting at arbitrary level and with arbitrary root
concatenation. All that remains is to determine the correct settings from
ID_AA64MMFR0_EL1.PARange for which we use a lookup table.
As well as supporting 44 and 48 bit addressing we can also reduce the order of
the first level for systems which support only 32 or 36 physical address bits,
saving a page.
Systems with 42-bits are an interesting case, since they only support 3 levels
of paging, implying that 8 pages are required at the root level. So far I am
not aware of any systems with peripheral located so high up (the only 42-bit
system I've seen has nothing above 40-bits), so such systems remain configured
for 40-bit IPA with a pair of pages at the root of the p2m.
Switching to symbolic names for the VTCR_EL2 bits as we go improves the clarity
of the result.
Parts of this are derived from "xen/arm: Add 4-level page table for
stage 2 translation" by Vijaya Kumar K.
arm32 remains with the static 3-level, 2 page root configuration.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Julien Grall <julien.grall@linaro.org>
Ian Campbell [Thu, 18 Sep 2014 00:09:54 +0000 (01:09 +0100)]
xen: arm: support for up to 48-bit physical addressing on arm64
This only affects Xen's own stage one paging.
- Use symbolic names for TCR bits for clarity.
- Update PADDR_BITS
- Base field of LPAE PT structs is now 36 bits (and therefore
unsigned long long for arm32 compatibility)
- TCR_EL2.PS is set from ID_AA64MMFR0_EL1.PASize.
- Provide decode of ID_AA64MMFR0_EL1 in CPU info
Parts of this are derived from "xen/arm: Add 4-level page table for
stage 2 translation" by Vijaya Kumar K.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Julien Grall <julien.grall@linaro.org>
Ian Campbell [Thu, 18 Sep 2014 00:09:52 +0000 (01:09 +0100)]
xen: arm: handle variable p2m levels in p2m_lookup
This paves the way for boot-time selection of the number of levels to
use in the p2m, which is required to support both 40-bit and 48-bit
systems. For now the starting level remains a compile time constant.
Implemented by turning the linear sequence of lookups into a loop.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Julien Grall <julien.grall@linaro.org>
Ian Campbell [Thu, 18 Sep 2014 00:09:51 +0000 (01:09 +0100)]
xen: arm: Defer setting of VTCR_EL2 until after CPUs are up
Currently we retain the hardcoded values but soon we will want to calculate the
correct values based upon the CPU properties common to all processors, which
are only available once they are all up.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Julien Grall <julien.grall@linaro.org>
Ian Campbell [Thu, 18 Sep 2014 00:09:49 +0000 (01:09 +0100)]
xen: arm: handle concatenated root tables in dump_pt_walk
ARM allows for the concatenation of pages at the root of a p2m (but not a
regular page table) in order to support a larger IPA space than the number of
levels in the P2M would normally support. We use this to support 40-bit guest
addresses.
Previously we were unable to dump IPAs which were outside the first page of the
root. To fix this we adjust dump_pt_walk to take the machine address of the
page table root instead of expecting the caller to have mapped it. This allows
the walker code to select the correct page to map.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Julien Grall <julien.grall@linaro.org>
Ian Campbell [Thu, 18 Sep 2014 00:09:47 +0000 (01:09 +0100)]
xen: arm: rename p2m->first_level to p2m->root.
This was previously part of Vijaya's "xen/arm: Add 4-level page table
for stage 2 translation" but is split out here to make that patch
easier to read.
I went with ->root rather than ->root_level as the original did.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Julien Grall <julien.grall@linaro.org> Cc: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com>
In verify mode, we map the guest memory, and the guest page is
region_base + i * PAGE_SIZE. So we should csum page (region_base
+ i * PAGE_SIZE), not (region_base + (i+curbatch) * PAGE_SIZE)
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Hong Tao [Mon, 22 Sep 2014 05:59:13 +0000 (13:59 +0800)]
tools: libxc: restore: copy the correct page to memory
apply_batch() only handles MAX_BATCH_SIZE pages at one time. If
there is some bogus/unmapped/allocate-only/broken page, we will
skip it. So when we call apply_batch() again, the first page's
index is curbatch - invalid_pages. invalid_pages stores the number
of bogus/unmapped/allocate-only/broken pages we have found.
In many cases, invalid_pages is 0, so we don't catch this error.
Signed-off-by: Hong Tao <bobby.hong@huawei.com> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Roy Franz [Thu, 18 Sep 2014 22:50:05 +0000 (15:50 -0700)]
Update libfdt to v1.4.0
Update libfdt to v1.4.0 of libfdt taken from git://git.jdl.com/software/dtc.git
Xen changes to libfdt_env.h carried over from existing libfdt (v1.3.0)
This update provides the fdt_create_empty_tree() function used by the ARM
EFI boot code.
Signed-off-by: Roy Franz <roy.franz@linaro.org> Acked-by: Ian Campbell <ian.campbell@citrix.com>
Roy Franz [Thu, 18 Sep 2014 22:50:04 +0000 (15:50 -0700)]
add arm64 cache flushing code from linux v3.16
__flush_dcache_all added from arch/arm64/mm/cache.S, with helper macros from
arch/arm64/include/asm/assembler.h, from v3.16. The cache flushing is required
when transitioning from EFI code that runs with cache enable to Xen startup
code which expects the cache to be disabled.
Signed-off-by: Roy Franz <roy.franz@linaro.org> Acked-by: Ian Campbell <ian.campbell@citrix.com>
[ ijc -- removed indent on ENTRY() and dropped the entry point label which
duplicates the one from the macro. ]
x86: handle resumed instruction based on previous mem_event reply
In a scenario where a page fault that triggered a mem_event occured,
p2m_mem_access_check() will now be able to either 1) emulate the
current instruction, or 2) emulate it, but don't allow it to perform
any writes.
Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Jan Beulich <jbeulich@suse.com>