setup/teardown/preresume/postresume/checkpoint of COLO proxy module.
we use netlink to communicate with proxy module.
About colo-proxy module:
http://www.spinics.net/lists/netdev/msg333520.html
https://github.com/wencongyang/colo-proxy
How to use:
http://wiki.xen.org/wiki/COLO_-_Coarse_Grain_Lock_Stepping
Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Wen Congyang [Wed, 17 Feb 2016 07:10:27 +0000 (15:10 +0800)]
COLO: use qemu block replication
Use qemu block replication as our block replication solution.
Note that guest must be paused before starting COLO, otherwise,
the disk won't be consistent between primary and secondary.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Wen Congyang [Mon, 21 Mar 2016 07:38:30 +0000 (15:38 +0800)]
Support colo mode for qemu disk
Usage: disk = ['...,colo,colo-host=xxx,colo-port=xxx,colo-export=xxx,active-disk=xxx,hidden-disk=xxx...']
For QEMU block replication details:
http://wiki.qemu.org/Features/BlockReplication
Note: we just introduce COLO framework, but don't implement COLO
operations in this patch.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Wen Congyang [Wed, 17 Feb 2016 06:28:32 +0000 (14:28 +0800)]
COLO: introduce new API to prepare/start/do/get_error/stop replication
We will use qemu block replication, and qemu provides some qmp commands
to prepare replication, start replication, get replication error, and
stop replication. Introduce new API to execute these qmp commands.
After suspend primary vm, get dirty bitmap on secondary vm,
and send pages both dirty on primary/secondary to secondary.
Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com> CC: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
a. call callbacks resume/checkpoint/suspend while secondary vm
status is consistent with primary
b. send dirty pfn list to primary when checkpoint under colo
c. send store gfn and console gfn to xl before resuming secondary vm
Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com> CC: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Wen Congyang [Tue, 15 Dec 2015 08:05:41 +0000 (16:05 +0800)]
primary vm suspend/resume/checkpoint code
We will do the following things again and again:
1. Suspend primary vm
a. Suspend primary vm
b. do postsuspend
c. Read CHECKPOINT_SVM_SUSPENDED sent by secondary
2. Checkpoint
a. Write emulator xenstore data and emulator context
b. Write checkpoint end record
3. Resume primary vm
a. Read CHECKPOINT_SVM_READY from slave
b. Do presume
c. Resume primary vm
d. Read CHECKPOINT_SVM_RESUMED from slave
4. Wait a new checkpoint
a. Wait a new checkpoint(not implemented)
b. Send CHECKPOINT_NEW to slave
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Wen Congyang [Tue, 15 Dec 2015 08:45:17 +0000 (16:45 +0800)]
secondary vm suspend/resume/checkpoint code
Secondary vm is running in colo mode. So we will do
the following things again and again:
1. Resume secondary vm
a. Send CHECKPOINT_SVM_READY to master.
b. If it is not the first resume, call libxl__checkpoint_devices_preresume().
c. If it is the first resume(resume right after live migration),
- call libxl__xc_domain_restore_done() to build the secondary vm.
- enable secondary vm's logdirty.
- call libxl__domain_resume() to resume secondary vm.
- call libxl__checkpoint_devices_setup() to setup checkpoint devices.
d. Send CHECKPOINT_SVM_RESUMED to master.
2. Wait a new checkpoint
a. Call libxl__checkpoint_devices_commit().
b. Read CHECKPOINT_NEW from master.
3. Suspend secondary vm
a. Suspend secondary vm.
b. Call libxl__checkpoint_devices_postsuspend().
c. Send CHECKPOINT_SVM_SUSPENDED to master.
4. Checkpoint
a. Read emulator xenstore data and emulator context
b. REC_TYPE_CHECKPOINT_END
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
tools/libxl: add back channel support to read stream
This is used by primay to read records sent by secondary.
Note: The function libxl__stream_read_checkpoint_state() will be used
in later patches called "secondary vm suspend/resume/checkpoint code" and
"primary vm suspend/resume/checkpoint code".
Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
tools/libxl: add back channel support to write stream
Add back channel support to write stream. If the write stream is
a back channel stream, this means the write stream is used by
Secondary to send some records back.
Note: The function libxl__stream_write_checkpoint_state() will be used
in later patches called "secondary vm suspend/resume/checkpoint code" and
"primary vm suspend/resume/checkpoint code".
Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
libxc/migration: export read_record for common use
read_record() could be used by primary to read dirty bitmap
record sent by secondary under COLO.
When used by xc save side, we need to pass the backchannel fd
instead of ctx->fd to read_record(), so we added a fd param to
it.
No functional changes.
CC: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Wen Congyang [Mon, 14 Dec 2015 07:24:17 +0000 (15:24 +0800)]
tools/libx{l,c}: add back channel to libxc
In COLO mode, both VMs are running, and are considered in sync if the
visible network traffic is identical. After some time, they fall out of
sync.
At this point, the two VMs have definitely diverged. Lets call the
primary dirty bitmap set A, while the secondary dirty bitmap set B.
Sets A and B are different.
Under normal migration, the page data for set A will be sent from the
primary to the secondary.
However, the set difference B - A (the one in B but not in A, lets
call this C) is out-of-date on the secondary (with respect to the
primary) and will not be sent by the primary (to secondary), as it
was not memory dirtied by the primary. The secondary needs C page data
to reconstruct an exact copy of the primary at the checkpoint.
The secondary cannot calculate C as it doesn't know A. Instead, the
secondary must send B to the primary, at which point the primary
calculates the union of A and B (lets call this D) which is all the
pages dirtied by both the primary and the secondary, and sends all page
data covered by D.
In the general case, D is a superset of both A and B. Without the
backchannel dirty bitmap, a COLO checkpoint can't reconstruct a valid
copy of the primary.
We transfer the dirty bitmap on libxc side, so we need to introduce back
channel to libxc.
Note: it is different from the paper. We change the original design to
the current one, according to our following concerns:
1. The original design needs extra memory on Secondary host. When there's
multiple backups on one host, the memory cost is high.
2. The memory cache code will be another 1k+, it will make the review
more time consuming.
Note: this patch merely adds new parameters to various prototypes and
functions. The new parameters are used in later patch called
"libxc/restore: send dirty pfn list to primary when checkpoint under
COLO".
Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com> CC: Ian Campbell <Ian.Campbell@citrix.com> CC: Ian Jackson <Ian.Jackson@eu.citrix.com> CC: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
tools/libxl: Add back channel to allow migration target send data back
In COLO mode, secondary needs to send the following data to primary:
1. In libxl
Secondary sends the following CHECKPOINT_CONTEXT to primary:
CHECKPOINT_SVM_SUSPENDED, CHECKPOINT_SVM_READY and CHECKPOINT_SVM_RESUMED
2. In libxc
Secondary sends the dirty pfn list to primary
But the io_fd only can be written in primary, and only can be read in
secondary. Save recv_fd in domain_suspend_state, and send_fd in
domain_create_state. Extend libxl_domain_create_restore API, add a
send_fd param to it. Add LIBXL_HAVE_CREATE_RESTORE_SEND_FD to indicate
the API change.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Secondary vm is running in COLO mode, we need to send secondary
vm's dirty page information to primary host at checkpoint, so we
have to enable qemu logdirty on secondary.
libxl__domain_suspend_common_switch_qemu_logdirty() is to enable
qemu logdirty. But it uses libxl__domain_save_state, and calls
libxl__xc_domain_saverestore_async_callback_done() before exits.
This can not be used for secondary vm.
Update libxl__domain_suspend_common_switch_qemu_logdirty() to
introduce a new API libxl__domain_common_switch_qemu_logdirty().
This API only uses libxl__logdirty_switch, and calls
lds->callback before exits. This new API will be used by the patch:
secondary vm suspend/resume/checkpoint codes
Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com> CC: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Wen Congyang [Mon, 14 Dec 2015 07:08:13 +0000 (15:08 +0800)]
tools/libxl: introduction of libxl__qmp_restore to load qemu state
In normal migration, the qemu state is passed to qemu as a parameter.
With COLO, secondary vm is running. So we will do the following steps
at every checkpoint:
1. suspend both primary vm and secondary vm
2. sync the state
3. resume both primary vm and secondary vm
Primary will send qemu's state in step2, and secondary's qemu should
read it and restore the state before it is resumed. We can not pass the
state to qemu as a parameter because secondary QEMU is already started
at this point, so we introduce libxl__qmp_restore() to do it.
Signed-off-by: Yang Hongyang <hongyang.yang@easystack.cn> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com> Signed-off-by: Changlong Xie <xiecl.fnst@cn.fujitsu.com> Cc: Anthony Perard <anthony.perard@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Jan Beulich [Thu, 31 Mar 2016 12:52:04 +0000 (14:52 +0200)]
x86/HVM: fix forwarding of internally cached requests
Forwarding entire batches to the device model when an individual
iteration of them got rejected by internal device emulation handlers
with X86EMUL_UNHANDLEABLE is wrong: The device model would then handle
all iterations, without the internal handler getting to see any past
the one it returned failure for. This causes misbehavior in at least
the MSI-X and VGA code, which want to see all such requests for
internal tracking/caching purposes. But note that this does not apply
to buffered I/O requests.
This in turn means that the condition in hvm_process_io_intercept() of
when to crash the domain was wrong: Since X86EMUL_UNHANDLEABLE can
validly be returned by the individual device handlers, we mustn't
blindly crash the domain if such occurs on other than the initial
iteration. Instead we need to distinguish hvm_copy_*_guest_phys()
failures from device specific ones, and then the former need to always
be fatal to the domain (i.e. also on the first iteration), since
otherwise we again would end up forwarding a request to qemu which the
internal handler didn't get to see.
The adjustment should be okay even for stdvga's MMIO handling:
- if it is not caching then the accept function would have failed so we
won't get into hvm_process_io_intercept(),
- if it issued the buffered ioreq then we only get to the p->count
reduction if hvm_send_ioreq() actually encountered an error (in which
we don't care about the request getting split up).
Also commit 4faffc41d ("x86/hvm: limit reps to avoid the need to handle
retry") went too far in removing code from hvm_process_io_intercept():
When there were successfully handled iterations, the function should
continue to return success with a clipped repeat count.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Paul Durrant [Thu, 31 Mar 2016 12:49:50 +0000 (14:49 +0200)]
x86/hvm/viridian: zero and check vcpu context _pad field
Commit 57844631 "save APIC assist vector" added an extra field to the
viridian vcpu context save record. This field was only a uint8_t and
so an extra _pad field was also added to pad up to the next 64-bit
boundary.
This patch makes sure that _pad field is zeroed on save and checked
for zero on restore. This prevents a potential leak of information
from the stack and a compatibility check against future use of the
space occupied by the _pad field.
The _pad field is zeroed as a side effect of making use of a C99 struct
initializer for the other fields. This patch also modifies the domain
context save code to use the same mechanism.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
libxc/libxl/python/xenstat/ocaml: Use new XEN_VERSION hypercall
We change the xen_version libxc code to use the new hypercall.
Which of course means every user in the code base has to
be changed over.
It is important to note that the xc_version_op has a different
return semantic than the previous one. It returns negative
values on error (like the old one), but it also returns
an positive value on success (unlike the old one). The positive
value is the number of bytes copied in.
Note that both Ocaml and xenstat use tabs instead of four
spaces so they look quite odd.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> [for the Ocaml stubs] Acked-by: George Dunlap <george.dunlap@eu.citrix.com> [xenctx bits] Acked-by: Wei Liu <wei.liu2@citrix.com>
HYPERCALL_version_op. New hypercall mirroring XENVER_ but sane.
This hypercall mirrors the XENVER_ in that it has similar functionality.
However it is designed differently:
- No compat layer. The data structures are the same size on 32
as on 64-bit.
- The hypercall accepts three arguments - the command, pointer to
an buffer, and the length of the buffer.
- Each sub-ops can be "probed" for size by returning the size of
buffer that will be needed - if the buffer is NULL.
- Subops can complete even if the buffer is too small - truncated
data will be filled and hypercall will return -ENOBUFS.
- VERSION_commandline, VERSION_changeset are privileged.
- There is no XENVER_compile_info equivalent.
- The hypercall can return -EPERM and toolstack/OSes are expected
to deal with. However there are three subops: XEN_VERSION_version,
XEN_VERSION_platform_parameters and XEN_VERSION_get_features
that will always return an value as guests cannot survive without them.
While we combine some of the common code between XENVER_ and VERSION_
take the liberty of moving pae_extended_cr3 in x86 area.
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> [XSM bits] Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jonathan Davies [Wed, 30 Mar 2016 16:06:39 +0000 (16:06 +0000)]
oxenstored: allow compilation prior to OCaml 3.12.0
Commit 363ae55c8 used an OCaml feature called record field punning. This broke
the build on compilers prior to OCaml 3.12.0.
This patch makes no semantic change but now uses backwards-compatible syntax.
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Shannon Zhao [Wed, 30 Mar 2016 10:09:00 +0000 (12:09 +0200)]
arm: Add a hypercall for device mmio mapping
It needs to map platform or amba device mmio to Dom0 on ARM. But when
booting with ACPI, it can't get the mmio region in Xen due to lack of
AML interpreter to parse DSDT table. Therefore, let Dom0 call a
hypercall to map mmio region when it adds the devices.
Here we add a new map space like the XEN_DOMCTL_memory_mapping to map
mmio region for Dom0. Also add a helper to combine the
xsm_add_to_physmap and XENMAPSPACE_dev_mmio space check together.
Shannon Zhao [Wed, 30 Mar 2016 10:10:00 +0000 (12:10 +0200)]
arm/acpi: Permit MMIO access of Xen unused devices for Dom0
Firstly it permits full MMIO capabilities for Dom0. Then deny MMIO
access of Xen used devices, such as UART, GIC, SMMU. Currently, it only
denies the MMIO access of UART and GIC regions. For other Xen used
devices it could be added later when they are supported.
Shannon Zhao [Wed, 30 Mar 2016 10:14:00 +0000 (12:14 +0200)]
arm/acpi: Configure SPI interrupt type and route to Dom0 dynamically
Interrupt information is described in DSDT and is not available at the
time of booting. Check if the interrupt is permitted to access and set
the interrupt type, route it to guest dynamically only for SPI
and Dom0.
Shannon Zhao [Wed, 30 Mar 2016 10:12:00 +0000 (12:12 +0200)]
arm/acpi: Create min DT stub for Dom0
Create a DT for Dom0 for ACPI-case only. DT contains minimal required
information such as Dom0 bootargs, initrd, efi description table and
address of uefi memory table.
Also document this device tree bindings of "hypervisor" and
"hypervisor/uefi" node.
Shannon Zhao [Wed, 30 Mar 2016 10:11:00 +0000 (12:11 +0200)]
arm/acpi: Prepare XSDT table for Dom0
Copy and modify XSDT table before passing it to Dom0. Replace the entry
value of the copied table. Add a new entry for STAO table as well. And
keep entry value of other reused tables unchanged.
Shannon Zhao [Wed, 30 Mar 2016 10:10:00 +0000 (12:10 +0200)]
arm/acpi: Estimate memory required for acpi/efi tables
Estimate the memory required for loading acpi/efi tables in Dom0. Make
the length of each table aligned with 64bit. Alloc the pages to store
the new created EFI and ACPI tables and free these pages when
destroying domain.
Merge branch 'pin' of https://github.com/jgross1/xen into staging
* 'pin' of https://github.com/jgross1/xen:
libxl: add force option for xl vcpu-pin
libxl: print message how to recover from xl cpupool-cpu-remove errors
libxc: do some retries in xc_cpupool_removecpu() for EBUSY case
All patches have Acked-by and Reviewed-by tags.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Paul Durrant [Tue, 29 Mar 2016 15:55:23 +0000 (16:55 +0100)]
tools/misc/xen-hvmctx: fix the build
Commit 78c5f59e "x86/hvm/viridian: save APIC assist vector" changed
the name of a field in the viridian vcpu save record. Unfortunately this
record has a decode function in xen-hvmctx and so it no longer builds.
This patch fixes the field name in xen-hvmctx and also adds a decode of
the additional field that was added to the save record.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
There should be 6 instead of 7 arguments now for tmem_control()
. which was done in commit 54a51b1766fd433b95e63834eb15d4b1f70271de
"tmem: Remove xc_tmem_control mystical arg3" which missed
this change.
Signed-off-by: Zhigang Wang <zhigang.x.wang@oracle.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Tue, 29 Mar 2016 15:16:23 +0000 (17:16 +0200)]
spinlock: improve spin_is_locked() for recursive locks
Recursive locks know their current owner, and since we use the function
solely to determine whether a particular lock is being held by the
current CPU (which so far has been an imprecise check), make actually
check the owner for recusrively acquired locks.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Quan Xu <quan.xu@intel.com> Acked-by: Tim Deegan <tim@xen.org>
Shuai Ruan [Tue, 29 Mar 2016 15:15:57 +0000 (17:15 +0200)]
x86/xsaves: calculate comp_offsets[] based on xcomp_bv
Previous patch using all available features calculate comp_offsets.
This is wrong.This patch fix this bug by calculating the comp_offset
based on xcomp_bv of current guest.
Also, the comp_offset should take alignment into consideration.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Shuai Ruan <shuai.ruan@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 29 Mar 2016 15:15:15 +0000 (17:15 +0200)]
ns16550: enable Pericom controller support
Other than the controllers supported so far, multiple port Pericom
boards map all of their ports via BAR0, which requires a number of
adjustments: Instead of tracking "max_bars" we now flag whether all
ports use BAR0, and whether to expect a port-I/O or MMIO resource. As
a result pci_uart_config() now gets handed a port index, which it then
maps into a BAR index or an offset into BAR0 depending on the bar0
flag.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Tim Deegan <tim@xen.org>
Jan Beulich [Tue, 29 Mar 2016 15:14:43 +0000 (17:14 +0200)]
ns16550: store pointer to config parameters for PCI
Subsequent changes will want to use this pointer.
This makes the enable_ro structure member redundant, so it gets dropped
at once.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Tim Deegan <tim@xen.org>
Shannon Zhao [Tue, 29 Mar 2016 12:26:57 +0000 (14:26 +0200)]
hvm/params: add a new delivery type for event-channel in HVM_PARAM_CALLBACK_IRQ
This new delivery type which is for ARM shares the same value with
HVM_PARAM_CALLBACK_TYPE_VECTOR which is for x86.
val[15:8] is flag: val[7:0] is a PPI.
To the flag, bit 8 stands the interrupt mode is edge(1) or level(0) and
bit 9 stands the interrupt polarity is active low(1) or high(0).
Signed-off-by: Shannon Zhao <shannon.zhao@linaro.org> Acked-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Tue, 29 Mar 2016 12:26:33 +0000 (14:26 +0200)]
x86/hvm/viridian: fix APIC assist page leak
Commit a6f2cdb6 "keep APIC assist page mapped..." introduced a page
leak because it relied on viridian_vcpu_deinit() always being called
to release the page mapping. This does not happen in the case a normal
domain shutdown.
This patch fixes the problem by introducing a new function,
viridian_domain_deinit(), which will iterate through the vCPUs and
release any page mappings still present.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Paul Durrant [Tue, 29 Mar 2016 12:26:03 +0000 (14:26 +0200)]
x86/hvm/viridian: save APIC assist vector
If any vcpu has a pending APIC assist when the domain is suspended
then the vector needs to be saved. If this is not done then it's
possible for the vector to remain pending in the vlapic ISR
indefinitely after resume.
This patch adds code to save the APIC assist vector value in the
viridian vcpu save record. This means that the record is now zero-
extended on load and, because this implies a loaded value of
zero means nothing is pending (for backwards compatibility with
hosts not implementing APIC assist), the rest of the viridian APIC
assist code is adjusted to treat a zero value in this way. A
check has therefore been added to viridian_start_apic_assist() to
prevent the enlightenment being used for vectors < 0x10 (which
are illegal for an APIC).
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
I nominate Anthony Perard as qemu-xen co-maintainer. He has been doing a
lot of QEMU work over the years and in fact he is the original author of
the Xen enablement code in upstream QEMU.
As qemu-xen co-maintainer, he could help me manage the qemu-xen trees
and promptly backport all the relevant commits from upstream QEMU.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Jan Beulich [Tue, 29 Mar 2016 12:24:26 +0000 (14:24 +0200)]
x86: fix information leak on AMD CPUs
The fix for XSA-52 was wrong, and so was the change synchronizing that
new behavior to the FXRSTOR logic: AMD's manuals explictly state that
writes to the ES bit are ignored, and it instead gets calculated from
the exception and mask bits (it gets set whenever there is an unmasked
exception, and cleared otherwise). Hence we need to follow that model
in our workaround.
This is CVE-2016-3158 / CVE-2016-3159 / XSA-172.
[xen/arch/x86/xstate.c:xrstor: CVE-2016-3158]
[xen/arch/x86/i387.c:fpu_fxrstor: CVE-2016-3159]
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
George Dunlap [Thu, 24 Mar 2016 17:17:24 +0000 (17:17 +0000)]
xl: Return an error on failed cd-insert
This makes xl more useful in scripts.
The strange thing about this is that the internal cd_insert function
*already* returned something appropriate, and cd-eject was using it,
but cd-insert wasn't.
Also:
* Rework cd_insert to return EXIT_FAILURE and EXIT_SUCCESS rather than
magic constants
* Use 'r' for non-libxl return code, as specified in CODING_STYLE
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
George Dunlap [Thu, 24 Mar 2016 17:17:23 +0000 (17:17 +0000)]
xl: Make set_memory_target return an error code on failure
Also move the rc -> shell code translation into set_memory_max() to
make the two functions consistent with each other, and with other
similar examples in xl_cmdimpl.c
Change a 'long long' to "int64_t" while we're at it.
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
George Dunlap [Thu, 24 Mar 2016 17:17:22 +0000 (17:17 +0000)]
libxl: Remove pointless hypercall from libxl_set_memory_target
There's no obvious reason for the call to xc_domain_getinfolist -- all
it seems to be doing is checking that the domain exists; but if it
doesn't exist, it will have already failed by this point.
NB that this will change the return value for libxl_set_memory_target:
now it will return 0 on success, rather than returning 1 (which was
the previous behavior). This is more in line with expected behavior,
and also allows the caller to distingiush between success and other
failure modes (some of which also return 1).
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Doug Goldstein [Wed, 16 Mar 2016 19:18:43 +0000 (14:18 -0500)]
xsm: move FLASK_AVC_STATS to Kconfig
Have Kconfig set CONFIG_FLASK_AVC_STATS and prefix all uses with CONFIG_
to use the Kconfig variable.
Note that will preserve the original behavior - which is that you
cannot disable FLASK_AVC_STATS. Enterprising users can disable
it without any compilation issues.
Signed-off-by: Doug Goldstein <cardoe@cardoe.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Doug Goldstein [Wed, 16 Mar 2016 19:18:42 +0000 (14:18 -0500)]
xsm: only define XSM_MAGIC in xsm.h
Rather than have XSM_MAGIC set in the global xen/config.h and set in
xsm.h if it's unset, just set it once in xsm.h since its only used in
files that already include xsm.h
Signed-off-by: Doug Goldstein <cardoe@cardoe.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Juergen Gross [Thu, 24 Mar 2016 17:44:50 +0000 (18:44 +0100)]
libxl: add force option for xl vcpu-pin
In order to be able to undo a vcpu pin override in case of a kernel
driver error add a flag "-f" to the "xl vcpu-pin" command forcing the
hypervisor to undo the override.
Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Thu, 24 Mar 2016 17:44:50 +0000 (18:44 +0100)]
libxl: print message how to recover from xl cpupool-cpu-remove errors
An error occurring when calling "xl cpupool-cpu-remove" might leave
the system in a state where a cpu is neither completely free nor in
a cpupool. This can easily be repaired by adding the cpu via
"xl cpupool-cpu-add" to the cpupool where it was removed from before.
Print a message telling this the user in case of an error.
Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Thu, 24 Mar 2016 17:44:50 +0000 (18:44 +0100)]
libxc: do some retries in xc_cpupool_removecpu() for EBUSY case
The hypervisor might return EBUSY when trying to remove a cpu from a
cpupool when a domain running in this cpupool has pinned a vcpu
temporarily. Do some retries in this case, perhaps the situation
cleans up.
Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Thu, 24 Mar 2016 15:07:30 +0000 (16:07 +0100)]
x86/vLAPIC: vlapic_reg_write() can't fail
It only ever returns X86EMUL_OKAY, so to make this more obvious change
the function return type to void. Re-structure vlapic_apicv_write() at
once to have only a single path leading to vlapic_reg_write().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Thu, 24 Mar 2016 15:06:22 +0000 (16:06 +0100)]
x86: annotate special features
Some bits in a featureset are not simple a indication of new functionality,
and require special handling.
APIC, OSXSAVE and OSPKE are fast-forwards of other pieces of state;
IA32_APIC_BASE.EN, CR4.OSXSAVE and CR4.OSPKE. Xen will take care of filling
these appropriately at runtime.
FDP_EXCP_ONLY and NO_FPU_SEL are bits indicating reduced functionality in the
x87 pipeline. The effects of these cannot be hidden from the guest, so the
host values will always be provided.
HTT, X2APIC and CMP_LEGACY indicate how to interpret other cpuid leaves. In
most cases, the toolstack value will be used (with the expectation that these
flags will match the other provided topology information). However with cpuid
masking, the host values are presented as masking cannot influence what the
guest sees in the dependent leaves.
HYPERVISOR is unconditionally set in the PV ABI, but follows the toolstack
setting for HVM guests.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Andrew Cooper [Thu, 24 Mar 2016 15:05:37 +0000 (16:05 +0100)]
x86: mask out unknown features from Xen's capabilities
If Xen doesn't know about a feature, it is unsafe for use and should be
deliberately hidden from Xen's capabilities.
This doesn't make a practical difference yet, but will make a difference
later when the guest featuresets are seeded from the host featureset.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Andrew Cooper [Thu, 24 Mar 2016 15:03:44 +0000 (16:03 +0100)]
x86: collect more cpuid feature leaves
New words are:
* 0x80000007.edx - Contains Invarient TSC
* 0x80000008.ebx - Newly used for AMD Zen processors
In addition, replace some open-coded ITSC and EFRO manipulation.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Andrew Cooper [Thu, 24 Mar 2016 15:02:37 +0000 (16:02 +0100)]
x86: script to automatically process featureset information
This script consumes include/public/arch-x86/cpufeatureset.h and generates a
single include/asm-x86/cpuid-autogen.h containing all the processed
information.
It currently generates just FEATURESET_NR_ENTRIES. Future changes will
generate more information.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Dario Faggioli [Thu, 24 Mar 2016 14:57:30 +0000 (15:57 +0100)]
sched: add .init_pdata hook to the scheduler interface
with the purpose of decoupling the allocation phase and
the initialization one, for per-pCPU data of the schedulers.
This makes it possible to perform the initialization later
in the pCPU bringup/assignement process, when more information
(for instance, the host CPU topology) are available. This,
for now, is important only for Credit2, but it can well be
useful to other schedulers.
Dario Faggioli [Thu, 24 Mar 2016 14:56:56 +0000 (15:56 +0100)]
sched: fix locking when allocating an RTDS pCPU
as doing that include changing the scheduler lock
mapping for the pCPU itself, and the correct way
of doing that is:
- take the lock that the pCPU is using right now
(which may be the lock of another scheduler);
- change the mapping of the lock to the RTDS one;
- release the lock (the one that has actually been
taken!)
Jim Fehlig [Tue, 15 Mar 2016 01:14:15 +0000 (01:14 +0000)]
tools: Restrict configuration of qemu processes
Commit 6ef823fd added '-nodefaults' to the qemu args created by
libxl, which is a good step in restricting qemu's default
configuration. This change takes another step by adding
-no-user-config, which ignores any user-provided config files in
sysconfdir. Together, -nodefaults and -no-user-config allow Xen
to avoid unkown and uncontrolled qemu configuration.
Both options are also added to the qemu invocation in the
xen-qemu-dom0-disk-backend systemd service file.
Signed-off-by: Jim Fehlig <jfehlig@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jonathan Davies [Thu, 17 Mar 2016 17:51:15 +0000 (17:51 +0000)]
oxenstored: log request and response during transaction replay
During a transaction replay, the replayed requests and the new responses are
logged in the same way as the original requests and the original responses.
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com> Reviewed-by: Euan Harris <euan.harris@citrix.com> Acked-by: David Scott <dave@recoil.org>
Jonathan Davies [Thu, 17 Mar 2016 17:51:14 +0000 (17:51 +0000)]
oxenstored: replay transaction upon conflict
The existing transaction merge algorithm keeps track of the least upper bound
(longest common prefix) of all the nodes which have been read and written, and
will re-combine two stores which have disjoint upper bounds. This works well for
small transactions but causes unnecessary conflicts for ones that span a large
subtree, such as the following ones used by the xapi toolstack:
* VM start: creates /vm/... /vss/... /local/domain/...
The least upper bound of this transaction is / and so all
these transactions conflict with everything.
* Device hotplug: creates /local/domain/0/... /local/domain/n/...
The least upper bound of this transaction is /local/domain so
all these transactions conflict with each other.
If the existing merge algorithm cannot merge and commit, we attempt
a /replay/ of the failed transaction against the new store.
When we replay the requests we check whether the response sent to the client is
the same as during the first attempt at the transaction. If the responses are
all the same then the transaction replay can be committed. If any differ then
the transaction replay must be aborted and the client must retry.
This algorithm uses the intuition that the transactions made by the toolstack
are designed to be for separate domains, and should fundamentally not conflict
in the sense that they don't read or write any shared keys. By replaying the
transaction on the server side we do what the client would have to do anyway,
only we can do it quickly without allowing any other requests to interfere.
Performing 300 parallel simulated VM start and shutdowns without this code:
300 parallel starts and shutdowns: 268.92
Performing 300 parallel simulated VM start and shutdowns with this code:
300 parallel starts and shutdowns: 3.80
Signed-off-by: Dave Scott <dave@recoil.org> Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com> Reviewed-by: Euan Harris <euan.harris@citrix.com> Acked-by: David Scott <dave@recoil.org>
Jonathan Davies [Thu, 17 Mar 2016 17:51:13 +0000 (17:51 +0000)]
oxenstored: move functions that process simple operations
Separate the functions which process operations that can be done as part of a
transaction. Specifically, these operations are: read, write, rm, getperms,
setperms, getdomainpath, directory, mkdir.
Also split function_of_type into two functions: one for processing the simple
operations and one for processing the rest.
This will help allow replay of transactions, allowing us to invoke the functions
that process the simple operations as part of the processing of transaction_end.
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com> Reviewed-by: Euan Harris <euan.harris@citrix.com> Acked-by: David Scott <dave@recoil.org>
Jonathan Davies [Thu, 17 Mar 2016 17:51:12 +0000 (17:51 +0000)]
oxenstored: keep track of each transaction's operations
A list of (request, response) pairs from the operations performed within the
transaction will be useful to support transaction replay.
Since this consumes memory, the number of requests per transaction must not be
left unbounded. Hence a new quota for this is introduced. This quota, configured
via the configuration key 'quota-maxrequests', limits the size of transactions
initiated by domUs.
After the maximum number of requests has been exhausted, any further requests
will result in EQUOTA errors. The client may then choose to end the transaction;
a successful commit will result in the retention of only the prior requests.
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com> Reviewed-by: Euan Harris <euan.harris@citrix.com> Acked-by: David Scott <dave@recoil.org>
Jonathan Davies [Thu, 17 Mar 2016 17:51:11 +0000 (17:51 +0000)]
oxenstored: refactor request processing
Encapsulate the request in a record that is passed from do_input to
process_packet and input_handle_error.
This will be helpful when keeping track of the requests made as part of a
transaction.
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com> Reviewed-by: Euan Harris <euan.harris@citrix.com> Acked-by: David Scott <dave@recoil.org>
Jonathan Davies [Thu, 17 Mar 2016 17:51:10 +0000 (17:51 +0000)]
oxenstored: remove some unused parameters
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com> Reviewed-by: Euan Harris <euan.harris@citrix.com> Acked-by: David Scott <dave@recoil.org>
Jonathan Davies [Thu, 17 Mar 2016 17:51:09 +0000 (17:51 +0000)]
oxenstored: refactor putting response on wire
Previously, the functions reply_{ack,data,data_or_ack} and input_handle_error
put the response on the wire by invoking Connection.send_{ack,reply,error}.
Instead, these functions now return a value indicating what needs to be put on
the wire, and that action is done by a send_response function called
afterwards.
This refactoring gives us a chance to store the value of the response, useful
for replaying transactions.
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com> Reviewed-by: Euan Harris <euan.harris@citrix.com> Acked-by: David Scott <dave@recoil.org>
Jan Beulich [Wed, 23 Mar 2016 10:04:52 +0000 (11:04 +0100)]
x86: drop raw_write_cr4() again
The bypassing of the memory cache is, namely in the context of the
32-bit PV SMEP/SMAP workaround series (as Andrew validly points out),
making the overall correctness more difficult to verify. Hence go
back to uniform writes.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Wed, 23 Mar 2016 10:02:07 +0000 (11:02 +0100)]
x86: remap text/data/bss with appropriate permissions
c/s cf39362 "x86: use 2M superpages for text/data/bss mappings" served two
purposes; to map the primary code and data with appropriate pagetable
permissions (rather than unilaterally RWX), and to reduce the TLB pressure.
The extra alignment exposed a SYSLinux issue, and was partly reverted by c/s 0b8a172 "x86: partially revert use of 2M mappings for hypervisor image".
This change reinstates the pagetable permission improvements while avoiding
the 2M alignment issue.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
So that we have a nice mechansim to figure out the upper
bounds of bug.frames and also catch compiler errors in case
one tries to use a higher frame number.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com> Acked-by: Jan Beulich <jbeulich@suse.com>
--- Cc: Stefano Stabellini <stefano.stabellini@citrix.com> Cc: Julien Grall <julien.grall@arm.com> Cc: Keir Fraser <keir@xen.org> Cc: Jan Beulich <jbeulich@suse.com> Cc: Andrew Cooper <andrew.cooper3@citrix.com>
v3: First time included.
v4: Add BUG_FRAME check also in the assembler version of the macro.
v5: Add Acks, make BUILD_BUG_ON checks look correct. Position the
BUGFRAME_NR properly. Reposition the BUGFRAME_NR again.
---
xsm/xen_version: Add XSM for most of xen_version hypercall
Most of XENVER_* have now an XSM check for their sub-ops.
The subop for XENVER_commandline is now a priviliged operation.
To not break guests we still return an string - but it is
just '<denied>\0'.
The XENVER_[version|platform_parameters|get_features] - will
always return an value to the guest.
The rest: XENVER_[extraversion|capabilities|page_size|
guest_handle|changeset| compile_info] behave as before -
allowed by default for all guests if using the XSM default
policy or with the dummy one. And if the system admin
wants to curtail access to some of them - they can do
that now with a non-default XSM policy.
Also we add a local variable block.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
--- Cc: Daniel De Graaf <dgdegra@tycho.nsa.gov> Cc: Ian Jackson <ian.jackson@eu.citrix.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Wei Liu <wei.liu2@citrix.com>
v2: Do XSM check for all the XENVER_ ops.
- Add empty data conditions.
- Return <denied> for priv subops.
- Move extraversion from priv to normal. Drop the XSM check
for the non-priv subops.
v3:
- Add +1 for strlen(xen_deny()) to include NULL. Move changeset,
compile_info to non-priv subops.
- Remove the \0 on xen_deny()
- Add new XSM domain for xenver hypercall. Add all subops to it.
- Remove the extra line, Add Ack from Daniel
v4:
- Rename the XSM from xen_version_op to xsm_xen_version.
Prefix the types with 'xen' to distinguish it from another
hypercall performing similar operation. Removed Ack from Daniel
as it was so large. Add local variable block.
v5:
- Make XENVER_platform_parameters,get_features,version be excluded
from the XSM check per Jans' review. Add BUILD_BUG_CHECK and fix
odd line removals. Remove stray changes and fix spelling.