We used structure assignment to update irte which was non-atomic when the
whole IRTE was to be updated. It is unsafe when a interrupt happened during
update. Furthermore, no bug or warning would be reported when this happened.
This patch introduces two variants, atomic and non-atomic, to update irte.
For initialization and release case, the non-atomic variant will be used. for
other cases (such as reprogramming to set irq affinity), the atomic variant
will be used. If the caller requests an atomic update but we can't meet it, we
raise a bug.
Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Jan Beulich <jbeulich@suse.com> [x86]
VT-d: introduce new fields in msi_desc to track binding with guest interrupt
msi_msg_to_remap_entry() is buggy when the live IRTE is in posted format. It
wrongly inherits the 'im' field meaning the IRTE is in posted format but
updates all the other fields to remapping format.
There are also two situations that lead to the above issue. One is some callers
really want to change the IRTE to remapped format. The other is some callers
only want to update msi message (e.g. set msi affinity) for they don't aware
that this msi is binded with a guest interrupt. We should suppress update
in the second situation. To distinguish them, straightforwardly, we can let
caller specify which format of IRTE they want update to. It isn't feasible for
making all callers be aware of the binding with guest interrupt will cause a
far more complicated change (including the interfaces exposed to IOAPIC and
MSI). Also some callings happen in interrupt context where we can't acquire
d->event_lock to read struct hvm_pirq_dpci.
This patch introduces two new fields in msi_desc to track binding with a guest
interrupt such that msi_msg_to_remap_entry() can get the binding and update
IRTE accordingly. After that change, pi_update_irte() can utilize
msi_msg_to_remap_entry() to update IRTE to posted format.
Signed-off-by: Feng Wu <feng.wu@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
passthrough: don't migrate pirq when it is delivered through VT-d PI
When a vCPU was migrated to another pCPU, pt irqs binded to this vCPU might
also need migration as a optimization to reduce IPI between pCPUs. When VT-d
PI is enabled, interrupt vector will be recorded to a main memory resident
data-structure and a notification whose destination is decided by NDST is
generated. NDST is properly adjusted during vCPU migration so pirq directly
injected to guest needn't be migrated.
This patch adds a indicator, @posted, to show whether the pt irq is delivered
through VT-d PI. Also this patch fixes a bug that hvm_migrate_pirq() accesses
pirq_dpci->gmsi.dest_vcpu_id without checking the pirq_dpci's type.
Signed-off-by: Chao Gao <chao.gao@intel.com>
[jb: remove an extranious check from hvm_migrate_pirq()] Reviewed-by: Jan Beulich <jbeulich@suse.com>
Daniel Kiper [Fri, 7 Apr 2017 11:37:24 +0000 (13:37 +0200)]
x86: add multiboot2 protocol support for relocatable images
Add multiboot2 protocol support for relocatable images. Only GRUB2 with
"multiboot2: Add support for relocatable images" patch understands
that feature. Older multiboot protocol (regardless of version)
compatible loaders ignore it and everything works as usual.
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
Daniel Kiper [Fri, 7 Apr 2017 11:37:02 +0000 (13:37 +0200)]
x86/boot: rename sym_phys() to sym_offs()
This way macro name better describes its function.
Currently it is used to calculate symbol offset in
relation to the beginning of Xen image mapping.
However, value returned by sym_offs() for a given
symbol is not always equal its physical address.
There is no functional change.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
Daniel Kiper [Fri, 7 Apr 2017 11:36:32 +0000 (13:36 +0200)]
x86: make Xen early boot code relocatable
Every multiboot protocol (regardless of version) compatible image must
specify its load address (in ELF or multiboot header). Multiboot protocol
compatible loader have to load image at specified address. However, there
is no guarantee that the requested memory region (in case of Xen it starts
at 2 MiB and ends at ~5 MiB) where image should be loaded initially is a RAM
and it is free (legacy BIOS platforms are merciful for Xen but I found at
least one EFI platform on which Xen load address conflicts with EFI boot
services; it is Dell PowerEdge R820 with latest firmware). To cope with that
problem we must make Xen early boot code relocatable and help boot loader to
relocate image in proper way by suggesting, not requesting specific load
addresses as it is right now, allowed address ranges. This patch does former.
It does not add multiboot2 protocol interface which is done in "x86: add
multiboot2 protocol support for relocatable images" patch.
This patch changes following things:
- %esi register is used as a storage for Xen image load base address;
it is mostly unused in early boot code and preserved during C functions
calls in 32-bit mode,
- %fs is used as base for Xen data relative addressing in 32-bit code
if it is possible; %esi is used for that thing during error printing
because it is not always possible to properly and efficiently
initialize %fs.
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Daniel Kiper [Fri, 7 Apr 2017 11:35:32 +0000 (13:35 +0200)]
x86: change default load address from 1 MiB to 2 MiB
Subsequent patches introducing relocatable early boot code play with
page tables using 2 MiB huge pages. If load address is not aligned at
2 MiB then code touching such page tables must have special cases for
start and end of Xen image memory region. So, let's make life easier
and move default load address from 1 MiB to 2 MiB. This way page table
code will be nice and easy. Hence, there is a chance that it will be
less error prone too... :-)))
Additionally, drop first 2 MiB mapping from Xen image mapping.
It is no longer needed.
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Doug Goldstein <cardoe@cardoe.com>
Jan Beulich [Fri, 7 Apr 2017 09:04:02 +0000 (09:04 +0000)]
x86emul: correct compat mode system descriptor handling
There are some oddities to take care of here - see the code comment.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 7 Apr 2017 10:08:34 +0000 (12:08 +0200)]
x86/HVM: don't leak PFEC_implict to guests
Doing so may not only confuse them, but will - on VMX - lead to
VMRESUME failures. Add respective ASSERT()s where the fields get set
to guard against future similar issues (or - in the restore case -
fail the operation). In that latter code at once convert the mis-used
gdprintk() to dprintk(), as the vCPU of interest is not "current".
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
boot allocator: use arch helper for virt_to_mfn on DIRECTMAP_VIRT region
On ARM platforms with NUMA, while initializing second memory node,
panic is triggered from init_node_heap() when virt_to_mfn()
is called for DIRECTMAP_VIRT region address because DIRECTMAP_VIRT
region is not mapped to any virtual address.
The check virt_to_mfn() here is used to know whether the max MFN is
part of the direct mapping. The max MFN is found by calling virt_to_mfn
on end address of DIRECTMAP_VIRT region, which is DIRECTMAP_VIRT_END.
On ARM64, all RAM is currently direct mapped in Xen and virt_to_mfn
uses the hardware for address translation. So if the virtual address
is not mapped translation fault is raised.
In this patch, instead of calling virt_to_mfn(), arch helper
arch_mfn_in_directmap() is introduced.
On ARM64 this arch helper will return true, because currently all RAM
is direct mapped in Xen.
On ARM32, only a limited amount of RAM, called xenheap, is always mapped
and DIRECTMAP_VIRT region is not mapped. Hence return false.
For x86 this helper does virt_to_mfn.
Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@cavium.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <julien.grall@arm.com>
x86/vpmu_intel: handle SMT consistently for programmable and fixed counters
The patch introduces a macro FIXED_CTR_CTRL_ANYTHREAD_MASK and uses it
to mask .Anythread bit for all counter in IA32_FIXED_CTR_CTRL MSR in all
versions of Intel Arhcitectural Performance Monitoring. Masking .AnyThread bit
is necesssry for two reasons:
1. We need to be consistent in the implementation. We disable .Anythread bit in
programmable counters (regardless of the version) by masking bit 21 in
IA32_PERFEVTSELx. (See code snippet below from vpmu_intel.c)
/* Masks used for testing whether and MSR is valid */
#define ARCH_CTRL_MASK (~((1ull << 32) - 1) | (1ull << 21))
But we leave it enabled in fixed function counters for version 3. Removing the
condition disables the bit in fixed function counters regardless of the version,
which is consistent with what is done for programmable counters.
2. We don't want to expose event counts from another guest (or hypervisor)
which can happen if .AnyThread bit is not masked and a VCPU is only scheduled
to run on one of the hardware threads in a hyper-threaded CPU.
Also, note that Intel SDM discourages the use of .AnyThread bit in virtualized
environments (per section 18.2.3.1 AnyThread Counting and Software Evolution).
Signed-off-by: Mohit Gambhir <mohit.gambhir@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
x86/io: move the list of guest to machine IO ports out of domain_iommu
There's no reason to store that list inside of the domain_iommu struct, the
forwarding of guest IO ports into machine IO ports is not tied to the presence
of an IOMMU.
Move it inside of the hvm_domain struct instead.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/io: rename misleading dpci_ prefixed functions to g2m_
The dpci_ prefix used on those IO handlers is misleading, there's nothing PCI
specific in them, they simply map a guest IO port into a machine (physical) IO
port. They don't specifically trap the PCI IO port range in any way
(0xcf8/0xcfc).
Rename them to use the g2m_ prefix in order to avoid this confusion.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Tamas K Lengyel [Fri, 7 Apr 2017 10:01:10 +0000 (12:01 +0200)]
altp2m: introduce external-only and limited use-cases
Currently setting altp2mhvm=1 in the domain configuration allows access to the
altp2m interface for both in-guest and external privileged tools. This poses
a problem for use-cases where only external access should be allowed, requiring
the user to compile Xen with XSM enabled to be able to appropriately restrict
access.
In this patch we deprecate the altp2mhvm domain configuration option and
introduce the altp2m option, which allows specifying if by default the altp2m
interface should be external-only or limited. The information is stored in
HVM_PARAM_ALTP2M which we now define with specific XEN_ALTP2M_* modes.
If external mode is selected, the XSM check is shifted to use XSM_DM_PRIV
type check, thus restricting access to the interface by the guest itself. Note
that we keep the default XSM policy untouched. Users of XSM who wish to enforce
external mode for altp2m can do so by adjusting their XSM policy directly,
as this domain config option does not override an active XSM policy.
Also, as part of this patch we adjust the hvmop handler to require
HVM_PARAM_ALTP2M to be of a type other then disabled for all ops. This has been
previously only required for get/set altp2m domain state, all other options
were gated on altp2m_enabled. Since altp2m_enabled only gets set during set
altp2m domain state, this change introduces no new requirements to the other
ops but makes it more clear that it is required for all ops.
Signed-off-by: Tamas K Lengyel <tamas.lengyel@zentific.com> Signed-off-by: Sergej Proskurin <proskurin@sec.in.tum.de> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Jan Beulich <jbeulich@suse.com>
Wei Liu [Thu, 6 Apr 2017 18:33:36 +0000 (19:33 +0100)]
xen: use a dummy file in C99 header check
The check builds header file as if it is a C file. Clang doesn't like
the idea of having dead code in C file. The check as-is fails on Clang
with unused function warnings.
Use a dummy file like the C++ header check to fix this.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
vgic: refuse irq migration when one is already in progress
When an irq migration is already in progress, but not yet completed
(GIC_IRQ_GUEST_MIGRATING is set), refuse any other irq migration
requests for the same irq.
This patch implements this approach by returning success or failure from
vgic_migrate_irq, and avoiding irq target changes on failure. It prints
a warning in case the irq migration fails.
It also moves the clear_bit of GIC_IRQ_GUEST_MIGRATING to after the
physical irq affinity has been changed so that all operations regarding
irq migration are completed.
arm: remove irq from inflight, then change physical affinity
This patch fixes a potential race that could happen when
gic_update_one_lr and vgic_vcpu_inject_irq run simultaneously.
When GIC_IRQ_GUEST_MIGRATING is set, we must make sure that the irq has
been removed from inflight before changing physical affinity, to avoid
concurrent accesses to p->inflight, as vgic_vcpu_inject_irq will take a
different vcpu lock.
Andrew Cooper [Tue, 7 Mar 2017 16:20:51 +0000 (16:20 +0000)]
tools/insn-fuzz: Fix assertion failures in x86_emulate_wrapper()
c/s 92cf67888 "x86/emul: Hold x86_emulate() to strict X86EMUL_EXCEPTION
requirements" was appropriate for the hypervisor, but the fuzzer stubs didn't
conform to the stricter requirements. AFL is very quick to discover this.
Extend the fuzzing harness exception logic to raise exceptions appropriately.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 27 Mar 2017 09:37:35 +0000 (10:37 +0100)]
tools/insn-fuzz: Provide IA32_DEBUGCTL consistently to the emulator
x86_emulates()'s is_branch_step() performs a speculative read of
IA32_DEBUGCTL, but doesn't squash exceptions should they arise. In reality,
this MSR is always available.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 20 Mar 2017 19:17:33 +0000 (19:17 +0000)]
tools/insn-fuzz: Fix a stability bug in afl-clang-fast mode
The fuzzing harness conditionally disables hooks to test error paths in the
emulator. However, fuzz_emulops is a static structure.
c/s 69f4633 "tools/insn-fuzz: Support AFL's afl-clang-fast mode" introduced
persistent mode, but because fuzz_emulops is static, the clobbering of hooks
accumulates over repeated input, meaning that previous corpora influence the
execution over the current corpus.
Move the partially clobbered struct x86_emulate_ops into struct fuzz_state,
which is re-initialised from full on each call to LLVMFuzzerTestOneInput()
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 20 Mar 2017 18:33:59 +0000 (18:33 +0000)]
tools/insn-fuzz: Avoid making use of static data
AFL has a measure of stability, where it passes the same corpus into the
fuzzing harness and observes whether the execution path changes from before.
Any instability in the fuzzing harness reduces its effectiveness, as an
observed crash may not reliably be caused by the original corpus.
In preparation to fix a stability bug, introduce struct fuzz_state, allocated
on the stack and passed around via struct x86_emulate_ctxt's data parameter.
Propagate ctxt into the helpers such as maybe_fail(), so the state can be
retrieved.
Move the previously-static data_{index,num} into struct fuzz_state.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 2 Mar 2017 18:36:54 +0000 (18:36 +0000)]
tools/insn-fuzz: Don't hit memcpy() for zero-length reads
For control-flow changes, the emulator needs to perform a zero-length
instruction fetch at the target offset. It also passes NULL for the
destination buffer, as there is no instruction stream to collect.
This trips up UBSAN when passed to memcpy(), as passing NULL is undefined
behaviour per the C spec (irrespective of passing a size of 0).
Special case these fetches in fuzz_insn_fetch() before reaching data_read().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 29 Mar 2017 16:12:37 +0000 (17:12 +0100)]
MAINTAINERS: Move the x86 instruction emulator under x86 maintainership
Requested-by: Jan Beulich <JBeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Fri, 31 Mar 2017 13:49:45 +0000 (14:49 +0100)]
x86/emul: Require callers to provide LMA in the emulation context
Long mode (or not) influences emulation behaviour in a number of cases.
Instead of reusing the ->read_msr() hook to obtain EFER.LMA, require callers
to provide it directly.
This simplifies all long mode checks during emulation to a simple boolean
read, removing embedded msr reads. It also allows for the removal of a local
variable in the sysenter emulation block, and removes a latent bug in the
syscall emulation block where rc contains a non X86EMUL_* constant for a
period of time.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Reviewed-by: Jan Beulich <JBeulich@suse.com>
Andrew Cooper [Fri, 31 Mar 2017 17:13:38 +0000 (18:13 +0100)]
x86/emul: Drop swint_emulate infrastructure
With the SVM injection logic capable of doing its own emulation, there is no
need for this hardware-specific assistance in the common emulator.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
Andrew Cooper [Thu, 30 Mar 2017 17:27:07 +0000 (17:27 +0000)]
x86/svm: Introduce svm_emul_swint_injection()
Software events require emulation in some cases on AMD hardware. Introduce
svm_emul_swint_injection() to perform this emulation if necessary in
svm_inject_event(), which will cope with any sources of event, rather than
just those coming from x86_emulate().
This logic mirrors inject_swint() in the x86 instruction emulator.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 31 Mar 2017 16:03:26 +0000 (17:03 +0100)]
x86/hvm: Fix segmentation logic for system segments
c/s c785f759718 "x86/emul: Prepare to allow use of system segments for memory
references" made alterations to hvm_virtual_to_linear_addr() to allow for the
use of system segments.
However, the determination of which segmentation mode to use was based on the
current address size from emulation.
In particular, it is wrong for system segment accesses while executing in a
compatibility mode code segment. When long mode is active, all system
segments have a 64-bit base, and this must not be truncated during the
calculation of the linear address. (Note that the presence and limit checks
for system segments behave the same, and are already uniformly applied in both
cases.)
Replace the existing addr_size parameter with active_cs, which gets used in
combination with current to work out which segmentation logic to use.
While here, also fix the determination of segmentation to use for vm86 mode,
which is a protected mode facility but which uses real mode segmentation.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 31 Mar 2017 17:14:07 +0000 (17:14 +0000)]
x86/hvm: Correct long mode predicate
hvm_long_mode_enabled() tests for EFER.LMA, which is specifically different to
EFER.LME.
Rename it to match its behaviour, and have it strictly return a boolean value
(although all its callers already use it in implicitly-boolean contexts, so no
functional change).
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Tim Deegan <tim@xen.org>
Andrew Cooper [Fri, 31 Mar 2017 15:06:07 +0000 (16:06 +0100)]
x86/hvm: Correct some address space terminology
The function hvm_translate_linear_addr() translates a virtual address to a
linear address, not a linear address to a physical address. Correct its name.
Both hvm_translate_virtual_addr() and hvmemul_virtual_to_linear() return a
linear address, but a parameter name of paddr is easily confused with paddr_t.
Rename it to linear, to clearly identify the address space, and for
consistency with hvm_virtual_to_linear_addr().
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Ronald Rojas [Wed, 5 Apr 2017 16:05:54 +0000 (17:05 +0100)]
golang/xenlight: Implement cpupool operations
Include some useful "Utility" functions:
- CpupoolFindByName
- CpupoolMakeFree
Still need to implement the following functions:
- libxl_cpupool_rename
- libxl_cpupool_cpuadd_node
- libxl_cpupool_cpuremove_node
- libxl_cpupool_movedomain
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Ronald Rojas <ronladred@gmail.com> Acked-by: Ian Jackson <ian.jackson@citrix.com>
Include both constants and a Stringification for libxl_scheduler.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Ronald Rojas <ronladred@gmail.com> Acked-by: Ian Jackson <ian.jackson@citrix.com>
Ronald Rojas [Wed, 5 Apr 2017 16:05:49 +0000 (17:05 +0100)]
golang/xenlight: Implement libxl_bitmap and helper operations
Implement Bitmap type, along with helper functions.
The Bitmap type is implemented interllay in a way which makes it
easy to copy into and out of the C libxl_bitmap type.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Ronald Rojas <ronladred@gmail.com> Acked-by: Ian Jackson <ian.jackson@citrix.com>
Ronald Rojas [Wed, 5 Apr 2017 16:05:48 +0000 (17:05 +0100)]
golang/xenlight: Implement libxl_domain_info and libxl_domain_unpause
Add calls for the following host-related functionality:
- libxl_domain_info
- libxl_domain_unpause
Include Golang version for the libxl_domain_info as
DomainInfo.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Ronald Rojas <ronladred@gmail.com> Acked-by: Ian Jackson <ian.jackson@citrix.com>
Ronald Rojas [Wed, 5 Apr 2017 16:05:45 +0000 (17:05 +0100)]
golang/xenlight: Create stub package
Create a basic Makefile to build and install libxenlight Golang
bindings. Also add a stub package which only opens libxl context.
Include a global xenlight.Ctx variable which can be used as the
default context by the entire program if desired.
For now, return simple errors. Proper error handling will be
added in next patch.
Until we get configure support, disable it by default. It can be
enabled either by adding "CONFIG_GOLANG=y" to .config, or adding it to
the 'make' line.
Signed-off-by: Ronald Rojas <ronladred@gmail.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@citrix.com>
Andrew Cooper [Thu, 30 Mar 2017 16:32:34 +0000 (17:32 +0100)]
docs: Clarify the expected behaviour of zero-content records
The sending side shouldn't send any data records which end up having
zero-length content, but the receiving side will need to tolerate such
records for compatibility purposes.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
[ wei: fix typos etc ] Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Thu, 30 Mar 2017 16:32:32 +0000 (17:32 +0100)]
tools/libxc: Avoid generating inappropriate zero-content records
The code as written attempted to elide zero-content records, as such records
serve no purpose but come with a performance hit. Unfortunately, in the case
where the hypervisor reported max size is non-zero, but the actual size is
zero, the record is not elided.
This previously tripped up the sanity checks in the restore side of migration,
but as the underlying reasons for eliding the records in the first place are
still valid, fix the elision logic.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Thu, 30 Mar 2017 16:32:31 +0000 (17:32 +0100)]
tools/libxc: Tolerate specific zero-content records in migration v2 streams
The migration v2 save code was written to avoid sending data records with no
content, as such records serve no purpose but come with a performance hit.
The restore code sanity checks this expectation.
Under some circumstances (most notably, on AMD hardware with Debug Extensions,
and a PV guest kernel which is not using the feature), the save code would
generate a record with no content, which trips the sanity check in the restore
code.
As the stream is otherwise fine, tolerate these records and avoid failing the
migration.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Currently xc_translate_foreign_address() only checks for the PSE bit on
level 2 entries (that's 2 MB pages on x64 and 32-bit with PAE, and 4 MB
pages on 32-bit). But the Linux kernel sometimes uses 1 GB pages. This
patch fixes that, by checking the PSE bit on level 3 entries if the guest
has 4 translation levels (that means 64-bit guests only).
xen/arm: Handle guest external abort as guest SError
The guest generated external data/instruction aborts can be treated
as guest SErrors. We already have a handler to handle the SErrors,
so we can reuse this handler to handle guest external aborts.
xen/arm: Prevent slipping hypervisor SError to guest
If there is a pending SError while we're returning from trap. If the
SError handle option is "DIVERSE", we have to prevent slipping this
hypervisor SError to guest. So we have to use the dsb/isb to guarantee
that the pending hypervisor SError would be caught in hypervisor before
return to guest.
In previous patch, we will set SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT to
cpu_hwcaps when option is NOT "DIVERSE". This means we can use the
alternative to skip synchronizing SErrors for other SErrors handle options.
Because we have umasked the Abort/SError bit in previous patch. We have
to disable the Abort/SError before returning to guest as we have done
for IRQ.
xen/arm: Isolate the SError between the context switch of 2 vCPUs
If there is a pending SError while we are doing context switch, if the
SError handle option is "FORWARD", We have to guarantee this SError to
be caught by current vCPU, otherwise it will be caught by next vCPU and
be forwarded to this wrong vCPU.
So we have to synchronize SError before switch to next vCPU. But this is
only required by "FORWARD" option. In this case we added a new flag
SKIP_CTXT_SWITCH_SERROR_SYNC in cpu_hwcaps to skip synchronizing SError
in context switch for other options. In the meantime, we don't need to
export serror_op accessing to other source files.
Because we have umasked the Abort/SError bit in previous patch, we have
to disable Abort/SError before doing context switch as we have done for
IRQ.
In previous patches, we have provided the ability to synchronize
SErrors in exception entries. But we haven't synchronized SErrors
while returning to guest and doing context switch.
So we still have two risks:
1. Slipping hypervisor SErrors to guest. For example, hypervisor
triggers a SError while returning to guest, but this SError may be
delivered after entering guest. In "DIVERSE" option, this SError
would be routed back to guest and panic the guest. But actually,
we should crash the whole system due to this hypervisor SError.
2. Slipping previous guest SErrors to the next guest. In "FORWARD"
option, if hypervisor triggers a SError while context switching.
This SError may be delivered after switching to next vCPU. In this
case, this SError will be forwarded to next vCPU and may panic
an incorrect guest.
So we have have to introduce this macro to synchronize SErrors while
returning to guest and doing context switch. In this macro, we use
ASSERT to make sure the abort is ummasked. Because we unmasked abort
in the entries, but we don't know whether someone will mask it in the
future.
We also added a barrier to this macro to prevent compiler reorder our
asm volatile code.
xen/arm: Introduce a helper to check local abort is enabled
In previous patch, we have umasked the Abort/SError bit for Xen
in most of its running time. So in some use-cases, we have to
check whether the abort is enabled in current context. For example,
while we want to synchronize SErrors, we have to confirm the abort
is enabled. Otherwise synchronize SErrors is pointless.
xen/arm: Unmask the Abort/SError bit in the exception entries
Currently, we masked the Abort/SError bit in Xen exception entries.
So Xen could not capture any Abort/SError while it's running.
Now, Xen has the ability to handle the Abort/SError, we should unmask
the Abort/SError bit by default to let Xen capture Abort/SError while
it's running.
But in order to avoid receiving nested asynchronous abort, we don't
unmask Abort/SError bit in hyp_error and trap_data_abort.
xen/arm: Replace do_trap_guest_serror with new helpers
We have introduced two helpers to handle the guest/hyp SErrors:
do_trap_guest_serror and do_trap_guest_hyp_serror. These handlers
can take the role of do_trap_guest_serror and reduce the assembly
code in the same time. So we use these two helpers to replace it
and drop it now.
xen/arm: Introduce new helpers to handle guest/hyp SErrors
Currently, ARM32 and ARM64 has different SError exception handlers.
These handlers include lots of code to check SError handle options
and code to distinguish guest-generated SErrors from hypervisor
SErrors.
The new helpers: do_trap_guest_serror and do_trap_hyp_serror are
wrappers of __do_trap_serror with constant guest/hyp parameters.
__do_trap_serror moves the option checking code and SError checking
code from assembly to C source. This will make the code become more
readable and avoid placing check code in too many places.
These two helpers only handle the following 3 types of SErrors:
1) Guest-generated SError and had been delivered in EL1 and then
been forwarded to EL2.
2) Guest-generated SError but hadn't been delivered in EL1 before
trapping to EL2. This SError would be caught in EL2 as soon as
we just unmasked the PSTATE.A bit.
3) Hypervisor generated native SError, that would be a bug.
In the new helpers, we have used the function "inject_vabt_exception"
which was disabled by "#if 0" before. Now, we can remove the "#if 0"
to make this function to be available.
xen/arm: Move macro VABORT_GEN_BY_GUEST to common header
We want to move part of SErrors checking code from hyp_error assembly code
to a function. This new function will use this macro to distinguish the
guest SErrors from hypervisor SErrors. So we have to move this macro to
common header.
The VABORT_GEN_BY_GUEST macro uses the symbols abort_guest_exit_start
and abort_guest_exit_end. After we move this macro to a common header,
we need to make sure that the two symbols are visible to other source
files. Currently, they are declared .global in arm32/entry.S, but not
arm64/entry.S. Fix that.
xen/arm32: Use alternative to skip the check of pending serrors
We have provided an option to administrator to determine how to
handle the SErrors. In order to skip the check of pending SError,
in conventional way, we have to read the option every time before
we try to check the pending SError. This will add overhead to check
the option at every trap.
The ARM32 supports the alternative patching feature. We can use an
ALTERNATIVE to avoid checking option at every trap. We added a new
cpufeature named "SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT". This feature
will be enabled when the option is not diverse.
xen/arm64: Use alternative to skip the check of pending serrors
We have provided an option to administrator to determine how to
handle the SErrors. In order to skip the check of pending SError,
in conventional way, we have to read the option every time before
we try to check the pending SError. This will add overhead to check
the option at every trap.
The ARM64 supports the alternative patching feature. We can use an
ALTERNATIVE to avoid checking option at every trap. We added a new
cpufeature named "SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT". This feature
will be enabled when the option is not diverse.
xen/arm: Introduce a initcall to update cpu_hwcaps by serror_op
In the later patches of this series, we want to use the alternative
patching framework to avoid synchronizing serror_op in every entries
and exits. So we define a new cpu feature "SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT"
for serror_op. When serror_op is not equal to SERROR_DIVERSE, this
feature will be set to cpu_hwcaps.
Currently, the default serror_op is SERROR_DIVERSE, if we want to
change the serror_op value we have to place the serror parameter
in command line. It seems no problem to update cpu_hwcaps directly
in the serror parameter parsing function.
While the default option will be diverse today, this may change in the
future. So we introduce this initcall to guarantee the cpu_hwcaps can be
updated no matter the serror parameter is placed in the command line
or not.
xen/arm: Introduce a command line parameter for SErrors/Aborts
In order to distinguish guest-generated SErrors from hypervisor-generated
SErrors we have to place SError checking code in every EL1 <-> EL2 paths.
That will cause overhead on entries and exits due to dsb/isb.
However, not all platforms want to categorize SErrors. For example, a host
that is running with trusted guests. The administrator can confirm that
all guests that are running on the host will not trigger such SErrors. In
this use-case, we should provide some options to administrators to avoid
categorizing SErrors and then reduce the overhead of dsb/isb.
We provided the following 3 options to administrators to determine how
the hypervisors handle SErrors:
* `diverse`:
The hypervisor will distinguish guest SErrors from hypervisor SErrors.
The guest generated SErrors will be forwarded to guests, the hypervisor
generated SErrors will cause the whole system to crash.
It requires:
1. dsb/isb on all EL1 -> EL2 trap entries to categorize SErrors
correctly.
2. dsb/isb on EL2 -> EL1 return paths to prevent slipping hypervisor
SErrors to guests.
3. dsb/isb in context switch to isolate SErrors between 2 vCPUs.
* `forward`:
The hypervisor will not distinguish guest SErrors from hypervisor
SErrors. All SErrors will be forwarded to guests, except the SErrors
generated when the idle vCPU is running. The idle domain doesn't have
the ability to handle SErrors, so we have to crash the whole system when
we get SErros with the idle vCPU. This option will avoid most overhead
of the dsb/isb, except the dsb/isb in context switch which is used to
isolate the SErrors between 2 vCPUs.
* `panic`:
The hypervisor will not distinguish guest SErrors from hypervisor SErrors.
All SErrors will crash the whole system. This option will avoid all
overhead of the dsb/isb pairs.
xen/arm: Introduce a virtual abort injection helper
When guest triggers async aborts, in most platform, such aborts
will be routed to hypervisor. But we don't want the hypervisor
to handle such aborts, so we have to route such aborts back to
the guest.
This helper is using the HCR_EL2.VSE (HCR.VA for aarch32) bit to
route such aborts back to the guest. After updating HCR_EL2.VSE bit
in vCPU context, we write the value to HCR_EL2 immediately. In this
case we don't need to move the restoration of HCR_EL2 to other place,
and it worked regardless of whether we get preempted.
If the guest PC had been advanced by SVC/HVC/SMC instructions before
we caught the SError in hypervisor, we have to adjust the guest PC to
exact address while the SError generated.
About HSR_EC_SVC32/64, even thought we don't trap SVC32/64 today,
we would like them to be handled here. This would be useful when
VM introspection will gain support of SVC32/64 trapping.
After updating HCR_EL2.VSE bit of vCPU HCR_EL2, write the value
to HCR_EL2 immediately. In this case we don't need to move the
restoration of HCR_EL2 to leave_hypervisor_tail, and it worked
regardless of whether we get preempted.
This helper will be used by the later patches in this series, we
use #if 0 to disable it in this patch temporarily to remove the
warning message of unused function from compiler.
xen/arm: Save HCR_EL2 when a guest took the SError
The HCR_EL2.VSE (HCR.VA for aarch32) bit can be used to generate a
virtual abort to guest. The HCR_EL2.VSE bit has a peculiar feature
of getting cleared when the guest has taken the abort (this is the
only bit that behaves as such in HCR_EL2 register).
This means that if we set the HCR_EL2.VSE bit to signal such an abort,
we must preserve it in the guest context until it disappears from
HCR_EL2, and at which point it must be cleared from the context. This
is achieved by reading back from HCR_EL2 until the guest takes the
fault.
If we preserved a pending VSE in guest context, we have to restore
it to HCR_EL2 when context switch to this guest. This is achieved
by writing saved HCR_EL2 value in guest context back to HCR_EL2
register before return to guest. This had been done by the patch
of "Restore HCR_EL2 register".
xen/arm: Avoid setting/clearing HCR_RW at every context switch
The HCR_EL2 flags for 64-bit and 32-bit domains are different. But
when we initialized the HCR_EL2 for vcpu0 of Dom0 and all vcpus of
DomU in vcpu_initialise, we didn't know the domain's address size
information. We had to use compatible flags to initialize HCR_EL2,
and set HCR_RW for 64-bit domain or clear HCR_RW for 32-bit domain
at every context switch.
But, after we added the HCR_EL2 to vcpu's context, this behaviour
seems a little fussy. We can update the HCR_RW bit in vcpu's context
as soon as we get the domain's address size to avoid setting/clearing
HCR_RW at every context switch.
xen/arm: Set and restore HCR_EL2 register for each vCPU separately
Different domains may have different HCR_EL2 flags. For example, the
64-bit domain needs HCR_RW flag but the 32-bit does not need it. So
we give each domain a default HCR_EL2 value and save it in the vCPU's
context.
HCR_EL2 register has only one bit can be updated automatically without
explicit write (HCR_VSE). But we haven't used this bit currently, so
we can consider that the HCR_EL2 register will not be modified while
the guest is running. So save the HCR_EL2 while guest exiting to
hypervisor is not neccessary. We just have to restore this register for
each vCPU while context switching.
The p2m_restore_state which will be invoked in context switch progress
has included the writing of HCR_EL2 already. It updates the HCR_EL2.RW
bit to tell the hardware how to interpret the stage-1 page table as the
encodings are different between AArch64 and AArch32. We can reuse this
write to restore the HCR_EL2 for each vCPU. Of course, the value of each
vCPU's HCR_EL2 should be adjusted to have proper HCR_EL2.RW bit in this
function. In the later patch of this series, we will set the HCR_EL2.RW
for each vCPU while the domain is creating.
xen/arm: Introduce a helper to get default HCR_EL2 flags
We want to add HCR_EL2 register to Xen context switch. And each copy
of HCR_EL2 in vcpu structure will be initialized with the same set
of trap flags as the HCR_EL2 register. We introduce a helper here to
represent these flags to be reused easily.
xen/arm: Save ESR_EL2 to avoid using mismatched value in syndrome check
Xen will do exception syndrome check while some types of exception
take place in EL2. The syndrome check code read the ESR_EL2 register
directly, but in some situation this register maybe overridden by
nested exception.
For example, if we re-enable IRQ before reading ESR_EL2 which means
Xen may enter in IRQ exception mode and return the processor with
clobbered ESR_EL2 (See ARM ARM DDI 0487A.j D7.2.25)
In this case the guest exception syndrome has been overridden, we will
check the syndrome for guest sync exception with an incorrect ESR_EL2
value. So we want to save ESR_EL2 to cpu_user_regs as soon as the
exception takes place in EL2 to avoid using an incorrect syndrome value.
In order to save ESR_EL2, we added a 32-bit member hsr to cpu_user_regs.
But while saving registers in trap entry, we use stp to save ELR and
CPSR at the same time through 64-bit general registers. If we keep this
code, the hsr will be overridden by upper 32-bit of CPSR. So adjust the
code to use str to save ELR in a separate instruction and use stp to
save CPSR and HSR at the same time through 32-bit general registers.
This change affects the registers restore in trap exit, we can't use the
ldp to restore ELR and CPSR from stack at the same time. We have to use
ldr to restore them separately.
Merge branch 'staging' of git://xenbits.xen.org/people/konradwilk/xen into staging
* 'staging' of git://xenbits.xen.org/people/konradwilk/xen:
Introduce the pvcalls header
Introduce the Xen 9pfs transport header
xen: introduce a C99 headers check
ring.h: introduce macros to handle monodirectional rings with multiple req sizes
tmem: Parse UUIDs correctly.
tmem: Fix tmem-shared-auth 'auth' values
tmem: By default to join an shared pool it must be authorized.
xen/libxc: Move TMEM_AUTH to XEN_SYSCTL_TMEM_OP_SET_AUTH
xen/libcx/tmem: Replace TMEM_RESTORE_NEW with XEN_SYSCTL_TMEM_OP_SET_POOLS
displif: add ABI for para-virtual display
tools:misc:xenlockprof: fix possible format string overflow
GCC7 complains about a possible overflow/truncation in xenlockprof.
xenlockprof.c: In function ‘main’:
xenlockprof.c:100:53: error: ‘%s’ directive writing up to 39 bytes into a
region of size between 17 and 37 [-Werror=format-overflow=]
sprintf(name, "unknown type(%d) %d lock %s", data[j].type,
^~
xenlockprof.c:100:13: note: ‘sprintf’ output between 24 and 83 bytes
into a destination of size 60
sprintf(name, "unknown type(%d) %d lock %s", data[j].type,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
data[j].idx, data[j].name);
~~~~~~~~~~~~~~~~~~~~~~~~~~
This increases the size of name to 100. Not the most scalable solution,
but certainly the "cheapest", as it doesn't add dependencies for
asprintf.
Signed-off-by: Seraphime Kirkovski <kirkseraph@gmail.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Wed, 5 Apr 2017 14:39:53 +0000 (16:39 +0200)]
memory: don't hand MFN info to translated guests
We shouldn't hand MFN info back from increase-reservation for
translated domains, just like we don't for populate-physmap and
memory-exchange. For full symmetry also check for a NULL guest handle
in populate_physmap() (but note this makes no sense in
memory_exchange(), as there the array is also an input).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Released-acked-by: Julien Grall <julien.grall@arm.com>
Jan Beulich [Wed, 5 Apr 2017 14:39:16 +0000 (16:39 +0200)]
memory: exit early from memory_exchange() upon write-back error
There's no point in continuing if in the end we'll return -EFAULT
anyway. It also seems wrong to report a chunk for which at least one
write-back failed as successfully exchanged (albeit the indication of
an error is also not fully correct, as the exchange happened in that
case at least partially - retrieving the GFN to assign the memory to
and/or handing back the information on the replacement memory didn't
work). In any case limiting the amount of damage done to the guest
can't be all that bad an idea.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Released-acked-by: Julien Grall <julien.grall@arm.com>
ring.h: introduce macros to handle monodirectional rings with multiple req sizes
This patch introduces macros, structs and functions to handle rings in
the format described by docs/misc/pvcalls.markdown and
docs/misc/9pfs.markdown. The index page (struct __name##_data_intf)
contains the indexes and the grant refs to setup two rings.
$NAME_read_packet and $NAME_write_packet are provided to read or write
any data struct from/to the ring. In pvcalls, they are unused. In xen
9pfs, they are used to read or write the 9pfs header. In other protocols
they could be used to read/write the whole request structure. See
docs/misc/9pfs.markdown:Ring Usage to learn how to check how much data
is on the ring, and how to handle notifications.
There is a ring_size parameter to most functions so that protocols using
these macros don't have to have a statically defined ring order at build
time. In pvcalls for example, each new ring could have a different
order.
These macros don't help you share the indexes page or the event channels
needed for notifications. You can do that with other out of band
mechanisms, such as xenstore or another ring.
It is not possible to use a macro to define another macro with a
variable name. For this reason, this patch introduces static inline
functions instead, that are not C89 compliant. Additionally, the macro
defines a struct with a variable sized array, which is also not C89
compliant.
The hypervisor code (tmemc_shared_pool_auth) since the inception
would consider auth values of:
0 - to disable authentication!
1 - to enable authentication for the given UUID.
The docs have it the other way around, so lets fix it.
Acked-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xen/libxc: Move TMEM_AUTH to XEN_SYSCTL_TMEM_OP_SET_AUTH
which surprisingly (or maybe not) looks like
XEN_SYSCTL_TMEM_OP_SET_POOLS.
This hypercall came about, as explained in docs/misc/tmem-internals.html:
When tmem was first proposed to the linux kernel mailing list
(LKML), there was concern expressed about security of shared ephemeral
pools. The initial tmem implementation only
required a client to provide a 128-bit UUID to identify a shared pool, and the
linux-side tmem implementation obtained this UUID from the superblock of the
shared filesystem (in ocfs2). It was
pointed out on LKML that the UUID was essentially a security key and any
malicious domain that guessed it would have access to any data from the shared
filesystem that found its way into tmem.
..
As a result, a Xen boot option -- tmem_shared_auth; -- was
added. The option defaults to disabled,
but when it is enabled, management tools must explicitly authenticate (or may
explicitly deny) shared pool access to any client.
On Xen, this is done with the xm tmem-shared-auth command.
"
However the implementation has some rather large holes:
a) The hypercall was accessed from any guest.
b) If the ->domain id value is 0xFFFF then one can toggle the
tmem_global.shared_auth knob on/off. That with a)
made it pretty bad.
c) If one toggles the tmem_global.shared_auth off, then the
'tmem_shared_auth=1' bootup parameter is ignored and
one can join any shared pool (if UUID is known)!
d) If the 'tmem_shared_auth=1' and tmem_global.shared_auth is
set to 1, then one can only join an shared pool if the
UUID has been set by 'xl tmem-shared-auth'. Otherwise
the joining of a pool fails and a non-shared pool is
created (without errors to guest). Not exactly sure if
the shared pool creation at that point should error out
or not.
e) If a guest is migrated, the policy values (which UUID
can be shared, whether tmem_global.shared_auth is set, etc)
are completely ignored.
This patch only fixes a) and only allows the hypercall to
be called by the control domain. Subsequent patches will
fix the remaining issues.
We also have to call client_create as the guest at this
point may not have done any tmem hypercalls - and hence
the '->tmem' from 'struct domain' is still NULL. Us calling
client_create fixes this.
Acked-by: Wei Liu <wei.liu2@citrix.com> [libxc changes] Reviewed-by: Jan Beulich <jbeulich@suse.com> [hypervisor changes] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xen/libcx/tmem: Replace TMEM_RESTORE_NEW with XEN_SYSCTL_TMEM_OP_SET_POOLS
This used to be done under TMEM_RESTORE_NEW which was an hypercall
accessible by the guest. However there are couple of reasons
not to do it:
- No checking of domid on TMEM_RESTORE_NEW which meant that
any guest could create TMEM pools for other guests.
- The guest can already create pools using TMEM_NEW_POOL
(which is limited to guest doing the hypercall)
- This functionality is only needed during migration - there
is no need for the guest to have this functionality.
However to move this we also have to allocate the 'struct domain'
->tmem pointer. It is by default set to NULL and would be initialized
via the guest do_tmem() hypercalls. Presumarily that was the
initial reason that TMEM_RESTORE_NEW was in the guest accessible
hypercalls.
Acked-by: Wei Liu <wei.liu2@citrix.com> [libxc change] Reviewed-by: Jan Beulich <jbeulich@suse.com> [hypervisor changes] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This is the ABI for the two halves of a para-virtualized
display driver.
This protocol aims to provide a unified protocol which fits more
sophisticated use-cases than a framebuffer device can handle. At the
moment basic functionality is supported with the intention to extend:
o multiple dynamically allocated/destroyed framebuffers
o buffers of arbitrary sizes
o better configuration options including multiple display support
Note: existing fbif can be used together with displif running at the
same time, e.g. on Linux one provides framebuffer and another DRM/KMS
Future extensions to the existing protocol may include:
o allow display/connector cloning
o allow allocating objects other than display buffers
o add planes/overlays support
o support scaling
o support rotation
Note, that this protocol doesn't use ring macros for
bi-directional exchange (PV calls/9pfs) bacause:
o it statically defines the use of a single page
for the ring buffer
o it uses direct memory access to ring's contents
w/o memory copying
o re-uses the same idea that kbdif/fbif use
which for this use-case seems to be appropriate
==================================================
Rationale for introducing this protocol instead of
using the existing fbif:
==================================================
1. In/out event sizes
o fbif - 40 octets
o displif - 40 octets
This is only the initial version of the displif protocol
which means that there could be requests which will not fit
(WRT introducing some GPU related functionality
later on). In that case we cannot alter fbif sizes as we need to
be backward compatible an will be forced to handle those
apart of fbif.
2. Shared page
Displif doesn't use anything like struct xenfb_page, but
DEFINE_RING_TYPES(xen_displif, struct xendispl_req, struct
xendispl_resp) which is a better and more common way.
Output events use a shared page which only has in_cons and in_prod
and all the rest is used for incoming events. Here struct xenfb_page
could probably be used as is despite the fact that it only has a half
of a page for incoming events which is only 50 events. (consider
something like 60Hz display)
3. Amount of changes.
fbif only provides XENFB_TYPE_UPDATE and XENFB_TYPE_RESIZE
events, so it looks like it is easier to get fb support into displif
than vice versa. displif at the moment has 6 requests and 1 event,
multiple connector support, etc.
Add ABI for the two halves of a para-virtualized
sound driver to communicate with each other.
The ABI allows implementing audio playback and capture as
well as volume control and possibility to mute/unmute
audio sources.
Note: depending on the use-case backend can expose more sound
cards and PCM devices/streams than the underlying HW physically
has by employing SW mixers, configuring virtual sound streams,
channels etc. Thus, allowing fine tunned configurations per
frontend.
Jan Beulich [Tue, 4 Apr 2017 12:47:46 +0000 (14:47 +0200)]
memory: properly check guest memory ranges in XENMEM_exchange handling
The use of guest_handle_okay() here (as introduced by the XSA-29 fix)
is insufficient here, guest_handle_subrange_okay() needs to be used
instead.
Note that the uses are okay in
- XENMEM_add_to_physmap_batch handling due to the size field being only
16 bits wide,
- livepatch_list() due to the limit of 1024 enforced on the
number-of-entries input (leaving aside the fact that this can be
called by a privileged domain only anyway),
- compat mode handling due to counts there being limited to 32 bits,
- everywhere else due to guest arrays being accessed sequentially from
index zero.
This is CVE-2017-7228 / XSA-212.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86/vioapic: allow the vIO APIC to have a variable number of pins
Although it's still always set to VIOAPIC_NUM_PINS (48).
Add a new field to the hvm_ioapic struct to contain the number of pins (number
of IO redirection table entries) and turn the redirection table into a variable
sized array.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/hvm: convert gsi_assert_count into a variable size array
Rearrange the fields of hvm_irq so that gsi_assert_count can be converted into
a variable size array and add a new field to account the number of GSIs.
Due to this changes the irq member in the hvm_domain struct also needs to
become a pointer set at runtime.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
kexec: clear kexec_image slot when unloading kexec image
When kexec_do_unload calls kexec_swap_images to get the old kexec_image to
free, it passes NULL for the new kexec_image pointer. The new slot wasn't being
cleared in such a case, leading to a stale pointer being left behind in the
kexec_image array and Xen panics in subsequent load/unload operations.
Signed-off-by: Bhavesh Davda <bhavesh.davda@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86/vioapic: expand hvm_vioapic to contain vIO APIC internal state
This is required in order to have a variable number of vIO APIC pins, instead
of the current fixed value (48). Note that this patch only expands the fields
of the hvm_vioapic struct, without actually introducing any new fields or
functionality.
The reason to expand the hvm_vioapic structure instead of the hvm_hw_vioapic
one is that the variable number of pins functionality is only going to be used
by the hardware domain, so no modifications are needed to the save format.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
parse_vwfi runs after init_traps on cpu0, potentially resulting in the
wrong HCR_EL2 for it. Secondary cpus boot after parse_vwfi, so in their
case init_traps will write the correct set of flags to HCR_EL2.
For cpu0, fix the issue by changing HCR_EL2 setting from a new
presmp_initcall.
xen/arm: acpi: Map MMIO on fault in stage-2 page table for the hardware domain
When booting using ACPI, not all MMIOs can be discovered by parsing the
static tables or the UEFI memory map. A lot of them will be described in
the DSDT. However, Xen does not have an AML parser which requires us to
find a different approach.
During the first discussions on supporting ACPI (see design doc [1]), it
was decided to rely on the hardware domain to make a request to the
hypervisor to map the MMIO region in stage-2 page table before accessing
it. This approach works fine if the OS has limited hooks to modify the
page tables.
In the case of Linux kernel, notifiers have been added to map
the MMIO regions when adding a new AMBA/platform device. Whilst this is
covering most of the MMIOs, some of them (e.g OpRegion, ECAM...) are not
related to a specific device or the driver is not using the
AMBA/platform API. So more hooks would need to be added in the code.
Various approaches have been discussed (see [2]), one of them was to
create stage-2 mappings seamlessly in Xen upon hardware memory faults.
This approach was first ruled out because it relies on the hardware
domain to probe the region before any use. So this would not work when
DMA'ing to another device's MMIO region when the device is protected by
an SMMU. It has been pointed out that this is a limited use case compare
to DMA'ing between MMIO and RAM.
This patch implements this approach. All MMIOs region will be mapped in
stage-2 using p2m_mmio_direct_c (i.e normal memory outer and inner
write-back cacheable). The stage-1 page table will be in control of the
memory attribute. This is fine because the hardware domain is a trusted
domain.
Note that MMIO will only be mapped on a data abort fault. It is assumed
that it will not be possible to execute code from MMIO
(p2m_mmio_direct_c will forbid that).
As mentioned above, this solution will cover most of the cases. If a
platform requires to do DMA'ing to another device's MMIO region without
any access performed by the OS. Then it will be expected to have
specific platform code in the hypervisor to map the MMIO at boot time or
the OS to use the existing hypercalls (i.e XENMEM_add_to_add_physmap{,_batch})
before any access.
Ian Jackson [Mon, 3 Apr 2017 11:34:13 +0000 (12:34 +0100)]
tools: ocaml: In configure, check for ocamlopt
If ocaml.m4 didn't find ocamlopt, disable all the ocaml builds.
Currently our Makefiles do not work properly when the native code
compiler (`ocamlopt') is not available. In principle this should be
fixed to fall back to bytecode, but this is not a task for this stage
of the Xen 4.9 release.
Without this change, we cannot build on systems with only ocamlc.
That includes Debian jessie ARM64, as used on the new ARM64 hardware
in the Xen Project CI test lab.
When the Makefiles are fixed, this commit should be reverted.
Committers: Please rerun autogen.sh.
CC: Julien Grall <julien.grall@arm.com> CC: Christian Lindig <christian.lindig@citrix.com> CC: Jonathan Ludlam <Jonathan.Ludlam@citrix.com> CC: David Scott <dave@recoil.org> CC: Wei Liu <wei.liu2@citrix.com> Tested-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Tue, 28 Mar 2017 16:26:15 +0000 (18:26 +0200)]
xenstore: cleanup tdb.c
Remove all unused functions from tdb.c. This will reduce code size of
xenstored and - more important - of xenstore stubdom.
tdb.c hasn't been updated to a newer version since its introduction in
2005. Any backport of bug fixes or update to a new version will need
major work, so there is no real downside to remove not needed code.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Juergen Gross [Fri, 31 Mar 2017 11:29:19 +0000 (13:29 +0200)]
xenstore: rework of transaction handling
The handling of transactions in xenstored is rather clumsy today:
- Each transaction in progress is keeping a local copy of the complete
xenstore data base
- A transaction will fail as soon as any node is being modified outside
the transaction
This is leading to a very bad behavior in case of a large xenstore.
Memory consumption of xenstored is much higher than necessary and with
many domains up transactions failures will be more and more common.
Instead of keeping a complete copy of the data base for each
transaction store the transaction data in the same data base as the
normal xenstore entries prepended with the transaction in the single
nodes either read or modified. At the end of the transaction walk
through all nodes accessed and check for conflicting modifications.
In case no conflicts are found write all modified nodes to the data
base without transaction identifier.
Following tests have been performed:
- create/destroy of various domains, including HVM with ioemu-stubdom
(xenstored and xenstore-stubdom)
- multiple concurrent runs of xs-test over several minutes
(xenstored and xenstore-stubdom)
- test for memory leaks of xenstored by dumping talloc reports before
and after the tests
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>