This patch proposes relying on host TSC synchronization and
passthrough to the guest, when running on a TSC-safe platform. On
time_calibration we retrieve the platform time in ns and the counter
read by the clocksource that was used to compute system time. We can
guarantee that on a platform with a constant and reliable TSC, that the
time read on vcpu B right after A is bigger independently of the VCPU
calibration error. Since pvclock time infos are monotonic as seen by any
vCPU set PVCLOCK_TSC_STABLE_BIT, which then enables usage of VDSO on
Linux. IIUC, this is similar to how it's implemented on KVM. Add also a
comment regarding this bit changing and that guests are expected to
check this bit on every read.
Should note that I've yet to see time going backwards in a long running
test I ran for 2 weeks (in a dual socket machine), plus few other
tests I did on older platforms.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Recent x86/time changes improved a lot of the monotonicity in xen
timekeeping, making it much harder to observe time going backwards.
Although platform timer can't be expected to be perfectly in sync with
TSC and so get_s_time won't be guaranteed to always return
monotonically increasing values across cpus. This is the case in some
of the boxes I am testing with, observing sometimes ~100 warps (of
very few nanoseconds each) after a few hours.
This patch introduces support for using TSC as platform time source
which is the highest resolution time and most performant to get.
Though there are also several problems associated with its usage, and
there isn't a complete (and architecturally defined) guarantee that
all machines will provide reliable and monotonic TSC in all cases (I
believe Intel to be the only that can guarantee that?). For this reason
it's not used unless administrator changes "clocksource" boot option
to "tsc". Initializing TSC clocksource requires all CPUs up to have
the tsc reliability checks performed. init_xen_time is called before
all CPUs are up, so for example we would start with HPET (or ACPI,
PIT) at boot time, and switch later to TSC. The switch then happens on
verify_tsc_reliability initcall that is invoked when all CPUs are up.
When attempting to initialize TSC we also check for time warps and if
it has invariant TSC. Note that while we deem reliable a CONSTANT_TSC
with no deep C-states, it might not always be the case, so we're
conservative and allow TSC to be used as platform timer only with
invariant TSC. Additionally we check if CPU Hotplug isn't meant to be
performed on the host which will either be when max vcpus and
num_present_cpu are the same. This is because a newly hotplugged CPU
may not satisfy the condition of having all TSCs synchronized - so
when having tsc clocksource being used we allow offlining CPUs but not
onlining any ones back. Finally we prevent TSC from being used as
clocksource on multiple sockets because it isn't guaranteed to be
invariant. Further relaxing of this last requirement is added in a
separate patch, such that we allow vendors with such guarantee to use
TSC as clocksource. In case any of these conditions is not met, we
keep the clocksource that was previously initialized on init_xen_time.
Since b64438c7c ("x86/time: use correct (local) time stamp in
constant-TSC calibration fast path") updates to cpu time use local
stamps, which means platform timer is only used to seed the initial
cpu time. We further introduce a new rendezvous function
(nop_rendezvous) which doesn't require synchronization between master
and slave CPUS and just reads calibration_rendezvous struct and writes
it down the stime and stamp to the cpu_calibration struct to be used
later on. With clocksource=tsc there is no need to be in sync with
another clocksource, so we reseed the local/master stamps to be values
of TSC and update the platform time stamps accordingly. Time
calibration is set to 1sec after we switch to TSC, thus these stamps
are reseeded to also ensure monotonic returning values right after the
point we switch to TSC. This is to remove the possibility of having
inconsistent readings in this short period (i.e. until calibration
fires).
Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
To allow the caller to fetch the last read from the clocksource which
was used to calculate system_time. This is a prerequisite for a
subsequent patch that will use this last read.
Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com>
And accomodate platform time source initialization in
try_platform_time(). This is a preparatory patch for deferring
TSC clocksource initialization to the stage where all CPUS are
up (verify_tsc_reliability init call).
Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/arm64: Add missing synchronization barrier in invalidate_cache
The invalidation of the instructions cache requires barriers to ensure
the completion of the invalidation before continuing (see B2.3.4 in ARM
DDI 0487A.j).
This was overlooked in commit fb9d877 "xen/arm64: Add an helper to
invalidate all instruction caches".
Right now the contents of 'name' are all located in
the .data section. We want them in the .rodata section
so change the type to have const on them.
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
livepatch/tests: Make .livepatch.depends be read-only
As currently during the injection of the build-id it ends up
being marked as AW. We want it to be read-only.
Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xen/tools: tracing: improve tracing of context switches.
Right now, two out of the three events related to
context switch (that is TRC_SCHED_SWITCH_INFPREV and
TRC_SCHED_SWITCH_INFNEXT) only report the domain id,
and not the vcpu id.
That's omitting a useful piece of information, and
even if we be figured that out by looking at other
records, that's unnecessarily complicated (especially
if working on a trace from a sctipt).
This changes both the tracing code in Xen and the parsing
code in tools at once, to avoid introducing transitional
regressions.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Add HVM usb passthrough support to libxl by using qemu's capability
to emulate standard USB controllers.
A USB controller is added via qmp command to the emulated hardware
when a usbctrl device of type DEVICEMODEL is requested. Depending on
the requested speed the appropriate hardware type is selected. A host
USB device can then be added to the emulated USB controller via qmp
command.
Removing of the devices is done via qmp commands, too.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
libxl: add basic support for devices without backend
With the planned support of HVM USB passthrough via the USB emulation
capabilities of qemu libxl has to support guest devices which have no
back- and frontend. Information about those devices will live in the
libxl part of Xenstore only.
Add some basic support to libxl to be able to cope with this scenario.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
libxl: add function to remove usb controller xenstore entries
In case of failure when trying to add a new USB controller to a domain
libxl might leak xenstore entries. Add a function to remove them and
call this function in case of failure.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
move TLB-flush filtering out into populate_physmap during vm creation
This patch implemented parts of TODO left in commit id a902c12ee45fc9389eb8fe54eeddaf267a555c58 (More efficient TLB-flush
filtering in alloc_heap_pages()). It moved TLB-flush filtering out into
populate_physmap. Because of TLB-flush in alloc_heap_pages, it's very slow
to create a guest with memory size of more than 100GB on host with 100+
cpus.
This patch introduced a "MEMF_no_tlbflush" bit to memflags to indicate
whether TLB-flush should be done in alloc_heap_pages or its caller
populate_physmap. Once this bit is set in memflags, alloc_heap_pages will
ignore TLB-flush. To use this bit after vm is created might lead to
security issue, that is, this would make pages accessible to the guest B,
when guest A may still have a cached mapping to them.
Therefore, this patch also introduced a "creation_finished" field to struct
domain to indicate whether this domain has ever got unpaused by hypervisor.
MEMF_no_tlbflush can be set only during vm creation phase when
creation_finished is still false before this domain gets unpaused for the
first time.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
replace tlbflush check and operation with inline functions
This patch cleaned up the code by replacing complicated tlbflush check and
operation with inline functions. We should use those inline functions to
avoid the complicated tlbflush check and tlbflush operations when
implementing TODOs left in commit a902c12ee45fc9389eb8fe54eeddaf267a555c58
(More efficient TLB-flush filtering in alloc_heap_pages()).
"#include <asm/flushtlb.h>" is removed from xen/arch/x86/acpi/suspend.c to
avoid the compiling error after we include "<asm/flushtlb.h>" to
xen/include/xen/mm.h.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
3a7f872a ("tools: lift BUILD_BUG_ON to a tools header file") was taken
out from an rather old half finished branch by dropping unrelated
changes. Unfortunately two issues sneaked in.
1. Hvmloader should be standalone. Revert the changes to hvmloader.
2. The define guard in libs.h was erroneously deleted. Add that back.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
In a linux part an ioctl(gntdev, IOCTL_GNTDEV_GRANT_COPY, ..)
system call is invoked. In mini-os the operation is yet not
implemented. For the OSs that does not implement gnttab the
call of the grant copy operation causes abort.
Signed-off-by: Paulina Szubarczyk <paulinaszubarczyk@gmail.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
[ wei: modify this patch to use BUILD_BUG_ON in xen-tools/libs.h ] Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Currently it is only possible to set mem_access restrictions only for
a contiguous range of GFNs (or, as a particular case, for a single GFN).
This patch introduces a new libxc function taking an array of GFNs.
The alternative would be to set each page in turn, using a userspace-HV
roundtrip for each call, and triggering a TLB flush per page set.
Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: Julien Grall <julien.grall@arm.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Daniel Kiper [Mon, 19 Sep 2016 15:24:20 +0000 (17:24 +0200)]
x86/boot/reloc: rename some variables and rearrange code a bit
Replace mbi with mbi_out and mbi_old with mbi_in and rearrange code
a bit to make it more readable. Additionally, this way multiboot (v1)
protocol implementation and future multiboot2 protocol implementation
will use the same variable naming convention.
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Olaf Hering [Mon, 19 Sep 2016 09:29:46 +0000 (09:29 +0000)]
docs: correct values for old VMDP unplug
Fix commit f6d4cf5 ("docs: document old SUSE/Novell unplug for HVM").
The values which VMDP used to control either NIC or disk are flipped.
What the code does is:
case 8:
if (val == 1 ) {
ide_unplug_harddisks();
} else if (val == 2) {
pci_unplug_netifs();
net_tap_shutdown_all();
}
break;
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Wei Liu <wei.liu2@citrix.com>
Jan Beulich [Mon, 19 Sep 2016 09:42:23 +0000 (11:42 +0200)]
x86/Intel: Broadwell doesn't have PKG_C{8,9,10}_RESIDENCY MSRs
According to
https://lists.xenproject.org/archives/html/xen-devel/2016-09/msg01797.html
this partially reverts commit 350bc1a9d4 ("x86: support newer Intel CPU
models") to account for the appearant earlier mis-documentation.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Tamas K Lengyel [Mon, 19 Sep 2016 09:38:08 +0000 (11:38 +0200)]
vm_event: sanitize vm_event response handling
Setting response flags in vm_event are only ever safe if the vCPUs are paused.
To reflect this we move all checks within the if block that already checks
whether this is the case. Checks that are only supported on one architecture
we relocate the bitmask operations to the arch-specific handlers to avoid
the overhead on architectures that don't support it.
Furthermore, we clean-up the emulation checks so it more clearly represents the
decision-logic when emulation should take place. As part of this we also
set the stage to allow emulation in response to other types of events, not just
mem_access violations.
Signed-off-by: Tamas K Lengyel <tamas.lengyel@zentific.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Jan Beulich [Mon, 19 Sep 2016 09:37:09 +0000 (11:37 +0200)]
x86/Intel: hide CPUID faulting capability from guests
We don't currently emulate it, so guests should not be misguided to
believe they can (try to) use it.
For now, simply return zero to guests for platform MSR reads, and only
accept (by discarding) writes of zero. If ever there will be bits we
can safely expose to guests, let's handle them by white listing.
(As a side note - according to SDM version 059 bit 31 is reserved on
all known families.)
Reported-by: Kyle Huey <me@kylehuey.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citix.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
Which is only used by Livepatch code. The purpose behind
this call is to modify the page table entries flags.
Specifically the .ro and .nx flags. The current mechanism
puts cache attributes in the flags and the .ro and .nx are
locked down and assumed to be .ro=0, nx=1.
Livepatch needs .nx=0 and also .ro to be set to 1.
We introduce a new 'flags' where various bits determine
whether .ro and .nx bits are set or cleared. We can't use
an enum as the function prototype would diverge from x86.
Reviewed-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xen: credit2: properly schedule migration of a running vcpu.
If wanting to migrate a vcpu that is actually running,
we need to ask the scheduler to chime in as soon as
possible, to have the vcpu itself stopped and actually
moved.
Make sure this happens by, after setting all the relevant
flags, raising the scheduler softirq.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
xen: credit1: fix mask to be used for tickling in Credit1
If there are idle pcpus inside the waking vcpu's
soft-affinity mask, we should really tickle one
of them (this is one of the purposes of the
__runq_tickle() function itself!), not just
any idle pcpu.
The issue has been introduced in 02ea5031825d
("credit1: properly deal with pCPUs not in any cpupool"),
where the usage of idle_mask is changed, without
updating the bottom of the function, where it
is also referenced.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
xen: credit1: small optimization in Credit1's tickling logic.
If, when vcpu x wakes up, there are no idle pcpus in x's
soft-affinity, we just go ahead and look at its hard
affinity. This basically means that, if, in __runq_tickle(),
new_idlers_empty is true, balance_step is equal to
CSCHED_BALANCE_HARD_AFFINITY, and that calling
csched_balance_cpumask() for whatever vcpu, would just
return the vcpu's cpu_hard_affinity.
Therefore, don't bother calling it (it's just pure
overhead) and use cpu_hard_affinity directly.
For this very reason, this patch should only be
a (slight) optimization, and entail no functional
change.
As a side note, it would make sense to do what the
patch does, even if we could be inside the
[[ new_idlers_empty && new->pri > cur->pri ]] if
with balance_step equal to CSCHED_BALANCE_SOFT_AFFINITY.
In fact, what is actually happening is:
- vcpu x is waking up, and (since there aren't suitable
idlers, and it's entitled for it) it is preempting
vcpu y;
- vcpu y's hard-affinity is a superset of its
soft-affinity mask.
Therefore, it makes sense to use wider possible mask,
as by doing that, we maximize the probability of
finding an idle pcpu in there, to which we can send
vcpu y, which then will be able to run.
While there, also fix the comment, which included
an awkward parenthesis nesting.
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Thu, 15 Sep 2016 08:07:48 +0000 (10:07 +0200)]
x86: fold code in load_segments()
No need to have the same logic twice. (Note that the type change does
not affect the put_user() instances, as they derive their access size
from the second [pointer] argument.)
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 15 Sep 2016 08:06:56 +0000 (10:06 +0200)]
x86/EFI: don't accept 64-bit base relocations on page tables
Page tables get pre-populated with physical addresses which, due to
living inside the Xen image, will never exceed 32 bits in width. That
in turn results in the tool generating the relocations to produce
32-bit relocations for them instead of the 64-bit ones needed for
relocating virtual addresses. Hence instead of special casing page
tables in the processing of 64-bit relocations, let's be more rigid
and refuse them (as being indicative of something else having gone
wrong in the build process).
Peng Fan [Fri, 2 Sep 2016 09:41:41 +0000 (17:41 +0800)]
xen/arm: smpboot: drop unneeded code when identifying cpuinfo
The current_cpu_data indicates the cpuinfo for the current cpu.
There is no need to fill the current_cpu_data from boot_cpu_data,
because the following call to identify_cpu will override it.
Andrew Cooper [Mon, 12 Sep 2016 09:30:00 +0000 (10:30 +0100)]
x86/xstate: Fix latent bugs in compress_xsave_states()
compress_xsave_states() mustn't read xstate_bv or xcomp_bv before first
confirming that the input buffer is large enough. It also doesn't cope with
compressed input. Make all of these problems the callers responsbility to
ensure.
Simplify the decompression logic by inlining get_xsave_addr(). As xstate_bv
is previously validated, dest won't ever been NULL.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 12 Sep 2016 09:30:00 +0000 (10:30 +0100)]
x86/domctl: Fix migration of guests which are not using xsave
c/s da62246e "x86/xsaves: enable xsaves/xrstors/xsavec in xen" broke migration
of PV guests which were not using xsave.
In such a case, compress_xsave_states() gets passed a zero length buffer. The
first thing it tries to do is ASSERT() on user-provided data, if it hadn't
already wandered off the end of the buffer to do so.
Perform more verification of the input buffer before passing it to
compress_xsave_states(). This involves making xsave_area_compressed() public.
Similar problems exist on the HVM side, so make equivalent adjustments there.
This doesn't manifest in general, as hvm_save_cpu_xsave_states() elides the
entire record if xsave isn't used, but is a problem if a caller were to
construct an xsave record manually.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <JBeulich@suse.com>
Andrew Cooper [Mon, 12 Sep 2016 09:30:00 +0000 (10:30 +0100)]
x86/xstate: Fix latent bugs in expand_xsave_states()
Without checking the size input, the memcpy() for the uncompressed path might
read off the end of the vcpu's xsave_area. Both callers pass the approprite
size, so hold them to it with a BUG_ON().
The compressed path is currently dead code, but its attempt to avoid leaking
uninitalised data was incomplete. Work around this by zeroing the whole rest
of the buffer before decompression.
The loop skips all bits which aren't set in xstate_bv, meaning that the
memset() was dead code. The logic is more obvious with get_xsave_addr()
expanded inline, allowing for quite a lot of simplification, including all the
NULL pointer logic.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <JBeulich@suse.com>
Andrew Cooper [Mon, 12 Sep 2016 09:30:00 +0000 (10:30 +0100)]
x86/domctl: Fix TOCTOU race with the use of XEN_DOMCTL_getvcpuextstate
A toolstack must call XEN_DOMCTL_getvcpuextstate twice; first to find the size
of the buffer to use, and a second time to get the actual content.
The reported size was based on v->arch.xcr0_accum, but a guest which extends
its xcr0_accum between the two hypercalls will cause the toolstack to fail the
evc->size != size check, as the provided buffer is now too small. This causes
a hard error during the final phase of migration.
Instead, return a size based on xfeature_mask, which is the maximum size Xen
will ever permit. The hypercall must now tolerate a toolstack-provided buffer
which is overly large (for the case where a guest isn't using all available
xsave states), and should write back how much data was actually written into
the buffer.
As the query for size now has no dependence on vcpu state, the vcpu_pause()
can be omitted for a small performance improvement.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
libxl: add libxl__qmp_run_command_flexarray() function
Add a function libxl__qmp_run_command_flexarray() to run a qmp command
with an array of arguments. The arguments are name-value pairs stored
in a flexarray.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Tamas K Lengyel [Mon, 1 Aug 2016 17:59:14 +0000 (11:59 -0600)]
arm/vm_event: get/set registers
Add support for getting/setting registers through vm_event on ARM. Only
TTB/CR/R0/R1, PC and CPSR are sent as part of a request and only PC is set
as part of a response. The set of registers can be expanded in the future to
include other registers as well if necessary.
Signed-off-by: Tamas K Lengyel <tamas.lengyel@zentific.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
On ARM we need an alternative VA region to poke in the
hypervisor .text data. And since this is setup during runtime
we may fail (it uses vmap so most likely error is ENOMEM).
As such this error needs to be bubbled up and also abort
the livepatching if it occurs.
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This is copied from Linux 4.7, and the initial commit
that put this in is 5c5bf25d4f7a950382f94fc120a5818197b48fe9
"arm64: introduce aarch64_insn_gen_{nop|branch_imm}() helper functions"
This lays the groundwork for Livepatch to generate the
trampoline to jump to the new replacement function.
Also allows us to NOP the callsites.
Acked-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
-- Cc: Ross Lagerwall <ross.lagerwall@citrix.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Julien Grall <julien.grall@arm.com>
RFC: First submission
v1: The full copy of insn_gen_branch instead of just the code to make branch
v2: Added Julien's Ack.
Remove the duplicate paragraph in the commit message.
alternatives: x86 rename and change parameters on ARM
On x86 we squash 'apply_alternatives' in to
'alternative_instructions' (who was its sole user)
and 'apply_alternatives_nocheck' to 'apply_alternatives'.
On ARM we change the parameters for 'apply_alternatives'
to be of 'const struct alt_instr *' instead of void pointer and
size length.
We also add 'const' and make the arguments be on the
proper offset.
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> [x86 bits] Reviewed-by: Julien Grall <julien.grall@arm.com> [ARM bits] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
xen/arm: alternative: Make it possible to patch outside of the hypervisor
With livepatch the alternatives that should be patched are outside of
the Xen hypervisor _start -> _end. The current code is assuming that
only Xen could be patched and therefore will explode when a payload
contains alternatives.
Given that alt_instr contains a relative offset, the function
__apply_alternatives could directly take in parameter the virtual
address of the alt_instr set of the re-mapped region. So we can mandate
the callers of __apply_alternatives to provide use with a region that has
read-write access.
The only caller that will patch directly the Xen binary is the function
__apply_alternatives_multi_stop. The other caller apply_alternatives
will work on the payload which will still have read-write access at that
time.
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Julien Grall <julien.grall@arm.com>
This patch contains only renaming and comment update. There are no
functional changes:
- Don't mix _start and _stext, they both point to the same address
but the former makes more sense (we are mapping the Xen binary, not
only the text section).
- s/text_mfn/xen_mfn/ and s/text_order/xen_order/ to make clear that
we map the Xen binary.
- Mention about inittext as alternative may patch this section.
- Use 1U instead of 1 in shift
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Julien Grall <julien.grall@arm.com>
emulate.c:2016:14: error: comparison of unsigned enum expression < 0
is always false [-Werror,-Wtautological-compare]
if ( seg < 0 || seg >= ARRAY_SIZE(hvmemul_ctxt->seg_reg) )
~~~ ^ ~
Clang is wrong to raise a warning like this. The signed-ness of an enum is
implementation defined in C, and robust code must not assume the choices made
by the compiler.
In this case, dropping the < 0 check creates a latent bug which would result
in an array underflow when compiled with a compiler which chooses a signed
enum.
Work around the bug by explicitly pulling seg into an unsigned integer, and
only perform the upper bounds check.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Lars Kurth [Fri, 12 Aug 2016 09:37:28 +0000 (10:37 +0100)]
Remove ambiguities in the COPYING file; add CONTRIBUTING file
COPYING file:
The motivation of this change is to make it easier for new
contributors to conduct a license and patent review, WITHOUT
changing any licenses.
- Remove references to BSD-style licenses as we have more
common license exceptions and replace with "other license
stanzas"
- List the most common situations under which code is licensed
under licenses other than GPLv2 (section "Licensing Exceptions")
- List the most common non-GPLv2 licenses that are in use in
this repository based on a recent FOSSology scan (section
"Licensing Exceptions")
- List other license related conventions within the project
to make it easier to conduct a license review.
- Clarify the incoming license as its omission has confused
past contributors (section "Contributions")
CONTRIBUTION file:
The motivation of this file is to make it easier for contributors
to find contribution related resources. Add information on existing
license related conventions to avoid unintentional future licensing
issues. Provide templates for copyright headers for the most commonly
used licenses in this repository.
Signed-off-by: Lars Kurth <lars.kurth@citrix.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Sat, 2 Jul 2016 15:29:49 +0000 (16:29 +0100)]
x86/hvm: Optimise segment accesses in hvmemul_write_segment()
There is no need to read the segment information from VMCS/VMCB and cache it,
just to clobber the cached content immediately afterwards.
Write straight into the cache and set the accessed/dirty bits.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 1 Jul 2016 00:02:04 +0000 (01:02 +0100)]
x86/segment: Bounds check accesses to emulation ctxt->seg_reg[]
HVM HAP codepaths have space for all segment registers in the seg_reg[]
cache (with x86_seg_none still risking an array overrun), while the shadow
codepaths only have space for the user segments.
Range check the input segment of *_get_seg_reg() against the size of the array
used to cache the results, to avoid overruns in the case that the callers
don't filter their input suitably.
Subsume the is_x86_user_segment(seg) checks from the shadow code, which were
an incomplete attempt at range checking, and are now superceeded. Make
hvm_get_seg_reg() static, as it is not used outside of shadow/common.c
No functional change, but far easier to reason that no overflow is possible.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 12 Aug 2016 13:35:28 +0000 (14:35 +0100)]
hvm/fep: Allow testing of instructions crossing the -1 -> 0 virtual boundary
The Force Emulation Prefix is named to follow its PV counterpart for cpuid or
rdtsc, but isn't really an instruction prefix. It behaves as a break-out into
Xen, with the purpose of emulating the next instruction in the current state.
It is important to be able to test legal situations which occur in real
hardware, including instruction which cross certain boundaries, and
instructions starting at 0.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 8 Sep 2016 12:17:05 +0000 (14:17 +0200)]
libelf: drop pointless uses of __FUNCTION__
Non-debugging message text should be (and is in the cases here, albeit
often only with the addition of an ELF: prefix) distinguishable without
also logging function names.
In the messages touched at once use %#x (or variants thereof) in favor
of 0x%x.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Andrew Cooper [Fri, 1 Jul 2016 00:02:04 +0000 (01:02 +0100)]
x86/shadow: Avoid overflowing sh_ctxt->seg_reg[]
hvm_get_seg_reg() does not perform a range check on its input segment, calls
hvm_get_segment_register() and writes straight into sh_ctxt->seg_reg[].
x86_seg_none is outside the bounds of sh_ctxt->seg_reg[], and will hit a BUG()
in {vmx,svm}_get_segment_register().
HVM guests running with shadow paging can end up performing a virtual to
linear translation with x86_seg_none. This is used for addresses which are
already linear. However, none of this is a legitimate pagetable update, so
fail the emulation in such a case.
This is XSA-187 / CVE-2016-7094.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org>
Andrew Cooper [Fri, 22 Jul 2016 16:02:54 +0000 (16:02 +0000)]
x86/emulate: Correct boundary interactions of emulated instructions
This reverts most of c/s 0640ffb6 "x86emul: fix rIP handling".
Experimentally, in long mode processors will execute an instruction stream
which crosses the 64bit -1 -> 0 virtual boundary, whether the instruction
boundary is aligned on the virtual boundary, or is misaligned.
In compatibility mode, Intel processors will execute an instruction stream
which crosses the 32bit -1 -> 0 virtual boundary, while AMD processors raise a
segmentation fault. Xen's segmentation behaviour matches AMD.
For 16bit code, hardware does not ever truncated %ip. %eip is always used and
behaves normally as a 32bit register, including in 16bit protected mode
segments, as well as in Real and Unreal mode.
This is XSA-186 / CVE-2016-7093.
Reported-by: Brian Marcotte <marcotte@panix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 8 Sep 2016 12:14:53 +0000 (14:14 +0200)]
x86/32on64: don't allow recursive page tables from L3
L3 entries are special in PAE mode, and hence can't reasonably be used
for setting up recursive (and hence linear) page table mappings. Since
abuse is possible when the guest in fact gets run on 4-level page
tables, this needs to be excluded explicitly.
This is XSA-185 / CVE-2016-7092.
Reported-by: Jérémie Boutoille <jboutoille@ext.quarkslab.com> Reported-by: "栾尚聪(好风)" <shangcong.lsc@alibaba-inc.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
You could construct _most_ of the names of the functions
by doing 'nm --defined' but unfortunatly you do not get the
<file> prefix that is added on in Xen . For example:
$ cat xen-syms.symbols |grep do_domain_pause
0xffff82d080104920 t domain.c#do_domain_pause
$ nm --defined xen-syms|grep do_domain_pause ffff82d080104920 t do_domain_pause
This is normally not an issue, but if one is doing livepatching and
wants during build-time verify that the symbols the livepatch payloads
will patch do correspond to the one the hypervisor has built - this helps a lot.
Note that during runtime one can do:
[root@localhost xen]# cat /proc/xen/xensyms |grep do_domain_pause ffff82d080104920 t domain.c#do_domain_pause
But one may not want to build and verify a livepatch on the same host.
Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Livepatch expected at some point to be able to print the
build-id during bootup, which it did not. The reason is
that xen_build_init and livepatch_init are both __initcall
type routines. This meant that when livepatch_init called
xen_build_id, it would return -ENODATA as build_id_len was
not setup yet (b/c xen_build_init would be called later).
The original patch fixed this by calling xen_build_init in
livepatch_init which allows us to print the build-id of
the hypervisor.
However the x86 maintainers pointed out that build-id
is independent of Livepatch and in fact should print
regardless whether Livepatch is enabled or not.
Therefore this patch moves the logic of printing the build-id
to version.c.
Reviewed-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
version/livepatch: Move xen_build_id_check to version.h
It makes more sense for it to be there. However that
means the version.h has now a dependency on <xen/elfstructs.h>
as the Elf_Note is a macro.
The elfstructs.h has a dependency on types.h as well so
we need that. We cannot put that #include <xen/types.h>
in elfstructs.h as that file is used by tools and they
do not have such file.
Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
It is possible. Especially if the only thing they do is
NOP functions - in which case there is only .livepatch.funcs
sections.
Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Jan Beulich [Wed, 7 Sep 2016 10:35:40 +0000 (12:35 +0200)]
x86/HVM: adjust feature checking in MSR intercept handling
Consistently consult hvm_cpuid(). With that, BNDCFGS gets better
handled outside of VMX specific code, just like XSS. Don't needlessly
check for MTRR support when the MSR being accessed clearly is not an
MTRR one.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 7 Sep 2016 10:34:43 +0000 (12:34 +0200)]
VMX: correct feature checks for MPX and XSAVES
Their VMCS fields aren't tied to the respective base CPU feature flags
but instead to VMX specific ones.
Note that while the VMCS GUEST_BNDCFGS field exists if either of the
two respective features is available, MPX continues to get exposed to
guests only with both features present.
Also add the so far missing handling of
- GUEST_BNDCFGS in construct_vmcs()
- MSR_IA32_BNDCFGS in vmx_msr_{read,write}_intercept()
and mirror the extra correctness checks during MSR write to
vmx_load_msr().
Reported-by: "Rockosov, Dmitry" <dmitry.rockosov@intel.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: "Rockosov, Dmitry" <dmitry.rockosov@intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tamas K Lengyel [Wed, 7 Sep 2016 10:33:57 +0000 (12:33 +0200)]
x86/altp2m: use __get_gfn_type_access to avoid lock conflicts
Use __get_gfn_type_access instead of get_gfn_type_access when checking
the hostp2m entries during altp2m mem_access setting and gfn remapping
to avoid a lock conflict which can make dom0 freeze. During mem_access
setting the hp2m is already locked. For gfn remapping we change the flow
to lock the hp2m before locking the ap2m.
Signed-off-by: Tamas K Lengyel <tamas.lengyel@zentific.com> Reviewed-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Wed, 7 Sep 2016 10:32:31 +0000 (12:32 +0200)]
replace bogus -ENOSYS uses
This doesn't cover all of them, just the ones that I think would most
obviously better be -EINVAL or -EOPNOTSUPP.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Wei Liu [Fri, 2 Sep 2016 13:43:25 +0000 (14:43 +0100)]
xen: indicate gcov in log messages
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Andrew Cooper [Mon, 26 Jan 2015 15:21:30 +0000 (15:21 +0000)]
x86/hypercall: Reduce the size of the hypercall tables
The highest populated entry in each hypercall table is currently at index 49.
There is no need to extend both to tables to 64 entries.
Range check eax against the hypercall table array size, and use a
BUILD_BUG_ON() to ensure that the hypercall tables don't grow larger than the
args table.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 26 Jan 2015 15:11:59 +0000 (15:11 +0000)]
x86/hypercall: Merge the hypercall arg tables
For the same reason as c/s 33a231e3f "x86/HVM: fold hypercall tables" and
c/s d6d67b047 "x86/pv: Merge the pv hypercall tables", this removes the
risk of accidentally updating only one of the tables.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 26 Jan 2015 14:30:43 +0000 (14:30 +0000)]
xen/multicall: Rework arch multicall handling
The x86 multicall handling was previously some very hairy inline assembly, and
is hard to follow and maintain.
Replace the existing do_multicall_call() with arch_do_multicall_call(). The
x86 side needs to handle both compat and non-compat calls, so pass the full
multicall state, rather than just the multicall_entry sub-structure.
On the ARM side, alter the prototype to match, but there is no resulting
functional change. On the x86 side, the implementation is now in plain C.
This allows the removal of both asm/multicall.h header files.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 26 Jan 2015 14:15:23 +0000 (14:15 +0000)]
x86/hypercall: Move the hypercall tables into C
Editing (and indeed, finding) the hypercall tables can be tricky, especially
towards the end where .rept's are used to maintain the correct layout.
Move this all into C, and let the compiler do the hard work.
To do this, xen/hypercall.h and asm-x86/hypercall.h need to contain prototypes
for all the hypercalls; some were previously missing. This in turn requires
some shuffling of definitions and includes.
One difference is that NULL function pointers are used instead of
{,compat_}do_ni_hypercall(), which pv_hypercall() handles correctly. All
ni_hypercall() infrastructure is therefore dropped.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 26 Jan 2015 12:01:00 +0000 (12:01 +0000)]
x86/pv: Implement pv_hypercall() in C
In a similar style to hvm_do_hypercall(). The C version is far easier to
understand and edit than the assembly versions.
There are a few small differences however. The register clobbering values
have changed (to match the HVM side), and in particular clobber the upper
32bits of 64bit arguments. The hypercall and performance counter record are
reordered to increase code sharing between the 32bit and 64bit cases.
The sole callers of __trace_hypercall_entry() were the assembly code. Given
the new C layout, it is more convenient to fold __trace_hypercall_entry() into
pv_hypercall(), and call __trace_hypercall() directly.
Finally, pv_hypercall() will treat a NULL hypercall function pointer as
-ENOSYS, allowing further cleanup.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 26 Jan 2015 11:25:43 +0000 (11:25 +0000)]
x86/hypercall: Move the hypercall arg tables into C
Editing (and indeed, finding) the hypercall args tables can be tricky,
especially towards the end where .rept's are used to maintain the correct
layout.
Move this all into C, and let the compiler do the hard work. As 0 is the
default value, drop all explicit 0's.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 15 Jul 2016 13:12:01 +0000 (13:12 +0000)]
x86/pv: Support do_set_segment_base() for compat guests
set_segment_base is the only hypercall exists in only one of the two modes
guests might run in; all other hypercalls are either implemented, or
unimplemented in both modes.
Remove this split, by allowing do_set_segment_base() to be called in the
compat hypercall path. This change will simplify the verification logic in a
later change.
No behavioural change from a guests point of view.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <JBeulich@suse.com>