]> xenbits.xensource.com Git - xen.git/log
xen.git
7 years agoxen/test/livepatch/Makefile: Install in DESTDIR/usr/lib/debug/xen-livepatch
Ian Jackson [Wed, 7 Jun 2017 14:00:17 +0000 (15:00 +0100)]
xen/test/livepatch/Makefile: Install in DESTDIR/usr/lib/debug/xen-livepatch

Dumping these patch files in /usr/lib/debug/xen-*.livepatch is a bit
ugly.

Also, refactor the Makefile to have a LIVEPATCHES variable, to reduce
repetition.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Julien Grall <julien.grall@arm.com>
(cherry picked from commit a38d1af5fb02bee68c9a30e38b97c6129815f943)
(cherry picked from commit d32607042bdc944aa23b7eaf17cd68c8536d3a11)

7 years agoxen/arm: p2m: Fix incorrect mapping of superpages
Julien Grall [Fri, 19 May 2017 16:08:39 +0000 (17:08 +0100)]
xen/arm: p2m: Fix incorrect mapping of superpages

The same set of functions is used to set as well as to clean P2M
entries, except for clean operations (INVALID_MFN ~0UL) is passed as a
parameter. Unfortunately, when calculating an appropriate target order
for a particular mapping INVALID_MFN is taken into account which leads
to 4K page target order being set each time even for 2MB and 1GB
mappings.

This will result to break down the superpage into 4K mappings and leave
empty tables allocated.

This was introduced by commit 2ef3e36ec7 "xen/arm: p2m: Introduce
p2m_set_entry and __p2m_set_entry".

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master-commit-id: 3fafdc28eb98dc1cb686379d83270516fc38049d

7 years agovgic: refuse irq migration when one is already in progress
Stefano Stabellini [Wed, 5 Apr 2017 20:28:43 +0000 (13:28 -0700)]
vgic: refuse irq migration when one is already in progress

When an irq migration is already in progress, but not yet completed
(GIC_IRQ_GUEST_MIGRATING is set), refuse any other irq migration
requests for the same irq.

This patch implements this approach by returning success or failure from
vgic_migrate_irq, and avoiding irq target changes on failure. It prints
a warning in case the irq migration fails.

It also moves the clear_bit of GIC_IRQ_GUEST_MIGRATING to after the
physical irq affinity has been changed so that all operations regarding
irq migration are completed.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
master-commit-id: 91ef7364933037b6a8d825405a1c809a72e6152f

7 years agoarm: remove irq from inflight, then change physical affinity
Stefano Stabellini [Wed, 5 Apr 2017 20:28:42 +0000 (13:28 -0700)]
arm: remove irq from inflight, then change physical affinity

This patch fixes a potential race that could happen when
gic_update_one_lr and vgic_vcpu_inject_irq run simultaneously.

When GIC_IRQ_GUEST_MIGRATING is set, we must make sure that the irq has
been removed from inflight before changing physical affinity, to avoid
concurrent accesses to p->inflight, as vgic_vcpu_inject_irq will take a
different vcpu lock.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
master-commit-id: 31bc6a93a096bab21211e0e2c7c284ee4aec5402

7 years agoxen/arm: Survive unknown traps from guests
Julien Grall [Fri, 5 May 2017 14:30:36 +0000 (15:30 +0100)]
xen/arm: Survive unknown traps from guests

Currently we crash Xen if we see an ESR_EL2.EC value we don't recognise.
As configurable disables/enables are added to the architecture
(controlled by RES1/RESO bits respectively), with associated synchronous
exceptions, it may be possible for a guest to trigger exceptions with
classes that we don't recognise.

While we can't service these exceptions in a manner useful to the guest,
we can avoid bringing down the host. Per ARM DDI 0487A.k_iss10775, page
D7-1937, EC values within the range 0x00 - 0x2c are reserved for future
use with synchronous exceptions, and EC within the range 0x2d - 0x3f may
be used for either synchronous or asynchronous exceptions.

The patch makes Xen handle any unknown EC by injecting an UNDEFINED
exception into the guest, with a corresponding (ratelimited) warning in
the log.

This patch is based on Linux commit f050fe7a9164 "arm: KVM: Survive unknown
traps from the guest".

Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master-commit-id: baf2950213e6a50801940643e2549a5baa21ad71

7 years agoxen/arm: do_trap_hypervisor: Separate hypervisor and guest traps
Julien Grall [Fri, 5 May 2017 14:30:35 +0000 (15:30 +0100)]
xen/arm: do_trap_hypervisor: Separate hypervisor and guest traps

The function do_trap_hypervisor is currently handling both trap coming
from the hypervisor and the guest. This makes difficult to get specific
behavior when a trap is coming from either the guest or the hypervisor.

Split the function into two parts:
    - do_trap_guest_sync to handle guest traps
    - do_trap_hyp_sync to handle hypervisor traps

On AArch32, the Hyp Trap Exception provides the standard mechanism for
trapping Guest OS functions to the hypervisor (see B1.14.1 in ARM DDI
0406C.c). It cannot be generated when generated when the processor is in
Hyp Mode, instead other exception will be used. So it is fine to replace
the call to do_trap_hypervisor by do_trap_guest_sync.

For AArch64, there are two distincts exception depending whether the
exception was taken from the current level (hypervisor) or lower level
(guest).

Note that the unknown traps from guests will lead to panic Xen. This is
already behavior and is left unchanged for simplicy. A follow-up patch
will address that.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master-commit-id: 5a0ed9a09ebb32b620d9217875bb5206d5ccf4d7

7 years agoxen/arm: Save ESR_EL2 to avoid using mismatched value in syndrome check
Wei Chen [Wed, 5 Apr 2017 09:09:03 +0000 (17:09 +0800)]
xen/arm: Save ESR_EL2 to avoid using mismatched value in syndrome check

Xen will do exception syndrome check while some types of exception
take place in EL2. The syndrome check code read the ESR_EL2 register
directly, but in some situation this register maybe overridden by
nested exception.

For example, if we re-enable IRQ before reading ESR_EL2 which means
Xen may enter in IRQ exception mode and return the processor with
clobbered ESR_EL2 (See ARM ARM DDI 0487A.j D7.2.25)

In this case the guest exception syndrome has been overridden, we will
check the syndrome for guest sync exception with an incorrect ESR_EL2
value. So we want to save ESR_EL2 to cpu_user_regs as soon as the
exception takes place in EL2 to avoid using an incorrect syndrome value.

In order to save ESR_EL2, we added a 32-bit member hsr to cpu_user_regs.
But while saving registers in trap entry, we use stp to save ELR and
CPSR at the same time through 64-bit general registers. If we keep this
code, the hsr will be overridden by upper 32-bit of CPSR. So adjust the
code to use str to save ELR in a separate instruction and use stp to
save CPSR and HSR at the same time through 32-bit general registers.
This change affects the registers restore in trap exit, we can't use the
ldp to restore ELR and CPSR from stack at the same time. We have to use
ldr to restore them separately.

Signed-off-by: Wei Chen <Wei.Chen@arm.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master-commit-id: 90dbcd749103c35609370e7b11d26690d4ca4f40

7 years agostop_machine: fill fn_result only in case of error
Gregory Herrero [Fri, 9 Jun 2017 11:42:07 +0000 (13:42 +0200)]
stop_machine: fill fn_result only in case of error

When stop_machine_run() is called with NR_CPUS as last argument,
fn_result member must be filled only if an error happens since it is
shared across all cpus.

Assume CPU1 detects an error and set fn_result to -1, then CPU2 doesn't
detect an error and set fn_result to 0. The error detected by CPU1 will
be ignored.

Note that in case multiple failures occur on different CPUs, only the
last error will be reported.

Signed-off-by: Gregory Herrero <gregory.herrero@oracle.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: d8b833d78f6bfde9855a949b5e6d3790d78c0fb7
master date: 2017-06-01 10:53:04 +0200

7 years agohvmloader: avoid tests when they would clobber used memory
Jan Beulich [Fri, 9 Jun 2017 11:41:44 +0000 (13:41 +0200)]
hvmloader: avoid tests when they would clobber used memory

First of all limit the memory range used for testing to 4Mb: There's no
point placing page tables right above 8Mb when they can equally well
live at the bottom of the chunk at 4Mb - rep_io_test() cares about the
5Mb...7Mb range only anyway. In a subsequent patch this will then also
allow simply looking for an unused 4Mb range (instead of using a build
time determined one).

Extend the "skip tests" condition beyond the "is there enough memory"
question.

Reported-by: Charles Arnold <carnold@suse.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Gary Lin <glin@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0d6968635ce51a8ed7508ddcf17b3d13a462cb27
master date: 2017-05-19 16:04:38 +0200

7 years agoarm: fix build with gcc 7
Jan Beulich [Fri, 9 Jun 2017 11:41:08 +0000 (13:41 +0200)]
arm: fix build with gcc 7

The compiler dislikes duplicate "const", and the ones it complains
about look like they we in fact meant to be placed differently.

Also fix array_access_okay() (just like on x86), despite the construct
being unused on ARM: -Wint-in-bool-context, enabled by default in
gcc 7, doesn't like multiplication in conditional operators. "Hide" it,
at the risk of the next compiler version becoming smarter and
recognizing even that. (The hope is that added smartness then would
also better deal with legitimate cases like the one here.) The change
could have been done in access_ok(), but I think we better keep it at
the place the compiler is actually unhappy about.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
master commit: 9d3011bd1cd29f8f3841bf1b64d5ead9ed1434e8
master date: 2017-05-19 10:12:08 +0200

7 years agox86: fix build with gcc 7
Jan Beulich [Fri, 9 Jun 2017 11:40:28 +0000 (13:40 +0200)]
x86: fix build with gcc 7

-Wint-in-bool-context, enabled by default in gcc 7, doesn't like
multiplication in conditional operators. "Hide" them, at the risk of
the next compiler version becoming smarter and recognizing even those.
(The hope is that added smartness then would also better deal with
legitimate cases like the ones here.)

The change could have been done in access_ok(), but I think we better
keep it at the places the compiler is actually unhappy about.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f32400e90c046a9fd76c8917a60d34ade9c02ea2
master date: 2017-05-19 10:11:36 +0200

7 years agox86/mm: fix incorrect unmapping of 2MB and 1GB pages
Igor Druzhinin [Fri, 9 Jun 2017 11:36:48 +0000 (13:36 +0200)]
x86/mm: fix incorrect unmapping of 2MB and 1GB pages

The same set of functions is used to set as well as to clean
P2M entries, except that for clean operations INVALID_MFN (~0UL)
is passed as a parameter. Unfortunately, when calculating an
appropriate target order for a particular mapping INVALID_MFN
is not taken into account which leads to 4K page target order
being set each time even for 2MB and 1GB mappings. This eventually
breaks down an EPT structure irreversibly into 4K mappings which
prevents consecutive high order mappings to this area.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
x86/NPT: deal with fallout from 2Mb/1Gb unmapping change

Commit efa9596e9d ("x86/mm: fix incorrect unmapping of 2MB and 1GB
pages") left the NPT code untouched, as there is no explicit alignment
check matching the one in EPT code. However, the now more widespread
storing of INVALID_MFN into PTEs requires adjustments:
- calculations when shattering large pages may spill into the p2m type
  field (converting p2m_populate_on_demand to p2m_grant_map_rw) - use
  OR instead of PLUS,
- the use of plain l{2,3}e_from_pfn() in p2m_pt_set_entry() results in
  all upper (flag) bits being clobbered - introduce and use
  p2m_l{2,3}e_from_pfn(), paralleling the existing L1 variant.

Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: efa9596e9d167c8fb7d1c4446c10f7ca30453646
master date: 2017-05-17 17:23:15 +0200
master commit: 83520cb4aa39ebeb4eb1a7cac2e85b413e75a336
master date: 2017-06-06 14:32:54 +0200

7 years agox86/pv: Align %rsp before pushing the failsafe stack frame
Andrew Cooper [Fri, 9 Jun 2017 11:36:08 +0000 (13:36 +0200)]
x86/pv: Align %rsp before pushing the failsafe stack frame

Architecturally, all 64bit stacks are aligned on a 16 byte boundary before an
exception frame is pushed.  The failsafe frame should not special in this
regard.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cbcaccb5e991155a4ae85a032e990614c3dc6960
master date: 2017-05-09 19:00:20 +0100

7 years agox86/pv: Fix bugs with the handling of int80_bounce
Andrew Cooper [Fri, 9 Jun 2017 11:35:27 +0000 (13:35 +0200)]
x86/pv: Fix bugs with the handling of int80_bounce

Testing has revealed two issues:

 1) Passing a NULL handle to set_trap_table() is intended to flush the entire
    table.  The 64bit guest case (and 32bit guest on 32bit Xen, when it
    existed) called init_int80_direct_trap() to reset int80_bounce, but c/s
    cda335c279 which introduced the 32bit guest on 64bit Xen support omitted
    this step.  Previously therefore, it was impossible for a 32bit guest to
    reset its registered int80_bounce details.

 2) init_int80_direct_trap() doesn't honour the guests request to have
    interrupts disabled on entry.  PVops Linux requests that interrupts are
    disabled, but Xen currently leaves them enabled when following the int80
    fastpath.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 55ab172a1f286742d918947ecb9b257ce31cc253
master date: 2017-05-09 19:00:04 +0100

7 years agox86/vpmu_intel: fix hypervisor crash by masking PC bit in MSR_P6_EVNTSEL
Mohit Gambhir [Fri, 9 Jun 2017 11:35:00 +0000 (13:35 +0200)]
x86/vpmu_intel: fix hypervisor crash by masking PC bit in MSR_P6_EVNTSEL

Setting Pin Control (PC) bit (19) in MSR_P6_EVNTSEL results in a General
Protection Fault and thus results in a hypervisor crash. This behavior has
been observed on two generations of Intel processors namely, Haswell and
Broadwell. Other Intel processor generations were not tested. However, it
does seem to be a possible erratum that hasn't yet been confirmed by Intel.

To fix the problem this patch masks PC bit and returns an error in
case any guest tries to write to it on any Intel processor. In addition
to the fact that setting this bit crashes the hypervisor on Haswell and
Broadwell, the PC flag bit toggles a hardware pin on the physical CPU
every time the programmed event occurs and the hardware behavior in
response to the toggle is undefined in the SDM, which makes this bit
unsafe to be used by guests and hence should be masked on all machines.

Signed-off-by: Mohit Gambhir <mohit.gambhir@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 8bf68dca65e2d61f4dfc6715cca51ad3dd5aadf1
master date: 2017-05-08 13:37:17 +0200

7 years agohvm: fix hypervisor crash in hvm_save_one()
Jan Beulich [Fri, 9 Jun 2017 11:34:29 +0000 (13:34 +0200)]
hvm: fix hypervisor crash in hvm_save_one()

hvm_save_cpu_ctxt() returns success without writing any data into
hvm_domain_context_t when all VCPUs are offline. This can then crash
the hypervisor (with FATAL PAGE FAULT) in hvm_save_one() via the
"off < (ctxt.cur - sizeof(*desc))" for() test, where ctxt.cur remains 0,
causing an underflow which leads the hypervisor to go off the end of the
ctxt buffer.

This has been broken since Xen 4.4 (c/s e019c606f59).
It has happened in practice with an HVM Linux VM (Debian 8) queried around
shutdown:

(XEN) hvm.c:1595:d3v0 All CPUs offline -- powering off.
(XEN) ----[ Xen-4.9-rc  x86_64  debug=y   Not tainted ]----
(XEN) CPU:    5
(XEN) RIP:    e008:[<ffff82d0802496d2>] hvm_save_one+0x145/0x1fd
(XEN) RFLAGS: 0000000000010286   CONTEXT: hypervisor (d0v2)
(XEN) rax: ffff830492cbb445   rbx: 0000000000000000   rcx: ffff83039343b400
(XEN) rdx: 00000000ff88004d   rsi: fffffffffffffff8   rdi: 0000000000000000
(XEN) rbp: ffff8304103e7c88   rsp: ffff8304103e7c48   r8:  0000000000000001
(XEN) r9:  deadbeefdeadf00d   r10: 0000000000000000   r11: 0000000000000282
(XEN) r12: 00007f43a3b14004   r13: 00000000fffffffe   r14: 0000000000000000
(XEN) r15: ffff830400c41000   cr0: 0000000080050033   cr4: 00000000001526e0
(XEN) cr3: 0000000402e13000   cr2: ffff830492cbb447
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen code around <ffff82d0802496d2> (hvm_save_one+0x145/0x1fd):
(XEN)  00 00 48 01 c8 83 c2 08 <66> 39 58 02 75 64 eb 08 48 89 c8 ba 08 00 00 00
(XEN) Xen stack trace from rsp=ffff8304103e7c48:
(XEN)    0000041000000000 ffff83039343b400 ffff8304103e7c70 ffff8304103e7da8
(XEN)    ffff830400c41000 00007f43a3b13004 ffff8304103b7000 ffffffffffffffea
(XEN)    ffff8304103e7d48 ffff82d0802683d4 ffff8300d19fd000 ffff82d0802320d8
(XEN)    ffff830400c41000 0000000000000000 ffff8304103e7cd8 ffff82d08026ff3d
(XEN)    0000000000000000 ffff8300d19fd000 ffff8304103e7cf8 ffff82d080232142
(XEN)    0000000000000000 ffff8300d19fd000 ffff8304103e7d28 ffff82d080207051
(XEN)    ffff8304103e7d18 ffff830400c41000 0000000000000202 ffff830400c41000
(XEN)    0000000000000000 00007f43a3b13004 0000000000000000 deadbeefdeadf00d
(XEN)    ffff8304103e7e68 ffff82d080206c47 0700000000000000 ffff830410375bd0
(XEN)    0000000000000296 ffff830410375c78 ffff830410375c80 0000000000000003
(XEN)    ffff8304103e7e68 ffff8304103b67c0 ffff8304103b7000 ffff8304103b67c0
(XEN)    0000000d00000037 0000000000000003 0000000000000002 00007f43a3b14004
(XEN)    00007ffd5d925590 0000000000000000 0000000100000000 0000000000000000
(XEN)    00000000ea8f8000 0000000000000000 00007ffd00000000 0000000000000000
(XEN)    00007f43a276f557 0000000000000000 00000000ea8f8000 0000000000000000
(XEN)    00007ffd5d9255e0 00007f43a23280b2 00007ffd5d926058 ffff8304103e7f18
(XEN)    ffff8300d19fe000 0000000000000024 ffff82d0802053e5 deadbeefdeadf00d
(XEN)    ffff8304103e7f08 ffff82d080351565 010000003fffffff 00007f43a3b13004
(XEN)    deadbeefdeadf00d deadbeefdeadf00d deadbeefdeadf00d deadbeefdeadf00d
(XEN)    ffff8800781425c0 ffff88007ce94300 ffff8304103e7ed8 ffff82d0802719ec
(XEN) Xen call trace:
(XEN)    [<ffff82d0802496d2>] hvm_save_one+0x145/0x1fd
(XEN)    [<ffff82d0802683d4>] arch_do_domctl+0xa7a/0x259f
(XEN)    [<ffff82d080206c47>] do_domctl+0x1862/0x1b7b
(XEN)    [<ffff82d080351565>] pv_hypercall+0x1ef/0x42c
(XEN)    [<ffff82d080355106>] entry.o#test_all_events+0/0x30
(XEN)
(XEN) Pagetable walk from ffff830492cbb447:
(XEN)  L4[0x106] = 00000000dbc36063 ffffffffffffffff
(XEN)  L3[0x012] = 0000000000000000 ffffffffffffffff
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 5:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: ffff830492cbb447
(XEN) ****************************************

At the same time pave the way for having zero-length records.

Inspired by an earlier patch from Andrew and Razvan.

Reported-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Diagnosed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Tim Deegan <tim@xen.org>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
master commit: ed719d7ca6e8df6384a2ecbe9a78977e32586478
master date: 2017-05-04 15:05:26 +0200

7 years agox86/32on64: properly honor add-to-physmap-batch's size
Jan Beulich [Fri, 9 Jun 2017 11:33:32 +0000 (13:33 +0200)]
x86/32on64: properly honor add-to-physmap-batch's size

Commit 407a3c00ff ("compat/memory: fix build with old gcc") "fixed" a
build issue by switching to the use of uninitialized data. Due to
- the bounding of the uninitialized data item
- the accessed area being outside of Xen space
- arguments being properly verified by the native hypercall function
this is not a security issue.

Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 144aec4140515c53bb1676df71a469f3e285c557
master date: 2017-04-26 09:48:45 +0200

7 years agotools: ocaml: In configure, check for ocamlopt
Ian Jackson [Mon, 3 Apr 2017 11:34:13 +0000 (12:34 +0100)]
tools: ocaml: In configure, check for ocamlopt

If ocaml.m4 didn't find ocamlopt, disable all the ocaml builds.

Currently our Makefiles do not work properly when the native code
compiler (`ocamlopt') is not available.  In principle this should be
fixed to fall back to bytecode, but this is not a task for this stage
of the Xen 4.9 release.

Without this change, we cannot build on systems with only ocamlc.
That includes Debian jessie ARM64, as used on the new ARM64 hardware
in the Xen Project CI test lab.

When the Makefiles are fixed, this commit should be reverted.

Committers: Please rerun autogen.sh.

CC: Julien Grall <julien.grall@arm.com>
CC: Christian Lindig <christian.lindig@citrix.com>
CC: Jonathan Ludlam <Jonathan.Ludlam@citrix.com>
CC: David Scott <dave@recoil.org>
CC: Wei Liu <wei.liu2@citrix.com>
Tested-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 4d0240e03349fd0715332eae65372e0a47b5a43b)

7 years agotools/libxc: Tolerate specific zero-content records in migration v2 streams
Andrew Cooper [Thu, 30 Mar 2017 16:32:31 +0000 (17:32 +0100)]
tools/libxc: Tolerate specific zero-content records in migration v2 streams

The migration v2 save code was written to avoid sending data records with no
content, as such records serve no purpose but come with a performance hit.
The restore code sanity checks this expectation.

Under some circumstances (most notably, on AMD hardware with Debug Extensions,
and a PV guest kernel which is not using the feature), the save code would
generate a record with no content, which trips the sanity check in the restore
code.

As the stream is otherwise fine, tolerate these records and avoid failing the
migration.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 119ee4d77377aa1fc62efdadc1cc87df4f1270bf)

7 years agolibxc: fix segfault on uninitialized xch->fmem
Seraphime Kirkovski [Tue, 4 Apr 2017 12:40:48 +0000 (14:40 +0200)]
libxc: fix segfault on uninitialized xch->fmem

Currently in xc_interface_open, xch->fmem is not initialized
and in some rare case the code fails before ever assigning a value
to it.

I got this in master:

   $ sudo ./xl/xl run
   xencall: error: Could not obtain handle on privileged command interface: No such file or directory
   Segmentation fault

This initializes the whole xch_buff to 0.

Signed-off-by: Seraphime Kirkovski <kirkseraph@gmail.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit c4bdbec00c9063736361124a3492ebceabfaed06)

8 years agox86/mce: always re-initialize 'severity_cpu' in mcheck_cmn_handler()
Haozhong Zhang [Wed, 3 May 2017 15:06:33 +0000 (17:06 +0200)]
x86/mce: always re-initialize 'severity_cpu' in mcheck_cmn_handler()

mcheck_cmn_handler() does not always set 'severity_cpu' to override
its value taken from previous rounds of MC handling, which will
interfere the current round of MC handling. Always re-initialize it to
clear the historical value.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6a2c6a68423475cd89a8cc9978554880e5a21b7d
master date: 2017-04-07 15:56:09 +0200

8 years agox86/mce: make 'severity_cpu' private to its users
Haozhong Zhang [Wed, 3 May 2017 15:06:07 +0000 (17:06 +0200)]
x86/mce: make 'severity_cpu' private to its users

The current 'severity_cpu' is used by both mcheck_cmn_handler() and
mce_softirq(). If MC# happens during mce_softirq(), the values set in
mcheck_cmn_handler() and mce_softirq() may interfere with each
other. Use private 'severity_cpu' for each function to fix this issue.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 011899a2e00b9f748c04d501c205da04bbff4359
master date: 2017-04-07 15:55:34 +0200

8 years agomemory: don't hand MFN info to translated guests
Jan Beulich [Wed, 3 May 2017 15:05:39 +0000 (17:05 +0200)]
memory: don't hand MFN info to translated guests

We shouldn't hand MFN info back from increase-reservation for
translated domains, just like we don't for populate-physmap and
memory-exchange. For full symmetry also check for a NULL guest handle
in populate_physmap() (but note this makes no sense in
memory_exchange(), as there the array is also an input).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d18627583df28facd9af473ea1ac4a56e93e6ea9
master date: 2017-04-05 16:39:53 +0200

8 years agomemory: exit early from memory_exchange() upon write-back error
Jan Beulich [Wed, 3 May 2017 15:05:11 +0000 (17:05 +0200)]
memory: exit early from memory_exchange() upon write-back error

There's no point in continuing if in the end we'll return -EFAULT
anyway. It also seems wrong to report a chunk for which at least one
write-back failed as successfully exchanged (albeit the indication of
an error is also not fully correct, as the exchange happened in that
case at least partially - retrieving the GFN to assign the memory to
and/or handing back the information on the replacement memory didn't
work). In any case limiting the amount of damage done to the guest
can't be all that bad an idea.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1cf4d2ec0d7c0cb53729ca810e416793030f6f07
master date: 2017-04-05 16:39:16 +0200

8 years agokexec: clear kexec_image slot when unloading kexec image
Bhavesh Davda [Wed, 3 May 2017 15:03:34 +0000 (17:03 +0200)]
kexec: clear kexec_image slot when unloading kexec image

When kexec_do_unload calls kexec_swap_images to get the old kexec_image to
free, it passes NULL for the new kexec_image pointer. The new slot wasn't being
cleared in such a case, leading to a stale pointer being left behind in the
kexec_image array and Xen panics in subsequent load/unload operations.

Signed-off-by: Bhavesh Davda <bhavesh.davda@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5c5216e825332c83b1965b5a39a6100f9dde34da
master date: 2017-04-04 11:34:57 +0200

8 years agoupdate Xen version to 4.8.2-pre
Jan Beulich [Tue, 2 May 2017 12:55:09 +0000 (14:55 +0200)]
update Xen version to 4.8.2-pre

8 years agox86: discard type information when stealing pages
Jan Beulich [Tue, 2 May 2017 12:54:26 +0000 (14:54 +0200)]
x86: discard type information when stealing pages

While a page having just a single general reference left necessarily
has a zero type reference count too, its type may still be valid (and
in validated state; at present this is only possible and relevant for
PGT_seg_desc_page, as page tables have their type forcibly zapped when
their type reference count drops to zero, and
PGT_{writable,shared}_page pages don't require any validation). In
such a case when the page is being re-used with the same type again,
validation is being skipped. As validation criteria differ between
32- and 64-bit guests, pages to be transferred between guests need to
have their validation indicator zapped (and with it we zap all other
type information at once).

This is XSA-214.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: eaf537342c909875c10f49b06e17493655410681
master date: 2017-05-02 14:46:58 +0200

8 years agomulticall: deal with early exit conditions
Jan Beulich [Tue, 2 May 2017 12:52:54 +0000 (14:52 +0200)]
multicall: deal with early exit conditions

In particular changes to guest privilege level require the multicall
sequence to be aborted, as hypercalls are permitted from kernel mode
only. While likely not very useful in a multicall, also properly handle
the return value in the HYPERVISOR_iret case (which should be the guest
specified value).

This is XSA-213.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Julien Grall <julien.grall@arm.com>
master commit: 22c096c99d8c05833c3c19870e36efb2dd4e8013
master date: 2017-05-02 14:45:02 +0200

8 years agoMerge branch 'staging-4.8' of xenbits.xen.org:/home/xen/git/xen into staging-4.8
Jan Beulich [Mon, 10 Apr 2017 13:24:47 +0000 (15:24 +0200)]
Merge branch 'staging-4.8' of xenbits.xen.org:/home/xen/git/xen into staging-4.8

8 years agoupdate Xen version to 4.8.1 RELEASE-4.8.1
Jan Beulich [Mon, 10 Apr 2017 13:21:48 +0000 (15:21 +0200)]
update Xen version to 4.8.1

8 years agosetup vwfi correctly on cpu0
Stefano Stabellini [Fri, 31 Mar 2017 22:37:07 +0000 (15:37 -0700)]
setup vwfi correctly on cpu0

parse_vwfi runs after init_traps on cpu0, potentially resulting in the
wrong HCR_EL2 for it. Secondary cpus boot after parse_vwfi, so in their
case init_traps will write the correct set of flags to HCR_EL2.

For cpu0, fix the issue by changing HCR_EL2 setting from a new
presmp_initcall.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agooxenstored: trim history in the frequent_ops function
Thomas Sanders [Tue, 28 Mar 2017 17:57:52 +0000 (18:57 +0100)]
oxenstored: trim history in the frequent_ops function

We were trimming the history of commits only at the end of each
transaction (regardless of how it ended).

Therefore if non-transactional writes were being made but no
transactions were being ended, the history would grow
indefinitely. Now we trim the history at regular intervals.

Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
8 years agooxenstored transaction conflicts: improve logging
Thomas Sanders [Mon, 27 Mar 2017 13:36:34 +0000 (14:36 +0100)]
oxenstored transaction conflicts: improve logging

For information related to transaction conflicts, potentially frequent
logging at "info" priority has been changed to "debug" priority, and
once per two minutes there is an "info" priority summary.

Additional detailed logging has been added at "debug" priority.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
8 years agooxenstored: don't wake to issue no conflict-credit
Thomas Sanders [Fri, 24 Mar 2017 19:55:03 +0000 (19:55 +0000)]
oxenstored: don't wake to issue no conflict-credit

In the main loop, when choosing the timeout for the select function
call, we were setting it so as to wake up to issue conflict-credit to
any domains that could accept it. When xenstore is idle, this would
mean waking up every 50ms (by default) to do no work. With this
commit, we check whether any domain is below its cap, and if not then
we set the timeout for longer (the same timeout as before the
conflict-protection feature was added).

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
8 years agooxenstored: do not commit read-only transactions
Thomas Sanders [Fri, 24 Mar 2017 16:16:10 +0000 (16:16 +0000)]
oxenstored: do not commit read-only transactions

The packet telling us to end the transaction has always carried an
argument telling us whether to commit.

If the transaction made no modifications to the tree, now we ignore
that argument and do not commit: it is just a waste of effort.

This makes read-only transactions immune to conflicts, and means that
we do not need to store any of their details in the history that is
used for assigning blame for conflicts.

We count a transaction as a read-only transaction only if it contains
no operations that modified the tree.

This means that (for example) a transaction that creates a new node
then deletes it would NOT count as read-only, even though it makes no
change overall. A more sophisticated algorithm could judge the
transaction based on comparison of its initial and final states, but
this would add complexity and computational cost.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
8 years agooxenstored: allow self-conflicts
Thomas Sanders [Thu, 23 Mar 2017 19:06:54 +0000 (19:06 +0000)]
oxenstored: allow self-conflicts

We already avoid inter-domain conflicts but now allow intra-domain
conflicts.  Although there are no known practical examples of a domain
that might perform operations that conflict with its own transactions,
this is conceivable, so here we avoid changing those semantics
unnecessarily.

When a transaction commit fails with a conflict and we look through
the history of commits to see which connection(s) to blame, ignore
historical commits that were made by the same connection as the
failing commit.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
8 years agooxenstored: blame the connection that caused a transaction conflict
Jonathan Davies [Thu, 23 Mar 2017 14:28:16 +0000 (14:28 +0000)]
oxenstored: blame the connection that caused a transaction conflict

Blame each connection found to have made a commit that would cause this
transaction to fail. Each blamed connection is penalised by having its
conflict-credit decremented.

Note the change in semantics for the replay function: we no longer stop after
finding the first operation that can't be replayed. This allows us to identify
all operations that conflicted with this transaction, not just the one that
conflicted first.

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
v1 Reviewed-by: Christian Lindig <christian.lindig@citrix.com>

Changes since v1:
 * use correct log levels for informational messages
Changes since v2:
 * fix the blame algorithm and improve logging
   (fix was reviewed by Jonathan Davies)

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
8 years agooxenstored: track commit history
Jonathan Davies [Mon, 27 Mar 2017 08:58:29 +0000 (08:58 +0000)]
oxenstored: track commit history

Since the list of historic activity cannot grow without bound, it is safe to use
this to track commits.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Thomas Sanders <thomas.sanders@citrix.com>
8 years agooxenstored: discard old commit-history on txn end
Thomas Sanders [Thu, 23 Mar 2017 14:25:16 +0000 (14:25 +0000)]
oxenstored: discard old commit-history on txn end

The history of commits is to be used for working out which historical
commit(s) (including atomic writes) caused conflicts with a
currently-failing commit of a transaction. Any commit that was made
before the current transaction started cannot be relevant. Therefore
we never need to keep history from before the start of the
longest-running transaction that is open at any given time: whenever a
transaction ends (with or without a commit) then if it was the
longest-running open transaction we can delete history up until start
of the the next-longest-running open transaction.

Some transactions might stay open for a very long time, so if any
transaction exceeds conflict_max_history_seconds then we remove it
from consideration in this context, and will not guarantee to keep
remembering about historical commits made during such a transaction.

We implement this by keeping a list of all open transactions that have
not been open too long. When a transaction ends, we remove it from the
list, along with any that have been open longer than the maximum; then
we delete any history from before the start of the longest-running
transaction remaining in the list.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
8 years agooxenstored: only record operations with side-effects in history
Jonathan Davies [Thu, 23 Mar 2017 14:20:33 +0000 (14:20 +0000)]
oxenstored: only record operations with side-effects in history

There is no need to record "read" operations as they will never cause another
transaction to fail.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Thomas Sanders <thomas.sanders@citrix.com>
8 years agooxenstored: support commit history tracking
Jonathan Davies [Tue, 14 Mar 2017 13:20:07 +0000 (13:20 +0000)]
oxenstored: support commit history tracking

Add ability to track xenstore tree operations -- either non-transactional
operations or committed transactions.

For now, the call to actually retain commits is commented out because history
can grow without bound.

For now, we call record_commit for all non-transactional operations. A
subsequent patch will make it retain only the ones with side-effects.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
8 years agooxenstored: add transaction info relevant to history-tracking
Jonathan Davies [Tue, 14 Mar 2017 12:17:38 +0000 (12:17 +0000)]
oxenstored: add transaction info relevant to history-tracking

Specifically:
 * retain the original store (not just the root) in full transactions
 * store commit count at the time of the start of the transaction

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
8 years agooxenstored: ignore domains with no conflict-credit
Thomas Sanders [Tue, 14 Mar 2017 12:15:52 +0000 (12:15 +0000)]
oxenstored: ignore domains with no conflict-credit

When processing connections, skip those from domains with no remaining
conflict-credit.

Also, issue a point of conflict-credit at regular intervals, the
period being set by the configuration option "conflict-max-history-
seconds".  When issuing conflict-credit, we give a point either to
every domain at once (one each) or only to the single domain at the
front of the queue, depending on the configuration option
"conflict-rate-limit-is-aggregate".

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
8 years agooxenstored: handling of domain conflict-credit
Thomas Sanders [Tue, 14 Mar 2017 12:15:52 +0000 (12:15 +0000)]
oxenstored: handling of domain conflict-credit

This commit gives each domain a conflict-credit variable, which will
later be used for limiting how often a domain can cause other domain's
transaction-commits to fail.

This commit also provides functions and data for manipulating domains
and their conflict-credit, and checking whether they have credit.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
8 years agooxenstored: comments explaining some variables
Thomas Sanders [Tue, 14 Mar 2017 12:15:52 +0000 (12:15 +0000)]
oxenstored: comments explaining some variables

It took a while of reading and reasoning to work out what these are
for, so here are comments to make life easier for everyone reading
this code in future.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
8 years agoxenstored: Log when the write transaction rate limit bites
Ian Jackson [Tue, 7 Mar 2017 16:09:13 +0000 (16:09 +0000)]
xenstored: Log when the write transaction rate limit bites

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
plus:

xenstore: dont increment bool variable
Instead of incrementing a bool variable just set it to true.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
8 years agoxenstored: apply a write transaction rate limit
Ian Jackson [Tue, 7 Mar 2017 16:09:12 +0000 (16:09 +0000)]
xenstored: apply a write transaction rate limit

This avoids a rogue client being about to stall another client (eg the
toolstack) indefinitely.

This is XSA-206.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Backported to 4.8 (not entirely trivial).

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
8 years agotools/libxenctrl: fix error check after opening libxenforeignmemory
Paul Durrant [Wed, 22 Feb 2017 13:27:34 +0000 (13:27 +0000)]
tools/libxenctrl: fix error check after opening libxenforeignmemory

Checking the value of xch->xcall is clearly incorrect. The code should be
checking xch->fmem (i.e. the return of the previously called function).

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit 80a7d04f532ddc3500acd7988917708a536ae15f)

8 years agolibxl: correct xenstore entry for empty cdrom
Juergen Gross [Wed, 15 Feb 2017 11:11:12 +0000 (12:11 +0100)]
libxl: correct xenstore entry for empty cdrom

Specifying an empty cdrom device will result in a Xenstore entry

params = aio:(null)

as the physical device path isn't existing. This lets a domain booted
via OVMF hang as OVMF is checking for "aio:" only in order to detect
the empty cdrom case.

Use an empty string for the physical device path in this case. As a
cdrom device for HVM is always backed by qdisk we only need to cover this
backend.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
8 years agox86: use 64 bit mask when masking away mfn bits
Juergen Gross [Tue, 4 Apr 2017 12:55:55 +0000 (14:55 +0200)]
x86: use 64 bit mask when masking away mfn bits

When using _PAGE_PSE_PAT as base for a negated bit mask make sure it is
propagated to 64 bits when applied to a 64 bit value.

There seems to be only one place where this is a problem, so fix this
by casting _PAGE_PSE_PAT to 64 bits there.

Not doing so will probably lead to problems on hosts with more than
16 TB of memory.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: 4edb1a42e3320757e3559f17edf6903bc1777de3
master date: 2017-03-30 15:11:24 +0200

8 years agomemory: properly check guest memory ranges in XENMEM_exchange handling
Jan Beulich [Tue, 4 Apr 2017 12:55:00 +0000 (14:55 +0200)]
memory: properly check guest memory ranges in XENMEM_exchange handling

The use of guest_handle_okay() here (as introduced by the XSA-29 fix)
is insufficient here, guest_handle_subrange_okay() needs to be used
instead.

Note that the uses are okay in
- XENMEM_add_to_physmap_batch handling due to the size field being only
  16 bits wide,
- livepatch_list() due to the limit of 1024 enforced on the
  number-of-entries input (leaving aside the fact that this can be
  called by a privileged domain only anyway),
- compat mode handling due to counts there being limited to 32 bits,
- everywhere else due to guest arrays being accessed sequentially from
  index zero.

This is CVE-2017-7228 / XSA-212.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 938fd2586eb081bcbd694f4c1f09ae6a263b0d90
master date: 2017-04-04 14:47:46 +0200

8 years agoxen: sched: don't call hooks of the wrong scheduler via VCPU2OP
Dario Faggioli [Fri, 31 Mar 2017 06:33:20 +0000 (08:33 +0200)]
xen: sched: don't call hooks of the wrong scheduler via VCPU2OP

Within context_saved(), we call the context_saved hook,
and we use VCPU2OP() to determine from what scheduler.
VCPU2OP uses DOM2OP, which uses d->cpupool, which is
NULL when d is the idle domain. And in that case,
DOM2OP just returns ops, the scheduler of cpupool0.

Therefore, if:
- cpupool0's scheduler defines context_saved (like
  Credit2 and RTDS do),
- we are not in cpupool0 (i.e., our scheduler is
  not ops),
- we are context switching from idle,

we call VCPU2OP(idle_vcpu), which means
DOM2OP(idle->cpupool), which is ops.

Therefore, we both:
- check if context_saved is defined in the wrong
  scheduler;
- if yes, call the wrong one.

When using Credit2 at boot, and also Credit2 in
the other cpupool, this is wrong but innocuous,
because it only involves the idle vcpus.

When using Credit2 at boot, and Credit1 in the
other cpupool, this is *totally* wrong, and
it's by chance it does not explode!

When using Credit2 and other schedulers I'm
developping, I hit the following assert (in
sched_credit2.c, on a CPU inside a cpupool that
does not use Credit2):

csched2_context_saved()
{
 ...
 ASSERT(!vcpu_on_runq(svc));
 ...
}

Fix this by dealing explicitly, in VCPU2OP, with
idle vcpus, returning the scheduler of the pCPU
they (always) run on.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: a3653e6a279213ba4e883b2252415dc98633106a
master date: 2017-03-27 14:28:05 +0100

8 years agox86/EFI: avoid Xen image when looking for module/kexec position
Jan Beulich [Fri, 31 Mar 2017 06:32:51 +0000 (08:32 +0200)]
x86/EFI: avoid Xen image when looking for module/kexec position

When booting straight from EFI, we don't further try to relocate Xen.
As a result, so far we also didn't avoid the area Xen uses when looking
for a location to put modules or the kexec area. Introduce a fake
module slot to deal with that without having to fiddle with a lot of
code.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e22e1c47958a4778cd7baa3980f74e52f525ba28
master date: 2017-03-20 09:27:12 +0100

8 years agox86/EFI: avoid IOMMU faults on [_end,__2M_rwdata_end)
Jan Beulich [Fri, 31 Mar 2017 06:32:22 +0000 (08:32 +0200)]
x86/EFI: avoid IOMMU faults on [_end,__2M_rwdata_end)

Commit c9a4a1c419 ("x86/layout: Correct Xen's idea of its own memory
layout") didn't go far enough with the conversion, causing IOMMU faults
when memory from that range was handed to a domain. We must not make
this memory available for allocation (the change is benign to xen.gz at
this point in time).

Note that the change to tboot_shutdown() is fixing another issue at
once: As it looks, the function so far skipped all memory below the Xen
image.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d522571a408a7dd21a06705f6dd51cdafd2db4fc
master date: 2017-03-20 09:25:36 +0100

8 years agox86/EFI: avoid overrunning mb_modules[]
Jan Beulich [Fri, 31 Mar 2017 06:31:53 +0000 (08:31 +0200)]
x86/EFI: avoid overrunning mb_modules[]

Commit 436fb462ab ("x86/microcode: enable boot time (pre-Dom0)
loading") added a 4th module without providing an array slot for it.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 02b37b7eff39e40828041b2fe480725ab8443258
master date: 2017-03-17 15:45:22 +0100

8 years agobuild/clang: fix XSM dummy policy when using clang 4.0
Roger Pau Monné [Fri, 31 Mar 2017 06:31:14 +0000 (08:31 +0200)]
build/clang: fix XSM dummy policy when using clang 4.0

There seems to be some weird bug in clang 4.0 that prevents xsm_pmu_op from
working as expected, and vpmu.o ends up with a reference to
__xsm_action_mismatch_detected which makes the build fail:

[...]
ld    -melf_x86_64_fbsd  -T xen.lds -N prelink.o  \
    xen/common/symbols-dummy.o -o xen/.xen-syms.0
prelink.o: In function `xsm_default_action':
xen/include/xsm/dummy.h:80: undefined reference to `__xsm_action_mismatch_detected'
xen/xen/include/xsm/dummy.h:80: relocation truncated to fit: R_X86_64_PC32 against undefined symbol `__xsm_action_mismatch_detected'
ld: xen/xen/.xen-syms.0: hidden symbol `__xsm_action_mismatch_detected' isn't defined

Then doing a search in the objects files:

# find xen/ -type f -name '*.o' -print0 | xargs -0 bash -c \
  'for filename; do nm "$filename" | \
  grep -q __xsm_action_mismatch_detected && echo "$filename"; done' bash
xen/arch/x86/prelink.o
xen/arch/x86/cpu/vpmu.o
xen/arch/x86/cpu/built_in.o
xen/arch/x86/built_in.o

The current patch is the only way I've found to fix this so far, by simply
moving the XSM_PRIV check into the default case in xsm_pmu_op. This also fixes
the behavior of do_xenpmu_op, which will now return -EINVAL for unknown
XENPMU_* operations, instead of -EPERM when called by a privileged domain.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
master commit: 9e4d116faff4545a7f21c2b01008e94d68e6db58
master date: 2017-03-14 18:19:29 +0100

8 years agox86: drop unneeded __packed attributes
Roger Pau Monné [Fri, 31 Mar 2017 06:28:49 +0000 (08:28 +0200)]
x86: drop unneeded __packed attributes

There where a couple of unneeded packed attributes in several x86-specific
structures, that are obviously aligned. The only non-trivial one is
vmcb_struct, which has been checked to have the same layout with and without
the packed attribute using pahole. In that case add a build-time size check to
be on the safe side.

No functional change is expected as a result of this commit.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
master commit: 4036e7c592905c2292cdeba8269e969959427237
master date: 2017-03-07 17:11:06 +0100

8 years agoarm: xen_size should be paddr_t for consistency
Stefano Stabellini [Wed, 29 Mar 2017 18:32:34 +0000 (11:32 -0700)]
arm: xen_size should be paddr_t for consistency

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoxen/arm: alternative: Register re-mapped Xen area as a temporary virtual region
Wei Chen [Mon, 27 Mar 2017 08:40:50 +0000 (16:40 +0800)]
xen/arm: alternative: Register re-mapped Xen area as a temporary virtual region

While I was using the alternative patching in the SErrors patch series [1].
I used a branch instruction as alternative instruction.

        ALTERNATIVE("nop",
                    "b skip_check",
                    SKIP_CHECK_PENDING_VSERROR)

Unfortunately, I got a system panic message with this code:

(XEN) build-id: f64081d86e7e88504b7d00e1486f25751c004e39
(XEN) alternatives: Patching with alt table 100b9480 -> 100b9498
(XEN) Xen BUG at alternative.c:61
(XEN) ----[ Xen-4.9-unstable  arm32  debug=y   Tainted:  C   ]----
(XEN) CPU:    0
(XEN) PC:     00252b68 alternative.c#__apply_alternatives+0x128/0x1d4
(XEN) CPSR:   800000da MODE:Hypervisor
(XEN)      R0: 00000000 R1: 00000000 R2: 100b9490 R3: 100b949c
(XEN)      R4: eafeff84 R5: 00000000 R6: 100b949c R7: 10079290
(XEN)      R8: 100792ac R9: 00000001 R10:100b948c R11:002cfe04 R12:002932c0
(XEN) HYP: SP: 002cfdc4 LR: 00239128
(XEN)
(XEN)   VTCR_EL2: 80003558
(XEN)  VTTBR_EL2: 0000000000000000
(XEN)
(XEN)  SCTLR_EL2: 30cd187f
(XEN)    HCR_EL2: 000000000038663f
(XEN)  TTBR0_EL2: 00000000bff09000
(XEN)
(XEN)    ESR_EL2: 00000000
(XEN)  HPFAR_EL2: 0000000000000000
(XEN)      HDFAR: 00000000
(XEN)      HIFAR: 00000000
(XEN)
(XEN) Xen stack trace from sp=002cfdc4:
(XEN)    00000000 00294328 002e0004 00000001 10079290 002cfe14 100b9490 00000000
(XEN)    10010000 10122700 00200000 002cfe1c 00000080 00252c14 00000000 002cfe64
(XEN)    00252dd8 00000007 00000000 000bfe00 100b9480 100b9498 002cfe1c 002cfe1c
(XEN)    10010000 10122700 00000000 00000000 00000000 00000000 00000000 00000000
(XEN)    00000000 00000000 00000000 002ddf30 00000000 003113e8 0030f018 002cfe9c
(XEN)    00238914 00000002 00000000 00000000 00000000 0028b000 00000002 00293800
(XEN)    00000002 0030f238 00000002 00290640 00000001 002cfea4 002a2840 002cff54
(XEN)    002a65fc 11112131 10011142 00000000 0028d194 00000000 00000000 00000000
(XEN)    bdffb000 80000000 00000000 c0000000 00000000 00000002 00000000 c0000000
(XEN)    002b8060 00002000 002b8040 00000000 c0000000 bc000000 00000000 c0000000
(XEN)    00000000 be000000 00000000 00112701 00000000 bff12701 00000000 00000000
(XEN)    00000000 00000000 00000000 00000000 00000018 00000000 00000001 00000000
(XEN)    9fece000 80200000 80000000 00400000 00200550 00000000 00000000 00000000
(XEN)    00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
(XEN)    00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
(XEN)    00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
(XEN)    00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
(XEN)    00000000 00000000 00000000 00000000 00000000 00000000 00000000
(XEN) Xen call trace:
(XEN)    [<00252b68>] alternative.c#__apply_alternatives+0x128/0x1d4 (PC)
(XEN)    [<00239128>] is_active_kernel_text+0x10/0x28 (LR)
(XEN)    [<00252dd8>] alternative.c#__apply_alternatives_multi_stop+0x1c4/0x204
(XEN)    [<00238914>] stop_machine_run+0x1e8/0x254
(XEN)    [<002a2840>] apply_alternatives_all+0x38/0x54
(XEN)    [<002a65fc>] start_xen+0xcf4/0xf88
(XEN)    [<00200550>] arm32/head.o#paging+0x94/0xd8
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) Xen BUG at alternative.c:61
(XEN) ****************************************

This panic was triggered by the BUG(); in branch_insn_requires_update.
That's because in this case the alternative patching needs to update the
offset of the branch instruction. But the new target address of the branch
instruction could not pass the check of is_active_kernel_text();

The reason is that: When Xen is booting, it will call apply_alternatives_all
to do patching with alternative tables. In this progress, we should update
the offset of branch instructions if required. This means we should modify
the Xen text section. But Xen text section is marked as read-only and we
configure the hardware to not allow a region to be writable and executable at
the same time. So we re-map Xen in a temporary area for writing. In this case,
the calculation of the new target address of the branch instruction is based
on this re-mapped area. The new target address will point to a value in the
re-mapped area. But we haven't registered this area as an active kernel text.
So the check of is_active_kernel_text will always return false.

We have to register the re-mapped Xen area as a virtual region temporarily to
solve this problem.

1. https://lists.xenproject.org/archives/html/xen-devel/2017-03/msg01939.html

Signed-off-by: Wei Chen <Wei.Chen@arm.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoQEMU_TAG update
Ian Jackson [Tue, 21 Mar 2017 18:44:24 +0000 (18:44 +0000)]
QEMU_TAG update

8 years agoarm: read/write rank->vcpu atomically
Stefano Stabellini [Sat, 11 Feb 2017 02:05:22 +0000 (18:05 -0800)]
arm: read/write rank->vcpu atomically

We don't need a lock in vgic_get_target_vcpu anymore, solving the
following lock inversion bug: the rank lock should be taken first, then
the vgic lock. However, gic_update_one_lr is called with the vgic lock
held, and it calls vgic_get_target_vcpu, which tries to obtain the rank
lock.

Coverity-ID: 1381855
Coverity-ID: 1381853

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoxen/arm: p2m: Perform local TLB invalidation on vCPU migration
Julien Grall [Wed, 8 Mar 2017 18:06:02 +0000 (18:06 +0000)]
xen/arm: p2m: Perform local TLB invalidation on vCPU migration

The ARM architecture allows an OS to have per-CPU page tables, as it
guarantees that TLBs never migrate from one CPU to another.

This works fine until this is done in a guest. Consider the following
scenario:
    - vcpu-0 maps P to V
    - vpcu-1 maps P' to V

If run on the same physical CPU, vcpu-1 can hit in TLBs generated by
vcpu-0 accesses, and access the wrong physical page.

The solution to this is to keep a per-p2m map of which vCPU ran the last
on each given pCPU and invalidate local TLBs if two vPCUs from the same
VM run on the same CPU.

Unfortunately it is not possible to allocate per-cpu variable on the
fly. So for now the size of the array is NR_CPUS, this is fine because
we still have space in the structure domain. We may want to add an
helper to allocate per-cpu variable in the future.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoxen/arm: Introduce INVALID_VCPU_ID
Julien Grall [Wed, 8 Mar 2017 18:06:01 +0000 (18:06 +0000)]
xen/arm: Introduce INVALID_VCPU_ID

Define INVALID_VCPU_ID as MAX_VIRT_CPUS to avoid casting problem later
on. At the moment it can always fit in uint8_t.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoxen/arm: Set nr_cpu_ids to available number of cpus
Vijaya Kumar K [Mon, 1 Feb 2016 09:26:13 +0000 (14:56 +0530)]
xen/arm: Set nr_cpu_ids to available number of cpus

nr_cpu_ids for arm platforms is incorrectly set to NR_CPUS
irrespective of the number of cpus supported by platform.

Signed-off-by: Vijaya Kumar K <Vijaya.Kumar@caviumnetworks.com>
Reviewed-by: Julien Grall <julien.grall@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoxen/arm: acpi: Relax hw domain mapping attributes to p2m_mmio_direct_c
Edgar E. Iglesias [Thu, 26 Jan 2017 13:16:02 +0000 (14:16 +0100)]
xen/arm: acpi: Relax hw domain mapping attributes to p2m_mmio_direct_c

Since the hardware domain is a trusted domain, we extend the
trust to include making final decisions on what attributes to
use when mapping memory regions.

For ACPI configured hardware domains, this patch relaxes the hardware
domains mapping attributes to p2m_mmio_direct_c. This will allow the
hardware domain to control the attributes via its S1 mappings.

Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
Acked-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoRevert "xen/arm: Map mmio-sram nodes as un-cached memory"
Edgar E. Iglesias [Thu, 26 Jan 2017 13:16:01 +0000 (14:16 +0100)]
Revert "xen/arm: Map mmio-sram nodes as un-cached memory"

This reverts commit 1e75ed8b64bc1a9b47e540e6f100f17ec6d97f1b.

The default attribute mapping for MMIO as been relaxed and now rely on
the hardware domain to set the correct memory attribute

Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoxen/arm: dt: Relax hw domain mapping attributes to p2m_mmio_direct_c
Edgar E. Iglesias [Thu, 26 Jan 2017 13:16:00 +0000 (14:16 +0100)]
xen/arm: dt: Relax hw domain mapping attributes to p2m_mmio_direct_c

Since the hardware domain is a trusted domain, we extend the
trust to include making final decisions on what attributes to
use when mapping memory regions.

For device-tree configured hardware domains, this patch relaxes
the hardware domains mapping attributes to p2m_mmio_direct_c.
This will allow the hardware domain to control the attributes
via its S1 mappings.

Signed-off-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoxen/arm: flush icache as well when XEN_DOMCTL_cacheflush is issued
Tamas K Lengyel [Fri, 27 Jan 2017 18:25:23 +0000 (11:25 -0700)]
xen/arm: flush icache as well when XEN_DOMCTL_cacheflush is issued

When the toolstack modifies memory of a running ARM VM it may happen
that the underlying memory of a current vCPU PC is changed. Without
flushing the icache the vCPU may continue executing stale instructions.

Also expose the xc_domain_cacheflush through xenctrl.h.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@zentific.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoxen/arm: fix GIC_INVALID_LR
Stefano Stabellini [Thu, 22 Dec 2016 02:15:10 +0000 (18:15 -0800)]
xen/arm: fix GIC_INVALID_LR

GIC_INVALID_LR should be 0xff, but actually, defined as ~(uint8_t)0, is
0xffffffff. Fix the problem by placing the ~ operator before the cast.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agofix out of bound access to mode_strings
Stefano Stabellini [Fri, 9 Dec 2016 01:17:04 +0000 (17:17 -0800)]
fix out of bound access to mode_strings

mode == ARRAY_SIZE(mode_strings) causes an out of bound access to
the mode_strings array.

Coverity-ID: 1381859

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agomissing vgic_unlock_rank in gic_remove_irq_from_guest
Stefano Stabellini [Fri, 9 Dec 2016 00:59:28 +0000 (16:59 -0800)]
missing vgic_unlock_rank in gic_remove_irq_from_guest

Add missing vgic_unlock_rank on the error path in
gic_remove_irq_from_guest.

Coverity-ID: 1381843

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoxen/arm: Fix macro for ARM Jazelle CPU feature identification
Artem Mygaiev [Tue, 6 Dec 2016 14:16:45 +0000 (16:16 +0200)]
xen/arm: Fix macro for ARM Jazelle CPU feature identification

Fix macro for ARM Jazelle CPU feature identification: value of 0 indicates
that CPU does not support ARM Jazelle (ID_PFR0[11:8])

Coverity-ID: 1381849

Signed-off-by: Artem Mygaiev <artem_mygaiev@epam.com>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoxen/arm: traps: Emulate ICC_SRE_EL1 as RAZ/WI
Julien Grall [Mon, 5 Dec 2016 17:43:23 +0000 (17:43 +0000)]
xen/arm: traps: Emulate ICC_SRE_EL1 as RAZ/WI

Recent Linux kernel (4.4 and onwards [1]) is checking whether it is possible
to enable sysreg access (ICC_SRE_EL1.SRE) when the ID register
(ID_AA64PRF0_EL1.GIC) is reporting the presence of the sysreg interface.

When the guest has been configured to use GICv2, the hypervisor will
disable sysreg access for this vm (via ICC_SRE_EL2.Enable) and therefore
access to system register such as ICC_SRE_EL1 are trapped in EL2.

However, ICC_SRE_EL1 is not emulated by the hypervisor. This means that
Linux will crash as soon as it is trying to access ICC_SRE_EL1.

To solve this problem, Xen can implement ICC_SRE_EL1 as read-as-zero
write-ignore. The emulation will only be used when sysreg are disabled
for EL1.

[1]  963fcd409 "arm64: cpufeatures: Check ICC_EL1_SRE.SRE before
enabling ARM64_HAS_SYSREG_GIC_CPUIF"

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoxen/arm: Fix misplaced parentheses for PSCI version check
Artem Mygaiev [Wed, 30 Nov 2016 13:53:11 +0000 (15:53 +0200)]
xen/arm: Fix misplaced parentheses for PSCI version check

Fix misplaced parentheses for PSCI version check

Signed-off-by: Artem Mygaiev <artem_mygaiev@epam.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoarm/irq: Reorder check when the IRQ is already used by someone
Oleksandr Tyshchenko [Fri, 2 Dec 2016 16:38:16 +0000 (18:38 +0200)]
arm/irq: Reorder check when the IRQ is already used by someone

Call irq_get_domain for the IRQ we are interested in
only after making sure that it is the guest IRQ to avoid
ASSERT(test_bit(_IRQ_GUEST, &desc->status)) triggering.

Signed-off-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoDon't clear HCR_VM bit when updating VTTBR.
Jun Sun [Mon, 10 Oct 2016 19:27:56 +0000 (12:27 -0700)]
Don't clear HCR_VM bit when updating VTTBR.

Currently function p2m_restore_state() would clear HCR_VM bit, i.e.,
disabling stage2 translation, before updating VTTBR register. After
some research and talking to ARM support, I got confirmed that this is not
necessary. We are currently working on a new platform that would need this
to be removed.

The patch is tested on FVP foundation model.

Signed-off-by: Jun Sun <jsun@junsun.net>
Acked-by: Steve Capper <steve.capper@linaro.org>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agox86/emul: Correct the decoding of mov to/from cr/dr
Andrew Cooper [Tue, 14 Mar 2017 11:43:25 +0000 (12:43 +0100)]
x86/emul: Correct the decoding of mov to/from cr/dr

The mov to/from cr/dr behave as if they were encoded with Mod = 3.  When
encoded with Mod != 3, no displacement or SIB bytes are fetched.

Add a test with a deliberately malformed ModRM byte.  (Also add the
automatically-generated simd.h to .gitignore.)

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c2e316b2f220af06dab76b1219e61441c31f6ff9
master date: 2017-03-07 17:29:16 +0000

8 years agox86emul: correct decoding of vzero{all,upper}
Jan Beulich [Tue, 14 Mar 2017 11:42:58 +0000 (12:42 +0100)]
x86emul: correct decoding of vzero{all,upper}

These VEX encoded insns aren't followed by a ModR/M byte.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 26735f30dffe1091686bbe921aacbea8ba371cc8
master date: 2017-03-02 16:08:27 +0100

8 years agoxen: credit2: don't miss accounting while doing a credit reset.
Dario Faggioli [Tue, 14 Mar 2017 11:42:19 +0000 (12:42 +0100)]
xen: credit2: don't miss accounting while doing a credit reset.

A credit reset basically means going through all the
vCPUs of a runqueue and altering their credits, as a
consequence of a 'scheduling epoch' having come to an
end.

Blocked or runnable vCPUs are fine, all the credits
they've spent running so far have been accounted to
them when they were scheduled out.

But if a vCPU is running on a pCPU, when a reset event
occurs (on another pCPU), that does not get properly
accounted. Let's therefore begin to do so, for better
accuracy and fairness.

In fact, after this patch, we see this in a trace:

 csched2:schedule cpu 10, rq# 1, busy, not tickled
 csched2:burn_credits d1v5, credit = 9998353, delta = 202996
 runstate_continue d1v5 running->running
 ...
 csched2:schedule cpu 12, rq# 1, busy, not tickled
 csched2:burn_credits d1v6, credit = -1327, delta = 9999544
 csched2:reset_credits d0v13, credit_start = 10500000, credit_end = 10500000, mult = 1
 csched2:reset_credits d0v14, credit_start = 10500000, credit_end = 10500000, mult = 1
 csched2:reset_credits d0v7, credit_start = 10500000, credit_end = 10500000, mult = 1
 csched2:burn_credits d1v5, credit = 201805, delta = 9796548
 csched2:reset_credits d1v5, credit_start = 201805, credit_end = 10201805, mult = 1
 csched2:burn_credits d1v6, credit = -1327, delta = 0
 csched2:reset_credits d1v6, credit_start = -1327, credit_end = 9998673, mult = 1

Which shows how d1v5 actually executed for ~9.796 ms,
on pCPU 10, when reset_credit() is executed, on pCPU
12, because of d1v6's credits going below 0.

Without this patch, this 9.796ms are not accounted
to anyone. With this patch, d1v5 is charged for that,
and its credits drop down from 9796548 to 201805.

And this is important, as it means that it will
begin the new epoch with 10201805 credits, instead
of 10500000 (which he would have, before this patch).

Basically, we were forgetting one round of accounting
in epoch x, for the vCPUs that are running at the time
the epoch ends. And this meant favouring a little bit
these same vCPUs, in epoch x+1, providing them with
the chance of execute longer than their fair share.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 4fa4f8a3cd5afd4980ad9517755d002dc316abdc
master date: 2017-03-01 16:56:34 +0000

8 years agoxen: credit2: always mark a tickled pCPU as... tickled!
Dario Faggioli [Tue, 14 Mar 2017 11:41:54 +0000 (12:41 +0100)]
xen: credit2: always mark a tickled pCPU as... tickled!

In fact, whether or not a pCPU has been tickled, and is
therefore about to re-schedule, is something we look at
and base decisions on in various places.

So, let's make sure that we do that basing on accurate
information.

While there, also tweak a little bit smt_idle_mask_clear()
(used for implementing SMT support), so that it only alter
the relevant cpumask when there is the actual need for this.
(This is only for reduced overhead, behavior remains the
same).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
master commit: a76645240bd14e964e85dbc975a8989edea6aa27
master date: 2017-03-01 16:56:34 +0000

8 years agox86/layout: Correct Xen's idea of its own memory layout
Andrew Cooper [Tue, 14 Mar 2017 11:41:21 +0000 (12:41 +0100)]
x86/layout: Correct Xen's idea of its own memory layout

c/s b4cd59fe "x86: reorder .data and .init when linking" had an unintended
side effect, where xen_in_range() and the tboot S3 MAC were no longer correct.

In practice, it means that Xen's .data section is excluded from consideration,
which means:
 1) Default IOMMU construction for the hardware domain could create mappings.
 2) .data isn't included in the tboot MAC checked on resume from S3.

Adjust the comments and virtual address anchors used to define the regions.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c9a4a1c419cebac83a8fb60c4532ad8ccc973dc4
master date: 2017-02-28 16:18:38 +0000

8 years agox86/vmx: Don't leak host syscall MSR state into HVM guests
Andrew Cooper [Tue, 14 Mar 2017 11:40:36 +0000 (12:40 +0100)]
x86/vmx: Don't leak host syscall MSR state into HVM guests

hvm_hw_cpu->msr_flags is in fact the VMX dirty bitmap of MSRs needing to be
restored when switching into guest context.  It should never have been part of
the migration state to start with, and Xen must not make any decisions based
on the value seen during restore.

Identify it as obsolete in the header files, consistently save it as zero and
ignore it on restore.

The MSRs must be considered dirty during VMCS creation to cause the proper
defaults of 0 to be visible to the guest.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 2f1add6e1c8789d979daaafa3d80ddc1bc375783
master date: 2017-02-21 11:06:39 +0000

8 years agoxen/arm: fix affected memory range by dcache clean functions
Stefano Stabellini [Fri, 3 Mar 2017 01:15:26 +0000 (17:15 -0800)]
xen/arm: fix affected memory range by dcache clean functions

clean_dcache_va_range and clean_and_invalidate_dcache_va_range don't
calculate the range correctly when "end" is not cacheline aligned. As a
result, the last cacheline is not skipped. Fix the issue by aligning the
start address to the cacheline size.

In addition, make the code simpler and faster in
invalidate_dcache_va_range, by removing the module operation and using
bitmasks instead. Also remove the size adjustments in
invalidate_dcache_va_range, because the size variable is not used later
on.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Tested-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
8 years agoxen/arm: introduce vwfi parameter
Stefano Stabellini [Wed, 1 Mar 2017 19:43:15 +0000 (11:43 -0800)]
xen/arm: introduce vwfi parameter

Introduce new Xen command line parameter called "vwfi", which stands for
virtual wfi. The default is "trap": Xen traps guest wfi and wfe
instructions. In the case of wfi, Xen calls vcpu_block on the guest
vcpu; in the case of guest wfe, Xen calls vcpu_yield on the guest vcpu.
The behavior can be changed by setting vwfi to "native", in that case
Xen doesn't trap neither wfi nor wfe, running them in guest context.

The result is strong reduction in irq latency (from 5000ns to 2000ns,
measured using https://github.com/edgarigl/tbm, the physical timer, and
1 pcpu dedicated to 1 vcpu). The downside is that the scheduler thinks
that the guest is busy when actually is sleeping, leading to suboptimal
scheduling decisions.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoarm/p2m: remove the page from p2m->pages list before freeing it
Julien Grall [Fri, 24 Feb 2017 09:01:59 +0000 (10:01 +0100)]
arm/p2m: remove the page from p2m->pages list before freeing it

The p2m code is using the page list field to link all the pages used
for the stage-2 page tables. The page is added into the p2m->pages
list just after the allocation but never removed from the list.

The page list field is also used by the allocator, not removing may
result a later Xen crash due to inconsistency (see [1]).

This bug was introduced by the reworking of p2m code in commit 2ef3e36ec7
"xen/arm: p2m: Introduce p2m_set_entry and __p2m_set_entry".

[1] https://lists.xenproject.org/archives/html/xen-devel/2017-02/msg00524.html

Reported-by: Vijaya Kumar K <Vijaya.Kumar@cavium.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: cf5e1a74b9687be3d146e59ab10c26be6da9d0d4
master date: 2017-02-24 09:58:50 +0100

8 years agoQEMU_TAG update
Ian Jackson [Wed, 22 Feb 2017 16:26:41 +0000 (16:26 +0000)]
QEMU_TAG update

8 years agoVMX: fix VMCS race on context-switch paths
Jan Beulich [Mon, 20 Feb 2017 14:58:02 +0000 (15:58 +0100)]
VMX: fix VMCS race on context-switch paths

When __context_switch() is being bypassed during original context
switch handling, the vCPU "owning" the VMCS partially loses control of
it: It will appear non-running to remote CPUs, and hence their attempt
to pause the owning vCPU will have no effect on it (as it already
looks to be paused). At the same time the "owning" CPU will re-enable
interrupts eventually (the lastest when entering the idle loop) and
hence becomes subject to IPIs from other CPUs requesting access to the
VMCS. As a result, when __context_switch() finally gets run, the CPU
may no longer have the VMCS loaded, and hence any accesses to it would
fail. Hence we may need to re-load the VMCS in vmx_ctxt_switch_from().

For consistency use the new function also in vmx_do_resume(), to
avoid leaving an open-coded incarnation of it around.

Reported-by: Kevin Mayer <Kevin.Mayer@gdata.de>
Reported-by: Anshul Makkar <anshul.makkar@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Tested-by: Sergey Dyasli <sergey.dyasli@citrix.com>
master commit: 2f4d2198a9b3ba94c959330b5c94fe95917c364c
master date: 2017-02-17 15:49:56 +0100

8 years agoxen/p2m: Fix p2m_flush_table for non-nested cases
George Dunlap [Mon, 20 Feb 2017 14:57:37 +0000 (15:57 +0100)]
xen/p2m: Fix p2m_flush_table for non-nested cases

Commit 71bb7304e7a7a35ea6df4b0cedebc35028e4c159 added flushing of
nested p2m tables whenever the host p2m table changed.  Unfortunately
in the process, it added a filter to p2m_flush_table() function so
that the p2m would only be flushed if it was being used as a nested
p2m.  This meant that the p2m was not being flushed at all for altp2m
callers.

Only check np2m_base if p2m_class for nested p2m's.

NB that this is not a security issue: The only time this codepath is
called is in cases where either nestedp2m or altp2m is enabled, and
neither of them are in security support.

Reported-by: Matt Leinhos <matt@starlab.io>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Tested-by: Tamas K Lengyel <tamas@tklengyel.com>
master commit: 6192e6378e094094906950120470a621d5b2977c
master date: 2017-02-15 17:15:56 +0000

8 years agox86/ept: allow write-combining on !mfn_valid() MMIO mappings again
David Woodhouse [Mon, 20 Feb 2017 14:56:48 +0000 (15:56 +0100)]
x86/ept: allow write-combining on !mfn_valid() MMIO mappings again

For some MMIO regions, such as those high above RAM, mfn_valid() will
return false.

Since the fix for XSA-154 in commit c61a6f74f80e ("x86: enforce
consistent cachability of MMIO mappings"), guests have no longer been
able to use PAT to obtain write-combining on such regions because the
'ignore PAT' bit is set in EPT.

We probably want to err on the side of caution and preserve that
behaviour for addresses in mmio_ro_ranges, but not for normal MMIO
mappings. That necessitates a slight refactoring to check mfn_valid()
later, and let the MMIO case get through to the right code path.

Since we're not bailing out for !mfn_valid() immediately, the range
checks need to be adjusted to cope \97 simply by masking in the low bits
to account for 'order' instead of adding, to avoid overflow when the mfn
is INVALID_MFN (which happens on unmap, since we carefully call this
function to fill in the EMT even though the PTE won't be valid).

The range checks are also slightly refactored to put only one of them in
the fast path in the common case. If it doesn't overlap, then it
*definitely* isn't contained, so we don't need both checks. And if it
overlaps and is only one page, then it definitely *is* contained.

Finally, add a comment clarifying how that 'return -1' works \97 it isn't
returning an error and causing the mapping to fail; it relies on
resolve_misconfig() being able to split the mapping later. So it's
*only* sane to do it where order>0 and the 'problem' will be solved by
splitting the large page. Not for blindly returning 'error', which I was
tempted to do in my first attempt.

Signed-off-by: David Woodhouse <dwmw@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 30921dc2df3665ca1b2593595aa6725ff013d386
master date: 2017-02-07 14:30:01 +0100

8 years agoIOMMU: always call teardown callback
Oleksandr Tyshchenko [Wed, 15 Feb 2017 12:20:48 +0000 (12:20 +0000)]
IOMMU: always call teardown callback

There is a possible scenario when (d)->need_iommu remains unset
during guest domain execution. For example, when no devices
were assigned to it. Taking into account that teardown callback
is not called when (d)->need_iommu is unset we might have unreleased
resourses after destroying domain.

So, always call teardown callback to roll back actions
that were performed in init callback.

This is XSA-207.

Signed-off-by: Oleksandr Tyshchenko <olekstysh@gmail.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Julien Grall <julien.grall@arm.com>
8 years agox86/emulate: don't assume that addr_size == 32 implies protected mode
George Dunlap [Thu, 9 Feb 2017 09:25:58 +0000 (10:25 +0100)]
x86/emulate: don't assume that addr_size == 32 implies protected mode

Callers of x86_emulate() generally define addr_size based on the code
segment.  In vm86 mode, the code segment is set by the hardware to be
16-bits; but it is entirely possible to enable protected mode, set the
CS to 32-bits, and then disable protected mode.  (This is commonly
called "unreal mode".)

But the instruction decoder only checks for protected mode when
addr_size == 16.  So in unreal mode, hardware will throw a #UD for VEX
prefixes, but our instruction decoder will decode them, triggering an
ASSERT() further on in _get_fpu().  (With debug=n the emulator will
incorrectly emulate the instruction rather than throwing a #UD, but
this is only a bug, not a crash, so it's not a security issue.)

Teach the instruction decoder to check that we're in protected mode,
even if addr_size is 32.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Split real mode and VM86 mode handling, as VM86 mode is strictly 16-bit
at all times. Re-base.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 05118b1596ffe4559549edbb28bd0124a7316123
master date: 2017-01-25 15:09:55 +0100

8 years agoxen: credit2: fix shutdown/suspend when playing with cpupools.
Dario Faggioli [Thu, 9 Feb 2017 09:25:33 +0000 (10:25 +0100)]
xen: credit2: fix shutdown/suspend when playing with cpupools.

In fact, during shutdown/suspend, we temporarily move all
the vCPUs to the BSP (i.e., pCPU 0, as of now). For Credit2
domains, we call csched2_vcpu_migrate(), expects to find the
target pCPU in the domain's pool

Therefore, if Credit2 is the default scheduler and we have
removed pCPU 0 from cpupool0, shutdown/suspend fails like
this:

 RIP:    e008:[<ffff82d08012906d>] sched_credit2.c#migrate+0x274/0x2d1
 Xen call trace:
    [<ffff82d08012906d>] sched_credit2.c#migrate+0x274/0x2d1
    [<ffff82d080129138>] sched_credit2.c#csched2_vcpu_migrate+0x6e/0x86
    [<ffff82d08012c468>] schedule.c#vcpu_move_locked+0x69/0x6f
    [<ffff82d08012ec14>] cpu_disable_scheduler+0x3d7/0x430
    [<ffff82d08019669b>] __cpu_disable+0x299/0x2b0
    [<ffff82d0801012f8>] cpu.c#take_cpu_down+0x2f/0x38
    [<ffff82d0801312d8>] stop_machine.c#stopmachine_action+0x7f/0x8d
    [<ffff82d0801330b8>] tasklet.c#do_tasklet_work+0x74/0xab
    [<ffff82d0801333ed>] do_tasklet+0x66/0x8b
    [<ffff82d080166a73>] domain.c#idle_loop+0x3b/0x5e

 ****************************************
 Panic on CPU 8:
 Assertion 'svc->vcpu->processor < nr_cpu_ids' failed at sched_credit2.c:1729
 ****************************************

On the other hand, if Credit2 is the scheduler of another
pool, when trying (still during shutdown/suspend) to move
the vCPUs of the Credit2 domains to pCPU 0, it figures
out that pCPU 0 is not a Credit2 pCPU, and fails like this:

 RIP:    e008:[<ffff82d08012916b>] sched_credit2.c#csched2_vcpu_migrate+0xa1/0x107
 Xen call trace:
    [<ffff82d08012916b>] sched_credit2.c#csched2_vcpu_migrate+0xa1/0x107
    [<ffff82d08012c4e9>] schedule.c#vcpu_move_locked+0x69/0x6f
    [<ffff82d08012edfc>] cpu_disable_scheduler+0x3d7/0x430
    [<ffff82d08019687b>] __cpu_disable+0x299/0x2b0
    [<ffff82d0801012f8>] cpu.c#take_cpu_down+0x2f/0x38
    [<ffff82d0801314c0>] stop_machine.c#stopmachine_action+0x7f/0x8d
    [<ffff82d0801332a0>] tasklet.c#do_tasklet_work+0x74/0xab
    [<ffff82d0801335d5>] do_tasklet+0x66/0x8b
    [<ffff82d080166c53>] domain.c#idle_loop+0x3b/0x5e

The solution is to recognise the specific situation, inside
csched2_vcpu_migrate() and, considering it is something temporary,
which only happens during shutdown/suspend, quickly deal with it.

Then, in the resume path, in restore_vcpu_affinity(), things
are set back to normal, and a new v->processor is chosen, for
each vCPU, from the proper set of pCPUs (i.e., the ones of
the proper cpupool).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
xen: credit2: non Credit2 pCPUs are ok during shutdown/suspend.

Commit 7478ebe1602e6 ("xen: credit2: fix shutdown/suspend
when playing with cpupools"), while doing the right thing
for actual code, forgot to update the ASSERT()s accordingly,
in csched2_vcpu_migrate().

In fact, as stated there already, during shutdown/suspend,
we must allow a Credit2 vCPU to temporarily migrate to a
non Credit2 BSP, without any ASSERT() triggering.

Move them down, after the check for whether or not we are
shutting down, where the assumption that the pCPU must be
valid Credit2 ones, is valid.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
master commit: 7478ebe1602e6bb8242a18840b15757a1d5ad18a
master date: 2017-01-24 17:02:07 +0000
master commit: ad5808d9057248e7879cf375662f0a449fff7005
master date: 2017-02-01 14:44:51 +0000

8 years agoxen: credit2: never consider CPUs outside of our cpupool.
Dario Faggioli [Thu, 9 Feb 2017 09:24:56 +0000 (10:24 +0100)]
xen: credit2: never consider CPUs outside of our cpupool.

In fact, relying on the mask of what pCPUs belong to
which Credit2 runqueue is not enough. If we only do that,
when Credit2 is the boot scheduler, we may ASSERT() or
panic when moving a pCPU from Pool-0 to another cpupool.

This is because pCPUs outside of any pool are considered
part of cpupool0. This puts us at risk of crash when those
same pCPUs are added to another pool and something
different than the idle domain is found to be running
on them.

Note that, even if we prevent the above to happen (which
is the purpose of this patch), this is still pretty bad,
in fact, when we remove a pCPU from Pool-0:
- in Credit1, as we do *not* update prv->ncpus and
  prv->credit, which means we're considering the wrong
  total credits when doing accounting;
- in Credit2, the pCPU remains part of one runqueue,
  and is hence at least considered during load balancing,
  even if no vCPU should really run there.

In Credit1, this "only" causes skewed accounting and
no crashes because there is a lot of `cpumask_and`ing
going on with the cpumask of the domains' cpupool
(which, BTW, comes at a price).

A quick and not to involved (and easily backportable)
solution for Credit2, is to do exactly the same.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com
Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: e7191920261d20e52ca4c06a03589a1155981b04
master date: 2017-01-24 17:02:07 +0000

8 years agoxen: credit2: use the correct scratch cpumask.
Dario Faggioli [Thu, 9 Feb 2017 09:24:32 +0000 (10:24 +0100)]
xen: credit2: use the correct scratch cpumask.

In fact, there is one scratch mask per each CPU. When
you use the one of a CPU, it must be true that:
 - the CPU belongs to your cpupool and scheduler,
 - you own the runqueue lock (the one you take via
   {v,p}cpu_schedule_lock()) for that CPU.

This was not the case within the following functions:

get_fallback_cpu(), csched2_cpu_pick(): as we can't be
sure we either are on, or hold the lock for, the CPU
that is in the vCPU's 'v->processor'.

migrate(): it's ok, when called from balance_load(),
because that comes from csched2_schedule(), which takes
the runqueue lock of the CPU where it executes. But it is
not ok when we come from csched2_vcpu_migrate(), which
can be called from other places.

The fix is to explicitly use the scratch space of the
CPUs for which we know we hold the runqueue lock.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reported-by: Jan Beulich <JBeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 548db8742872399936a2090cbcdfd5e1b34fcbcc
master date: 2017-01-24 17:02:07 +0000

8 years agox86/hvm: do not set msr_tsc_adjust on hvm_set_guest_tsc_fixed
Joao Martins [Thu, 9 Feb 2017 09:23:52 +0000 (10:23 +0100)]
x86/hvm: do not set msr_tsc_adjust on hvm_set_guest_tsc_fixed

Commit 6e03363 ("x86: Implement TSC adjust feature for HVM guest")
implemented TSC_ADJUST MSR for hvm guests. Though while booting
an HVM guest the boot CPU would have a value set with delta_tsc -
guest tsc while secondary CPUS would have 0. For example one can
observe:
 $ xen-hvmctx 17 | grep tsc_adjust
 TSC_ADJUST: tsc_adjust ff9377dfef47fe66
 TSC_ADJUST: tsc_adjust 0
 TSC_ADJUST: tsc_adjust 0
 TSC_ADJUST: tsc_adjust 0

Upcoming Linux 4.10 now validates whether this MSR is correct and
adjusts them accordingly under the following conditions: values of < 0
(our case for CPU 0) or != 0 or values > 7FFFFFFF. In this conditions it
will force set to 0 and for the CPUs that the value doesn't match all
together. If this msr is not correct we would see messages such as:

[Firmware Bug]: TSC ADJUST: CPU0: -30517044286984129 force to 0

And on HVM guests supporting TSC_ADJUST (requiring at least Haswell
Intel) it won't boot.

Our current vCPU 0 values are incorrect and according to Intel SDM which on
section "Time-Stamp Counter Adjustment" states that "On RESET, the value
of the IA32_TSC_ADJUST MSR is 0." hence we should set it 0 and be
consistent across multiple vCPUs. Perhaps this MSR should be only
changed by the guest which already happens through
hvm_set_guest_tsc_adjust(..) routines (see below). After this patch
guests running Linux 4.10 will see a valid IA32_TSC_ADJUST msr of value
 0 for all CPUs and are able to boot.

On the same section of the spec ("Time-Stamp Counter Adjustment") it is
also stated:
"If an execution of WRMSR to the IA32_TIME_STAMP_COUNTER MSR
 adds (or subtracts) value X from the TSC, the logical processor also
 adds (or subtracts) value X from the IA32_TSC_ADJUST MSR.

 Unlike the TSC, the value of the IA32_TSC_ADJUST MSR changes only in
 response to WRMSR (either to the MSR itself, or to the
 IA32_TIME_STAMP_COUNTER MSR). Its value does not otherwise change as
 time elapses. Software seeking to adjust the TSC can do so by using
 WRMSR to write the same value to the IA32_TSC_ADJUST MSR on each logical
 processor."

This suggests these MSRs values should only be changed through guest i.e.
throught write intercept msrs. We keep IA32_TSC MSR logic such that writes
accomodate adjustments to TSC_ADJUST, hence no functional change in the
msr_tsc_adjust for IA32_TSC msr. Though, we do that in a separate routine
namely hvm_set_guest_tsc_msr instead of through hvm_set_guest_tsc(...).

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 98297f09bd07bb63407909aae1d309d8adeb572e
master date: 2017-01-24 12:37:36 +0100

8 years agox86emul: correct FPU stub asm() constraints
Jan Beulich [Thu, 9 Feb 2017 09:23:22 +0000 (10:23 +0100)]
x86emul: correct FPU stub asm() constraints

Properly inform the compiler about fic's role as both an input (its
insn_bytes field) and output (its exn_raised field).

Take the opportunity and bring emulate_fpu_insn_stub() more in line
with emulate_fpu_insn_stub_eflags().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 3dfbb8df335f12297cfc7db9d3df2b74c474921b
master date: 2017-01-24 12:35:59 +0100

8 years agox86: segment attribute handling adjustments
Jan Beulich [Thu, 9 Feb 2017 09:22:55 +0000 (10:22 +0100)]
x86: segment attribute handling adjustments

Null selector loads into SS (possible in 64-bit mode only, and only in
rings other than ring 3) must not alter SS.DPL. (This was found to be
an issue on KVM, and fixed in Linux commit 33ab91103b.)

Further arch_set_info_hvm_guest() didn't make sure that the ASSERT()s
in hvm_set_segment_register() wouldn't trigger: Add further checks, but
tolerate (adjust) clear accessed (CS, SS, DS, ES) and busy (TR) bits.

Finally the setting of the accessed bits for user segments was lost by
commit dd5c85e312 ("x86/hvm: Reposition the modification of raw segment
data from the VMCB/VMCS"), yet VMX requires them to be set for usable
segments. Add respective ASSERT()s (the only path not properly setting
them was arch_set_info_hvm_guest()).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 366ff5f1b3252f9069d5aedb2ffc2567bb0a37c9
master date: 2017-01-20 14:39:12 +0100

8 years agox86emul: LOCK check adjustments
Jan Beulich [Thu, 9 Feb 2017 09:22:28 +0000 (10:22 +0100)]
x86emul: LOCK check adjustments

BT, being encoded as DstBitBase just like BT{C,R,S}, nevertheless does
not write its (register or memory) operand and hence also doesn't allow
a LOCK prefix to be used.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f2d4f4ba80de8a03a1b0f300d271715a88a8433d
master date: 2017-01-20 14:37:33 +0100

8 years agox86emul: VEX.B is ignored in compatibility mode
Jan Beulich [Thu, 9 Feb 2017 09:21:50 +0000 (10:21 +0100)]
x86emul: VEX.B is ignored in compatibility mode

While VEX.R and VEX.X are guaranteed to be 1 in compatibility mode
(and hence a respective mode_64bit() check can be dropped), VEX.B can
be encoded as zero, but would be ignored by the processor. Since we
emulate instructions in 64-bit mode (except possibly in the test
harness), we need to force the bit to 1 in order to not act on the
wrong {X,Y,Z}MM register (which has no bad effect on 32-bit test
harness builds, as there the bit would again be ignored by the
hardware, and would by default be expected to be 1 anyway).

We must not, however, fiddle with the high bit of VEX.VVVV in the
decode phase, as that would undermine the checking of instructions
requiring the field to be all ones independent of mode. This is
being enforced in copy_REX_VEX() instead.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86emul: correct VEX/XOP/EVEX operand size handling for 16-bit code

Operand size defaults to 32 bits in that case, but would not have been
set that way in the absence of an operand size override.

Reported-by: Wei Liu <wei.liu2@citrix.com> (by AFL fuzzing)
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 89c76ee7f60777b81c8fd0475a6af7c84e72a791
master date: 2017-01-17 10:32:25 +0100
master commit: beb82042447c5d6e7073d816d6afc25c5a423cde
master date: 2017-01-25 15:08:59 +0100

8 years agox86/xstate: Fix array overrun on hardware with LWP
Andrew Cooper [Thu, 9 Feb 2017 09:20:45 +0000 (10:20 +0100)]
x86/xstate: Fix array overrun on hardware with LWP

c/s da62246e4c "x86/xsaves: enable xsaves/xrstors/xsavec in xen" introduced
setup_xstate_features() to allocate and fill xstate_offsets[] and
xstate_sizes[].

However, fls() casts xfeature_mask to 32bits which truncates LWP out of the
calculation.  As a result, the arrays are allocated too short, and the cpuid
infrastructure reads off the end of them when calculating xstate_size for the
guest.

On one test system, this results in 0x3fec83c0 being returned as the maximum
size of an xsave area, which surprisingly appears not to bother Windows or
Linux too much.  I suspect they both use current size based on xcr0, which Xen
forwards from real hardware.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: fe0d67576e335c02becf1cea8e67005509fa90b6
master date: 2017-01-16 17:37:26 +0000