]> xenbits.xensource.com Git - xen.git/log
xen.git
8 years agox86emul: correct PUSHF/POPF
Jan Beulich [Wed, 18 Jan 2017 09:23:10 +0000 (10:23 +0100)]
x86emul: correct PUSHF/POPF

Both need to raise #GP(0) when in VM86 mode with IOPL < 3.

Additionally PUSHF is documented to clear VM and RF from the value
placed onto the stack.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e5c1b8145bccb7fc587ee5b0c95ace6c5e0c7ffd
master date: 2016-12-07 13:55:42 +0100

8 years agolibelf: section index 0 is special
Jan Beulich [Wed, 18 Jan 2017 09:22:43 +0000 (10:22 +0100)]
libelf: section index 0 is special

When iterating over sections, table entry zero needs to be ignored.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 41fe9cabf29ea15c1f8edee49326dfde705013d3
master date: 2016-12-07 13:52:35 +0100

8 years agox86emul: CMOVcc always writes its destination
Jan Beulich [Wed, 18 Jan 2017 09:22:14 +0000 (10:22 +0100)]
x86emul: CMOVcc always writes its destination

This would be benign if there wasn't the zero-extending side effect of
32-bit operations in 64-bit mode.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: cc53a74291ea5dd5b2c9a327dc386c0e5f859237
master date: 2016-11-25 14:31:50 +0100

8 years agox86/vmx: Don't deliver #MC with an error code
Andrew Cooper [Wed, 18 Jan 2017 09:21:34 +0000 (10:21 +0100)]
x86/vmx: Don't deliver #MC with an error code

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 892d191df60806ea63f9d3c4fc37615bee028812
master date: 2016-11-25 10:48:20 +0000

8 years agox86/emul: Don't deliver #UD with an error code
Andrew Cooper [Wed, 18 Jan 2017 09:20:58 +0000 (10:20 +0100)]
x86/emul: Don't deliver #UD with an error code

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 9586cba3383021bb4bd57f3fa33e87cc64b4c74a
master date: 2016-11-25 10:48:10 +0000

8 years agox86/SVM: don't deliver #GP without error code
Jan Beulich [Wed, 18 Jan 2017 09:19:32 +0000 (10:19 +0100)]
x86/SVM: don't deliver #GP without error code

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 647c7bd4453c224d9ccdfdb37491544f797fdc48
master date: 2016-11-25 09:46:32 +0100

8 years agox86/EFI: meet further spec requirements for runtime calls
Jan Beulich [Wed, 18 Jan 2017 09:19:03 +0000 (10:19 +0100)]
x86/EFI: meet further spec requirements for runtime calls

So far we didn't guarantee 16-byte alignment of the stack: While (so
far) we don't tell the compiler to use smaller alignment, we also don't
guarantee 16-byte alignment when establishing stack pointers for new
vCPU-s. Runtime service functions using SSE instructions may end with
#GP(0) without that.

Note that making use of -mpreferred-stack-boundary=3, as mentioned in
the comment, wouldn't help to reduce the needed alignment: The compiler
would then be free to align the stack of the function with the aligned
object, but would be permitted to place an odd number of 8-byte objects
there, resulting in the callee to still run on an unaligned stack.

(The only working alternative to the approach chosen here would be to
use -mincoming-stack-boundary=3, but that would affect all functions in
runtime.c, not just the ones actually making runtime services calls.
And it would still require the manual alignment logic here to be used
with gcc 5.2 and earlier - not permitting that command line option -,
just that then the alignment amount would become conditional.)

Hence enforce the needed alignment by making efi_rs_enter() return a
suitably aligned structure, which the caller then necessarily has to
store in a suitably aligned local variable, the address of which then
gets passed to efi_rs_leave(). Also (to limit exposure) move the
function declarations to where they belong: They're local to runtime.c,
and shared only with compat.c (by the latter including the former).

Furthermore we should avoid #MF to be raised on the FLDCW we do.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f6b7fedc896250028cb81dafe9a3f6773aaf1da2
master date: 2016-11-22 13:52:53 +0100

8 years agox86/svm: Fix svm_nextrip_insn_length() when crossing the virtual boundary to 0
Andrew Cooper [Wed, 18 Jan 2017 09:18:10 +0000 (10:18 +0100)]
x86/svm: Fix svm_nextrip_insn_length() when crossing the virtual boundary to 0

vmcb->nextrip can legitimately be less than vmcb->rip when execution wraps
back around to 0.  Instead, complain if the reported length is greater than 15
and use x86_decode_insn() as a fallback.

While making changes here, fix two whitespace issues with the case labels.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
x86/hvm: Fix non-debug build folling c/s 0745f665a5

The variable is named inst_len, not insn_len.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
master commit: 0745f665a575bdb6724f6ec1ab767cd71ba8c253
master date: 2016-11-21 14:01:45 +0000
master commit: f678e2c78110e73431217306bbd33c736802d700
master date: 2016-11-21 17:17:51 +0000

8 years agox86/traps: Don't call hvm_hypervisor_cpuid_leaf() for PV guests
Andrew Cooper [Wed, 18 Jan 2017 09:17:43 +0000 (10:17 +0100)]
x86/traps: Don't call hvm_hypervisor_cpuid_leaf() for PV guests

Luckily, hvm_hypervisor_cpuid_leaf() and vmx_hypervisor_cpuid_leaf() are safe
to execute in the context of a PV guest, but HVM-specific feature flags
shouldn't be visible to PV guests.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0f43883193da76fc928e836e319c3172f394e0f3
master date: 2016-11-16 10:33:18 +0000

8 years agox86/vmx: Correct the long mode check in vmx_cpuid_intercept()
Andrew Cooper [Wed, 18 Jan 2017 09:17:21 +0000 (10:17 +0100)]
x86/vmx: Correct the long mode check in vmx_cpuid_intercept()

%cs.L may be set in a legacy mode segment, or clear in a compatibility mode
segment; it is not the correct way to check for long mode being active.

Both of these situations result in incorrect visibility of the SYSCALL feature
in CPUID, and by extension, incorrect behaviour in hvm_efer_valid().

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: fcb618c025f9251d7e22138f6528595037252c21
master date: 2016-11-16 10:32:54 +0000

8 years agox86/svm: Don't clobber eax and edx if an RDMSR intercept fails
Andrew Cooper [Wed, 18 Jan 2017 09:16:52 +0000 (10:16 +0100)]
x86/svm: Don't clobber eax and edx if an RDMSR intercept fails

The original code has a bug; eax and edx get unconditionally updated even when
hvm_msr_read_intercept() doesn't return X86EMUL_OKAY.

It is only by blind luck (vmce_rdmsr() eagerly initialising its msr_content
pointer) that this isn't an information leak into guests.

While fixing this bug, reduce the scope of msr_content and initialise it to 0.
This makes it obvious that a stack leak won't occur, even if there were to be
a buggy codepath in hvm_msr_read_intercept().

Also make some non-functional improvements.  Make the insn_len calculation
common, and reduce the quantity of explicit casting by making better use of
the existing register names.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
master commit: a0b4e3c0681a11b765fe218fba0ba4ebb9fa56c5
master date: 2016-11-10 15:34:42 +0000

8 years agox86emul: {L,S}{G,I}DT ignore operand size overrides in 64-bit mode
Jan Beulich [Wed, 18 Jan 2017 09:16:20 +0000 (10:16 +0100)]
x86emul: {L,S}{G,I}DT ignore operand size overrides in 64-bit mode

This affects not only the layout of the data (always 2+8 bytes), but
also the contents (no truncation to 24 bits occurs).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 4ccb2adb96042e0d1e334c01fe260b32e6001db9
master date: 2016-11-03 17:23:22 +0100

8 years agox86/emul: Reject LGDT/LIDT attempts with non-canonical base addresses
Andrew Cooper [Wed, 18 Jan 2017 09:15:56 +0000 (10:15 +0100)]
x86/emul: Reject LGDT/LIDT attempts with non-canonical base addresses

No sane OS would deliberately try this, but make Xen's emulation match real
hardware by delivering #GP(0), rather than suffering a VMEntry failure.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <JBeulich@suse.com>
master commit: 12bc22f79117dfae5e59382cdda6b8b6b70a7554
master date: 2016-11-03 12:23:23 +0000

8 years agox86/emul: Correct the decoding of SReg3 operands
Andrew Cooper [Wed, 18 Jan 2017 09:15:22 +0000 (10:15 +0100)]
x86/emul: Correct the decoding of SReg3 operands

REX.R is ignored when considering segment register operands, and needs masking
out first.

While fixing this, reorder the user segments in x86_segment to match SReg3
encoding.  This avoids needing a translation table between hardware ordering
and Xen's ordering.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
VMX: fix realmode emulation SReg handling

Commit 0888d36bb2 ("x86/emul: Correct the decoding of SReg3 operands")
overlooked three places where x86_seg_cs was assumed to be zero.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0888d36bb23f7365ce12b03127fd0fb2661ec90e
master date: 2016-10-26 14:04:12 +0100
master commit: a62511bf14971ff581212decbbf57fc11b967840
master date: 2016-10-31 08:57:47 +0100

8 years agox86/HVM: add missing NULL check before using VMFUNC hook
Jan Beulich [Wed, 21 Dec 2016 16:45:42 +0000 (17:45 +0100)]
x86/HVM: add missing NULL check before using VMFUNC hook

This is CVE-2016-10025 / XSA-203.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 74dcd0ce6f4fadce8093e54f0fc1a45426577e13
master date: 2016-12-21 16:47:19 +0100

8 years agox86: force EFLAGS.IF on when exiting to PV guests
Jan Beulich [Wed, 21 Dec 2016 16:45:05 +0000 (17:45 +0100)]
x86: force EFLAGS.IF on when exiting to PV guests

Guest kernels modifying instructions in the process of being emulated
for another of their vCPU-s may effect EFLAGS.IF to be cleared upon
next exiting to guest context, by converting the being emulated
instruction to CLI (at the right point in time). Prevent any such bad
effects by always forcing EFLAGS.IF on. And to cover hypothetical other
similar issues, also force EFLAGS.{IOPL,NT,VM} to zero.

This is CVE-2016-10024 / XSA-202.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0e47f92b072548800223f9a21ea051a017173915
master date: 2016-12-21 16:46:13 +0100

8 years agox86/emul: Correct the handling of eflags with SYSCALL
Andrew Cooper [Sun, 18 Dec 2016 15:42:59 +0000 (15:42 +0000)]
x86/emul: Correct the handling of eflags with SYSCALL

A singlestep #DB is determined by the resulting eflags value from the
execution of SYSCALL, not the original eflags value.

By using the original eflags value, we negate the guest kernels attempt to
protect itself from a privilege escalation by masking TF.

Introduce a tf boolean and have the SYSCALL emulation recalculate it
after the instruction is complete.

This is XSA-204

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86emul: CMPXCHG8B ignores operand size prefix
Jan Beulich [Tue, 13 Dec 2016 13:27:24 +0000 (14:27 +0100)]
x86emul: CMPXCHG8B ignores operand size prefix

Otherwise besides mis-handling the instruction, the comparison failure
case would result in uninitialized stack data being handed back to the
guest in rDX:rAX (32 bits leaked for 32-bit guests, 96 bits for 64-bit
ones).

This is CVE-2016-9932 / XSA-200.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
8 years agomissing vgic_unlock_rank in gic_remove_irq_from_guest
Stefano Stabellini [Fri, 9 Dec 2016 00:59:28 +0000 (16:59 -0800)]
missing vgic_unlock_rank in gic_remove_irq_from_guest

Add missing vgic_unlock_rank on the error path in
gic_remove_irq_from_guest.

Coverity-ID: 1381843

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoQEMU_TAG update
Ian Jackson [Wed, 7 Dec 2016 16:54:20 +0000 (16:54 +0000)]
QEMU_TAG update

8 years agoarm64: fix incorrect memory region size in TCR_EL2
Shanker Donthineni [Thu, 17 Mar 2016 12:46:58 +0000 (13:46 +0100)]
arm64: fix incorrect memory region size in TCR_EL2

The maximum and minimum values for TxSZ depend on level of
translation as per AArch64 Virtual Memory System Architecture.
According to ARM specification DDI0487A_h (sec D4.2.2, page 1752),
the minimum TxSZ value is 16. If TxSZ is programmed to a value
smaller than 16 then it is IMPLEMENTATION DEFINED.

This patch sets T0SZ to (64-48)bits since XEN uses all 4 levels
to cover 48bit (256TB) virtual address instead of value zero.

Signed-off-by: Shanker Donthineni <shankerd@codeaurora.org>
Acked-by: Julien Grall <julien.grall@arm.com>
8 years agoQEMU_TAG update
Ian Jackson [Tue, 29 Nov 2016 18:38:11 +0000 (18:38 +0000)]
QEMU_TAG update

8 years agoarm32: handle async aborts delivered while at HYP
Wei Chen [Tue, 29 Nov 2016 15:12:39 +0000 (16:12 +0100)]
arm32: handle async aborts delivered while at HYP

If guest generates an asynchronous abort and then traps into HYP
(by HVC or IRQ) before the abort has been delivered, the hypervisor
could not catch it, because the PSTATE.A bit is masked all the time
in hypervisor. So this asynchronous abort may be slipped to next
running guest with PSTATE.A bit unmasked.

In order to avoid this, it is necessary to take the abort at HYP, by
clearing the PSTATE.A bit. In this patch, we unmask the PSTATE.A bit
to open a window to catch guest-generated asynchronous abort in all
Guest -> HYP switch paths. If we caught such asynchronous abort in
checking window, the HYP data abort exception will be triggered and
the abort source guest will be crashed.

This is part of XSA-201.

Signed-off-by: Wei Chen <Wei.Chen@arm.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 6aaff7b407ca76dcfc4fe81f2afe9d1594cb0d6b
master date: 2016-11-29 15:59:55 +0100

8 years agoarm: crash the guest when it traps on external abort
Wei Chen [Tue, 29 Nov 2016 15:12:20 +0000 (16:12 +0100)]
arm: crash the guest when it traps on external abort

If we spot a data or prefetch abort bearing the ESR_EL2.EA bit set, we
know that this is an external abort, and that should crash the guest.

This is part of XSA-201.

Signed-off-by: Wei Chen <Wei.Chen@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Julien Grall <Julien.Grall@arm.com>
master commit: f8c6a9334b251d2e78b0873a71b4d369908fb123
master date: 2016-11-29 15:59:26 +0100

8 years agoarm64: handle async aborts delivered while at EL2
Wei Chen [Tue, 29 Nov 2016 15:12:05 +0000 (16:12 +0100)]
arm64: handle async aborts delivered while at EL2

If EL1 generates an asynchronous abort and then traps into EL2
(by HVC or IRQ) before the abort has been delivered, the hypervisor
could not catch it, because the PSTATE.A bit is masked all the time
in hypervisor. So this asynchronous abort may be slipped to next
running guest with PSTATE.A bit unmasked.

In order to avoid this, it is necessary to take the abort at EL2, by
clearing the PSTATE.A bit. In this patch, we unmask the PSTATE.A bit
to open a window to catch guest-generated asynchronous abort in all
EL1 -> EL2 swich paths. If we catched such asynchronous abort in
checking window, the hyp_error exception will be triggered and the
abort source guest will be crashed.

This is part of XSA-201.

Signed-off-by: Wei Chen <Wei.Chen@arm.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 36008800e81bc061cce1fd204a0b638f9dc61c70
master date: 2016-11-29 15:58:57 +0100

8 years agoarm64: handle guest-generated EL1 asynchronous abort
Wei Chen [Tue, 29 Nov 2016 15:11:39 +0000 (16:11 +0100)]
arm64: handle guest-generated EL1 asynchronous abort

In current code, when the hypervisor receives an asynchronous abort
from a guest, the hypervisor will do panic, the host will be down.
We have to prevent such security issue, so, in this patch we crash
the guest, when the hypervisor receives an asynchronous abort from
the guest.

This is part of XSA-201.

Signed-off-by: Wei Chen <Wei.Chen@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Julien Grall <Julien.Grall@arm.com>
master commit: 2cf7d2bafb9b68add1710b8c3f7ecad58e53a9db
master date: 2016-11-29 15:57:52 +0100

8 years agopygrub: Properly quote results, when returning them to the caller:
Ian Jackson [Tue, 22 Nov 2016 13:23:54 +0000 (14:23 +0100)]
pygrub: Properly quote results, when returning them to the caller:

* When the caller wants sexpr output, use `repr()'
  This is what Xend expects.

  The returned S-expressions are now escaped and quoted by Python,
  generally using '...'.  Previously kernel and ramdisk were unquoted
  and args was quoted with "..." but without proper escaping.  This
  change may break toolstacks which do not properly dequote the
  returned S-expressions.

* When the caller wants "simple" output, crash if the delimiter is
  contained in the returned value.

  With --output-format=simple it does not seem like this could ever
  happen, because the bootloader config parsers all take line-based
  input from the various bootloader config files.

  With --output-format=simple0, this can happen if the bootloader
  config file contains nul bytes.

This is CVE-2016-9379 and CVE-2016-9380 / XSA-198.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Tested-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 27e14d346ed6ff1c3a3cfc479507e62d133e92a9
master date: 2016-11-22 13:52:09 +0100

8 years agox86/svm: fix injection of software interrupts
Andrew Cooper [Tue, 22 Nov 2016 13:23:29 +0000 (14:23 +0100)]
x86/svm: fix injection of software interrupts

The non-NextRip logic in c/s 36ebf14eb "x86/emulate: support for emulating
software event injection" was based on an older version of the AMD software
manual.  The manual was later corrected, following findings from that series.

I took the original wording of "not supported without NextRIP" to mean that
X86_EVENTTYPE_SW_INTERRUPT was not eligible for use.  It turns out that this
is not the case, and the new wording is clearer on the matter.

Despite testing the original patch series on non-NRip hardware, the
swint-emulation XTF test case focuses on the debug vectors; it never ended up
executing an `int $n` instruction for a vector which wasn't also an exception.

During a vmentry, the use of X86_EVENTTYPE_HW_EXCEPTION comes with a vector
check to ensure that it is only used with exception vectors.  Xen's use of
X86_EVENTTYPE_HW_EXCEPTION for `int $n` injection has always been buggy on AMD
hardware.

Fix this by always using X86_EVENTTYPE_SW_INTERRUPT.

Print and decode the eventinj information in svm_vmcb_dump(), as it has
several invalid combinations which cause vmentry failures.

This is CVE-2016-9378 / part of XSA-196.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 920edccd41db6cb0145545afa1850edf5e7d098e
master date: 2016-11-22 13:51:16 +0100

8 years agox86/emul: correct the IDT entry calculation in inject_swint()
Andrew Cooper [Tue, 22 Nov 2016 13:23:05 +0000 (14:23 +0100)]
x86/emul: correct the IDT entry calculation in inject_swint()

The logic, as introduced in c/s 36ebf14ebe "x86/emulate: support for emulating
software event injection" is buggy.  The size of an IDT entry depends on long
mode being active, not the width of the code segment currently in use.

In particular, this means that a compatibility code segment which hits
emulation for software event injection will end up using an incorrect offset
in the IDT for DPL/Presence checking.  In practice, this only occurs on old
AMD hardware lacking NRip support; all newer AMD hardware, and all Intel
hardware bypass this path in the emulator.

While here, fix a minor issue with reading the IDT entry.  The return value
from ops->read() wasn't checked, but in reality the only failure case is if a
pagefault occurs.  This is not a realistic problem as the kernel will almost
certainly crash with a double fault if this setup actually occured.

This is CVE-2016-9377 / part of XSA-196.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 255e8fe95f22ded5186fd75244ffcfb9d5dbc855
master date: 2016-11-22 13:50:49 +0100

8 years agox86emul: fix huge bit offset handling
Jan Beulich [Tue, 22 Nov 2016 13:22:37 +0000 (14:22 +0100)]
x86emul: fix huge bit offset handling

We must never chop off the high 32 bits.

This is CVE-2016-9383 / XSA-195.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1c6c2d60d205f71ede0fbbd9047e459112f576db
master date: 2016-11-22 13:49:06 +0100

8 years agox86/PV: writes of %fs and %gs base MSRs require canonical addresses
Jan Beulich [Tue, 22 Nov 2016 13:22:09 +0000 (14:22 +0100)]
x86/PV: writes of %fs and %gs base MSRs require canonical addresses

Commit c42494acb2 ("x86: fix FS/GS base handling when using the
fsgsbase feature") replaced the use of wrmsr_safe() on these paths
without recognizing that wr{f,g}sbase() use just wrmsrl() and that the
WR{F,G}SBASE instructions also raise #GP for non-canonical input.

Similarly arch_set_info_guest() needs to prevent non-canonical
addresses from getting stored into state later to be loaded by context
switch code. For consistency also check stack pointers and LDT base.
DR0..3, otoh, already get properly checked in set_debugreg() (albeit
we discard the error there).

The SHADOW_GS_BASE check isn't strictly necessary, but I think we
better avoid trying the WRMSR if we know it's going to fail.

This is CVE-2016-9385 / XSA-193.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f3fa3abf3e61fb1f25ce721e14ac324dda67311f
master date: 2016-11-22 13:46:28 +0100

8 years agox86/HVM: don't load LDTR with VM86 mode attrs during task switch
Jan Beulich [Tue, 22 Nov 2016 13:21:34 +0000 (14:21 +0100)]
x86/HVM: don't load LDTR with VM86 mode attrs during task switch

Just like TR, LDTR is purely a protected mode facility and hence needs
to be loaded accordingly. Also move its loading to where it
architecurally belongs.

This is CVE-2016-9382 / XSA-192.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 93aa42b85ae0084ba7b749d0e990c94fbf0c17e3
master date: 2016-11-22 13:45:44 +0100

8 years agox86/hvm: Fix the handling of non-present segments
Andrew Cooper [Tue, 22 Nov 2016 13:20:58 +0000 (14:20 +0100)]
x86/hvm: Fix the handling of non-present segments

In 32bit, the data segments may be NULL to indicate that the segment is
ineligible for use.  In both 32bit and 64bit, the LDT selector may be NULL to
indicate that the entire LDT is ineligible for use.  However, nothing in Xen
actually checks for this condition when performing other segmentation
checks.  (Note however that limit and writeability checks are correctly
performed).

Neither Intel nor AMD specify the exact behaviour of loading a NULL segment.
Experimentally, AMD zeroes all attributes but leaves the base and limit
unmodified.  Intel zeroes the base, sets the limit to 0xfffffff and resets the
attributes to just .G and .D/B.

The use of the segment information in the VMCB/VMCS is equivalent to a native
pipeline interacting with the segment cache.  The present bit can therefore
have a subtly different meaning, and it is now cooked to uniformly indicate
whether the segment is usable or not.

GDTR and IDTR don't have access rights like the other segments, but for
consistency, they are treated as being present so no special casing is needed
elsewhere in the segmentation logic.

AMD hardware does not consider the present bit for %cs and %tr, and will
function as if they were present.  They are therefore unconditionally set to
present when reading information from the VMCB, to maintain the new meaning of
usability.

Intel hardware has a separate unusable bit in the VMCS segment attributes.
This bit is inverted and stored in the present field, so the hvm code can work
with architecturally-common state.

This is CVE-2016-9386 / XSA-191.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 04beafa8e6c66f5cd814c00e2d2b51cfbc41cb8a
master date: 2016-11-22 13:44:50 +0100

8 years agoupdate Xen version to 4.6.5-pre
Jan Beulich [Tue, 22 Nov 2016 13:11:27 +0000 (14:11 +0100)]
update Xen version to 4.6.5-pre

8 years agoupdate Xen version to 4.6.4 RELEASE-4.6.4
Jan Beulich [Fri, 4 Nov 2016 11:33:10 +0000 (12:33 +0100)]
update Xen version to 4.6.4

8 years agovscsiif.h: replace PAGE_SIZE with VSCSIIF_PAGE_SIZE
Stefano Stabellini [Tue, 25 Oct 2016 15:19:36 +0000 (17:19 +0200)]
vscsiif.h: replace PAGE_SIZE with VSCSIIF_PAGE_SIZE

Do not reference PAGE_SIZE directly: it could be undefined, or it could
have different values in the frontend or in the backend.

Define VSCSIIF_PAGE_SIZE as 4096, assuming all users of vscsiif.h have
4K page granularity. Replace PAGE_SIZE with VSCSIIF_PAGE_SIZE.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
master commit: d93539cc486aa6022195305dbea5fe12f90b69fe
master date: 2016-10-21 09:58:16 +0100

8 years agousbif.h: replace PAGE_SIZE with USBIF_RING_SIZE
Stefano Stabellini [Tue, 25 Oct 2016 15:19:12 +0000 (17:19 +0200)]
usbif.h: replace PAGE_SIZE with USBIF_RING_SIZE

Do not reference PAGE_SIZE directly: it could be undefined, or it could
have different values in the frontend or in the backend.

Define USBIF_RING_SIZE as 4096, assuming all users of usbif.h have 4K
page granularity. Replace PAGE_SIZE with USBIF_RING_SIZE.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
master commit: 04535bee06858fd949c743cfecc4d7b96333a16c
master date: 2016-10-21 09:57:55 +0100

8 years agox86/Viridian: don't depend on undefined register state
Jan Beulich [Tue, 25 Oct 2016 15:18:41 +0000 (17:18 +0200)]
x86/Viridian: don't depend on undefined register state

The high halves of all GPRs are undefined in 32-bit and compat modes,
and the dependency is being obfuscated by our structure field names not
matching architectural register names (it was actually while putting
together a patch to correct this when I noticed the issue here).

For consistency also use the architecturally correct names on the
output side.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
master commit: a709a3a646302e95ba42beac89264f6cdacd0c64
master date: 2016-10-14 14:09:42 +0200

8 years agox86emul: fix pushing of selector registers
Jan Beulich [Tue, 25 Oct 2016 15:18:16 +0000 (17:18 +0200)]
x86emul: fix pushing of selector registers

Both explicit PUSH and far CALL currently push unrelated data (the
segment attributes word) in the high half (attributes and limit in the
64-bit case in the high 48 bits) instead of zero. To avoid having to
apply this and further changes in multiple places, also fold the two
(respectively) far call/jmp instances into one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 373923ed9c2ed36925f574387db2be2ebe5ce45a
master date: 2016-10-14 14:09:16 +0200

8 years agox86/hvm: Clobber %cs.L when LME becomes set
Andrew Cooper [Tue, 25 Oct 2016 15:17:45 +0000 (17:17 +0200)]
x86/hvm: Clobber %cs.L when LME becomes set

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ed00f1761689ac7b9c074e9084c81e47c25d460c
master date: 2016-10-14 12:44:29 +0100

8 years agoxen/trace: Fix trace metadata page count calculation (revert fbf96e6)
George Dunlap [Tue, 25 Oct 2016 15:17:03 +0000 (17:17 +0200)]
xen/trace: Fix trace metadata page count calculation (revert fbf96e6)

Changeset fbf96e6, "xentrace: correct formula to calculate
t_info_pages", broke the trace metadata page count calculation, by
mistaking t_info_first_offset as denominated in bytes, when in fact it
is denominated in words (uint32_t).

Effectively revert that change, and put a comment there to reduce the
chance that someone will make that mistake in the future.

Reviewed-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Tested-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
master commit: 71b8b46111219a2f83f4f9ae06ac5409744ea86e
master date: 2016-10-11 14:23:52 +0100

8 years agox86: defer not-present segment checks
Jan Beulich [Tue, 25 Oct 2016 15:16:30 +0000 (17:16 +0200)]
x86: defer not-present segment checks

Following on from commits 5602e74c60 ("x86emul: correct loading of
%ss") and bdb860d01c ("x86/HVM: correct segment register loading during
task switch") the point of the non-.present checks needs to be refined:
#NP (and its #SS companion), other than suggested by the various
instruction pages in Intel's SDM, gets checked for only after all type
and permission checks. The only checks getting done even later are the
long mode specific ones for system descriptors (which we don't support
yet) and 64-bit code segments (i.e. anything touching other than the
attribute byte).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86/hvm: Correct the position of the %cs L/D checks

Contrary to the description in the software manuals, in Long Mode, attempts to
load %cs check that D is not set in combination with L before the present flag
is checked.

This can be observed because the L/D check fails with #GP before the presence
check failes with #NP.

This change partially reverts c/s 78ff18c90 "x86: defer not-present segment
checks", taking it back to how it was in the v1 submission.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 78ff18c905318a9b1e5dd32662986f03b10a4e56
master date: 2016-10-10 12:16:49 +0200
master commit: 13b9f31751a55a96e86bd7e64b433a62a4a5b71e
master date: 2016-10-14 12:43:17 +0100

8 years agoxen: credit1: return the 'time remaining to the limit' as next timeslice.
Dario Faggioli [Tue, 25 Oct 2016 15:15:35 +0000 (17:15 +0200)]
xen: credit1: return the 'time remaining to the limit' as next timeslice.

If vcpu x has run for 200us, and sched_ratelimit_us is
1000us, continue running x _but_ return 1000us-200us as
the next time slice. This way, next scheduling point will
happen in 800us, i.e., exactly at the point when x crosses
the threshold, and can be descheduled (if appropriate).

Right now (without this patch), we're always returning
sched_ratelimit_us (1000us, in the example above), which
means we're (potentially) allowing x to run more than
it should have been able to.

Note that, however, in order to avoid setting timers to very
short intervals, which is part of the purpose of rate limiting,
we never use a time slice smaller than a well defined threshold.
Such threshold (CSCHED_MIN_TIMER defined in this patch) is, in
general independent from rate limiting, but it looks a good idea
to set it to the minimum possible ratelimiting value.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 0053127890ebe9cafbd232752636a15881e4915a
master date: 2016-09-30 14:46:36 +0100

8 years agox86emul: honor guest CR0.TS and CR0.EM
Jan Beulich [Tue, 4 Oct 2016 13:05:41 +0000 (14:05 +0100)]
x86emul: honor guest CR0.TS and CR0.EM

We must not emulate any instructions accessing respective registers
when either of these flags is set in the guest view of the register, or
else we may do so on data not belonging to the guest's current task.

Being architecturally required behavior, the logic gets placed in the
instruction emulator instead of hvmemul_get_fpu(). It should be noted,
though, that hvmemul_get_fpu() being the only current handler for the
get_fpu() callback, we don't have an active problem with CR4: Both
CR4.OSFXSR and CR4.OSXSAVE get handled as necessary by that function.

This is XSA-190.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
8 years agox86/AMD: apply erratum 665 workaround
Emanuel Czirai [Wed, 28 Sep 2016 15:04:13 +0000 (17:04 +0200)]
x86/AMD: apply erratum 665 workaround

AMD F12h machines have an erratum which can cause DIV/IDIV to behave
unpredictably. The workaround is to set MSRC001_1029[31] but sometimes
there is no BIOS update containing that workaround so let's do it
ourselves unconditionally. It is simple enough.

[ Borislav: Wrote commit message. ]

Signed-off-by: Emanuel Czirai <icanrealizeum@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
[Linux commit: d1992996753132e2dafe955cccb2fb0714d3cfc4]

Make applicable to Xen.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6bfee2038565a208f4ecef0911087ca10eecf25b
master date: 2016-09-26 17:28:09 +0200

8 years agox86emul: don't allow null selector for LTR
Jan Beulich [Wed, 28 Sep 2016 15:03:38 +0000 (17:03 +0200)]
x86emul: don't allow null selector for LTR

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: dcfd9a5eadedc71d8546286b881bba7db152207a
master date: 2016-09-26 17:27:06 +0200

8 years agox86emul: correct loading of %ss
Jan Beulich [Wed, 28 Sep 2016 15:03:13 +0000 (17:03 +0200)]
x86emul: correct loading of %ss

- Instead of #NP, #SS needs to be raised for non-present descriptors.
- Loading a null selector is fine in 64-bit mode at CPL != 3, as long
  as RPL == CPL.
- Don't lose the low two selector bits on null selector loads (also
  applies to %ds, %es, %fs, %gs, and LDTR).

Since we need CPL earlier now, also switch to using get_cpl() (instead
of open coding it).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5602e74c60c1ec2feef4cdd75376e4b1a1d7e681
master date: 2016-09-26 17:26:21 +0200

8 years agox86/Intel: hide CPUID faulting capability from guests
Jan Beulich [Wed, 28 Sep 2016 15:02:47 +0000 (17:02 +0200)]
x86/Intel: hide CPUID faulting capability from guests

We don't currently emulate it, so guests should not be misguided to
believe they can (try to) use it.

For now, simply return zero to guests for platform MSR reads, and only
accept (by discarding) writes of zero. If ever there will be bits we
can safely expose to guests, let's handle them by white listing.

(As a side note - according to SDM version 059 bit 31 is reserved on
all known families.)

Reported-by: Kyle Huey <me@kylehuey.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: b982a5bea4273a4b9fc007d5046bed8d1669c07f
master date: 2016-09-19 11:37:09 +0200

8 years agoxen: credit2: properly schedule migration of a running vcpu.
Dario Faggioli [Wed, 28 Sep 2016 15:02:11 +0000 (17:02 +0200)]
xen: credit2: properly schedule migration of a running vcpu.

If wanting to migrate a vcpu that is actually running,
we need to ask the scheduler to chime in as soon as
possible, to have the vcpu itself stopped and actually
moved.

Make sure this happens by, after setting all the relevant
flags, raising the scheduler softirq.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: 115e4c5e52c14c126cd8ae0dfe0322c95b65e3c8
master date: 2016-09-15 12:39:47 +0100

8 years agoxen: credit1: fix mask to be used for tickling in Credit1
Dario Faggioli [Wed, 28 Sep 2016 15:01:49 +0000 (17:01 +0200)]
xen: credit1: fix mask to be used for tickling in Credit1

If there are idle pcpus inside the waking vcpu's
soft-affinity mask, we should really tickle one
of them (this is one of the purposes of the
__runq_tickle() function itself!), not just
any idle pcpu.

The issue has been introduced in 02ea5031825d
("credit1: properly deal with pCPUs not in any cpupool"),
where the usage of idle_mask is changed, without
updating the bottom of the function, where it
is also referenced.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: f83fc393b2bb0a8b97bca07d810684a2c709aaa8
master date: 2016-09-15 12:39:47 +0100

8 years agox86/domctl: Fix TOCTOU race with the use of XEN_DOMCTL_getvcpuextstate
Andrew Cooper [Wed, 28 Sep 2016 15:01:09 +0000 (17:01 +0200)]
x86/domctl: Fix TOCTOU race with the use of XEN_DOMCTL_getvcpuextstate

A toolstack must call XEN_DOMCTL_getvcpuextstate twice; first to find the size
of the buffer to use, and a second time to get the actual content.

The reported size was based on v->arch.xcr0_accum, but a guest which extends
its xcr0_accum between the two hypercalls will cause the toolstack to fail the
evc->size != size check, as the provided buffer is now too small.  This causes
a hard error during the final phase of migration.

Instead, return a size based on xfeature_mask, which is the maximum size Xen
will ever permit.  The hypercall must now tolerate a toolstack-provided buffer
which is overly large (for the case where a guest isn't using all available
xsave states), and should write back how much data was actually written into
the buffer.

As the query for size now has no dependence on vcpu state, the vcpu_pause()
can be omitted for a small performance improvement.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d4a322557ae98cccdf90a0f442a29e1f5d76378a
master date: 2016-09-13 10:43:59 +0100

8 years agoQEMU_TAG update
Ian Jackson [Tue, 20 Sep 2016 15:37:08 +0000 (16:37 +0100)]
QEMU_TAG update

8 years agolibxl: do not assume Dom0 backend while getting nic info
Marek Marczykowski-Górecki [Mon, 5 Sep 2016 10:15:26 +0000 (12:15 +0200)]
libxl: do not assume Dom0 backend while getting nic info

Fill backend_domid field based on backend path.

[ Correspoding xen.git#master commit is 7539772a65b0 -iwj ]

Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
(cherry picked from commit 317eb71bc3b0830338601fb14d1f49546a1c1e35)

8 years agotools/migrate: Prevent PTE truncation from being fatal duing the live phase
Andrew Cooper [Thu, 1 Sep 2016 09:45:03 +0000 (10:45 +0100)]
tools/migrate: Prevent PTE truncation from being fatal duing the live phase

It is possible, when normalising a PV pagetable that the table has been freed
and reused for something else by the guest.

In such a case, data read might no longer be a pagetable, and fail the
truncation check.  However, this should only be fatal if we encounter such a
page in the paused phase.

This check is now consistent with all other checks in the same area.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit a335a58e757232249d7f33503ab5d8f849518a47)
(cherry picked from commit 7e17174735034803c55a4650e42fbe657846d318)

8 years agoRevert "x86/hvm: Perform a user instruction fetch for a FEP in userspace"
Jan Beulich [Mon, 12 Sep 2016 15:46:39 +0000 (17:46 +0200)]
Revert "x86/hvm: Perform a user instruction fetch for a FEP in userspace"

This reverts commit c3b06b0a7523374822e2eec789178444eefc5710,
which doesn't build and would, in its current form, use
uninitialized data if it did.

8 years agox86/segment: Bounds check accesses to emulation ctxt->seg_reg[]
Andrew Cooper [Mon, 12 Sep 2016 14:00:56 +0000 (16:00 +0200)]
x86/segment: Bounds check accesses to emulation ctxt->seg_reg[]

HVM HAP codepaths have space for all segment registers in the seg_reg[]
cache (with x86_seg_none still risking an array overrun), while the shadow
codepaths only have space for the user segments.

Range check the input segment of *_get_seg_reg() against the size of the array
used to cache the results, to avoid overruns in the case that the callers
don't filter their input suitably.

Subsume the is_x86_user_segment(seg) checks from the shadow code, which were
an incomplete attempt at range checking, and are now superceeded.  Make
hvm_get_seg_reg() static, as it is not used outside of shadow/common.c

No functional change, but far easier to reason that no overflow is possible.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
xen/x86: Fix build with clang following c/s 4fa0105

https://travis-ci.org/xen-project/xen/jobs/158494027#L2344

Clang complains:

  emulate.c:2016:14: error: comparison of unsigned enum expression < 0
  is always false [-Werror,-Wtautological-compare]
      if ( seg < 0 || seg >= ARRAY_SIZE(hvmemul_ctxt->seg_reg) )
           ~~~ ^ ~

Clang is wrong to raise a warning like this.  The signed-ness of an enum is
implementation defined in C, and robust code must not assume the choices made
by the compiler.

In this case, dropping the < 0 check creates a latent bug which would result
in an array underflow when compiled with a compiler which chooses a signed
enum.

Work around the bug by explicitly pulling seg into an unsigned integer, and
only perform the upper bounds check.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 4fa0105d95be6e7145a1f6fd1036ccd43976228c
master date: 2016-09-08 16:39:46 +0100
master commit: 4c47c47938ea24c73d9459f9f0b6923513772b5d
master date: 2016-09-09 15:31:01 +0100

8 years agox86/hvm: Perform a user instruction fetch for a FEP in userspace
Andrew Cooper [Mon, 12 Sep 2016 14:00:30 +0000 (16:00 +0200)]
x86/hvm: Perform a user instruction fetch for a FEP in userspace

This matches hardware behaviour, and prevents erroneous failures when a guest
has SMEP/SMAP active and issues a FEP from userspace.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0831e99446121636045cf6f616a1991d6ef22071
master date: 2016-09-08 16:39:46 +0100

8 years agohvm/fep: Allow testing of instructions crossing the -1 -> 0 virtual boundary
Andrew Cooper [Mon, 12 Sep 2016 14:00:08 +0000 (16:00 +0200)]
hvm/fep: Allow testing of instructions crossing the -1 -> 0 virtual boundary

The Force Emulation Prefix is named to follow its PV counterpart for cpuid or
rdtsc, but isn't really an instruction prefix.  It behaves as a break-out into
Xen, with the purpose of emulating the next instruction in the current state.

It is important to be able to test legal situations which occur in real
hardware, including instruction which cross certain boundaries, and
instructions starting at 0.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7b5cee79dad24e7006059667b02bd7de685d8ee5
master date: 2016-09-08 16:39:46 +0100

8 years agoVMX: correct feature checks for MPX
Jan Beulich [Mon, 12 Sep 2016 13:59:29 +0000 (15:59 +0200)]
VMX: correct feature checks for MPX

Its VMCS field isn't tied to the respective base CPU feature flag but
instead to a VMX specific one.

Note that while the VMCS GUEST_BNDCFGS field exists if either of the
two respective features is available, MPX continues to get exposed to
guests only with both features present.

Also add the so far missing handling of
- GUEST_BNDCFGS in construct_vmcs()
- MSR_IA32_BNDCFGS in vmx_msr_{read,write}_intercept()
and mirror the extra correctness checks during MSR write to
vmx_load_msr().

Reported-by: "Rockosov, Dmitry" <dmitry.rockosov@intel.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: "Rockosov, Dmitry" <dmitry.rockosov@intel.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 68eb1a4d92be58e26bd11d02b8e0317bd56294ac
master date: 2016-09-07 12:34:43 +0200

8 years agox86/shadow: Avoid overflowing sh_ctxt->seg_reg[]
Andrew Cooper [Thu, 8 Sep 2016 12:26:01 +0000 (14:26 +0200)]
x86/shadow: Avoid overflowing sh_ctxt->seg_reg[]

hvm_get_seg_reg() does not perform a range check on its input segment, calls
hvm_get_segment_register() and writes straight into sh_ctxt->seg_reg[].

x86_seg_none is outside the bounds of sh_ctxt->seg_reg[], and will hit a BUG()
in {vmx,svm}_get_segment_register().

HVM guests running with shadow paging can end up performing a virtual to
linear translation with x86_seg_none.  This is used for addresses which are
already linear.  However, none of this is a legitimate pagetable update, so
fail the emulation in such a case.

This is XSA-187 / CVE-2016-7094.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: a9f3b3bad17d91e2067fc00d51b0302349570d08
master date: 2016-09-08 14:16:26 +0200

8 years agox86/emulate: Correct boundary interactions of emulated instructions
Andrew Cooper [Thu, 8 Sep 2016 12:25:28 +0000 (14:25 +0200)]
x86/emulate: Correct boundary interactions of emulated instructions

This reverts most of c/s 0640ffb6 "x86emul: fix rIP handling".

Experimentally, in long mode processors will execute an instruction stream
which crosses the 64bit -1 -> 0 virtual boundary, whether the instruction
boundary is aligned on the virtual boundary, or is misaligned.

In compatibility mode, Intel processors will execute an instruction stream
which crosses the 32bit -1 -> 0 virtual boundary, while AMD processors raise a
segmentation fault.  Xen's segmentation behaviour matches AMD.

For 16bit code, hardware does not ever truncated %ip.  %eip is always used and
behaves normally as a 32bit register, including in 16bit protected mode
segments, as well as in Real and Unreal mode.

This is XSA-186 / CVE-2016-7093.

Reported-by: Brian Marcotte <marcotte@panix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e9575f980df81aeb0e5b6139f485fd6f7bb7f5b6
master date: 2016-09-08 14:15:53 +0200

8 years agox86/32on64: don't allow recursive page tables from L3
Jan Beulich [Thu, 8 Sep 2016 12:24:45 +0000 (14:24 +0200)]
x86/32on64: don't allow recursive page tables from L3

L3 entries are special in PAE mode, and hence can't reasonably be used
for setting up recursive (and hence linear) page table mappings. Since
abuse is possible when the guest in fact gets run on 4-level page
tables, this needs to be excluded explicitly.

This is XSA-185 / CVE-2016-7092.

Reported-by: Jérémie Boutoille <jboutoille@ext.quarkslab.com>
Reported-by: "栾尚聪(好风)" <shangcong.lsc@alibaba-inc.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c844d637d92a75854ea5c8d4e5ca34302a9f623c
master date: 2016-09-08 14:14:53 +0200

8 years agomemory: fix compat handling of XENMEM_access_op
Jan Beulich [Tue, 6 Sep 2016 09:52:22 +0000 (11:52 +0200)]
memory: fix compat handling of XENMEM_access_op

Within compat_memory_op() this needs to be placed in the first switch()
statement, or it ends up being dead code (as that first switch() has a
default case chaining to compat_arch_memory_op()).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 8d6af808a7e9d9ae1d129e1e5a0def7f8b2333ee
master date: 2016-09-02 14:19:51 +0200

8 years agox86/PV: make PMU MSR handling consistent
Jan Beulich [Tue, 6 Sep 2016 09:51:59 +0000 (11:51 +0200)]
x86/PV: make PMU MSR handling consistent

So far accesses to Intel MSRs on an AMD system fall through to the
default case, while accesses to AMD MSRs on an Intel system bail (in
the RDMSR case without updating EAX and EDX). Make the "AMD MSRs on
Intel" case match the "Intel MSR on AMD" one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: bea64b3ed25864b90a41e1ca6eeb5a58895bb751
master date: 2016-09-02 14:19:29 +0200

8 years agocredit1: fix a race when picking initial pCPU for a vCPU
Dario Faggioli [Tue, 6 Sep 2016 09:51:34 +0000 (11:51 +0200)]
credit1: fix a race when picking initial pCPU for a vCPU

In the Credit1 hunk of 9f358ddd69463 ("xen: Have
schedulers revise initial placement") csched_cpu_pick()
is called without taking the runqueue lock of the
(temporary) pCPU that the vCPU has been assigned to
(e.g., in XEN_DOMCTL_max_vcpus).

However, although 'hidden' in the IS_RUNQ_IDLE() macro,
that function does access the runq (for doing load
balancing calculations). Two scenarios are possible:
 1) we are on cpu X, and IS_RUNQ_IDLE() peeks at cpu's
    X own runq;
 2) we are on cpu X, but IS_RUNQ_IDLE() peeks at some
    other cpu's runq.

Scenario 2) absolutely requies that the appropriate
runq lock is taken. Scenario 1) works even without
taking the cpu's own runq lock. That is actually what
happens when when _csched_pick_cpu() is called from
csched_vcpu_acct() (in turn, called by csched_tick()).

Races have been observed and reported (by both XenServer
own testing and OSSTest [1]), in the form of
IS_RUNQ_IDLE() falling over LIST_POISON, because we're
not currently holding the proper lock, in
csched_vcpu_insert(), when scenario 1) occurs.

However, for better robustness, from now on we always
ask for the proper runq lock to be held when calling
IS_RUNQ_IDLE() (which is also becoming a static inline
function instead of macro).

In order to comply with that, we take the lock around
the call to _csched_cpu_pick() in csched_vcpu_acct().

[1] https://lists.xen.org/archives/html/xen-devel/2016-08/msg02144.html

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 9109bf55084398c4547b8956906410c158eb9a17
master date: 2016-09-02 14:17:55 +0200

8 years agox86/32on64: misc adjustments to call gate emulation
Jan Beulich [Tue, 6 Sep 2016 09:51:05 +0000 (11:51 +0200)]
x86/32on64: misc adjustments to call gate emulation

- There's no 32-bit displacement in 16-bit addressing mode.
- It is wrong to ASSERT() anything on parts of an instruction fetched
  from guest memory.
- The two scaling bits of a SIB byte don't affect whether there is a
  scaled index register or not.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ee1cc4bfdca84d526805c4c72302c026f5e9cd94
master date: 2016-09-01 15:23:46 +0200

8 years agoxen: Remove buggy initial placement algorithm
George Dunlap [Tue, 6 Sep 2016 09:49:37 +0000 (11:49 +0200)]
xen: Remove buggy initial placement algorithm

The initial placement algorithm sometimes picks cpus outside of the
mask it's given, does a lot of unnecessary bitmasking, does its own
separate load calculation, and completely ignores vcpu hard and soft
affinities.  Just get rid of it and rely on the schedulers to do
initial placement.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d5438accceecc8172db2d37d98b695eb8bc43afc
master date: 2016-07-26 10:44:06 +0100

8 years agoxen: Have schedulers revise initial placement
George Dunlap [Tue, 6 Sep 2016 09:49:13 +0000 (11:49 +0200)]
xen: Have schedulers revise initial placement

The generic domain creation logic in
xen/common/domctl.c:default_vcpu0_location() attempts to try to do
initial placement load-balancing by placing vcpu 0 on the least-busy
non-primary hyperthread available.  Unfortunately, the logic can end
up picking a pcpu that's not in the online mask.  When this is passed
to a scheduler such which assumes that the initial assignment is
valid, it causes a null pointer dereference looking up the runqueue.

Furthermore, this initial placement doesn't take into account hard or
soft affinity, or any scheduler-specific knowledge (such as historic
runqueue load, as in credit2).

To solve this, when inserting a vcpu, always call the per-scheduler
"pick" function to revise the initial placement.  This will
automatically take all knowledge the scheduler has into account.

csched2_cpu_pick ASSERTs that the vcpu's pcpu scheduler lock has been
taken.  Grab and release the lock to minimize time spend with irqs
disabled.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
Reviwed-by: Dario Faggioli <dario.faggioli@citrix.com>
master commit: 9f358ddd69463fa8fb65cf67beb5f6f0d3350e32
master date: 2016-07-26 10:42:49 +0100

8 years agosched: better handle (not) inserting idle vCPUs in runqueues
Dario Faggioli [Tue, 6 Sep 2016 09:48:07 +0000 (11:48 +0200)]
sched: better handle (not) inserting idle vCPUs in runqueues

Idle vCPUs are set to run immediately, as a part of their
own initialization, so we shouldn't even try to put them
in a runqueue. In fact, no scheduler does that, even when
asked to (that is rather explicit in Credit2 and RTDS, a
bit less evident in Credit1).

Let's make things look as follows:
 - in generic code, explicitly avoid even trying to
   insert idle vCPUs in runqueues;
 - in specific schedulers' code, enforce that.

Note that, as csched_vcpu_insert() is no longer being
called, during boot (from sched_init_vcpu()) we can
safely avoid saving the flags when taking the runqueue
lock.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
master commit: 6b53bb4ab3c9bd5eccde88a5175cf72589ba6d52
master date: 2015-11-24 14:49:47 +0100

8 years agoxen/physmap: Do not permit a guest to populate PoD pages for itself
Andrew Cooper [Fri, 26 Aug 2016 08:28:48 +0000 (10:28 +0200)]
xen/physmap: Do not permit a guest to populate PoD pages for itself

PoD is supposed to be entirely transparent to guest, but this interface has
been left exposed for a long time.

The use of PoD requires careful co-ordination by the toolstack with the
XENMEM_{get,set}_pod_target hypercalls, and xenstore ballooning target.  The
best a guest can do without toolstack cooperation crash.

Furthermore, there are combinations of features (e.g. c/s c63868ff "libxl:
disallow PCI device assignment for HVM guest when PoD is enabled") which a
toolstack might wish to explicitly prohibit (in this case, because the two
simply don't function in combination).  In such cases, the guest mustn't be
able to subvert the configuration chosen by the toolstack.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 2a99aa99fc84a45f505f84802af56b006d14c52e
master date: 2016-08-19 18:40:11 +0100

8 years agopage-alloc/x86: don't restrict DMA heap to node 0
Jan Beulich [Fri, 26 Aug 2016 08:28:08 +0000 (10:28 +0200)]
page-alloc/x86: don't restrict DMA heap to node 0

When node zero has no memory, the DMA bit width will end up getting set
to 9, which is obviously not helpful to hold back a reasonable amount
of low enough memory for Dom0 to use for DMA purposes. Find the lowest
node with memory below 4Gb instead.

Introduce arch_get_dma_bitsize() to keep this arch-specific logic out
of common code.

Also adjust the original calculation: I think the subtraction of 1
should have been part of the flsl() argument rather than getting
applied to its result. And while previously the division by 4 was valid
to be done on the flsl() result, this now also needs to be converted,
as is should only be applied to the spanned pages value.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d0d6597d3d682f324b6a79e3278e6f5bb6bad153
master date: 2016-08-11 13:35:50 +0200

8 years agolibxl: return any serial tty path in libxl_console_get_tty
Bob Liu [Thu, 4 Aug 2016 01:07:56 +0000 (09:07 +0800)]
libxl: return any serial tty path in libxl_console_get_tty

When specifying a serial list in domain config, users of
libxl_console_get_tty cannot get the tty path of a second specified pty serial,
since right now it always returns the tty path of serial 0.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 35dbf099ac18924d40533c9d1b9bfbf1ecb818c9)
(cherry picked from commit 822464961ae1bac44dcabb049255d61d5511e368)

8 years agotools/libxc: Properly increment ApicIdCoreSize field on AMD
Boris Ostrovsky [Fri, 22 Jul 2016 17:14:01 +0000 (13:14 -0400)]
tools/libxc: Properly increment ApicIdCoreSize field on AMD

Current code incorrectly adds 1 to full register instead of
incrementing the field in bits 15:12.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit a3336a507519c1d28db3bbff8e439aa3811733f3)
(cherry picked from commit de781b47f447fcdd1c556543fcf0ef1d654d2d4d)

8 years agolibxenstat: honour XEN_RUN_DIR
Wei Liu [Mon, 11 Jul 2016 17:28:08 +0000 (18:28 +0100)]
libxenstat: honour XEN_RUN_DIR

This is because libxl uses XEN_RUN_DIR to generate the socket path for
libxenstat while libxenstat itself uses hard-coded path, which is not
necessarily the same path as XEN_RUN_DIR.  The default configuration
happened to work because XEN_RUN_DIR defaulted to /var/run/xen, which
matched the hard-coded path.

We should make libxenstat use XEN_RUN_DIR so that it works with
non-default configuration.

Generate a _paths.h because it is required to make this change work.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit dedb221889dbdd96f1d3c1155c3eb492d329bb53)

(cherry picked from commit ab75cdf635da31012f822260c607f5b4ceaf7668)
Conflicts:
tools/xenstat/libxenstat/src/xenstat_qmp.c
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
8 years agolibxenvchan: Change license of header from Lesser GPL v2.1 to BSD
Konrad Rzeszutek Wilk [Mon, 13 Jun 2016 09:28:57 +0000 (05:28 -0400)]
libxenvchan: Change license of header from Lesser GPL v2.1 to BSD

As the xen/COPYING file says:
"A few files are licensed under both GPL and a weaker BSD-style
license. This includes all files within the subdirectory
include/public, as described in include/public/COPYING. All such files
include the non-GPL license text as a source-code comment. Although
the license text refers generically to "the software", the non-GPL
license applies *only* to those source files that explicitly include
the non-GPL license text."

The libxenvchan.h is under xen/include/public/io directory
and the xen/include/public/COPYING says:

"XEN NOTICE
==========

This copyright applies to all files within this subdirectory and its
subdirectories:
  include/public/*.h
  include/public/hvm/*.h
  include/public/io/*.h

The intention is that these files can be freely copied into the source
tree of an operating system when porting that OS to run on Xen. Doing
so does *not* cause the OS to become subject to the terms of the GPL.

All other files in the Xen source distribution are covered by version
2 of the GNU General Public License except where explicitly stated
otherwise within individual source files.
"
Having the libxenvchan.h as Lesser GPL v2.1 where the COPYING file
says otherwise is confusing to say at least.

Upon consulting with the authors of libxenvchan they said:
"FWIW Neither I, nor ITL staff (as author of original libvchan library)
have anything against converting it to the BSD-style licence."
(Marek Marczykowski-Górecki,
http://lists.xen.org/archives/html/xen-devel/2016-06/msg00995.html)
so as such lets change it.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Anil Madhavapeddy <anil@recoil.org>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: George Dunlap <George.Dunlap@eu.citrix.com>
Acked-by: Jan Beulich <JBeulich@suse.com>
Acked-by: Jason Andryuk <andryuk@aero.org>
Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Matthew Daley <mattjd@gmail.com>
Acked-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Roger Pau Monne <roger.pau@entel.upc.edu>
Acked-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
["I have spoken to my line manager.  I can confirm that Citrix is happy
 with this proposed change.  So:

Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
 This view from Citrix covers all contributions made to these files in
 the course of Citrix's employees' employment, which I think is:

 > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
 > cc: George Dunlap <George.Dunlap@eu.citrix.com>
 > Cc: Ian Campbell <ian.campbell@citrix.com>
 > Cc: Ian Jackson <Ian.Jackson@eu.citrix.com>
 > Cc: Roger Pau Monne <roger.pau@entel.upc.edu>
 > Cc: Stefano Stabellini <sstabellini@kernel.org>
 > Cc: Tim Deegan <tim@xen.org>
 > Cc: Wei Liu <wei.liu2@citrix.com>

 ..
 [in subsequent email]:
 Wei points out that this ought also to include Keir Fraser's
 contribution, which was (only) in 2012.
 " (from Ian's email)

 In a subsequent mail, Wei also points out that David Scott's
 contribution is covered by Ian's ack.
]

master commit: 937324f032f4f77866e80e39de0d697fa5131df1

(cherry picked from commit b5e124661c175c88512c81acdeb06992259361b7)

8 years agoxl: correct xl cpupool-numa-split with vcpu limited dom0
Juergen Gross [Tue, 14 Jun 2016 04:30:58 +0000 (06:30 +0200)]
xl: correct xl cpupool-numa-split with vcpu limited dom0

When trying to use xl cpupool-numa-split and dom0 is limited to less
vcpus than one numa node the operation will fail.

Correct this by allowing this configuration.

Reported-by: Glenn Enright <glenn@rimuhosting.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit c256d2afc1cad0cca912492e338d6ff97e477c4f)
(cherry picked from commit 78a30103813a7929e922de7414cc56fa2ae52984)

8 years agoconfigure: Fix when no libsystemd compat lib are available
Anthony PERARD [Tue, 3 May 2016 15:59:49 +0000 (16:59 +0100)]
configure: Fix when no libsystemd compat lib are available

From systemd change log, since version 209, libsystemd.so contain
everything, including libsystemd-daemon.so. Distro may, or may not provide
the compatibility libraries which libsystemd-daemon is part of.

So, if libsystemd-daemon is not available, check for the presence of
a recent enough libsystemd.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
[ wei: run autogen.sh ]
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 7dec5b0c658bea9c16a0e3c051e64d2abf57be48)

8 years agoupdate Xen version to 4.6.4-pre
Jan Beulich [Mon, 8 Aug 2016 07:59:12 +0000 (09:59 +0200)]
update Xen version to 4.6.4-pre

8 years agoRevert "xen: Have schedulers revise initial placement"
Jan Beulich [Mon, 8 Aug 2016 07:55:11 +0000 (09:55 +0200)]
Revert "xen: Have schedulers revise initial placement"

This reverts commit 477080fc560a3025d451175b69995e62a2ce1a8d,
as it has further (so far unidentified) dependencies.

8 years agoRevert "xen: Remove buggy initial placement algorithm"
Jan Beulich [Mon, 8 Aug 2016 07:53:44 +0000 (09:53 +0200)]
Revert "xen: Remove buggy initial placement algorithm"

This reverts commit 715242a2764570680c4f9f5b039e390a8a78a642,
as its prereq had further (so far unidentified) dependencies.

8 years agox86/mmcfg: Fix initalisation of variables in pci_mmcfg_nvidia_mcp55()
Andrew Cooper [Fri, 5 Aug 2016 11:47:57 +0000 (13:47 +0200)]
x86/mmcfg: Fix initalisation of variables in pci_mmcfg_nvidia_mcp55()

Shifting into the sign bit of an integer is undefined behaviour.

Only the first integer is actually undefined, but switch all the shifts
for consistency.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <JBeulich@suse.com>
master commit: ab8fc3937eeb9332b83d7e14d81e37f0b0ef1841
master date: 2016-08-03 18:46:59 +0100

8 years agoxen: Remove buggy initial placement algorithm
George Dunlap [Fri, 5 Aug 2016 11:47:31 +0000 (13:47 +0200)]
xen: Remove buggy initial placement algorithm

The initial placement algorithm sometimes picks cpus outside of the
mask it's given, does a lot of unnecessary bitmasking, does its own
separate load calculation, and completely ignores vcpu hard and soft
affinities.  Just get rid of it and rely on the schedulers to do
initial placement.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d5438accceecc8172db2d37d98b695eb8bc43afc
master date: 2016-07-26 10:44:06 +0100

8 years agoxen: Have schedulers revise initial placement
George Dunlap [Fri, 5 Aug 2016 11:47:04 +0000 (13:47 +0200)]
xen: Have schedulers revise initial placement

The generic domain creation logic in
xen/common/domctl.c:default_vcpu0_location() attempts to try to do
initial placement load-balancing by placing vcpu 0 on the least-busy
non-primary hyperthread available.  Unfortunately, the logic can end
up picking a pcpu that's not in the online mask.  When this is passed
to a scheduler such which assumes that the initial assignment is
valid, it causes a null pointer dereference looking up the runqueue.

Furthermore, this initial placement doesn't take into account hard or
soft affinity, or any scheduler-specific knowledge (such as historic
runqueue load, as in credit2).

To solve this, when inserting a vcpu, always call the per-scheduler
"pick" function to revise the initial placement.  This will
automatically take all knowledge the scheduler has into account.

csched2_cpu_pick ASSERTs that the vcpu's pcpu scheduler lock has been
taken.  Grab and release the lock to minimize time spend with irqs
disabled.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
Reviwed-by: Dario Faggioli <dario.faggioli@citrix.com>
master commit: 9f358ddd69463fa8fb65cf67beb5f6f0d3350e32
master date: 2016-07-26 10:42:49 +0100

8 years agonested vmx: Validate host VMX MSRs before accessing them
Euan Harris [Fri, 5 Aug 2016 11:46:29 +0000 (13:46 +0200)]
nested vmx: Validate host VMX MSRs before accessing them

Some VMX MSRs may not exist on certain processor models, or may
be disabled because of configuration settings.   It is only safe to
access these MSRs if configuration flags in other MSRs are set.  These
prerequisites are listed in the Intel 64 and IA-32 Architectures
Software Developer’s Manual, Vol 3, Appendix A.

nvmx_msr_read_intercept() does not check the prerequisites before
accessing MSR_IA32_VMX_PROCBASED_CTLS2, MSR_IA32_VMX_EPT_VPID_CAP,
MSR_IA32_VMX_VMFUNC on the host.   Accessing these MSRs from a nested
VMX guest running on a host which does not support them will cause
Xen to crash with a GPF.

Signed-off-by: Euan Harris <euan.harris@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5e02972646132ad98c365ebfcfcb43b40a0dde36
master date: 2016-06-13 12:44:32 +0100

8 years agonested vmx: intercept guest rdmsr for MSR_IA32_VMX_VMFUNC
Euan Harris [Fri, 5 Aug 2016 11:46:02 +0000 (13:46 +0200)]
nested vmx: intercept guest rdmsr for MSR_IA32_VMX_VMFUNC

Guest reads of MSR_IA32_VMX_VMFUNC should be handled by
the logic in vmx_msr_read_intercept().   Otherwise a guest
can read the raw host value of this MSR, even if nested
vmx is disabled.

Signed-off-by: Euan Harris <euan.harris@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6439d23319986d37a6ea843c98b329218c3ac231
master date: 2016-06-08 14:14:33 +0200

8 years agoserial: fix incorrect length of strncmp for dtuart
Jiandi An [Fri, 5 Aug 2016 11:45:21 +0000 (13:45 +0200)]
serial: fix incorrect length of strncmp for dtuart

In serial_parse_handler(), length of strncmp for dtuart should have been
6, not 5.

Signed-off-by: Jiandi An <anjiandi@codeaurora.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: ba98196b54b27262ffe3d3463358eb4cff18b28d
master date: 2016-06-08 11:10:23 +0200

8 years agoxen/arm: p2m: Restrict usage of get_page_from_gva to the current vCPU
Julien Grall [Wed, 20 Jul 2016 16:10:45 +0000 (17:10 +0100)]
xen/arm: p2m: Restrict usage of get_page_from_gva to the current vCPU

The function get_page_from_gva translates a guest virtual address to a
machine address. The translation involves the register VTTBR_EL2,
TTBR0_EL1, TTBR1_EL1 and SCTLR_EL1.

Currently, only the first register is context switch is the current
domain is not the same. This will result to use the wrong TTBR*_EL1 and
SCTLR_EL1 for the translation.

To fix the code properly, we would have to context switch all the
registers mentioned above when the vCPU in parameter is not the current
one. Similar things would need to be done in the callee
p2m_mem_check_and_get_page.

Given that the only caller of this function with the vCPU that may not
be current is a guest debugging function (show_guest_stack), restrict
the usage to the current vCPU for the time being.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoxen/arm: p2m: Pass the vCPU in parameter to get_page_from_gva
Julien Grall [Wed, 20 Jul 2016 16:10:44 +0000 (17:10 +0100)]
xen/arm: p2m: Pass the vCPU in parameter to get_page_from_gva

The function get_page_from_gva translates a guest virtual address to a
machine address. The translation involves the register VTTBR_EL2,
TTBR0_EL1, TTBR1_EL1 and SCTLR_EL1. Whilst the first register is per
domain (the p2m is common to every vCPUs), the last 3 are per-vCPU.

Therefore, the function should take the vCPU in parameter and not the
domain. Fixing the actual code path will be done a separate patch.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agoxen/arm: system: Use the correct parameter name in local_irq_restore
Julien Grall [Wed, 20 Jul 2016 16:10:43 +0000 (17:10 +0100)]
xen/arm: system: Use the correct parameter name in local_irq_restore

The parameter to store the flags is called 'x' and not 'flags'.
Thankfully all the user of the macro is passing 'flags'.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agox86/entry: Avoid SMAP violation in compat_create_bounce_frame()
Andrew Cooper [Wed, 15 Jun 2016 17:32:14 +0000 (18:32 +0100)]
x86/entry: Avoid SMAP violation in compat_create_bounce_frame()

A 32bit guest kernel might be running on user mappings.
compat_create_bounce_frame() must whitelist its guest accesses to avoid
risking a SMAP violation.

For both variants of create_bounce_frame(), re-blacklist user accesses if
execution exits via an exception table redirection.

This is XSA-183 / CVE-2016-6259

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/pv: Remove unsafe bits from the mod_l?_entry() fastpath
Andrew Cooper [Mon, 11 Jul 2016 13:32:03 +0000 (14:32 +0100)]
x86/pv: Remove unsafe bits from the mod_l?_entry() fastpath

All changes in writeability and cacheability must go through full
re-validation.

Rework the logic as a whitelist, to make it clearer to follow.

This is XSA-182

Reported-by: Jérémie Boutoille <jboutoille@ext.quarkslab.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
8 years agoupdate Xen version to 4.6.3 RELEASE-4.6.3
Jan Beulich [Mon, 20 Jun 2016 12:08:22 +0000 (14:08 +0200)]
update Xen version to 4.6.3

Skipping 4.6.2 due to qemu changes having become necessary after that
tree had got tagged already.

8 years agoREADME: Change to say `Xen 4.6'
Ian Jackson [Mon, 20 Jun 2016 12:06:10 +0000 (13:06 +0100)]
README: Change to say `Xen 4.6'

This is more accurate, since we are past 4.6.0 now.  We don't want to
update the README for each point release.  (This change should have
been done shortly after the release of 4.6.0.)

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
8 years agoQEMU_UPSTREAM_REVISION update.
Ian Jackson [Tue, 14 Jun 2016 17:34:13 +0000 (18:34 +0100)]
QEMU_UPSTREAM_REVISION update.

Update to what we hope to make a stable release of soon.  Includes
XSA-180 fix and Ubuntu build fix.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
8 years agoQEMU_UPSTREAM_REVISION update
Ian Jackson [Fri, 10 Jun 2016 16:32:12 +0000 (17:32 +0100)]
QEMU_UPSTREAM_REVISION update

This commit id is equivalent to qemu-xen-4.6.2

8 years agopublic: typo: use ' as apostrophe in grant_table.h
Dario Faggioli [Fri, 10 Jun 2016 15:41:32 +0000 (17:41 +0200)]
public: typo: use ' as apostrophe in grant_table.h

If grep 2.23 is installed, build fails like this:
...
mkdir -p compat
grep -v 'DEFINE_XEN_GUEST_HANDLE(long)' public/grant_table.h | \
python /home/SOURCES/xen/xen/xen.git/xen/tools/compat-build-source.py >compat/grant_table.c.new
mv -f compat/grant_table.c.new compat/grant_table.c
gcc  ... -o compat/grant_table.i compat/grant_table.c
compat/grant_table.c:33:1: error: unterminated comment
 /*
 ^
compat/grant_table.c:28:0: error: unterminated #ifndef
 #ifndef __XEN_PUBLIC_GRANT_TABLE_H__
 ^
Makefile:62: recipe for target 'compat/grant_table.i' failed
make[3]: *** [compat/grant_table.i] Error 1
rm compat/grant_table.c
make[3]: Leaving directory '/home/SOURCES/xen/xen/xen.git/xen/include'
...

This is because grant_table.h contains this (note the
apostrophe): "granter\92s memory", and `grep -v', in version
2.23, stops processing the file (while, for instance,
until 2.22, this was not happening).

Although the above behavior is likely an issue in grep,
(https://debbugs.gnu.org/cgi/bugreport.cgi?bug=22461)
I think we better switch to using " ' " in that line
anyway, as we do basically everywhere else (even in
the same file).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
master commit: 3f293c7caaefc2c37b61e44e8ebd5a7f1c554afb
master date: 2016-02-25 13:03:04 +0100

8 years agoQEMU_TAG update
Ian Jackson [Fri, 10 Jun 2016 10:53:51 +0000 (11:53 +0100)]
QEMU_TAG update

8 years agolibxl: set XEN_QEMU_CONSOLE_LIMIT for QEMU
Wei Liu [Thu, 26 May 2016 15:11:42 +0000 (16:11 +0100)]
libxl: set XEN_QEMU_CONSOLE_LIMIT for QEMU

XSA-180 provides a patch to QEMU to bodge QEMU logging issue. We
explicitly set the limit in libxl for 4.7.

Introduce a function for setting the environment variable and call it in
the right places.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit b0d409d9c4944ed29d29457fea4ad6b41d698eca)

8 years agolibxl: Fix NULL pointer due to XSA-178 fix wrong XS nodename
Ian Jackson [Wed, 8 Jun 2016 14:42:19 +0000 (15:42 +0100)]
libxl: Fix NULL pointer due to XSA-178 fix wrong XS nodename

In "libxl: Do not trust backend for disk eject vdev" (c69871a2fb26 on
xen.git#staging) we changed libxl_evenable_disk_eject to read the
device vdev out of xenstore from the /libxl path, rather than the
backend path, and to read it during setup rather than on each event.

However, the patch has a mistake:
    -        GCSPRINTF("%s/dev", backend), NULL);
    +        GCSPRINTF("%s/vdev", libxl_path), &configured_vdev);
                           ^
Spot the extra "v".  This causes configured_vdev always to be NULL.
configured_vdev is passed to [libxl__]strdup.

In Xen 4.6 and later libxl__strdup is used and tolerates NULL.
evg->vdev is set to NULL.  This propagates to the `vdev' field in the
generated event.  This may or may not cause further trouble, depending
on the calling application.  In our osstest test cases it does not
cause any trouble, so the bug goes undetected.

In Xen 4.5 and earlier, the strdup does not tolerate NULL, and libxl
crashes immediately.  This has been detected by osstest as a
regression in Xen 4.5.

IMO this patch should be applied immediately to
  xen.git#staging-4.5 (to check that it fixes the osstest regression)
  xen.git#staging     (to check that it does not break master

Subject to passes, it should then be propagated to all supported
stable trees and also be mentioned in an update to XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
CC: security@xenproject.org
CC: Jan Beulich <jbeulich@suse.com>
CC: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 62b4d4769ca39fd5263da20d786a7b9a80a22d9a)

8 years agoQEMU_TAG update
Ian Jackson [Tue, 7 Jun 2016 16:05:33 +0000 (17:05 +0100)]
QEMU_TAG update