]> xenbits.xensource.com Git - xen.git/log
xen.git
8 years agolibxl: Revert 3658f7a0bdd8 "libxl: fix libxl_set_memory_target"
Ian Jackson [Sat, 21 Jan 2017 19:03:23 +0000 (19:03 +0000)]
libxl: Revert 3658f7a0bdd8 "libxl: fix libxl_set_memory_target"

I backported
  3658f7a0bdd8fcda217559927e25263a07398c27
  libxl: fix libxl_set_memory_target
but this was not necessary, because the commit it fixes is not in
Xen 4.6.

So the backport introduces a duplicate cal to xc_domain_getinfolist
and also breaks the build with
    libxl.c:4908:5: error: ‘r’ undeclared (first use in this function)

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit 09f521a077024d5955d766eef7a040d2af928ec2)

8 years agolibxl: fix libxl_set_memory_target
Wei Liu [Thu, 29 Dec 2016 16:36:31 +0000 (16:36 +0000)]
libxl: fix libxl_set_memory_target

Commit 26dbc93a ("libxl: Remove pointless hypercall from
libxl_set_memory_target") removed the call to xc_domain_getinfolist, but
it failed to notice that "info" was actually needed later.

Put that back. While at it, make the code conform to coding style
requirement.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit ed5f19aea66fe5a72060d6a795ffcd23b7643ee3)
(cherry picked from commit e1cefedd80f9972854769bfc6e32e23b56cd0712)
(cherry picked from commit 013ee593ca0456e1adfdf80ef7e44b151cdb1545)
(cherry picked from commit 3658f7a0bdd8fcda217559927e25263a07398c27)

8 years agox86: force EFLAGS.IF on when exiting to PV guests
Jan Beulich [Wed, 21 Dec 2016 16:47:08 +0000 (17:47 +0100)]
x86: force EFLAGS.IF on when exiting to PV guests

Guest kernels modifying instructions in the process of being emulated
for another of their vCPU-s may effect EFLAGS.IF to be cleared upon
next exiting to guest context, by converting the being emulated
instruction to CLI (at the right point in time). Prevent any such bad
effects by always forcing EFLAGS.IF on. And to cover hypothetical other
similar issues, also force EFLAGS.{IOPL,NT,VM} to zero.

This is CVE-2016-10024 / XSA-202.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0e47f92b072548800223f9a21ea051a017173915
master date: 2016-12-21 16:46:13 +0100

8 years agox86/emul: Correct the handling of eflags with SYSCALL
Andrew Cooper [Sun, 18 Dec 2016 15:42:59 +0000 (15:42 +0000)]
x86/emul: Correct the handling of eflags with SYSCALL

A singlestep #DB is determined by the resulting eflags value from the
execution of SYSCALL, not the original eflags value.

By using the original eflags value, we negate the guest kernels attempt to
protect itself from a privilege escalation by masking TF.

Introduce a tf boolean and have the SYSCALL emulation recalculate it
after the instruction is complete.

This is XSA-204

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86emul: CMPXCHG8B ignores operand size prefix
Jan Beulich [Tue, 13 Dec 2016 13:29:02 +0000 (14:29 +0100)]
x86emul: CMPXCHG8B ignores operand size prefix

Otherwise besides mis-handling the instruction, the comparison failure
case would result in uninitialized stack data being handed back to the
guest in rDX:rAX (32 bits leaked for 32-bit guests, 96 bits for 64-bit
ones).

This is CVE-2016-9932 / XSA-200.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
8 years agoQEMU_TAG update
Ian Jackson [Wed, 7 Dec 2016 16:54:35 +0000 (16:54 +0000)]
QEMU_TAG update

8 years agoarm64: fix incorrect memory region size in TCR_EL2
Shanker Donthineni [Thu, 17 Mar 2016 12:46:58 +0000 (13:46 +0100)]
arm64: fix incorrect memory region size in TCR_EL2

The maximum and minimum values for TxSZ depend on level of
translation as per AArch64 Virtual Memory System Architecture.
According to ARM specification DDI0487A_h (sec D4.2.2, page 1752),
the minimum TxSZ value is 16. If TxSZ is programmed to a value
smaller than 16 then it is IMPLEMENTATION DEFINED.

This patch sets T0SZ to (64-48)bits since XEN uses all 4 levels
to cover 48bit (256TB) virtual address instead of value zero.

Signed-off-by: Shanker Donthineni <shankerd@codeaurora.org>
Acked-by: Julien Grall <julien.grall@arm.com>
8 years agoQEMU_TAG update
Ian Jackson [Tue, 29 Nov 2016 18:38:34 +0000 (18:38 +0000)]
QEMU_TAG update

8 years agopygrub: Properly quote results, when returning them to the caller:
Ian Jackson [Tue, 22 Nov 2016 13:30:27 +0000 (14:30 +0100)]
pygrub: Properly quote results, when returning them to the caller:

* When the caller wants sexpr output, use `repr()'
  This is what Xend expects.

  The returned S-expressions are now escaped and quoted by Python,
  generally using '...'.  Previously kernel and ramdisk were unquoted
  and args was quoted with "..." but without proper escaping.  This
  change may break toolstacks which do not properly dequote the
  returned S-expressions.

* When the caller wants "simple" output, crash if the delimiter is
  contained in the returned value.

  With --output-format=simple it does not seem like this could ever
  happen, because the bootloader config parsers all take line-based
  input from the various bootloader config files.

  With --output-format=simple0, this can happen if the bootloader
  config file contains nul bytes.

This is CVE-2016-9379 and CVE-2016-9380 / XSA-198.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Tested-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 27e14d346ed6ff1c3a3cfc479507e62d133e92a9
master date: 2016-11-22 13:52:09 +0100

8 years agox86/svm: fix injection of software interrupts
Andrew Cooper [Tue, 22 Nov 2016 13:29:50 +0000 (14:29 +0100)]
x86/svm: fix injection of software interrupts

The non-NextRip logic in c/s 36ebf14eb "x86/emulate: support for emulating
software event injection" was based on an older version of the AMD software
manual.  The manual was later corrected, following findings from that series.

I took the original wording of "not supported without NextRIP" to mean that
X86_EVENTTYPE_SW_INTERRUPT was not eligible for use.  It turns out that this
is not the case, and the new wording is clearer on the matter.

Despite testing the original patch series on non-NRip hardware, the
swint-emulation XTF test case focuses on the debug vectors; it never ended up
executing an `int $n` instruction for a vector which wasn't also an exception.

During a vmentry, the use of X86_EVENTTYPE_HW_EXCEPTION comes with a vector
check to ensure that it is only used with exception vectors.  Xen's use of
X86_EVENTTYPE_HW_EXCEPTION for `int $n` injection has always been buggy on AMD
hardware.

Fix this by always using X86_EVENTTYPE_SW_INTERRUPT.

Print and decode the eventinj information in svm_vmcb_dump(), as it has
several invalid combinations which cause vmentry failures.

This is CVE-2016-9378 / part of XSA-196.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 920edccd41db6cb0145545afa1850edf5e7d098e
master date: 2016-11-22 13:51:16 +0100

8 years agox86/emul: correct the IDT entry calculation in inject_swint()
Andrew Cooper [Tue, 22 Nov 2016 13:29:26 +0000 (14:29 +0100)]
x86/emul: correct the IDT entry calculation in inject_swint()

The logic, as introduced in c/s 36ebf14ebe "x86/emulate: support for emulating
software event injection" is buggy.  The size of an IDT entry depends on long
mode being active, not the width of the code segment currently in use.

In particular, this means that a compatibility code segment which hits
emulation for software event injection will end up using an incorrect offset
in the IDT for DPL/Presence checking.  In practice, this only occurs on old
AMD hardware lacking NRip support; all newer AMD hardware, and all Intel
hardware bypass this path in the emulator.

While here, fix a minor issue with reading the IDT entry.  The return value
from ops->read() wasn't checked, but in reality the only failure case is if a
pagefault occurs.  This is not a realistic problem as the kernel will almost
certainly crash with a double fault if this setup actually occured.

This is CVE-2016-9377 / part of XSA-196.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 255e8fe95f22ded5186fd75244ffcfb9d5dbc855
master date: 2016-11-22 13:50:49 +0100

8 years agox86emul: fix huge bit offset handling
Jan Beulich [Tue, 22 Nov 2016 13:29:03 +0000 (14:29 +0100)]
x86emul: fix huge bit offset handling

We must never chop off the high 32 bits.

This is CVE-2016-9383 / XSA-195.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1c6c2d60d205f71ede0fbbd9047e459112f576db
master date: 2016-11-22 13:49:06 +0100

8 years agox86/PV: writes of %fs and %gs base MSRs require canonical addresses
Jan Beulich [Tue, 22 Nov 2016 13:28:40 +0000 (14:28 +0100)]
x86/PV: writes of %fs and %gs base MSRs require canonical addresses

Commit c42494acb2 ("x86: fix FS/GS base handling when using the
fsgsbase feature") replaced the use of wrmsr_safe() on these paths
without recognizing that wr{f,g}sbase() use just wrmsrl() and that the
WR{F,G}SBASE instructions also raise #GP for non-canonical input.

Similarly arch_set_info_guest() needs to prevent non-canonical
addresses from getting stored into state later to be loaded by context
switch code. For consistency also check stack pointers and LDT base.
DR0..3, otoh, already get properly checked in set_debugreg() (albeit
we discard the error there).

The SHADOW_GS_BASE check isn't strictly necessary, but I think we
better avoid trying the WRMSR if we know it's going to fail.

This is CVE-2016-9385 / XSA-193.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f3fa3abf3e61fb1f25ce721e14ac324dda67311f
master date: 2016-11-22 13:46:28 +0100

8 years agox86/HVM: don't load LDTR with VM86 mode attrs during task switch
Jan Beulich [Tue, 22 Nov 2016 13:28:12 +0000 (14:28 +0100)]
x86/HVM: don't load LDTR with VM86 mode attrs during task switch

Just like TR, LDTR is purely a protected mode facility and hence needs
to be loaded accordingly. Also move its loading to where it
architecurally belongs.

This is CVE-2016-9382 / XSA-192.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 93aa42b85ae0084ba7b749d0e990c94fbf0c17e3
master date: 2016-11-22 13:45:44 +0100

8 years agox86/hvm: Fix the handling of non-present segments
Andrew Cooper [Tue, 22 Nov 2016 13:27:40 +0000 (14:27 +0100)]
x86/hvm: Fix the handling of non-present segments

In 32bit, the data segments may be NULL to indicate that the segment is
ineligible for use.  In both 32bit and 64bit, the LDT selector may be NULL to
indicate that the entire LDT is ineligible for use.  However, nothing in Xen
actually checks for this condition when performing other segmentation
checks.  (Note however that limit and writeability checks are correctly
performed).

Neither Intel nor AMD specify the exact behaviour of loading a NULL segment.
Experimentally, AMD zeroes all attributes but leaves the base and limit
unmodified.  Intel zeroes the base, sets the limit to 0xfffffff and resets the
attributes to just .G and .D/B.

The use of the segment information in the VMCB/VMCS is equivalent to a native
pipeline interacting with the segment cache.  The present bit can therefore
have a subtly different meaning, and it is now cooked to uniformly indicate
whether the segment is usable or not.

GDTR and IDTR don't have access rights like the other segments, but for
consistency, they are treated as being present so no special casing is needed
elsewhere in the segmentation logic.

AMD hardware does not consider the present bit for %cs and %tr, and will
function as if they were present.  They are therefore unconditionally set to
present when reading information from the VMCB, to maintain the new meaning of
usability.

Intel hardware has a separate unusable bit in the VMCS segment attributes.
This bit is inverted and stored in the present field, so the hvm code can work
with architecturally-common state.

This is CVE-2016-9386 / XSA-191.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 04beafa8e6c66f5cd814c00e2d2b51cfbc41cb8a
master date: 2016-11-22 13:44:50 +0100

8 years agox86emul: honor guest CR0.TS and CR0.EM
Jan Beulich [Tue, 4 Oct 2016 13:06:11 +0000 (14:06 +0100)]
x86emul: honor guest CR0.TS and CR0.EM

We must not emulate any instructions accessing respective registers
when either of these flags is set in the guest view of the register, or
else we may do so on data not belonging to the guest's current task.

Being architecturally required behavior, the logic gets placed in the
instruction emulator instead of hvmemul_get_fpu(). It should be noted,
though, that hvmemul_get_fpu() being the only current handler for the
get_fpu() callback, we don't have an active problem with CR4: Both
CR4.OSFXSR and CR4.OSXSAVE get handled as necessary by that function.

This is XSA-190.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
8 years agoQEMU_TAG update
Ian Jackson [Tue, 20 Sep 2016 15:52:52 +0000 (16:52 +0100)]
QEMU_TAG update

8 years agoupdate Xen version to 4.5.5 RELEASE-4.5.5
Jan Beulich [Tue, 20 Sep 2016 05:59:24 +0000 (07:59 +0200)]
update Xen version to 4.5.5

One of the qemu tags did get created the wrong way, so we need to skip
4.5.4.

8 years agoupdate Xen version to 4.5.4 RELEASE-4.5.4
Jan Beulich [Mon, 19 Sep 2016 15:50:59 +0000 (17:50 +0200)]
update Xen version to 4.5.4

8 years agoRevert "x86/hvm: Perform a user instruction fetch for a FEP in userspace"
Jan Beulich [Mon, 12 Sep 2016 15:50:13 +0000 (17:50 +0200)]
Revert "x86/hvm: Perform a user instruction fetch for a FEP in userspace"

This reverts commit 95559492c958e45fa7c01b1b3e0fb704e5b8b9eb,
which doesn't build and would, in its current form, use
uninitialized data if it did.

8 years agox86/segment: Bounds check accesses to emulation ctxt->seg_reg[]
Andrew Cooper [Mon, 12 Sep 2016 14:05:09 +0000 (16:05 +0200)]
x86/segment: Bounds check accesses to emulation ctxt->seg_reg[]

HVM HAP codepaths have space for all segment registers in the seg_reg[]
cache (with x86_seg_none still risking an array overrun), while the shadow
codepaths only have space for the user segments.

Range check the input segment of *_get_seg_reg() against the size of the array
used to cache the results, to avoid overruns in the case that the callers
don't filter their input suitably.

Subsume the is_x86_user_segment(seg) checks from the shadow code, which were
an incomplete attempt at range checking, and are now superceeded.  Make
hvm_get_seg_reg() static, as it is not used outside of shadow/common.c

No functional change, but far easier to reason that no overflow is possible.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
xen/x86: Fix build with clang following c/s 4fa0105

https://travis-ci.org/xen-project/xen/jobs/158494027#L2344

Clang complains:

  emulate.c:2016:14: error: comparison of unsigned enum expression < 0
  is always false [-Werror,-Wtautological-compare]
      if ( seg < 0 || seg >= ARRAY_SIZE(hvmemul_ctxt->seg_reg) )
           ~~~ ^ ~

Clang is wrong to raise a warning like this.  The signed-ness of an enum is
implementation defined in C, and robust code must not assume the choices made
by the compiler.

In this case, dropping the < 0 check creates a latent bug which would result
in an array underflow when compiled with a compiler which chooses a signed
enum.

Work around the bug by explicitly pulling seg into an unsigned integer, and
only perform the upper bounds check.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 4fa0105d95be6e7145a1f6fd1036ccd43976228c
master date: 2016-09-08 16:39:46 +0100
master commit: 4c47c47938ea24c73d9459f9f0b6923513772b5d
master date: 2016-09-09 15:31:01 +0100

8 years agox86/hvm: Perform a user instruction fetch for a FEP in userspace
Andrew Cooper [Mon, 12 Sep 2016 14:04:35 +0000 (16:04 +0200)]
x86/hvm: Perform a user instruction fetch for a FEP in userspace

This matches hardware behaviour, and prevents erroneous failures when a guest
has SMEP/SMAP active and issues a FEP from userspace.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0831e99446121636045cf6f616a1991d6ef22071
master date: 2016-09-08 16:39:46 +0100

8 years agohvm/fep: Allow testing of instructions crossing the -1 -> 0 virtual boundary
Andrew Cooper [Mon, 12 Sep 2016 14:04:08 +0000 (16:04 +0200)]
hvm/fep: Allow testing of instructions crossing the -1 -> 0 virtual boundary

The Force Emulation Prefix is named to follow its PV counterpart for cpuid or
rdtsc, but isn't really an instruction prefix.  It behaves as a break-out into
Xen, with the purpose of emulating the next instruction in the current state.

It is important to be able to test legal situations which occur in real
hardware, including instruction which cross certain boundaries, and
instructions starting at 0.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7b5cee79dad24e7006059667b02bd7de685d8ee5
master date: 2016-09-08 16:39:46 +0100

8 years agoVMX: correct feature checks for MPX
Jan Beulich [Mon, 12 Sep 2016 14:03:27 +0000 (16:03 +0200)]
VMX: correct feature checks for MPX

Its VMCS field isn't tied to the respective base CPU feature flag but
instead to a VMX specific one.

Note that while the VMCS GUEST_BNDCFGS field exists if either of the
two respective features is available, MPX continues to get exposed to
guests only with both features present.

Also add the so far missing handling of
- GUEST_BNDCFGS in construct_vmcs()
- MSR_IA32_BNDCFGS in vmx_msr_{read,write}_intercept()
and mirror the extra correctness checks during MSR write to
vmx_load_msr().

Reported-by: "Rockosov, Dmitry" <dmitry.rockosov@intel.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: "Rockosov, Dmitry" <dmitry.rockosov@intel.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 68eb1a4d92be58e26bd11d02b8e0317bd56294ac
master date: 2016-09-07 12:34:43 +0200

8 years agox86/shadow: Avoid overflowing sh_ctxt->seg_reg[]
Andrew Cooper [Thu, 8 Sep 2016 12:29:05 +0000 (14:29 +0200)]
x86/shadow: Avoid overflowing sh_ctxt->seg_reg[]

hvm_get_seg_reg() does not perform a range check on its input segment, calls
hvm_get_segment_register() and writes straight into sh_ctxt->seg_reg[].

x86_seg_none is outside the bounds of sh_ctxt->seg_reg[], and will hit a BUG()
in {vmx,svm}_get_segment_register().

HVM guests running with shadow paging can end up performing a virtual to
linear translation with x86_seg_none.  This is used for addresses which are
already linear.  However, none of this is a legitimate pagetable update, so
fail the emulation in such a case.

This is XSA-187 / CVE-2016-7094.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: a9f3b3bad17d91e2067fc00d51b0302349570d08
master date: 2016-09-08 14:16:26 +0200

8 years agox86/emulate: Correct boundary interactions of emulated instructions
Andrew Cooper [Thu, 8 Sep 2016 12:28:11 +0000 (14:28 +0200)]
x86/emulate: Correct boundary interactions of emulated instructions

This reverts most of c/s 0640ffb6 "x86emul: fix rIP handling".

Experimentally, in long mode processors will execute an instruction stream
which crosses the 64bit -1 -> 0 virtual boundary, whether the instruction
boundary is aligned on the virtual boundary, or is misaligned.

In compatibility mode, Intel processors will execute an instruction stream
which crosses the 32bit -1 -> 0 virtual boundary, while AMD processors raise a
segmentation fault.  Xen's segmentation behaviour matches AMD.

For 16bit code, hardware does not ever truncated %ip.  %eip is always used and
behaves normally as a 32bit register, including in 16bit protected mode
segments, as well as in Real and Unreal mode.

This is XSA-186 / CVE-2016-7093.

Reported-by: Brian Marcotte <marcotte@panix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e9575f980df81aeb0e5b6139f485fd6f7bb7f5b6
master date: 2016-09-08 14:15:53 +0200

8 years agox86/32on64: don't allow recursive page tables from L3
Jan Beulich [Thu, 8 Sep 2016 12:27:34 +0000 (14:27 +0200)]
x86/32on64: don't allow recursive page tables from L3

L3 entries are special in PAE mode, and hence can't reasonably be used
for setting up recursive (and hence linear) page table mappings. Since
abuse is possible when the guest in fact gets run on 4-level page
tables, this needs to be excluded explicitly.

This is XSA-185 / CVE-2016-7092.

Reported-by: Jérémie Boutoille <jboutoille@ext.quarkslab.com>
Reported-by: "栾尚聪(好风)" <shangcong.lsc@alibaba-inc.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c844d637d92a75854ea5c8d4e5ca34302a9f623c
master date: 2016-09-08 14:14:53 +0200

8 years agomemory: fix compat handling of XENMEM_access_op
Jan Beulich [Tue, 6 Sep 2016 10:13:43 +0000 (12:13 +0200)]
memory: fix compat handling of XENMEM_access_op

Within compat_memory_op() this needs to be placed in the first switch()
statement, or it ends up being dead code (as that first switch() has a
default case chaining to compat_arch_memory_op()).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 8d6af808a7e9d9ae1d129e1e5a0def7f8b2333ee
master date: 2016-09-02 14:19:51 +0200

8 years agocredit1: fix a race when picking initial pCPU for a vCPU
Dario Faggioli [Tue, 6 Sep 2016 10:13:18 +0000 (12:13 +0200)]
credit1: fix a race when picking initial pCPU for a vCPU

In the Credit1 hunk of 9f358ddd69463 ("xen: Have
schedulers revise initial placement") csched_cpu_pick()
is called without taking the runqueue lock of the
(temporary) pCPU that the vCPU has been assigned to
(e.g., in XEN_DOMCTL_max_vcpus).

However, although 'hidden' in the IS_RUNQ_IDLE() macro,
that function does access the runq (for doing load
balancing calculations). Two scenarios are possible:
 1) we are on cpu X, and IS_RUNQ_IDLE() peeks at cpu's
    X own runq;
 2) we are on cpu X, but IS_RUNQ_IDLE() peeks at some
    other cpu's runq.

Scenario 2) absolutely requies that the appropriate
runq lock is taken. Scenario 1) works even without
taking the cpu's own runq lock. That is actually what
happens when when _csched_pick_cpu() is called from
csched_vcpu_acct() (in turn, called by csched_tick()).

Races have been observed and reported (by both XenServer
own testing and OSSTest [1]), in the form of
IS_RUNQ_IDLE() falling over LIST_POISON, because we're
not currently holding the proper lock, in
csched_vcpu_insert(), when scenario 1) occurs.

However, for better robustness, from now on we always
ask for the proper runq lock to be held when calling
IS_RUNQ_IDLE() (which is also becoming a static inline
function instead of macro).

In order to comply with that, we take the lock around
the call to _csched_cpu_pick() in csched_vcpu_acct().

[1] https://lists.xen.org/archives/html/xen-devel/2016-08/msg02144.html

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 9109bf55084398c4547b8956906410c158eb9a17
master date: 2016-09-02 14:17:55 +0200

8 years agox86/32on64: misc adjustments to call gate emulation
Jan Beulich [Tue, 6 Sep 2016 10:12:49 +0000 (12:12 +0200)]
x86/32on64: misc adjustments to call gate emulation

- There's no 32-bit displacement in 16-bit addressing mode.
- It is wrong to ASSERT() anything on parts of an instruction fetched
  from guest memory.
- The two scaling bits of a SIB byte don't affect whether there is a
  scaled index register or not.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ee1cc4bfdca84d526805c4c72302c026f5e9cd94
master date: 2016-09-01 15:23:46 +0200

8 years agoxen: Remove buggy initial placement algorithm
George Dunlap [Tue, 6 Sep 2016 10:11:53 +0000 (12:11 +0200)]
xen: Remove buggy initial placement algorithm

The initial placement algorithm sometimes picks cpus outside of the
mask it's given, does a lot of unnecessary bitmasking, does its own
separate load calculation, and completely ignores vcpu hard and soft
affinities.  Just get rid of it and rely on the schedulers to do
initial placement.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d5438accceecc8172db2d37d98b695eb8bc43afc
master date: 2016-07-26 10:44:06 +0100

8 years agoxen: Have schedulers revise initial placement
George Dunlap [Tue, 6 Sep 2016 10:11:28 +0000 (12:11 +0200)]
xen: Have schedulers revise initial placement

The generic domain creation logic in
xen/common/domctl.c:default_vcpu0_location() attempts to try to do
initial placement load-balancing by placing vcpu 0 on the least-busy
non-primary hyperthread available.  Unfortunately, the logic can end
up picking a pcpu that's not in the online mask.  When this is passed
to a scheduler such which assumes that the initial assignment is
valid, it causes a null pointer dereference looking up the runqueue.

Furthermore, this initial placement doesn't take into account hard or
soft affinity, or any scheduler-specific knowledge (such as historic
runqueue load, as in credit2).

To solve this, when inserting a vcpu, always call the per-scheduler
"pick" function to revise the initial placement.  This will
automatically take all knowledge the scheduler has into account.

csched2_cpu_pick ASSERTs that the vcpu's pcpu scheduler lock has been
taken.  Grab and release the lock to minimize time spend with irqs
disabled.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
Reviwed-by: Dario Faggioli <dario.faggioli@citrix.com>
master commit: 9f358ddd69463fa8fb65cf67beb5f6f0d3350e32
master date: 2016-07-26 10:42:49 +0100

8 years agosched: better handle (not) inserting idle vCPUs in runqueues
Dario Faggioli [Tue, 6 Sep 2016 10:10:40 +0000 (12:10 +0200)]
sched: better handle (not) inserting idle vCPUs in runqueues

Idle vCPUs are set to run immediately, as a part of their
own initialization, so we shouldn't even try to put them
in a runqueue. In fact, no scheduler does that, even when
asked to (that is rather explicit in Credit2 and RTDS, a
bit less evident in Credit1).

Let's make things look as follows:
 - in generic code, explicitly avoid even trying to
   insert idle vCPUs in runqueues;
 - in specific schedulers' code, enforce that.

Note that, as csched_vcpu_insert() is no longer being
called, during boot (from sched_init_vcpu()) we can
safely avoid saving the flags when taking the runqueue
lock.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
master commit: 6b53bb4ab3c9bd5eccde88a5175cf72589ba6d52
master date: 2015-11-24 14:49:47 +0100

8 years agoxen/physmap: Do not permit a guest to populate PoD pages for itself
Andrew Cooper [Fri, 26 Aug 2016 08:32:01 +0000 (10:32 +0200)]
xen/physmap: Do not permit a guest to populate PoD pages for itself

PoD is supposed to be entirely transparent to guest, but this interface has
been left exposed for a long time.

The use of PoD requires careful co-ordination by the toolstack with the
XENMEM_{get,set}_pod_target hypercalls, and xenstore ballooning target.  The
best a guest can do without toolstack cooperation crash.

Furthermore, there are combinations of features (e.g. c/s c63868ff "libxl:
disallow PCI device assignment for HVM guest when PoD is enabled") which a
toolstack might wish to explicitly prohibit (in this case, because the two
simply don't function in combination).  In such cases, the guest mustn't be
able to subvert the configuration chosen by the toolstack.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 2a99aa99fc84a45f505f84802af56b006d14c52e
master date: 2016-08-19 18:40:11 +0100

8 years agopage-alloc/x86: don't restrict DMA heap to node 0
Jan Beulich [Fri, 26 Aug 2016 08:31:25 +0000 (10:31 +0200)]
page-alloc/x86: don't restrict DMA heap to node 0

When node zero has no memory, the DMA bit width will end up getting set
to 9, which is obviously not helpful to hold back a reasonable amount
of low enough memory for Dom0 to use for DMA purposes. Find the lowest
node with memory below 4Gb instead.

Introduce arch_get_dma_bitsize() to keep this arch-specific logic out
of common code.

Also adjust the original calculation: I think the subtraction of 1
should have been part of the flsl() argument rather than getting
applied to its result. And while previously the division by 4 was valid
to be done on the flsl() result, this now also needs to be converted,
as is should only be applied to the spanned pages value.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d0d6597d3d682f324b6a79e3278e6f5bb6bad153
master date: 2016-08-11 13:35:50 +0200

8 years agolibxl: return any serial tty path in libxl_console_get_tty
Bob Liu [Thu, 4 Aug 2016 01:07:56 +0000 (09:07 +0800)]
libxl: return any serial tty path in libxl_console_get_tty

When specifying a serial list in domain config, users of
libxl_console_get_tty cannot get the tty path of a second specified pty serial,
since right now it always returns the tty path of serial 0.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 35dbf099ac18924d40533c9d1b9bfbf1ecb818c9)
(cherry picked from commit 822464961ae1bac44dcabb049255d61d5511e368)
(cherry picked from commit e06d2bae53cb1a3542e7269fd35bf3885dd2e244)

8 years agotools/libxc: Properly increment ApicIdCoreSize field on AMD
Boris Ostrovsky [Fri, 22 Jul 2016 17:14:01 +0000 (13:14 -0400)]
tools/libxc: Properly increment ApicIdCoreSize field on AMD

Current code incorrectly adds 1 to full register instead of
incrementing the field in bits 15:12.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit a3336a507519c1d28db3bbff8e439aa3811733f3)
(cherry picked from commit de781b47f447fcdd1c556543fcf0ef1d654d2d4d)
(cherry picked from commit 0e944364767cb198ae81c4326ed24ea3941ad8e7)

8 years agolibxenvchan: Change license of header from Lesser GPL v2.1 to BSD
Konrad Rzeszutek Wilk [Mon, 13 Jun 2016 09:28:57 +0000 (05:28 -0400)]
libxenvchan: Change license of header from Lesser GPL v2.1 to BSD

As the xen/COPYING file says:
"A few files are licensed under both GPL and a weaker BSD-style
license. This includes all files within the subdirectory
include/public, as described in include/public/COPYING. All such files
include the non-GPL license text as a source-code comment. Although
the license text refers generically to "the software", the non-GPL
license applies *only* to those source files that explicitly include
the non-GPL license text."

The libxenvchan.h is under xen/include/public/io directory
and the xen/include/public/COPYING says:

"XEN NOTICE
==========

This copyright applies to all files within this subdirectory and its
subdirectories:
  include/public/*.h
  include/public/hvm/*.h
  include/public/io/*.h

The intention is that these files can be freely copied into the source
tree of an operating system when porting that OS to run on Xen. Doing
so does *not* cause the OS to become subject to the terms of the GPL.

All other files in the Xen source distribution are covered by version
2 of the GNU General Public License except where explicitly stated
otherwise within individual source files.
"
Having the libxenvchan.h as Lesser GPL v2.1 where the COPYING file
says otherwise is confusing to say at least.

Upon consulting with the authors of libxenvchan they said:
"FWIW Neither I, nor ITL staff (as author of original libvchan library)
have anything against converting it to the BSD-style licence."
(Marek Marczykowski-Górecki,
http://lists.xen.org/archives/html/xen-devel/2016-06/msg00995.html)
so as such lets change it.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Anil Madhavapeddy <anil@recoil.org>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: George Dunlap <George.Dunlap@eu.citrix.com>
Acked-by: Jan Beulich <JBeulich@suse.com>
Acked-by: Jason Andryuk <andryuk@aero.org>
Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Matthew Daley <mattjd@gmail.com>
Acked-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Roger Pau Monne <roger.pau@entel.upc.edu>
Acked-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
["I have spoken to my line manager.  I can confirm that Citrix is happy
 with this proposed change.  So:

Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
 This view from Citrix covers all contributions made to these files in
 the course of Citrix's employees' employment, which I think is:

 > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
 > cc: George Dunlap <George.Dunlap@eu.citrix.com>
 > Cc: Ian Campbell <ian.campbell@citrix.com>
 > Cc: Ian Jackson <Ian.Jackson@eu.citrix.com>
 > Cc: Roger Pau Monne <roger.pau@entel.upc.edu>
 > Cc: Stefano Stabellini <sstabellini@kernel.org>
 > Cc: Tim Deegan <tim@xen.org>
 > Cc: Wei Liu <wei.liu2@citrix.com>

 ..
 [in subsequent email]:
 Wei points out that this ought also to include Keir Fraser's
 contribution, which was (only) in 2012.
 " (from Ian's email)

 In a subsequent mail, Wei also points out that David Scott's
 contribution is covered by Ian's ack.
]

master commit: 937324f032f4f77866e80e39de0d697fa5131df1

(cherry picked from commit b5e124661c175c88512c81acdeb06992259361b7)

(cherry picked from commit 29e5892cbfd01b65db013a9bda99fa52781eed67)
Conflicts:
xen/include/public/io/libxenvchan.h
The conflict was due to to the FSF address still being in this version.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
8 years agoxl: correct xl cpupool-numa-split with vcpu limited dom0
Juergen Gross [Tue, 14 Jun 2016 04:30:58 +0000 (06:30 +0200)]
xl: correct xl cpupool-numa-split with vcpu limited dom0

When trying to use xl cpupool-numa-split and dom0 is limited to less
vcpus than one numa node the operation will fail.

Correct this by allowing this configuration.

Reported-by: Glenn Enright <glenn@rimuhosting.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit c256d2afc1cad0cca912492e338d6ff97e477c4f)
(cherry picked from commit 78a30103813a7929e922de7414cc56fa2ae52984)
(cherry picked from commit f8972b4223ee76800afa13432403daf248b46805)

8 years agoconfigure: Fix when no libsystemd compat lib are available
Anthony PERARD [Tue, 3 May 2016 15:59:49 +0000 (16:59 +0100)]
configure: Fix when no libsystemd compat lib are available

From systemd change log, since version 209, libsystemd.so contain
everything, including libsystemd-daemon.so. Distro may, or may not provide
the compatibility libraries which libsystemd-daemon is part of.

So, if libsystemd-daemon is not available, check for the presence of
a recent enough libsystemd.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
[ wei: run autogen.sh ]
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 7dec5b0c658bea9c16a0e3c051e64d2abf57be48)
(cherry picked from commit 2c1122947e6866ad31b6cc33792178d9baa84430)

8 years agoRevert "xen: Have schedulers revise initial placement"
Jan Beulich [Fri, 5 Aug 2016 13:43:54 +0000 (15:43 +0200)]
Revert "xen: Have schedulers revise initial placement"

This reverts commit c421378a8d14c811e5467d535bc71adc0328a316,
as it needs further so far unidentified prereqs.

8 years agoRevert "xen: Remove buggy initial placement algorithm"
Jan Beulich [Fri, 5 Aug 2016 13:42:52 +0000 (15:42 +0200)]
Revert "xen: Remove buggy initial placement algorithm"

This reverts commit 505ad3a8b7fd3b91ab39c829ca6636cd264198c7,
as its prereq needs further so far unidentified prereqs.

8 years agox86/mmcfg: Fix initalisation of variables in pci_mmcfg_nvidia_mcp55()
Andrew Cooper [Fri, 5 Aug 2016 12:08:33 +0000 (14:08 +0200)]
x86/mmcfg: Fix initalisation of variables in pci_mmcfg_nvidia_mcp55()

Shifting into the sign bit of an integer is undefined behaviour.

Only the first integer is actually undefined, but switch all the shifts
for consistency.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <JBeulich@suse.com>
master commit: ab8fc3937eeb9332b83d7e14d81e37f0b0ef1841
master date: 2016-08-03 18:46:59 +0100

8 years agoxen: Remove buggy initial placement algorithm
George Dunlap [Fri, 5 Aug 2016 12:07:57 +0000 (14:07 +0200)]
xen: Remove buggy initial placement algorithm

The initial placement algorithm sometimes picks cpus outside of the
mask it's given, does a lot of unnecessary bitmasking, does its own
separate load calculation, and completely ignores vcpu hard and soft
affinities.  Just get rid of it and rely on the schedulers to do
initial placement.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d5438accceecc8172db2d37d98b695eb8bc43afc
master date: 2016-07-26 10:44:06 +0100

8 years agoxen: Have schedulers revise initial placement
George Dunlap [Fri, 5 Aug 2016 12:07:27 +0000 (14:07 +0200)]
xen: Have schedulers revise initial placement

The generic domain creation logic in
xen/common/domctl.c:default_vcpu0_location() attempts to try to do
initial placement load-balancing by placing vcpu 0 on the least-busy
non-primary hyperthread available.  Unfortunately, the logic can end
up picking a pcpu that's not in the online mask.  When this is passed
to a scheduler such which assumes that the initial assignment is
valid, it causes a null pointer dereference looking up the runqueue.

Furthermore, this initial placement doesn't take into account hard or
soft affinity, or any scheduler-specific knowledge (such as historic
runqueue load, as in credit2).

To solve this, when inserting a vcpu, always call the per-scheduler
"pick" function to revise the initial placement.  This will
automatically take all knowledge the scheduler has into account.

csched2_cpu_pick ASSERTs that the vcpu's pcpu scheduler lock has been
taken.  Grab and release the lock to minimize time spend with irqs
disabled.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
Reviwed-by: Dario Faggioli <dario.faggioli@citrix.com>
master commit: 9f358ddd69463fa8fb65cf67beb5f6f0d3350e32
master date: 2016-07-26 10:42:49 +0100

8 years agonested vmx: Validate host VMX MSRs before accessing them
Euan Harris [Fri, 5 Aug 2016 11:50:47 +0000 (13:50 +0200)]
nested vmx: Validate host VMX MSRs before accessing them

Some VMX MSRs may not exist on certain processor models, or may
be disabled because of configuration settings.   It is only safe to
access these MSRs if configuration flags in other MSRs are set.  These
prerequisites are listed in the Intel 64 and IA-32 Architectures
Software Developer’s Manual, Vol 3, Appendix A.

nvmx_msr_read_intercept() does not check the prerequisites before
accessing MSR_IA32_VMX_PROCBASED_CTLS2, MSR_IA32_VMX_EPT_VPID_CAP,
MSR_IA32_VMX_VMFUNC on the host.   Accessing these MSRs from a nested
VMX guest running on a host which does not support them will cause
Xen to crash with a GPF.

Signed-off-by: Euan Harris <euan.harris@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5e02972646132ad98c365ebfcfcb43b40a0dde36
master date: 2016-06-13 12:44:32 +0100

8 years agoserial: fix incorrect length of strncmp for dtuart
Jiandi An [Fri, 5 Aug 2016 11:50:07 +0000 (13:50 +0200)]
serial: fix incorrect length of strncmp for dtuart

In serial_parse_handler(), length of strncmp for dtuart should have been
6, not 5.

Signed-off-by: Jiandi An <anjiandi@codeaurora.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: ba98196b54b27262ffe3d3463358eb4cff18b28d
master date: 2016-06-08 11:10:23 +0200

8 years agox86/entry: Avoid SMAP violation in compat_create_bounce_frame()
Andrew Cooper [Wed, 15 Jun 2016 17:32:14 +0000 (18:32 +0100)]
x86/entry: Avoid SMAP violation in compat_create_bounce_frame()

A 32bit guest kernel might be running on user mappings.
compat_create_bounce_frame() must whitelist its guest accesses to avoid
risking a SMAP violation.

For both variants of create_bounce_frame(), re-blacklist user accesses if
execution exits via an exception table redirection.

This is XSA-183 / CVE-2016-6259

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86/pv: Remove unsafe bits from the mod_l?_entry() fastpath
Andrew Cooper [Mon, 11 Jul 2016 13:32:03 +0000 (14:32 +0100)]
x86/pv: Remove unsafe bits from the mod_l?_entry() fastpath

All changes in writeability and cacheability must go through full
re-validation.

Rework the logic as a whitelist, to make it clearer to follow.

This is XSA-182

Reported-by: Jérémie Boutoille <jboutoille@ext.quarkslab.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
8 years agoQEMU_UPSTREAM_REVISION update
Ian Jackson [Tue, 14 Jun 2016 17:37:35 +0000 (18:37 +0100)]
QEMU_UPSTREAM_REVISION update

Includes XSA-180 fix and Ubuntu build fix.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
8 years agopublic: typo: use ' as apostrophe in grant_table.h
Dario Faggioli [Fri, 10 Jun 2016 15:42:21 +0000 (17:42 +0200)]
public: typo: use ' as apostrophe in grant_table.h

If grep 2.23 is installed, build fails like this:
...
mkdir -p compat
grep -v 'DEFINE_XEN_GUEST_HANDLE(long)' public/grant_table.h | \
python /home/SOURCES/xen/xen/xen.git/xen/tools/compat-build-source.py >compat/grant_table.c.new
mv -f compat/grant_table.c.new compat/grant_table.c
gcc  ... -o compat/grant_table.i compat/grant_table.c
compat/grant_table.c:33:1: error: unterminated comment
 /*
 ^
compat/grant_table.c:28:0: error: unterminated #ifndef
 #ifndef __XEN_PUBLIC_GRANT_TABLE_H__
 ^
Makefile:62: recipe for target 'compat/grant_table.i' failed
make[3]: *** [compat/grant_table.i] Error 1
rm compat/grant_table.c
make[3]: Leaving directory '/home/SOURCES/xen/xen/xen.git/xen/include'
...

This is because grant_table.h contains this (note the
apostrophe): "granter\92s memory", and `grep -v', in version
2.23, stops processing the file (while, for instance,
until 2.22, this was not happening).

Although the above behavior is likely an issue in grep,
(https://debbugs.gnu.org/cgi/bugreport.cgi?bug=22461)
I think we better switch to using " ' " in that line
anyway, as we do basically everywhere else (even in
the same file).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
master commit: 3f293c7caaefc2c37b61e44e8ebd5a7f1c554afb
master date: 2016-02-25 13:03:04 +0100

8 years agoQEMU_TAG update
Ian Jackson [Fri, 10 Jun 2016 12:10:47 +0000 (13:10 +0100)]
QEMU_TAG update

8 years agolibxl: set XEN_QEMU_CONSOLE_LIMIT for QEMU
Wei Liu [Thu, 26 May 2016 15:11:42 +0000 (16:11 +0100)]
libxl: set XEN_QEMU_CONSOLE_LIMIT for QEMU

XSA-180 provides a patch to QEMU to bodge QEMU logging issue. We
explicitly set the limit in libxl for 4.7.

Introduce a function for setting the environment variable and call it in
the right places.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit b0d409d9c4944ed29d29457fea4ad6b41d698eca)

(cherry picked from commit fe82a96a657f36b08ade60ec4f3a53e67a4ee314)
Conflicts:
tools/libxl/libxl_dm.c

This version of libxl does not pass a dm_envs to the
*build_device_model_args* functions.  Instead, call
libxl__set_qemu_env_for_xsa_180 in libxl__spawn_local_dm.

The other call ultimate site of *build_device_model_args* (ie of
libxl__build_device_model_args) is in libxl__spawn_stub_dm, where we
don't need to set the env var.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
8 years agolibxl: Fix NULL pointer due to XSA-178 fix wrong XS nodename
Ian Jackson [Wed, 8 Jun 2016 14:42:19 +0000 (15:42 +0100)]
libxl: Fix NULL pointer due to XSA-178 fix wrong XS nodename

In "libxl: Do not trust backend for disk eject vdev" (c69871a2fb26 on
xen.git#staging) we changed libxl_evenable_disk_eject to read the
device vdev out of xenstore from the /libxl path, rather than the
backend path, and to read it during setup rather than on each event.

However, the patch has a mistake:
    -        GCSPRINTF("%s/dev", backend), NULL);
    +        GCSPRINTF("%s/vdev", libxl_path), &configured_vdev);
                           ^
Spot the extra "v".  This causes configured_vdev always to be NULL.
configured_vdev is passed to [libxl__]strdup.

In Xen 4.6 and later libxl__strdup is used and tolerates NULL.
evg->vdev is set to NULL.  This propagates to the `vdev' field in the
generated event.  This may or may not cause further trouble, depending
on the calling application.  In our osstest test cases it does not
cause any trouble, so the bug goes undetected.

In Xen 4.5 and earlier, the strdup does not tolerate NULL, and libxl
crashes immediately.  This has been detected by osstest as a
regression in Xen 4.5.

IMO this patch should be applied immediately to
  xen.git#staging-4.5 (to check that it fixes the osstest regression)
  xen.git#staging     (to check that it does not break master

Subject to passes, it should then be propagated to all supported
stable trees and also be mentioned in an update to XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
CC: security@xenproject.org
CC: Jan Beulich <jbeulich@suse.com>
CC: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 62b4d4769ca39fd5263da20d786a7b9a80a22d9a)
(cherry picked from commit 8b7a356409023f60f80e9f4b00bba16ad56cd77b)

8 years agoQEMU_TAG update
Ian Jackson [Tue, 7 Jun 2016 16:05:48 +0000 (17:05 +0100)]
QEMU_TAG update

8 years agolibxl: keep PoD target adjustment by memory fudge after reload_domain_config()
Vitaly Kuznetsov [Wed, 3 Feb 2016 15:53:03 +0000 (16:53 +0100)]
libxl: keep PoD target adjustment by memory fudge after reload_domain_config()

Commit 56fb5fd623 ("libxl: adjust PoD target by memory fudge, too")
introduced target_memkb adjustment for HVM PoD domains on create,
wherein the value it wrote to target is always 1MiB lower than the
actual target_memkb.  Unfortunately, on reboot, it is this value which
is read *unmodified* to feed into the next domain creation; from which
1MiB is subtracted *again*.  This means that any guest which reboots
with memory < maxmem will have its memory target decreased by 1MiB on
every boot.

This patch makes it so that when reading target on reboot, we adjust the
value we read *up* by 1MiB, so that the domain will be build with the
appropriate amount of memory and the target will remain the same after
reboot.

This is still not quite a complete fix, as the 1MiB offset is only
subtracted when creating or rebooting; it is not subtracted when 'xl
set-memory' is called.  But it will prevent any situations where memory
is continually increased or decreased.  A better fix will have to wait
until after the release.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 7e81d992b183c47a74eea5ecd613d27950b5cdc3)
(cherry picked from commit 7c5c20d129b1a333a3f85a2d04ecde48b5fd7589)

8 years agolibxl: Document ~/serial/ correctly
Ian Jackson [Wed, 4 May 2016 14:17:45 +0000 (15:17 +0100)]
libxl: Document ~/serial/ correctly

xenstore-paths.markdown talked about ~/device/serial/, but that's not
used.

(It is very wrong for this value, which contains a driver domain
filesystem path, to be in the guest's area of xenstore.  However, it
is only ever created by libxl and ready by xenconsoled.  When it is
created, it inherits the read-only permissions of /local/domain/DOMID.
So there is no security bug.)

This is a followup to XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Cleanup: use libxl__backendpath_parse_domid in libxl__device_disk_from_xs_be
Ian Jackson [Fri, 29 Apr 2016 15:08:19 +0000 (16:08 +0100)]
libxl: Cleanup: use libxl__backendpath_parse_domid in libxl__device_disk_from_xs_be

Rather than an open-coded sscanf.  No functional change with correct
input.

This is a followup to XSA-175 and XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Cleanup: Have libxl__alloc_vdev use /libxl
Ian Jackson [Tue, 3 May 2016 14:25:19 +0000 (15:25 +0100)]
libxl: Cleanup: Have libxl__alloc_vdev use /libxl

When allocating a vdev for a new disk, look in /libxl/device, rather
than the frontends directory in xenstore.

This is more in line with the other parts of libxl, which ought not to
trust frontends.  In this case, though, there is no security bug prior
to this patch because the frontend is the toolstack domain itself.

If libxl__alloc_vdev were ever changed to take a frontend domain
argument, this patch will fix a latent security bug.

This is a followup to XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust backend in channel list
Ian Jackson [Wed, 4 May 2016 15:59:38 +0000 (16:59 +0100)]
libxl: Do not trust backend in channel list

Read the name from /libxl/device.  Pass the /libxl path to
libxl__device_channel_from_xenstore.

This removes the final route by which READ_LIBXLDEV might receive a
backend path.

This is part of XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
---
v2: Remove be_path variable which is now no longer used.

8 years agolibxl: Do not trust backend for nic in list
Ian Jackson [Wed, 4 May 2016 15:23:57 +0000 (16:23 +0100)]
libxl: Do not trust backend for nic in list

libxl_device_nic_list should use the /libxl path to search for
devices, and for obtaining the device information.

The "type" parameter was always "vif".  Abolish it.  (In any case,
paths in /libxl/device are named after the frontend type which is
constant, not the backend type which might in future vary.)

Abolish a redundant store to pnic->backend_domid.  Before this commit,
that store was not needed because libxl_device_nic_init (called by
libxl__device_nic_from_xenstore) would zero it.  Now it overwrites the
correct backend domid with zero; so remove it.

This is part of XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust backend for nic in devid_to_device
Ian Jackson [Wed, 4 May 2016 15:20:05 +0000 (16:20 +0100)]
libxl: Do not trust backend for nic in devid_to_device

libxl_devid_to_device_nic should read the information it needs from
the /libxl/device path, not the backend.

This is part of XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust backend in nic getinfo
Ian Jackson [Tue, 3 May 2016 15:35:21 +0000 (16:35 +0100)]
libxl: Do not trust backend in nic getinfo

This is part of XSA-178.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Have READ_LIBXLDEV use libxl_path rather than be_path
Ian Jackson [Tue, 3 May 2016 14:40:18 +0000 (15:40 +0100)]
libxl: Have READ_LIBXLDEV use libxl_path rather than be_path

Fix the just-introduced bug in this macro: now it reads the
trustworthy libxl_path.  Change the variable name in the two functions
(nic and channel) which use it.

Shuffling the bump in the carpet along, we now introduce three new
bugs: the three call sites pass a backend path where a frontend path
is expected.

No functional change.

This is part of XSA-178.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Rename READ_BACKEND to READ_LIBXLDEV
Ian Jackson [Wed, 4 May 2016 15:07:02 +0000 (16:07 +0100)]
libxl: Rename READ_BACKEND to READ_LIBXLDEV

We are going to want to change all the functions that use READ_BACKEND
to get untrustworthy information from the backend, to use trustworthy
information from /libxl.

This will involve replacing READ_BACKEND, which reads from be_path,
with a similar macro READ_LIBXLDEV, which reads from libxl_path.

The macro name change generates a lot of clutter in the diff.  So we
break it out into this separate patch.  Here, we rename the macro, but
the implementation does not really match the new name.

So, another way to look at this, is that we have transformed the bug:
 * All of the backends use READ_BACKEND, which is unsafe
into the new bug:
 * READ_LIBXLDEV actually reads be_path, which is unsafe.

There is no functional change as yet.

This is part of XSA-178.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Rename libxl__device_{nic,channel}_from_xs_be to _from_xenstore
Ian Jackson [Wed, 4 May 2016 15:18:36 +0000 (16:18 +0100)]
libxl: Rename libxl__device_{nic,channel}_from_xs_be to _from_xenstore

We are going to change these functions to expect, and be passed, a
/libxl path.  So it is wrong that they are called _from_xs_be.

Neither function reads anything which isn't found in both places, so
we can and will change the call sites later.

The only remaining function in libxl called *_from_xs_be relates to
PCI devices, for which the backend domain is hardcoded to 0 throughout
the libxl_pci.c.

No functional change.

This is part of XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust backend for channel in getinfo
Ian Jackson [Wed, 4 May 2016 14:57:10 +0000 (15:57 +0100)]
libxl: Do not trust backend for channel in getinfo

Do not read the frontend path out of the backend.  We have it in our
hand.  Likewise the guest (frontend) domid was one of our parameters (!)

This is part of XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust backend for cdrom insert
Ian Jackson [Fri, 29 Apr 2016 18:13:17 +0000 (19:13 +0100)]
libxl: Do not trust backend for cdrom insert

Use the /libxl path where appropriate.  Rename `path' variable to
`be_path' to make sure we caught all the occurrences.

Specifically, when checking that the device still exists, check the
`frontend' value in /libxl, rather than anything in the backend
directory.

This is part of XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust backend for disk in getinfo
Ian Jackson [Fri, 29 Apr 2016 18:10:45 +0000 (19:10 +0100)]
libxl: Do not trust backend for disk in getinfo

Do not read the frontend path out of the backend.  We have it in our
hand.  Likewise the guest (frontend) domid was one of our parameters (!)

This is part of XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust backend for disk; fix driver domain disks list
Ian Jackson [Fri, 29 Apr 2016 17:29:45 +0000 (18:29 +0100)]
libxl: Do not trust backend for disk; fix driver domain disks list

Rework libxl__device_disk_from_xs_be (which takes a backend path) into
to libxl__device_disk_from_xenstore (which takes a libxl path).

libxl__device_disk_from_xenstore now finds the backend path itself,
although it doesn't use it any more for most of its functions.  We
rename the variable from be_path to backend_path to make sure we
didn't miss any cases.

All the data collection is now done by reading from the copy in
/libxl.

libxl_device_disk_list and its helper libxl__append_disk_list (which
used to be libxl__append_disk_list_of_type) need extensive rework,
because they now need to specify the /libxl path rather than the
backend path.

To do that they enumerate disks by looking in the appropriate area in
/libxl.  Previously they scanned various of the backend directories in
dom0 (which was broken for driver domains).  It is no longer necessary
to enumerate the various disk backends, because they all use the same
paths in /devices.  libxl__device_disk_from_xenstore will parse the
type out of the backend path, for itself.  (Indeed, it did so before -
the now-gone type parameter to libxl__append_disk_list_of_type wasn't
used other than to construct the directory to list.)

Finally, remove a redundant store to pdisk->backend_domid in
libxl__append_disk_list[_of_type].  Even before this commit, that
store was not needed because libxl_device_disk_init (called by
libxl__device_disk_from_xenstore) would zero it.  Now it overwrites
the correct backend domid with zero; so remove it.

This is part of XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
---
v2: Also fix up COLO reads, following rebase

8 years agolibxl: Do not trust backend for disk eject vdev
Ian Jackson [Fri, 29 Apr 2016 15:23:35 +0000 (16:23 +0100)]
libxl: Do not trust backend for disk eject vdev

For disk eject, use configured vdev from /libxl, not backend.

The backend directory is writeable by driver domains.  This means that
a malicious driver domain could cause libxl to see a wrong vdev,
confusing the user or the toolstack.

Use the vdev from the /libxl space, rather than the backend.

For convenience, we read the vdev from the /libxl space into the evg
during setup and copy it on each event, rather than reading it afresh
each time (which would in any case involve generating or saving a copy
of the relevant /libxl path).

This is part of XSA-178.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: cdrom eject and insert: write to /libxl
Ian Jackson [Fri, 29 Apr 2016 18:15:13 +0000 (19:15 +0100)]
libxl: cdrom eject and insert: write to /libxl

Copy the new type and params values to /libxl, so that the information
in /libxl is kept up to date.

This is needed so that we can return this trustworthy information,
rather than trusting the backend-writeable parts of xenstore.

This is part of XSA-178.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust backend for vtpm in getinfo (uuid)
Ian Jackson [Fri, 29 Apr 2016 15:57:14 +0000 (16:57 +0100)]
libxl: Do not trust backend for vtpm in getinfo (uuid)

Use uuid from /libxl, rather than from backend.  I think the backend
is not supposed to change the uuid, since it seems to be set by libxl
during setup.

If in fact the backend is supposed to be able to change the uuid, this
patch needs to be dropped and replaced by a patch which makes the vtpm
uuid lookup tolerate bad or missing data.

This is part of XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust backend for vtpm in getinfo (except uuid)
Ian Jackson [Fri, 29 Apr 2016 16:18:44 +0000 (17:18 +0100)]
libxl: Do not trust backend for vtpm in getinfo (except uuid)

* Do not check the backend for existence.  We have already read the
  /libxl path so know that the vtpm exists (or is supposed to); if the
  backend doesn't exist then that must be the backend's doing.
* Get the frontend path from the /libxl directory.
* The frontend domid is the guest domid, and does not need to be read
  from xenstore (!)

We still attempt to read the uuid from the backend.  This will be
fixed in the next patch.

This is part of XSA-178.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust backend in libxl__device_exists
Ian Jackson [Wed, 4 May 2016 14:04:35 +0000 (15:04 +0100)]
libxl: Do not trust backend in libxl__device_exists

To determine whether a device is supposed to exist, look in /libxl,
rather than the backend.

This is part of XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Make copy of every xs backend in /libxl in _generic_add
Ian Jackson [Fri, 29 Apr 2016 15:19:28 +0000 (16:19 +0100)]
libxl: Make copy of every xs backend in /libxl in _generic_add

We want to stop libxl trustingly reading information from the backend
directory (since this is, of course, writeable by the backend, which
might be a semi-trusted driver domain).

In principle it is wrong in current libxl for anything to try to
divine virtual device configuration from xenstore: the JSON domain
config ought to supply that, and xenstore should only tell us which
devices actually exist.

However:

Firstly, there are several existing places where configuration
information is retrieved from xenstore rather than JSON.  We do not
want to reen gineer this in a security patch.

Secondly, we want to make a security patch which can be backported to
versions of libxl without the JSON configuration machinery.

So we take the expedient approach of keeping a copy of the
configuration somewhere we trust, namely /libxl.  This is obviously
fairly low-risk, although it does write significantly more keys in
xenstore.

In this patch we make this change in libxl__device_generic_add.  This
is responsible for actually writing the vast majority of device
information to xenstore.  There are a few loose ends which will be
dealt with in a moment.

Likewise, changes to readers to use the new location will appear in
further patches.

This is part of XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust frontend for channel in getinfo
Ian Jackson [Tue, 3 May 2016 16:24:32 +0000 (17:24 +0100)]
libxl: Do not trust frontend for channel in getinfo

libxl_device_channel_getinfo needs to examine devices without trusting
frontend-controlled data.  So:

* Use /libxl to find the backend path.
* Parse the backend path to find the backend domid, rather than
  reading it from the frontend.
* Tolerate FRONTEND/tty vanishing.

Note that there is a strange off-by-one error in the computation of
both fe_path and libxl_path in libxl_device_channel_getinfo: the
incoming channel->devid, which is copied to channelinfo->devid, has +1
applied to calculate the frontend path (and, after this patch, the
libxl path).  I.e., the devid passed to libxl_device_channel_getinfo
must be one less than the actual devid for the device being asked
about.

This is actually a bug which mirrors a bug in
libxl__append_channel_list, which fills in the devids of the channel
devices it finds with sequentially increasing numbers starting at 0.

In the usual case channels have real devids starting at 1 (because
there is the console, which is devid 0, but not a channel).  So these
bugs usually cancel out.

We do not address this problem at this time.  This bug does not have
any security implications.

This patch is part of XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust frontend for channel in list
Ian Jackson [Tue, 3 May 2016 16:01:56 +0000 (17:01 +0100)]
libxl: Do not trust frontend for channel in list

libxl_device_channel_list should not trust frontend-provided data.

So it needs to iterate using the /libxl paths, and read the backend
path out of /libxl.

However, it also filters out pure "consoles", which are channels
without a "name".  But the name was stored only in the frontend
directory, which the frontend can delete.

So store the name in the backend too.  (Ideally we would store it in
/libxl, where the backend can't write to it either, but
libxl__device_console_add not currently have access to the xenstore
transaction used by libxl__device_generic_add.  Protection against the
backend will come later, in XSA-178.)

Because the libxl paths are defined to be in terms of the frontend
device types, not the backend device types, it is no longer correct
for libxl__append_channel_list to take a type argument.  Abolish this
(with no functional effect).

This is part of XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust frontend for nic in getinfo
Ian Jackson [Tue, 3 May 2016 15:31:07 +0000 (16:31 +0100)]
libxl: Do not trust frontend for nic in getinfo

libxl_device_nic_getinfo needs to examine devices without trusting
frontend-controlled data.  So:

* Use /libxl to find the backend path.
* Parse the backend path to find the backend domid, rather than
  reading it from the frontend.

This is part of XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust frontend for nic in libxl_devid_to_device_nic
Ian Jackson [Tue, 3 May 2016 14:52:53 +0000 (15:52 +0100)]
libxl: Do not trust frontend for nic in libxl_devid_to_device_nic

Find the backend by reading the pointer in /libxl rather than in the
guest's frontend area.

This is part of XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust frontend for vtpm in getinfo
Ian Jackson [Tue, 3 May 2016 15:00:20 +0000 (16:00 +0100)]
libxl: Do not trust frontend for vtpm in getinfo

libxl_device_vtpm_getinfo needs to examine devices without trusting
frontend-controlled data.  So:

* Use /libxl to find the backend path.
* Parse the backend path to find the backend domid, rather than
  reading it from the frontend.

This is part of XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust frontend for vtpm list
Ian Jackson [Tue, 3 May 2016 14:58:32 +0000 (15:58 +0100)]
libxl: Do not trust frontend for vtpm list

libxl_device_vtpm_list needs to enumerate and identify devices without
trusting frontend-controlled data.  So

* Use the /libxl path to enumerate vtpms.
* Use the /libxl path to find the corresponding backends.
* Parse the backend path to find the backend domid.

This is part of XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust frontend for disk in getinfo
Ian Jackson [Fri, 29 Apr 2016 18:21:51 +0000 (19:21 +0100)]
libxl: Do not trust frontend for disk in getinfo

* Rename the frontend variable to `fe_path' to check we caught them all
* Read the backend path from /libxl, rather than from the frontend
* Parse the backend domid from the backend path, rather than reading it
  from the frontend (and add the appropriate error path and initialisation)

This is part of XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust frontend for disk eject event
Ian Jackson [Wed, 27 Apr 2016 15:08:49 +0000 (16:08 +0100)]
libxl: Do not trust frontend for disk eject event

Use the /libxl path for interpreting disk eject watch events: do not
read the backend path out of the frontend.  Instead, use the version
in /libxl.  That avoids us relying on the guest-modifiable
$frontend/backend pointer.

To implement this we store the path
  /libxl/$guest/device/vbd/$devid/backend
in the evgen structure.

This is part of XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust frontend in libxl__device_nextid
Ian Jackson [Wed, 4 May 2016 14:30:32 +0000 (15:30 +0100)]
libxl: Do not trust frontend in libxl__device_nextid

When selecting the devid for a new device, we should look in
/libxl/device for existing devices, not in the frontend area.

This is part of XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust frontend in libxl__devices_destroy
Ian Jackson [Tue, 3 May 2016 17:39:36 +0000 (18:39 +0100)]
libxl: Do not trust frontend in libxl__devices_destroy

We need to enumerate the devices we have provided to a domain, without
trusting the guest-writeable (or, at least, guest-deletable) frontend
paths.

Instead, enumerate via, and read the backend path from, /libxl.

The console /libxl path is regular, so the special case for console 0
is not relevant any more: /libxl/GUEST/device/console/0 will be found,
and then libxl__device_destroy will DTRT to the right frontend path.

This is part of XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Provide libxl__backendpath_parse_domid
Ian Jackson [Wed, 27 Apr 2016 15:34:19 +0000 (16:34 +0100)]
libxl: Provide libxl__backendpath_parse_domid

Multiple places in libxl need to figure out the backend domid of a
device.  This can be discovered easily by looking at the backend path,
which always starts /local/domain/$backend_domid/.

There are no call sites yet.

This is part of XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Record backend/frontend paths in /libxl/$DOMID
Ian Jackson [Mon, 16 May 2016 13:56:57 +0000 (14:56 +0100)]
libxl: Record backend/frontend paths in /libxl/$DOMID

This gives us a record of all the backends we have set up for a
domain, which is separate from the frontends in
  /local/domain/$DOMID/device.

In particular:

1. A guest has write permission for the frontend path:
  /local/domain/$DOMID/device/$KIND/$DEVID
which means that the guest can completely delete the frontend.
(They can't recreate it because they don't have write permission
on the containing directory.)

2. A guest has write permission for the backend path recorded in the
frontend, ie, it can write to
  /local/domain/$DOMID/device/$KIND/$DEVID/backend
which means that the guest can break the association between
frontend and backend.

So we can't rely on iterating over the frontends to find all the
backends, or examining a frontend to discover how a device is
configured.

So, have libxl__device_generic_add record the frontend and backend
paths in /libxl/$DOMID/device, and have libxl__device_destroy remove
them again.

Create the containing directory /libxl/GUEST/device in
libxl__domain_make.  The already existing xs_rm in devices_destroy_cb
will take care of removing it.

This is part of XSA-175.

Backport note: Backported over 7472ced, which fixes a bug in driver
domain teardown.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agoxen/arm: Don't free p2m->root in p2m_teardown() before it has been allocated
Andrew Cooper [Thu, 2 Jun 2016 13:19:00 +0000 (14:19 +0100)]
xen/arm: Don't free p2m->root in p2m_teardown() before it has been allocated

If p2m_init() didn't complete successfully, (e.g. due to VMID
exhaustion), p2m_teardown() is called and unconditionally tries to free
p2m->root before it has been allocated.  free_domheap_pages() doesn't
tolerate NULL pointers.

This is XSA-181

Reported-by: Aaron Cornelius <Aaron.Cornelius@dornerworks.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agosched: avoid races on time values read from NOW()
Dario Faggioli [Fri, 27 May 2016 12:50:19 +0000 (14:50 +0200)]
sched: avoid races on time values read from NOW()

or (even in cases where there is no race, e.g., outside
of Credit2) avoid using a time sample which may be rather
old, and hence stale.

In fact, we should only sample NOW() from _inside_
the critical region within which the value we read is
used. If we don't, in case we have to spin for a while
before entering the region, when actually using it:

 1) we will use something that, at the veryy least, is
    not really "now", because of the spinning,

 2) if someone else sampled NOW() during a critical
    region protected by the lock we are spinning on,
    and if we compare the two samples when we get
    inside our region, our one will be 'earlier',
    even if we actually arrived later, which is a
    race.

In Credit2, we see an instance of 2), in runq_tickle(),
when it is called by csched2_context_saved() as it samples
NOW() before acquiring the runq lock. This makes things
look like the time went backwards, and it confuses the
algorithm (there's even a d2printk() about it, which would
trigger all the time, if enabled).

In RTDS, something similar happens in repl_timer_handler(),
and there's another instance in schedule() (in generic code),
so fix these cases too.

While there, improve csched2_vcpu_wake() and and rt_vcpu_wake()
a little as well (removing a pointless initialization, and
moving the sampling a bit closer to its use). These two hunks
entail no further functional changes.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
RTDS: fix another instance of the 'read NOW()' race

which was overlooked in 779511f4bf5ae ("sched: avoid
races on time values read from NOW()").

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
master commit: 779511f4bf5ae34820a85e4eb20d50c60f69e977
master date: 2016-05-23 14:39:51 +0200
master commit: 4074e4ebe9115ac4986f963a13feada3e0560460
master date: 2016-05-25 14:33:57 +0200

8 years agox86emul: suppress writeback upon unsuccessful MMX/SSE/AVX insn emulation
Jan Beulich [Fri, 27 May 2016 12:49:52 +0000 (14:49 +0200)]
x86emul: suppress writeback upon unsuccessful MMX/SSE/AVX insn emulation

This in particular prevents updating guest IP when handling the retry
needed to forward the memory access to qemu.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2bb230972c5ddb1ca823f47750b5d46a9d302d0e
master date: 2016-05-19 12:06:33 +0200

8 years agoxen/nested_p2m: Don't walk EPT tables with a regular PT walker
Andrew Cooper [Fri, 27 May 2016 12:49:28 +0000 (14:49 +0200)]
xen/nested_p2m: Don't walk EPT tables with a regular PT walker

hostmode->p2m_ga_to_gfn() is a plain PT walker, and is not appropriate for a
general L1 p2m walk.  It is fine for AMD as NPT share the same format as
normal pagetables.  For Intel EPT however, it is wrong.

The translation ends up correct (as the formats are sufficiently similar), but
the control bits in lower 12 bits differ in meaning.  A plain PT walker sets
A/D bits (bits 5 and 6) as it walks, but in EPT tables, these are the IPAT and
top bit of EMT (caching type).  This in turn causes problem when the EPT
tables are subsequently used.

Replace hostmode->p2m_ga_to_gfn() with nestedhap_walk_L1_p2m() in
paging_gva_to_gfn(), which is the correct function for the task.  This
involves making nestedhap_walk_L1_p2m() non-static, and adding
vmx_vmcs_enter/exit() pairs to nvmx_hap_walk_L1_p2m() as it is now reachable
from contexts other than v == current.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: bab2bd8e222de9e596699ac080ea985af828c4c4
master date: 2016-05-18 18:22:06 +0100

8 years agox86/PoD: skip eager reclaim when possible
Jan Beulich [Fri, 27 May 2016 12:48:58 +0000 (14:48 +0200)]
x86/PoD: skip eager reclaim when possible

Reclaiming pages is pointless when the cache can already satisfy all
outstanding PoD entries, and doing reclaims in that case can be very
harmful to performance when that memory gets used by the guest, but
only to store zeroes there.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 556c69f4efb09dd06e6bce4cbb0455287f19d02e
master date: 2016-05-12 18:02:21 +0200

8 years agoIOMMU/x86: per-domain control structure is not HVM-specific
Jan Beulich [Fri, 27 May 2016 12:48:00 +0000 (14:48 +0200)]
IOMMU/x86: per-domain control structure is not HVM-specific

... and hence should not live in the HVM part of the PV/HVM union. In
fact it's not even architecture specific (there already is a per-arch
extension type to it), so it gets moved out right to common struct
domain.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: af07377007d595b5d6422291bb1c932c16d1036f
master date: 2016-05-04 09:44:32 +0200

8 years agox86: use optimal NOPs to fill the SMEP/SMAP placeholders
Jan Beulich [Fri, 27 May 2016 12:47:08 +0000 (14:47 +0200)]
x86: use optimal NOPs to fill the SMEP/SMAP placeholders

Alternatives patching code picks the most suitable NOPs for the
running system, so simply use it to replace the pre-populated ones.

Use an arbitrary, always available feature to key off from, but
hide this behind the new X86_FEATURE_ALWAYS.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86/compat: correct SMEP/SMAP NOPs patching

Correct the number of single byte NOPs we want to be replaced in case
neither SMEP nor SMAP are available.

Also simplify the expression adding these NOPs - at that location .
equals .Lcr4_orig, and removing that part of the expression fixes a
bogus ".space or fill with negative value, ignored" warning by very old
gas (which actually is what made me look at those constructs again).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 01a0bd0a7d72be638a359db3f8cf551123467d29
master date: 2016-05-13 18:15:55 +0100
master commit: f5610009529628314c9d1d52b00715fe855fcf06
master date: 2016-05-26 17:26:24 +0100

8 years agox86: suppress SMEP and SMAP while running 32-bit PV guest code
Jan Beulich [Fri, 27 May 2016 12:46:31 +0000 (14:46 +0200)]
x86: suppress SMEP and SMAP while running 32-bit PV guest code

Since such guests' kernel code runs in ring 1, their memory accesses,
at the paging layer, are supervisor mode ones, and hence subject to
SMAP/SMEP checks. Such guests cannot be expected to be aware of those
two features though (and so far we also don't expose the respective
feature flags), and hence may suffer page faults they cannot deal with.

While the placement of the re-enabling slightly weakens the intended
protection, it was selected such that 64-bit paths would remain
unaffected where possible. At the expense of a further performance hit
the re-enabling could be put right next to the CLACs.

Note that this introduces a number of extra TLB flushes - CR4.SMEP
transitioning from 0 to 1 always causes a flush, and it transitioning
from 1 to 0 may also do.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86/compat: Cleanup and further debugging of SMAP/SMEP fixup

 * Abstract (X86_CR4_SMEP | X86_CR4_SMAP) behind XEN_CR4_PV32_BITS to avoid
   opencoding the invidial bits which are fixed up behind a 32bit PV guests
   back.
 * Show cr4_pv32_mask in the BUG register dump

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
x86: refine debugging of SMEP/SMAP fix

Instead of just latching cr4_pv32_mask into %rdx, correct the found
wrong value in %cr4 (to avoid triggering another BUG).

Also there is one more place for XEN_CR4_PV32_BITS to be used.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86: make SMEP/SMAP suppression tolerate NMI/MCE at the "wrong" time

There is one instruction boundary where any kind of interruption would
break the assumptions cr4_pv32_restore's debug mode checking makes on
the correlation between the CR4 register value and its in-memory cache.
Correct this (see the code comment) even in non-debug mode, or else
a subsequent cr4_pv32_restore would also be misguided into thinking the
features are enabled when they really aren't.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ea3e8edfdbabfb17f0d39ed128716ec464f348b8
master date: 2016-05-13 18:15:45 +0100
master commit: ad4aa3619f436e3ed79eea8498ac18aa8d5e6b83
master date: 2016-05-16 13:11:05 +0100
master commit: e5e73163ec40b409151f2170d8e406a72b515ff2
master date: 2016-05-17 16:41:35 +0200
master commit: 9e28baf22ec98a64f68757eff39df72173d5f1bb
master date: 2016-05-17 16:42:15 +0200

8 years agox86: move cached CR4 value to struct cpu_info
Jan Beulich [Fri, 27 May 2016 12:45:36 +0000 (14:45 +0200)]
x86: move cached CR4 value to struct cpu_info

This not only eases using the cached value in assembly code, but also
improves the generated code resulting from such reads in C.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5d93f1d8ca7b62e85c8b98ed9c45b6cef89d17b8
master date: 2016-03-18 09:49:47 +0100

8 years agox86/alternatives: correct near branch check
Jan Beulich [Fri, 27 May 2016 12:44:46 +0000 (14:44 +0200)]
x86/alternatives: correct near branch check

Make sure the near JMP/CALL check doesn't consume uninitialized
data, not even in a benign way. And relax the length check at once.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: cd29140ef0e65a33d62e7f5ee843077e51913f01
master date: 2016-03-09 16:51:16 +0100

8 years agox86/P2M: consolidate handling of types not requiring a valid MFN
Jan Beulich [Fri, 27 May 2016 12:44:09 +0000 (14:44 +0200)]
x86/P2M: consolidate handling of types not requiring a valid MFN

As noted regarding the mixture of checks in p2m_pt_set_entry(),
introduce a new P2M type group allowing to be used everywhere we
just care about accepting operations with either a valid MFN or a type
permitting to be used without (valid) MFN.

Note that p2m_mmio_dm is not included in P2M_NO_MFN_TYPES, as for the
intended purpose that one ought to be treated similar to p2m_invalid
(perhaps the two should ultimately get folded anyway).

Note further that PoD superpages now get INVALID_MFN used when creating
page table entries (was _mfn(0) before).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: c35eefded2992fc9b979f99190422527650872fd
master date: 2015-11-20 12:38:33 +0100

8 years agoxen/arm: p2m: Release the p2m lock before undoing the mappings
Julien Grall [Fri, 20 May 2016 13:37:42 +0000 (14:37 +0100)]
xen/arm: p2m: Release the p2m lock before undoing the mappings

Since commit 4b25423a "arch/arm: unmap partially-mapped memory regions",
Xen has been undoing the P2M mappings when an error occurred during
insertion or memory allocation.

This is done by calling recursively apply_p2m_changes, however the
second call is done with the p2m lock taken which will result in a
deadlock for the current processor.

The p2m lock is here to protect 2 threads modifying concurrently the
page tables. However, it does not guarantee the ordering of the
changes. I.e if 2 threads request change on regions that overlaps,
then the result is undefined.

Therefore it is fine to move the recursive call to undo the changes
after the lock is released.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Wei Chen <Wei.Chen@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Tested-by: Wei Chen <Wei.Chen@arm.com>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>