]> xenbits.xensource.com Git - xen.git/log
xen.git
7 years agogrant_table: fix GNTTABOP_cache_flush handling
Andrew Cooper [Tue, 12 Sep 2017 13:17:22 +0000 (15:17 +0200)]
grant_table: fix GNTTABOP_cache_flush handling

Don't fall over a NULL grant_table pointer when the owner of the domain
is a system domain (DOMID_{XEN,IO} etc).

This is CVE-2017-14318 / XSA-232.

Reported-by: Matthew Daley <mattd@bugfuzz.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c3d830b244998b3686e2eb64db95996be5eb5e5c
master date: 2017-09-12 14:44:11 +0200

7 years agoxen/mm: make sure node is less than MAX_NUMNODES
George Dunlap [Tue, 12 Sep 2017 13:16:48 +0000 (15:16 +0200)]
xen/mm: make sure node is less than MAX_NUMNODES

The output of MEMF_get_node(memflags) can be as large as nodeid_t can
hold (currently 255).  This is then used as an index to arrays of size
MAX_NUMNODE, which is 64 on x86 and 1 on ARM, can be passed in by an
untrusted guest (via memory_exchange and increase_reservation) and is
not currently bounds-checked.

Check the value in page_alloc.c before using it, and also check the
value in the hypercall call sites and return -EINVAL if appropriate.
Don't permit domains other than the hardware or control domain to
allocate node-constrained memory.

This is CVE-2017-14316 / XSA-231.

Reported-by: Matthew Daley <mattd@bugfuzz.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2fece35303529395bfea6b03d2268380ef682c93
master date: 2017-09-12 14:43:16 +0200

7 years agoarm/mm: release grant lock on xenmem_add_to_physmap_one() error paths
Jan Beulich [Wed, 23 Aug 2017 15:55:38 +0000 (17:55 +0200)]
arm/mm: release grant lock on xenmem_add_to_physmap_one() error paths

Commit 55021ff9ab ("xen/arm: add_to_physmap_one: Avoid to map mfn 0 if
an error occurs") introduced error paths not releasing the grant table
lock. Replace them by a suitable check after the lock was dropped.

This is XSA-235.

Reported-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
master commit: 59546c1897a90fe9af5ebbbb05ead8d98b4d17b9
master date: 2017-08-23 17:45:45 +0200

7 years agognttab: fix "don't use possibly unbounded tail calls"
Jan Beulich [Mon, 21 Aug 2017 14:00:02 +0000 (16:00 +0200)]
gnttab: fix "don't use possibly unbounded tail calls"

The compat mode code also needs adjustment to deal with the changed
return value from gnttab_copy().

This is part of XSA-226.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ca617570542e1d7d8de636d5396959bbf1dabab7
master date: 2017-08-21 15:43:36 +0200

7 years agognttab: fix transitive grant handling
Jan Beulich [Thu, 17 Aug 2017 13:15:55 +0000 (15:15 +0200)]
gnttab: fix transitive grant handling

Processing of transitive grants must not use the fast path, or else
reference counting breaks due to the skipped recursive call to
__acquire_grant_for_copy() (its __release_grant_for_copy()
counterpart occurs independent of original pin count). Furthermore
after re-acquiring temporarily dropped locks we need to verify no grant
properties changed if the original pin count was non-zero; checking
just the pin counts is sufficient only for well-behaved guests. As a
result, __release_grant_for_copy() needs to mirror that new behavior.

Furthermore a __release_grant_for_copy() invocation was missing on the
retry path of __acquire_grant_for_copy(), and gnttab_set_version() also
needs to bail out upon encountering a transitive grant.

This is part of XSA-226.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ad48fb963dbff02762d2db5396fa655ac0c432c7
master date: 2017-08-17 14:40:31 +0200

7 years agognttab: don't use possibly unbounded tail calls
Jan Beulich [Thu, 17 Aug 2017 13:15:35 +0000 (15:15 +0200)]
gnttab: don't use possibly unbounded tail calls

There is no guarantee that the compiler would actually translate them
to branches instead of calls, so only ones with a known recursion limit
are okay:
- __release_grant_for_copy() can call itself only once, as
  __acquire_grant_for_copy() won't permit use of multi-level transitive
  grants,
- __acquire_grant_for_copy() is fine to call itself with the last
  argument false, as that prevents further recursion,
- __acquire_grant_for_copy() must not call itself to recover from an
  observed change to the active entry's pin count

This is part of XSA-226.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 999d2ccb7f73408aa22656e1ba2f98b077eaa1c2
master date: 2017-08-17 14:39:18 +0200

7 years agognttab: correct pin status fixup for copy
Jan Beulich [Tue, 15 Aug 2017 13:35:46 +0000 (15:35 +0200)]
gnttab: correct pin status fixup for copy

Regardless of copy operations only setting GNTPIN_hst*, GNTPIN_dev*
also need to be taken into account when deciding whether to clear
_GTF_{read,writ}ing. At least for consistency with code elsewhere the
read part better doesn't use any mask at all.

This is XSA-230.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6e2a4c73564ab907b732059adb317d6ca2d138a2
master date: 2017-08-15 15:08:03 +0200

7 years agox86/grant: disallow misaligned PTEs
Andrew Cooper [Tue, 15 Aug 2017 13:33:09 +0000 (15:33 +0200)]
x86/grant: disallow misaligned PTEs

Pagetable entries must be aligned to function correctly.  Disallow attempts
from the guest to have a grant PTE created at a misaligned address, which
would result in corruption of the L1 table with largely-guest-controlled
values.

This is CVE-2017-12137 / XSA-227.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ce442926c2530da9376199dcc769436376ad2386
master date: 2017-08-15 15:06:45 +0200

7 years agognttab: __gnttab_unmap_common_complete() is all-or-nothing
Jan Beulich [Tue, 20 Jun 2017 14:59:16 +0000 (16:59 +0200)]
gnttab: __gnttab_unmap_common_complete() is all-or-nothing

All failures have to be detected in __gnttab_unmap_common(), the
completion function must not skip part of its processing. In particular
the GNTMAP_device_map related putting of page references and adjustment
of pin count must not occur if __gnttab_unmap_common() signaled an
error. Furthermore the function must not make adjustments to global
state (here: clearing GNTTAB_device_map) before all possibly failing
operations have been performed.

There's one exception for IOMMU related failures: As IOMMU manipulation
occurs after GNTMAP_*_map have been cleared already, the related page
reference and pin count adjustments need to be done nevertheless. A
fundamental requirement for the correctness of this is that
iommu_{,un}map_page() crash any affected DomU in case of failure.

The version check appears to be pointless (or could perhaps be a
BUG_ON() or ASSERT()), but for the moment also move it.

This is part of XSA-224.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 11fc7ccb7217ccfb79edb727d1349618bcb0602d
master date: 2017-06-20 14:46:47 +0200

7 years agognttab: correct logic to get page references during map requests
George Dunlap [Tue, 20 Jun 2017 14:58:54 +0000 (16:58 +0200)]
gnttab: correct logic to get page references during map requests

The rules for reference counting are somewhat complicated:

* Each of GNTTAB_host_map and GNTTAB_device_map need their own
reference count

* If the mapping is writeable:
 - GNTTAB_host_map needs a type count under only some conditions
 - GNTTAB_device_map always needs a type count

If the mapping succeeds, we need to keep all of these; if the mapping
fails, we need to release whatever references we have acquired so far.

Additionally, the code that does a lot of this calculation "inherits"
a reference as part of the process of finding out who the owner is.

Finally, if the grant is mapped as writeable (without the
GNTMAP_readonly flag), but the hypervisor cannot grab a
PGT_writeable_page type, the entire operation should fail.

Unfortunately, the current code has several logic holes:

* If a grant is mapped only GNTTAB_device_map, and with a writeable
  mapping, but in conditions where a *host* type count is not
  necessary, the code will fail to grab the necessary type count.

* If a grant is mapped both GNTTAB_device_map and GNTTAB_host_map,
  with a writeable mapping, in conditions where the host type count is
  not necessary, *and* where the page cannot be changed to type
  PGT_writeable, the condition will not be detected.

In both cases, this means that on success, the type count will be
erroneously reduced when the grant is unmapped.  In the second case,
the type count will be erroneously reduced on the failure path as
well.  (In the first case the failure path logic has the same hole
as the reference grabbing logic.)

Additionally, the return value of get_page() is not checked; but this
may fail even if the first get_page() succeeded due to a reference
counting overflow.

First of all, simplify the restoration logic by explicitly counting
the reference and type references acquired.

Consider each mapping type separately, explicitly marking the
'incoming' reference as used so we know when we need to grab a second
one.

Finally, always check the return value of get_page[_type]() and go to
the failure path if appropriate.

This is part of XSA-224.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 75b384ece635adc55c2bafbdc2d8959c10542c31
master date: 2017-06-20 14:46:21 +0200

7 years agognttab: never create host mapping unless asked to
Jan Beulich [Tue, 20 Jun 2017 14:58:35 +0000 (16:58 +0200)]
gnttab: never create host mapping unless asked to

We shouldn't create a host mapping unless asked to even in the case of
mapping a granted MMIO page. In particular the mapping wouldn't be torn
down when processing the matching unmap request.

This is part of XSA-224.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 56f2ab5b970f1b18cf2019df4bf27db544cda6ea
master date: 2017-06-20 14:46:01 +0200

7 years agognttab: fix handling of dev_bus_addr during unmap
George Dunlap [Tue, 20 Jun 2017 14:58:12 +0000 (16:58 +0200)]
gnttab: fix handling of dev_bus_addr during unmap

If a grant has been mapped with the GNTTAB_device_map flag, calling
grant_unmap_ref() with dev_bus_addr set to zero should cause the
GNTTAB_device_map part of the mapping to be left alone.

Unfortunately, at the moment, op->dev_bus_addr is implicitly checked
before clearing the map and adjusting the pin count, but only the bits
above 12; and it is not checked at all before dropping page
references.  This means a guest can repeatedly make such a call to
cause the reference count to drop to zero, causing the page to be
freed and re-used, even though it's still mapped in its pagetables.

To fix this, always check op->dev_bus_addr explicitly for being
non-zero, as well as op->flag & GNTMAP_device_map, before doing
operations on the device_map.

While we're here, make the logic a bit cleaner:

* Always initialize op->frame to zero and set it from act->frame, to reduce the
chance of untrusted input being used

* Explicitly check the full dev_bus_addr against act->frame <<
  PAGE_SHIFT, rather than ignoring the lower 12 bits

This is part of XSA-224.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 8fdfcb2b6bcd074776560e76843815f124d587f1
master date: 2017-06-20 14:45:33 +0200

7 years agoarm: vgic: Don't update the LR when the IRQ is not enabled
Julien Grall [Tue, 20 Jun 2017 14:57:41 +0000 (16:57 +0200)]
arm: vgic: Don't update the LR when the IRQ is not enabled

gic_raise_inflight_irq will be called if the IRQ is already inflight
(i.e the IRQ is injected to the guest). If the IRQ is already already in
the LRs, then the associated LR will be updated.

To know if the interrupt is already in the LR, the function check if the
interrupt is queued. However, if the interrupt is not enabled then the
interrupt may not be queued nor in the LR. So gic_update_one_lr may be
called (if we inject on the current vCPU) and read the LR.

Because the interrupt is not in the LR, Xen will either read:
    * LR 0 if the interrupt was never injected before
    * LR 255 (GIC_INVALID_LR) if the interrupt was injected once. This
    is because gic_update_one_lr will reset p->lr.

Reading LR 0 will result to potentially update the wrong interrupt and
not keep the LRs in sync with Xen.

Reading LR 255 will result to:
    * Crash Xen on GICv3 as the LR index is bigger than supported (see
    gicv3_ich_read_lr).
    * Read/write always GICH_LR + 255 * 4 that is not part of the memory
    mapped.

The problem can be prevented by checking whether the interrupt is
enabled in gic_raise_inflight_irq before calling gic_update_one_lr.

A follow-up of this patch is expected to mitigate the issue in the
future.

This is XSA-223.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: c84e4b2dd4050ef3eecc13fcfa6842373ba4519c
master date: 2017-06-20 14:41:55 +0200

7 years agoguest_physmap_remove_page() needs its return value checked
Jan Beulich [Tue, 20 Jun 2017 14:57:14 +0000 (16:57 +0200)]
guest_physmap_remove_page() needs its return value checked

Callers, namely such subsequently freeing the page, must not blindly
assume success - the function may namely fail when needing to shatter a
super page, but there not being memory available for the then needed
intermediate page table.

As it happens, guest_remove_page() callers now also all check the
return value.

Furthermore a missed put_gfn() on an error path in gnttab_transfer() is
also being taken care of.

This is part of XSA-222.

Reported-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a0cce6048d010a30ac82f8db7787bbf9aada64f4
master date: 2017-06-20 14:41:16 +0200

7 years agomemory: fix return value handing of guest_remove_page()
Andrew Cooper [Tue, 20 Jun 2017 14:56:28 +0000 (16:56 +0200)]
memory: fix return value handing of guest_remove_page()

Despite the description in mm.h, guest_remove_page() previously returned 0 for
paging errors.

Switch guest_remove_page() to having regular 0/-error semantics, and propagate
the return values from clear_mmio_p2m_entry() and mem_sharing_unshare_page()
to the callers (although decrease_reservation() is the only caller which
currently cares).

This is part of XSA-222.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b614f642c35da5184416787352f51a6379a92628
master date: 2017-06-20 14:39:56 +0200

7 years agoevtchn: avoid NULL derefs
Jan Beulich [Tue, 20 Jun 2017 14:56:04 +0000 (16:56 +0200)]
evtchn: avoid NULL derefs

Commit fbbd5009e6 ("evtchn: refactor low-level event channel port ops")
added a de-reference of the struct evtchn pointer for a port without
first making sure the bucket pointer is non-NULL. This de-reference is
actually entirely unnecessary, as all relevant callers (beyond the
problematic do_poll()) already hold the port number in their hands, and
the actual leaf functions need nothing else.

For FIFO event channels there's a second problem in that the ordering
of reads and updates to ->num_evtchns and ->event_array[] was so far
undefined (the read side isn't always holding the domain's event lock).
Add respective barriers.

This is XSA-221.

Reported-by: Ankur Arora <ankur.a.arora@oracle.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: e7719a0dfac7a20cb7da5529e09773d8271bb78b
master date: 2017-06-20 14:37:47 +0200

7 years agox86: avoid leaking BND* between vCPU-s
Jan Beulich [Tue, 20 Jun 2017 14:55:09 +0000 (16:55 +0200)]
x86: avoid leaking BND* between vCPU-s

For MPX (BND<n>, BNDCFGU, and BNDSTATUS) the situation is less clear,
and the SDM has not entirely consistent information for that case.
While experimentally the instructions don't change register state as
long as the two XCR0 bits aren't both 1, be on the safe side and enable
both if BNDCFGS.EN is being set the first time.

This is XSA-220.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: de20bb6c4f65c4161e0931402613f9ffac86302d
master date: 2017-06-20 14:36:51 +0200

7 years agox86/shadow: hold references for the duration of emulated writes
Andrew Cooper [Tue, 20 Jun 2017 14:54:16 +0000 (16:54 +0200)]
x86/shadow: hold references for the duration of emulated writes

The (misnamed) emulate_gva_to_mfn() function translates a linear address to an
mfn, but releases its page reference before returning the mfn to its caller.

sh_emulate_map_dest() uses the results of one or two translations to construct
a virtual mapping to the underlying frames, completes an emulated
write/cmpxchg, then unmaps the virtual mappings.

The page references need holding until the mappings are unmapped, or the
frames can change ownership before the writes occurs.

This is XSA-219.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 26217aff67ae1538d4e1b2226afab6993cdbe772
master date: 2017-06-20 14:36:11 +0200

7 years agognttab: correct maptrack table access
Jan Beulich [Tue, 20 Jun 2017 14:52:39 +0000 (16:52 +0200)]
gnttab: correct maptrack table access

In order to observe a consistent (limit,pointer-table) pair, the reader
needs to either hold the grant table lock or both sides need to order
their accesses suitably (the writer side barrier is already there). Add
the missing barrier.

This is part of XSA-218.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 4b78efa91c8ae3c42e14b8eaeaad773c5eb3b71a
master date: 2017-06-20 14:34:34 +0200

7 years agognttab: Avoid potential double-put of maptrack entry
George Dunlap [Tue, 20 Jun 2017 14:52:18 +0000 (16:52 +0200)]
gnttab: Avoid potential double-put of maptrack entry

Each grant mapping for a particular domain is tracked by an in-Xen
"maptrack" entry.  This entry is is referenced by a "handle", which is
given to the guest when it calls gnttab_map_grant_ref().

There are two types of mapping a particular handle can refer to:
GNTMAP_host_map and GNTMAP_device_map.  A given
gnttab_unmap_grant_ref() call can remove either only one or both of
these entries.  When a particular handle has no entries left, it must
be freed.

gnttab_unmap_grant_ref() loops through its grant unmap request list
twice.  It first removes entries from any host pagetables and (if
appropraite) iommus; then it does a single domain TLB flush; then it
does the clean-up, including telling the granter that entries are no
longer being used (if appropriate).

At the moment, it's during the first pass that the maptrack flags are
cleared, but the second pass that the maptrack entry is freed.

Unfortunately this allows the following race, which results in a
double-free:

 A: (pass 1) clear host_map
 B: (pass 1) clear device_map
 A: (pass 2) See that maptrack entry has no mappings, free it
 B: (pass 2) See that maptrack entry has no mappings, free it #

Unfortunately, unlike the active entry pinning update, we can't simply
move the maptrack flag changes to the second half, because the
maptrack flags are used to determine if iommu entries need to be
added: a domain's iommu must never have fewer permissions than the
maptrack flags indicate, or a subsequent map_grant_ref() might fail to
add the necessary iommu entries.

Instead, free the maptrack entry in the first pass if there are no
further mappings.

This is part of XSA-218.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: b7f6cbb9d43f7384e1f38f8764b9a48216c8a525
master date: 2017-06-20 14:33:13 +0200

7 years agognttab: fix unmap pin accounting race
Jan Beulich [Tue, 20 Jun 2017 14:51:58 +0000 (16:51 +0200)]
gnttab: fix unmap pin accounting race

Once all {writable} mappings of a grant entry have been unmapped, the
hypervisor informs the guest that the grant entry has been released by
clearing the _GTF_{reading,writing} usage flags in the guest's grant
table as appropriate.

Unfortunately, at the moment, the code that updates the accounting
happens in a different critical section than the one which updates the
usage flags; this means that under the right circumstances, there may be
a window in time after the hypervisor reported the grant as being free
during which the grant referee still had access to the page.

Move the grant accounting code into the same critical section as the
reporting code to make sure this kind of race can't happen.

This is part of XSA-218.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 9a0bd460cfc28564d39fa23541bb872b13e7f7ea
master date: 2017-06-20 14:32:03 +0200

7 years agoIOMMU: handle IOMMU mapping and unmapping failures
Quan Xu [Tue, 20 Jun 2017 14:51:32 +0000 (16:51 +0200)]
IOMMU: handle IOMMU mapping and unmapping failures

Treat IOMMU mapping and unmapping failures as a fatal to the DomU
If IOMMU mapping and unmapping failed, crash the DomU and propagate
the error up to the call trees.

No spamming of the log can occur. For DomU, we avoid logging any
message for already dying domains. For Dom0, that'll still be more
verbose than we'd really like, but it at least wouldn't outright
flood the console.

Signed-off-by: Quan Xu <quan.xu@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 834c97baebb3743c54bcae228e984ae1b9692e6a
master date: 2016-06-14 15:10:57 +0200

7 years agox86/mm: disallow page stealing from HVM domains
Jan Beulich [Tue, 20 Jun 2017 14:50:57 +0000 (16:50 +0200)]
x86/mm: disallow page stealing from HVM domains

The operation's success can't be controlled by the guest, as the device
model may have an active mapping of the page. If we nevertheless
permitted this operation, we'd have to add further TLB flushing to
prevent scenarios like

"Domains A (HVM), B (PV), C (PV); B->target==A
 Steps:
 1. B maps page X from A as writable
 2. B unmaps page X without a TLB flush
 3. A sends page X to C via GNTTABOP_transfer
 4. C maps page X as pagetable (potentially causing a TLB flush in C,
 but not in B)

 At this point, X would be mapped as a pagetable in C while being
 writable through a stale TLB entry in B."

A similar scenario could be constructed for A using XENMEM_exchange and
some arbitrary PV domain C then having this page allocated.

This is XSA-217.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
master commit: fae7d5be8bb8b7a5b7005c4f3b812a47661a721e
master date: 2017-06-20 14:29:51 +0200

8 years agox86: correct create_bounce_frame
Jan Beulich [Tue, 2 May 2017 13:08:11 +0000 (15:08 +0200)]
x86: correct create_bounce_frame

We may push up to 96 bytes on the guest (kernel) stack, so we should
also cover as much in the early range check. Note that this is the
simplest possible patch, which has the theoretical potential of
breaking a guest: We only really push 96 bytes when invoking the
failsafe callback, ordinary exceptions only have 56 or 64 bytes pushed
(without / with error code respectively). There is, however, no PV OS
known to place a kernel stack there.

This is XSA-215.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
8 years agox86: discard type information when stealing pages
Jan Beulich [Tue, 2 May 2017 13:07:30 +0000 (15:07 +0200)]
x86: discard type information when stealing pages

While a page having just a single general reference left necessarily
has a zero type reference count too, its type may still be valid (and
in validated state; at present this is only possible and relevant for
PGT_seg_desc_page, as page tables have their type forcibly zapped when
their type reference count drops to zero, and
PGT_{writable,shared}_page pages don't require any validation). In
such a case when the page is being re-used with the same type again,
validation is being skipped. As validation criteria differ between
32- and 64-bit guests, pages to be transferred between guests need to
have their validation indicator zapped (and with it we zap all other
type information at once).

This is XSA-214.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: eaf537342c909875c10f49b06e17493655410681
master date: 2017-05-02 14:46:58 +0200

8 years agomulticall: deal with early exit conditions
Jan Beulich [Tue, 2 May 2017 13:06:54 +0000 (15:06 +0200)]
multicall: deal with early exit conditions

In particular changes to guest privilege level require the multicall
sequence to be aborted, as hypercalls are permitted from kernel mode
only. While likely not very useful in a multicall, also properly handle
the return value in the HYPERVISOR_iret case (which should be the guest
specified value).

This is XSA-213.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Julien Grall <julien.grall@arm.com>
master commit: 22c096c99d8c05833c3c19870e36efb2dd4e8013
master date: 2017-05-02 14:45:02 +0200

8 years agooxenstored: trim history in the frequent_ops function
Thomas Sanders [Tue, 28 Mar 2017 17:57:52 +0000 (18:57 +0100)]
oxenstored: trim history in the frequent_ops function

We were trimming the history of commits only at the end of each
transaction (regardless of how it ended).

Therefore if non-transactional writes were being made but no
transactions were being ended, the history would grow
indefinitely. Now we trim the history at regular intervals.

Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
8 years agooxenstored transaction conflicts: improve logging
Thomas Sanders [Mon, 27 Mar 2017 13:36:34 +0000 (14:36 +0100)]
oxenstored transaction conflicts: improve logging

For information related to transaction conflicts, potentially frequent
logging at "info" priority has been changed to "debug" priority, and
once per two minutes there is an "info" priority summary.

Additional detailed logging has been added at "debug" priority.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
8 years agooxenstored: don't wake to issue no conflict-credit
Thomas Sanders [Fri, 24 Mar 2017 19:55:03 +0000 (19:55 +0000)]
oxenstored: don't wake to issue no conflict-credit

In the main loop, when choosing the timeout for the select function
call, we were setting it so as to wake up to issue conflict-credit to
any domains that could accept it. When xenstore is idle, this would
mean waking up every 50ms (by default) to do no work. With this
commit, we check whether any domain is below its cap, and if not then
we set the timeout for longer (the same timeout as before the
conflict-protection feature was added).

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
8 years agooxenstored: do not commit read-only transactions
Thomas Sanders [Fri, 24 Mar 2017 16:16:10 +0000 (16:16 +0000)]
oxenstored: do not commit read-only transactions

The packet telling us to end the transaction has always carried an
argument telling us whether to commit.

If the transaction made no modifications to the tree, now we ignore
that argument and do not commit: it is just a waste of effort.

This makes read-only transactions immune to conflicts, and means that
we do not need to store any of their details in the history that is
used for assigning blame for conflicts.

We count a transaction as a read-only transaction only if it contains
no operations that modified the tree.

This means that (for example) a transaction that creates a new node
then deletes it would NOT count as read-only, even though it makes no
change overall. A more sophisticated algorithm could judge the
transaction based on comparison of its initial and final states, but
this would add complexity and computational cost.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
8 years agooxenstored: allow self-conflicts
Thomas Sanders [Thu, 23 Mar 2017 19:06:54 +0000 (19:06 +0000)]
oxenstored: allow self-conflicts

We already avoid inter-domain conflicts but now allow intra-domain
conflicts.  Although there are no known practical examples of a domain
that might perform operations that conflict with its own transactions,
this is conceivable, so here we avoid changing those semantics
unnecessarily.

When a transaction commit fails with a conflict and we look through
the history of commits to see which connection(s) to blame, ignore
historical commits that were made by the same connection as the
failing commit.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
8 years agooxenstored: blame the connection that caused a transaction conflict
Jonathan Davies [Thu, 23 Mar 2017 14:28:16 +0000 (14:28 +0000)]
oxenstored: blame the connection that caused a transaction conflict

Blame each connection found to have made a commit that would cause this
transaction to fail. Each blamed connection is penalised by having its
conflict-credit decremented.

Note the change in semantics for the replay function: we no longer stop after
finding the first operation that can't be replayed. This allows us to identify
all operations that conflicted with this transaction, not just the one that
conflicted first.

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
v1 Reviewed-by: Christian Lindig <christian.lindig@citrix.com>

Changes since v1:
 * use correct log levels for informational messages
Changes since v2:
 * fix the blame algorithm and improve logging
   (fix was reviewed by Jonathan Davies)

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
8 years agooxenstored: track commit history
Jonathan Davies [Mon, 27 Mar 2017 08:58:29 +0000 (08:58 +0000)]
oxenstored: track commit history

Since the list of historic activity cannot grow without bound, it is safe to use
this to track commits.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Thomas Sanders <thomas.sanders@citrix.com>
8 years agooxenstored: discard old commit-history on txn end
Thomas Sanders [Thu, 23 Mar 2017 14:25:16 +0000 (14:25 +0000)]
oxenstored: discard old commit-history on txn end

The history of commits is to be used for working out which historical
commit(s) (including atomic writes) caused conflicts with a
currently-failing commit of a transaction. Any commit that was made
before the current transaction started cannot be relevant. Therefore
we never need to keep history from before the start of the
longest-running transaction that is open at any given time: whenever a
transaction ends (with or without a commit) then if it was the
longest-running open transaction we can delete history up until start
of the the next-longest-running open transaction.

Some transactions might stay open for a very long time, so if any
transaction exceeds conflict_max_history_seconds then we remove it
from consideration in this context, and will not guarantee to keep
remembering about historical commits made during such a transaction.

We implement this by keeping a list of all open transactions that have
not been open too long. When a transaction ends, we remove it from the
list, along with any that have been open longer than the maximum; then
we delete any history from before the start of the longest-running
transaction remaining in the list.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
8 years agooxenstored: only record operations with side-effects in history
Jonathan Davies [Thu, 23 Mar 2017 14:20:33 +0000 (14:20 +0000)]
oxenstored: only record operations with side-effects in history

There is no need to record "read" operations as they will never cause another
transaction to fail.

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Thomas Sanders <thomas.sanders@citrix.com>
Backport 4.6 -> 4.5 by removing reference to XS_RESET_WATCHES.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
8 years agooxenstored: support commit history tracking
Jonathan Davies [Tue, 14 Mar 2017 13:20:07 +0000 (13:20 +0000)]
oxenstored: support commit history tracking

Add ability to track xenstore tree operations -- either non-transactional
operations or committed transactions.

For now, the call to actually retain commits is commented out because history
can grow without bound.

For now, we call record_commit for all non-transactional operations. A
subsequent patch will make it retain only the ones with side-effects.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
8 years agooxenstored: add transaction info relevant to history-tracking
Jonathan Davies [Tue, 14 Mar 2017 12:17:38 +0000 (12:17 +0000)]
oxenstored: add transaction info relevant to history-tracking

Specifically:
 * retain the original store (not just the root) in full transactions
 * store commit count at the time of the start of the transaction

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
8 years agooxenstored: ignore domains with no conflict-credit
Thomas Sanders [Tue, 14 Mar 2017 12:15:52 +0000 (12:15 +0000)]
oxenstored: ignore domains with no conflict-credit

When processing connections, skip those from domains with no remaining
conflict-credit.

Also, issue a point of conflict-credit at regular intervals, the
period being set by the configuration option "conflict-max-history-
seconds".  When issuing conflict-credit, we give a point either to
every domain at once (one each) or only to the single domain at the
front of the queue, depending on the configuration option
"conflict-rate-limit-is-aggregate".

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
8 years agooxenstored: handling of domain conflict-credit
Thomas Sanders [Tue, 14 Mar 2017 12:15:52 +0000 (12:15 +0000)]
oxenstored: handling of domain conflict-credit

This commit gives each domain a conflict-credit variable, which will
later be used for limiting how often a domain can cause other domain's
transaction-commits to fail.

This commit also provides functions and data for manipulating domains
and their conflict-credit, and checking whether they have credit.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
8 years agooxenstored: comments explaining some variables
Thomas Sanders [Tue, 14 Mar 2017 12:15:52 +0000 (12:15 +0000)]
oxenstored: comments explaining some variables

It took a while of reading and reasoning to work out what these are
for, so here are comments to make life easier for everyone reading
this code in future.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
8 years agooxenstored: allow compilation prior to OCaml 3.12.0
Jonathan Davies [Thu, 23 Mar 2017 17:30:58 +0000 (17:30 +0000)]
oxenstored: allow compilation prior to OCaml 3.12.0

Commit 363ae55c8 used an OCaml feature called record field punning. This broke
the build on compilers prior to OCaml 3.12.0.

This patch makes no semantic change but now uses backwards-compatible syntax.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
8 years agooxenstored: log request and response during transaction replay
Jonathan Davies [Thu, 23 Mar 2017 17:30:49 +0000 (17:30 +0000)]
oxenstored: log request and response during transaction replay

During a transaction replay, the replayed requests and the new responses are
logged in the same way as the original requests and the original responses.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com>
Reviewed-by: Euan Harris <euan.harris@citrix.com>
Acked-by: David Scott <dave@recoil.org>
8 years agooxenstored: replay transaction upon conflict
Jonathan Davies [Thu, 23 Mar 2017 17:30:42 +0000 (17:30 +0000)]
oxenstored: replay transaction upon conflict

The existing transaction merge algorithm keeps track of the least upper bound
(longest common prefix) of all the nodes which have been read and written, and
will re-combine two stores which have disjoint upper bounds. This works well for
small transactions but causes unnecessary conflicts for ones that span a large
subtree, such as the following ones used by the xapi toolstack:

 * VM start: creates /vm/... /vss/... /local/domain/...
   The least upper bound of this transaction is / and so all
   these transactions conflict with everything.

 * Device hotplug: creates /local/domain/0/... /local/domain/n/...
   The least upper bound of this transaction is /local/domain so
   all these transactions conflict with each other.

If the existing merge algorithm cannot merge and commit, we attempt
a /replay/ of the failed transaction against the new store.

When we replay the requests we check whether the response sent to the client is
the same as during the first attempt at the transaction. If the responses are
all the same then the transaction replay can be committed. If any differ then
the transaction replay must be aborted and the client must retry.

This algorithm uses the intuition that the transactions made by the toolstack
are designed to be for separate domains, and should fundamentally not conflict
in the sense that they don't read or write any shared keys. By replaying the
transaction on the server side we do what the client would have to do anyway,
only we can do it quickly without allowing any other requests to interfere.

Performing 300 parallel simulated VM start and shutdowns without this code:

300 parallel starts and shutdowns: 268.92

Performing 300 parallel simulated VM start and shutdowns with this code:

300 parallel starts and shutdowns: 3.80

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Dave Scott <dave@recoil.org>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com>
Reviewed-by: Euan Harris <euan.harris@citrix.com>
Acked-by: David Scott <dave@recoil.org>
8 years agooxenstored: move functions that process simple operations
Jonathan Davies [Thu, 23 Mar 2017 17:13:03 +0000 (17:13 +0000)]
oxenstored: move functions that process simple operations

Separate the functions which process operations that can be done as part of a
transaction. Specifically, these operations are: read, write, rm, getperms,
setperms, getdomainpath, directory, mkdir.

Also split function_of_type into two functions: one for processing the simple
operations and one for processing the rest.

This will help allow replay of transactions, allowing us to invoke the functions
that process the simple operations as part of the processing of transaction_end.

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com>
Reviewed-by: Euan Harris <euan.harris@citrix.com>
Acked-by: David Scott <dave@recoil.org>
Backporting to 4.5:

- Removed references to Reset_watches, which was introduced in 4.6.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
8 years agooxenstored: keep track of each transaction's operations
Jonathan Davies [Thu, 23 Mar 2017 17:12:56 +0000 (17:12 +0000)]
oxenstored: keep track of each transaction's operations

A list of (request, response) pairs from the operations performed within the
transaction will be useful to support transaction replay.

Since this consumes memory, the number of requests per transaction must not be
left unbounded. Hence a new quota for this is introduced. This quota, configured
via the configuration key 'quota-maxrequests', limits the size of transactions
initiated by domUs.

After the maximum number of requests has been exhausted, any further requests
will result in EQUOTA errors. The client may then choose to end the transaction;
a successful commit will result in the retention of only the prior requests.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com>
Reviewed-by: Euan Harris <euan.harris@citrix.com>
Acked-by: David Scott <dave@recoil.org>
8 years agooxenstored: refactor request processing
Jonathan Davies [Thu, 23 Mar 2017 17:12:50 +0000 (17:12 +0000)]
oxenstored: refactor request processing

Encapsulate the request in a record that is passed from do_input to
process_packet and input_handle_error.

This will be helpful when keeping track of the requests made as part of a
transaction.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com>
Reviewed-by: Euan Harris <euan.harris@citrix.com>
Acked-by: David Scott <dave@recoil.org>
8 years agooxenstored: remove some unused parameters
Jonathan Davies [Thu, 23 Mar 2017 17:12:41 +0000 (17:12 +0000)]
oxenstored: remove some unused parameters

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com>
Reviewed-by: Euan Harris <euan.harris@citrix.com>
Acked-by: David Scott <dave@recoil.org>
8 years agooxenstored: refactor putting response on wire
Jonathan Davies [Thu, 23 Mar 2017 17:05:22 +0000 (17:05 +0000)]
oxenstored: refactor putting response on wire

Previously, the functions reply_{ack,data,data_or_ack} and input_handle_error
put the response on the wire by invoking Connection.send_{ack,reply,error}.

Instead, these functions now return a value indicating what needs to be put on
the wire, and that action is done by a send_response function called
afterwards.

This refactoring gives us a chance to store the value of the response, useful
for replaying transactions.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com>
Reviewed-by: Euan Harris <euan.harris@citrix.com>
Acked-by: David Scott <dave@recoil.org>
8 years agoxenstored: Log when the write transaction rate limit bites
Ian Jackson [Thu, 23 Mar 2017 17:05:11 +0000 (17:05 +0000)]
xenstored: Log when the write transaction rate limit bites

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
plus:

xenstore: dont increment bool variable
Instead of incrementing a bool variable just set it to true.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
8 years agoxenstored: apply a write transaction rate limit
Ian Jackson [Sat, 18 Mar 2017 16:44:46 +0000 (16:44 +0000)]
xenstored: apply a write transaction rate limit

This avoids a rogue client being about to stall another client (eg the
toolstack) indefinitely.

This is XSA-206.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
8 years agomemory: properly check guest memory ranges in XENMEM_exchange handling
Jan Beulich [Tue, 4 Apr 2017 13:05:00 +0000 (15:05 +0200)]
memory: properly check guest memory ranges in XENMEM_exchange handling

The use of guest_handle_okay() here (as introduced by the XSA-29 fix)
is insufficient here, guest_handle_subrange_okay() needs to be used
instead.

Note that the uses are okay in
- XENMEM_add_to_physmap_batch handling due to the size field being only
  16 bits wide,
- livepatch_list() due to the limit of 1024 enforced on the
  number-of-entries input (leaving aside the fact that this can be
  called by a privileged domain only anyway),
- compat mode handling due to counts there being limited to 32 bits,
- everywhere else due to guest arrays being accessed sequentially from
  index zero.

This is CVE-2017-7228 / XSA-212.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 938fd2586eb081bcbd694f4c1f09ae6a263b0d90
master date: 2017-04-04 14:47:46 +0200

8 years agoQEMU_TAG update
Ian Jackson [Tue, 21 Mar 2017 18:45:00 +0000 (18:45 +0000)]
QEMU_TAG update

8 years agoQEMU_TAG update
Ian Jackson [Wed, 22 Feb 2017 16:43:23 +0000 (16:43 +0000)]
QEMU_TAG update

8 years agoIOMMU: always call teardown callback
Oleksandr Tyshchenko [Wed, 15 Feb 2017 12:21:03 +0000 (12:21 +0000)]
IOMMU: always call teardown callback

There is a possible scenario when (d)->need_iommu remains unset
during guest domain execution. For example, when no devices
were assigned to it. Taking into account that teardown callback
is not called when (d)->need_iommu is unset we might have unreleased
resourses after destroying domain.

So, always call teardown callback to roll back actions
that were performed in init callback.

This is XSA-207.

Signed-off-by: Oleksandr Tyshchenko <olekstysh@gmail.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Julien Grall <julien.grall@arm.com>
8 years agolibxl: Revert 3658f7a0bdd8 "libxl: fix libxl_set_memory_target"
Ian Jackson [Sat, 21 Jan 2017 19:03:23 +0000 (19:03 +0000)]
libxl: Revert 3658f7a0bdd8 "libxl: fix libxl_set_memory_target"

I backported
  3658f7a0bdd8fcda217559927e25263a07398c27
  libxl: fix libxl_set_memory_target
but this was not necessary, because the commit it fixes is not in
Xen 4.6.

So the backport introduces a duplicate cal to xc_domain_getinfolist
and also breaks the build with
    libxl.c:4908:5: error: ‘r’ undeclared (first use in this function)

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit 09f521a077024d5955d766eef7a040d2af928ec2)

8 years agolibxl: fix libxl_set_memory_target
Wei Liu [Thu, 29 Dec 2016 16:36:31 +0000 (16:36 +0000)]
libxl: fix libxl_set_memory_target

Commit 26dbc93a ("libxl: Remove pointless hypercall from
libxl_set_memory_target") removed the call to xc_domain_getinfolist, but
it failed to notice that "info" was actually needed later.

Put that back. While at it, make the code conform to coding style
requirement.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit ed5f19aea66fe5a72060d6a795ffcd23b7643ee3)
(cherry picked from commit e1cefedd80f9972854769bfc6e32e23b56cd0712)
(cherry picked from commit 013ee593ca0456e1adfdf80ef7e44b151cdb1545)
(cherry picked from commit 3658f7a0bdd8fcda217559927e25263a07398c27)

8 years agox86: force EFLAGS.IF on when exiting to PV guests
Jan Beulich [Wed, 21 Dec 2016 16:47:08 +0000 (17:47 +0100)]
x86: force EFLAGS.IF on when exiting to PV guests

Guest kernels modifying instructions in the process of being emulated
for another of their vCPU-s may effect EFLAGS.IF to be cleared upon
next exiting to guest context, by converting the being emulated
instruction to CLI (at the right point in time). Prevent any such bad
effects by always forcing EFLAGS.IF on. And to cover hypothetical other
similar issues, also force EFLAGS.{IOPL,NT,VM} to zero.

This is CVE-2016-10024 / XSA-202.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0e47f92b072548800223f9a21ea051a017173915
master date: 2016-12-21 16:46:13 +0100

8 years agox86/emul: Correct the handling of eflags with SYSCALL
Andrew Cooper [Sun, 18 Dec 2016 15:42:59 +0000 (15:42 +0000)]
x86/emul: Correct the handling of eflags with SYSCALL

A singlestep #DB is determined by the resulting eflags value from the
execution of SYSCALL, not the original eflags value.

By using the original eflags value, we negate the guest kernels attempt to
protect itself from a privilege escalation by masking TF.

Introduce a tf boolean and have the SYSCALL emulation recalculate it
after the instruction is complete.

This is XSA-204

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86emul: CMPXCHG8B ignores operand size prefix
Jan Beulich [Tue, 13 Dec 2016 13:29:02 +0000 (14:29 +0100)]
x86emul: CMPXCHG8B ignores operand size prefix

Otherwise besides mis-handling the instruction, the comparison failure
case would result in uninitialized stack data being handed back to the
guest in rDX:rAX (32 bits leaked for 32-bit guests, 96 bits for 64-bit
ones).

This is CVE-2016-9932 / XSA-200.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
8 years agoQEMU_TAG update
Ian Jackson [Wed, 7 Dec 2016 16:54:35 +0000 (16:54 +0000)]
QEMU_TAG update

8 years agoarm64: fix incorrect memory region size in TCR_EL2
Shanker Donthineni [Thu, 17 Mar 2016 12:46:58 +0000 (13:46 +0100)]
arm64: fix incorrect memory region size in TCR_EL2

The maximum and minimum values for TxSZ depend on level of
translation as per AArch64 Virtual Memory System Architecture.
According to ARM specification DDI0487A_h (sec D4.2.2, page 1752),
the minimum TxSZ value is 16. If TxSZ is programmed to a value
smaller than 16 then it is IMPLEMENTATION DEFINED.

This patch sets T0SZ to (64-48)bits since XEN uses all 4 levels
to cover 48bit (256TB) virtual address instead of value zero.

Signed-off-by: Shanker Donthineni <shankerd@codeaurora.org>
Acked-by: Julien Grall <julien.grall@arm.com>
8 years agoQEMU_TAG update
Ian Jackson [Tue, 29 Nov 2016 18:38:34 +0000 (18:38 +0000)]
QEMU_TAG update

8 years agopygrub: Properly quote results, when returning them to the caller:
Ian Jackson [Tue, 22 Nov 2016 13:30:27 +0000 (14:30 +0100)]
pygrub: Properly quote results, when returning them to the caller:

* When the caller wants sexpr output, use `repr()'
  This is what Xend expects.

  The returned S-expressions are now escaped and quoted by Python,
  generally using '...'.  Previously kernel and ramdisk were unquoted
  and args was quoted with "..." but without proper escaping.  This
  change may break toolstacks which do not properly dequote the
  returned S-expressions.

* When the caller wants "simple" output, crash if the delimiter is
  contained in the returned value.

  With --output-format=simple it does not seem like this could ever
  happen, because the bootloader config parsers all take line-based
  input from the various bootloader config files.

  With --output-format=simple0, this can happen if the bootloader
  config file contains nul bytes.

This is CVE-2016-9379 and CVE-2016-9380 / XSA-198.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Tested-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 27e14d346ed6ff1c3a3cfc479507e62d133e92a9
master date: 2016-11-22 13:52:09 +0100

8 years agox86/svm: fix injection of software interrupts
Andrew Cooper [Tue, 22 Nov 2016 13:29:50 +0000 (14:29 +0100)]
x86/svm: fix injection of software interrupts

The non-NextRip logic in c/s 36ebf14eb "x86/emulate: support for emulating
software event injection" was based on an older version of the AMD software
manual.  The manual was later corrected, following findings from that series.

I took the original wording of "not supported without NextRIP" to mean that
X86_EVENTTYPE_SW_INTERRUPT was not eligible for use.  It turns out that this
is not the case, and the new wording is clearer on the matter.

Despite testing the original patch series on non-NRip hardware, the
swint-emulation XTF test case focuses on the debug vectors; it never ended up
executing an `int $n` instruction for a vector which wasn't also an exception.

During a vmentry, the use of X86_EVENTTYPE_HW_EXCEPTION comes with a vector
check to ensure that it is only used with exception vectors.  Xen's use of
X86_EVENTTYPE_HW_EXCEPTION for `int $n` injection has always been buggy on AMD
hardware.

Fix this by always using X86_EVENTTYPE_SW_INTERRUPT.

Print and decode the eventinj information in svm_vmcb_dump(), as it has
several invalid combinations which cause vmentry failures.

This is CVE-2016-9378 / part of XSA-196.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 920edccd41db6cb0145545afa1850edf5e7d098e
master date: 2016-11-22 13:51:16 +0100

8 years agox86/emul: correct the IDT entry calculation in inject_swint()
Andrew Cooper [Tue, 22 Nov 2016 13:29:26 +0000 (14:29 +0100)]
x86/emul: correct the IDT entry calculation in inject_swint()

The logic, as introduced in c/s 36ebf14ebe "x86/emulate: support for emulating
software event injection" is buggy.  The size of an IDT entry depends on long
mode being active, not the width of the code segment currently in use.

In particular, this means that a compatibility code segment which hits
emulation for software event injection will end up using an incorrect offset
in the IDT for DPL/Presence checking.  In practice, this only occurs on old
AMD hardware lacking NRip support; all newer AMD hardware, and all Intel
hardware bypass this path in the emulator.

While here, fix a minor issue with reading the IDT entry.  The return value
from ops->read() wasn't checked, but in reality the only failure case is if a
pagefault occurs.  This is not a realistic problem as the kernel will almost
certainly crash with a double fault if this setup actually occured.

This is CVE-2016-9377 / part of XSA-196.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 255e8fe95f22ded5186fd75244ffcfb9d5dbc855
master date: 2016-11-22 13:50:49 +0100

8 years agox86emul: fix huge bit offset handling
Jan Beulich [Tue, 22 Nov 2016 13:29:03 +0000 (14:29 +0100)]
x86emul: fix huge bit offset handling

We must never chop off the high 32 bits.

This is CVE-2016-9383 / XSA-195.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1c6c2d60d205f71ede0fbbd9047e459112f576db
master date: 2016-11-22 13:49:06 +0100

8 years agox86/PV: writes of %fs and %gs base MSRs require canonical addresses
Jan Beulich [Tue, 22 Nov 2016 13:28:40 +0000 (14:28 +0100)]
x86/PV: writes of %fs and %gs base MSRs require canonical addresses

Commit c42494acb2 ("x86: fix FS/GS base handling when using the
fsgsbase feature") replaced the use of wrmsr_safe() on these paths
without recognizing that wr{f,g}sbase() use just wrmsrl() and that the
WR{F,G}SBASE instructions also raise #GP for non-canonical input.

Similarly arch_set_info_guest() needs to prevent non-canonical
addresses from getting stored into state later to be loaded by context
switch code. For consistency also check stack pointers and LDT base.
DR0..3, otoh, already get properly checked in set_debugreg() (albeit
we discard the error there).

The SHADOW_GS_BASE check isn't strictly necessary, but I think we
better avoid trying the WRMSR if we know it's going to fail.

This is CVE-2016-9385 / XSA-193.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f3fa3abf3e61fb1f25ce721e14ac324dda67311f
master date: 2016-11-22 13:46:28 +0100

8 years agox86/HVM: don't load LDTR with VM86 mode attrs during task switch
Jan Beulich [Tue, 22 Nov 2016 13:28:12 +0000 (14:28 +0100)]
x86/HVM: don't load LDTR with VM86 mode attrs during task switch

Just like TR, LDTR is purely a protected mode facility and hence needs
to be loaded accordingly. Also move its loading to where it
architecurally belongs.

This is CVE-2016-9382 / XSA-192.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 93aa42b85ae0084ba7b749d0e990c94fbf0c17e3
master date: 2016-11-22 13:45:44 +0100

8 years agox86/hvm: Fix the handling of non-present segments
Andrew Cooper [Tue, 22 Nov 2016 13:27:40 +0000 (14:27 +0100)]
x86/hvm: Fix the handling of non-present segments

In 32bit, the data segments may be NULL to indicate that the segment is
ineligible for use.  In both 32bit and 64bit, the LDT selector may be NULL to
indicate that the entire LDT is ineligible for use.  However, nothing in Xen
actually checks for this condition when performing other segmentation
checks.  (Note however that limit and writeability checks are correctly
performed).

Neither Intel nor AMD specify the exact behaviour of loading a NULL segment.
Experimentally, AMD zeroes all attributes but leaves the base and limit
unmodified.  Intel zeroes the base, sets the limit to 0xfffffff and resets the
attributes to just .G and .D/B.

The use of the segment information in the VMCB/VMCS is equivalent to a native
pipeline interacting with the segment cache.  The present bit can therefore
have a subtly different meaning, and it is now cooked to uniformly indicate
whether the segment is usable or not.

GDTR and IDTR don't have access rights like the other segments, but for
consistency, they are treated as being present so no special casing is needed
elsewhere in the segmentation logic.

AMD hardware does not consider the present bit for %cs and %tr, and will
function as if they were present.  They are therefore unconditionally set to
present when reading information from the VMCB, to maintain the new meaning of
usability.

Intel hardware has a separate unusable bit in the VMCS segment attributes.
This bit is inverted and stored in the present field, so the hvm code can work
with architecturally-common state.

This is CVE-2016-9386 / XSA-191.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 04beafa8e6c66f5cd814c00e2d2b51cfbc41cb8a
master date: 2016-11-22 13:44:50 +0100

8 years agox86emul: honor guest CR0.TS and CR0.EM
Jan Beulich [Tue, 4 Oct 2016 13:06:11 +0000 (14:06 +0100)]
x86emul: honor guest CR0.TS and CR0.EM

We must not emulate any instructions accessing respective registers
when either of these flags is set in the guest view of the register, or
else we may do so on data not belonging to the guest's current task.

Being architecturally required behavior, the logic gets placed in the
instruction emulator instead of hvmemul_get_fpu(). It should be noted,
though, that hvmemul_get_fpu() being the only current handler for the
get_fpu() callback, we don't have an active problem with CR4: Both
CR4.OSFXSR and CR4.OSXSAVE get handled as necessary by that function.

This is XSA-190.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
8 years agoQEMU_TAG update
Ian Jackson [Tue, 20 Sep 2016 15:52:52 +0000 (16:52 +0100)]
QEMU_TAG update

8 years agoupdate Xen version to 4.5.5 RELEASE-4.5.5
Jan Beulich [Tue, 20 Sep 2016 05:59:24 +0000 (07:59 +0200)]
update Xen version to 4.5.5

One of the qemu tags did get created the wrong way, so we need to skip
4.5.4.

8 years agoupdate Xen version to 4.5.4 RELEASE-4.5.4
Jan Beulich [Mon, 19 Sep 2016 15:50:59 +0000 (17:50 +0200)]
update Xen version to 4.5.4

8 years agoRevert "x86/hvm: Perform a user instruction fetch for a FEP in userspace"
Jan Beulich [Mon, 12 Sep 2016 15:50:13 +0000 (17:50 +0200)]
Revert "x86/hvm: Perform a user instruction fetch for a FEP in userspace"

This reverts commit 95559492c958e45fa7c01b1b3e0fb704e5b8b9eb,
which doesn't build and would, in its current form, use
uninitialized data if it did.

8 years agox86/segment: Bounds check accesses to emulation ctxt->seg_reg[]
Andrew Cooper [Mon, 12 Sep 2016 14:05:09 +0000 (16:05 +0200)]
x86/segment: Bounds check accesses to emulation ctxt->seg_reg[]

HVM HAP codepaths have space for all segment registers in the seg_reg[]
cache (with x86_seg_none still risking an array overrun), while the shadow
codepaths only have space for the user segments.

Range check the input segment of *_get_seg_reg() against the size of the array
used to cache the results, to avoid overruns in the case that the callers
don't filter their input suitably.

Subsume the is_x86_user_segment(seg) checks from the shadow code, which were
an incomplete attempt at range checking, and are now superceeded.  Make
hvm_get_seg_reg() static, as it is not used outside of shadow/common.c

No functional change, but far easier to reason that no overflow is possible.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Tim Deegan <tim@xen.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
xen/x86: Fix build with clang following c/s 4fa0105

https://travis-ci.org/xen-project/xen/jobs/158494027#L2344

Clang complains:

  emulate.c:2016:14: error: comparison of unsigned enum expression < 0
  is always false [-Werror,-Wtautological-compare]
      if ( seg < 0 || seg >= ARRAY_SIZE(hvmemul_ctxt->seg_reg) )
           ~~~ ^ ~

Clang is wrong to raise a warning like this.  The signed-ness of an enum is
implementation defined in C, and robust code must not assume the choices made
by the compiler.

In this case, dropping the < 0 check creates a latent bug which would result
in an array underflow when compiled with a compiler which chooses a signed
enum.

Work around the bug by explicitly pulling seg into an unsigned integer, and
only perform the upper bounds check.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 4fa0105d95be6e7145a1f6fd1036ccd43976228c
master date: 2016-09-08 16:39:46 +0100
master commit: 4c47c47938ea24c73d9459f9f0b6923513772b5d
master date: 2016-09-09 15:31:01 +0100

8 years agox86/hvm: Perform a user instruction fetch for a FEP in userspace
Andrew Cooper [Mon, 12 Sep 2016 14:04:35 +0000 (16:04 +0200)]
x86/hvm: Perform a user instruction fetch for a FEP in userspace

This matches hardware behaviour, and prevents erroneous failures when a guest
has SMEP/SMAP active and issues a FEP from userspace.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0831e99446121636045cf6f616a1991d6ef22071
master date: 2016-09-08 16:39:46 +0100

8 years agohvm/fep: Allow testing of instructions crossing the -1 -> 0 virtual boundary
Andrew Cooper [Mon, 12 Sep 2016 14:04:08 +0000 (16:04 +0200)]
hvm/fep: Allow testing of instructions crossing the -1 -> 0 virtual boundary

The Force Emulation Prefix is named to follow its PV counterpart for cpuid or
rdtsc, but isn't really an instruction prefix.  It behaves as a break-out into
Xen, with the purpose of emulating the next instruction in the current state.

It is important to be able to test legal situations which occur in real
hardware, including instruction which cross certain boundaries, and
instructions starting at 0.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7b5cee79dad24e7006059667b02bd7de685d8ee5
master date: 2016-09-08 16:39:46 +0100

8 years agoVMX: correct feature checks for MPX
Jan Beulich [Mon, 12 Sep 2016 14:03:27 +0000 (16:03 +0200)]
VMX: correct feature checks for MPX

Its VMCS field isn't tied to the respective base CPU feature flag but
instead to a VMX specific one.

Note that while the VMCS GUEST_BNDCFGS field exists if either of the
two respective features is available, MPX continues to get exposed to
guests only with both features present.

Also add the so far missing handling of
- GUEST_BNDCFGS in construct_vmcs()
- MSR_IA32_BNDCFGS in vmx_msr_{read,write}_intercept()
and mirror the extra correctness checks during MSR write to
vmx_load_msr().

Reported-by: "Rockosov, Dmitry" <dmitry.rockosov@intel.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: "Rockosov, Dmitry" <dmitry.rockosov@intel.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 68eb1a4d92be58e26bd11d02b8e0317bd56294ac
master date: 2016-09-07 12:34:43 +0200

8 years agox86/shadow: Avoid overflowing sh_ctxt->seg_reg[]
Andrew Cooper [Thu, 8 Sep 2016 12:29:05 +0000 (14:29 +0200)]
x86/shadow: Avoid overflowing sh_ctxt->seg_reg[]

hvm_get_seg_reg() does not perform a range check on its input segment, calls
hvm_get_segment_register() and writes straight into sh_ctxt->seg_reg[].

x86_seg_none is outside the bounds of sh_ctxt->seg_reg[], and will hit a BUG()
in {vmx,svm}_get_segment_register().

HVM guests running with shadow paging can end up performing a virtual to
linear translation with x86_seg_none.  This is used for addresses which are
already linear.  However, none of this is a legitimate pagetable update, so
fail the emulation in such a case.

This is XSA-187 / CVE-2016-7094.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: a9f3b3bad17d91e2067fc00d51b0302349570d08
master date: 2016-09-08 14:16:26 +0200

8 years agox86/emulate: Correct boundary interactions of emulated instructions
Andrew Cooper [Thu, 8 Sep 2016 12:28:11 +0000 (14:28 +0200)]
x86/emulate: Correct boundary interactions of emulated instructions

This reverts most of c/s 0640ffb6 "x86emul: fix rIP handling".

Experimentally, in long mode processors will execute an instruction stream
which crosses the 64bit -1 -> 0 virtual boundary, whether the instruction
boundary is aligned on the virtual boundary, or is misaligned.

In compatibility mode, Intel processors will execute an instruction stream
which crosses the 32bit -1 -> 0 virtual boundary, while AMD processors raise a
segmentation fault.  Xen's segmentation behaviour matches AMD.

For 16bit code, hardware does not ever truncated %ip.  %eip is always used and
behaves normally as a 32bit register, including in 16bit protected mode
segments, as well as in Real and Unreal mode.

This is XSA-186 / CVE-2016-7093.

Reported-by: Brian Marcotte <marcotte@panix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e9575f980df81aeb0e5b6139f485fd6f7bb7f5b6
master date: 2016-09-08 14:15:53 +0200

8 years agox86/32on64: don't allow recursive page tables from L3
Jan Beulich [Thu, 8 Sep 2016 12:27:34 +0000 (14:27 +0200)]
x86/32on64: don't allow recursive page tables from L3

L3 entries are special in PAE mode, and hence can't reasonably be used
for setting up recursive (and hence linear) page table mappings. Since
abuse is possible when the guest in fact gets run on 4-level page
tables, this needs to be excluded explicitly.

This is XSA-185 / CVE-2016-7092.

Reported-by: Jérémie Boutoille <jboutoille@ext.quarkslab.com>
Reported-by: "栾尚聪(好风)" <shangcong.lsc@alibaba-inc.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c844d637d92a75854ea5c8d4e5ca34302a9f623c
master date: 2016-09-08 14:14:53 +0200

8 years agomemory: fix compat handling of XENMEM_access_op
Jan Beulich [Tue, 6 Sep 2016 10:13:43 +0000 (12:13 +0200)]
memory: fix compat handling of XENMEM_access_op

Within compat_memory_op() this needs to be placed in the first switch()
statement, or it ends up being dead code (as that first switch() has a
default case chaining to compat_arch_memory_op()).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 8d6af808a7e9d9ae1d129e1e5a0def7f8b2333ee
master date: 2016-09-02 14:19:51 +0200

8 years agocredit1: fix a race when picking initial pCPU for a vCPU
Dario Faggioli [Tue, 6 Sep 2016 10:13:18 +0000 (12:13 +0200)]
credit1: fix a race when picking initial pCPU for a vCPU

In the Credit1 hunk of 9f358ddd69463 ("xen: Have
schedulers revise initial placement") csched_cpu_pick()
is called without taking the runqueue lock of the
(temporary) pCPU that the vCPU has been assigned to
(e.g., in XEN_DOMCTL_max_vcpus).

However, although 'hidden' in the IS_RUNQ_IDLE() macro,
that function does access the runq (for doing load
balancing calculations). Two scenarios are possible:
 1) we are on cpu X, and IS_RUNQ_IDLE() peeks at cpu's
    X own runq;
 2) we are on cpu X, but IS_RUNQ_IDLE() peeks at some
    other cpu's runq.

Scenario 2) absolutely requies that the appropriate
runq lock is taken. Scenario 1) works even without
taking the cpu's own runq lock. That is actually what
happens when when _csched_pick_cpu() is called from
csched_vcpu_acct() (in turn, called by csched_tick()).

Races have been observed and reported (by both XenServer
own testing and OSSTest [1]), in the form of
IS_RUNQ_IDLE() falling over LIST_POISON, because we're
not currently holding the proper lock, in
csched_vcpu_insert(), when scenario 1) occurs.

However, for better robustness, from now on we always
ask for the proper runq lock to be held when calling
IS_RUNQ_IDLE() (which is also becoming a static inline
function instead of macro).

In order to comply with that, we take the lock around
the call to _csched_cpu_pick() in csched_vcpu_acct().

[1] https://lists.xen.org/archives/html/xen-devel/2016-08/msg02144.html

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 9109bf55084398c4547b8956906410c158eb9a17
master date: 2016-09-02 14:17:55 +0200

8 years agox86/32on64: misc adjustments to call gate emulation
Jan Beulich [Tue, 6 Sep 2016 10:12:49 +0000 (12:12 +0200)]
x86/32on64: misc adjustments to call gate emulation

- There's no 32-bit displacement in 16-bit addressing mode.
- It is wrong to ASSERT() anything on parts of an instruction fetched
  from guest memory.
- The two scaling bits of a SIB byte don't affect whether there is a
  scaled index register or not.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ee1cc4bfdca84d526805c4c72302c026f5e9cd94
master date: 2016-09-01 15:23:46 +0200

8 years agoxen: Remove buggy initial placement algorithm
George Dunlap [Tue, 6 Sep 2016 10:11:53 +0000 (12:11 +0200)]
xen: Remove buggy initial placement algorithm

The initial placement algorithm sometimes picks cpus outside of the
mask it's given, does a lot of unnecessary bitmasking, does its own
separate load calculation, and completely ignores vcpu hard and soft
affinities.  Just get rid of it and rely on the schedulers to do
initial placement.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d5438accceecc8172db2d37d98b695eb8bc43afc
master date: 2016-07-26 10:44:06 +0100

8 years agoxen: Have schedulers revise initial placement
George Dunlap [Tue, 6 Sep 2016 10:11:28 +0000 (12:11 +0200)]
xen: Have schedulers revise initial placement

The generic domain creation logic in
xen/common/domctl.c:default_vcpu0_location() attempts to try to do
initial placement load-balancing by placing vcpu 0 on the least-busy
non-primary hyperthread available.  Unfortunately, the logic can end
up picking a pcpu that's not in the online mask.  When this is passed
to a scheduler such which assumes that the initial assignment is
valid, it causes a null pointer dereference looking up the runqueue.

Furthermore, this initial placement doesn't take into account hard or
soft affinity, or any scheduler-specific knowledge (such as historic
runqueue load, as in credit2).

To solve this, when inserting a vcpu, always call the per-scheduler
"pick" function to revise the initial placement.  This will
automatically take all knowledge the scheduler has into account.

csched2_cpu_pick ASSERTs that the vcpu's pcpu scheduler lock has been
taken.  Grab and release the lock to minimize time spend with irqs
disabled.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
Reviwed-by: Dario Faggioli <dario.faggioli@citrix.com>
master commit: 9f358ddd69463fa8fb65cf67beb5f6f0d3350e32
master date: 2016-07-26 10:42:49 +0100

8 years agosched: better handle (not) inserting idle vCPUs in runqueues
Dario Faggioli [Tue, 6 Sep 2016 10:10:40 +0000 (12:10 +0200)]
sched: better handle (not) inserting idle vCPUs in runqueues

Idle vCPUs are set to run immediately, as a part of their
own initialization, so we shouldn't even try to put them
in a runqueue. In fact, no scheduler does that, even when
asked to (that is rather explicit in Credit2 and RTDS, a
bit less evident in Credit1).

Let's make things look as follows:
 - in generic code, explicitly avoid even trying to
   insert idle vCPUs in runqueues;
 - in specific schedulers' code, enforce that.

Note that, as csched_vcpu_insert() is no longer being
called, during boot (from sched_init_vcpu()) we can
safely avoid saving the flags when taking the runqueue
lock.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
master commit: 6b53bb4ab3c9bd5eccde88a5175cf72589ba6d52
master date: 2015-11-24 14:49:47 +0100

8 years agoxen/physmap: Do not permit a guest to populate PoD pages for itself
Andrew Cooper [Fri, 26 Aug 2016 08:32:01 +0000 (10:32 +0200)]
xen/physmap: Do not permit a guest to populate PoD pages for itself

PoD is supposed to be entirely transparent to guest, but this interface has
been left exposed for a long time.

The use of PoD requires careful co-ordination by the toolstack with the
XENMEM_{get,set}_pod_target hypercalls, and xenstore ballooning target.  The
best a guest can do without toolstack cooperation crash.

Furthermore, there are combinations of features (e.g. c/s c63868ff "libxl:
disallow PCI device assignment for HVM guest when PoD is enabled") which a
toolstack might wish to explicitly prohibit (in this case, because the two
simply don't function in combination).  In such cases, the guest mustn't be
able to subvert the configuration chosen by the toolstack.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 2a99aa99fc84a45f505f84802af56b006d14c52e
master date: 2016-08-19 18:40:11 +0100

8 years agopage-alloc/x86: don't restrict DMA heap to node 0
Jan Beulich [Fri, 26 Aug 2016 08:31:25 +0000 (10:31 +0200)]
page-alloc/x86: don't restrict DMA heap to node 0

When node zero has no memory, the DMA bit width will end up getting set
to 9, which is obviously not helpful to hold back a reasonable amount
of low enough memory for Dom0 to use for DMA purposes. Find the lowest
node with memory below 4Gb instead.

Introduce arch_get_dma_bitsize() to keep this arch-specific logic out
of common code.

Also adjust the original calculation: I think the subtraction of 1
should have been part of the flsl() argument rather than getting
applied to its result. And while previously the division by 4 was valid
to be done on the flsl() result, this now also needs to be converted,
as is should only be applied to the spanned pages value.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d0d6597d3d682f324b6a79e3278e6f5bb6bad153
master date: 2016-08-11 13:35:50 +0200

8 years agolibxl: return any serial tty path in libxl_console_get_tty
Bob Liu [Thu, 4 Aug 2016 01:07:56 +0000 (09:07 +0800)]
libxl: return any serial tty path in libxl_console_get_tty

When specifying a serial list in domain config, users of
libxl_console_get_tty cannot get the tty path of a second specified pty serial,
since right now it always returns the tty path of serial 0.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 35dbf099ac18924d40533c9d1b9bfbf1ecb818c9)
(cherry picked from commit 822464961ae1bac44dcabb049255d61d5511e368)
(cherry picked from commit e06d2bae53cb1a3542e7269fd35bf3885dd2e244)

8 years agotools/libxc: Properly increment ApicIdCoreSize field on AMD
Boris Ostrovsky [Fri, 22 Jul 2016 17:14:01 +0000 (13:14 -0400)]
tools/libxc: Properly increment ApicIdCoreSize field on AMD

Current code incorrectly adds 1 to full register instead of
incrementing the field in bits 15:12.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit a3336a507519c1d28db3bbff8e439aa3811733f3)
(cherry picked from commit de781b47f447fcdd1c556543fcf0ef1d654d2d4d)
(cherry picked from commit 0e944364767cb198ae81c4326ed24ea3941ad8e7)

8 years agolibxenvchan: Change license of header from Lesser GPL v2.1 to BSD
Konrad Rzeszutek Wilk [Mon, 13 Jun 2016 09:28:57 +0000 (05:28 -0400)]
libxenvchan: Change license of header from Lesser GPL v2.1 to BSD

As the xen/COPYING file says:
"A few files are licensed under both GPL and a weaker BSD-style
license. This includes all files within the subdirectory
include/public, as described in include/public/COPYING. All such files
include the non-GPL license text as a source-code comment. Although
the license text refers generically to "the software", the non-GPL
license applies *only* to those source files that explicitly include
the non-GPL license text."

The libxenvchan.h is under xen/include/public/io directory
and the xen/include/public/COPYING says:

"XEN NOTICE
==========

This copyright applies to all files within this subdirectory and its
subdirectories:
  include/public/*.h
  include/public/hvm/*.h
  include/public/io/*.h

The intention is that these files can be freely copied into the source
tree of an operating system when porting that OS to run on Xen. Doing
so does *not* cause the OS to become subject to the terms of the GPL.

All other files in the Xen source distribution are covered by version
2 of the GNU General Public License except where explicitly stated
otherwise within individual source files.
"
Having the libxenvchan.h as Lesser GPL v2.1 where the COPYING file
says otherwise is confusing to say at least.

Upon consulting with the authors of libxenvchan they said:
"FWIW Neither I, nor ITL staff (as author of original libvchan library)
have anything against converting it to the BSD-style licence."
(Marek Marczykowski-Górecki,
http://lists.xen.org/archives/html/xen-devel/2016-06/msg00995.html)
so as such lets change it.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Anil Madhavapeddy <anil@recoil.org>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Acked-by: George Dunlap <George.Dunlap@eu.citrix.com>
Acked-by: Jan Beulich <JBeulich@suse.com>
Acked-by: Jason Andryuk <andryuk@aero.org>
Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Acked-by: Matthew Daley <mattjd@gmail.com>
Acked-by: Olaf Hering <olaf@aepfle.de>
Acked-by: Roger Pau Monne <roger.pau@entel.upc.edu>
Acked-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
["I have spoken to my line manager.  I can confirm that Citrix is happy
 with this proposed change.  So:

Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
 This view from Citrix covers all contributions made to these files in
 the course of Citrix's employees' employment, which I think is:

 > Cc: Andrew Cooper <andrew.cooper3@citrix.com>
 > cc: George Dunlap <George.Dunlap@eu.citrix.com>
 > Cc: Ian Campbell <ian.campbell@citrix.com>
 > Cc: Ian Jackson <Ian.Jackson@eu.citrix.com>
 > Cc: Roger Pau Monne <roger.pau@entel.upc.edu>
 > Cc: Stefano Stabellini <sstabellini@kernel.org>
 > Cc: Tim Deegan <tim@xen.org>
 > Cc: Wei Liu <wei.liu2@citrix.com>

 ..
 [in subsequent email]:
 Wei points out that this ought also to include Keir Fraser's
 contribution, which was (only) in 2012.
 " (from Ian's email)

 In a subsequent mail, Wei also points out that David Scott's
 contribution is covered by Ian's ack.
]

master commit: 937324f032f4f77866e80e39de0d697fa5131df1

(cherry picked from commit b5e124661c175c88512c81acdeb06992259361b7)

(cherry picked from commit 29e5892cbfd01b65db013a9bda99fa52781eed67)
Conflicts:
xen/include/public/io/libxenvchan.h
The conflict was due to to the FSF address still being in this version.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
8 years agoxl: correct xl cpupool-numa-split with vcpu limited dom0
Juergen Gross [Tue, 14 Jun 2016 04:30:58 +0000 (06:30 +0200)]
xl: correct xl cpupool-numa-split with vcpu limited dom0

When trying to use xl cpupool-numa-split and dom0 is limited to less
vcpus than one numa node the operation will fail.

Correct this by allowing this configuration.

Reported-by: Glenn Enright <glenn@rimuhosting.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit c256d2afc1cad0cca912492e338d6ff97e477c4f)
(cherry picked from commit 78a30103813a7929e922de7414cc56fa2ae52984)
(cherry picked from commit f8972b4223ee76800afa13432403daf248b46805)

8 years agoconfigure: Fix when no libsystemd compat lib are available
Anthony PERARD [Tue, 3 May 2016 15:59:49 +0000 (16:59 +0100)]
configure: Fix when no libsystemd compat lib are available

From systemd change log, since version 209, libsystemd.so contain
everything, including libsystemd-daemon.so. Distro may, or may not provide
the compatibility libraries which libsystemd-daemon is part of.

So, if libsystemd-daemon is not available, check for the presence of
a recent enough libsystemd.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
[ wei: run autogen.sh ]
Acked-by: Wei Liu <wei.liu2@citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 7dec5b0c658bea9c16a0e3c051e64d2abf57be48)
(cherry picked from commit 2c1122947e6866ad31b6cc33792178d9baa84430)

8 years agoRevert "xen: Have schedulers revise initial placement"
Jan Beulich [Fri, 5 Aug 2016 13:43:54 +0000 (15:43 +0200)]
Revert "xen: Have schedulers revise initial placement"

This reverts commit c421378a8d14c811e5467d535bc71adc0328a316,
as it needs further so far unidentified prereqs.

8 years agoRevert "xen: Remove buggy initial placement algorithm"
Jan Beulich [Fri, 5 Aug 2016 13:42:52 +0000 (15:42 +0200)]
Revert "xen: Remove buggy initial placement algorithm"

This reverts commit 505ad3a8b7fd3b91ab39c829ca6636cd264198c7,
as its prereq needs further so far unidentified prereqs.

8 years agox86/mmcfg: Fix initalisation of variables in pci_mmcfg_nvidia_mcp55()
Andrew Cooper [Fri, 5 Aug 2016 12:08:33 +0000 (14:08 +0200)]
x86/mmcfg: Fix initalisation of variables in pci_mmcfg_nvidia_mcp55()

Shifting into the sign bit of an integer is undefined behaviour.

Only the first integer is actually undefined, but switch all the shifts
for consistency.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <JBeulich@suse.com>
master commit: ab8fc3937eeb9332b83d7e14d81e37f0b0ef1841
master date: 2016-08-03 18:46:59 +0100

8 years agoxen: Remove buggy initial placement algorithm
George Dunlap [Fri, 5 Aug 2016 12:07:57 +0000 (14:07 +0200)]
xen: Remove buggy initial placement algorithm

The initial placement algorithm sometimes picks cpus outside of the
mask it's given, does a lot of unnecessary bitmasking, does its own
separate load calculation, and completely ignores vcpu hard and soft
affinities.  Just get rid of it and rely on the schedulers to do
initial placement.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d5438accceecc8172db2d37d98b695eb8bc43afc
master date: 2016-07-26 10:44:06 +0100

8 years agoxen: Have schedulers revise initial placement
George Dunlap [Fri, 5 Aug 2016 12:07:27 +0000 (14:07 +0200)]
xen: Have schedulers revise initial placement

The generic domain creation logic in
xen/common/domctl.c:default_vcpu0_location() attempts to try to do
initial placement load-balancing by placing vcpu 0 on the least-busy
non-primary hyperthread available.  Unfortunately, the logic can end
up picking a pcpu that's not in the online mask.  When this is passed
to a scheduler such which assumes that the initial assignment is
valid, it causes a null pointer dereference looking up the runqueue.

Furthermore, this initial placement doesn't take into account hard or
soft affinity, or any scheduler-specific knowledge (such as historic
runqueue load, as in credit2).

To solve this, when inserting a vcpu, always call the per-scheduler
"pick" function to revise the initial placement.  This will
automatically take all knowledge the scheduler has into account.

csched2_cpu_pick ASSERTs that the vcpu's pcpu scheduler lock has been
taken.  Grab and release the lock to minimize time spend with irqs
disabled.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
Reviwed-by: Dario Faggioli <dario.faggioli@citrix.com>
master commit: 9f358ddd69463fa8fb65cf67beb5f6f0d3350e32
master date: 2016-07-26 10:42:49 +0100

8 years agonested vmx: Validate host VMX MSRs before accessing them
Euan Harris [Fri, 5 Aug 2016 11:50:47 +0000 (13:50 +0200)]
nested vmx: Validate host VMX MSRs before accessing them

Some VMX MSRs may not exist on certain processor models, or may
be disabled because of configuration settings.   It is only safe to
access these MSRs if configuration flags in other MSRs are set.  These
prerequisites are listed in the Intel 64 and IA-32 Architectures
Software Developer’s Manual, Vol 3, Appendix A.

nvmx_msr_read_intercept() does not check the prerequisites before
accessing MSR_IA32_VMX_PROCBASED_CTLS2, MSR_IA32_VMX_EPT_VPID_CAP,
MSR_IA32_VMX_VMFUNC on the host.   Accessing these MSRs from a nested
VMX guest running on a host which does not support them will cause
Xen to crash with a GPF.

Signed-off-by: Euan Harris <euan.harris@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5e02972646132ad98c365ebfcfcb43b40a0dde36
master date: 2016-06-13 12:44:32 +0100