]> xenbits.xensource.com Git - xen.git/log
xen.git
8 years agooxenstored: keep track of each transaction's operations
Jonathan Davies [Thu, 23 Mar 2017 17:40:35 +0000 (17:40 +0000)]
oxenstored: keep track of each transaction's operations

A list of (request, response) pairs from the operations performed within the
transaction will be useful to support transaction replay.

Since this consumes memory, the number of requests per transaction must not be
left unbounded. Hence a new quota for this is introduced. This quota, configured
via the configuration key 'quota-maxrequests', limits the size of transactions
initiated by domUs.

After the maximum number of requests has been exhausted, any further requests
will result in EQUOTA errors. The client may then choose to end the transaction;
a successful commit will result in the retention of only the prior requests.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com>
Reviewed-by: Euan Harris <euan.harris@citrix.com>
Acked-by: David Scott <dave@recoil.org>
8 years agooxenstored: refactor request processing
Jonathan Davies [Thu, 23 Mar 2017 17:40:27 +0000 (17:40 +0000)]
oxenstored: refactor request processing

Encapsulate the request in a record that is passed from do_input to
process_packet and input_handle_error.

This will be helpful when keeping track of the requests made as part of a
transaction.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com>
Reviewed-by: Euan Harris <euan.harris@citrix.com>
Acked-by: David Scott <dave@recoil.org>
8 years agooxenstored: remove some unused parameters
Jonathan Davies [Thu, 23 Mar 2017 17:40:08 +0000 (17:40 +0000)]
oxenstored: remove some unused parameters

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com>
Reviewed-by: Euan Harris <euan.harris@citrix.com>
Acked-by: David Scott <dave@recoil.org>
8 years agooxenstored: refactor putting response on wire
Jonathan Davies [Thu, 23 Mar 2017 17:40:00 +0000 (17:40 +0000)]
oxenstored: refactor putting response on wire

Previously, the functions reply_{ack,data,data_or_ack} and input_handle_error
put the response on the wire by invoking Connection.send_{ack,reply,error}.

Instead, these functions now return a value indicating what needs to be put on
the wire, and that action is done by a send_response function called
afterwards.

This refactoring gives us a chance to store the value of the response, useful
for replaying transactions.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com>
Reviewed-by: Euan Harris <euan.harris@citrix.com>
Acked-by: David Scott <dave@recoil.org>
8 years agooxenstored: add a safe net mechanism for existing ill-behaved clients
Zheng Li [Fri, 24 Mar 2017 17:04:23 +0000 (17:04 +0000)]
oxenstored: add a safe net mechanism for existing ill-behaved clients

In previous commit, we moved from exhaustively scanning all domain connections
to only processing those have correctly notified us by events. The benefits are
not only efficiency but also correctness, because it could potentially block an
ill-behaved client and have it waiting on its own mistake. If someone makes a
mistake on this when developing a piece of code, he/she would immediately
notice the problem (as the process being blocked), so that he/she could fix it
rightaway before anything else. Note that the chances of making such mistakes
are rare in reality, because most client code would use the libxenstore library
(which has all the notification logic built in correctly) instead of having to
implement raw accessing from scratch.

On the other hand, we did notice that there were some legacy code that didn't do
the notification correctly. As some code might be still running in wild, it
would be bad if they break by this change (e.g. after an upgrade). This patch
introduces a safe net mechanism to ensure ill-behaved clients continue to work,
but still retain most of the performance benefits here.

  * We add a checker to still scan all the rings periodically, so that we can
    still pick up these messages at an acceptable frequency.

  * Internally, we introduce an io_credit concept for domain connections. It
    represents the rounds of ring scan we are going to perform on a domain
    connection. For well-behaved connections, this value is changing between 0
    and 1; but for connections detected as ill-behaved, we'll bump its credit
    to a high value so that we'll unconditionally scan its ring for the next
    $n$ rounds. This way, the client won't hiccupped by the interval between
    checker's running (especially during periods when it continously interacts
    with oxenstored); and oxenstored doesn't have to keep scanning these
    rings indefinitely (with the credit running out), as they are usually quite
    most of the time.

  * We log an message when a domain connection is suspected as ill-behaved.
    Enable [info] level logging if you want/need to see it in action. Note that
    this information won't be accurate, as false positives are possible due to
    time window (e.g. we detect a client has written to the ring and we get no
    notificiation from it for the time being, but still the notification could
    potentially arrive at some time later). It's no harm to give a domain
    connection extra credit though.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Zheng Li <dev@zheng.li>
Reviewed-by: David Scott <dave.scott@citrix.com>
8 years agooxenstored: only process domain connections that notify us by events
Zheng Li [Fri, 24 Mar 2017 17:03:31 +0000 (17:03 +0000)]
oxenstored: only process domain connections that notify us by events

Currently, upon receiving an event, oxenstored will always scan/process all
the domain connections (xs rings), disregarding which domain sent that event.
This is rather costy and inefficient. It also shadows and indulges client
for not correctly communicating with us on message/space availability.

With this patch, oxenstore will only scan/process the domain connections
that have correctly notified us by events or have IO actions leftover from
previous communication.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Zheng Li <dev@zheng.li>
Reviewed-by: David Scott <dave.scott@citrix.com>
8 years agooxenstored: enable domain connection indexing based on eventchn port
Zheng Li [Fri, 24 Mar 2017 17:02:55 +0000 (17:02 +0000)]
oxenstored: enable domain connection indexing based on eventchn port

Currently in xenstore connection database,  we use a hash table of
(domid -> connection) to store domain connections. This allows fast indexing
based on dom ids.

This patch adds another dimention of fast indexing that is based on eventchn
port number. This is useful when doing selective connection processing
based on the port numbers of incoming events.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Zheng Li <dev@zheng.li>
Reviewed-by: David Scott <dave.scott@citrix.com>
8 years agooxenstored: use hash table to store socket connections
Zheng Li [Fri, 24 Mar 2017 17:02:08 +0000 (17:02 +0000)]
oxenstored: use hash table to store socket connections

Currently we use list to store socket connections. This is fine for smaller
number of connections. But when we scale up, traveling through a list of
hundreds or thousands of connections just to find a single one of them is very
low efficient.

This patch replaces the list with a (Unix.file_descr -> Connection.t) hash table.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Zheng Li <dev@zheng.li>
Reviewed-by: David Scott <dave.scott@citrix.com>
8 years agooxenstored: catch the error when a connection is already deleted
Zheng Li [Fri, 24 Mar 2017 17:01:08 +0000 (17:01 +0000)]
oxenstored: catch the error when a connection is already deleted

The function process_fdset_with is called on the read set connections first.
During the process, it might destroy a connection and remove it from the
connections database if some errors occur. However, a reference to the same
connection might still exist in the write set, which is awaiting to be
processed next. In this case, a Not_found error will be raised and the process
is aborted.

This patch changes the logic to ignore connections just missing from the
connection database and continue the rest part of the work.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Zheng Li <dev@zheng.li>
Reviewed-by: David Scott <dave.scott@citrix.com>
8 years agooxenstored: perform a 3-way merge of the quota after a transaction
Jerome Maloberti [Fri, 24 Mar 2017 16:57:40 +0000 (16:57 +0000)]
oxenstored: perform a 3-way merge of the quota after a transaction

At a beginning of a transaction, the quotas from the global store
are duplicated and modified by the transaction. If during the
transaction, an action associated to no transaction is concurrently
executed, the quotas of the global store are updated, and then the
updates are lost when the transaction merges.

We fix this problem by keeping another copy of the quota at the
beginning of the transaction, and performing a 3-way merge between
the quotas from the transaction and the "original" copy of the quota
onto the quota of the global store.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jerome Maloberti <jerome.maloberti@citrix.com>
Signed-off-by: Euan Harris <euan.harris@citrix.com>
Acked-by: David Scott <dave.scott@citrix.com>
8 years agooxenstored: exempt dom0 from domU node quotas
Vincent Bernardoff [Fri, 24 Mar 2017 16:57:02 +0000 (16:57 +0000)]
oxenstored: exempt dom0 from domU node quotas

If a domU has exhausted its quota we still want the toolstack in dom0 to
be able to create new nodes in places like
  /local/domain/%d/control/shutdown

Without this patch, a domU which has exhausted its quota can only be
powered off, which is not as good as being able to request a clean
shutdown.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Signed-off-by: Euan Harris <euan.harris@citrix.com>
Acked-by: David Scott <dave.scott@citrix.com>
8 years agoxenstored: Log when the write transaction rate limit bites
Ian Jackson [Sat, 18 Mar 2017 17:13:27 +0000 (17:13 +0000)]
xenstored: Log when the write transaction rate limit bites

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
plus:

xenstore: dont increment bool variable
Instead of incrementing a bool variable just set it to true.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
8 years agoxenstored: apply a write transaction rate limit
Ian Jackson [Sat, 18 Mar 2017 17:12:39 +0000 (17:12 +0000)]
xenstored: apply a write transaction rate limit

This avoids a rogue client being about to stall another client (eg the
toolstack) indefinitely.

This is XSA-206.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
8 years agomemory: properly check guest memory ranges in XENMEM_exchange handling
Jan Beulich [Tue, 4 Apr 2017 13:06:29 +0000 (15:06 +0200)]
memory: properly check guest memory ranges in XENMEM_exchange handling

The use of guest_handle_okay() here (as introduced by the XSA-29 fix)
is insufficient here, guest_handle_subrange_okay() needs to be used
instead.

Note that the uses are okay in
- XENMEM_add_to_physmap_batch handling due to the size field being only
  16 bits wide,
- livepatch_list() due to the limit of 1024 enforced on the
  number-of-entries input (leaving aside the fact that this can be
  called by a privileged domain only anyway),
- compat mode handling due to counts there being limited to 32 bits,
- everywhere else due to guest arrays being accessed sequentially from
  index zero.

This is CVE-2017-7228 / XSA-212.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 938fd2586eb081bcbd694f4c1f09ae6a263b0d90
master date: 2017-04-04 14:47:46 +0200

8 years agoQEMU_TAG update
Ian Jackson [Tue, 21 Mar 2017 18:45:09 +0000 (18:45 +0000)]
QEMU_TAG update

8 years agoQEMU_TAG update
Ian Jackson [Wed, 22 Feb 2017 16:46:48 +0000 (16:46 +0000)]
QEMU_TAG update

8 years agoIOMMU: always call teardown callback
Oleksandr Tyshchenko [Wed, 15 Feb 2017 12:21:10 +0000 (12:21 +0000)]
IOMMU: always call teardown callback

There is a possible scenario when (d)->need_iommu remains unset
during guest domain execution. For example, when no devices
were assigned to it. Taking into account that teardown callback
is not called when (d)->need_iommu is unset we might have unreleased
resourses after destroying domain.

So, always call teardown callback to roll back actions
that were performed in init callback.

This is XSA-207.

Signed-off-by: Oleksandr Tyshchenko <olekstysh@gmail.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Julien Grall <julien.grall@arm.com>
8 years agox86: force EFLAGS.IF on when exiting to PV guests
Jan Beulich [Wed, 21 Dec 2016 16:48:43 +0000 (17:48 +0100)]
x86: force EFLAGS.IF on when exiting to PV guests

Guest kernels modifying instructions in the process of being emulated
for another of their vCPU-s may effect EFLAGS.IF to be cleared upon
next exiting to guest context, by converting the being emulated
instruction to CLI (at the right point in time). Prevent any such bad
effects by always forcing EFLAGS.IF on. And to cover hypothetical other
similar issues, also force EFLAGS.{IOPL,NT,VM} to zero.

This is CVE-2016-10024 / XSA-202.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0e47f92b072548800223f9a21ea051a017173915
master date: 2016-12-21 16:46:13 +0100

8 years agox86/emul: Correct the handling of eflags with SYSCALL
Andrew Cooper [Sun, 18 Dec 2016 15:42:59 +0000 (15:42 +0000)]
x86/emul: Correct the handling of eflags with SYSCALL

A singlestep #DB is determined by the resulting eflags value from the
execution of SYSCALL, not the original eflags value.

By using the original eflags value, we negate the guest kernels attempt to
protect itself from a privilege escalation by masking TF.

Introduce a tf boolean and have the SYSCALL emulation recalculate it
after the instruction is complete.

This is XSA-204

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86emul: CMPXCHG8B ignores operand size prefix
Jan Beulich [Tue, 13 Dec 2016 13:30:14 +0000 (14:30 +0100)]
x86emul: CMPXCHG8B ignores operand size prefix

Otherwise besides mis-handling the instruction, the comparison failure
case would result in uninitialized stack data being handed back to the
guest in rDX:rAX (32 bits leaked for 32-bit guests, 96 bits for 64-bit
ones).

This is CVE-2016-9932 / XSA-200.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
8 years agoQEMU_TAG update
Ian Jackson [Wed, 7 Dec 2016 16:54:47 +0000 (16:54 +0000)]
QEMU_TAG update

8 years agoQEMU_TAG update
Ian Jackson [Tue, 29 Nov 2016 18:40:26 +0000 (18:40 +0000)]
QEMU_TAG update

8 years agopygrub: Properly quote results, when returning them to the caller:
Ian Jackson [Tue, 22 Nov 2016 13:35:31 +0000 (14:35 +0100)]
pygrub: Properly quote results, when returning them to the caller:

* When the caller wants sexpr output, use `repr()'
  This is what Xend expects.

  The returned S-expressions are now escaped and quoted by Python,
  generally using '...'.  Previously kernel and ramdisk were unquoted
  and args was quoted with "..." but without proper escaping.  This
  change may break toolstacks which do not properly dequote the
  returned S-expressions.

* When the caller wants "simple" output, crash if the delimiter is
  contained in the returned value.

  With --output-format=simple it does not seem like this could ever
  happen, because the bootloader config parsers all take line-based
  input from the various bootloader config files.

  With --output-format=simple0, this can happen if the bootloader
  config file contains nul bytes.

This is CVE-2016-9379 and CVE-2016-9380 / XSA-198.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Tested-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 27e14d346ed6ff1c3a3cfc479507e62d133e92a9
master date: 2016-11-22 13:52:09 +0100

8 years agox86emul: fix huge bit offset handling
Jan Beulich [Tue, 22 Nov 2016 13:34:56 +0000 (14:34 +0100)]
x86emul: fix huge bit offset handling

We must never chop off the high 32 bits.

This is CVE-2016-9383 / XSA-195.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1c6c2d60d205f71ede0fbbd9047e459112f576db
master date: 2016-11-22 13:49:06 +0100

8 years agox86/PV: writes of %fs and %gs base MSRs require canonical addresses
Jan Beulich [Tue, 22 Nov 2016 13:34:25 +0000 (14:34 +0100)]
x86/PV: writes of %fs and %gs base MSRs require canonical addresses

Commit c42494acb2 ("x86: fix FS/GS base handling when using the
fsgsbase feature") replaced the use of wrmsr_safe() on these paths
without recognizing that wr{f,g}sbase() use just wrmsrl() and that the
WR{F,G}SBASE instructions also raise #GP for non-canonical input.

Similarly arch_set_info_guest() needs to prevent non-canonical
addresses from getting stored into state later to be loaded by context
switch code. For consistency also check stack pointers and LDT base.
DR0..3, otoh, already get properly checked in set_debugreg() (albeit
we discard the error there).

The SHADOW_GS_BASE check isn't strictly necessary, but I think we
better avoid trying the WRMSR if we know it's going to fail.

This is CVE-2016-9385 / XSA-193.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f3fa3abf3e61fb1f25ce721e14ac324dda67311f
master date: 2016-11-22 13:46:28 +0100

8 years agox86/HVM: don't load LDTR with VM86 mode attrs during task switch
Jan Beulich [Tue, 22 Nov 2016 13:33:54 +0000 (14:33 +0100)]
x86/HVM: don't load LDTR with VM86 mode attrs during task switch

Just like TR, LDTR is purely a protected mode facility and hence needs
to be loaded accordingly. Also move its loading to where it
architecurally belongs.

This is CVE-2016-9382 / XSA-192.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 93aa42b85ae0084ba7b749d0e990c94fbf0c17e3
master date: 2016-11-22 13:45:44 +0100

8 years agox86/hvm: Fix the handling of non-present segments
Andrew Cooper [Tue, 22 Nov 2016 13:32:39 +0000 (14:32 +0100)]
x86/hvm: Fix the handling of non-present segments

In 32bit, the data segments may be NULL to indicate that the segment is
ineligible for use.  In both 32bit and 64bit, the LDT selector may be NULL to
indicate that the entire LDT is ineligible for use.  However, nothing in Xen
actually checks for this condition when performing other segmentation
checks.  (Note however that limit and writeability checks are correctly
performed).

Neither Intel nor AMD specify the exact behaviour of loading a NULL segment.
Experimentally, AMD zeroes all attributes but leaves the base and limit
unmodified.  Intel zeroes the base, sets the limit to 0xfffffff and resets the
attributes to just .G and .D/B.

The use of the segment information in the VMCB/VMCS is equivalent to a native
pipeline interacting with the segment cache.  The present bit can therefore
have a subtly different meaning, and it is now cooked to uniformly indicate
whether the segment is usable or not.

GDTR and IDTR don't have access rights like the other segments, but for
consistency, they are treated as being present so no special casing is needed
elsewhere in the segmentation logic.

AMD hardware does not consider the present bit for %cs and %tr, and will
function as if they were present.  They are therefore unconditionally set to
present when reading information from the VMCB, to maintain the new meaning of
usability.

Intel hardware has a separate unusable bit in the VMCS segment attributes.
This bit is inverted and stored in the present field, so the hvm code can work
with architecturally-common state.

This is CVE-2016-9386 / XSA-191.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 04beafa8e6c66f5cd814c00e2d2b51cfbc41cb8a
master date: 2016-11-22 13:44:50 +0100

8 years agox86emul: honor guest CR0.TS and CR0.EM
Jan Beulich [Tue, 4 Oct 2016 13:06:16 +0000 (14:06 +0100)]
x86emul: honor guest CR0.TS and CR0.EM

We must not emulate any instructions accessing respective registers
when either of these flags is set in the guest view of the register, or
else we may do so on data not belonging to the guest's current task.

Being architecturally required behavior, the logic gets placed in the
instruction emulator instead of hvmemul_get_fpu(). It should be noted,
though, that hvmemul_get_fpu() being the only current handler for the
get_fpu() callback, we don't have an active problem with CR4: Both
CR4.OSFXSR and CR4.OSXSAVE get handled as necessary by that function.

This is XSA-190.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
8 years agoevtchn-fifo: prevent use after free
Jan Beulich [Thu, 8 Sep 2016 12:32:51 +0000 (14:32 +0200)]
evtchn-fifo: prevent use after free

evtchn_fifo_init_control() calls evtchn_fifo_destroy() on an error
path, leading to cleanup_event_array() which frees d->evtchn_fifo
without also clearing the pointer. Otoh the bulk of
evtchn_fifo_init_control() is dependent on d->evtchn_fifo being NULL.

This is XSA-188 / CVE-2016-7154.

Reported-by: Mikhail V Gorobets <mikhail.v.gorobets@intel.com>
Suggested-by: Mikhail V Gorobets <mikhail.v.gorobets@intel.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
8 years agox86/shadow: Avoid overflowing sh_ctxt->seg_reg[]
Andrew Cooper [Thu, 8 Sep 2016 12:32:16 +0000 (14:32 +0200)]
x86/shadow: Avoid overflowing sh_ctxt->seg_reg[]

hvm_get_seg_reg() does not perform a range check on its input segment, calls
hvm_get_segment_register() and writes straight into sh_ctxt->seg_reg[].

x86_seg_none is outside the bounds of sh_ctxt->seg_reg[], and will hit a BUG()
in {vmx,svm}_get_segment_register().

HVM guests running with shadow paging can end up performing a virtual to
linear translation with x86_seg_none.  This is used for addresses which are
already linear.  However, none of this is a legitimate pagetable update, so
fail the emulation in such a case.

This is XSA-187 / CVE-2016-7094.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: a9f3b3bad17d91e2067fc00d51b0302349570d08
master date: 2016-09-08 14:16:26 +0200

8 years agox86/32on64: don't allow recursive page tables from L3
Jan Beulich [Thu, 8 Sep 2016 12:30:11 +0000 (14:30 +0200)]
x86/32on64: don't allow recursive page tables from L3

L3 entries are special in PAE mode, and hence can't reasonably be used
for setting up recursive (and hence linear) page table mappings. Since
abuse is possible when the guest in fact gets run on 4-level page
tables, this needs to be excluded explicitly.

This is XSA-185 / CVE-2016-7092.

Reported-by: Jérémie Boutoille <jboutoille@ext.quarkslab.com>
Reported-by: "栾尚聪(好风)" <shangcong.lsc@alibaba-inc.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c844d637d92a75854ea5c8d4e5ca34302a9f623c
master date: 2016-09-08 14:14:53 +0200

8 years agox86/pv: Remove unsafe bits from the mod_l?_entry() fastpath
Andrew Cooper [Mon, 11 Jul 2016 13:32:03 +0000 (14:32 +0100)]
x86/pv: Remove unsafe bits from the mod_l?_entry() fastpath

All changes in writeability and cacheability must go through full
re-validation.

Rework the logic as a whitelist, to make it clearer to follow.

This is XSA-182

Reported-by: Jérémie Boutoille <jboutoille@ext.quarkslab.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
8 years agoQEMU_TAG update
Ian Jackson [Fri, 10 Jun 2016 12:11:08 +0000 (13:11 +0100)]
QEMU_TAG update

8 years agolibxl: set XEN_QEMU_CONSOLE_LIMIT for QEMU
Wei Liu [Thu, 26 May 2016 15:11:42 +0000 (16:11 +0100)]
libxl: set XEN_QEMU_CONSOLE_LIMIT for QEMU

XSA-180 provides a patch to QEMU to bodge QEMU logging issue. We
explicitly set the limit in libxl for 4.7.

Introduce a function for setting the environment variable and call it in
the right places.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Release-acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit b0d409d9c4944ed29d29457fea4ad6b41d698eca)

(cherry picked from commit fe82a96a657f36b08ade60ec4f3a53e67a4ee314)
Conflicts:
tools/libxl/libxl_dm.c

This version of libxl does not pass a dm_envs to the
*build_device_model_args* functions.  Instead, call
libxl__set_qemu_env_for_xsa_180 in libxl__spawn_local_dm.

The other call ultimate site of *build_device_model_args* (ie of
libxl__build_device_model_args) is in libxl__spawn_stub_dm, where we
don't need to set the env var.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit 44a703dfbe3cb6e6ec6d2699b631807137dfdc1e)

8 years agolibxl: Fix NULL pointer due to XSA-178 fix wrong XS nodename
Ian Jackson [Wed, 8 Jun 2016 14:42:19 +0000 (15:42 +0100)]
libxl: Fix NULL pointer due to XSA-178 fix wrong XS nodename

In "libxl: Do not trust backend for disk eject vdev" (c69871a2fb26 on
xen.git#staging) we changed libxl_evenable_disk_eject to read the
device vdev out of xenstore from the /libxl path, rather than the
backend path, and to read it during setup rather than on each event.

However, the patch has a mistake:
    -        GCSPRINTF("%s/dev", backend), NULL);
    +        GCSPRINTF("%s/vdev", libxl_path), &configured_vdev);
                           ^
Spot the extra "v".  This causes configured_vdev always to be NULL.
configured_vdev is passed to [libxl__]strdup.

In Xen 4.6 and later libxl__strdup is used and tolerates NULL.
evg->vdev is set to NULL.  This propagates to the `vdev' field in the
generated event.  This may or may not cause further trouble, depending
on the calling application.  In our osstest test cases it does not
cause any trouble, so the bug goes undetected.

In Xen 4.5 and earlier, the strdup does not tolerate NULL, and libxl
crashes immediately.  This has been detected by osstest as a
regression in Xen 4.5.

IMO this patch should be applied immediately to
  xen.git#staging-4.5 (to check that it fixes the osstest regression)
  xen.git#staging     (to check that it does not break master

Subject to passes, it should then be propagated to all supported
stable trees and also be mentioned in an update to XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
CC: security@xenproject.org
CC: Jan Beulich <jbeulich@suse.com>
CC: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 62b4d4769ca39fd5263da20d786a7b9a80a22d9a)
(cherry picked from commit 8b7a356409023f60f80e9f4b00bba16ad56cd77b)
(cherry picked from commit 6d272988fb3bdcc2578c5ee6d8a408628d1f0c0e)

8 years agolibxl: Document ~/serial/ correctly
Ian Jackson [Wed, 4 May 2016 14:17:45 +0000 (15:17 +0100)]
libxl: Document ~/serial/ correctly

xenstore-paths.markdown talked about ~/device/serial/, but that's not
used.

(It is very wrong for this value, which contains a driver domain
filesystem path, to be in the guest's area of xenstore.  However, it
is only ever created by libxl and ready by xenconsoled.  When it is
created, it inherits the read-only permissions of /local/domain/DOMID.
So there is no security bug.)

This is a followup to XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Cleanup: use libxl__backendpath_parse_domid in libxl__device_disk_from_xs_be
Ian Jackson [Fri, 29 Apr 2016 15:08:19 +0000 (16:08 +0100)]
libxl: Cleanup: use libxl__backendpath_parse_domid in libxl__device_disk_from_xs_be

Rather than an open-coded sscanf.  No functional change with correct
input.

This is a followup to XSA-175 and XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Cleanup: Have libxl__alloc_vdev use /libxl
Ian Jackson [Tue, 3 May 2016 14:25:19 +0000 (15:25 +0100)]
libxl: Cleanup: Have libxl__alloc_vdev use /libxl

When allocating a vdev for a new disk, look in /libxl/device, rather
than the frontends directory in xenstore.

This is more in line with the other parts of libxl, which ought not to
trust frontends.  In this case, though, there is no security bug prior
to this patch because the frontend is the toolstack domain itself.

If libxl__alloc_vdev were ever changed to take a frontend domain
argument, this patch will fix a latent security bug.

This is a followup to XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust backend for nic in list
Ian Jackson [Wed, 4 May 2016 15:23:57 +0000 (16:23 +0100)]
libxl: Do not trust backend for nic in list

libxl_device_nic_list should use the /libxl path to search for
devices, and for obtaining the device information.

The "type" parameter was always "vif".  Abolish it.  (In any case,
paths in /libxl/device are named after the frontend type which is
constant, not the backend type which might in future vary.)

Abolish a redundant store to pnic->backend_domid.  Before this commit,
that store was not needed because libxl_device_nic_init (called by
libxl__device_nic_from_xenstore) would zero it.  Now it overwrites the
correct backend domid with zero; so remove it.

This is part of XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust backend for nic in devid_to_device
Ian Jackson [Wed, 4 May 2016 15:20:05 +0000 (16:20 +0100)]
libxl: Do not trust backend for nic in devid_to_device

libxl_devid_to_device_nic should read the information it needs from
the /libxl/device path, not the backend.

This is part of XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust backend in nic getinfo
Ian Jackson [Tue, 3 May 2016 15:35:21 +0000 (16:35 +0100)]
libxl: Do not trust backend in nic getinfo

This is part of XSA-178.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Have READ_LIBXLDEV use libxl_path rather than be_path
Ian Jackson [Tue, 3 May 2016 14:40:18 +0000 (15:40 +0100)]
libxl: Have READ_LIBXLDEV use libxl_path rather than be_path

Fix the just-introduced bug in this macro: now it reads the
trustworthy libxl_path.  Change the variable name in the two functions
(nic and channel) which use it.

Shuffling the bump in the carpet along, we now introduce three new
bugs: the three call sites pass a backend path where a frontend path
is expected.

No functional change.

This is part of XSA-178.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Rename READ_BACKEND to READ_LIBXLDEV
Ian Jackson [Wed, 4 May 2016 15:07:02 +0000 (16:07 +0100)]
libxl: Rename READ_BACKEND to READ_LIBXLDEV

We are going to want to change all the functions that use READ_BACKEND
to get untrustworthy information from the backend, to use trustworthy
information from /libxl.

This will involve replacing READ_BACKEND, which reads from be_path,
with a similar macro READ_LIBXLDEV, which reads from libxl_path.

The macro name change generates a lot of clutter in the diff.  So we
break it out into this separate patch.  Here, we rename the macro, but
the implementation does not really match the new name.

So, another way to look at this, is that we have transformed the bug:
 * All of the backends use READ_BACKEND, which is unsafe
into the new bug:
 * READ_LIBXLDEV actually reads be_path, which is unsafe.

There is no functional change as yet.

This is part of XSA-178.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Rename libxl__device_nic_from_xs_be to _from_xenstore
Ian Jackson [Wed, 4 May 2016 15:18:36 +0000 (16:18 +0100)]
libxl: Rename libxl__device_nic_from_xs_be to _from_xenstore

We are going to change these functions to expect, and be passed, a
/libxl path.  So it is wrong that they are called _from_xs_be.

Neither function reads anything which isn't found in both places, so
we can and will change the call sites later.

The only remaining function in libxl called *_from_xs_be relates to
PCI devices, for which the backend domain is hardcoded to 0 throughout
the libxl_pci.c.

No functional change.

This is part of XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust backend for cdrom insert
Ian Jackson [Fri, 29 Apr 2016 18:13:17 +0000 (19:13 +0100)]
libxl: Do not trust backend for cdrom insert

Use the /libxl path where appropriate.  Rename `path' variable to
`be_path' to make sure we caught all the occurrences.

Specifically, when checking that the device still exists, check the
`frontend' value in /libxl, rather than anything in the backend
directory.

This is part of XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust backend for disk in getinfo
Ian Jackson [Fri, 29 Apr 2016 18:10:45 +0000 (19:10 +0100)]
libxl: Do not trust backend for disk in getinfo

Do not read the frontend path out of the backend.  We have it in our
hand.  Likewise the guest (frontend) domid was one of our parameters (!)

This is part of XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust backend for disk; fix driver domain disks list
Ian Jackson [Fri, 29 Apr 2016 17:29:45 +0000 (18:29 +0100)]
libxl: Do not trust backend for disk; fix driver domain disks list

Rework libxl__device_disk_from_xs_be (which takes a backend path) into
to libxl__device_disk_from_xenstore (which takes a libxl path).

libxl__device_disk_from_xenstore now finds the backend path itself,
although it doesn't use it any more for most of its functions.  We
rename the variable from be_path to backend_path to make sure we
didn't miss any cases.

All the data collection is now done by reading from the copy in
/libxl.

libxl_device_disk_list and its helper libxl__append_disk_list (which
used to be libxl__append_disk_list_of_type) need extensive rework,
because they now need to specify the /libxl path rather than the
backend path.

To do that they enumerate disks by looking in the appropriate area in
/libxl.  Previously they scanned various of the backend directories in
dom0 (which was broken for driver domains).  It is no longer necessary
to enumerate the various disk backends, because they all use the same
paths in /devices.  libxl__device_disk_from_xenstore will parse the
type out of the backend path, for itself.  (Indeed, it did so before -
the now-gone type parameter to libxl__append_disk_list_of_type wasn't
used other than to construct the directory to list.)

Finally, remove a redundant store to pdisk->backend_domid in
libxl__append_disk_list[_of_type].  Even before this commit, that
store was not needed because libxl_device_disk_init (called by
libxl__device_disk_from_xenstore) would zero it.  Now it overwrites
the correct backend domid with zero; so remove it.

This is part of XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
---
v2: Also fix up COLO reads, following rebase

8 years agolibxl: Do not trust backend for disk eject vdev
Ian Jackson [Fri, 29 Apr 2016 15:23:35 +0000 (16:23 +0100)]
libxl: Do not trust backend for disk eject vdev

For disk eject, use configured vdev from /libxl, not backend.

The backend directory is writeable by driver domains.  This means that
a malicious driver domain could cause libxl to see a wrong vdev,
confusing the user or the toolstack.

Use the vdev from the /libxl space, rather than the backend.

For convenience, we read the vdev from the /libxl space into the evg
during setup and copy it on each event, rather than reading it afresh
each time (which would in any case involve generating or saving a copy
of the relevant /libxl path).

This is part of XSA-178.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: cdrom eject and insert: write to /libxl
Ian Jackson [Fri, 29 Apr 2016 18:15:13 +0000 (19:15 +0100)]
libxl: cdrom eject and insert: write to /libxl

Copy the new type and params values to /libxl, so that the information
in /libxl is kept up to date.

This is needed so that we can return this trustworthy information,
rather than trusting the backend-writeable parts of xenstore.

This is part of XSA-178.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust backend for vtpm in getinfo (uuid)
Ian Jackson [Fri, 29 Apr 2016 15:57:14 +0000 (16:57 +0100)]
libxl: Do not trust backend for vtpm in getinfo (uuid)

Use uuid from /libxl, rather than from backend.  I think the backend
is not supposed to change the uuid, since it seems to be set by libxl
during setup.

If in fact the backend is supposed to be able to change the uuid, this
patch needs to be dropped and replaced by a patch which makes the vtpm
uuid lookup tolerate bad or missing data.

This is part of XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust backend for vtpm in getinfo (except uuid)
Ian Jackson [Fri, 29 Apr 2016 16:18:44 +0000 (17:18 +0100)]
libxl: Do not trust backend for vtpm in getinfo (except uuid)

* Do not check the backend for existence.  We have already read the
  /libxl path so know that the vtpm exists (or is supposed to); if the
  backend doesn't exist then that must be the backend's doing.
* Get the frontend path from the /libxl directory.
* The frontend domid is the guest domid, and does not need to be read
  from xenstore (!)

We still attempt to read the uuid from the backend.  This will be
fixed in the next patch.

This is part of XSA-178.

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Make copy of every xs backend in /libxl in _generic_add
Ian Jackson [Fri, 29 Apr 2016 15:19:28 +0000 (16:19 +0100)]
libxl: Make copy of every xs backend in /libxl in _generic_add

We want to stop libxl trustingly reading information from the backend
directory (since this is, of course, writeable by the backend, which
might be a semi-trusted driver domain).

In principle it is wrong in current libxl for anything to try to
divine virtual device configuration from xenstore: the JSON domain
config ought to supply that, and xenstore should only tell us which
devices actually exist.

However:

Firstly, there are several existing places where configuration
information is retrieved from xenstore rather than JSON.  We do not
want to reen gineer this in a security patch.

Secondly, we want to make a security patch which can be backported to
versions of libxl without the JSON configuration machinery.

So we take the expedient approach of keeping a copy of the
configuration somewhere we trust, namely /libxl.  This is obviously
fairly low-risk, although it does write significantly more keys in
xenstore.

In this patch we make this change in libxl__device_generic_add.  This
is responsible for actually writing the vast majority of device
information to xenstore.  There are a few loose ends which will be
dealt with in a moment.

Likewise, changes to readers to use the new location will appear in
further patches.

This is part of XSA-178.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust frontend for nic in getinfo
Ian Jackson [Tue, 3 May 2016 15:31:07 +0000 (16:31 +0100)]
libxl: Do not trust frontend for nic in getinfo

libxl_device_nic_getinfo needs to examine devices without trusting
frontend-controlled data.  So:

* Use /libxl to find the backend path.
* Parse the backend path to find the backend domid, rather than
  reading it from the frontend.

This is part of XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust frontend for nic in libxl_devid_to_device_nic
Ian Jackson [Tue, 3 May 2016 14:52:53 +0000 (15:52 +0100)]
libxl: Do not trust frontend for nic in libxl_devid_to_device_nic

Find the backend by reading the pointer in /libxl rather than in the
guest's frontend area.

This is part of XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust frontend for vtpm in getinfo
Ian Jackson [Tue, 3 May 2016 15:00:20 +0000 (16:00 +0100)]
libxl: Do not trust frontend for vtpm in getinfo

libxl_device_vtpm_getinfo needs to examine devices without trusting
frontend-controlled data.  So:

* Use /libxl to find the backend path.
* Parse the backend path to find the backend domid, rather than
  reading it from the frontend.

This is part of XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust frontend for vtpm list
Ian Jackson [Tue, 3 May 2016 14:58:32 +0000 (15:58 +0100)]
libxl: Do not trust frontend for vtpm list

libxl_device_vtpm_list needs to enumerate and identify devices without
trusting frontend-controlled data.  So

* Use the /libxl path to enumerate vtpms.
* Use the /libxl path to find the corresponding backends.
* Parse the backend path to find the backend domid.

This is part of XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust frontend for disk in getinfo
Ian Jackson [Fri, 29 Apr 2016 18:21:51 +0000 (19:21 +0100)]
libxl: Do not trust frontend for disk in getinfo

* Rename the frontend variable to `fe_path' to check we caught them all
* Read the backend path from /libxl, rather than from the frontend
* Parse the backend domid from the backend path, rather than reading it
  from the frontend (and add the appropriate error path and initialisation)

This is part of XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust frontend for disk eject event
Ian Jackson [Wed, 27 Apr 2016 15:08:49 +0000 (16:08 +0100)]
libxl: Do not trust frontend for disk eject event

Use the /libxl path for interpreting disk eject watch events: do not
read the backend path out of the frontend.  Instead, use the version
in /libxl.  That avoids us relying on the guest-modifiable
$frontend/backend pointer.

To implement this we store the path
  /libxl/$guest/device/vbd/$devid/backend
in the evgen structure.

This is part of XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust frontend in libxl__device_nextid
Ian Jackson [Wed, 4 May 2016 14:30:32 +0000 (15:30 +0100)]
libxl: Do not trust frontend in libxl__device_nextid

When selecting the devid for a new device, we should look in
/libxl/device for existing devices, not in the frontend area.

This is part of XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Do not trust frontend in libxl__devices_destroy
Ian Jackson [Tue, 3 May 2016 17:39:36 +0000 (18:39 +0100)]
libxl: Do not trust frontend in libxl__devices_destroy

We need to enumerate the devices we have provided to a domain, without
trusting the guest-writeable (or, at least, guest-deletable) frontend
paths.

Instead, enumerate via, and read the backend path from, /libxl.

The console /libxl path is regular, so the special case for console 0
is not relevant any more: /libxl/GUEST/device/console/0 will be found,
and then libxl__device_destroy will DTRT to the right frontend path.

This is part of XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Provide libxl__backendpath_parse_domid
Ian Jackson [Wed, 27 Apr 2016 15:34:19 +0000 (16:34 +0100)]
libxl: Provide libxl__backendpath_parse_domid

Multiple places in libxl need to figure out the backend domid of a
device.  This can be discovered easily by looking at the backend path,
which always starts /local/domain/$backend_domid/.

There are no call sites yet.

This is part of XSA-175.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agolibxl: Record backend/frontend paths in /libxl/$DOMID
Ian Jackson [Mon, 16 May 2016 15:44:31 +0000 (16:44 +0100)]
libxl: Record backend/frontend paths in /libxl/$DOMID

This gives us a record of all the backends we have set up for a
domain, which is separate from the frontends in
  /local/domain/$DOMID/device.

In particular:

1. A guest has write permission for the frontend path:
  /local/domain/$DOMID/device/$KIND/$DEVID
which means that the guest can completely delete the frontend.
(They can't recreate it because they don't have write permission
on the containing directory.)

2. A guest has write permission for the backend path recorded in the
frontend, ie, it can write to
  /local/domain/$DOMID/device/$KIND/$DEVID/backend
which means that the guest can break the association between
frontend and backend.

So we can't rely on iterating over the frontends to find all the
backends, or examining a frontend to discover how a device is
configured.

So, have libxl__device_generic_add record the frontend and backend
paths in /libxl/$DOMID/device, and have libxl__device_destroy remove
them again.

Create the containing directory /libxl/GUEST/device in
libxl__domain_make.  The already existing xs_rm in devices_destroy_cb
will take care of removing it.

This is part of XSA-175.

Backport note: Backported over 7472ced, which fixes a bug in driver
domain teardown.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
8 years agoxen/arm: Don't free p2m->first_level in p2m_teardown() before it has been allocated
Andrew Cooper [Thu, 2 Jun 2016 13:19:00 +0000 (14:19 +0100)]
xen/arm: Don't free p2m->first_level in p2m_teardown() before it has been allocated

If p2m_init() didn't complete successfully, (e.g. due to VMID
exhaustion), p2m_teardown() is called and unconditionally tries to free
p2m->first_level before it has been allocated.  free_domheap_pages() doesn't
tolerate NULL pointers.

This is XSA-181

Reported-by: Aaron Cornelius <Aaron.Cornelius@dornerworks.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agox86/mm: fully honor PS bits in guest page table walks
Jan Beulich [Tue, 17 May 2016 12:56:46 +0000 (14:56 +0200)]
x86/mm: fully honor PS bits in guest page table walks

In L4 entries it is currently unconditionally reserved (and hence
should, when set, always result in a reserved bit page fault), and is
reserved on hardware not supporting 1Gb pages (and hence should, when
set, similarly cause a reserved bit page fault on such hardware).

This is CVE-2016-4480 / XSA-176.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 46699c7393bd991234b5642763c5c24b6b39a6c4
master date: 2016-05-17 14:41:14 +0200

8 years agoUpdate QEMU_UPSTREAM_REVISION
Ian Jackson [Tue, 10 May 2016 18:20:49 +0000 (19:20 +0100)]
Update QEMU_UPSTREAM_REVISION

8 years agoQEMU_TAG update
Ian Jackson [Tue, 10 May 2016 18:16:37 +0000 (19:16 +0100)]
QEMU_TAG update

8 years agoQEMU_TAG update
Ian Jackson [Tue, 10 May 2016 18:10:57 +0000 (19:10 +0100)]
QEMU_TAG update

9 years agox86: limit GFNs to 32 bits for shadowed superpages.
Tim Deegan [Wed, 16 Mar 2016 17:05:25 +0000 (17:05 +0000)]
x86: limit GFNs to 32 bits for shadowed superpages.

Superpage shadows store the shadowed GFN in the backpointer field,
which for non-BIGMEM builds is 32 bits wide.  Shadowing a superpage
mapping of a guest-physical address above 2^44 would lead to the GFN
being truncated there, and a crash when we come to remove the shadow
from the hash table.

Track the valid width of a GFN for each guest, including reporting it
through CPUID, and enforce it in the shadow pagetables.  Set the
maximum witth to 32 for guests where this truncation could occur.

This is XSA-173.

Reported-by: Ling Liu <liuling-it@360.cn>
Signed-off-by: Tim Deegan <tim@xen.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
9 years agox86: fix information leak on AMD CPUs
Jan Beulich [Tue, 29 Mar 2016 13:20:58 +0000 (15:20 +0200)]
x86: fix information leak on AMD CPUs

The fix for XSA-52 was wrong, and so was the change synchronizing that
new behavior to the FXRSTOR logic: AMD's manuals explictly state that
writes to the ES bit are ignored, and it instead gets calculated from
the exception and mask bits (it gets set whenever there is an unmasked
exception, and cleared otherwise). Hence we need to follow that model
in our workaround.

This is CVE-2016-3158 / CVE-2016-3159 / XSA-172.
[xen/arch/x86/xstate.c:xrstor: CVE-2016-3158]
[xen/arch/x86/i387.c:fpu_fxrstor: CVE-2016-3159]

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 7bd9dc3adfbb014c55f0928ebb3b20950ca9c019
master date: 2016-03-29 14:24:26 +0200

9 years agolibvchan: Read prod/cons only once.
Konrad Rzeszutek Wilk [Thu, 10 Mar 2016 07:26:42 +0000 (08:26 +0100)]
libvchan: Read prod/cons only once.

We must ensure that the prod/cons are only read once and that
the compiler won't try to optimize the reads. That is split
the read of these in multiple instructions influencing later
branch code. As such insert barriers when fetching the cons
and prod index.

This is part of XSA155.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
master commit: 7d66a4ba695ab8d13b214fb816dd59e443ae1ec9
master date: 2015-12-18 09:50:02 -0500

9 years agotools: pygrub: if partition table is empty, try treating as a whole disk
Ian Campbell [Thu, 5 Nov 2015 14:46:12 +0000 (14:46 +0000)]
tools: pygrub: if partition table is empty, try treating as a whole disk

pygrub (in identify_disk_image()) detects a DOS style partition table
via the presence of the 0xaa55 signature at the end of the first
sector of the disk.

However this signature is also present in whole-disk configurations
when there is an MBR on the disk. Many filesystems (e.g. ext[234])
include leading padding in their on disk format specifically to enable
this.

So if we think we have a DOS partition table but do not find any
actual partition table entries we may as well try looking at it as a
whole disk image. Worst case is we probe and find there isn't anything
there.

This was reported by Sjors Gielen in Debian bug #745419. The fix was
inspired by a patch by Adi Kriegisch in
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=745419#27

Tested by genext2fs'ing my /boot into a new raw image (works) and
then:
   dd if=/usr/lib/grub/i386-pc/g2ldr.mbr of=img conv=notrunc bs=512 count=1

to add an MBR (with 0xaa55 signature) to it, which after this patch
also works.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: 745419-forwarded@bugs.debian.org
(cherry picked from commit fb31b1475f1bf179f033b8de3f0e173006fd77e9)
(cherry picked from commit 6c9b1bcce4fcc872edddd44f88390a67d5954069)
(cherry picked from commit 812406cf2b6731d07f0f840d799fcfa5917dbaf4)

9 years agox86: fix unintended fallthrough case from XSA-154
Andrew Cooper [Thu, 18 Feb 2016 14:28:25 +0000 (15:28 +0100)]
x86: fix unintended fallthrough case from XSA-154

... and annotate the other deliberate one: Coverity objects otherwise.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
One of the two instances was actually a bug.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 8dd6d1c099865ee5f5916616a0ca79cd943c46f9
master date: 2016-02-18 15:10:07 +0100

9 years agox86/VMX: sanitize rIP before re-entering guest
Jan Beulich [Wed, 17 Feb 2016 15:52:25 +0000 (16:52 +0100)]
x86/VMX: sanitize rIP before re-entering guest

... to prevent guest user mode arranging for a guest crash (due to
failed VM entry). (On the AMD system I checked, hardware is doing
exactly the canonicalization being added here.)

Note that fixing this in an architecturally correct way would be quite
a bit more involved: Making the x86 instruction emulator check all
branch targets for validity, plus dealing with invalid rIP resulting
from update_guest_eip() or incoming directly during a VM exit. The only
way to get the latter right would be by not having hardware do the
injection.

Note further that there are a two early returns from
vmx_vmexit_handler(): One (through vmx_failed_vmentry()) leads to
domain_crash() anyway, and the other covers real mode only and can
neither occur with a non-canonical rIP nor result in an altered rIP,
so we don't need to force those paths through the checking logic.

This is CVE-2016-2271 / XSA-170.

Reported-by: 刘令 <liuling-it@360.cn>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ffbbfda37782a2408953af1a3e00ada80bb141bc
master date: 2016-02-17 16:18:08 +0100

9 years agox86: enforce consistent cachability of MMIO mappings
Jan Beulich [Wed, 17 Feb 2016 15:51:44 +0000 (16:51 +0100)]
x86: enforce consistent cachability of MMIO mappings

We've been told by Intel that inconsistent cachability between
multiple mappings of the same page can affect system stability only
when the affected page is an MMIO one. Since the stale data issue is
of no relevance to the hypervisor (since all guest memory accesses go
through proper accessors and validation), handling of RAM pages
remains unchanged here. Any MMIO mapped by domains however needs to be
done consistently (all cachable mappings or all uncachable ones), in
order to avoid Machine Check exceptions. Since converting existing
cachable mappings to uncachable (at the time an uncachable mapping
gets established) would in the PV case require tracking all mappings,
allow MMIO to only get mapped uncachable (UC, UC-, or WC).

This also implies that in the PV case we mustn't use the L1 PTE update
fast path when cachability flags get altered.

Since in the HVM case at least for now we want to continue honoring
pinned cachability attributes for pages not mapped by the hypervisor,
special case handling of r/o MMIO pages (forcing UC) gets added there.
Arguably the counterpart change to p2m-pt.c may not be necessary, since
UC- (which already gets enforced there) is probably strict enough.

Note that the shadow code changes include fixing the write protection
of r/o MMIO ranges: shadow_l1e_remove_flags() and its siblings, other
than l1e_remove_flags() and alike, return the new PTE (and hence
ignoring their return values makes them no-ops).

This is CVE-2016-2270 / XSA-154.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c61a6f74f80eb36ed83a82f713db3143159b9009
master date: 2016-02-17 16:16:53 +0100

9 years agoupdate Xen version to 4.4.4 RELEASE-4.4.4
Jan Beulich [Thu, 21 Jan 2016 12:38:50 +0000 (13:38 +0100)]
update Xen version to 4.4.4

9 years agox86/vmx: Fix injection of #DB traps following XSA-156
Andrew Cooper [Wed, 20 Jan 2016 13:10:12 +0000 (14:10 +0100)]
x86/vmx: Fix injection of #DB traps following XSA-156

Most #DB exceptions are traps rather than faults, meaning that the instruction
pointer in the exception frame points after the instruction rather than at it.

However, VMX intercepts all have fault semantics, even when intercepting a
trap.  Re-injecting an intercepted trap as a fault causes an infinite loop in
the guest, by re-executing the same trapping instruction repeatedly.  This
breaks debugging inside the guest.

Introduce a helper which copies VM_EXIT_INTR_INTO to VM_ENTRY_INTR_INFO, and
use it to mirror the intercepted interrupt back to the guest.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 0747bc8b4d85f3fc0ee1e58418418fa0229e8ff8
master date: 2016-01-05 11:28:56 +0000

9 years agox86/VMX: prevent INVVPID failure due to non-canonical guest address
Jan Beulich [Wed, 20 Jan 2016 13:09:21 +0000 (14:09 +0100)]
x86/VMX: prevent INVVPID failure due to non-canonical guest address

While INVLPG (and on SVM INVLPGA) don't fault on non-canonical
addresses, INVVPID fails (in the "individual address" case) when passed
such an address.

Since such intercepted INVLPG are effectively no-ops anyway, don't fix
this in vmx_invlpg_intercept(), but instead have paging_invlpg() never
return true in such a case.

This is CVE-2016-1571 / XSA-168.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: bf05e88ed7342a91cceba050b6c622accb809842
master date: 2016-01-20 13:50:10 +0100

9 years agox86/mm: PV superpage handling lacks sanity checks
Jan Beulich [Wed, 20 Jan 2016 13:08:27 +0000 (14:08 +0100)]
x86/mm: PV superpage handling lacks sanity checks

MMUEXT_{,UN}MARK_SUPER fail to check the input MFN for validity before
dereferencing pointers into the superpage frame table.

Reported-by: Qinghao Tang <luodalongde@gmail.com>
get_superpage() has a similar issue.

This is CVE-2016-1570 / XSA-167.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: 47abf29a9255b2e7b94e56d66b455d0a584b68b8
master date: 2016-01-20 13:49:23 +0100

9 years agoConfig.mk: update OVMF changeset
Wei Liu [Wed, 14 Oct 2015 11:41:13 +0000 (12:41 +0100)]
Config.mk: update OVMF changeset

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
(cherry picked from commits 04c5efb0a141fa53e805e396970419436e74ce67
 and f046e501bbca1c8a46853b2e1f1b587e228c73de)

Apropos of discussion in
 "OVMF related osstest failures on multiple branches"
 http://lists.xenproject.org/archives/html/xen-devel/2016-01/msg00442.html

We believe the older ovmf.git does not work when built with the gcc in
Debian jessie.  We do not know where this bug lies but we are fixing
it by updating ovmf.

We have decided that we are not in a position to review the changes to
OVMF upstream, and ourselves decide what to cherry pick.  Instead we
will update the revision wholesale and use the xen.git stable
branches' push gate.

Conflicts:
Config.mk

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commits 6c3c6ff9ecaa5ee0be8b535d36fdcd12380564a1
 and 1d3cc6e62c4d2fc3dd9251d4921881425c9d27bd)

Conflicts:
Config.mk
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit ee576d71103e6795dc0add91db1b0d281eab1caf)

Conflicts:
Config.mk
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
9 years agoblktap: Fix two 'maybe uninitialized' variables
Dario Faggioli [Fri, 20 Jun 2014 14:09:00 +0000 (16:09 +0200)]
blktap: Fix two 'maybe uninitialized' variables

[ Cross-ported to blktap1 from 345e44a85d71a
  "blktap2: Fix two 'maybe uninitialized' variables" -iwj;
  Remainder of commit message is from blktap2's version. ]

for which gcc 4.9.0 complains about, like this:

block-qcow.c: In function `get_cluster_offset':
block-qcow.c:431:3: error: `tmp_ptr' may be used uninitialized in this function
[-Werror=maybe-uninitialized]
   memcpy(tmp_ptr, l1_ptr, 4096);
   ^
block-qcow.c:606:7: error: `tmp_ptr2' may be used uninitialized in this
function [-Werror=maybe-uninitialized]
   if (write(s->fd, tmp_ptr2, 4096) != 4096) {
       ^
cc1: all warnings being treated as errors
/home/dario/Sources/xen/xen/xen.git/tools/blktap2/drivers/../../../tools/Rules.mk:89:
 recipe for target 'block-qcow.o' failed
make[5]: *** [block-qcow.o] Error 1

The proper behavior is to return upon allocation failure.
About what to return, 0 seems the best option, looking
at both the function and the call sites.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Backport-requested-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit 845e8c1653242bbd9b9de5a081182db0f3f39054)

9 years agoQEMU_TAG update
Ian Jackson [Mon, 4 Jan 2016 15:36:32 +0000 (15:36 +0000)]
QEMU_TAG update

9 years agolibxl: Fix building libxlu_cfg_y.y with bison 3.0
Ed Swierk [Tue, 6 Jan 2015 15:21:07 +0000 (15:21 +0000)]
libxl: Fix building libxlu_cfg_y.y with bison 3.0

- Use %lex-param instead of obsolete YYLEX_PARAM to override lex scanner
  parameter
- Change deprecated %name-prefix= to %name-prefix

Tested against bison 2.4.1 and 3.0.2.

This is expected to sometimes (depending on timestamps and whether the
bison input files are edited) break building on systems with ancient
versions of bison.  Bison 2.4.1 is known to work and was released in
December 2008.

Also, consquentially, regenerate bison output files with bison
1:2.5.dfsg-2.1 from Debian wheezy.

Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Tested-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Release-Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 7ba4cdfadd4f3c45d65ffe50e621759f458fedc0)

[ I have checked that rebuilding the bison and flex input produces no
  further changes. -iwj ]

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
9 years agolibxl: Rerun bison and flex
Ian Jackson [Mon, 4 Jan 2016 14:45:12 +0000 (14:45 +0000)]
libxl: Rerun bison and flex

We are going to want to cherry pick a change to the bison input, which
will involve rerunning bison.

So firstly, update the bison and flex output to that from current
Debian wheezy (i386; 1:2.5.dfsg-2.1 and 2.5.35-10.1 respectively).

There should be no functional change since there is no change to the
source file, but we will inherit bugfixes and behavioural changes from
the new version of bison.  So this is more a matter of hope than
knowledge.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
9 years agoQEMU_TAG update
Ian Jackson [Fri, 18 Dec 2015 14:58:40 +0000 (14:58 +0000)]
QEMU_TAG update

9 years agox86/HVM: avoid reading ioreq state more than once
Jan Beulich [Thu, 17 Dec 2015 13:31:28 +0000 (14:31 +0100)]
x86/HVM: avoid reading ioreq state more than once

Otherwise, especially when the compiler chooses to translate the
switch() to a jump table, unpredictable behavior (and in the jump table
case arbitrary code execution) can result.

This is XSA-166.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: b452430a4cdfc801fa4bc391aed7522365e1deb6
master date: 2015-12-17 14:22:46 +0100

9 years agox86: don't leak ST(n)/XMMn values to domains first using them
Jan Beulich [Thu, 17 Dec 2015 13:30:57 +0000 (14:30 +0100)]
x86: don't leak ST(n)/XMMn values to domains first using them

FNINIT doesn't alter these registers, and hence using it is
insufficient to initialize a guest's initial state.

This is CVE-2015-8555 / XSA-165.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 81818b3f277544535974204f8d840da86fa8a44f
master date: 2015-12-17 14:22:13 +0100

9 years agox86/time: fix domain type check in tsc_set_info()
Haozhong Zhang [Tue, 15 Dec 2015 14:56:02 +0000 (15:56 +0100)]
x86/time: fix domain type check in tsc_set_info()

Replace is_hvm_domain() in tsc_set_info() by has_hvm_container_domain()
to keep consistent with other domain type checks in tsc_set_info().

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
master commit: 3c80d6f3c61eb0f8072f70b0a9a8c8c7adf17572
master date: 2015-12-08 09:46:30 +0100

9 years agoevtchn: don't reuse ports that are still "busy"
David Vrabel [Tue, 15 Dec 2015 14:55:23 +0000 (15:55 +0100)]
evtchn: don't reuse ports that are still "busy"

When using the FIFO ABI a guest may close an event channel that is
still LINKED.  If this port is reused, subsequent events may be lost
because they may become pending on the wrong queue.

This could be fixed by requiring guests to only close event channels
that are not linked.  This is difficult since: a) irq cleanup in the
guest may be done in a context that cannot wait for the event to be
unlinked; b) the guest may attempt to rebind a PIRQ whose previous
close is still pending; and c) existing guests already have the
problematic behaviour.

Instead, simply check a port is not "busy" (i.e., it's not linked)
before reusing it.

Guests should still drain any queues for VCPUs that are being
offlined, or the port will become unusable until the VCPU is onlined
and starts processing events again.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 78e24c269b0a4a8b864ece725e6d4209ed95dfa7
master date: 2015-12-02 15:21:46 +0100

9 years agox86/boot: check for not allowed sections before linking
Daniel Kiper [Tue, 15 Dec 2015 14:55:00 +0000 (15:55 +0100)]
x86/boot: check for not allowed sections before linking

Currently check for not allowed sections is performed just after
compilation. However, if compilation succeeds and check fails then
second build will create xen.gz/xen.efi without any visible error.
This happens because %.o: %.c recipe created object file during first
run and make do not execute this recipe during second run. So, look
for not allowed sections before linking. This way check will be
executed every time.

Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d380b3559734739ae009cd3c0e9aabb5602863e2
master date: 2015-11-25 17:24:36 +0100

9 years agox86/vPMU: document as unsupported
Jan Beulich [Tue, 15 Dec 2015 14:53:49 +0000 (15:53 +0100)]
x86/vPMU: document as unsupported

This is XSA-163.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: c03480cf5c4e96fb4afb2237ad0a3cac7162564a
master date: 2015-11-24 18:32:20 +0100

9 years agoVMX: fix/adjust trap injection
Jan Beulich [Tue, 15 Dec 2015 14:53:12 +0000 (15:53 +0100)]
VMX: fix/adjust trap injection

In the course of investigating the 4.1.6 backport issue of the XSA-156
patch I realized that #DB injection has always been broken, but with it
now getting always intercepted the problem has got worse: Documentation
clearly states that neither DR7.GD nor DebugCtl.LBR get cleared before
the intercept, so this is something we need to do before reflecting the
intercepted exception.

While adjusting this (and also with 4.1.6's strange use of
X86_EVENTTYPE_SW_EXCEPTION for #DB in mind) I further realized that
the special casing of individual vectors shouldn't be done for
software interrupts (resulting from INT $nn).

And then some code movement: Setting of CR2 for #PF can be done in the
same switch() statement (no need for a separate if()), and reading of
intr_info is better done close the the consumption of the variable
(allowing the compiler to generate better code / use fewer registers
for variables).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 81a28f14009f4d8577a81b28dd06f6828112054b
master date: 2015-11-24 12:30:31 +0100

9 years agosched: fix locking for insert_vcpu() in credit1 and RTDS
Dario Faggioli [Tue, 15 Dec 2015 14:52:51 +0000 (15:52 +0100)]
sched: fix locking for insert_vcpu() in credit1 and RTDS

The insert_vcpu() hook is handled with inconsistent locking.
In fact, schedule_cpu_switch() calls the hook with runqueue
lock held, while sched_move_domain() relies on the hook
implementations to take the lock themselves (and, since that
is not done in Credit1 and RTDS, such operation is not safe
in those cases).

This is fixed as follows:
 - take the lock in the hook implementations, in specific
   schedulers' code;
 - avoid calling insert_vcpu(), for the idle vCPU, in
   schedule_cpu_switch(). In fact, idle vCPUs are set to run
   immediately, and the various schedulers won't insert them
   in their runqueues anyway, even when explicitly asked to.

While there, still in schedule_cpu_switch(), locking with
_irq() is enough (there's no need to do *_irqsave()).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: Meng Xu <mengxu@cis.upenn.edu>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: ae2f41e3d7e7798537b7ea6dbb9a0c6aeb1179e3
master date: 2015-11-24 14:48:34 +0100

9 years agox86/HVM: don't inject #DB with error code
Jan Beulich [Tue, 15 Dec 2015 14:51:44 +0000 (15:51 +0100)]
x86/HVM: don't inject #DB with error code

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper@citrix.com>
master commit: 057e0e72d2a5d598087c5f167ec6a13203a3cf65
master date: 2015-11-12 16:59:18 +0100

9 years agox86/vmx: improvements to vmentry failure handling
Andrew Cooper [Tue, 15 Dec 2015 14:50:47 +0000 (15:50 +0100)]
x86/vmx: improvements to vmentry failure handling

Combine the almost identical vm_launch_fail() and vm_resume_fail() into a
single vmx_vmentry_failure().

Re-save all GPRs so that domain_crash() prints the real register values,
rather than the stack frame of the vmx_vmentry_failure() call.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: bbcf0b218f64b1e3e2b66b0fbb623f51d9014e81
master date: 2015-11-03 18:14:02 +0100

9 years agox86/PoD: Make p2m_pod_empty_cache() restartable
Andrew Cooper [Tue, 15 Dec 2015 14:50:21 +0000 (15:50 +0100)]
x86/PoD: Make p2m_pod_empty_cache() restartable

This avoids a long running operation when destroying a domain with a
large PoD cache.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 59a5061723ba47c0028cf48487e5de551c42a378
master date: 2015-11-02 15:33:38 +0100

9 years agox86/NUMA: fix SRAT table processor entry parsing and consumption
Jan Beulich [Tue, 15 Dec 2015 14:48:44 +0000 (15:48 +0100)]
x86/NUMA: fix SRAT table processor entry parsing and consumption

- don't overrun apicid_to_node[] (possible in the x2APIC case)
- don't limit number of processor related SRAT entries we can consume
- make acpi_numa_{processor,x2apic}_affinity_init() as similar to one
  another as possible
- print APIC IDs in hex (to ease matching with other log messages), at
  once making legacy and x2APIC ones distinguishable (by width)

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 83281fc9b31396e94c0bfb6550b75c165037a0ad
master date: 2015-10-14 12:46:27 +0200

9 years agox86: hide MWAITX from PV domains
Jan Beulich [Tue, 15 Dec 2015 14:47:59 +0000 (15:47 +0100)]
x86: hide MWAITX from PV domains

Since MWAIT is hidden too. (Linux starting with 4.3 is making use of
that feature, and is checking for it without looking at the MWAIT one.)

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 941cd44324db7eddc46cba4596fa13d505066ccf
master date: 2015-10-13 17:17:52 +0200

9 years agoVT-d: don't suppress invalidation address write when it is zero
Jan Beulich [Tue, 15 Dec 2015 14:47:26 +0000 (15:47 +0100)]
VT-d: don't suppress invalidation address write when it is zero

GFN zero is a valid address, and hence may need invalidation done for
it just like for any other GFN.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Yang Zhang <yang.z.zhang@intel.com>
master commit: 710942e57fb42ff8f344ca82f6b678f67e38ae63
master date: 2015-10-12 15:58:35 +0200

9 years agomemory: fix XSA-158 fix
Jan Beulich [Wed, 9 Dec 2015 12:56:40 +0000 (13:56 +0100)]
memory: fix XSA-158 fix

For one the uses of domu_max_order and ptdom_max_order were swapped.

And then gcc warns about an unused result of a __must_check function
in the control part of a conditional expression when both other
expressions can be determined by the compiler to produce the same value
(see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68039), which happens
when HAS_PASSTHROUGH is undefined (i.e. for ARM on 4.4 and older).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
master commit: ff841cead287d7913901ba5c4e7628a6958b5bea
master date: 2015-12-09 13:53:13 +0100

9 years agoQEMU_TAG update
Ian Jackson [Wed, 9 Dec 2015 11:50:24 +0000 (11:50 +0000)]
QEMU_TAG update