]> xenbits.xensource.com Git - xen.git/log
xen.git
8 years agox86: correct create_bounce_frame
Jan Beulich [Tue, 2 May 2017 13:02:38 +0000 (15:02 +0200)]
x86: correct create_bounce_frame

We may push up to 96 bytes on the guest (kernel) stack, so we should
also cover as much in the early range check. Note that this is the
simplest possible patch, which has the theoretical potential of
breaking a guest: We only really push 96 bytes when invoking the
failsafe callback, ordinary exceptions only have 56 or 64 bytes pushed
(without / with error code respectively). There is, however, no PV OS
known to place a kernel stack there.

This is XSA-215.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
8 years agox86: discard type information when stealing pages
Jan Beulich [Tue, 2 May 2017 13:02:02 +0000 (15:02 +0200)]
x86: discard type information when stealing pages

While a page having just a single general reference left necessarily
has a zero type reference count too, its type may still be valid (and
in validated state; at present this is only possible and relevant for
PGT_seg_desc_page, as page tables have their type forcibly zapped when
their type reference count drops to zero, and
PGT_{writable,shared}_page pages don't require any validation). In
such a case when the page is being re-used with the same type again,
validation is being skipped. As validation criteria differ between
32- and 64-bit guests, pages to be transferred between guests need to
have their validation indicator zapped (and with it we zap all other
type information at once).

This is XSA-214.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: eaf537342c909875c10f49b06e17493655410681
master date: 2017-05-02 14:46:58 +0200

8 years agomulticall: deal with early exit conditions
Jan Beulich [Tue, 2 May 2017 13:01:12 +0000 (15:01 +0200)]
multicall: deal with early exit conditions

In particular changes to guest privilege level require the multicall
sequence to be aborted, as hypercalls are permitted from kernel mode
only. While likely not very useful in a multicall, also properly handle
the return value in the HYPERVISOR_iret case (which should be the guest
specified value).

This is XSA-213.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Julien Grall <julien.grall@arm.com>
master commit: 22c096c99d8c05833c3c19870e36efb2dd4e8013
master date: 2017-05-02 14:45:02 +0200

8 years agosetup vwfi correctly on cpu0
Stefano Stabellini [Fri, 31 Mar 2017 22:37:07 +0000 (15:37 -0700)]
setup vwfi correctly on cpu0

parse_vwfi runs after init_traps on cpu0, potentially resulting in the
wrong HCR_EL2 for it. Secondary cpus boot after parse_vwfi, so in their
case init_traps will write the correct set of flags to HCR_EL2.

For cpu0, fix the issue by changing HCR_EL2 setting from a new
presmp_initcall.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agooxenstored: trim history in the frequent_ops function
Thomas Sanders [Tue, 28 Mar 2017 17:57:52 +0000 (18:57 +0100)]
oxenstored: trim history in the frequent_ops function

We were trimming the history of commits only at the end of each
transaction (regardless of how it ended).

Therefore if non-transactional writes were being made but no
transactions were being ended, the history would grow
indefinitely. Now we trim the history at regular intervals.

Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
8 years agooxenstored transaction conflicts: improve logging
Thomas Sanders [Mon, 27 Mar 2017 13:36:34 +0000 (14:36 +0100)]
oxenstored transaction conflicts: improve logging

For information related to transaction conflicts, potentially frequent
logging at "info" priority has been changed to "debug" priority, and
once per two minutes there is an "info" priority summary.

Additional detailed logging has been added at "debug" priority.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
8 years agooxenstored: don't wake to issue no conflict-credit
Thomas Sanders [Fri, 24 Mar 2017 19:55:03 +0000 (19:55 +0000)]
oxenstored: don't wake to issue no conflict-credit

In the main loop, when choosing the timeout for the select function
call, we were setting it so as to wake up to issue conflict-credit to
any domains that could accept it. When xenstore is idle, this would
mean waking up every 50ms (by default) to do no work. With this
commit, we check whether any domain is below its cap, and if not then
we set the timeout for longer (the same timeout as before the
conflict-protection feature was added).

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
8 years agooxenstored: do not commit read-only transactions
Thomas Sanders [Fri, 24 Mar 2017 16:16:10 +0000 (16:16 +0000)]
oxenstored: do not commit read-only transactions

The packet telling us to end the transaction has always carried an
argument telling us whether to commit.

If the transaction made no modifications to the tree, now we ignore
that argument and do not commit: it is just a waste of effort.

This makes read-only transactions immune to conflicts, and means that
we do not need to store any of their details in the history that is
used for assigning blame for conflicts.

We count a transaction as a read-only transaction only if it contains
no operations that modified the tree.

This means that (for example) a transaction that creates a new node
then deletes it would NOT count as read-only, even though it makes no
change overall. A more sophisticated algorithm could judge the
transaction based on comparison of its initial and final states, but
this would add complexity and computational cost.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
8 years agooxenstored: allow self-conflicts
Thomas Sanders [Thu, 23 Mar 2017 19:06:54 +0000 (19:06 +0000)]
oxenstored: allow self-conflicts

We already avoid inter-domain conflicts but now allow intra-domain
conflicts.  Although there are no known practical examples of a domain
that might perform operations that conflict with its own transactions,
this is conceivable, so here we avoid changing those semantics
unnecessarily.

When a transaction commit fails with a conflict and we look through
the history of commits to see which connection(s) to blame, ignore
historical commits that were made by the same connection as the
failing commit.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
8 years agooxenstored: blame the connection that caused a transaction conflict
Jonathan Davies [Thu, 23 Mar 2017 14:28:16 +0000 (14:28 +0000)]
oxenstored: blame the connection that caused a transaction conflict

Blame each connection found to have made a commit that would cause this
transaction to fail. Each blamed connection is penalised by having its
conflict-credit decremented.

Note the change in semantics for the replay function: we no longer stop after
finding the first operation that can't be replayed. This allows us to identify
all operations that conflicted with this transaction, not just the one that
conflicted first.

Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
v1 Reviewed-by: Christian Lindig <christian.lindig@citrix.com>

Changes since v1:
 * use correct log levels for informational messages
Changes since v2:
 * fix the blame algorithm and improve logging
   (fix was reviewed by Jonathan Davies)

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
8 years agooxenstored: track commit history
Jonathan Davies [Mon, 27 Mar 2017 08:58:29 +0000 (08:58 +0000)]
oxenstored: track commit history

Since the list of historic activity cannot grow without bound, it is safe to use
this to track commits.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Thomas Sanders <thomas.sanders@citrix.com>
8 years agooxenstored: discard old commit-history on txn end
Thomas Sanders [Thu, 23 Mar 2017 14:25:16 +0000 (14:25 +0000)]
oxenstored: discard old commit-history on txn end

The history of commits is to be used for working out which historical
commit(s) (including atomic writes) caused conflicts with a
currently-failing commit of a transaction. Any commit that was made
before the current transaction started cannot be relevant. Therefore
we never need to keep history from before the start of the
longest-running transaction that is open at any given time: whenever a
transaction ends (with or without a commit) then if it was the
longest-running open transaction we can delete history up until start
of the the next-longest-running open transaction.

Some transactions might stay open for a very long time, so if any
transaction exceeds conflict_max_history_seconds then we remove it
from consideration in this context, and will not guarantee to keep
remembering about historical commits made during such a transaction.

We implement this by keeping a list of all open transactions that have
not been open too long. When a transaction ends, we remove it from the
list, along with any that have been open longer than the maximum; then
we delete any history from before the start of the longest-running
transaction remaining in the list.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
8 years agooxenstored: only record operations with side-effects in history
Jonathan Davies [Thu, 23 Mar 2017 14:20:33 +0000 (14:20 +0000)]
oxenstored: only record operations with side-effects in history

There is no need to record "read" operations as they will never cause another
transaction to fail.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Thomas Sanders <thomas.sanders@citrix.com>
8 years agooxenstored: support commit history tracking
Jonathan Davies [Tue, 14 Mar 2017 13:20:07 +0000 (13:20 +0000)]
oxenstored: support commit history tracking

Add ability to track xenstore tree operations -- either non-transactional
operations or committed transactions.

For now, the call to actually retain commits is commented out because history
can grow without bound.

For now, we call record_commit for all non-transactional operations. A
subsequent patch will make it retain only the ones with side-effects.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
8 years agooxenstored: add transaction info relevant to history-tracking
Jonathan Davies [Tue, 14 Mar 2017 12:17:38 +0000 (12:17 +0000)]
oxenstored: add transaction info relevant to history-tracking

Specifically:
 * retain the original store (not just the root) in full transactions
 * store commit count at the time of the start of the transaction

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
8 years agooxenstored: ignore domains with no conflict-credit
Thomas Sanders [Tue, 14 Mar 2017 12:15:52 +0000 (12:15 +0000)]
oxenstored: ignore domains with no conflict-credit

When processing connections, skip those from domains with no remaining
conflict-credit.

Also, issue a point of conflict-credit at regular intervals, the
period being set by the configuration option "conflict-max-history-
seconds".  When issuing conflict-credit, we give a point either to
every domain at once (one each) or only to the single domain at the
front of the queue, depending on the configuration option
"conflict-rate-limit-is-aggregate".

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
8 years agooxenstored: handling of domain conflict-credit
Thomas Sanders [Tue, 14 Mar 2017 12:15:52 +0000 (12:15 +0000)]
oxenstored: handling of domain conflict-credit

This commit gives each domain a conflict-credit variable, which will
later be used for limiting how often a domain can cause other domain's
transaction-commits to fail.

This commit also provides functions and data for manipulating domains
and their conflict-credit, and checking whether they have credit.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
8 years agooxenstored: comments explaining some variables
Thomas Sanders [Tue, 14 Mar 2017 12:15:52 +0000 (12:15 +0000)]
oxenstored: comments explaining some variables

It took a while of reading and reasoning to work out what these are
for, so here are comments to make life easier for everyone reading
this code in future.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Sanders <thomas.sanders@citrix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
8 years agooxenstored: allow compilation prior to OCaml 3.12.0
Jonathan Davies [Thu, 23 Mar 2017 16:28:45 +0000 (16:28 +0000)]
oxenstored: allow compilation prior to OCaml 3.12.0

Commit 363ae55c8 used an OCaml feature called record field punning. This broke
the build on compilers prior to OCaml 3.12.0.

This patch makes no semantic change but now uses backwards-compatible syntax.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
8 years agooxenstored: log request and response during transaction replay
Jonathan Davies [Thu, 23 Mar 2017 16:28:34 +0000 (16:28 +0000)]
oxenstored: log request and response during transaction replay

During a transaction replay, the replayed requests and the new responses are
logged in the same way as the original requests and the original responses.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com>
Reviewed-by: Euan Harris <euan.harris@citrix.com>
Acked-by: David Scott <dave@recoil.org>
8 years agooxenstored: replay transaction upon conflict
Jonathan Davies [Thu, 23 Mar 2017 16:28:19 +0000 (16:28 +0000)]
oxenstored: replay transaction upon conflict

The existing transaction merge algorithm keeps track of the least upper bound
(longest common prefix) of all the nodes which have been read and written, and
will re-combine two stores which have disjoint upper bounds. This works well for
small transactions but causes unnecessary conflicts for ones that span a large
subtree, such as the following ones used by the xapi toolstack:

 * VM start: creates /vm/... /vss/... /local/domain/...
   The least upper bound of this transaction is / and so all
   these transactions conflict with everything.

 * Device hotplug: creates /local/domain/0/... /local/domain/n/...
   The least upper bound of this transaction is /local/domain so
   all these transactions conflict with each other.

If the existing merge algorithm cannot merge and commit, we attempt
a /replay/ of the failed transaction against the new store.

When we replay the requests we check whether the response sent to the client is
the same as during the first attempt at the transaction. If the responses are
all the same then the transaction replay can be committed. If any differ then
the transaction replay must be aborted and the client must retry.

This algorithm uses the intuition that the transactions made by the toolstack
are designed to be for separate domains, and should fundamentally not conflict
in the sense that they don't read or write any shared keys. By replaying the
transaction on the server side we do what the client would have to do anyway,
only we can do it quickly without allowing any other requests to interfere.

Performing 300 parallel simulated VM start and shutdowns without this code:

300 parallel starts and shutdowns: 268.92

Performing 300 parallel simulated VM start and shutdowns with this code:

300 parallel starts and shutdowns: 3.80

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Dave Scott <dave@recoil.org>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com>
Reviewed-by: Euan Harris <euan.harris@citrix.com>
Acked-by: David Scott <dave@recoil.org>
8 years agooxenstored: move functions that process simple operations
Jonathan Davies [Thu, 23 Mar 2017 16:28:08 +0000 (16:28 +0000)]
oxenstored: move functions that process simple operations

Separate the functions which process operations that can be done as part of a
transaction. Specifically, these operations are: read, write, rm, getperms,
setperms, getdomainpath, directory, mkdir.

Also split function_of_type into two functions: one for processing the simple
operations and one for processing the rest.

This will help allow replay of transactions, allowing us to invoke the functions
that process the simple operations as part of the processing of transaction_end.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com>
Reviewed-by: Euan Harris <euan.harris@citrix.com>
Acked-by: David Scott <dave@recoil.org>
8 years agooxenstored: keep track of each transaction's operations
Jonathan Davies [Thu, 23 Mar 2017 16:27:58 +0000 (16:27 +0000)]
oxenstored: keep track of each transaction's operations

A list of (request, response) pairs from the operations performed within the
transaction will be useful to support transaction replay.

Since this consumes memory, the number of requests per transaction must not be
left unbounded. Hence a new quota for this is introduced. This quota, configured
via the configuration key 'quota-maxrequests', limits the size of transactions
initiated by domUs.

After the maximum number of requests has been exhausted, any further requests
will result in EQUOTA errors. The client may then choose to end the transaction;
a successful commit will result in the retention of only the prior requests.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com>
Reviewed-by: Euan Harris <euan.harris@citrix.com>
Acked-by: David Scott <dave@recoil.org>
8 years agooxenstored: refactor request processing
Jonathan Davies [Thu, 23 Mar 2017 16:27:50 +0000 (16:27 +0000)]
oxenstored: refactor request processing

Encapsulate the request in a record that is passed from do_input to
process_packet and input_handle_error.

This will be helpful when keeping track of the requests made as part of a
transaction.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com>
Reviewed-by: Euan Harris <euan.harris@citrix.com>
Acked-by: David Scott <dave@recoil.org>
8 years agooxenstored: remove some unused parameters
Jonathan Davies [Thu, 23 Mar 2017 16:27:39 +0000 (16:27 +0000)]
oxenstored: remove some unused parameters

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com>
Reviewed-by: Euan Harris <euan.harris@citrix.com>
Acked-by: David Scott <dave@recoil.org>
8 years agooxenstored: refactor putting response on wire
Jonathan Davies [Thu, 23 Mar 2017 16:27:23 +0000 (16:27 +0000)]
oxenstored: refactor putting response on wire

Previously, the functions reply_{ack,data,data_or_ack} and input_handle_error
put the response on the wire by invoking Connection.send_{ack,reply,error}.

Instead, these functions now return a value indicating what needs to be put on
the wire, and that action is done by a send_response function called
afterwards.

This refactoring gives us a chance to store the value of the response, useful
for replaying transactions.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Jonathan Davies <jonathan.davies@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jon Ludlam <jonathan.ludlam@citrix.com>
Reviewed-by: Euan Harris <euan.harris@citrix.com>
Acked-by: David Scott <dave@recoil.org>
8 years agoxenstored: Log when the write transaction rate limit bites
Ian Jackson [Sat, 18 Mar 2017 16:45:27 +0000 (16:45 +0000)]
xenstored: Log when the write transaction rate limit bites

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
plus:

xenstore: dont increment bool variable
Instead of incrementing a bool variable just set it to true.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
8 years agoxenstored: apply a write transaction rate limit
Ian Jackson [Sat, 18 Mar 2017 16:44:46 +0000 (16:44 +0000)]
xenstored: apply a write transaction rate limit

This avoids a rogue client being about to stall another client (eg the
toolstack) indefinitely.

This is XSA-206.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
8 years agox86: use 64 bit mask when masking away mfn bits
Juergen Gross [Tue, 4 Apr 2017 13:03:14 +0000 (15:03 +0200)]
x86: use 64 bit mask when masking away mfn bits

When using _PAGE_PSE_PAT as base for a negated bit mask make sure it is
propagated to 64 bits when applied to a 64 bit value.

There seems to be only one place where this is a problem, so fix this
by casting _PAGE_PSE_PAT to 64 bits there.

Not doing so will probably lead to problems on hosts with more than
16 TB of memory.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: 4edb1a42e3320757e3559f17edf6903bc1777de3
master date: 2017-03-30 15:11:24 +0200

8 years agomemory: properly check guest memory ranges in XENMEM_exchange handling
Jan Beulich [Tue, 4 Apr 2017 13:02:33 +0000 (15:02 +0200)]
memory: properly check guest memory ranges in XENMEM_exchange handling

The use of guest_handle_okay() here (as introduced by the XSA-29 fix)
is insufficient here, guest_handle_subrange_okay() needs to be used
instead.

Note that the uses are okay in
- XENMEM_add_to_physmap_batch handling due to the size field being only
  16 bits wide,
- livepatch_list() due to the limit of 1024 enforced on the
  number-of-entries input (leaving aside the fact that this can be
  called by a privileged domain only anyway),
- compat mode handling due to counts there being limited to 32 bits,
- everywhere else due to guest arrays being accessed sequentially from
  index zero.

This is CVE-2017-7228 / XSA-212.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 938fd2586eb081bcbd694f4c1f09ae6a263b0d90
master date: 2017-04-04 14:47:46 +0200

8 years agoRevert "xen: sched: don't call hooks of the wrong scheduler via VCPU2OP"
Jan Beulich [Mon, 3 Apr 2017 07:58:53 +0000 (09:58 +0200)]
Revert "xen: sched: don't call hooks of the wrong scheduler via VCPU2OP"

This reverts commit 7ff6d9fc19ef88d2ca3304a312c0e8a46b61f546,
which osstest has found to be severely broken.

8 years agoxen: sched: don't call hooks of the wrong scheduler via VCPU2OP
Dario Faggioli [Fri, 31 Mar 2017 07:03:32 +0000 (09:03 +0200)]
xen: sched: don't call hooks of the wrong scheduler via VCPU2OP

Within context_saved(), we call the context_saved hook,
and we use VCPU2OP() to determine from what scheduler.
VCPU2OP uses DOM2OP, which uses d->cpupool, which is
NULL when d is the idle domain. And in that case,
DOM2OP just returns ops, the scheduler of cpupool0.

Therefore, if:
- cpupool0's scheduler defines context_saved (like
  Credit2 and RTDS do),
- we are not in cpupool0 (i.e., our scheduler is
  not ops),
- we are context switching from idle,

we call VCPU2OP(idle_vcpu), which means
DOM2OP(idle->cpupool), which is ops.

Therefore, we both:
- check if context_saved is defined in the wrong
  scheduler;
- if yes, call the wrong one.

When using Credit2 at boot, and also Credit2 in
the other cpupool, this is wrong but innocuous,
because it only involves the idle vcpus.

When using Credit2 at boot, and Credit1 in the
other cpupool, this is *totally* wrong, and
it's by chance it does not explode!

When using Credit2 and other schedulers I'm
developping, I hit the following assert (in
sched_credit2.c, on a CPU inside a cpupool that
does not use Credit2):

csched2_context_saved()
{
 ...
 ASSERT(!vcpu_on_runq(svc));
 ...
}

Fix this by dealing explicitly, in VCPU2OP, with
idle vcpus, returning the scheduler of the pCPU
they (always) run on.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: a3653e6a279213ba4e883b2252415dc98633106a
master date: 2017-03-27 14:28:05 +0100

8 years agox86/EFI: avoid Xen image when looking for module/kexec position
Jan Beulich [Fri, 31 Mar 2017 07:03:01 +0000 (09:03 +0200)]
x86/EFI: avoid Xen image when looking for module/kexec position

When booting straight from EFI, we don't further try to relocate Xen.
As a result, so far we also didn't avoid the area Xen uses when looking
for a location to put modules or the kexec area. Introduce a fake
module slot to deal with that without having to fiddle with a lot of
code.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e22e1c47958a4778cd7baa3980f74e52f525ba28
master date: 2017-03-20 09:27:12 +0100

8 years agox86/EFI: avoid overrunning mb_modules[]
Jan Beulich [Fri, 31 Mar 2017 07:01:29 +0000 (09:01 +0200)]
x86/EFI: avoid overrunning mb_modules[]

Commit 436fb462ab ("x86/microcode: enable boot time (pre-Dom0)
loading") added a 4th module without providing an array slot for it.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 02b37b7eff39e40828041b2fe480725ab8443258
master date: 2017-03-17 15:45:22 +0100

8 years agobuild/clang: fix XSM dummy policy when using clang 4.0
Roger Pau Monné [Fri, 31 Mar 2017 07:00:50 +0000 (09:00 +0200)]
build/clang: fix XSM dummy policy when using clang 4.0

There seems to be some weird bug in clang 4.0 that prevents xsm_pmu_op from
working as expected, and vpmu.o ends up with a reference to
__xsm_action_mismatch_detected which makes the build fail:

[...]
ld    -melf_x86_64_fbsd  -T xen.lds -N prelink.o  \
    xen/common/symbols-dummy.o -o xen/.xen-syms.0
prelink.o: In function `xsm_default_action':
xen/include/xsm/dummy.h:80: undefined reference to `__xsm_action_mismatch_detected'
xen/xen/include/xsm/dummy.h:80: relocation truncated to fit: R_X86_64_PC32 against undefined symbol `__xsm_action_mismatch_detected'
ld: xen/xen/.xen-syms.0: hidden symbol `__xsm_action_mismatch_detected' isn't defined

Then doing a search in the objects files:

# find xen/ -type f -name '*.o' -print0 | xargs -0 bash -c \
  'for filename; do nm "$filename" | \
  grep -q __xsm_action_mismatch_detected && echo "$filename"; done' bash
xen/arch/x86/prelink.o
xen/arch/x86/cpu/vpmu.o
xen/arch/x86/cpu/built_in.o
xen/arch/x86/built_in.o

The current patch is the only way I've found to fix this so far, by simply
moving the XSM_PRIV check into the default case in xsm_pmu_op. This also fixes
the behavior of do_xenpmu_op, which will now return -EINVAL for unknown
XENPMU_* operations, instead of -EPERM when called by a privileged domain.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
master commit: 9e4d116faff4545a7f21c2b01008e94d68e6db58
master date: 2017-03-14 18:19:29 +0100

8 years agox86: drop unneeded __packed attributes
Roger Pau Monné [Fri, 31 Mar 2017 07:00:13 +0000 (09:00 +0200)]
x86: drop unneeded __packed attributes

There where a couple of unneeded packed attributes in several x86-specific
structures, that are obviously aligned. The only non-trivial one is
vmcb_struct, which has been checked to have the same layout with and without
the packed attribute using pahole. In that case add a build-time size check to
be on the safe side.

No functional change is expected as a result of this commit.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
master commit: 4036e7c592905c2292cdeba8269e969959427237
master date: 2017-03-07 17:11:06 +0100

8 years agoQEMU_TAG update
Ian Jackson [Tue, 21 Mar 2017 18:44:50 +0000 (18:44 +0000)]
QEMU_TAG update

8 years agox86/emul: Correct the decoding of mov to/from cr/dr
Andrew Cooper [Tue, 14 Mar 2017 13:06:03 +0000 (14:06 +0100)]
x86/emul: Correct the decoding of mov to/from cr/dr

The mov to/from cr/dr behave as if they were encoded with Mod = 3.  When
encoded with Mod != 3, no displacement or SIB bytes are fetched.

Add a test with a deliberately malformed ModRM byte.  (Also add the
automatically-generated simd.h to .gitignore.)

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c2e316b2f220af06dab76b1219e61441c31f6ff9
master date: 2017-03-07 17:29:16 +0000

8 years agoxen: credit2: don't miss accounting while doing a credit reset.
Dario Faggioli [Tue, 14 Mar 2017 13:05:41 +0000 (14:05 +0100)]
xen: credit2: don't miss accounting while doing a credit reset.

A credit reset basically means going through all the
vCPUs of a runqueue and altering their credits, as a
consequence of a 'scheduling epoch' having come to an
end.

Blocked or runnable vCPUs are fine, all the credits
they've spent running so far have been accounted to
them when they were scheduled out.

But if a vCPU is running on a pCPU, when a reset event
occurs (on another pCPU), that does not get properly
accounted. Let's therefore begin to do so, for better
accuracy and fairness.

In fact, after this patch, we see this in a trace:

 csched2:schedule cpu 10, rq# 1, busy, not tickled
 csched2:burn_credits d1v5, credit = 9998353, delta = 202996
 runstate_continue d1v5 running->running
 ...
 csched2:schedule cpu 12, rq# 1, busy, not tickled
 csched2:burn_credits d1v6, credit = -1327, delta = 9999544
 csched2:reset_credits d0v13, credit_start = 10500000, credit_end = 10500000, mult = 1
 csched2:reset_credits d0v14, credit_start = 10500000, credit_end = 10500000, mult = 1
 csched2:reset_credits d0v7, credit_start = 10500000, credit_end = 10500000, mult = 1
 csched2:burn_credits d1v5, credit = 201805, delta = 9796548
 csched2:reset_credits d1v5, credit_start = 201805, credit_end = 10201805, mult = 1
 csched2:burn_credits d1v6, credit = -1327, delta = 0
 csched2:reset_credits d1v6, credit_start = -1327, credit_end = 9998673, mult = 1

Which shows how d1v5 actually executed for ~9.796 ms,
on pCPU 10, when reset_credit() is executed, on pCPU
12, because of d1v6's credits going below 0.

Without this patch, this 9.796ms are not accounted
to anyone. With this patch, d1v5 is charged for that,
and its credits drop down from 9796548 to 201805.

And this is important, as it means that it will
begin the new epoch with 10201805 credits, instead
of 10500000 (which he would have, before this patch).

Basically, we were forgetting one round of accounting
in epoch x, for the vCPUs that are running at the time
the epoch ends. And this meant favouring a little bit
these same vCPUs, in epoch x+1, providing them with
the chance of execute longer than their fair share.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 4fa4f8a3cd5afd4980ad9517755d002dc316abdc
master date: 2017-03-01 16:56:34 +0000

8 years agoxen: credit2: always mark a tickled pCPU as... tickled!
Dario Faggioli [Tue, 14 Mar 2017 13:05:14 +0000 (14:05 +0100)]
xen: credit2: always mark a tickled pCPU as... tickled!

In fact, whether or not a pCPU has been tickled, and is
therefore about to re-schedule, is something we look at
and base decisions on in various places.

So, let's make sure that we do that basing on accurate
information.

While there, also tweak a little bit smt_idle_mask_clear()
(used for implementing SMT support), so that it only alter
the relevant cpumask when there is the actual need for this.
(This is only for reduced overhead, behavior remains the
same).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
master commit: a76645240bd14e964e85dbc975a8989edea6aa27
master date: 2017-03-01 16:56:34 +0000

8 years agox86/vmx: Don't leak host syscall MSR state into HVM guests
Andrew Cooper [Tue, 14 Mar 2017 13:04:37 +0000 (14:04 +0100)]
x86/vmx: Don't leak host syscall MSR state into HVM guests

hvm_hw_cpu->msr_flags is in fact the VMX dirty bitmap of MSRs needing to be
restored when switching into guest context.  It should never have been part of
the migration state to start with, and Xen must not make any decisions based
on the value seen during restore.

Identify it as obsolete in the header files, consistently save it as zero and
ignore it on restore.

The MSRs must be considered dirty during VMCS creation to cause the proper
defaults of 0 to be visible to the guest.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 2f1add6e1c8789d979daaafa3d80ddc1bc375783
master date: 2017-02-21 11:06:39 +0000

8 years agoupdate Xen version to 4.6.6-pre
Jan Beulich [Tue, 14 Mar 2017 13:03:50 +0000 (14:03 +0100)]
update Xen version to 4.6.6-pre

8 years agoxen/arm: fix affected memory range by dcache clean functions
Stefano Stabellini [Fri, 3 Mar 2017 01:15:26 +0000 (17:15 -0800)]
xen/arm: fix affected memory range by dcache clean functions

clean_dcache_va_range and clean_and_invalidate_dcache_va_range don't
calculate the range correctly when "end" is not cacheline aligned. As a
result, the last cacheline is not skipped. Fix the issue by aligning the
start address to the cacheline size.

In addition, make the code simpler and faster in
invalidate_dcache_va_range, by removing the module operation and using
bitmasks instead. Also remove the size adjustments in
invalidate_dcache_va_range, because the size variable is not used later
on.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Tested-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
8 years agoxen/arm: introduce vwfi parameter
Stefano Stabellini [Wed, 1 Mar 2017 19:43:15 +0000 (11:43 -0800)]
xen/arm: introduce vwfi parameter

Introduce new Xen command line parameter called "vwfi", which stands for
virtual wfi. The default is "trap": Xen traps guest wfi and wfe
instructions. In the case of wfi, Xen calls vcpu_block on the guest
vcpu; in the case of guest wfe, Xen calls vcpu_yield on the guest vcpu.
The behavior can be changed by setting vwfi to "native", in that case
Xen doesn't trap neither wfi nor wfe, running them in guest context.

The result is strong reduction in irq latency (from 5000ns to 2000ns,
measured using https://github.com/edgarigl/tbm, the physical timer, and
1 pcpu dedicated to 1 vcpu). The downside is that the scheduler thinks
that the guest is busy when actually is sleeping, leading to suboptimal
scheduling decisions.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoupdate Xen version to 4.6.5 RELEASE-4.6.5
Jan Beulich [Tue, 7 Mar 2017 16:19:05 +0000 (17:19 +0100)]
update Xen version to 4.6.5

8 years agoQEMU_TAG update
Ian Jackson [Wed, 22 Feb 2017 16:39:13 +0000 (16:39 +0000)]
QEMU_TAG update

8 years agoVMX: fix VMCS race on context-switch paths
Jan Beulich [Mon, 20 Feb 2017 15:06:20 +0000 (16:06 +0100)]
VMX: fix VMCS race on context-switch paths

When __context_switch() is being bypassed during original context
switch handling, the vCPU "owning" the VMCS partially loses control of
it: It will appear non-running to remote CPUs, and hence their attempt
to pause the owning vCPU will have no effect on it (as it already
looks to be paused). At the same time the "owning" CPU will re-enable
interrupts eventually (the lastest when entering the idle loop) and
hence becomes subject to IPIs from other CPUs requesting access to the
VMCS. As a result, when __context_switch() finally gets run, the CPU
may no longer have the VMCS loaded, and hence any accesses to it would
fail. Hence we may need to re-load the VMCS in vmx_ctxt_switch_from().

For consistency use the new function also in vmx_do_resume(), to
avoid leaving an open-coded incarnation of it around.

Reported-by: Kevin Mayer <Kevin.Mayer@gdata.de>
Reported-by: Anshul Makkar <anshul.makkar@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Tested-by: Sergey Dyasli <sergey.dyasli@citrix.com>
master commit: 2f4d2198a9b3ba94c959330b5c94fe95917c364c
master date: 2017-02-17 15:49:56 +0100

8 years agoxen/p2m: Fix p2m_flush_table for non-nested cases
George Dunlap [Mon, 20 Feb 2017 15:05:52 +0000 (16:05 +0100)]
xen/p2m: Fix p2m_flush_table for non-nested cases

Commit 71bb7304e7a7a35ea6df4b0cedebc35028e4c159 added flushing of
nested p2m tables whenever the host p2m table changed.  Unfortunately
in the process, it added a filter to p2m_flush_table() function so
that the p2m would only be flushed if it was being used as a nested
p2m.  This meant that the p2m was not being flushed at all for altp2m
callers.

Only check np2m_base if p2m_class for nested p2m's.

NB that this is not a security issue: The only time this codepath is
called is in cases where either nestedp2m or altp2m is enabled, and
neither of them are in security support.

Reported-by: Matt Leinhos <matt@starlab.io>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Tested-by: Tamas K Lengyel <tamas@tklengyel.com>
master commit: 6192e6378e094094906950120470a621d5b2977c
master date: 2017-02-15 17:15:56 +0000

8 years agox86/ept: allow write-combining on !mfn_valid() MMIO mappings again
David Woodhouse [Mon, 20 Feb 2017 15:05:29 +0000 (16:05 +0100)]
x86/ept: allow write-combining on !mfn_valid() MMIO mappings again

For some MMIO regions, such as those high above RAM, mfn_valid() will
return false.

Since the fix for XSA-154 in commit c61a6f74f80e ("x86: enforce
consistent cachability of MMIO mappings"), guests have no longer been
able to use PAT to obtain write-combining on such regions because the
'ignore PAT' bit is set in EPT.

We probably want to err on the side of caution and preserve that
behaviour for addresses in mmio_ro_ranges, but not for normal MMIO
mappings. That necessitates a slight refactoring to check mfn_valid()
later, and let the MMIO case get through to the right code path.

Since we're not bailing out for !mfn_valid() immediately, the range
checks need to be adjusted to cope \97 simply by masking in the low bits
to account for 'order' instead of adding, to avoid overflow when the mfn
is INVALID_MFN (which happens on unmap, since we carefully call this
function to fill in the EMT even though the PTE won't be valid).

The range checks are also slightly refactored to put only one of them in
the fast path in the common case. If it doesn't overlap, then it
*definitely* isn't contained, so we don't need both checks. And if it
overlaps and is only one page, then it definitely *is* contained.

Finally, add a comment clarifying how that 'return -1' works \97 it isn't
returning an error and causing the mapping to fail; it relies on
resolve_misconfig() being able to split the mapping later. So it's
*only* sane to do it where order>0 and the 'problem' will be solved by
splitting the large page. Not for blindly returning 'error', which I was
tempted to do in my first attempt.

Signed-off-by: David Woodhouse <dwmw@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 30921dc2df3665ca1b2593595aa6725ff013d386
master date: 2017-02-07 14:30:01 +0100

8 years agox86/VT-x: Dump VMCS on VMLAUNCH/VMRESUME failure
Andrew Cooper [Mon, 20 Feb 2017 15:04:52 +0000 (16:04 +0100)]
x86/VT-x: Dump VMCS on VMLAUNCH/VMRESUME failure

If a VMLAUNCH/VMRESUME fails due to invalid control or host state, dump the
VMCS before crashing the domain.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d0fd9ae54491328b10dee4003656c14b3bf3d3e9
master date: 2016-07-04 10:51:48 +0100

8 years agoIOMMU: always call teardown callback
Oleksandr Tyshchenko [Wed, 15 Feb 2017 12:20:59 +0000 (12:20 +0000)]
IOMMU: always call teardown callback

There is a possible scenario when (d)->need_iommu remains unset
during guest domain execution. For example, when no devices
were assigned to it. Taking into account that teardown callback
is not called when (d)->need_iommu is unset we might have unreleased
resourses after destroying domain.

So, always call teardown callback to roll back actions
that were performed in init callback.

This is XSA-207.

Signed-off-by: Oleksandr Tyshchenko <olekstysh@gmail.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Julien Grall <julien.grall@arm.com>
8 years agox86/emulate: don't assume that addr_size == 32 implies protected mode
George Dunlap [Thu, 9 Feb 2017 09:37:08 +0000 (10:37 +0100)]
x86/emulate: don't assume that addr_size == 32 implies protected mode

Callers of x86_emulate() generally define addr_size based on the code
segment.  In vm86 mode, the code segment is set by the hardware to be
16-bits; but it is entirely possible to enable protected mode, set the
CS to 32-bits, and then disable protected mode.  (This is commonly
called "unreal mode".)

But the instruction decoder only checks for protected mode when
addr_size == 16.  So in unreal mode, hardware will throw a #UD for VEX
prefixes, but our instruction decoder will decode them, triggering an
ASSERT() further on in _get_fpu().  (With debug=n the emulator will
incorrectly emulate the instruction rather than throwing a #UD, but
this is only a bug, not a crash, so it's not a security issue.)

Teach the instruction decoder to check that we're in protected mode,
even if addr_size is 32.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Split real mode and VM86 mode handling, as VM86 mode is strictly 16-bit
at all times. Re-base.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 05118b1596ffe4559549edbb28bd0124a7316123
master date: 2017-01-25 15:09:55 +0100

8 years agox86/hvm: do not set msr_tsc_adjust on hvm_set_guest_tsc_fixed
Joao Martins [Thu, 9 Feb 2017 09:35:58 +0000 (10:35 +0100)]
x86/hvm: do not set msr_tsc_adjust on hvm_set_guest_tsc_fixed

Commit 6e03363 ("x86: Implement TSC adjust feature for HVM guest")
implemented TSC_ADJUST MSR for hvm guests. Though while booting
an HVM guest the boot CPU would have a value set with delta_tsc -
guest tsc while secondary CPUS would have 0. For example one can
observe:
 $ xen-hvmctx 17 | grep tsc_adjust
 TSC_ADJUST: tsc_adjust ff9377dfef47fe66
 TSC_ADJUST: tsc_adjust 0
 TSC_ADJUST: tsc_adjust 0
 TSC_ADJUST: tsc_adjust 0

Upcoming Linux 4.10 now validates whether this MSR is correct and
adjusts them accordingly under the following conditions: values of < 0
(our case for CPU 0) or != 0 or values > 7FFFFFFF. In this conditions it
will force set to 0 and for the CPUs that the value doesn't match all
together. If this msr is not correct we would see messages such as:

[Firmware Bug]: TSC ADJUST: CPU0: -30517044286984129 force to 0

And on HVM guests supporting TSC_ADJUST (requiring at least Haswell
Intel) it won't boot.

Our current vCPU 0 values are incorrect and according to Intel SDM which on
section "Time-Stamp Counter Adjustment" states that "On RESET, the value
of the IA32_TSC_ADJUST MSR is 0." hence we should set it 0 and be
consistent across multiple vCPUs. Perhaps this MSR should be only
changed by the guest which already happens through
hvm_set_guest_tsc_adjust(..) routines (see below). After this patch
guests running Linux 4.10 will see a valid IA32_TSC_ADJUST msr of value
 0 for all CPUs and are able to boot.

On the same section of the spec ("Time-Stamp Counter Adjustment") it is
also stated:
"If an execution of WRMSR to the IA32_TIME_STAMP_COUNTER MSR
 adds (or subtracts) value X from the TSC, the logical processor also
 adds (or subtracts) value X from the IA32_TSC_ADJUST MSR.

 Unlike the TSC, the value of the IA32_TSC_ADJUST MSR changes only in
 response to WRMSR (either to the MSR itself, or to the
 IA32_TIME_STAMP_COUNTER MSR). Its value does not otherwise change as
 time elapses. Software seeking to adjust the TSC can do so by using
 WRMSR to write the same value to the IA32_TSC_ADJUST MSR on each logical
 processor."

This suggests these MSRs values should only be changed through guest i.e.
throught write intercept msrs. We keep IA32_TSC MSR logic such that writes
accomodate adjustments to TSC_ADJUST, hence no functional change in the
msr_tsc_adjust for IA32_TSC msr. Though, we do that in a separate routine
namely hvm_set_guest_tsc_msr instead of through hvm_set_guest_tsc(...).

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 98297f09bd07bb63407909aae1d309d8adeb572e
master date: 2017-01-24 12:37:36 +0100

8 years agox86: segment attribute handling adjustments
Jan Beulich [Thu, 9 Feb 2017 09:35:24 +0000 (10:35 +0100)]
x86: segment attribute handling adjustments

Null selector loads into SS (possible in 64-bit mode only, and only in
rings other than ring 3) must not alter SS.DPL. (This was found to be
an issue on KVM, and fixed in Linux commit 33ab91103b.)

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 366ff5f1b3252f9069d5aedb2ffc2567bb0a37c9
master date: 2017-01-20 14:39:12 +0100

8 years agox86emul: LOCK check adjustments
Jan Beulich [Thu, 9 Feb 2017 09:35:02 +0000 (10:35 +0100)]
x86emul: LOCK check adjustments

BT, being encoded as DstBitBase just like BT{C,R,S}, nevertheless does
not write its (register or memory) operand and hence also doesn't allow
a LOCK prefix to be used.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f2d4f4ba80de8a03a1b0f300d271715a88a8433d
master date: 2017-01-20 14:37:33 +0100

8 years agox86emul: VEX.B is ignored in compatibility mode
Jan Beulich [Thu, 9 Feb 2017 09:34:28 +0000 (10:34 +0100)]
x86emul: VEX.B is ignored in compatibility mode

While VEX.R and VEX.X are guaranteed to be 1 in compatibility mode
(and hence a respective mode_64bit() check can be dropped), VEX.B can
be encoded as zero, but would be ignored by the processor. Since we
emulate instructions in 64-bit mode (except possibly in the test
harness), we need to force the bit to 1 in order to not act on the
wrong {X,Y,Z}MM register (which has no bad effect on 32-bit test
harness builds, as there the bit would again be ignored by the
hardware, and would by default be expected to be 1 anyway).

We must not, however, fiddle with the high bit of VEX.VVVV in the
decode phase, as that would undermine the checking of instructions
requiring the field to be all ones independent of mode. This is
being enforced in copy_REX_VEX() instead.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86emul: correct VEX/XOP/EVEX operand size handling for 16-bit code

Operand size defaults to 32 bits in that case, but would not have been
set that way in the absence of an operand size override.

Reported-by: Wei Liu <wei.liu2@citrix.com> (by AFL fuzzing)
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 89c76ee7f60777b81c8fd0475a6af7c84e72a791
master date: 2017-01-17 10:32:25 +0100
master commit: beb82042447c5d6e7073d816d6afc25c5a423cde
master date: 2017-01-25 15:08:59 +0100

8 years agolibxl: Revert 3658f7a0bdd8 "libxl: fix libxl_set_memory_target"
Ian Jackson [Sat, 21 Jan 2017 19:03:23 +0000 (19:03 +0000)]
libxl: Revert 3658f7a0bdd8 "libxl: fix libxl_set_memory_target"

I backported
  3658f7a0bdd8fcda217559927e25263a07398c27
  libxl: fix libxl_set_memory_target
but this was not necessary, because the commit it fixes is not in
Xen 4.6.

So the backport introduces a duplicate cal to xc_domain_getinfolist
and also breaks the build with
    libxl.c:4908:5: error: â€˜r’ undeclared (first use in this function)

Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
8 years agolibxl: fix libxl_set_memory_target
Wei Liu [Thu, 29 Dec 2016 16:36:31 +0000 (16:36 +0000)]
libxl: fix libxl_set_memory_target

Commit 26dbc93a ("libxl: Remove pointless hypercall from
libxl_set_memory_target") removed the call to xc_domain_getinfolist, but
it failed to notice that "info" was actually needed later.

Put that back. While at it, make the code conform to coding style
requirement.

Reported-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit ed5f19aea66fe5a72060d6a795ffcd23b7643ee3)
(cherry picked from commit e1cefedd80f9972854769bfc6e32e23b56cd0712)
(cherry picked from commit 013ee593ca0456e1adfdf80ef7e44b151cdb1545)

8 years agoinit/FreeBSD: fix incorrect usage of $rc_pids in xendriverdomain
Roger Pau Monne [Wed, 21 Dec 2016 16:47:26 +0000 (16:47 +0000)]
init/FreeBSD: fix incorrect usage of $rc_pids in xendriverdomain

It should be rc_pid.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-by: Nathan Friess <nathan.friess@gmail.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit fb4c92ffa661516e41d24974d3d0a2a3608caf68)
(cherry picked from commit c5efe958ca4b86779dc7564bf2682f3df71202e7)
(cherry picked from commit 287553f8d108f3660ec7eefb9a5b139d4a6faf30)

8 years agoinit/FreeBSD: add rc control variables
Roger Pau Monne [Mon, 19 Dec 2016 15:02:04 +0000 (15:02 +0000)]
init/FreeBSD: add rc control variables

Those are used in order to decide which scripts are executed at init.

Ref: https://www.freebsd.org/doc/en/articles/rc-scripting/article.html#rcng-confdummy

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
[ wei: fix up conflict ]
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 4d362ce02aaf1699957fb7c0edc6ae5839ccb30e)
(cherry picked from commit 63c68c7ec5b926d218d4d81b96b4352d30a59c7b)
(cherry picked from commit 675aa124b414d6411f7a661ef638e73e45645b61)

8 years agoinit/FreeBSD: fix xencommons so it can only be launched by Dom0
Roger Pau Monne [Mon, 19 Dec 2016 15:02:03 +0000 (15:02 +0000)]
init/FreeBSD: fix xencommons so it can only be launched by Dom0

At the moment the execution of xencommons is gated on the presence of the
privcmd device, but that's not correct, since privcmd is available to all Xen
domains (privileged or unprivileged). Instead of using privcmd use the
xenstored device, which will only be available to the domain that's in charge
of running xenstored, and thus xencommons.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit c875b9778da0c56a0c118626771465b87df31fe8)
(cherry picked from commit 3667bc0394743085548c85632b8dc5c3d77483d7)
(cherry picked from commit aa804864f151352ba83292e8918637d0bb319491)

8 years agoinit/FreeBSD: remove xendriverdomain_precmd
Roger Pau Monne [Mon, 19 Dec 2016 15:02:02 +0000 (15:02 +0000)]
init/FreeBSD: remove xendriverdomain_precmd

...because it's empty. While there also rename xendriverdomain_startcmd to
xendriverdomain_start in order to match the nomenclature of the file.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
[ wei: fix up minor error ]
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 29b968e46b215bea8881abdfd06a046417b83006)
(cherry picked from commit 86e54bea2bbaa01cbb4b04ec73dee459b89734f2)
(cherry picked from commit 57a037ec287cd6069c6e38df247d6316e00cb9db)

8 years agoinit/FreeBSD: set correct PATH for xl devd
Roger Pau Monne [Mon, 19 Dec 2016 15:02:01 +0000 (15:02 +0000)]
init/FreeBSD: set correct PATH for xl devd

FreeBSD init scripts don't have /usr/local/{bin/sbin} in it's PATH, which
prevents `xl devd` from working properly since hotplug scripts require the set
of xenstore cli tools to be in PATH.

While there also fix the usage of --pidfile, which according to the xl help
doesn't use "=", and add braces around XLDEVD_PIDFILE.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 1d52073334d7615934fe804bc656b7aab0e92ebd)
(cherry picked from commit e7ad85ef7ecd64cb80705d438bc1b041e3605310)
(cherry picked from commit 36095b27368479056b56f5f0e53c084129fc9c66)

8 years agoxen/arm: gic-v3: Make sure read from ICC_IAR1_EL1 is visible on the redistributor
Julien Grall [Wed, 18 Jan 2017 18:54:08 +0000 (18:54 +0000)]
xen/arm: gic-v3: Make sure read from ICC_IAR1_EL1 is visible on the redistributor

"The effects of reading ICC_IAR0_EL1 and ICC_IAR1_EL1 on the state of a
returned INTID are not guaranteed to be visible until after the execution
of a DSB".

Because of the GIC is an external component, a dsb sy is required.
Without it the sysreg read may not have been made visible on the
redistributor.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
8 years agox86/emul: Correct the return value handling of VMFUNC
Andrew Cooper [Wed, 18 Jan 2017 09:25:49 +0000 (10:25 +0100)]
x86/emul: Correct the return value handling of VMFUNC

The bracketing of x86_emulate() calling the ops->vmfunc() hook is wrong with
respect to the assignment to rc, which can trip the new assertions in
x86_emulate_wrapper().

The hvmemul_vmfunc() hook should only raise #UD if X86EMUL_EXCEPTION is
returned.  This is only a latent bug at the moment.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3ab1876504d409689824e161a8b04e57e1e5dd46
master date: 2016-12-22 13:32:46 +0000

8 years agox86emul: CMPXCHG16B requires an aligned operand
Jan Beulich [Wed, 18 Jan 2017 09:24:51 +0000 (10:24 +0100)]
x86emul: CMPXCHG16B requires an aligned operand

This distinguishes it from CMPXCHG8B.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d7d3a82c28a1102ee6c9707071a946164bae0d57
master date: 2016-12-16 14:37:11 +0100

8 years agoVT-d: correct dma_msi_set_affinity()
Jan Beulich [Wed, 18 Jan 2017 09:24:22 +0000 (10:24 +0100)]
VT-d: correct dma_msi_set_affinity()

Commit 83cd2038fe ("VT-d: use msi_compose_msg()) together with
15aa6c6748 ("amd iommu: use base platform MSI implementation"),
introducing the use of a per-CPU scratch CPU mask, went too far:
dma_msi_set_affinity() may, at least in theory, be called in
interrupt context, and hence the use of that scratch variable is not
correct.

Since the function overwrites the destination information anyway,
allow msi_compose_msg() to be called with a NULL CPU mask, avoiding
the use of that scratch variable.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 7f885a1f49a75c770360b030666a5c1545156e5c
master date: 2016-12-16 14:33:43 +0100

8 years agox86emul: MOVNTI does not allow REP prefixes
Jan Beulich [Wed, 18 Jan 2017 09:23:57 +0000 (10:23 +0100)]
x86emul: MOVNTI does not allow REP prefixes

Just like 66, prefixes F3 and F2 cause #UD.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 96a7cb37b921d2b320183d194d143262e1dd5b53
master date: 2016-12-14 10:11:08 +0100

8 years agox86/VPMU: clear the overflow status of which counter happened to overflow
Luwei Kang [Wed, 18 Jan 2017 09:23:33 +0000 (10:23 +0100)]
x86/VPMU: clear the overflow status of which counter happened to overflow

Just set the corresponding bits of counters which happened to overflow,
rather than setting all the available bits of IA32_PERF_GLOBAL_OVF_CTRL
when pmu interrupt happened.

Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7a0c70482580234868fcc53b8d72e31966dc7c52
master date: 2016-12-13 14:21:26 +0100

8 years agox86emul: correct PUSHF/POPF
Jan Beulich [Wed, 18 Jan 2017 09:23:10 +0000 (10:23 +0100)]
x86emul: correct PUSHF/POPF

Both need to raise #GP(0) when in VM86 mode with IOPL < 3.

Additionally PUSHF is documented to clear VM and RF from the value
placed onto the stack.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e5c1b8145bccb7fc587ee5b0c95ace6c5e0c7ffd
master date: 2016-12-07 13:55:42 +0100

8 years agolibelf: section index 0 is special
Jan Beulich [Wed, 18 Jan 2017 09:22:43 +0000 (10:22 +0100)]
libelf: section index 0 is special

When iterating over sections, table entry zero needs to be ignored.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 41fe9cabf29ea15c1f8edee49326dfde705013d3
master date: 2016-12-07 13:52:35 +0100

8 years agox86emul: CMOVcc always writes its destination
Jan Beulich [Wed, 18 Jan 2017 09:22:14 +0000 (10:22 +0100)]
x86emul: CMOVcc always writes its destination

This would be benign if there wasn't the zero-extending side effect of
32-bit operations in 64-bit mode.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: cc53a74291ea5dd5b2c9a327dc386c0e5f859237
master date: 2016-11-25 14:31:50 +0100

8 years agox86/vmx: Don't deliver #MC with an error code
Andrew Cooper [Wed, 18 Jan 2017 09:21:34 +0000 (10:21 +0100)]
x86/vmx: Don't deliver #MC with an error code

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 892d191df60806ea63f9d3c4fc37615bee028812
master date: 2016-11-25 10:48:20 +0000

8 years agox86/emul: Don't deliver #UD with an error code
Andrew Cooper [Wed, 18 Jan 2017 09:20:58 +0000 (10:20 +0100)]
x86/emul: Don't deliver #UD with an error code

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 9586cba3383021bb4bd57f3fa33e87cc64b4c74a
master date: 2016-11-25 10:48:10 +0000

8 years agox86/SVM: don't deliver #GP without error code
Jan Beulich [Wed, 18 Jan 2017 09:19:32 +0000 (10:19 +0100)]
x86/SVM: don't deliver #GP without error code

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 647c7bd4453c224d9ccdfdb37491544f797fdc48
master date: 2016-11-25 09:46:32 +0100

8 years agox86/EFI: meet further spec requirements for runtime calls
Jan Beulich [Wed, 18 Jan 2017 09:19:03 +0000 (10:19 +0100)]
x86/EFI: meet further spec requirements for runtime calls

So far we didn't guarantee 16-byte alignment of the stack: While (so
far) we don't tell the compiler to use smaller alignment, we also don't
guarantee 16-byte alignment when establishing stack pointers for new
vCPU-s. Runtime service functions using SSE instructions may end with
#GP(0) without that.

Note that making use of -mpreferred-stack-boundary=3, as mentioned in
the comment, wouldn't help to reduce the needed alignment: The compiler
would then be free to align the stack of the function with the aligned
object, but would be permitted to place an odd number of 8-byte objects
there, resulting in the callee to still run on an unaligned stack.

(The only working alternative to the approach chosen here would be to
use -mincoming-stack-boundary=3, but that would affect all functions in
runtime.c, not just the ones actually making runtime services calls.
And it would still require the manual alignment logic here to be used
with gcc 5.2 and earlier - not permitting that command line option -,
just that then the alignment amount would become conditional.)

Hence enforce the needed alignment by making efi_rs_enter() return a
suitably aligned structure, which the caller then necessarily has to
store in a suitably aligned local variable, the address of which then
gets passed to efi_rs_leave(). Also (to limit exposure) move the
function declarations to where they belong: They're local to runtime.c,
and shared only with compat.c (by the latter including the former).

Furthermore we should avoid #MF to be raised on the FLDCW we do.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f6b7fedc896250028cb81dafe9a3f6773aaf1da2
master date: 2016-11-22 13:52:53 +0100

8 years agox86/svm: Fix svm_nextrip_insn_length() when crossing the virtual boundary to 0
Andrew Cooper [Wed, 18 Jan 2017 09:18:10 +0000 (10:18 +0100)]
x86/svm: Fix svm_nextrip_insn_length() when crossing the virtual boundary to 0

vmcb->nextrip can legitimately be less than vmcb->rip when execution wraps
back around to 0.  Instead, complain if the reported length is greater than 15
and use x86_decode_insn() as a fallback.

While making changes here, fix two whitespace issues with the case labels.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
x86/hvm: Fix non-debug build folling c/s 0745f665a5

The variable is named inst_len, not insn_len.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
master commit: 0745f665a575bdb6724f6ec1ab767cd71ba8c253
master date: 2016-11-21 14:01:45 +0000
master commit: f678e2c78110e73431217306bbd33c736802d700
master date: 2016-11-21 17:17:51 +0000

8 years agox86/traps: Don't call hvm_hypervisor_cpuid_leaf() for PV guests
Andrew Cooper [Wed, 18 Jan 2017 09:17:43 +0000 (10:17 +0100)]
x86/traps: Don't call hvm_hypervisor_cpuid_leaf() for PV guests

Luckily, hvm_hypervisor_cpuid_leaf() and vmx_hypervisor_cpuid_leaf() are safe
to execute in the context of a PV guest, but HVM-specific feature flags
shouldn't be visible to PV guests.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0f43883193da76fc928e836e319c3172f394e0f3
master date: 2016-11-16 10:33:18 +0000

8 years agox86/vmx: Correct the long mode check in vmx_cpuid_intercept()
Andrew Cooper [Wed, 18 Jan 2017 09:17:21 +0000 (10:17 +0100)]
x86/vmx: Correct the long mode check in vmx_cpuid_intercept()

%cs.L may be set in a legacy mode segment, or clear in a compatibility mode
segment; it is not the correct way to check for long mode being active.

Both of these situations result in incorrect visibility of the SYSCALL feature
in CPUID, and by extension, incorrect behaviour in hvm_efer_valid().

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: fcb618c025f9251d7e22138f6528595037252c21
master date: 2016-11-16 10:32:54 +0000

8 years agox86/svm: Don't clobber eax and edx if an RDMSR intercept fails
Andrew Cooper [Wed, 18 Jan 2017 09:16:52 +0000 (10:16 +0100)]
x86/svm: Don't clobber eax and edx if an RDMSR intercept fails

The original code has a bug; eax and edx get unconditionally updated even when
hvm_msr_read_intercept() doesn't return X86EMUL_OKAY.

It is only by blind luck (vmce_rdmsr() eagerly initialising its msr_content
pointer) that this isn't an information leak into guests.

While fixing this bug, reduce the scope of msr_content and initialise it to 0.
This makes it obvious that a stack leak won't occur, even if there were to be
a buggy codepath in hvm_msr_read_intercept().

Also make some non-functional improvements.  Make the insn_len calculation
common, and reduce the quantity of explicit casting by making better use of
the existing register names.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
master commit: a0b4e3c0681a11b765fe218fba0ba4ebb9fa56c5
master date: 2016-11-10 15:34:42 +0000

8 years agox86emul: {L,S}{G,I}DT ignore operand size overrides in 64-bit mode
Jan Beulich [Wed, 18 Jan 2017 09:16:20 +0000 (10:16 +0100)]
x86emul: {L,S}{G,I}DT ignore operand size overrides in 64-bit mode

This affects not only the layout of the data (always 2+8 bytes), but
also the contents (no truncation to 24 bits occurs).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 4ccb2adb96042e0d1e334c01fe260b32e6001db9
master date: 2016-11-03 17:23:22 +0100

8 years agox86/emul: Reject LGDT/LIDT attempts with non-canonical base addresses
Andrew Cooper [Wed, 18 Jan 2017 09:15:56 +0000 (10:15 +0100)]
x86/emul: Reject LGDT/LIDT attempts with non-canonical base addresses

No sane OS would deliberately try this, but make Xen's emulation match real
hardware by delivering #GP(0), rather than suffering a VMEntry failure.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <JBeulich@suse.com>
master commit: 12bc22f79117dfae5e59382cdda6b8b6b70a7554
master date: 2016-11-03 12:23:23 +0000

8 years agox86/emul: Correct the decoding of SReg3 operands
Andrew Cooper [Wed, 18 Jan 2017 09:15:22 +0000 (10:15 +0100)]
x86/emul: Correct the decoding of SReg3 operands

REX.R is ignored when considering segment register operands, and needs masking
out first.

While fixing this, reorder the user segments in x86_segment to match SReg3
encoding.  This avoids needing a translation table between hardware ordering
and Xen's ordering.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
VMX: fix realmode emulation SReg handling

Commit 0888d36bb2 ("x86/emul: Correct the decoding of SReg3 operands")
overlooked three places where x86_seg_cs was assumed to be zero.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0888d36bb23f7365ce12b03127fd0fb2661ec90e
master date: 2016-10-26 14:04:12 +0100
master commit: a62511bf14971ff581212decbbf57fc11b967840
master date: 2016-10-31 08:57:47 +0100

8 years agox86/HVM: add missing NULL check before using VMFUNC hook
Jan Beulich [Wed, 21 Dec 2016 16:45:42 +0000 (17:45 +0100)]
x86/HVM: add missing NULL check before using VMFUNC hook

This is CVE-2016-10025 / XSA-203.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 74dcd0ce6f4fadce8093e54f0fc1a45426577e13
master date: 2016-12-21 16:47:19 +0100

8 years agox86: force EFLAGS.IF on when exiting to PV guests
Jan Beulich [Wed, 21 Dec 2016 16:45:05 +0000 (17:45 +0100)]
x86: force EFLAGS.IF on when exiting to PV guests

Guest kernels modifying instructions in the process of being emulated
for another of their vCPU-s may effect EFLAGS.IF to be cleared upon
next exiting to guest context, by converting the being emulated
instruction to CLI (at the right point in time). Prevent any such bad
effects by always forcing EFLAGS.IF on. And to cover hypothetical other
similar issues, also force EFLAGS.{IOPL,NT,VM} to zero.

This is CVE-2016-10024 / XSA-202.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0e47f92b072548800223f9a21ea051a017173915
master date: 2016-12-21 16:46:13 +0100

8 years agox86/emul: Correct the handling of eflags with SYSCALL
Andrew Cooper [Sun, 18 Dec 2016 15:42:59 +0000 (15:42 +0000)]
x86/emul: Correct the handling of eflags with SYSCALL

A singlestep #DB is determined by the resulting eflags value from the
execution of SYSCALL, not the original eflags value.

By using the original eflags value, we negate the guest kernels attempt to
protect itself from a privilege escalation by masking TF.

Introduce a tf boolean and have the SYSCALL emulation recalculate it
after the instruction is complete.

This is XSA-204

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
8 years agox86emul: CMPXCHG8B ignores operand size prefix
Jan Beulich [Tue, 13 Dec 2016 13:27:24 +0000 (14:27 +0100)]
x86emul: CMPXCHG8B ignores operand size prefix

Otherwise besides mis-handling the instruction, the comparison failure
case would result in uninitialized stack data being handed back to the
guest in rDX:rAX (32 bits leaked for 32-bit guests, 96 bits for 64-bit
ones).

This is CVE-2016-9932 / XSA-200.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
8 years agomissing vgic_unlock_rank in gic_remove_irq_from_guest
Stefano Stabellini [Fri, 9 Dec 2016 00:59:28 +0000 (16:59 -0800)]
missing vgic_unlock_rank in gic_remove_irq_from_guest

Add missing vgic_unlock_rank on the error path in
gic_remove_irq_from_guest.

Coverity-ID: 1381843

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
8 years agoQEMU_TAG update
Ian Jackson [Wed, 7 Dec 2016 16:54:20 +0000 (16:54 +0000)]
QEMU_TAG update

8 years agoarm64: fix incorrect memory region size in TCR_EL2
Shanker Donthineni [Thu, 17 Mar 2016 12:46:58 +0000 (13:46 +0100)]
arm64: fix incorrect memory region size in TCR_EL2

The maximum and minimum values for TxSZ depend on level of
translation as per AArch64 Virtual Memory System Architecture.
According to ARM specification DDI0487A_h (sec D4.2.2, page 1752),
the minimum TxSZ value is 16. If TxSZ is programmed to a value
smaller than 16 then it is IMPLEMENTATION DEFINED.

This patch sets T0SZ to (64-48)bits since XEN uses all 4 levels
to cover 48bit (256TB) virtual address instead of value zero.

Signed-off-by: Shanker Donthineni <shankerd@codeaurora.org>
Acked-by: Julien Grall <julien.grall@arm.com>
8 years agoQEMU_TAG update
Ian Jackson [Tue, 29 Nov 2016 18:38:11 +0000 (18:38 +0000)]
QEMU_TAG update

8 years agoarm32: handle async aborts delivered while at HYP
Wei Chen [Tue, 29 Nov 2016 15:12:39 +0000 (16:12 +0100)]
arm32: handle async aborts delivered while at HYP

If guest generates an asynchronous abort and then traps into HYP
(by HVC or IRQ) before the abort has been delivered, the hypervisor
could not catch it, because the PSTATE.A bit is masked all the time
in hypervisor. So this asynchronous abort may be slipped to next
running guest with PSTATE.A bit unmasked.

In order to avoid this, it is necessary to take the abort at HYP, by
clearing the PSTATE.A bit. In this patch, we unmask the PSTATE.A bit
to open a window to catch guest-generated asynchronous abort in all
Guest -> HYP switch paths. If we caught such asynchronous abort in
checking window, the HYP data abort exception will be triggered and
the abort source guest will be crashed.

This is part of XSA-201.

Signed-off-by: Wei Chen <Wei.Chen@arm.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 6aaff7b407ca76dcfc4fe81f2afe9d1594cb0d6b
master date: 2016-11-29 15:59:55 +0100

8 years agoarm: crash the guest when it traps on external abort
Wei Chen [Tue, 29 Nov 2016 15:12:20 +0000 (16:12 +0100)]
arm: crash the guest when it traps on external abort

If we spot a data or prefetch abort bearing the ESR_EL2.EA bit set, we
know that this is an external abort, and that should crash the guest.

This is part of XSA-201.

Signed-off-by: Wei Chen <Wei.Chen@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Julien Grall <Julien.Grall@arm.com>
master commit: f8c6a9334b251d2e78b0873a71b4d369908fb123
master date: 2016-11-29 15:59:26 +0100

8 years agoarm64: handle async aborts delivered while at EL2
Wei Chen [Tue, 29 Nov 2016 15:12:05 +0000 (16:12 +0100)]
arm64: handle async aborts delivered while at EL2

If EL1 generates an asynchronous abort and then traps into EL2
(by HVC or IRQ) before the abort has been delivered, the hypervisor
could not catch it, because the PSTATE.A bit is masked all the time
in hypervisor. So this asynchronous abort may be slipped to next
running guest with PSTATE.A bit unmasked.

In order to avoid this, it is necessary to take the abort at EL2, by
clearing the PSTATE.A bit. In this patch, we unmask the PSTATE.A bit
to open a window to catch guest-generated asynchronous abort in all
EL1 -> EL2 swich paths. If we catched such asynchronous abort in
checking window, the hyp_error exception will be triggered and the
abort source guest will be crashed.

This is part of XSA-201.

Signed-off-by: Wei Chen <Wei.Chen@arm.com>
Reviewed-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 36008800e81bc061cce1fd204a0b638f9dc61c70
master date: 2016-11-29 15:58:57 +0100

8 years agoarm64: handle guest-generated EL1 asynchronous abort
Wei Chen [Tue, 29 Nov 2016 15:11:39 +0000 (16:11 +0100)]
arm64: handle guest-generated EL1 asynchronous abort

In current code, when the hypervisor receives an asynchronous abort
from a guest, the hypervisor will do panic, the host will be down.
We have to prevent such security issue, so, in this patch we crash
the guest, when the hypervisor receives an asynchronous abort from
the guest.

This is part of XSA-201.

Signed-off-by: Wei Chen <Wei.Chen@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Julien Grall <Julien.Grall@arm.com>
master commit: 2cf7d2bafb9b68add1710b8c3f7ecad58e53a9db
master date: 2016-11-29 15:57:52 +0100

8 years agopygrub: Properly quote results, when returning them to the caller:
Ian Jackson [Tue, 22 Nov 2016 13:23:54 +0000 (14:23 +0100)]
pygrub: Properly quote results, when returning them to the caller:

* When the caller wants sexpr output, use `repr()'
  This is what Xend expects.

  The returned S-expressions are now escaped and quoted by Python,
  generally using '...'.  Previously kernel and ramdisk were unquoted
  and args was quoted with "..." but without proper escaping.  This
  change may break toolstacks which do not properly dequote the
  returned S-expressions.

* When the caller wants "simple" output, crash if the delimiter is
  contained in the returned value.

  With --output-format=simple it does not seem like this could ever
  happen, because the bootloader config parsers all take line-based
  input from the various bootloader config files.

  With --output-format=simple0, this can happen if the bootloader
  config file contains nul bytes.

This is CVE-2016-9379 and CVE-2016-9380 / XSA-198.

Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Tested-by: Ian Jackson <Ian.Jackson@eu.citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 27e14d346ed6ff1c3a3cfc479507e62d133e92a9
master date: 2016-11-22 13:52:09 +0100

8 years agox86/svm: fix injection of software interrupts
Andrew Cooper [Tue, 22 Nov 2016 13:23:29 +0000 (14:23 +0100)]
x86/svm: fix injection of software interrupts

The non-NextRip logic in c/s 36ebf14eb "x86/emulate: support for emulating
software event injection" was based on an older version of the AMD software
manual.  The manual was later corrected, following findings from that series.

I took the original wording of "not supported without NextRIP" to mean that
X86_EVENTTYPE_SW_INTERRUPT was not eligible for use.  It turns out that this
is not the case, and the new wording is clearer on the matter.

Despite testing the original patch series on non-NRip hardware, the
swint-emulation XTF test case focuses on the debug vectors; it never ended up
executing an `int $n` instruction for a vector which wasn't also an exception.

During a vmentry, the use of X86_EVENTTYPE_HW_EXCEPTION comes with a vector
check to ensure that it is only used with exception vectors.  Xen's use of
X86_EVENTTYPE_HW_EXCEPTION for `int $n` injection has always been buggy on AMD
hardware.

Fix this by always using X86_EVENTTYPE_SW_INTERRUPT.

Print and decode the eventinj information in svm_vmcb_dump(), as it has
several invalid combinations which cause vmentry failures.

This is CVE-2016-9378 / part of XSA-196.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 920edccd41db6cb0145545afa1850edf5e7d098e
master date: 2016-11-22 13:51:16 +0100

8 years agox86/emul: correct the IDT entry calculation in inject_swint()
Andrew Cooper [Tue, 22 Nov 2016 13:23:05 +0000 (14:23 +0100)]
x86/emul: correct the IDT entry calculation in inject_swint()

The logic, as introduced in c/s 36ebf14ebe "x86/emulate: support for emulating
software event injection" is buggy.  The size of an IDT entry depends on long
mode being active, not the width of the code segment currently in use.

In particular, this means that a compatibility code segment which hits
emulation for software event injection will end up using an incorrect offset
in the IDT for DPL/Presence checking.  In practice, this only occurs on old
AMD hardware lacking NRip support; all newer AMD hardware, and all Intel
hardware bypass this path in the emulator.

While here, fix a minor issue with reading the IDT entry.  The return value
from ops->read() wasn't checked, but in reality the only failure case is if a
pagefault occurs.  This is not a realistic problem as the kernel will almost
certainly crash with a double fault if this setup actually occured.

This is CVE-2016-9377 / part of XSA-196.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 255e8fe95f22ded5186fd75244ffcfb9d5dbc855
master date: 2016-11-22 13:50:49 +0100

8 years agox86emul: fix huge bit offset handling
Jan Beulich [Tue, 22 Nov 2016 13:22:37 +0000 (14:22 +0100)]
x86emul: fix huge bit offset handling

We must never chop off the high 32 bits.

This is CVE-2016-9383 / XSA-195.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1c6c2d60d205f71ede0fbbd9047e459112f576db
master date: 2016-11-22 13:49:06 +0100

8 years agox86/PV: writes of %fs and %gs base MSRs require canonical addresses
Jan Beulich [Tue, 22 Nov 2016 13:22:09 +0000 (14:22 +0100)]
x86/PV: writes of %fs and %gs base MSRs require canonical addresses

Commit c42494acb2 ("x86: fix FS/GS base handling when using the
fsgsbase feature") replaced the use of wrmsr_safe() on these paths
without recognizing that wr{f,g}sbase() use just wrmsrl() and that the
WR{F,G}SBASE instructions also raise #GP for non-canonical input.

Similarly arch_set_info_guest() needs to prevent non-canonical
addresses from getting stored into state later to be loaded by context
switch code. For consistency also check stack pointers and LDT base.
DR0..3, otoh, already get properly checked in set_debugreg() (albeit
we discard the error there).

The SHADOW_GS_BASE check isn't strictly necessary, but I think we
better avoid trying the WRMSR if we know it's going to fail.

This is CVE-2016-9385 / XSA-193.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f3fa3abf3e61fb1f25ce721e14ac324dda67311f
master date: 2016-11-22 13:46:28 +0100