Andrew Cooper [Fri, 2 Dec 2016 15:00:41 +0000 (15:00 +0000)]
xen/x86: Correct mandatory and SMP barrier definitions
Barriers are a complicated topic, a source of confusion, and their incorrect
use is a common cause of bugs. It really doesn't help when Xen's API is the
same as Linux, but its ABI different.
Bring the two back in line, so programmers stand a chance of actually getting
their usage correct.
Drop the links in the comment, both of which are now stale. Instead, refer to
the vendor system manuals in a generic way.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 2 Dec 2016 15:00:41 +0000 (15:00 +0000)]
xen/x86: Drop unnecessary barriers
x86's current implementation of wmb() is a compiler barrier. As a result, the
only change in this patch is to remove an mfence instruction from
cpuidle_disable_deep_cstate().
None of these barriers serve any purpose. They are not synchronising with
remote cpus, and their compiler-barrier properties are not needed for
correctness purposes.
Furthermore, these wmb()'s specifically do not want to turn into sfence
instructions in future changes where wmb()'s implementation is corrected.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Various minor optimizations in rb_erase():
- Avoid multiple loading of node->__rb_parent_color when computing parent
and color information (possibly not in close sequence, as there might
be further branches in the algorithm)
- In the 1-child subcase of case 1, copy the __rb_parent_color field from
the erased node to the child instead of recomputing it from the desired
parent and color
- When searching for the erased node's successor, differentiate between
cases 2 and 3 based on whether any left links were followed. This avoids
a condition later down.
- In case 3, keep a pointer to the erased node's right child so we don't
have to refetch it later to adjust its parent.
- In the no-childs subcase of cases 2 and 3, place the rebalance assigment
last so that the compiler can remove the following if(rebalance) test.
Also, added some comments to illustrate cases 2 and 3.
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[Linux commit 4f035ad67f4633c233cb3642711d49b4efc9c82d]
Ported to Xen.
Signed-off-by: Praveen Kumar <kpraveen.lkml@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
rbtree: handle 1-child recoloring in rb_erase() instead of rb_erase_color()
An interesting observation for rb_erase() is that when a node has
exactly one child, the node must be black and the child must be red.
An interesting consequence is that removing such a node can be done by
simply replacing it with its child and making the child black,
which we can do efficiently in rb_erase(). __rb_erase_color() then
only needs to handle the no-childs case and can be modified accordingly.
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[Linux commit 46b6135a7402ac23c5b25f2bd79b03bab8f98278]
Ported to Xen.
Signed-off-by: Praveen Kumar <kpraveen.lkml@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
When looking to fetch a node's sibling, we went through a sequence of:
- check if node is the parent's left child
- if it is, then fetch the parent's right child
This can be replaced with:
- fetch the parent's right child as an assumed sibling
- check that node is NOT the fetched child
This avoids fetching the parent's left child when node is actually
that child. Saves a bit on code size, though it doesn't seem to make
a large difference in speed.
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[Linux commit 59633abf34e2f44b8e772a2c12a92132aa7c2220]
Ported to Xen.
Signed-off-by: Praveen Kumar <kpraveen.lkml@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Wed, 3 Jan 2018 10:05:05 +0000 (11:05 +0100)]
simplify xenmem_add_to_physmap_batch()
There's no need for
- advancing the handles and at the same time using
__copy_{from,to}_guest_offset(),
- an "out" label,
- local variables "done" and (function scope) "rc".
To better reflect its resulting use also rename the function's "start"
parameter to "extent".
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 3 Jan 2018 10:04:26 +0000 (11:04 +0100)]
x86/E820: improve insn selection
..., largely to shrink code size a little:
- use TEST instead of CMP with zero immediate
- use MOVZWL instead of AND with 0xffff immediate
- compute final highmem_bk value in registers, accessing memory just
once
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 3 Jan 2018 10:03:56 +0000 (11:03 +0100)]
x86/E820: don't overrun array
The bounds check needs to be done after the increment, not before, or
else it needs to use a one lower immediate. Also use word operations
rather than byte ones for both the increment and the compare (allowing
E820_BIOS_MAX to be more easily bumped, should the need ever arise).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 3 Jan 2018 10:02:10 +0000 (11:02 +0100)]
x86/Intel: drop another 32-bit leftover
None of the models MISC_ENABLE MSR access is excluded for support 64-bit
mode - drop the conditional from early_init_intel(). Also convert
pointless rdmsr_safe() elsewhere to rdmsrl().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Wed, 5 Oct 2016 11:42:15 +0000 (12:42 +0100)]
xen/x86: Replace appropriate mandatory barriers with SMP barriers
There is no functional change. Xen currently assignes smp_* meaning to
the non-smp_* barriers.
All of these uses are just to deal with shared memory between multiple
processors, which means that the smp_*() varients are the correct ones to use.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
rbtree: low level optimizations in __rb_erase_color()
In __rb_erase_color(), we often already have pointers to the nodes being
rotated and/or know what their colors must be, so we can generate more
efficient code than the generic __rb_rotate_left() and __rb_rotate_right()
functions.
Also when the current node is red or when flipping the sibling's color,
the parent is already known so we can use the more efficient
rb_set_parent_color() function to set the desired color.
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[Linux commit 6280d2356fd8ad0936a63c10dc1e6accf48d0c61]
Ported to Xen.
Signed-off-by: Praveen Kumar <kpraveen.lkml@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
rbtree: optimize case selection logic in __rb_erase_color()
In __rb_erase_color(), we have to select one of 3 cases depending on the
color on the 'other' node children. If both children are black, we flip a
few node colors and iterate. Otherwise, we do either one or two tree
rotations, depending on the color of the 'other' child opposite to 'node',
and then we are done.
The corresponding logic had duplicate checks for the color of the 'other'
child opposite to 'node'. It was checking it first to determine if both
children are black, and then to determine how many tree rotations are
required. Rearrange the logic to avoid that extra check.
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[Linux commit e125d1471a4f8f1bf7ea9a83deb8d23cb40bd712]
Ported to Xen.
Signed-off-by: Praveen Kumar <kpraveen.lkml@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
rbtree: adjust node color in __rb_erase_color() only when necessary
In __rb_erase_color(), we were always setting a node to black after
exiting the main loop. And in one case, after fixing up the tree to
satisfy all rbtree invariants, we were setting the current node to root
just to guarantee a loop exit, at which point the root would be set to
black. However this is not necessary, as the root of an rbtree is already
known to be black. The only case where the color flip is required is when
we exit the loop due to the current node being red, and it's easiest to
just do the flip at that point instead of doing it after the loop.
[adrian.hunter@intel.com: perf tools: fix build for another rbtree.c change] Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[Linux commit d6ff1273928ebf15466a85b7e1810cd00e72998b]
Ported only rbtree.c to Xen.
Signed-off-by: Praveen Kumar <kpraveen.lkml@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
rbtree: low level optimizations in rb_insert_color()
- Use the newly introduced rb_set_parent_color() function to flip the color
of nodes whose parent is already known.
- Optimize rb_parent() when the node is known to be red - there is no need
to mask out the color in that case.
- Flipping gparent's color to red requires us to fetch its rb_parent_color
field, so we can reuse it as the parent value for the next loop iteration.
- Do not use __rb_rotate_left() and __rb_rotate_right() to handle tree
rotations: we already have pointers to all relevant nodes, and know their
colors (either because we want to adjust it, or because we've tested it,
or we can deduce it as black due to the node proximity to a known red node).
So we can generate more efficient code by making use of the node pointers
we already have, and setting both the parent and color attributes for
nodes all at once. Also in Case 2, some node attributes don't have to
be set because we know another tree rotation (Case 3) will always follow
and override them.
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[Linux commit 5bc9188aa207dafd47eab57df7c4fe5b3d3f636a]
Ported to Xen.
Signed-off-by: Praveen Kumar <kpraveen.lkml@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
rbtree: adjust root color in rb_insert_color() only when necessary
The root node of an rbtree must always be black. However,
rb_insert_color() only needs to maintain this invariant when it has been
broken - that is, when it exits the loop due to the current (red) node
being the root. In all other cases (exiting after tree rotations, or
exiting due to an existing black parent) the invariant is already
satisfied, so there is no need to adjust the root node color.
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[Linux commit 6d58452dc066db61acdff7b84671db1b11a3de1c]
Ported to Xen.
Signed-off-by: Praveen Kumar <kpraveen.lkml@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
rbtree: break out of rb_insert_color loop after tree rotation
It is a well known property of rbtrees that insertion never requires more
than two tree rotations. In our implementation, after one loop iteration
identified one or two necessary tree rotations, we would iterate and look
for more. However at that point the node's parent would always be black,
which would cause us to exit the loop.
We can make the code flow more obvious by just adding a break statement
after the tree rotations, where we know we are done. Additionally, in the
cases where two tree rotations are necessary, we don't have to update the
'node' pointer as it wouldn't be used until the next loop iteration, which
we now avoid due to this break statement.
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[Linux commit 1f0528653e41ec230c60f5738820e8a544731399]
Ported to Xen.
Signed-off-by: Praveen Kumar <kpraveen.lkml@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
rbtree: move some implementation details from rbtree.h to rbtree.c
rbtree users must use the documented APIs to manipulate the tree
structure. Low-level helpers to manipulate node colors and parenthood are
not part of that API, so move them to lib/rbtree.c
Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[Linux commit bf7ad8eeab995710c766df49c9c69a8592ca0216]
Ported to Xen.
Signed-off-by: Praveen Kumar <kpraveen.lkml@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Empty nodes have no color. We can make use of this property to simplify
the code emitted by the RB_EMPTY_NODE and RB_CLEAR_NODE macros. Also,
we can get rid of the rb_init_node function which had been introduced by
commit 88d19cf37952 ("timers: Add rb_init_node() to allow for stack
allocated rb nodes") to avoid some issue with the empty node's color not
being initialized.
I'm not sure what the RB_EMPTY_NODE checks in rb_prev() / rb_next() are
doing there, though. axboe introduced them in commit 10fd48f2376d
("rbtree: fixed reversed RB_EMPTY_NODE and rb_next/prev"). The way I
see it, the 'empty node' abstraction is only used by rbtree users to
flag nodes that they haven't inserted in any rbtree, so asking the
predecessor or successor of such nodes doesn't make any sense.
One final rb_init_node() caller was recently added in sysctl code to
implement faster sysctl name lookups. This code doesn't make use of
RB_EMPTY_NODE at all, and from what I could see it only called
rb_init_node() under the mistaken assumption that such initialization was
required before node insertion.
[sfr@canb.auug.org.au: fix net/ceph/osd_client.c build] Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[Linux commit 4c199a93a2d36b277a9fd209a0f2793f8460a215]
Ported rbtree.h and rbtree.c changes which are relevant to Xen.
Signed-off-by: Praveen Kumar <kpraveen.lkml@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Wolfram Strepp [Wed, 20 Dec 2017 17:00:49 +0000 (18:00 +0100)]
rbtree: remove redundant if()-condition in rb_erase()
Furthermore, notice that the initial checks:
if (!node->rb_left)
child = node->rb_right;
else if (!node->rb_right)
child = node->rb_left;
else
{
...
}
guarantee that old->rb_right is set in the final else branch, therefore
we can omit checking that again.
Signed-off-by: Wolfram Strepp <wstrepp@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[Linux commit 4b324126e0c6c3a5080ca3ec0981e8766ed6f1ee]
Ported to Xen.
Signed-off-by: Praveen Kumar <kpraveen.lkml@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 20 Dec 2017 11:52:15 +0000 (11:52 +0000)]
x86/nops: Switch to the P6 nops as a compile-time default
Along with c/s d7128e735031 switching the runtime choice of best nops, switch
the compile-time default to P6 nops. This is more efficient on most
processors for alternative points which add/remove code, rather than switch
between two different pieces of code.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Wed, 20 Dec 2017 09:12:11 +0000 (10:12 +0100)]
x86/shadow: make 1-bit-disable match 1-bit-enable
shadow_one_bit_enable() sets PG_SH_enable (if not already set of course)
in addition to the bit being requested. Make shadow_one_bit_disable()
behave similarly - clear PG_SH_enable if that's the only bit remaining.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
Jan Beulich [Wed, 20 Dec 2017 09:05:16 +0000 (10:05 +0100)]
x86/shadow: ignore sh_pin() failure in one more case
Following what we've already done in the XSA-250 fix, convert another
sh_pin() caller to no longer fail the higher level operation if pinning
fails, as pinning is a performance optimization only in those places.
Suggested-by: Tim Deegan <tim@xen.org> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org>
Jan Beulich [Wed, 20 Dec 2017 09:04:16 +0000 (10:04 +0100)]
x86/shadow: drop further 32-bit relics
PV guests don't ever get shadowed in other than 4-level mode anymore;
commit 5a3ce8f85e ("x86/shadow: drop stray name tags from
sh_{guest_get,map}_eff_l1e()") didn't go quite fare enough (and there's
a good chance that further cleanup opportunity exists, which I simply
didn't notice).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
Jan Beulich [Wed, 20 Dec 2017 09:03:20 +0000 (10:03 +0100)]
x86: introduce NOP9 forms
Both Intel and AMD recommend an operand-size-override-prefixed long NOP
form for covering 9 bytes, so introduce this and use it in p6_nops[] to
allow further reducing the number of NOPs needed when covering larger
ranges.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monne [Tue, 19 Dec 2017 14:17:52 +0000 (14:17 +0000)]
libxl/pvh: force PVH guests to use the xenstore shutdown
PVH guests are all required to support the xenstore-based shutdown
signalling, since there is no other way for a PVH guest to be
requested to shut down.
For HVM guests we check whether the guest has installed a PV-on-HVM
interrupt callback; that does not make sense for PVH guests.
So for PVH guests, take the PV path: assume that all PVH guests have
suitable xenstore drivers.
Jan Beulich [Fri, 15 Dec 2017 10:17:19 +0000 (11:17 +0100)]
domctl: improve locking during domain destruction
There is no need to hold the global domctl lock across domain_kill() -
the domain lock is fully sufficient here, and parallel cleanup after
multiple domains performs quite a bit better this way.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 15 Dec 2017 10:14:31 +0000 (11:14 +0100)]
x86: remove _PAGE_PSE check from get_page_from_l2e()
With L2_DISALLOW_MASK containing _PAGE_PSE unconditionally as of commit 56fff3e5e9 ("x86: nuke PV superpage option and code") there's no point
anymore in separately checking for the bit.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 15 Dec 2017 10:11:36 +0000 (11:11 +0100)]
x86/HVM: fix hvmemul_rep_outs_set_context()
There were two issues with this function: Its use of
hvmemul_do_pio_buffer() was wrong (the function deals only with
individual port accesses, not repeated ones, i.e. passing it
"*reps * bytes_per_rep" does not have the intended effect). And it
could have processed a larger set of operations in one go than was
probably intended (limited just by the size that xmalloc() can hand
back).
By converting to proper use of hvmemul_do_pio_buffer(), no intermediate
buffer is needed at all. As a result a preemption check is being added.
Also drop unused parameters from the function.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Yi Sun [Fri, 20 Oct 2017 08:50:00 +0000 (10:50 +0200)]
x86: implement data structure and CPU init flow for MBA
This patch implements main data structures of MBA.
Like CAT features, MBA HW info has cos_max which means the max thrtl
register number, and thrtl_max which means the max throttle value
(delay value). It also has a flag to represent if the throttle
value is linear or non-linear.
One thrtl register of MBA stores a throttle value for one or more
domains. The throttle value means the delay applied to traffic between
L2 cache and next cache level.
This patch also implements init flow for MBA and register stub
callback functions.
Yi Sun [Fri, 20 Oct 2017 08:50:00 +0000 (10:50 +0200)]
x86: a few optimizations to psr codes
This patch refines psr codes:
1. Change type of 'cat_init_feature' to 'bool' to remove the pointless
returning of error code.
2. Move printk in 'cat_init_feature' to reduce a return path.
3. Define a local variable 'feat_mask' in 'psr_cpu_init' to reduce calling of
'cpuid_count_leaf()'.
4. Change 'PSR_INFO_IDX_CAT_FLAG' to 'PSR_INFO_IDX_CAT_FLAGS'.
Yi Sun [Tue, 24 Oct 2017 09:33:00 +0000 (11:33 +0200)]
Rename PSR sysctl/domctl interfaces and xsm policy to make them be general
This patch renames PSR sysctl/domctl interfaces and related xsm policy to
make them be general for all resource allocation features but not only
for CAT. Then, we can resuse the interfaces for all allocation features.
Basically, it changes 'psr_cat_op' to 'psr_alloc', and remove 'CAT_' from some
macros. E.g.:
1. psr_cat_op -> psr_alloc
2. XEN_DOMCTL_psr_cat_op -> XEN_DOMCTL_psr_alloc
3. XEN_SYSCTL_psr_cat_op -> XEN_SYSCTL_psr_alloc
4. XEN_DOMCTL_PSR_CAT_SET_L3_CBM -> XEN_DOMCTL_PSR_SET_L3_CBM
5. XEN_SYSCTL_PSR_CAT_get_l3_info -> XEN_SYSCTL_PSR_get_l3_info
This patch creates MBA feature document in doc/features/. It describes
key points to implement MBA which is described in details in Intel SDM
"Introduction to Memory Bandwidth Allocation".
Andrew Cooper [Wed, 6 Dec 2017 17:46:20 +0000 (17:46 +0000)]
x86/vmx: Don't use hvm_inject_hw_exception() in long_mode_do_msr_write()
Since c/s 49de10f3c1718 "x86/hvm: Don't raise #GP behind the emulators back
for MSR accesses", returning X86EMUL_EXCEPTION has pushed the exception
generation to the top of the call tree.
Using hvm_inject_hw_exception() and returning X86EMUL_EXCEPTION causes a
double #GP injection, which combines to #DF.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 13 Dec 2017 16:55:38 +0000 (16:55 +0000)]
xen/efi: Fix build with clang-5.0
The clang-5.0 build is reliably failing with:
Error: size of boot.o:.text is 0x01
which is because efi_arch_flush_dcache_area() exists as a single ret
instruction. Mark it as __init like everything else in the files.
Spotted by Travis.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Jan Beulich <jbeulich@suse.com>
Tom Lendacky [Thu, 30 Nov 2017 22:46:40 +0000 (16:46 -0600)]
x86/microcode: Add support for fam17h microcode loading
The size for the Microcode Patch Block (MPB) for an AMD family 17h
processor is 3200 bytes. Add a #define for fam17h so that it does
not default to 2048 bytes and fail a microcode load/update.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
[Linux commit f4e9b7af0cd58dd039a0fb2cd67d57cea4889abf]
Ported to Xen.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Tue, 12 Dec 2017 19:02:12 +0000 (19:02 +0000)]
xen/arm: traps: Merge do_trap_instr_abort_guest and do_trap_data_abort_guest
The two helpers do_trap_instr_abort_guest and do_trap_data_abort_guest
are used trap stage-2 abort. While the former is only handling prefetch
abort and the latter data abort, they are very similarly and does not
warrant to have separate helpers.
For instance, merging the both will make easier to maintain stage-2 abort
handling. So consolidate the two helpers in a new helper
do_trap_stage2_abort.
Julien Grall [Tue, 12 Dec 2017 19:02:11 +0000 (19:02 +0000)]
xen/arm: traps: Move the definition of mmio_info_t in try_handle_mmio
mmio_info_t is currently filled by do_trap_data_guest_abort but only
important when emulation an MMIO region.
A follow-up patch will merge stage-2 prefetch abort and stage-2 data abort
in a single helper. To prepare that, mmio_info_t is now filled by
try_handle_mmio.
Julien Grall [Tue, 12 Dec 2017 19:02:10 +0000 (19:02 +0000)]
xen/arm: traps: Remove the field gva from mmio_info_t
mmio_info_t is used to gather information in order do emulation of a
region. Guest virtual address is unlikely to be a useful information and
not currently used. So remove the field gva from mmio_info_t and replace
by a local variable.
Julien Grall [Tue, 12 Dec 2017 19:02:07 +0000 (19:02 +0000)]
xen/arm: p2m: Rename p2m_flush_tlb and p2m_flush_tlb_sync
Rename p2m_flush_tlb and p2m_flush_tlb_sync to respectively
p2m_tlb_flush and p2m_force_tlb_flush_sync.
At first glance, inverting 'flush' and 'tlb' might seem pointless but
would be helpful in the future in order to get more easily some code ported
from x86 P2M or even to shared with.
For p2m_flush_tlb_sync, the 'force' was added because the TLBs are
flush unconditionally. A follow-up patch will add an helper to flush
TLBs only in certain cases.
Julien Grall [Tue, 12 Dec 2017 19:02:06 +0000 (19:02 +0000)]
xen/arm: domain_build: Use copy_to_guest_phys_flush_dcache in dtb_load
The function dtb_load is dealing with IPA but uses gvirt_to_maddr to do
the translation. This is currently working fine because the stage-1 MMU
is disabled.
Rather than relying on such assumption, use the new
copy_to_guest_phys_flush_dcache. This also result to a slightly more
comprehensible code.
Julien Grall [Tue, 12 Dec 2017 19:02:05 +0000 (19:02 +0000)]
xen/arm: domain_build: Rework initrd_load to use the generic copy helper
The function initrd_load is dealing with IPA but uses gvirt_to_maddr to
do the translation. This is currently working fine because the stage-1 MMU
is disabled.
Furthermore, the function is implementing its own copy to guest resulting
in code duplication and making more difficult to update the logic in
page-tables (such support for Populate On Demand).
The new copy_to_guest_phys_flush_dcache could be used here by temporarily
mapping the full initrd in the virtual space.
Julien Grall [Tue, 12 Dec 2017 19:02:04 +0000 (19:02 +0000)]
xen/arm: kernel: Rework kernel_zimage_load to use the generic copy helper
The function kernel_zimage is dealing with IPA but uses gvirt_to_maddr to
do the translation. This is currently working fine because the stage-1 MMU
is disabled.
Furthermore, the function is implementing its own copy to guest resulting
in code duplication and making more difficult to update the logic in
page-tables (such support for Populate On Demand).
The new copy_to_guest_phys_flush_dcache could be used here by
temporarily mapping the full kernel in the virtual space.
Julien Grall [Tue, 12 Dec 2017 19:02:02 +0000 (19:02 +0000)]
xen/arm: Extend copy_to_guest to support copying from/to guest physical address
The only differences between copy_to_guest and access_guest_memory_by_ipa are:
- The latter does not support copying data crossing page boundary
- The former is copying from/to guest VA whilst the latter from
guest PA
copy_to_guest can easily be extended to support copying from/to guest
physical address. For that a new bit is used to tell whether linear
address or ipa is been used.
Lastly access_guest_memory_by_ipa is reimplemented using copy_to_guest.
This also has the benefits to extend the use of it, it is now possible
to copy data crossing page boundary.
Julien Grall [Tue, 12 Dec 2017 19:02:01 +0000 (19:02 +0000)]
xen/arm: guest_copy: Extend the prototype to pass the vCPU
Currently, guest_copy assumes the copy will only be done for the current
vCPU. copy_guest is meant to be vCPU agnostic, so extend the prototype
to pass the vCPU.
At the same time, encapsulate the vCPU in an union to allow extension
for copying from a guest domain (ipa case) in the future.
Julien Grall [Tue, 12 Dec 2017 19:02:00 +0000 (19:02 +0000)]
xen/arm: Extend copy_to_guest to support zeroing guest VA and use it
The function copy_to_guest can easily be extended to support zeroing
guest VA. To avoid using a new bit, it is considered that a NULL buffer
(i.e buf == NULL) means the guest memory will be zeroed.
Lastly, reimplement raw_clear_guest using copy_to_guest.
Julien Grall [Tue, 12 Dec 2017 19:01:59 +0000 (19:01 +0000)]
xen/arm: Extend copy_to_guest to support copying from guest VA and use it
The only differences between copy_to_guest (formerly called
raw_copy_to_guest_helper) and raw_copy_from_guest is:
- The direction of the memcpy
- The permission use for translating the address
Extend copy_to_guest to support copying from guest VA by adding using a
bit in the flags to tell the direction of the copy.
Lastly, reimplement raw_copy_from_guest using copy_to_guest.
Julien Grall [Tue, 12 Dec 2017 19:01:58 +0000 (19:01 +0000)]
xen/arm: raw_copy_to_guest_helper: Rework the prototype and rename it
All the helpers within arch/arm/guestcopy.c are doing the same things:
copy data from/to the guest.
At the moment, the logic is duplicated in each helpers making more
difficult to implement new variant.
The first step for the consolidation is to get a common prototype and a
base. For convenience (it is at the beginning of the file!),
raw_copy_to_guest_helper is chosen.
The function is now renamed copy_guest to show it will be a
generic function to copy data from/to the guest. Note that for now, only
copying to guest virtual address is supported. Follow-up patches will
extend the support.
Release builds work fine, which is a first indication that the assertion
isn't really needed.
What's worse though - there appears to be a timing window where the
guest runs in shadow mode, but not in log-dirty mode, and that is what
triggers the assertion (the same could, afaict, be achieved by test-
enabling shadow mode on a PV guest). This is because turing off log-
dirty mode is being performed in two steps: First the log-dirty bit gets
cleared (paging_log_dirty_disable() [having paused the domain] ->
sh_disable_log_dirty() -> shadow_one_bit_disable()), followed by
unpausing the domain and only then clearing shadow mode (via
shadow_test_disable(), which pauses the domain a second time).
Hence besides removing the ASSERT() here (or optionally replacing it by
explicit translate and refcounts mode checks, but this seems rather
pointless now that the three are tied together) I wonder whether either
shadow_one_bit_disable() should turn off shadow mode if no other bit
besides PG_SH_enable remains set (just like shadow_one_bit_enable()
enables it if not already set), or the domain pausing scope should be
extended so that both steps occur without the domain getting a chance to
run in between.
Reported-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 12 Dec 2017 13:31:55 +0000 (14:31 +0100)]
x86emul: build SIMD tests with -Os
Specifically in the context of putting together subsequent patches I've
noticed that together with the touch() macro using -Os further
increases the chances of the compiler using memory operands for the
instructions we actually care to test.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Daniel Kiper [Tue, 12 Dec 2017 13:30:53 +0000 (14:30 +0100)]
x86/mb2: avoid Xen image when looking for module/crashkernel position
Commit e22e1c4 (x86/EFI: avoid Xen image when looking for module/kexec
position) added relevant check for EFI case. However, since commit f75a304 (x86: add multiboot2 protocol support for relocatable images)
Multiboot2 compatible bootloaders are able to relocate Xen image too.
So, we have to avoid also Xen image region in such cases.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 12 Dec 2017 13:29:45 +0000 (14:29 +0100)]
x86/shadow: fix ref-counting error handling
The old-Linux handling in shadow_set_l4e() mistakenly ORed together the
results of sh_get_ref() and sh_pin(). As the latter failing is not a
correctness problem, simply ignore its return value.
In sh_set_toplevel_shadow() a failing sh_get_ref() must not be
accompanied by installing the entry, despite the domain being crashed.
This is XSA-250.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org>
Jan Beulich [Tue, 12 Dec 2017 13:29:13 +0000 (14:29 +0100)]
x86/shadow: fix refcount overflow check
Commit c385d27079 ("x86 shadow: for multi-page shadows, explicitly track
the first page") reduced the refcount width to 25, without adjusting the
overflow check. Eliminate the disconnect by using a manifest constant.
Interestingly, up to commit 047782fa01 ("Out-of-sync L1 shadows: OOS
snapshot") the refcount was 27 bits wide, yet the check was already
using 26.
This is XSA-249.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org>
Jan Beulich [Tue, 12 Dec 2017 13:28:36 +0000 (14:28 +0100)]
x86/mm: don't wrongly set page ownership
PV domains can obtain mappings of any pages owned by the correct domain,
including ones that aren't actually assigned as "normal" RAM, but used
by Xen internally. At the moment such "internal" pages marked as owned
by a guest include pages used to track logdirty bits, as well as p2m
pages and the "unpaged pagetable" for HVM guests. Since the PV memory
management and shadow code conflict in their use of struct page_info
fields, and since shadow code is being used for log-dirty handling for
PV domains, pages coming from the shadow pool must, for PV domains, not
have the domain set as their owner.
While the change could be done conditionally for just the PV case in
shadow code, do it unconditionally (and for consistency also for HAP),
just to be on the safe side.
There's one special case though for shadow code: The page table used for
running a HVM guest in unpaged mode is subject to get_page() (in
set_shadow_status()) and hence must have its owner set.
This is XSA-248.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Tue, 12 Dec 2017 13:27:34 +0000 (14:27 +0100)]
x86: don't wrongly trigger linear page table assertion (2)
_put_final_page_type(), when free_page_type() has exited early to allow
for preemption, should not update the time stamp, as the page continues
to retain the typ which is in the process of being unvalidated. I can't
see why the time stamp update was put on that path in the first place
(albeit it may well have been me who had put it there years ago).
This is part of XSA-240.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap.com>
Julien Grall [Wed, 1 Nov 2017 14:03:14 +0000 (14:03 +0000)]
xen/arm32: mm: Rework is_xen_heap_page to avoid nameclash
The arm32 version of the function is_xen_heap_page currently define a
variable _mfn. This will lead to a compiler when use typesafe MFN in a
follow-up patch:
called object '_mfn' is not a function or function pointer
Fix it by renaming the local variable _mfn to mfn_.
Andre Przywara [Thu, 7 Dec 2017 16:14:08 +0000 (16:14 +0000)]
ARM: VGIC: move gic_remove_irq_from_queues()
gic_remove_irq_from_queues() was not only misnamed, it also has the wrong
abstraction, as it should not live in gic.c.
Move it into vgic.c and vgic.h, where it belongs to, and rename it on
the way.
Signed-off-by: Andre Przywara <andre.przywara@linaro.org> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Julien Grall [Wed, 6 Dec 2017 14:51:37 +0000 (14:51 +0000)]
xen/arm: gic-v3: Bail out if gicv3_cpu_init fail
When system registers are not enabled, all the access to them will trap
in EL2. In Xen, system registers will be enabled by gicv3_cpu_init only
on success. As the rest of the code (e.g gicv3_hyp_init) relies on
system register, it is better to bail out directly.
This will save time on debugging early boot issue on GICv3 platform.
Andre Przywara [Thu, 19 Oct 2017 12:48:37 +0000 (13:48 +0100)]
ARM: vGIC: fix nr_irq definition
The global variable "nr_irqs" is used for x86 and some common Xen code.
To make the latter work easily for ARM, it was #defined to NR_IRQS.
This not only violated the common habit of capitalizing macros, but
also caused issues if one wanted to use a rather innocent "nr_irqs" as
a local variable name or as a function parameter.
Drop the optimization and make nr_irqs a normal variable for ARM also.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Andre Przywara [Thu, 19 Oct 2017 12:48:36 +0000 (13:48 +0100)]
ARM: remove unneeded gic.h inclusions
gic.h is supposed to hold defines and prototypes for the hardware side
of the GIC interrupt controller. A lot of parts in Xen should not be
bothered with that, as they either only care about the VGIC or use
more generic interfaces.
Remove unneeded inclusions of gic.h from files where they are actually
not needed.
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Julien Grall [Thu, 7 Dec 2017 17:18:46 +0000 (17:18 +0000)]
xen/arm64: head.S: Introduce macro to load the physical address of a symbol
A lot of places in the ARM64 assembly code requiring to load the
physical address of a symbol. Rather than open-coding the translation,
introduce a new macro that will load the physical address of a symbol.
Lastly, use this new macro to replace all the current opencoded version.
Note that most of comments associated to the code changed have been
removed because the code is now self-explanatory.
Jan Beulich [Thu, 7 Dec 2017 10:09:31 +0000 (11:09 +0100)]
mm: don't use domain_shutdown() when re-offlining a page
It goes all silent, leaving open what has actually caused the crash.
Use domain_crash() instead, which leaves a log message before calling
domain_shutdown(..., SHUTDOWN_crash).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 6 Dec 2017 11:50:23 +0000 (12:50 +0100)]
x86/HVM: don't retain emulated insn cache when exiting back to guest
vio->mmio_retry is being set when a repeated string insn is being split
up. In that case we'll exit to the guest, expecting immediate re-entry.
Interruptions, however, may be serviced by the guest before re-entry
from the repeated string insn. Any emulation needed in the course of
handling the interruption must not fetch from the internally maintained
cache.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
Jan Beulich [Tue, 5 Dec 2017 16:23:53 +0000 (17:23 +0100)]
x86: don't ignore foreigndom on L2/L3/L4 page table updates
Silently assuming DOMID_SELF is unlikely to be a good idea for page
table updates. For PGT_writable pages, though, it seems better to allow
the writes, so the same check isn't being applied there.
Also add blank lines between the individual case blocks.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>