Andrew Cooper [Mon, 26 Jun 2017 11:58:25 +0000 (12:58 +0100)]
x86/mm: Fix infinite loop in get_spage_pages()
c/s 2b8eb37 switched int i to being unsigned, but the undo logic on failure
relied in i being signed. As i being unsigned in still preforable, adjust the
undo logic to work with an unsigned i.
Coverity-ID: 1413017 Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Konrad Rzeszutek Will <konrad.wilk@oracle.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Bhupinder Thakur [Thu, 22 Jun 2017 07:38:37 +0000 (13:08 +0530)]
xen/arm: Rename vgic_reg* functions definitions and calls to vreg_reg*
This patch renames the vgic_reg* access functions defined in vreg.h to vreg_reg*
and replaces all calls to vgic_reg* functions in vgic/its emulation code to vreg_reg*.
vreg_reg* are generic functions, which can be used to operate on 32/64-bit registers.
SBSA UART emulation code will also use vreg_reg* access functions for
accessing emulated pl011 registers.
Bhupinder Thakur [Thu, 22 Jun 2017 07:38:36 +0000 (13:08 +0530)]
xen/arm: vpl011: Move vgic register access functions to vreg.h
These functions are generic in nature and can be reused by other emulation
code in Xen. vGICv3 ITS and SBSA UART emulation code, would use these
functions to operate on their registers.
This patch moves the register access function definitions from vgic.h to
vreg.h.
Wei Liu [Mon, 5 Jun 2017 14:16:17 +0000 (15:16 +0100)]
x86: clean up PV emulation code
Replace bool_t with bool. Fix coding style issues. Add spaces around
binary ops. Use 1U for shifting. Eliminate TOGGLE_MODE.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Tue, 13 Jun 2017 20:36:58 +0000 (21:36 +0100)]
xen/livepatch: Don't crash on encountering STN_UNDEF relocations
A symndx of STN_UNDEF is special, and means a symbol value of 0. While
legitimate in the ELF standard, its existance in a livepatch is questionable
at best. Until a plausible usecase presents itself, reject such a relocation
with -EOPNOTSUPP.
Additionally, fix an off-by-one error while range checking symndx, and perform
a safety check on elf->sym[symndx].sym before derefencing it, to avoid
tripping over a NULL pointer when calculating val.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [x86 and arm32] Reviewed-by: Jan Beulich <JBeulich@suse.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Andrew Cooper [Thu, 22 Jun 2017 17:55:31 +0000 (18:55 +0100)]
xen/livepatch: Use zeroed memory allocations for arrays
Each of these arrays is sparse. Use zeroed allocations to cause uninitialised
array elements to contain deterministic values, most importantly for the
embedded pointers.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [x86 and arm32] Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Andrew Cooper [Tue, 13 Jun 2017 20:17:47 +0000 (21:17 +0100)]
xen/livepatch: Clean up arch relocation handling
* Reduce symbol scope and initalisation as much as possible
* Annotate a fallthrough case in arm64
* Fix switch statement style in arm32
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [x86 and arm32]
Andrew Cooper [Wed, 21 Jun 2017 11:36:18 +0000 (12:36 +0100)]
xen: Replace ASSERT(0) with ASSERT_UNREACHABLE()
No functional change, but the result is more informative both in the code and
error messages if the assertions do get hit.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <julien.gralL@arm.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Andrew Cooper [Tue, 18 Apr 2017 14:41:16 +0000 (15:41 +0100)]
x86/mm: Misc nonfunctional cleanup
* Drop trailing whitespace
* Apply Xen comment and space style
* Switch bool_t to bool
* Drop TOGGLE_MODE() macro
* Replace erroneous mandatory barriers with smp barriers
* Switch boolean ints for real bools
No (intended) functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Fixed an issue where the maximum index allowed (31) goes beyond the
actual number of array elements (4) of ad->monitor.write_ctrlreg_mask.
Coverity-ID: 1412966
Signed-off-by: Razvan Cojocaru <rcojocaru@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 23 Jun 2017 13:49:41 +0000 (15:49 +0200)]
make steal_page() return a proper error value
... and use it where suitable (the tmem caller doesn't propagate an
error code). While it doesn't matter as much, also make donate_page()
follow suit on x86 (on ARM it already returns -ENOSYS).
Also move their declarations to common code and add __must_check.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com>
Jan Beulich [Fri, 23 Jun 2017 13:48:28 +0000 (15:48 +0200)]
x86emul: simplify SHLD/SHRD handling
First of all there's no point considering the "shift == width" case,
when immediately before that check we mask "shift" by "width - 1". And
then truncate_word() use can be reduced too: dst.val, as obtained by
generic operand fetching code, is already suitably truncated, and its
use can also be made symmetric in the main conditional expression (on
only left shift results). Finally masking the result of a right shift
is not necessary when the left hand operand doesn't have more than
"width" significant bits.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 22 Jun 2017 07:58:07 +0000 (09:58 +0200)]
x86/mm: drop redundant domain parameter from get_page_from_gfn_p2m()
It can always be read from the passed p2m. Take the opportunity and
also rename the function, making the "p2m" suffix a prefix, to match
other p2m functions, and convert the "gfn" parameter's type.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Thu, 22 Jun 2017 07:55:08 +0000 (09:55 +0200)]
gnttab: limit mapkind()'s iteration count
There's no need for the function to observe increases of the maptrack
table (which can occur as the maptrack lock isn't being held) - actual
population of maptrack entries is excluded while we're here (by way of
holding the respective grant table lock for writing, while code
populating entries acquires it for reading). Latch the limit ahead of
the loop, allowing for the barrier to move out, too.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
George Dunlap [Thu, 22 Jun 2017 07:53:18 +0000 (09:53 +0200)]
gnttab: remove host map in the event of a grant_map failure
The current code appropriately removes the reference and type counts
on failure, but leaves the mapping set up. As the only path which can
trigger this is failure from IOMMU manipulation, and as unprivileged
domains are being crashed in that case, this is not by itself a
security issue.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 22 Jun 2017 07:51:29 +0000 (09:51 +0200)]
ARM: simplify page type handling
There's no need to have anything here on ARM other than the distinction
between writable and non-writable pages (and even that could likely be
eliminated, but with a more intrusive change). Limit type to a single
bit and drop pinned and validated flags altogether.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Dario Faggioli [Thu, 22 Jun 2017 07:45:37 +0000 (09:45 +0200)]
idle_loop: either deal with tasklets or go idle
In fact, there are two kinds of tasklets: vCPU and
softirq context. When we want to do vCPU context tasklet
work, we force the idle vCPU (of a particular pCPU) into
execution, and run it from there.
This means there are two possible reasons for choosing
to run the idle vCPU:
1) we want a pCPU to go idle,
2) we want to run some vCPU context tasklet work.
If we're in case 2), it does not make sense to even
try to go idle (as the check will _always_ fail).
This patch rearranges the code of the body of idle
vCPUs, so that we actually check whether we are in
case 1) or 2), and act accordingly.
As a matter of fact, this also means that we do not
check if there's any tasklet work to do after waking
up from idle. This is not a problem, because:
a) for softirq context tasklets, if any is queued
"during" wakeup from idle, TASKLET_SOFTIRQ is
raised, and the call to do_softirq() (which is still
happening *after* the wakeup) will take care of it;
b) for vCPU context tasklets, if any is queued "during"
wakeup from idle, SCHEDULE_SOFTIRQ is raised and
do_softirq() (happening after the wakeup) calls
the scheduler. The scheduler sees that there is
tasklet work pending and confirms the idle vCPU
in execution, which then will get to execute
do_tasklet().
Add a pointer to the gic device tree bindings. Add an explanation on how
to calculate irq numbers from device tree.
Add a brief explanation of the reg property and a pointer to the xl docs
for a description of the iomem property. Add a note that in the example
we are using different memory addresses for guests and host.
Juergen Gross [Thu, 15 Jun 2017 09:58:27 +0000 (11:58 +0200)]
tools/xen-detect: try sysfs node for obtaining guest type
Instead of relying on cpuid instruction behaviour to tell which domain
type we are just try asking the kernel via the appropriate sysfs node
(added in Linux kernel 4.13).
Keep the old detection logic as a fallback for older kernels.
Petre Pircalabu [Tue, 20 Jun 2017 15:13:20 +0000 (17:13 +0200)]
x86/monitor: add masking support for write_ctrlreg events
Add support for filtering out the write_ctrlreg monitor events if they
are generated only by changing certains bits.
A new parameter (bitmask) was added to the xc_monitor_write_ctrlreg
function in order to mask the event generation if the changed bits are
set.
Signed-off-by: Petre Pircalabu <ppircalabu@bitdefender.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Ross Lagerwall [Tue, 20 Jun 2017 15:13:02 +0000 (17:13 +0200)]
rombios/ata: wait for BSY to clear after write
After rombios transfers the data for a write, it checks the status and
fails if BSY is set. qemu-trad doesn't set BSY for PIO writes, but QEMU
upstream does, and this causes rombios to fail writes because they are
marked as BSY. Instead, wait for BSY to clear after a write.
INT 13 writes are probably rarely used these days, but they are used by
GRUB 2 to write to its environment file which happens by default on
Ubuntu.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Ian Jackson [Mon, 19 Jun 2017 14:04:08 +0000 (15:04 +0100)]
xen/test/Makefile: Fix clean target, broken by pattern rule
In "xen/test/livepatch: Regularise Makefiles" we reworked
xen/test/Makefile to use a pattern rule. However, there are two
problems with this. Both are related to the way that xen/Rules.mk is
implicitly part of this Makefile because of the way that Makefiles
under xen/ are invoked by their parent directory Makefiles.
Firstly, the Rules.mk `clean' target overrides the pattern rule in
xen/test/Makefile. The result is that `make -C xen clean' does not
actually run the livepatch clean target.
The Rules.mk clean target does have provision for recursing into
subdirectories, but that feature is tangled up with complex object
file iteration machinery which is not desirable here. However, we can
extend the Rules.mk clean target since it is a double-colon rule.
Sadly this involves duplicating the SUBDIR iteration boilerplate. (A
make function could be used but the cure would be worse than the
disease.)
Secondly, Rules.mk has a number of -include directives. make likes to
try to (re)build files mentioned in includes. With the % pattern
rule, this applies to those files too.
As a result, make -C xen clean would try to build `.*.d' (for example)
in xen/test. This would fail with an error message. The error would
be ignored because of the `-', but it's annoying and ugly.
Solve this by limiting the % pattern rule to the targets we expect it
to handle. These are those listed in the top-level Makefile help
message, apart from: those which are subdir- or component-qualified;
clean targets (which are handled specially, even distclean); and dist,
src-tarball-*, etc. (which are converted to install by an earlier
Makefile).
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Release-acked-by: Julien Grall <julien.grall@arm.com>
Jan Beulich [Tue, 20 Jun 2017 12:51:53 +0000 (14:51 +0200)]
memory: don't suppress P2M update in populate_physmap()
Commit d18627583d ("memory: don't hand MFN info to translated guests")
wrongly added a null-handle check there - just like stated in its
description for memory_exchange(), the array is also an input for
populate_physmap() (and hence can't reasonably be null). I have no idea
how I've managed to overlook this.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Wolfram Strepp [Tue, 20 Jun 2017 12:51:07 +0000 (14:51 +0200)]
rb_tree: make clear distinction between two different cases in rb_erase()
There are two cases when a node, having 2 childs, is erased:
'normal case': the successor is not the right-hand-child of the node to be
erased
'special case': the successor is the right-hand child of the node to be erased
Here some ascii-art, with following symbols (referring to the code):
O: node to be deleted
N: the successor of O
P: parent of N
C: child of N
L: some other node
normal case:
O N
/ \ / \
/ \ / \
L \ L \
/ \ P ----> / \ P
/ \ / \
/ /
N C
\ / \
\
C
/ \
special case:
O|P N
/ \ / \
/ \ / \
L \ L \
/ \ N ----> / C
\ / \
\
C
/ \
Notice that for the special case we don't have to reconnect C to N.
Signed-off-by: Wolfram Strepp <wstrepp@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[Linux commit 4c60117811171d867d4f27f17ea07d7419d45dae]
Wolfram Strepp [Tue, 20 Jun 2017 12:50:13 +0000 (14:50 +0200)]
rbtree: optimize rb_erase()
Tfour 4 redundant if-conditions in function __rb_erase_color() in
lib/rbtree.c are removed.
In pseudo-source-code, the structure of the code is as follows:
if ((!A || B) && (!C || D)) {
.
.
.
} else {
if (!C || D) {//if this is true, it implies: (A == true) && (B == false)
if (A) {//hence this always evaluates to 'true'...
.
}
.
//at this point, C always becomes true, because of:
__rb_rotate_right/left();
//and:
other = parent->rb_right/left;
}
.
.
if (C) {//...and this too !
.
}
}
Signed-off-by: Wolfram Strepp <wstrepp@gmx.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[Linux commit 55a63998b8967615a15e2211ba0ff3a84a565824]
Artem Bityutskiy [Tue, 20 Jun 2017 12:49:40 +0000 (14:49 +0200)]
rbtree: add const qualifier to some functions
The 'rb_first()', 'rb_last()', 'rb_next()' and 'rb_prev()' calls
take a pointer to an RB node or RB root. They do not change the
pointed objects, so add a 'const' qualifier in order to make life
of the users of these functions easier.
Indeed, if I have my own constant pointer &const struct my_type *p,
and I call 'rb_next(&p->rb)', I get a GCC warning:
warning: passing argument 1 of \91rb_next\92 discards qualifiers from pointer target
type
Julien Grall [Tue, 6 Jun 2017 14:35:42 +0000 (15:35 +0100)]
xen/arm: vgic: Sanitize target mask used to send SGI
The current function vgic_to_sgi does not sanitize the target mask and
may therefore get an invalid vCPU ID. This will result to an out of
bound access of d->vcpu[...] as there is no check whether the vCPU ID is
within the maximum supported by the guest.
This was introduced by commit ea37fd2111 "xen/arm: split vgic driver
into generic and vgic-v2 driver".
Jan Beulich [Tue, 20 Jun 2017 12:46:47 +0000 (14:46 +0200)]
gnttab: __gnttab_unmap_common_complete() is all-or-nothing
All failures have to be detected in __gnttab_unmap_common(), the
completion function must not skip part of its processing. In particular
the GNTMAP_device_map related putting of page references and adjustment
of pin count must not occur if __gnttab_unmap_common() signaled an
error. Furthermore the function must not make adjustments to global
state (here: clearing GNTTAB_device_map) before all possibly failing
operations have been performed.
There's one exception for IOMMU related failures: As IOMMU manipulation
occurs after GNTMAP_*_map have been cleared already, the related page
reference and pin count adjustments need to be done nevertheless. A
fundamental requirement for the correctness of this is that
iommu_{,un}map_page() crash any affected DomU in case of failure.
The version check appears to be pointless (or could perhaps be a
BUG_ON() or ASSERT()), but for the moment also move it.
George Dunlap [Tue, 20 Jun 2017 12:46:21 +0000 (14:46 +0200)]
gnttab: correct logic to get page references during map requests
The rules for reference counting are somewhat complicated:
* Each of GNTTAB_host_map and GNTTAB_device_map need their own
reference count
* If the mapping is writeable:
- GNTTAB_host_map needs a type count under only some conditions
- GNTTAB_device_map always needs a type count
If the mapping succeeds, we need to keep all of these; if the mapping
fails, we need to release whatever references we have acquired so far.
Additionally, the code that does a lot of this calculation "inherits"
a reference as part of the process of finding out who the owner is.
Finally, if the grant is mapped as writeable (without the
GNTMAP_readonly flag), but the hypervisor cannot grab a
PGT_writeable_page type, the entire operation should fail.
Unfortunately, the current code has several logic holes:
* If a grant is mapped only GNTTAB_device_map, and with a writeable
mapping, but in conditions where a *host* type count is not
necessary, the code will fail to grab the necessary type count.
* If a grant is mapped both GNTTAB_device_map and GNTTAB_host_map,
with a writeable mapping, in conditions where the host type count is
not necessary, *and* where the page cannot be changed to type
PGT_writeable, the condition will not be detected.
In both cases, this means that on success, the type count will be
erroneously reduced when the grant is unmapped. In the second case,
the type count will be erroneously reduced on the failure path as
well. (In the first case the failure path logic has the same hole
as the reference grabbing logic.)
Additionally, the return value of get_page() is not checked; but this
may fail even if the first get_page() succeeded due to a reference
counting overflow.
First of all, simplify the restoration logic by explicitly counting
the reference and type references acquired.
Consider each mapping type separately, explicitly marking the
'incoming' reference as used so we know when we need to grab a second
one.
Finally, always check the return value of get_page[_type]() and go to
the failure path if appropriate.
This is part of XSA-224.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 20 Jun 2017 12:46:01 +0000 (14:46 +0200)]
gnttab: never create host mapping unless asked to
We shouldn't create a host mapping unless asked to even in the case of
mapping a granted MMIO page. In particular the mapping wouldn't be torn
down when processing the matching unmap request.
George Dunlap [Tue, 20 Jun 2017 12:45:33 +0000 (14:45 +0200)]
gnttab: fix handling of dev_bus_addr during unmap
If a grant has been mapped with the GNTTAB_device_map flag, calling
grant_unmap_ref() with dev_bus_addr set to zero should cause the
GNTTAB_device_map part of the mapping to be left alone.
Unfortunately, at the moment, op->dev_bus_addr is implicitly checked
before clearing the map and adjusting the pin count, but only the bits
above 12; and it is not checked at all before dropping page
references. This means a guest can repeatedly make such a call to
cause the reference count to drop to zero, causing the page to be
freed and re-used, even though it's still mapped in its pagetables.
To fix this, always check op->dev_bus_addr explicitly for being
non-zero, as well as op->flag & GNTMAP_device_map, before doing
operations on the device_map.
While we're here, make the logic a bit cleaner:
* Always initialize op->frame to zero and set it from act->frame, to reduce the
chance of untrusted input being used
* Explicitly check the full dev_bus_addr against act->frame <<
PAGE_SHIFT, rather than ignoring the lower 12 bits
This is part of XSA-224.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Tue, 20 Jun 2017 12:41:55 +0000 (14:41 +0200)]
arm: vgic: Don't update the LR when the IRQ is not enabled
gic_raise_inflight_irq will be called if the IRQ is already inflight
(i.e the IRQ is injected to the guest). If the IRQ is already already in
the LRs, then the associated LR will be updated.
To know if the interrupt is already in the LR, the function check if the
interrupt is queued. However, if the interrupt is not enabled then the
interrupt may not be queued nor in the LR. So gic_update_one_lr may be
called (if we inject on the current vCPU) and read the LR.
Because the interrupt is not in the LR, Xen will either read:
* LR 0 if the interrupt was never injected before
* LR 255 (GIC_INVALID_LR) if the interrupt was injected once. This
is because gic_update_one_lr will reset p->lr.
Reading LR 0 will result to potentially update the wrong interrupt and
not keep the LRs in sync with Xen.
Reading LR 255 will result to:
* Crash Xen on GICv3 as the LR index is bigger than supported (see
gicv3_ich_read_lr).
* Read/write always GICH_LR + 255 * 4 that is not part of the memory
mapped.
The problem can be prevented by checking whether the interrupt is
enabled in gic_raise_inflight_irq before calling gic_update_one_lr.
A follow-up of this patch is expected to mitigate the issue in the
future.
Jan Beulich [Tue, 20 Jun 2017 12:41:16 +0000 (14:41 +0200)]
guest_physmap_remove_page() needs its return value checked
Callers, namely such subsequently freeing the page, must not blindly
assume success - the function may namely fail when needing to shatter a
super page, but there not being memory available for the then needed
intermediate page table.
As it happens, guest_remove_page() callers now also all check the
return value.
Furthermore a missed put_gfn() on an error path in gnttab_transfer() is
also being taken care of.
This is part of XSA-222.
Reported-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Tue, 20 Jun 2017 12:39:56 +0000 (14:39 +0200)]
memory: fix return value handing of guest_remove_page()
Despite the description in mm.h, guest_remove_page() previously returned 0 for
paging errors.
Switch guest_remove_page() to having regular 0/-error semantics, and propagate
the return values from clear_mmio_p2m_entry() and mem_sharing_unshare_page()
to the callers (although decrease_reservation() is the only caller which
currently cares).
This is part of XSA-222.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 20 Jun 2017 12:37:47 +0000 (14:37 +0200)]
evtchn: avoid NULL derefs
Commit fbbd5009e6 ("evtchn: refactor low-level event channel port ops")
added a de-reference of the struct evtchn pointer for a port without
first making sure the bucket pointer is non-NULL. This de-reference is
actually entirely unnecessary, as all relevant callers (beyond the
problematic do_poll()) already hold the port number in their hands, and
the actual leaf functions need nothing else.
For FIFO event channels there's a second problem in that the ordering
of reads and updates to ->num_evtchns and ->event_array[] was so far
undefined (the read side isn't always holding the domain's event lock).
Add respective barriers.
This is XSA-221.
Reported-by: Ankur Arora <ankur.a.arora@oracle.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 20 Jun 2017 12:36:51 +0000 (14:36 +0200)]
x86: avoid leaking PKRU and BND* between vCPU-s
PKRU is explicitly "XSAVE-managed but not XSAVE-enabled", so guests
might access the register (via {RD,WR}PKRU) without setting XCR0.PKRU.
Force context switching as well as migrating the register as soon as
CR4.PKE is being set the first time.
For MPX (BND<n>, BNDCFGU, and BNDSTATUS) the situation is less clear,
and the SDM has not entirely consistent information for that case.
While experimentally the instructions don't change register state as
long as the two XCR0 bits aren't both 1, be on the safe side and enable
both if BNDCFGS.EN is being set the first time.
This is XSA-220.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Tue, 20 Jun 2017 12:36:11 +0000 (14:36 +0200)]
x86/shadow: hold references for the duration of emulated writes
The (misnamed) emulate_gva_to_mfn() function translates a linear address to an
mfn, but releases its page reference before returning the mfn to its caller.
sh_emulate_map_dest() uses the results of one or two translations to construct
a virtual mapping to the underlying frames, completes an emulated
write/cmpxchg, then unmaps the virtual mappings.
The page references need holding until the mappings are unmapped, or the
frames can change ownership before the writes occurs.
This is XSA-219.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org>
Jan Beulich [Tue, 20 Jun 2017 12:34:34 +0000 (14:34 +0200)]
gnttab: correct maptrack table accesses
In order to observe a consistent (limit,pointer-table) pair, the reader
needs to either hold the maptrack lock (in line with documentation) or
both sides need to order their accesses suitably (the writer side
barrier was removed by commit dff515dfea ["gnttab: use per-VCPU
maptrack free lists"], and a read side barrier has never been there).
Make the writer publish a new table page before limit (for bounds
checks to work), and new list head last (for racing maptrack_entry()
invocations to work). At the same time add read barriers to lockless
readers.
Additionally get_maptrack_handle() must not assume ->maptrack_head to
not change behind its back: Another handle may be put (updating only
->maptrack_tail) and then got or stolen (updating ->maptrack_head).
This is part of XSA-218.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
George Dunlap [Tue, 20 Jun 2017 12:33:13 +0000 (14:33 +0200)]
gnttab: Avoid potential double-put of maptrack entry
Each grant mapping for a particular domain is tracked by an in-Xen
"maptrack" entry. This entry is is referenced by a "handle", which is
given to the guest when it calls gnttab_map_grant_ref().
There are two types of mapping a particular handle can refer to:
GNTMAP_host_map and GNTMAP_device_map. A given
gnttab_unmap_grant_ref() call can remove either only one or both of
these entries. When a particular handle has no entries left, it must
be freed.
gnttab_unmap_grant_ref() loops through its grant unmap request list
twice. It first removes entries from any host pagetables and (if
appropraite) iommus; then it does a single domain TLB flush; then it
does the clean-up, including telling the granter that entries are no
longer being used (if appropriate).
At the moment, it's during the first pass that the maptrack flags are
cleared, but the second pass that the maptrack entry is freed.
Unfortunately this allows the following race, which results in a
double-free:
A: (pass 1) clear host_map
B: (pass 1) clear device_map
A: (pass 2) See that maptrack entry has no mappings, free it
B: (pass 2) See that maptrack entry has no mappings, free it #
Unfortunately, unlike the active entry pinning update, we can't simply
move the maptrack flag changes to the second half, because the
maptrack flags are used to determine if iommu entries need to be
added: a domain's iommu must never have fewer permissions than the
maptrack flags indicate, or a subsequent map_grant_ref() might fail to
add the necessary iommu entries.
Instead, free the maptrack entry in the first pass if there are no
further mappings.
This is part of XSA-218.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 20 Jun 2017 12:32:03 +0000 (14:32 +0200)]
gnttab: fix unmap pin accounting race
Once all {writable} mappings of a grant entry have been unmapped, the
hypervisor informs the guest that the grant entry has been released by
clearing the _GTF_{reading,writing} usage flags in the guest's grant
table as appropriate.
Unfortunately, at the moment, the code that updates the accounting
happens in a different critical section than the one which updates the
usage flags; this means that under the right circumstances, there may be
a window in time after the hypervisor reported the grant as being free
during which the grant referee still had access to the page.
Move the grant accounting code into the same critical section as the
reporting code to make sure this kind of race can't happen.
This is part of XSA-218.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 20 Jun 2017 12:29:51 +0000 (14:29 +0200)]
x86/mm: disallow page stealing from HVM domains
The operation's success can't be controlled by the guest, as the device
model may have an active mapping of the page. If we nevertheless
permitted this operation, we'd have to add further TLB flushing to
prevent scenarios like
"Domains A (HVM), B (PV), C (PV); B->target==A
Steps:
1. B maps page X from A as writable
2. B unmaps page X without a TLB flush
3. A sends page X to C via GNTTABOP_transfer
4. C maps page X as pagetable (potentially causing a TLB flush in C,
but not in B)
At this point, X would be mapped as a pagetable in C while being
writable through a stale TLB entry in B."
A similar scenario could be constructed for A using XENMEM_exchange and
some arbitrary PV domain C then having this page allocated.
This is XSA-217.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Zhongze Liu [Wed, 14 Jun 2017 01:11:48 +0000 (09:11 +0800)]
tools: fix several "format-truncation" warnings with GCC 7
GCC 7.1.1 complains that several buffers passed to snprintf() in xenpmd
and tools/ocmal/xc are too small to hold the largest possible resulting string,
which is calculated by adding up the maximum length of all the substrings.
The warnings are treated as errors by -Werror, and goes like this (abbreviated):
xenpmd.c:94:36: error: ‘%s’ directive output may be truncated writing up to
255 bytes into a region of size 13 [-Werror=format-truncation=]
#define BATTERY_INFO_FILE_PATH "/proc/acpi/battery/%s/info"
^
xenpmd.c:113:13: note: ‘snprintf’ output between 25 and 280 bytes into a
destination of size 32
xenpmd.c:95:37: error: ‘%s’ directive output may be truncated writing up to
255 bytes into a region of size 13 [-Werror=format-truncation=]
#define BATTERY_STATE_FILE_PATH "/proc/acpi/battery/%s/state"
^
xenpmd.c:116:13: note: ‘snprintf’ output between 26 and 281 bytes into a
destination of size 32
xenctrl_stubs.c:65:15: error: ‘%s’ directive output may be truncated writing
up to 1023 bytes into a region of size 252 [-Werror=format-truncation=]
"%d: %s: %s", error->code,
^~
xenctrl_stubs.c:64:4: note: ‘snprintf’ output 5 or more bytes (assuming 1028)
into a destination of size 256
Enlarge the size of these buffers as suggested by the complier
(and slightly rounded) to fix the warnings.
No functional changes.
Signed-off-by: Zhongze Liu <blackskygg@gmail.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Dushyant Behl [Fri, 16 Jun 2017 14:18:10 +0000 (16:18 +0200)]
x86/SVM: correct comments in vmcb.h
The VMEXIT codes listed from EXCEPTION_PF to EXCEPTION_XF had comments
describe the exitcodes slightly shifted than the expected value.
The expected exitcode value for page-fault is 78 which should be 0x4E
and so on till exception XF.
Signed-off-by: Dushyant Behl <myselfdushyantbehl@gmail.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Julien Grall [Tue, 13 Jun 2017 16:13:13 +0000 (17:13 +0100)]
xen/arm: Introduce wrappers for MFN <-> MADDR and GFN <-> GADDR
The new wrappers will add more safety when converting an address to a
frame number (either machine or guest). A follow-up patch will use them
to simplify the code.
Julien Grall [Tue, 13 Jun 2017 16:13:06 +0000 (17:13 +0100)]
xen/arm: mm: Clean-up mfn_to_xen_entry
The physical address is computed from the machine frame number, so
checking if the physical address is page aligned is pointless.
Furthermore, directly assigned the MFN to the corresponding field in the
entry rather than converting to a physical address and orring the value.
It will avoid to rely on the field position and make the code clearer.
Andre Przywara [Thu, 18 Aug 2016 14:40:55 +0000 (15:40 +0100)]
ARM: vITS: create ITS subnodes for Dom0 DT
Dom0 expects all ITSes in the system to be propagated to be able to
use MSIs.
Create Dom0 DT nodes for each hardware ITS, keeping the register frame
address the same, as the doorbell address that the Dom0 drivers program
into the BARs has to match the hardware.
Signed-off-by: Andre Przywara <andre.przywara@arm.com> Acked-by: Julien Grall <julien.grall@arm.com>
Andre Przywara [Wed, 7 Sep 2016 00:44:05 +0000 (01:44 +0100)]
ARM: vITS: create and initialize virtual ITSes for Dom0
For each hardware ITS create and initialize a virtual ITS for Dom0.
We use the same memory mapped address to keep the doorbell working.
This introduces a function to initialize a virtual ITS.
We maintain a list of virtual ITSes, at the moment for the only
purpose of later being able to free them again.
We configure the virtual ITSes to match the hardware ones, that is we
keep the number of device ID bits and event ID bits the same as the host
ITS.
Signed-off-by: Andre Przywara <andre.przywara@arm.com> Acked-by: Julien Grall <julien.grall@arm.com>
Andre Przywara [Mon, 10 Apr 2017 19:13:47 +0000 (20:13 +0100)]
ARM: vITS: increase mmio_count for each ITS
Increase the count of MMIO regions needed by one for each ITS Dom0 has
to emulate. We emulate the ITSes 1:1 from the hardware, so the number
is the number of host ITSes.
Andre Przywara [Tue, 6 Sep 2016 16:20:50 +0000 (17:20 +0100)]
ARM: vITS: handle INVALL command
The INVALL command instructs an ITS to invalidate the configuration
data for all LPIs associated with a given redistributor (read: VCPU).
This is nasty to emulate exactly with our architecture, so we just
iterate over all mapped LPIs and filter for those from that particular
VCPU.
Andre Przywara [Tue, 6 Sep 2016 16:20:17 +0000 (17:20 +0100)]
ARM: vITS: handle INV command
The INV command instructs the ITS to update the configuration data for
a given LPI by re-reading its entry from the property table.
We don't need to care so much about the priority value, but enabling
or disabling an LPI has some effect: We remove or push virtual LPIs
to their VCPUs, also check the virtual pending bit if an LPI gets enabled.
Signed-off-by: Andre Przywara <andre.przywara@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Andre Przywara [Wed, 7 Sep 2016 00:51:41 +0000 (01:51 +0100)]
ARM: vITS: handle DISCARD command
The DISCARD command drops the connection between a DeviceID/EventID
and an LPI/collection pair.
We mark the respective structure entries as not allocated and make
sure that any queued IRQs are removed.
Andre Przywara [Wed, 7 Sep 2016 00:50:55 +0000 (01:50 +0100)]
ARM: vITS: handle MOVI command
The MOVI command moves the interrupt affinity from one redistributor
(read: VCPU) to another.
For now migration of "live" LPIs is not yet implemented, but we store
the changed affinity in our virtual ITTE and the pending_irq.
Andre Przywara [Wed, 7 Sep 2016 00:49:37 +0000 (01:49 +0100)]
ARM: vITS: handle MAPTI/MAPI command
The MAPTI commands associates a DeviceID/EventID pair with a LPI/CPU
pair and actually instantiates LPI interrupts. MAPI is just a variant
of this comment, where the LPI ID is the same as the event ID.
We connect the already allocated host LPI to this virtual LPI, so that
any triggering LPI on the host can be quickly forwarded to a guest.
Beside entering the domain and the virtual LPI number in the respective
host LPI entry, we also initialize and add the already allocated
struct pending_irq to our radix tree, so that we can now easily find it
by its virtual LPI number.
We also read the property table to update the enabled bit and the
priority for our new LPI, as we might have missed this during an earlier
INVALL call (which only checks mapped LPIs). But we make sure that the
property table is actually valid, as all redistributors might still
be disabled at this point.
Since write_itte() now sees its first usage, we change the declaration
to static.