Uwe Dannowski [Fri, 16 Feb 2018 13:19:54 +0000 (13:19 +0000)]
x86/microcode: Propagate microcode update errors
Errors on updating the microcode in the processor were silently
dropped when invoked via the microcode_update hypercall. Also, the log
message was misleading.
Signed-off-by: Uwe Dannowski <uwed@amazon.de> Reviewed-by: Stefan Nuernberger <snu@amazon.de> Reviewed-by: Martin Pohlack <mpohlack@amazon.de> Reviewed-by: Amit Shah <aams@amazon.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 15 Feb 2018 17:17:32 +0000 (18:17 +0100)]
x86/srat: fix end calculation in nodes_cover_memory()
Along the lines of commit 7226486767 ("x86/srat: fix the end pfn check
in valid_numa_range()") nodes_cover_memory() also doesn't consistently
use "end": It's set to an inclusive value initially, but then compared
to the exclusive "end" field of struct node and also possibly set to
nodes[j].start, making it exclusive too. Change the initialization to
make the variable consistently exclusive.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Ross Lagerwall [Thu, 15 Feb 2018 17:16:17 +0000 (18:16 +0100)]
x86/hvm/dmop: only copy what is needed to/from the guest
dm_op() fails with -EFAULT if the struct xen_dm_op given by the guest is
smaller than Xen's struct xen_dm_op. This is a problem because DMOP is
meant to be a stable ABI but it breaks whenever the size of struct
xen_dm_op changes.
To fix this, change how the copying to and from the guest is done. When
copying from the guest, first copy the header and inspect the op. Then,
only copy the correct amount needed for that op. When copying to the
guest, don't copy the header. Rather, copy only the correct amount
needed for that particular op.
So now the dm_op() will fail if the guest does not supply enough bytes
for the specific op. It will not fail if the guest supplies too many
bytes for the specific op, but Xen will not copy the extra bytes.
Remove some now unused macros and helper functions.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Alexandru Isaila [Thu, 15 Feb 2018 10:22:26 +0000 (12:22 +0200)]
hvm/svm: Enable CR events
The CR_INTERCEPT_CR3_WRITE intercept is out of the vmcb->_cr_intercepts
so the AMD arch can't intercept CR events.
This patch implements the CR intercept by adding the flag on a
write_ctrlreg event. The monitor write ctrlreg event is moved from the
Intel side to the common capabilities side.
We just need to enable the SVM intercept and then hvm_mov_to_cr() will
forward the event on to the monitor when appropriate.
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Alexandru Isaila [Thu, 15 Feb 2018 10:22:25 +0000 (12:22 +0200)]
hvm/svm: Enable MSR events
At this moment there is no function to enable msr interception on svm.
This patch implements this function and moves the mov to msr monitor
event
form the Intel arch side to the common capabilities.
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Alexandru Isaila [Thu, 15 Feb 2018 10:22:24 +0000 (12:22 +0200)]
hvm/svm: Enable Breakpoint events
This commit implements the breakpoint events for svm.
At the moment, the Breakpoint vmexit is not forwarded to the monitor
layer.
This patch adds the hvm_monitor_debug call to the VMEXIT_EXCEPTION_BP.
Also, the Software Breakpoint cap is moved from the Intel arch to the
common part of the code.
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Andrew Cooper [Mon, 12 Feb 2018 16:06:00 +0000 (16:06 +0000)]
x86/xpti: Hide almost all of .text and all .data/.rodata/.bss mappings
The current XPTI implementation isolates the directmap (and therefore a lot of
guest data), but a large quantity of CPU0's state (including its stack)
remains visible.
Furthermore, an attacker able to read .text is in a vastly superior position
to normal when it comes to fingerprinting Xen for known vulnerabilities, or
scanning for ROP/Spectre gadgets.
Collect together the entrypoints in .text.entry (currently 3x4k frames, but
can almost certainly be slimmed down), and create a common mapping which is
inserted into each per-cpu shadow. The stubs are also inserted into this
mapping by pointing at the in-use L2. This allows stubs allocated later (SMP
boot, or CPU hotplug) to work without further changes to the common mappings.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Wed, 14 Feb 2018 12:22:23 +0000 (12:22 +0000)]
xen/arm: cpuerrata: Actually check errata on non-boot CPUs
The cpu errata framework was introduced in commit 8b01f6364f "xen/arm:
Detect silicon revision and set cap bits accordingly" and was meant to
detect errata present on any CPUs (via check_local_cpu_errata). However,
the function to check the MIDR (is_affected_midr_range) mistakenly
always use the boot CPU MIDR.
Fix is_affected_midr_range to use the current CPU MIDR.
Julien Grall [Wed, 14 Feb 2018 15:30:44 +0000 (15:30 +0000)]
xen/arm: Extend the number of memory banks supported
When booting using Grub on Thunder-X, the number of memory available is
greater than 64. Bump the number to 128, so we can take advantage of all
the memory.
Alexandru Isaila [Mon, 12 Feb 2018 15:08:15 +0000 (17:08 +0200)]
asm-x86/monitor: Fix monitor capability reporting on SVM systems
No monitor features are available on AMD and all
capabilities are passed only to the Intel processor architecture.
This means that the arch_monitor_get_capabilities returns
capabilities = 0.
This patch is separating out features which are implemented on both
systems from those implemented only on Intel, so that we advertize the
working capabilities on AMD.
Signed-off-by: Alexandru Isaila <aisaila@bitdefender.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Acked-by: Razvan Cojocaru <rcojocaru@bitdefender.com>
Andrew Cooper [Wed, 14 Feb 2018 10:38:34 +0000 (10:38 +0000)]
x86/spec_ctrl: Fix several bugs in SPEC_CTRL_ENTRY_FROM_INTR_IST
DO_OVERWRITE_RSB clobbers %rax, meaning in practice that the bti_ist_info
field gets zeroed. Older versions of this code had the DO_OVERWRITE_RSB
register selectable, so reintroduce this ability and use it to cause the
INTR_IST path to use %rdx instead.
The use of %dl for the %cs.rpl check means that when an IST interrupt hits
Xen, we try to load 1 into the high 32 bits of MSR_SPEC_CTRL, suffering a #GP
fault instead.
Also, drop an unused label which was a copy/paste mistake.
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Wed, 14 Feb 2018 07:16:00 +0000 (08:16 +0100)]
firmware/shim: avoid mkdir error during Xen tree setup
"mkdir -p" reports a missing operand, as config/ has no subdirs. Oddly
enough this doesn't cause the whole command (and hence the build to
fail), despite the "set -e" now covering the entire set of commands -
perhaps a quirk of the relatively old bash I've seen this with (a few
simple experiments suggest that commands inside () producing a non-
success status would exit the inner shell, but not the outer one).
Add a dummy . argument to the invocation.
Suggested-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Sameer Goel [Tue, 13 Feb 2018 16:56:42 +0000 (17:56 +0100)]
bitops: rename LOG_2 to ilog2
Changing the name of the macro from LOG_2 to ilog2.This makes the function name
similar to its Linux counterpart. Since, this is not used in multiple places,
the code churn is minimal.
This change helps in porting unchanged code from Linux.
Signed-off-by: Sameer Goel <sameer.goel@linaro.org> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Roger Pau Monné [Tue, 13 Feb 2018 16:55:43 +0000 (17:55 +0100)]
xsm: add bodge when compiling with llvm coverage support
llvm coverage support seems to disable some of the optimizations
needed in order to compile xsm, and the end result is that references
to __xsm_action_mismatch_detected are left in the object files.
Since coverage support cannot be used in production, introduce
__xsm_action_mismatch_detected for llvm coverage builds.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
Roger Pau Monné [Tue, 13 Feb 2018 16:54:09 +0000 (17:54 +0100)]
coverage: introduce support for llvm profiling
Introduce the functionality in order to fill the hooks of the
cov_sysctl_ops struct. Note that the functionality is still not wired
into the build system.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 13 Feb 2018 16:29:50 +0000 (17:29 +0100)]
x86: use paging_mark_pfn_dirty()
... in preference over paging_mark_dirty(), when the PFN is known
anyway.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Tue, 13 Feb 2018 16:28:36 +0000 (17:28 +0100)]
x86/mm: clean up SHARED_M2P{,_ENTRY} uses
Stop open-coding SHARED_M2P() and drop a pointless use of it from
paging_mfn_is_dirty() (!VALID_M2P() is a superset of SHARED_M2P()) and
another one from free_page_type() (prior assertions render this
redundant).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Juergen Gross [Tue, 21 Nov 2017 11:06:06 +0000 (12:06 +0100)]
tools/libxl: mark special pages as reserved in e820 map for PVH
The "special pages" for PVH guests include the frames for console and
Xenstore ring buffers. Those have to be marked as "Reserved" in the
guest's E820 map, as otherwise conflicts might arise later e.g. when
hotplugging memory into the guest.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Simon Gaiser [Thu, 8 Feb 2018 21:49:10 +0000 (22:49 +0100)]
libxc: xc_dom_parse_elf_kernel: Return error for invalid kernel images
Commit 96edb111dd ("libxc: panic when trying to create a PVH guest
without kernel support") already improved the handling of non PVH
capable kernels. But xc_dom_parse_elf_kernel() still returned success on
invalid elf images and the domain build only failed later. Now the build
process will fail immediately on detecting the error.
Signed-off-by: Simon Gaiser <simon@invisiblethingslab.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Paul Semel [Mon, 12 Feb 2018 12:09:15 +0000 (13:09 +0100)]
libxc: check for null size file mapping
Changed the error message when trying to map a null size file.
When doing `xl create` command, we get an Invalid Kernel error
when the file size is greater than zero. For zero length files, we are
falling in the mmap error, and we get an `Invalid parameter` error,
which is not explicit. With this change, we get a `zero length file`
error.
Signed-off-by: Paul Semel <semelpaul@gmail.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Haozhong Zhang [Mon, 12 Feb 2018 01:44:23 +0000 (09:44 +0800)]
x86/srat: fix the end pfn check in valid_numa_range()
... and fix the coding style on fly.
valid_numa_range(..., epfn << PAGE_SHIFT, ...) and its only caller
memory_add(..., epfn, pxm) interpret epfn inconsistently. The former
interprets epfn as the last pfn, while the latter interprets it as the
last pfn plus one. Fix this inconsistency in valid_numa_range(), since
most of other places use the latter interpretation.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Julien Grall [Tue, 6 Feb 2018 15:53:25 +0000 (15:53 +0000)]
xen/arm: vpsci: Move PSCI function dispatching from vsmc.c to vpsci.c
At the moment PSCI function dispatching is done in vsmc.c and the
function implementation in vpsci.c. Some bits of the implementation is
even done in vsmc.c (see PSCI_SYSTEM_RESET).
This means that it is difficult to follow the implementation and also
it requires to export functions for each PSCI function.
Therefore move PSCI dispatching in two new functions do_vpsci_0_1_call
and do_vpsci_0_2_call. The former will handle PSCI 0.1 calls while the
latter 0.2 or later calls.
At the same time, a new header vpsci.h was created to contain all
definitions for virtual PSCI and avoid confusion with the host PSCI.
Julien Grall [Tue, 6 Feb 2018 15:53:24 +0000 (15:53 +0000)]
xen/arm: vsmc: Don't implement function IDs that don't exist
The current implementation of SMCCC relies on the fact only the function
number (bits [15:0]) is enough to identify what to implement.
However, PSCI call are only available in the range 0x84000000-0x8400001F
and 0xC4000000-0xC400001F. Furthermore, not all SMC32 functions have
equivalent in the SMC64. This is the case of:
* PSCI_VERSION
* CPU_OFF
* MIGRATE_INFO_TYPE
* SYSTEM_OFF
* SYSTEM_RESET
Similarly call count, call uid, revision can only be query using smc32/hvc32
fast calls (See 6.2 in ARM DEN 0028B).
Xen should only implement identifier existing in the specification in
order to avoid potential clashes with later revision. Therefore rework the
vsmc code to use the whole function identifier rather than only the
function number.
At the same time, the new macros for call count, call uid, revision are
renamed to better suit the spec.
Lastly, update SSSC_SMCCC_FUNCTION_COUNT to match the correct number of
funtions. Note that version is not updated because the number has always
been wrong, and nobody could properly use it.
Andre Przywara [Tue, 6 Feb 2018 17:09:03 +0000 (17:09 +0000)]
ARM: make nr_irqs a constant
On ARM the maximum number of IRQs is a constant, but we share it being
a variable to match x86. Since we are not supposed to alter it, let's
mark it as "const" to avoid accidental change.
Andre Przywara [Tue, 6 Feb 2018 17:09:02 +0000 (17:09 +0000)]
ARM: VGIC: rework gicv[23]_update_lr to not use pending_irq
The functions to actually populate a list register were accessing
the VGIC internal pending_irq struct, although they should be abstracting
from that.
Break the needed information down to remove the reference to pending_irq
from gic-v[23].c.
Signed-off-by: Andre Przywara <andre.przywara@linaro.org> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Andre Przywara [Tue, 6 Feb 2018 17:09:01 +0000 (17:09 +0000)]
ARM: VGIC: factor out vgic_get_hw_irq_desc()
At the moment we happily access the VGIC internal struct pending_irq
(which describes a virtual IRQ) in irq.c.
Factor out the actually needed functionality to learn the associated
hardware IRQ and move that into gic-vgic.c to improve abstraction.
Andre Przywara [Tue, 6 Feb 2018 17:09:00 +0000 (17:09 +0000)]
ARM: VGIC: factor out vgic_connect_hw_irq()
At the moment we happily access VGIC internal data structures like
the rank and struct pending_irq in gic.c, which should be VGIC agnostic.
Factor out a new function vgic_connect_hw_irq(), which allows a virtual
IRQ to be connected to a hardware IRQ (using the hw bit in the LR).
This removes said accesses to VGIC data structures and improves abstraction.
One thing to note is that this changes the locking scheme slightly:
we hold the rank lock for a shorter period of time, not covering some
of the later lines, which deal with the "irq_desc" structure only. This
should not have any adverse effect, but is a change in locking anyway.
Signed-off-by: Andre Przywara <andre.przywara@linaro.org> Reviewed-by: Julien Grall <julien.grall@arm.com>
Andre Przywara [Tue, 6 Feb 2018 17:08:59 +0000 (17:08 +0000)]
ARM: VGIC: rework events_need_delivery()
In event.h we very deeply dive into the VGIC to learn if an event for
a guest is pending.
Rework that function to abstract the VGIC specific part out. Also
reorder the queries there, as we only actually need to check for the
event channel if there are no other pending IRQs.
Signed-off-by: Andre Przywara <andre.przywara@linaro.org> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Andre Przywara [Tue, 6 Feb 2018 17:08:58 +0000 (17:08 +0000)]
ARM: VGIC: split up gic_dump_info() to cover virtual part separately
Currently gic_dump_info() not only dumps the hardware state of the GIC,
but also the VGIC internal virtual IRQ lists.
Split the latter off and move it into gic-vgic.c to observe the abstraction.
Signed-off-by: Andre Przywara <andre.przywara@linaro.org> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Andre Przywara [Tue, 6 Feb 2018 17:08:57 +0000 (17:08 +0000)]
ARM: VGIC: split gic.c to observe hardware/virtual GIC separation
Currently gic.c holds code to handle hardware IRQs as well as code to
bridge VGIC requests to the GIC virtualization hardware.
Despite being named gic.c, this file reaches into the VGIC and uses data
structures describing virtual IRQs.
To improve abstraction, move the VGIC functions into a separate file,
so that gic.c does what it says on the tin.
Signed-off-by: Andre Przywara <andre.przywara@linaro.org> Acked-by: Julien Grall <julien.grall@arm.com>
Andre Przywara [Tue, 6 Feb 2018 17:08:56 +0000 (17:08 +0000)]
ARM: VGIC: drop unneeded gic_restore_pending_irqs()
In gic_restore_pending_irqs() we push our pending virtual IRQs into the
list registers. This function is called once from gic_inject(), just
before we return to the guest, but also in gic_restore_state(), when
we context-switch a VCPU. Having a closer look it turns out that the
later call is not needed, since we will always call gic_inject() anyway.
So remove that call (and the forward declaration) to streamline this
interface and make separating the GIC from the VGIC world later.
George Dunlap [Thu, 8 Feb 2018 16:23:50 +0000 (16:23 +0000)]
xen: Disable ARINC653 scheduler by default for non-DEBUG builds
The ARINC653 scheduler is targeted at a very specific niche; typical
users cannot benefit from using it. Disable it by default for
non-DEBUG builds. (Enable it for DEBUG builds so that we catch any
build breakages sooner rather than later.)
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Dario Faggioli <dfaggioli@suse.com>
Roger Pau Monné [Wed, 7 Feb 2018 15:32:18 +0000 (16:32 +0100)]
kconfig/gcov: rename to coverage
So it can be used by both gcc and clang. Just add the Kconfig option
and modify the makefiles so the llvm coverage specific code can be
added in a follow up patch.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
[jb: also change the shim config]
Jan Beulich [Wed, 7 Feb 2018 15:31:41 +0000 (16:31 +0100)]
x86: reduce Meltdown band-aid IPI overhead
In case we can detect single-threaded guest processes (by checking
whether we can account for all root page table uses locally on the vCPU
that's running), there's no point in issuing a sync IPI upon an L4 entry
update, as no other vCPU of the guest will have that page table loaded.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 7 Feb 2018 15:30:24 +0000 (16:30 +0100)]
PCI/passthrough: don't discard Dom0 provided information
Instead of giving, to subsequent code, the appearance of there not
having been any "info" data provided, adjust the conditional guarding
SR-IOV handling.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Wed, 31 Jan 2018 16:09:39 +0000 (16:09 +0000)]
x86/boot: Make alternative patching NMI-safe
During patching, there is a very slim risk that an NMI or MCE interrupt in the
middle of altering the code in the NMI/MCE paths, in which case bad things
will happen.
The NMI risk can be eliminated by running the patching loop in NMI context, at
which point the CPU will defer further NMIs until patching is complete.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
George Dunlap [Wed, 24 Jan 2018 11:56:31 +0000 (11:56 +0000)]
x86/mm: Add debug code to detect illegal page_lock and put_page_type ordering
The fix for XSA-242 depends on the same cpu never calling
_put_page_type() while holding a page_lock() for that page; doing so
may cause a deadlock under the right conditions.
Furthermore, even before that, there was never any discipline for the
order in which page locks are grabbed; if there are any paths that
grab the locks for two different pages at once, we risk creating the
conditions for a deadlock to occur.
These are believed to be safe, because it is believed that:
1. No hypervisor paths ever lock two pages at once, and
2. We never call _put_page_type() on a page while holding its page lock.
Add a check to debug builds to catch any violations of these
assumpitons.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Michael Young [Tue, 6 Feb 2018 21:27:23 +0000 (21:27 +0000)]
make xen ocaml safe-strings compliant
Xen built with ocaml 4.06 gives errors such as
Error: This expression has type bytes but an expression was
expected of type string
as Byte and safe-strings which were introduced in 4.02 are the
default in 4.06.
This patch which is mostly by Richard W.M. Jones of Red Hat
from https://bugzilla.redhat.com/show_bug.cgi?id=1526703
fixes these issues.
v2: drop tools/ocaml/libs/xc/xenctrl.ml from the patch as the
affected code was removed by commit d933f1a53c06002351c1e36d40615e40bd4bf6af
tools/ocaml: Drop coredump infrastructure
Signed-off-by: Michael Young <m.a.young@durham.ac.uk> Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
[ wei: remove trailing whitespaces ] Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Tue, 6 Feb 2018 13:45:17 +0000 (13:45 +0000)]
x86/spec_ctrl: Fix determination of when to use IBRS
The original version of this logic was:
/*
* On Intel hardware, we'd like to use retpoline in preference to
* IBRS, but only if it is safe on this hardware.
*/
else if ( boot_cpu_has(X86_FEATURE_IBRSB) )
{
if ( retpoline_safe() )
thunk = THUNK_RETPOLINE;
else
ibrs = true;
}
but it was changed by a request during review. Sadly, the result is buggy as
it breaks the later fallback logic by allowing IBRS to appear as available
when in fact it isn't.
This in practice means that on repoline-unsafe hardware without IBRS, we
select THUNK_JUMP despite intending to select THUNK_RETPOLINE.
Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Then plan is to use XENMEM_add_to_physmap_batch to map the shared pages from
one domU to another and use XENMEM_remove_from_physmap to cancel the sharing.
A wrapper to XENMEM_add_to_physmap_batch was added in the following commit:
Jan Beulich [Tue, 6 Feb 2018 16:29:33 +0000 (17:29 +0100)]
libxc: don't fail domain creation when unpacking initrd fails
At least Linux kernels have been able to work with gzip-ed initrd for
quite some time; initrd compressed with other methods aren't even being
attempted to unpack. Furthermore the unzip-ing routine used here isn't
capable of dealing with various forms of concatenated files, each of
which was gzip-ed separately (it is this particular case which has been
the source of observed VM creation failures).
Hence, if unpacking fails, simply hand the compressed blob to the guest
as is.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
Andrew Cooper [Fri, 2 Feb 2018 16:10:17 +0000 (16:10 +0000)]
x86/emul: Fix the emulation of invlpga
The instruction requires EFER.SVME set to be usable in the first place.
Furthermore, the emulation doesn't handle ASIDs, so avoid giving the
impression that they work. Permit ASID 0 which is reserved for non-root
mode (in which case the instruction is identical to invlpg), but raise #UD for
any other ASID.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Brian Woods [Mon, 5 Feb 2018 09:15:25 +0000 (10:15 +0100)]
x86/svm: correct EFER.SVME intercept checks
Corrects some EFER.SVME checks in intercepts. See AMD APM vol2 section
15.4 for more details. VMMCALL isn't checked due to guests needing it
to boot.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Brian Woods <brian.woods@amd.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Brian Woods [Mon, 5 Feb 2018 09:14:48 +0000 (10:14 +0100)]
x86/svm: update VGIF support
There are places where the GIF value is checked. A guest with VGIF
enabled can change the GIF value without the host being involved,
therefore it needs to check the GIF value in the VMCB rather the one in
the nestedsvm struct.
Signed-off-by: Brian Woods <brian.woods@amd.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Jan Beulich [Mon, 5 Feb 2018 09:14:15 +0000 (10:14 +0100)]
x86emul: add missing suffixes in test harness
I'm in the process of putting together a gas change issuing at least
warnings when the intended size of a memory operation can't be deduced
from another (register) operand. Add missing suffixes to silence such
future diagnostics.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Julien Grall [Fri, 2 Feb 2018 10:14:44 +0000 (10:14 +0000)]
xen/arm: Don't crash the domain on invalid HVC immediate
domain_crash_synchronous() should only be used when something went wrong
in Xen. It is better to inject to the guest as it will be in a better
position to provide helpful information (stack trace...).
Julien Grall [Fri, 2 Feb 2018 10:14:43 +0000 (10:14 +0000)]
xen/arm: Don't crash domain on bad MMIO emulation
Now the MMIO emulation is able to distinguish unhandled IO from aborted
one, there are no need to crash the domain when the region is access
with a bad width.
Instead let Xen inject a data abort to the guest and decide what to do.
Julien Grall [Fri, 2 Feb 2018 10:14:42 +0000 (10:14 +0000)]
xen/arm: io: Distinguish unhandled IO from aborted one
Currently, Xen is considering that an IO could either be handled or
unhandled. When unhandled, the stage-2 abort function will try another
way to resolve the abort.
However, the MMIO emulation may return unhandled when the address
belongs to an emulated range but was not correct. In that case, Xen
should avoid to try another way and directly inject a guest data abort.
Introduce a tri-state return to distinguish the following state:
* IO_ABORT: The IO was handled but resulted in an abort
* IO_HANDLED: The IO was handled
* IO_UNHANDLED: The IO was unhandled
For now, it is considered that an IO belonging to an emulated range
could either be handled or inject an abort. This could be revisit in the
future if overlapped region exist (or we want to try another way to
resolve the abort).
Julien Grall [Fri, 2 Feb 2018 10:14:41 +0000 (10:14 +0000)]
xen/arm: traps: Merge try_handle_mmio() and handle_mmio()
At the moment, try_handle_mmio() will do check on the HSR and bail out
if one check fail. This means that another method will be tried to
handle the fault even for bad access on emulated region. While this
should not be an issue, this is not future proof.
Move the checks of try_handle_mmio() in handle_mmio() after we identified
the fault to target an emulated MMIO. While this does not fix the potential
fall-through, a follow-up patch will do by distinguish the potential error.
Note that the handle_mmio() was renamed to try_handle_mmio() and the
prototype adapted.
While merging the 2 functions, remove the check whether the fault is
stage-2 abort on stage-1 translation walk because the instruction
syndrome will always be invalid (see B3-1433 in DDI 0406C.c and
D10-2460 in DDI 0487C.a).
Julien Grall [Fri, 2 Feb 2018 14:19:25 +0000 (14:19 +0000)]
xen/arm32: entry: Document the purpose of r11 in the traps handler
It took me a bit of time to understand why __DEFINE_TRAP_ENTRY is
storing the original stack pointer in r11. It is working in pair with
return_traps_entry where sp will be restored from r11.
This is fine because per the AAPCS r11 must be preserved by the
subroutine. So in return_from_trap, r11 will still contain the original
stack pointer.
Add some documentation in the code to point the 2 sides to each other.
Julien Grall [Fri, 2 Feb 2018 14:19:24 +0000 (14:19 +0000)]
xen/arm32: Invalidate icache on guest exist for Cortex-A15
In order to avoid aliasing attacks against the branch predictor on
Cortex A-15, let's invalidate the BTB on guest exit, which can only be
done by invalidating the icache (with ACTLR[0] being set).
We use the same hack as for A12/A17 to perform the vector decoding.
This is based on Linux patch from the kpti branch in [1].
Julien Grall [Fri, 2 Feb 2018 14:19:23 +0000 (14:19 +0000)]
xen/arm32: Invalidate BTB on guest exit for Cortex A17 and 12
In order to avoid aliasing attackes agains the branch predictor, let's
invalidate the BTB on guest exist. This is made complicated by the fact
that we cannot take a branch invalidating the BTB.
This is based on the fourth version posted by Marc Zyngier on Linux-arm
mailing list (see [1]).
Julien Grall [Fri, 2 Feb 2018 14:19:22 +0000 (14:19 +0000)]
xen/arm32: Add skeleton to harden branch predictor aliasing attacks
Aliasing attacked against CPU branch predictors can allow an attacker to
redirect speculative control flow on some CPUs and potentially divulge
information from one context to another.
This patch adds initiatial skeleton code behind a new Kconfig option
to enable implementation-specific mitigations against these attacks
for CPUs that are affected.
Most of mitigations will have to be applied when entering to the
hypervisor from the guest context.
Because the attack is against branch predictor, it is not possible to
safely use branch instruction before the mitigation is applied.
Therefore this has to be done in the vector entry before jump to the
helper handling a given exception.
However, on arm32, each vector contain a single instruction. This means
that the hardened vector tables may rely on the state of registers that
does not hold when in the hypervisor (e.g SP is 8 bytes aligned).
Therefore hypervisor code running with guest vectors table should be
minimized and always have IRQs and SErrors masked to reduce the risk to
use them.
This patch provides an infrastructure to switch vector tables before
entering to the guest and when leaving it.
Note that alternative could have been used, but older Xen (4.8 or
earlier) doesn't have support. So avoid using alternative to ease
backporting.
Julien Grall [Fri, 2 Feb 2018 14:19:21 +0000 (14:19 +0000)]
xen/arm32: entry: Add missing trap_reset entry
At the moment, the reset vector is defined as .word 0 (e.g andeq r0, r0,
r0).
This is rather unintuitive and will result to execute the trap
undefined. Instead introduce trap helpers for reset and will generate an
error message in the unlikely case that reset will be called.
Andrew Cooper [Tue, 30 Jan 2018 11:08:45 +0000 (11:08 +0000)]
arm/alternatives: Drop the !HAS_ALTERNATIVE infrastructure
ARM now unconditionally selects HAS_ALTERNATIVE, which has caused the
!HAS_ALTERNATIVE code in include/asm-arm/alternative.h to bitrot to the point
of failing to compile.
Expand all the CONFIG_HAS_ALTERNATIVE references in ARM code.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Andrew Cooper [Thu, 11 Jan 2018 12:42:59 +0000 (12:42 +0000)]
x86/ioemul: Misc improvements to ioport_emulate.c
Put the opcode into an array and use memcpy. This allows the compiled code to
be written with two movs, rather than 10 mov $imm8's. Also, drop trailing
whitespace in the file.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Thu, 1 Feb 2018 10:32:45 +0000 (11:32 +0100)]
x86/shim: don't use 32-bit compare on boolean variable
Current upstream gas silently assumes 32-bit operand size for most
operations where the size can't be inferred from an involved register
(my own one doesn't anymore, which is how I've noticed this). It is pure
luck that the 3 bytes following pvh_boot are currently padding ones.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Thu, 25 Jan 2018 12:16:12 +0000 (12:16 +0000)]
x86/emul: Improvements to internal users of decode_register()
Most users of decode_register() can be replaced with decode_gpr() right away.
For the few sites which do care about possibly using the legacy byteop
encoding, rename decode_register() to _decode_gpr() (to match its non-legacy
counterpart), and adjust its 'int highbyte_regs' parameter to the more correct
'bool legacy'.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 25 Jan 2018 12:16:12 +0000 (12:16 +0000)]
x86/hvm: Improvements to external users of decode_register()
* Rename to decode_gpr() to be more specific as to its purpose
* Drop the highbyte encoding handling, as no users currently care, and it
unlikely that future users would care.
* Change to a static inline, returning an unsigned long pointer.
Doing so highlights that the "invalid gpr" paths in hvm_mov_{to,from}_cr()
were actually unreachable. All callers already passed in-range GPRs, and
out-of-range GPRs would have hit the BUG() previously.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 25 Jan 2018 12:16:12 +0000 (12:16 +0000)]
x86/emul: Optimise decode_register() somewhat
The positions of GPRs inside struct cpu_user_regs doesn't follow any
particular order, so as compiled, decode_register() becomes a jump table to 16
blocks which calculate the appropriate offset, at a total of 207 bytes.
Instead, pre-compute the offsets at build time and use pointer arithmetic to
calculate the result. By observation, most callers in x86_emulate() inline
and constant-propagate the highbyte_regs value of 0.
The splitting of the general and legacy byte-op cases means that we will now
hit an ASSERT if any code path tries to use the legacy byte-op encoding with a
REX prefix.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Wed, 31 Jan 2018 11:36:38 +0000 (12:36 +0100)]
x86: move declaration of the exception_table to C
This makes the code cleaner because there's no need to declare the
exception_table in assembly, and also fixes the following error when
using clang's integrated assembler: