Julien Grall [Fri, 2 Feb 2018 14:19:22 +0000 (14:19 +0000)]
xen/arm32: Add skeleton to harden branch predictor aliasing attacks
Aliasing attacked against CPU branch predictors can allow an attacker to
redirect speculative control flow on some CPUs and potentially divulge
information from one context to another.
This patch adds initiatial skeleton code behind a new Kconfig option
to enable implementation-specific mitigations against these attacks
for CPUs that are affected.
Most of mitigations will have to be applied when entering to the
hypervisor from the guest context.
Because the attack is against branch predictor, it is not possible to
safely use branch instruction before the mitigation is applied.
Therefore this has to be done in the vector entry before jump to the
helper handling a given exception.
However, on arm32, each vector contain a single instruction. This means
that the hardened vector tables may rely on the state of registers that
does not hold when in the hypervisor (e.g SP is 8 bytes aligned).
Therefore hypervisor code running with guest vectors table should be
minimized and always have IRQs and SErrors masked to reduce the risk to
use them.
This patch provides an infrastructure to switch vector tables before
entering to the guest and when leaving it.
Note that alternative could have been used, but older Xen (4.8 or
earlier) doesn't have support. So avoid using alternative to ease
backporting.
Julien Grall [Fri, 2 Feb 2018 14:19:21 +0000 (14:19 +0000)]
xen/arm32: entry: Add missing trap_reset entry
At the moment, the reset vector is defined as .word 0 (e.g andeq r0, r0,
r0).
This is rather unintuitive and will result to execute the trap
undefined. Instead introduce trap helpers for reset and will generate an
error message in the unlikely case that reset will be called.
Julien Grall [Tue, 16 Jan 2018 14:23:37 +0000 (14:23 +0000)]
xen/arm64: Implement branch predictor hardening for affected Cortex-A CPUs
Cortex-A57, A72, A73 and A75 are susceptible to branch predictor
aliasing and can theoritically be attacked by malicious code.
This patch implements a PSCI-based mitigation for these CPUs when
available. The call into firmware will invalidate the branch predictor
state, preventing any malicious entries from affection other victim
contexts.
Ported from Linux git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git
branch kpti.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
This is part of XSA-254.
Julien Grall [Tue, 16 Jan 2018 14:23:36 +0000 (14:23 +0000)]
xen/arm64: Add skeleton to harden the branch predictor aliasing attacks
Aliasing attacked against CPU branch predictors can allow an attacker to
redirect speculative control flow on some CPUs and potentially divulge
information from one context to another.
This patch adds initial skeleton code behind a new Kconfig option to
enable implementation-specific mitigations against these attacks for
CPUs that are affected.
Most of the mitigations will have to be applied when entering to the
hypervisor from the guest context. For safety, it is applied at every
exception entry. So there are potential for optimizing when receiving
an exception at the same level.
Because the attack is against branch predictor, it is not possible to
safely use branch instruction before the mitigation is applied.
Therefore, this has to be done in the vector entry before jump to the
helper handling a given exception.
On Arm64, each vector can hold 32 instructions. This leave us 31
instructions for the mitigation. The last one is the branch instruction
to the helper.
Because a platform may have CPUs with different micro-architectures,
per-CPU vector table needs to be provided. Realistically, only a few
different mitigations will be necessary. So provide a small set of
vector tables. They will be re-used and patch with the mitigations
on-demand.
This is based on the work done in Linux (see [1]).
Julien Grall [Tue, 16 Jan 2018 14:23:33 +0000 (14:23 +0000)]
xen/arm: Introduce enable callback to enable a capabilities on each online CPU
Once Xen knows what features/workarounds present on the platform, it
might be necessary to configure each online CPU.
Introduce a new callback "enable" that will be called on each online CPU to
configure the "capability".
The code is based on Linux v4.14 (where cpufeature.c comes from), the
explanation of why using stop_machine_run is kept as we have similar
problem in the future.
Lastly introduce enable_errata_workaround that will be called once CPUs
have booted and before the hardware domain is created.
xen/arm: Detect silicon revision and set cap bits accordingly
After each CPU has been started, we iterate through a list of CPU
errata to detect CPUs which need from hypervisor code patches.
For each bug there is a function which checks if that a particular CPU is
affected. This needs to be done on every CPU to cover heterogenous
systems properly.
If a certain erratum has been detected, the capability bit will be set.
In the case the erratum requires code patching, this will be triggered
by the call to apply_alternatives.
The code is based on the file arch/arm64/kernel/cpu_errata.c in Linux
v4.6-rc3.
xen/arm: cpufeature: Provide an helper to check if a capability is supported
The CPU capabilities will be set depending on the value found in the CPU
registers. This patch provides a generic to go through a set of capabilities
and find which one should be enabled.
The parameter "info" is used to display the kind of capability updated (e.g
workaround, feature...).
Julien Grall [Wed, 22 Jun 2016 11:15:18 +0000 (12:15 +0100)]
xen/arm: Add macros to handle the MIDR
Add new macros to easily get different parts of the register and to
check if a given MIDR match a CPU model range. The latter will be really
useful to handle errata later.
The macros have been imported from the header
arch/arm64/include/asm/cputype.h in Linux v4.6-rc3.
Jan Beulich [Wed, 17 Jan 2018 16:24:59 +0000 (17:24 +0100)]
x86: allow Meltdown band-aid to be disabled
First of all we don't need it on AMD systems. Additionally allow its use
to be controlled by command line option. For best backportability, this
intentionally doesn't use alternative instruction patching to achieve
the intended effect - while we likely want it, this will be later
follow-up.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e871e80c38547d9faefc6604532ba3e985e65873
master date: 2018-01-16 17:50:59 +0100
Jan Beulich [Wed, 17 Jan 2018 16:24:12 +0000 (17:24 +0100)]
x86: Meltdown band-aid against malicious 64-bit PV guests
This is a very simplistic change limiting the amount of memory a running
64-bit PV guest has mapped (and hence available for attacking): Only the
mappings of stack, IDT, and TSS are being cloned from the direct map
into per-CPU page tables. Guest controlled parts of the page tables are
being copied into those per-CPU page tables upon entry into the guest.
Cross-vCPU synchronization of top level page table entry changes is
being effected by forcing other active vCPU-s of the guest into the
hypervisor.
The change to context_switch() isn't strictly necessary, but there's no
reason to keep switching page tables once a PV guest is being scheduled
out.
This isn't providing full isolation yet, but it should be covering all
pieces of information exposure of which would otherwise require an XSA.
There is certainly much room for improvement, especially of performance,
here - first and foremost suppressing all the negative effects on AMD
systems. But in the interest of backportability (including to really old
hypervisors, which may not even have alternative patching) any such is
being left out here.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5784de3e2067ed73efc2fe42e62831e8ae7f46c4
master date: 2018-01-16 17:49:03 +0100
Jan H. Schönherr [Wed, 17 Jan 2018 16:23:08 +0000 (17:23 +0100)]
x86: Don't use potentially incorrect CPUID values for topology information
Intel says for CPUID leaf 0Bh:
"Software must not use EBX[15:0] to enumerate processor
topology of the system. This value in this field
(EBX[15:0]) is only intended for display/diagnostic
purposes. The actual number of logical processors
available to BIOS/OS/Applications may be different from
the value of EBX[15:0], depending on software and platform
hardware configurations."
And yet, we're using them to derive the number cores in a package
and the number of siblings in a core.
Derive the number of siblings and cores from EAX instead, which is
intended for that.
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d51baf310e530659f73e714acf575555bdc46303
master date: 2018-01-08 10:48:24 +0000
Andrew Cooper [Wed, 17 Jan 2018 16:22:34 +0000 (17:22 +0100)]
x86/entry: Remove support for partial cpu_user_regs frames
Save all GPRs on entry to Xen.
The entry_int82() path is via a DPL1 gate, only usable by 32bit PV guests, so
can get away with only saving the 32bit registers. All other entrypoints can
be reached from 32 or 64bit contexts.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: f9eb74789af77e985ae653193f3622263499f674
master date: 2018-01-05 19:57:07 +0000
Roger Pau Monné [Wed, 17 Jan 2018 16:22:03 +0000 (17:22 +0100)]
x86/upcall: inject a spurious event after setting upcall vector
In case the vCPU has pending events to inject. This fixes a bug that
happened if the guest mapped the vcpu info area using
VCPUOP_register_vcpu_info without having setup the event channel
upcall, and then setup the upcall vector.
In this scenario the guest would not receive any upcalls, because the
call to VCPUOP_register_vcpu_info would have marked the vCPU as having
pending events, but the vector could not be injected because it was
not yet setup.
This has not caused issues so far because all the consumers first
setup the vector callback and then map the vcpu info page, but there's
no limitation that prevents doing it in the inverse order.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7b5b8ca7dffde866d851f0b87b994e0b13e5b867
master date: 2018-01-04 14:29:16 +0100
Jan Beulich [Wed, 17 Jan 2018 16:21:21 +0000 (17:21 +0100)]
x86/E820: don't overrun array
The bounds check needs to be done after the increment, not before, or
else it needs to use a one lower immediate. Also use word operations
rather than byte ones for both the increment and the compare (allowing
E820_BIOS_MAX to be more easily bumped, should the need ever arise).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0036c9dbcd8b52316aeebb475929d3a36cf5e514
master date: 2018-01-03 11:03:56 +0100
Remove useless smp_wmb() barrier after cpumask_set_cpu(cpuid,
&cpu_online_map), which is not synchronizing against anything.
Keep the other smp_wmb(), before the cpumask_set_cpu call, to ensure
that all writes before setting the cpu online are visible to other cpus.
For that to work properly, we need a corresponding smp_rmb() barrier,
after reading the online cpumask from other processors, which is
currently missing. Add it.
arm: configure interrupts to be in non-secure group1
Xen uses non-secure group1 interrupts, however it doesn't configure the
GICv3 accordingly. Xen needs to set GICD_IGROUPR for SPIs and
GICR_IGROUPR0 for local interrupt to "1" to specify that interrupts
belong to group1. This is particularly important if the system has
GICD_CTLR.DS set, also see commit 7c9b973061b03af62734f613f6abec46c0dd4a88 in Linux.
Julien Grall [Wed, 6 Dec 2017 14:51:37 +0000 (14:51 +0000)]
xen/arm: gic-v3: Bail out if gicv3_cpu_init fail
When system registers are not enabled, all the access to them will trap
in EL2. In Xen, system registers will be enabled by gicv3_cpu_init only
on success. As the rest of the code (e.g gicv3_hyp_init) relies on
system register, it is better to bail out directly.
This will save time on debugging early boot issue on GICv3 platform.
Tom Lendacky [Wed, 20 Dec 2017 15:24:23 +0000 (16:24 +0100)]
x86/microcode: Add support for fam17h microcode loading
The size for the Microcode Patch Block (MPB) for an AMD family 17h
processor is 3200 bytes. Add a #define for fam17h so that it does
not default to 2048 bytes and fail a microcode load/update.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
[Linux commit f4e9b7af0cd58dd039a0fb2cd67d57cea4889abf]
Ported to Xen.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 61d458ba8c171809e8dd9abd19339c87f3f934ca
master date: 2017-12-13 14:30:10 +0000
Jan Beulich [Wed, 20 Dec 2017 15:23:52 +0000 (16:23 +0100)]
gnttab: improve GNTTABOP_cache_flush locking
Dropping the lock before returning from grant_map_exists() means handing
possibly stale information back to the caller. Return back the pointer
to the active entry instead, for the caller to release the lock once
done.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andre Przywara <andre.przywara@linaro.org> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 553ac37137c2d1c03bf1b69cfb192ffbfe29daa4
master date: 2017-12-04 11:04:18 +0100
Jann validly points out that with a caller bogusly requesting a zero-
element batch with non-zero high command bits (the ones used for
continuation encoding), the assertion right before the call to
hypercall_create_continuation() would trigger. A similar situation would
arise afaict for non-empty batches with op and/or length zero in every
element.
While we want the former to succeed (as we do elsewhere for similar
no-op requests), the latter can clearly be converted to an error, as
this is a state that can't be the result of a prior operation.
Take the opportunity and also correct the order of argument checks:
We shouldn't accept zero-length elements with unknown bits set in "op".
Also constify cache_flush()'s first parameter.
Sergey Dyasli [Wed, 20 Dec 2017 15:22:58 +0000 (16:22 +0100)]
x86/vvmx: don't enable vmcs shadowing for nested guests
Running "./xtf_runner vvmx" in L1 Xen under L0 Xen produces the
following result on H/W with VMCS shadowing:
Test: vmxon
Failure in test_vmxon_in_root_cpl0()
Expected 0x8200000f: VMfailValid(15) VMXON_IN_ROOT
Got 0x82004400: VMfailValid(17408) <unknown>
Test result: FAILURE
This happens because SDM allows vmentries with enabled VMCS shadowing
VM-execution control and VMCS link pointer value of ~0ull. But results
of a nested VMREAD are undefined in such cases.
Fix this by not copying the value of VMCS shadowing control from vmcs01
to vmcs02.
Andrew Cooper [Wed, 20 Dec 2017 15:22:27 +0000 (16:22 +0100)]
xen/pv: Construct d0v0's GDT properly
c/s cf6d39f8199 "x86/PV: properly populate descriptor tables" changed the GDT
to reference zero_page for intermediate frames between the guest and Xen
frames.
Because dom0_construct_pv() doesn't call arch_set_info_guest(), some bits of
initialisation are missed, including the pv_destroy_gdt() which initially
fills the references to zero_page.
In practice, this means there is a window between starting and the first call
to HYPERCALL_set_gdt() were lar/lsl/verr/verw suffer non-architectural
behaviour.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 08f27f4468eedbeccaac9fdda4ef732247efd74e
master date: 2017-12-01 19:03:26 +0000
Paul Durrant [Wed, 20 Dec 2017 15:21:59 +0000 (16:21 +0100)]
x86/hvm: fix interaction between internal and external emulation
A call to handle_hvm_io_completion() is needed for completing I/O
that requires external emulation. Such completion should be requested when
hvm_vcpu_io_need_completion() returns true after hvm_emulate_once() has
completed. This is indicative of the underlying I/O emulation having
returned X86EMUL_RETRY and hence a re-emulation of the instruction is
needed to pick up the result of the I/O.
A call to handle_hvm_io_completion() is NOT needed when the underlying
I/O has not returned X86EMUL_RETRY since there will be no result to pick
up. Hence it bogus to request such completion when mmio_retry is set,
since this can only happen if the underlying I/O emulation has returned
X86EMUL_OKAY (meaning the I/O has completed successfully).
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/HVM: don't retain emulated insn cache when exiting back to guest
vio->mmio_retry is being set when a repeated string insn is being split
up. In that case we'll exit to the guest, expecting immediate re-entry.
Interruptions, however, may be serviced by the guest before re-entry
from the repeated string insn. Any emulation needed in the course of
handling the interruption must not fetch from the internally maintained
cache.
As a follow-up to XSA-212 we should have addressed a similar issue here:
The handles being advanced at the top of xenmem_add_to_physmap_batch()
means we allow hypervisor space accesses (in particular, for "errs",
writes) with suitably crafted input arguments. This isn't a security
issue in this case because of the limited width of struct
xen_add_to_physmap_batch's size field: It being 16-bits wide, only the
r/o M2P area can be accessed. Still we can and should do better.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 7f080956e9eed821fd42013bef11c1a2873fbeba
master date: 2017-11-28 13:15:12 +0100
Jan Beulich [Wed, 20 Dec 2017 15:21:03 +0000 (16:21 +0100)]
x86: check paging mode earlier in xenmem_add_to_physmap_one()
There's no point in deferring this until after some initial processing,
and it's actively wrong for the XENMAPSPACE_gmfn_foreign handling to not
have such a check at all.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f38f3dccf1e1a8aabcf57364326fc8f44cddc41a
master date: 2017-11-28 13:14:43 +0100
Jan Beulich [Wed, 20 Dec 2017 15:20:31 +0000 (16:20 +0100)]
x86: replace bad ASSERT() in xenmem_add_to_physmap_one()
There are no locks being held, i.e. it is possible to be triggered by
racy hypercall invocations. Subsequent code doesn't really depend on the
checked values, so this is not a security issue.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: f33d653f46f5889db7be4fef31d71bc871834c10
master date: 2017-11-28 13:14:10 +0100
Jan Beulich [Wed, 20 Dec 2017 15:19:12 +0000 (16:19 +0100)]
sync CPU state upon final domain destruction
See the code comment being added for why we need this.
This is being placed here to balance between the desire to prevent
future similar issues (the risk of which would grow if it was put
further down the call stack, e.g. in vmx_vcpu_destroy()) and the
intention to limit the performance impact (otherwise it could also go
into rcu_do_batch(), paralleling the use in do_tasklet_work()).
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 24246e1fb7496b830aca8a6a1fd3064ca1e3ebf9
master date: 2017-11-23 11:38:22 +0100
Andrew Cooper [Wed, 20 Dec 2017 15:18:40 +0000 (16:18 +0100)]
x86/hvm: Don't corrupt the HVM context stream when writing the MSR record
Ever since it was introduced in c/s bd1f0b45ff, hvm_save_cpu_msrs() has had a
bug whereby it corrupts the HVM context stream if some, but fewer than the
maximum number of MSRs are written.
_hvm_init_entry() creates an hvm_save_descriptor with length for
msr_count_max, but in the case that we write fewer than max, h->cur only moves
forward by the amount of space used, causing the subsequent
hvm_save_descriptor to be written within the bounds of the previous one.
To resolve this, reduce the length reported by the descriptor to match the
actual number of bytes used.
A typical failure on the destination side looks like:
(XEN) HVM4 restore: CPU_MSR 0
(XEN) HVM4.0 restore: not enough data left to read 56 MSR bytes
(XEN) HVM4 restore: failed to load entry 20/0
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d2f86bf604698806d311cc251c1b66fbb752673c
master date: 2017-11-21 11:19:02 +0000
The altp2m_vcpu_enable_notify subop handler might skip calling
rcu_unlock_domain() after rcu_lock_current_domain(). Albeit since both
rcu functions are no-ops when run on the current domain, this doesn't
really have repercussions.
The second change is adding a missing break that would have potentially
enabled #VE for the current domain even if it had intended to enable it
for another one (not a supported functionality).
Signed-off-by: Adrian Pop <apop@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: eb0660c6950e08e44fdfeca3e29320382e2a1554
master date: 2017-11-16 17:25:59 +0000
Andrew Cooper [Wed, 20 Dec 2017 15:17:26 +0000 (16:17 +0100)]
common/gnttab: Correct error handling for gnttab_setup_table()
Simplify the error labels to just "unlock" and "out". This fixes an erroneous
path where a failure of rcu_lock_domain_by_any_id() still results in
rcu_unlock_domain() being called.
This is only not an XSA by luck. rcu_unlock_domain() is a nop other than
decrementing the preempt count, and nothing reads the preempt count outside of
a debug build.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5e436e7a45082ea2cadc176c19e1df46c178448f
master date: 2017-08-15 15:08:57 +0100
Jan Beulich [Tue, 12 Dec 2017 14:04:28 +0000 (15:04 +0100)]
x86/shadow: fix ref-counting error handling
The old-Linux handling in shadow_set_l4e() mistakenly ORed together the
results of sh_get_ref() and sh_pin(). As the latter failing is not a
correctness problem, simply ignore its return value.
In sh_set_toplevel_shadow() a failing sh_get_ref() must not be
accompanied by installing the entry, despite the domain being crashed.
This is XSA-250.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 10be8001de7d87be1f0ccdda75cc70e922e56d03
master date: 2017-12-12 14:29:45 +0100
Jan Beulich [Tue, 12 Dec 2017 14:04:00 +0000 (15:04 +0100)]
x86/shadow: fix refcount overflow check
Commit c385d27079 ("x86 shadow: for multi-page shadows, explicitly track
the first page") reduced the refcount width to 25, without adjusting the
overflow check. Eliminate the disconnect by using a manifest constant.
Interestingly, up to commit 047782fa01 ("Out-of-sync L1 shadows: OOS
snapshot") the refcount was 27 bits wide, yet the check was already
using 26.
This is XSA-249.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 54e2292e8df7a1a7b041192be9d6d797b6d00869
master date: 2017-12-12 14:29:13 +0100
Jan Beulich [Tue, 12 Dec 2017 14:03:34 +0000 (15:03 +0100)]
x86/mm: don't wrongly set page ownership
PV domains can obtain mappings of any pages owned by the correct domain,
including ones that aren't actually assigned as "normal" RAM, but used
by Xen internally. At the moment such "internal" pages marked as owned
by a guest include pages used to track logdirty bits, as well as p2m
pages and the "unpaged pagetable" for HVM guests. Since the PV memory
management and shadow code conflict in their use of struct page_info
fields, and since shadow code is being used for log-dirty handling for
PV domains, pages coming from the shadow pool must, for PV domains, not
have the domain set as their owner.
While the change could be done conditionally for just the PV case in
shadow code, do it unconditionally (and for consistency also for HAP),
just to be on the safe side.
There's one special case though for shadow code: The page table used for
running a HVM guest in unpaged mode is subject to get_page() (in
set_shadow_status()) and hence must have its owner set.
This is XSA-248.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: ff2a793e15bb0b6254bc849ef8e83e1c284c3583
master date: 2017-12-12 14:28:36 +0100
Jan Beulich [Tue, 12 Dec 2017 14:03:00 +0000 (15:03 +0100)]
x86: don't wrongly trigger linear page table assertion (2)
_put_final_page_type(), when free_page_type() has exited early to allow
for preemption, should not update the time stamp, as the page continues
to retain the typ which is in the process of being unvalidated. I can't
see why the time stamp update was put on that path in the first place
(albeit it may well have been me who had put it there years ago).
This is part of XSA-240.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap.com>
master commit: e40b0219a8c77741ae48989efb520f4a762a5be3
master date: 2017-12-12 14:27:34 +0100
George Dunlap [Tue, 28 Nov 2017 12:45:13 +0000 (13:45 +0100)]
p2m: Check return value of p2m_set_entry() when decreasing reservation
If the entire range specified to p2m_pod_decrease_reservation() is marked
populate-on-demand, then it will make a single p2m_set_entry() call,
reducing its PoD entry count.
Unfortunately, in the right circumstances, this p2m_set_entry() call
may fail. It that case, repeated calls to decrease_reservation() may
cause p2m->pod.entry_count to fall below zero, potentially tripping
over BUG_ON()s to the contrary.
Instead, check to see if the entry succeeded, and return false if not.
The caller will then call guest_remove_page() on the gfns, which will
return -EINVAL upon finding no valid memory there to return.
Unfortunately if the order > 0, the entry may have partially changed.
A domain_crash() is probably the safest thing in that case.
Other p2m_set_entry() calls in the same function should be fine,
because they are writing the entry at its current order. Nonetheless,
check the return value and crash if our assumption turns otu to be
wrong.
This is part of XSA-247.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a3d64de8e86f5812917d2d0af28298f80debdf9a
master date: 2017-11-28 13:13:26 +0100
George Dunlap [Tue, 28 Nov 2017 12:44:42 +0000 (13:44 +0100)]
p2m: Always check to see if removing a p2m entry actually worked
The PoD zero-check functions speculatively remove memory from the p2m,
then check to see if it's completely zeroed, before putting it in the
cache.
Unfortunately, the p2m_set_entry() calls may fail if the underlying
pagetable structure needs to change and the domain has exhausted its
p2m memory pool: for instance, if we're removing a 2MiB region out of
a 1GiB entry (in the p2m_pod_zero_check_superpage() case), or a 4k
region out of a 2MiB or larger entry (in the p2m_pod_zero_check()
case); and the return value is not checked.
The underlying mfn will then be added into the PoD cache, and at some
point mapped into another location in the p2m. If the guest
afterwards ballons out this memory, it will be freed to the hypervisor
and potentially reused by another domain, in spite of the fact that
the original domain still has writable mappings to it.
There are several places where p2m_set_entry() shouldn't be able to
fail, as it is guaranteed to write an entry of the same order that
succeeded before. Add a backstop of crashing the domain just in case,
and an ASSERT_UNREACHABLE() to flag up the broken assumption on debug
builds.
While we're here, use PAGE_ORDER_2M rather than a magic constant.
This is part of XSA-247.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 92790672dedf2eab042e04ecc277c19d40fd348a
master date: 2017-11-28 13:13:03 +0100
Julien Grall [Tue, 28 Nov 2017 12:44:09 +0000 (13:44 +0100)]
x86/pod: prevent infinite loop when shattering large pages
When populating pages, the PoD may need to split large ones using
p2m_set_entry and request the caller to retry (see ept_get_entry for
instance).
p2m_set_entry may fail to shatter if it is not possible to allocate
memory for the new page table. However, the error is not propagated
resulting to the callers to retry infinitely the PoD.
Prevent the infinite loop by return false when it is not possible to
shatter the large mapping.
This is XSA-246.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: a1c6c6768971ea387d7eba0803908ef0928b43ac
master date: 2017-11-28 13:11:55 +0100
Andrew Cooper [Thu, 16 Nov 2017 11:03:26 +0000 (12:03 +0100)]
x86/shadow: correct SH_LINEAR mapping detection in sh_guess_wrmap()
The fix for XSA-243 / CVE-2017-15592 (c/s bf2b4eadcf379) introduced a change
in behaviour for sh_guest_wrmap(), where it had to cope with no shadow linear
mapping being present.
As the name suggests, guest_vtable is a mapping of the guests pagetable, not
Xen's pagetable, meaning that it isn't the pagetable we need to check for the
shadow linear slot in.
The practical upshot is that a shadow HVM vcpu which switches into 4-level
paging mode, with an L4 pagetable that contains a mapping which aliases Xen's
SH_LINEAR_PT_VIRT_START will fool the safety check for whether a SHADOW_LINEAR
mapping is present. As the check passes (when it should have failed), Xen
subsequently falls over the missing mapping with a pagefault such as:
Jan Beulich [Thu, 16 Nov 2017 11:02:55 +0000 (12:02 +0100)]
x86: don't wrongly trigger linear page table assertion
_put_page_type() may do multiple iterations until its cmpxchg()
succeeds. It invokes set_tlbflush_timestamp() on the first
iteration, however. Code inside the function takes care of this, but
- the assertion in _put_final_page_type() would trigger on the second
iteration if time stamps in a debug build are permitted to be
sufficiently much wider than the default 6 bits (see WRAP_MASK in
flushtlb.c),
- it returning -EINTR (for a continuation to be scheduled) would leave
the page inconsistent state (until the re-invocation completes).
Make the set_tlbflush_timestamp() invocation conditional, bypassing it
(for now) only in the case we really can't tolerate the stamp to be
stored.
This is part of XSA-240.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 2c458dfcb59f3d9d8a35fc5ffbf780b6ed7a26a6
master date: 2017-11-16 10:37:29 +0100
Yu Zhang [Thu, 16 Nov 2017 11:02:24 +0000 (12:02 +0100)]
x86/mm: fix race condition in modify_xen_mappings()
In modify_xen_mappings(), a L1/L2 page table shall be freed,
if all entries of this page table are empty. Corresponding
L2/L3 PTE will need be cleared in such scenario.
However, concurrent paging structure modifications on different
CPUs may cause the L2/L3 PTEs to be already be cleared or set
to reference a superpage.
Therefore the logic to enumerate the L1/L2 page table and to
reset the corresponding L2/L3 PTE need to be protected with
spinlock. And the _PAGE_PRESENT and _PAGE_PSE flags need be
checked after the lock is obtained.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b9ee1fd7b98064cf27d0f8f1adf1f5359b72c97f
master date: 2017-11-14 17:11:26 +0100
Min He [Thu, 16 Nov 2017 11:01:57 +0000 (12:01 +0100)]
x86/mm: fix race conditions in map_pages_to_xen()
In map_pages_to_xen(), a L2 page table entry may be reset to point to
a superpage, and its corresponding L1 page table need be freed in such
scenario, when these L1 page table entries are mapping to consecutive
page frames and having the same mapping flags.
However, variable `pl1e` is not protected by the lock before L1 page table
is enumerated. A race condition may happen if this code path is invoked
simultaneously on different CPUs.
For example, `pl1e` value on CPU0 may hold an obsolete value, pointing
to a page which has just been freed on CPU1. Besides, before this page
is reused, it will still be holding the old PTEs, referencing consecutive
page frames. Consequently the `free_xen_pagetable(l2e_to_l1e(ol2e))` will
be triggered on CPU0, resulting the unexpected free of a normal page.
This patch fixes the above problem by protecting the `pl1e` with the lock.
Also, there're other potential race conditions. For instance, the L2/L3
entry may be modified concurrently on different CPUs, by routines such as
map_pages_to_xen(), modify_xen_mappings() etc. To fix this, this patch will
check the _PAGE_PRESENT and _PAGE_PSE flags, after the spinlock is obtained,
for the corresponding L2/L3 entry.
Signed-off-by: Min He <min.he@intel.com> Signed-off-by: Yi Zhang <yi.z.zhang@intel.com> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a5114662297ad03efc36b52ad365ffa05fb357b7
master date: 2017-11-14 17:10:56 +0100
Eric Chanudet [Thu, 16 Nov 2017 11:01:28 +0000 (12:01 +0100)]
x86/hvm: do not register hpet mmio during s3 cycle
Do it once at domain creation (hpet_init).
Sleep -> Resume cycles will end up crashing an HVM guest with hpet as
the sequence during resume takes the path:
-> hvm_s3_suspend
-> hpet_reset
-> hpet_deinit
-> hpet_init
-> register_mmio_handler
-> hvm_next_io_handler
register_mmio_handler will use a new io handler each time, until
eventually it reaches NR_IO_HANDLERS, then hvm_next_io_handler calls
domain_crash.
Signed-off-by: Eric Chanudet <chanudete@ainfosec.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 015d6738ddff4074668c1d4887bbffd507ed1a7f
master date: 2017-11-14 17:09:50 +0100
George Dunlap [Thu, 16 Nov 2017 11:00:28 +0000 (12:00 +0100)]
x86/mm: Make PV linear pagetables optional
Allowing pagetables to point to other pagetables of the same level
(often called 'linear pagetables') has been included in Xen since its
inception; but recently it has been the source of a number of subtle
reference-counting bugs.
It is not used by Linux or MiniOS; but it is used by NetBSD and Novell
Netware. There are significant numbers of people who are never going
to use the feature, along with significant numbers who need the
feature.
Add a Kconfig option for the feature (default to 'y'). Also add a
command-line option to control whether PV linear pagetables are
allowed (default to 'true').
NB that we leave linear_pt_count in the page struct. It's in a union,
so its presence doesn't increase the size of the data struct.
Changing the layout of the other elements based on configuration
options is asking for trouble however; so we'll just leave it there
and ASSERT that it's zero.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3285e75dea89afb0ef5b3ee39bd15194bd7cc110
master date: 2017-10-27 14:36:45 +0100
Jan Beulich [Thu, 16 Nov 2017 10:59:21 +0000 (11:59 +0100)]
x86: don't latch wrong (stale) GS base addresses
load_segments() writes selector registers before doing any of the base
address updates. Any of these selector loads can cause a page fault in
case it references the LDT, and the LDT page accessed was only recently
installed. Therefore the call tree map_ldt_shadow_page() ->
guest_get_eff_kern_l1e() -> toggle_guest_mode() would in such a case
wrongly latch the outgoing vCPU's GS.base into the incoming vCPU's
recorded state.
Split page table toggling from GS handling - neither
guest_get_eff_kern_l1e() nor guest_io_okay() need more than the page
tables being the kernel ones for the memory access they want to do.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a711f6f24a7157ae70d1cc32e61b98f23dc0c584
master date: 2017-10-27 13:49:10 +0100
Jan Beulich [Thu, 16 Nov 2017 10:58:47 +0000 (11:58 +0100)]
x86: also show FS/GS base addresses when dumping registers
Their state may be important to figure the reason for a crash. To not
further grow duplicate code, break out a helper function.
I realize that (ab)using the control register array here may not be
considered the nicest solution, but it seems easier (and less overall
overhead) to do so compared to the alternative of introducing another
helper structure.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by Andrew Cooper <andrew.cooper3@citrix.com>
master commit: be7f60b5a39741eab0a8fea0324f7be0cb724cfb
master date: 2017-10-24 18:13:13 +0200
Jan Beulich [Thu, 16 Nov 2017 10:58:10 +0000 (11:58 +0100)]
x86: fix GS-base-dirty determination
load_segments() writes the two MSRs in their "canonical" positions
(GS_BASE for the user base, SHADOW_GS_BASE for the kernel one) and uses
SWAPGS to switch them around if the incoming vCPU is in kernel mode. In
order to not leave a stale kernel address in GS_BASE when the incoming
guest is in user mode, the check on the outgoing vCPU needs to be
dependent upon the mode it is currently in, rather than blindly looking
at the user base.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 91f85280b9b80852352fcad73d94ed29fafb88da
master date: 2017-10-24 18:12:31 +0200
Jan Beulich [Tue, 24 Oct 2017 14:50:14 +0000 (16:50 +0200)]
x86emul: handle address wrapping
This just the emulator part of commit 7869e2bafe
("x86emul/fuzz: add rudimentary limit checking"):
Several adjustments to the emulator's address calculations are needed:
While the DstBitBase one is really mandatory, the specification allows
for either original or new behavior for two-part accesses. Observed
behavior on real hardware, however, is for such accesses to silently
wrap at the 2^^32 boundary in other than 64-bit mode, just like they do
at the 2^^64 boundary in 64-bit mode, which our code is now being
brought in line with. While adding truncate_ea() invocations there,
also convert open coded instances of it.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 24 Oct 2017 14:48:50 +0000 (16:48 +0200)]
x86: avoid #GP for PV guest MSR accesses
Halfway recent Linux kernels probe MISC_FEATURES_ENABLES on all CPUs,
leading to ugly recovered #GP fault messages with debug builds on older
systems. We can do better, so introduce synthetic feature flags for
both this and PLATFORM_INFO to avoid the rdmsr_safe() altogether.
Note that the r/o nature of PLATFORM_INFO is now also being enforced.
The rdmsr_safe() uses for MISC_ENABLE are left in place as benign - it
exists for all 64-bit capable Intel CPUs (see e.g. early_init_intel()).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Tue, 24 Oct 2017 14:43:43 +0000 (16:43 +0200)]
x86/vvmx: Fix WRMSR interception of VMX MSRs
FEATURE_CONTROL is already read with LOCK bit set (so is unmodifiable), and
all VMX MSRs are read-only. Also, fix the MSR_IA32_VMX_TRUE_ENTRY_CTLS bound
to be MSR_IA32_VMX_VMFUNC, rather than having the intervening MSRs falling
into the default case.
Raise #GP faults if the guest tries to modify any of them.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 46c3acb308bf0cd044b114e637aacaf18b957618
master date: 2017-06-30 11:27:50 +0100
While I can't seem to find any users of this hypercall (being a likely
explanation of why the problem wasn't noticed so far), just like for
do_mmu_update() paged-out and shared page handling is needed here. Move
all this logic into mod_l1_entry(), which then also results in no
longer
- doing any of this handling for non-present PTEs,
- acquiring two temporary page references when one is already more than
enough.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 46aaf41ee099a048d7a554c03ae01bcdaa07f776
master date: 2017-10-13 12:43:41 +0200
Andrew Cooper [Tue, 24 Oct 2017 14:42:13 +0000 (16:42 +0200)]
xen/domctl: Fix Xen heap leak via XEN_DOMCTL_getvcpucontext
The backing structure for XEN_DOMCTL_getvcpucontext is only zeroed in the x86
HVM case. At the very least, this means that ARM returns junk through its
flags field (as it is only ever conditionally or'd into), and x86 PV leaks
data through gdt_frames[14...15]. (An exhaustive search for other leaks
hasn't been performed).
Unconditionally zero the memory upon allocation, and forgo the double clear
for x86 HVM. These hypercalls are not on hotpaths.
Note that this does not qualify for an XSA. Per XSA-77,
XEN_DOMCTL_getvcpucontext is unsafe for disaggregation, meaning that only the
control domain can use this hypercall.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3b2eeb7412e529f38d1e8b872ba0bc6ab09a7008
master date: 2017-10-09 12:43:21 +0100
Segment bases (and limits) aren't being cleared by the loading of a nul
selector into a segment register on AMD CPUs. Therefore, if an
outgoing vCPU has a non-zero base in FS or GS and the subsequent
incoming vCPU has a non-zero but nul selector in the respective
register(s), the selector value(s) would be loaded without clearing the
segment base(s) in the hidden register portion.
Since the ABI states "zero" in its description of the fs and gs fields,
it is worth noting that the chosen approach to fix this alters the
written down ABI. I consider this preferrable over enforcing the
previously written down behavior, as nul selectors are far more likely
to be what was meant from the beginning.
The adjustments also eliminate an inconsistency between FS and GS
handling: Old code had an extra pointless (gs_base_user was always zero
when DIRTY_GS was set) conditional for GS. The old bitkeeper changeset
has no explanation for this asymmetry.
Additionally for DS and ES a flat selector is being loaded prior to the
loading of a nul one on AMD CPUs, just as a precautionary measure
(we're not currently aware of ways for a guest to deduce the base of a
segment register which has a nul selector loaded).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 4e383df8650d72e47e2ca4ebfc4f6986f791d2f2
master date: 2017-10-04 14:17:08 +0200
Andrew Cooper [Tue, 24 Oct 2017 14:40:23 +0000 (16:40 +0200)]
x86/svm: Fix a livelock when trying to run shadowed unpaged guests
On AMD processors which support SMEP (Some Fam16h processors) and SMAP (Zen,
Fam17h), a guest which is running with shadow paging and clears CR0.PG while
keeping CR4.{SMEP,SMAP} set will livelock, as hardware raises #PF which the
shadow pagetable concludes shouldn't happen.
This occurs because hardware is running with host paging settings, which
causes the guests choice of SMEP/SMAP to actually take effect, even though
they shouldn't from the guests point of view.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
master commit: 3164f2f9db1e63ea64c3f9520d40cb09920d2b35
master date: 2017-10-02 13:57:34 +0100
Jan Beulich [Tue, 24 Oct 2017 14:39:33 +0000 (16:39 +0200)]
gnttab: fix pin count / page reference race
Dropping page references before decrementing pin counts is a bad idea
if assumptions are being made that a non-zero pin count implies a valid
page. Fix the order of operations in gnttab_copy_release_buf(), but at
the same time also remove the assertion that was found to trigger:
map_grant_ref() also has the potential of causing a race here, and
changing the order of operations there would likely be quite a bit more
involved.
This is CVE-2017-15597 / XSA-236.
Reported-by: Pawel Wieczorkiewicz <wipawel@amazon.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e008f7619dcd6d549727c9635b3f9f3c7ee483ed
master date: 2017-10-24 16:01:33 +0200
The variable domctl.u.address_size.size may remain uninitialized if
guest_type is not one of xen-3.0-aarch64 or xen-3.0-armv7l. And the
code precisely checks if this variable is still 0 to decide if the
guest type is supported or not.
This fixes the following build failure with gcc 7.x:
xc_dom_arm.c:229:31: error: 'domctl.u.address_size.size' may be used uninitialized in this function [-Werror=maybe-uninitialized]
if ( domctl.u.address_size.size == 0 )
Patch originally taken from
https://www.mail-archive.com/xen-devel@lists.xen.org/msg109313.html.
Signed-off-by: Bernd Kuhls <bernd.kuhls@t-online.de> Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
(cherry picked from commit 88bfbf90e35f1213f9967a97dee0b2039f9998a4)
Andrew Cooper [Thu, 12 Oct 2017 13:30:21 +0000 (15:30 +0200)]
x86/cpu: Fix IST handling during PCPU bringup
Clear IST references in newly allocated IDTs. Nothing good will come of
having them set before the TSS is suitably constructed (although the chances
of the CPU surviving such an IST interrupt/exception is extremely slim).
Uniformly set the IST references after the TSS is in place. This fixes an
issue on AMD hardware, where onlining a PCPU while PCPU0 is in HVM context
will cause IST_NONE to be copied into the new IDT, making that PCPU vulnerable
to privilege escalation from PV guests until it subsequently schedules an HVM
guest.
This is XSA-244.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: cc08c73c8c1f5ba5ed0f8274548db6725e1c3157
master date: 2017-10-12 14:50:31 +0200
Andrew Cooper [Thu, 12 Oct 2017 13:29:52 +0000 (15:29 +0200)]
x86/shadow: Don't create self-linear shadow mappings for 4-level translated guests
When initially creating a monitor table for 4-level translated guests, don't
install a shadow-linear mapping. This mapping is actually self-linear, and
trips up the writeable heuristic logic into following Xen's mappings, not the
guests' shadows it was expecting to follow.
A consequence of this is that sh_guess_wrmap() needs to cope with there being
no shadow-linear mapping present, which in practice occurs once each time a
vcpu switches to 4-level paging from a different paging mode.
An appropriate shadow-linear slot will be inserted into the monitor table
either while constructing lower level monitor tables, or by sh_update_cr3().
While fixing this, clarify the safety of the other mappings. Despite
appearing unsafe, it is correct to create a guest-linear mapping for
translated domains; this is self-linear and doesn't point into the translated
domain. Drop a dead clause for translate != external guests.
This is XSA-243.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: bf2b4eadcf379d0361b38de9725ea5f7a18a5205
master date: 2017-10-12 14:50:07 +0200
Jan Beulich [Thu, 12 Oct 2017 13:29:21 +0000 (15:29 +0200)]
x86: don't allow page_unlock() to drop the last type reference
Only _put_page_type() does the necessary cleanup, and hence not all
domain pages can be released during guest cleanup (leaving around
zombie domains) if we get this wrong.
Jan Beulich [Thu, 12 Oct 2017 13:28:44 +0000 (15:28 +0200)]
x86: don't store possibly stale TLB flush time stamp
While the timing window is extremely narrow, it is theoretically
possible for an update to the TLB flush clock and a subsequent flush
IPI to happen between the read and write parts of the update of the
per-page stamp. Exclude this possibility by disabling interrupts
across the update, preventing the IPI to be serviced in the middle.
This is XSA-241.
Reported-by: Jann Horn <jannh@google.com> Suggested-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 23a183607a427572185fc51c76cc5ab11c00c4cc
master date: 2017-10-12 14:48:25 +0200
Jan Beulich [Thu, 12 Oct 2017 13:27:37 +0000 (15:27 +0200)]
x86: limit linear page table use to a single level
That's the only way that they're meant to be used. Without such a
restriction arbitrarily long chains of same-level page tables can be
built, tearing down of which may then cause arbitrarily deep recursion,
causing a stack overflow. To facilitate this restriction, a counter is
being introduced to track both the number of same-level entries in a
page table as well as the number of uses of a page table in another
same-level one (counting into positive and negative direction
respectively, utilizing the fact that both counts can't be non-zero at
the same time).
Note that the added accounting introduces a restriction on the number
of times a page can be used in other same-level page tables - more than
32k of such uses are no longer possible.
Note also that some put_page_and_type[_preemptible]() calls are
replaced with open-coded equivalents. This seemed preferrable to
adding "parent_table" to the matrix of functions.
Note further that cross-domain same-level page table references are no
longer permitted (they probably never should have been).
This is XSA-240.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com>
master commit: 6987fc7558bdbab8119eabf026e3cdad1053f0e5
master date: 2017-10-12 14:44:34 +0200
Jan Beulich [Thu, 12 Oct 2017 13:27:07 +0000 (15:27 +0200)]
x86/HVM: prefill partially used variable on emulation paths
Certain handlers ignore the access size (vioapic_write() being the
example this was found with), perhaps leading to subsequent reads
seeing data that wasn't actually written by the guest. For
consistency and extra safety also do this on the read path of
hvm_process_io_intercept(), even if this doesn't directly affect what
guests get to see, as we've supposedly already dealt with read handlers
leaving data completely unitialized.
This is XSA-239.
Reported-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 0d4732ac29b63063764c29fa3bd8946daf67d6f3
master date: 2017-10-12 14:43:26 +0200
Misbehaving device model can pass incorrect XEN_DMOP_map/
unmap_io_range_to_ioreq_server arguments, namely end < start when
specifying address range. When this happens we hit ASSERT(s <= e) in
rangeset_contains_range()/rangeset_overlaps_range() with debug builds.
Production builds will not trap right away but may misbehave later
while handling such bogus ranges.
Jan Beulich [Thu, 12 Oct 2017 13:25:57 +0000 (15:25 +0200)]
x86/FLASK: fix unmap-domain-IRQ XSM hook
The caller and the FLASK implementation of xsm_unmap_domain_irq()
disagreed about what the "data" argument points to in the MSI case:
Change both sides to pass/take a PCI device.
This is part of XSA-237.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6f17f5c43a3bd28d27ed8133b2bf513e2eab7d59
master date: 2017-10-12 14:37:56 +0200
Jan Beulich [Thu, 12 Oct 2017 13:24:35 +0000 (15:24 +0200)]
x86/MSI: disallow redundant enabling
At the moment, Xen attempts to allow redundant enabling of MSI by
having pci_enable_msi() return 0, and point to the existing MSI
descriptor, when the msi already exists.
Unfortunately, if subsequent errors are encountered, the cleanup
paths assume pci_enable_msi() had done full initialization, and
hence undo everything that was assumed to be done by that
function without also undoing other setup that would normally
occur only after that function was called (in map_domain_pirq()
itself).
Rather than try to make the redundant enabling case work properly, just
forbid it entirely by having pci_enable_msi() return -EEXIST when MSI
is already set up.
This is part of XSA-237.
Reported-by: HW42 <hw42@ipsumj.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: a46126fec20e0cf4f5442352ef45efaea8c89646
master date: 2017-10-12 14:36:58 +0200
Jan Beulich [Thu, 12 Oct 2017 13:24:03 +0000 (15:24 +0200)]
x86: enforce proper privilege when (un)mapping pIRQ-s
(Un)mapping of IRQs, just like other RESOURCE__ADD* / RESOURCE__REMOVE*
actions (in FLASK terms) should be XSM_DM_PRIV rather than XSM_TARGET.
This in turn requires bypassing the XSM check in physdev_unmap_pirq()
for the HVM emuirq case just like is being done in physdev_map_pirq().
The primary goal security wise, however, is to no longer allow HVM
guests, by specifying their own domain ID instead of DOMID_SELF, to
enter code paths intended for PV guest and the control domains of HVM
guests only.
This is part of XSA-237.
Reported-by: HW42 <hw42@ipsumj.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: db72faf69c94513e180568006a9d899ed422ff90
master date: 2017-10-12 14:36:30 +0200
Jan Beulich [Thu, 12 Oct 2017 13:23:20 +0000 (15:23 +0200)]
x86: don't allow MSI pIRQ mapping on unowned device
MSI setup should be permitted only for existing devices owned by the
respective guest (the operation may still be carried out by the domain
controlling that guest).
This is part of XSA-237.
Reported-by: HW42 <hw42@ipsumj.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 3308374b1be7d43e23bd2e9eaf23ec06d7959882
master date: 2017-10-12 14:35:14 +0200
Julien Grall [Fri, 6 Oct 2017 13:16:50 +0000 (15:16 +0200)]
xen/arm: Correctly report the memory region in the dummy NUMA helpers
NUMA is currently not supported on Arm. Because common code is
NUMA-aware, dummy helpers are instead provided to expose a single node.
Those helpers are for instance used to know the region to scrub.
However the memory region is not reported correctly. Indeed, the
frametable may not be at the beginning of the memory and there might be
multiple memory banks. This will lead to not scrub some part of the
memory.
The memory information can be found using:
* first_valid_mfn as the start of the memory
* max_page - first_valid_mfn as the spanned pages
Note that first_valid_mfn is now been exported. The prototype has been
added in asm-arm/numa.h and not in a common header because I would
expect the variable to become static once NUMA is fully supported on
Arm.
Julien Grall [Fri, 6 Oct 2017 13:16:12 +0000 (15:16 +0200)]
xen/page_alloc: Cover memory unreserved after boot in first_valid_mfn
On Arm, some regions (e.g Initramfs, Dom0 Kernel...) are marked as
reserved until the hardware domain is built and they are copied into its
memory. Therefore, they will not be added in the boot allocator via
init_boot_pages.
Instead, init_xenheap_pages will be called once the region are not used
anymore.
Update first_valid_mfn in both init_heap_pages and init_boot_pages
(already exist) to cover all the cases.
Chao Gao [Fri, 6 Oct 2017 13:14:47 +0000 (15:14 +0200)]
VT-d: use correct BDF for VF to search VT-d unit
When SR-IOV is enabled, 'Virtual Functions' of a 'Physical Function'
are under the scope of the same VT-d unit as the 'Physical Function'.
A 'Physical Function' can be a 'Traditional Function' or an ARI
'Extended Function'. And furthermore, 'Extended Functions' on an
endpoint are under the scope of the same VT-d unit as the 'Traditional
Functions' on the endpoint. To search VT-d unit for a VF, if its PF
isn't an extended function, the BDF of PF should be used. Otherwise
the BDF of a traditional function in the same device with the PF
should be used.
Current code uses PCI_SLOT() to recognize an ARI 'Extended Funcion'.
But it is conceptually wrong w/o checking whether PF is an extended
function and would lead to match VFs of a RC integrated PF to a wrong
VT-d unit.
This patch overrides VF 'is_extfn' field and uses this field to
indicate whether the PF of this VF is an extended function. The field
helps to use correct BDF to search VT-d unit.
Reported-by: Crawford, Eric R <Eric.R.Crawford@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Jan Beulich <jbeulich@suse.com> Tested-by: Crawford, Eric R <Eric.R.Crawford@intel.com>
master commit: c286af54c7177c14180121b422d8df7281e547cb
master date: 2017-09-01 11:02:23 +0200
Xiong Zhang [Fri, 6 Oct 2017 13:14:20 +0000 (15:14 +0200)]
hvmloader: use base instead of pci_mem_start for find_next_rmrr()
find_next_rmrr(base) is used to find the lowest RMRR ending above base
but below 4G. Current method couldn't cover the following situation:
a. two rmrr exist, small gap between them
b. pci_mem_start and mem_resource.base is below the first rmrr.base
c. find_next_rmrr(pci_mem_start) will find the first rmrr
d. After aligning mem_resource.base to bar size,
first_rmrr.end < new_base < second_rmrr.base and
new_base + bar_sz > second_rmrr.base.
So the new bar will overlap with the second rmrr and doesn't overlap
with the first rmrr.
But the next_rmrr point to the first rmrr, then check_overlap() couldn't
find the overlap. Finally assign a wrong address to bar.
This patch using aligned new base to find the next rmrr, could fix the
above case and find all the overlapped rmrr with new base.
Jan Beulich [Tue, 12 Sep 2017 13:11:07 +0000 (15:11 +0200)]
gnttab: also validate PTE permissions upon destroy/replace
In order for PTE handling to match up with the reference counting done
by common code, presence and writability of grant mapping PTEs must
also be taken into account; validating just the frame number is not
enough. This is in particular relevant if a guest fiddles with grant
PTEs via non-grant hypercalls.
Note that the flags being passed to replace_grant_host_mapping()
already happen to be those of the existing mapping, so no new function
parameter is needed.
This is CVE-2017-14319 / XSA-234.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 16b1414de91b5a82a0996c67f6db3af7d7e32873
master date: 2017-09-12 14:45:13 +0200
tools/xenstore: dont unlink connection object twice
A connection object of a domain with associated stubdom has two
parents: the domain and the stubdom. When cleaning up the list of
active domains in domain_cleanup() make sure not to unlink the
connection twice from the same domain. This could happen when the
domain and its stubdom are being destroyed at the same time leading
to the domain loop being entered twice.
Additionally don't use talloc_free() in this case as it will remove
a random parent link, leading eventually to a memory leak. Use
talloc_unlink() instead specifying the context from which the
connection object should be removed.
This is CVE-2017-14317 / XSA-233.
Reported-by: Eric Chanudet <chanudete@ainfosec.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
master commit: 562a1c0f7ef3fbf3c122c3dfa4f2ad9dd51da9fe
master date: 2017-09-12 14:44:56 +0200
Andrew Cooper [Tue, 12 Sep 2017 13:10:11 +0000 (15:10 +0200)]
grant_table: fix GNTTABOP_cache_flush handling
Don't fall over a NULL grant_table pointer when the owner of the domain
is a system domain (DOMID_{XEN,IO} etc).
This is CVE-2017-14318 / XSA-232.
Reported-by: Matthew Daley <mattd@bugfuzz.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c3d830b244998b3686e2eb64db95996be5eb5e5c
master date: 2017-09-12 14:44:11 +0200
George Dunlap [Tue, 12 Sep 2017 13:09:28 +0000 (15:09 +0200)]
xen/mm: make sure node is less than MAX_NUMNODES
The output of MEMF_get_node(memflags) can be as large as nodeid_t can
hold (currently 255). This is then used as an index to arrays of size
MAX_NUMNODE, which is 64 on x86 and 1 on ARM, can be passed in by an
untrusted guest (via memory_exchange and increase_reservation) and is
not currently bounds-checked.
Check the value in page_alloc.c before using it, and also check the
value in the hypercall call sites and return -EINVAL if appropriate.
Don't permit domains other than the hardware or control domain to
allocate node-constrained memory.
This is CVE-2017-14316 / XSA-231.
Reported-by: Matthew Daley <mattd@bugfuzz.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2fece35303529395bfea6b03d2268380ef682c93
master date: 2017-09-12 14:43:16 +0200
When no memory is available in the hypervisor, rather than immediately
failing the request, try to steal a handle from another vCPU.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d02f1a0b7576bafb2fba903c7e6e7221ab0d2847
master date: 2017-08-17 14:41:01 +0200
cpufreq: only stop ondemand governor if already started
On CPUFREQ_GOV_STOP in cpufreq_governor_dbs, shortcut to
return success if the governor is already stopped.
Avoid executing dbs_timer_exit, to prevent tripping an assertion
within a call to kill_timer on a timer that has not been prepared
with init_timer, if the CPUFREQ_GOV_START case has not
run beforehand.
kill_timer validates timer state:
* itself, via BUG_ON(this_cpu(timers).running == timer);
* within active_timer, ASSERTing timer->status is within bounds;
* within list_del, which ASSERTs timer inactive list membership.
Patch is synonymous to an OpenXT patch produced at Citrix prior to
June 2014.
Signed-off-by: Christopher Clark <christopher.clark6@baesystems.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e7ec6f5f32cd2d0f723083cde3d7761c4e675f2c
master date: 2017-08-10 12:35:50 +0200
Chao Gao [Mon, 28 Aug 2017 11:02:53 +0000 (13:02 +0200)]
VT-d PI: disable VT-d PI when CPU-side PI isn't enabled
From the context calling pi_desc_init(), we can conclude the current
implementation of VT-d PI depends on CPU-side PI. If we enable VT-d PI
and disable CPU-side PI by disabling APICv explicitly in xen boot
command line, we would get an assertion failure.
This patch clears iommu_intpost once finding CPU-side PI won't be enabled.
It is safe for this is done before this flag starts taking effect. Also
take this chance to remove the useless check of "acknowledge interrupt on
exit", which is a minimal requirement which has been checked earlier.
Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: e489eb6138e7efe4214a7e9ba0d21f54fc5b7d35
master date: 2017-08-10 12:32:16 +0200
Rusty Bird [Mon, 28 Aug 2017 11:02:24 +0000 (13:02 +0200)]
VT-d: don't panic/warn on iommu=no-igfx
When operating on an Intel graphics device, iommu_enable_translation()
panicked (force_iommu==1) or warned (force_iommu==0) about the BIOS if
is_igd_vt_enabled_quirk() returned 0. That's good if the actual BIOS
problem has been detected. But since commit 1463411, returning 0 could
also happen if the user simply passed "iommu=no-igfx", in which case
bailing out with an info message (instead of a panic/warning) would be
more appropriate.
The panic broke the combination "iommu=force,no-igfx", and also the case
where "iommu=no-igfx" is passed but force_iommu=1 is set automatically
by x2apic_bsp_setup().
Move the iommu_igfx check from is_igd_vt_enabled_quirk() into its only
caller iommu_enable_translation(), and tweak the logic.
Olaf Hering [Mon, 28 Aug 2017 11:02:06 +0000 (13:02 +0200)]
docs: replace xm with xl in xen-tscmode
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
master commit: 763267e315a93e2b6d66a0afdcda96db939e09b6
master date: 2017-07-24 10:17:21 +0100
Olaf Hering [Mon, 28 Aug 2017 11:01:50 +0000 (13:01 +0200)]
rombios: prevent building with PIC/PIE
If the default compiler silently defaults to to -fPIC/-fPIE building
rombios fails:
ld -melf_i386 -s -r 32bitbios.o tcgbios/tcgbiosext.o util.o pmm.o -o 32bitbios_all.o
There are undefined symbols in the BIOS:
U _GLOBAL_OFFSET_TABLE_
make[10]: *** [Makefile:26: 32bitbios_all.o] Error 11
Prevent the failure by enforcing non-PIC/PIE mode.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 672949d6c61d9cba01c5b414eed9d522082f04d3
master date: 2017-06-26 14:32:46 +0100
Andrew Cooper [Mon, 28 Aug 2017 11:01:23 +0000 (13:01 +0200)]
xen/livepatch: Don't crash on encountering STN_UNDEF relocations
A symndx of STN_UNDEF is special, and means a symbol value of 0. While
legitimate in the ELF standard, its existance in a livepatch is questionable
at best. Until a plausible usecase presents itself, reject such a relocation
with -EOPNOTSUPP.
Additionally, fix an off-by-one error while range checking symndx, and perform
a safety check on elf->sym[symndx].sym before derefencing it, to avoid
tripping over a NULL pointer when calculating val.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> [x86 and arm32] Reviewed-by: Jan Beulich <JBeulich@suse.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: 2ff229643b739e2fd0cd0536ee9fca506cfa92f8
master date: 2017-06-23 15:00:37 +0100