Jan Beulich [Fri, 16 Mar 2018 16:20:59 +0000 (17:20 +0100)]
x86: ignore guest microcode loading attempts
The respective MSRs are write-only, and hence attempts by guests to
write to these are - as of 1f1d183d49 ("x86/HVM: don't give the wrong
impression of WRMSR succeeding") no longer ignored. Restore original
behavior for the two affected MSRs.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 59c0983e10d70ea2368085271b75fb007811fe52
master date: 2018-03-15 12:44:24 +0100
Jan Beulich [Tue, 6 Mar 2018 15:24:41 +0000 (16:24 +0100)]
x86/HVM: don't give the wrong impression of WRMSR succeeding
... for non-existent MSRs: wrmsr_hypervisor_regs()'s comment clearly
says that the function returns 0 for unrecognized MSRs, so
{svm,vmx}_msr_write_intercept() should not convert this into success. We
don't want to unconditionally fail the access though, as we can't be
certain the list of handled MSRs is complete enough for the guest types
we care about, so instead mirror what we do on the read paths and probe
the MSR to decide whether to raise #GP.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
master commit: 1f1d183d49008794b087cf043fc77f724a45af98
master date: 2018-02-27 15:12:23 +0100
Jan Beulich [Tue, 6 Mar 2018 15:24:01 +0000 (16:24 +0100)]
x86/PV: fix off-by-one in I/O bitmap limit check
With everyone having their tags below agreeing that putting things the
other way around in the comparison makes things easier to understand, do
that rearrangement while changing the line anyway.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.apu@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c6527bc66b6dd7a8dadaebb1047c8e52c6c5793c
master date: 2018-02-27 14:10:00 +0100
Andrew Cooper [Tue, 6 Mar 2018 15:22:54 +0000 (16:22 +0100)]
x86/pv: Avoid leaking other guests' MSR_TSC_AUX values into PV context
If the CPU pipeline supports RDTSCP or RDPID, a guest can observe the value in
MSR_TSC_AUX, irrespective of whether the relevant CPUID features are
advertised/hidden.
At the moment, paravirt_ctxt_switch_to() only writes to MSR_TSC_AUX if
TSC_MODE_PVRDTSCP mode is enabled, but this is not the default mode.
Therefore, default PV guests can read the value from a previously scheduled
HVM vcpu, or TSC_MODE_PVRDTSCP-enabled PV guest.
Alter the PV path to always write to MSR_TSC_AUX, using 0 in the common case.
To amortise overhead cost, introduce wrmsr_tsc_aux() which performs a lazy
update of the MSR, and use this function consistently across the codebase.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
master commit: cc0e45db277922b5723a7b1d9657d6f744230cf1
master date: 2018-02-27 10:47:23 +0000
Igor Druzhinin [Tue, 6 Mar 2018 15:22:09 +0000 (16:22 +0100)]
x86/nmi: start NMI watchdog on CPU0 after SMP bootstrap
We're noticing a reproducible system boot hang on certain
Skylake platforms where the BIOS is configured in legacy
boot mode with x2APIC disabled. The system stalls immediately
after writing the first SMP initialization sequence into APIC ICR.
The cause of the problem is watchdog NMI handler execution -
somewhere near the end of NMI handling (after it's already
rescheduled the next NMI) it tries to access IO port 0x61
to get the actual NMI reason on CPU0. Unfortunately, this
port is emulated by BIOS using SMIs and this emulation for
some reason takes more time than we expect during INIT-SIPI-SIPI
sequence. As the result, the system is constantly moving between
NMI and SMI handler and not making any progress.
To avoid this, initialize the watchdog after SMP bootstrap on
CPU0 and, additionally, protect the NMI handler by moving
IO port access before NMI re-scheduling. The latter should also
help in case of post boot CPU onlining. Although we're running
watchdog at much lower frequency at this point, it's neveretheless
possible we may trigger the issue anyway.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a44f1697968e04fcc6145e3bd51c748b57047240
master date: 2018-02-20 10:16:56 +0100
Jan Beulich [Tue, 6 Mar 2018 15:21:26 +0000 (16:21 +0100)]
x86/srat: fix end calculation in nodes_cover_memory()
Along the lines of commit 7226486767 ("x86/srat: fix the end pfn check
in valid_numa_range()") nodes_cover_memory() also doesn't consistently
use "end": It's set to an inclusive value initially, but then compared
to the exclusive "end" field of struct node and also possibly set to
nodes[j].start, making it exclusive too. Change the initialization to
make the variable consistently exclusive.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fdbed42649eb064e7c6d1bae2bdd4f46e7b2a160
master date: 2018-02-15 18:17:32 +0100
Andrew Cooper [Tue, 6 Mar 2018 15:19:35 +0000 (16:19 +0100)]
x86/spec_ctrl: Fix several bugs in SPEC_CTRL_ENTRY_FROM_INTR_IST
DO_OVERWRITE_RSB clobbers %rax, meaning in practice that the bti_ist_info
field gets zeroed. Older versions of this code had the DO_OVERWRITE_RSB
register selectable, so reintroduce this ability and use it to cause the
INTR_IST path to use %rdx instead.
The use of %dl for the %cs.rpl check means that when an IST interrupt hits
Xen, we try to load 1 into the high 32 bits of MSR_SPEC_CTRL, suffering a #GP
fault instead.
Also, drop an unused label which was a copy/paste mistake.
Reported-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a2b08fbed388f18235fda5ba1655c1483ef3e215
master date: 2018-02-14 13:22:15 +0000
Haozhong Zhang [Tue, 6 Mar 2018 15:19:03 +0000 (16:19 +0100)]
x86/srat: fix the end pfn check in valid_numa_range()
... and fix the coding style on fly.
valid_numa_range(..., epfn << PAGE_SHIFT, ...) and its only caller
memory_add(..., epfn, pxm) interpret epfn inconsistently. The former
interprets epfn as the last pfn, while the latter interprets it as the
last pfn plus one. Fix this inconsistency in valid_numa_range(), since
most of other places use the latter interpretation.
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 722648676751fda39086f54d961640f88174360b
master date: 2018-02-12 11:08:33 +0000
Jan Beulich [Tue, 6 Mar 2018 15:18:28 +0000 (16:18 +0100)]
x86: reduce Meltdown band-aid IPI overhead
In case we can detect single-threaded guest processes (by checking
whether we can account for all root page table uses locally on the vCPU
that's running), there's no point in issuing a sync IPI upon an L4 entry
update, as no other vCPU of the guest will have that page table loaded.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a22320e32dca0918ed23799583f470afe4c24330
master date: 2018-02-07 16:31:41 +0100
Andrew Cooper [Tue, 6 Mar 2018 15:17:48 +0000 (16:17 +0100)]
x86/emul: Fix the emulation of invlpga
The instruction requires EFER.SVME set to be usable in the first place.
Furthermore, the emulation doesn't handle ASIDs, so avoid giving the
impression that they work. Permit ASID 0 which is reserved for non-root
mode (in which case the instruction is identical to invlpg), but raise #UD for
any other ASID.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a91b2ec337a45d5d98e5a4387aa6563bc5cdc4c9
master date: 2018-02-05 18:17:22 +0000
Andrew Cooper [Thu, 16 Nov 2017 21:10:00 +0000 (21:10 +0000)]
tools/libxc: Fix restoration of PV MSRs after migrate
There are two bugs in process_vcpu_msrs() which clearly demonstrate that I
didn't test this bit of Migration v2 very well when writing it...
vcpu->msrsz is always expected to be a multiple of xen_domctl_vcpu_msr_t
records in a spec-compliant stream, so the modulo yields 0 for the msr_count,
rather than the actual number sent in the stream.
Passing 0 for the msr_count causes the hypercall to exit early, and hides the
fact that the guest handle is inserted into the wrong field in the domctl
union.
The reason that these bugs have gone unnoticed for so long is that the only
MSRs passed like this for PV guests are the AMD DBGEXT MSRs, which only exist
in fairly modern hardware, and whose use doesn't appear to be implemented in
any contemporary PV guests.
Andrew Cooper [Thu, 30 Mar 2017 16:32:32 +0000 (17:32 +0100)]
tools/libxc: Avoid generating inappropriate zero-content records
The code as written attempted to elide zero-content records, as such records
serve no purpose but come with a performance hit. Unfortunately, in the case
where the hypervisor reported max size is non-zero, but the actual size is
zero, the record is not elided.
This previously tripped up the sanity checks in the restore side of migration,
but as the underlying reasons for eliding the records in the first place are
still valid, fix the elision logic.
Jan Beulich [Tue, 27 Feb 2018 13:32:32 +0000 (14:32 +0100)]
gnttab: don't blindly free status pages upon version change
There may still be active mappings, which would trigger the respective
BUG_ON(). Split the loop into one dealing with the page attributes and
the second (when the first fully passed) freeing the pages. Return an
error if any pages still have pending references.
This is part of XSA-255.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 38bfcc165dda5f4284d7c218b91df9e144ddd88d
master date: 2018-02-27 14:07:12 +0100
Jan Beulich [Tue, 27 Feb 2018 13:32:14 +0000 (14:32 +0100)]
gnttab/ARM: don't corrupt shared GFN array
... by writing status GFNs to it. Introduce a second array instead.
Also implement gnttab_status_gmfn() properly now that the information is
suitably being tracked.
While touching it anyway, remove a misguided (but luckily benign) upper
bound check from gnttab_shared_gmfn(): We should never access beyond the
bounds of that array.
This is part of XSA-255.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9d2f8f9c65d4da35437f50ed9e812a2c5ab313e2
master date: 2018-02-27 14:04:44 +0100
Jan Beulich [Tue, 27 Feb 2018 13:31:30 +0000 (14:31 +0100)]
memory: don't implicitly unpin for decrease-reservation
It very likely was a mistake (copy-and-paste from domain cleanup code)
to implicitly unpin here: The caller should really unpin itself before
(or after, if they so wish) requesting the page to be removed.
This is XSA-252.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d798a0952903db9d8ee0a580e03f214d2b49b7d7
master date: 2018-02-27 14:03:27 +0100
Julien Grall [Wed, 14 Feb 2018 12:22:23 +0000 (12:22 +0000)]
xen/arm: cpuerrata: Actually check errata on non-boot CPUs
The cpu errata framework was introduced in commit 8b01f6364f "xen/arm:
Detect silicon revision and set cap bits accordingly" and was meant to
detect errata present on any CPUs (via check_local_cpu_errata). However,
the function to check the MIDR (is_affected_midr_range) mistakenly
always use the boot CPU MIDR.
Fix is_affected_midr_range to use the current CPU MIDR.
Tim Deegan [Thu, 15 Feb 2018 12:10:54 +0000 (13:10 +0100)]
tools/kdd: don't use a pointer to an unaligned field.
The 'val' field in the packet is byte-aligned (because it is part of a
packed struct), but the pointer argument to kdd_rdmsr() has the normal
alignment constraints for a uint64_t *. Use a local variable to make sure
the passed pointer has the correct alignment.
Reported-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Tim Deegan <tim@xen.org> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: bfd9a2095f1882e8c074b2d911bcb07d12cf6cf5
master date: 2017-03-15 10:57:00 +0000
Andrew Cooper [Wed, 14 Feb 2018 10:45:01 +0000 (11:45 +0100)]
x86/idle: Clear SPEC_CTRL while idle
On contemporary hardware, setting IBRS/STIBP has a performance impact on
adjacent hyperthreads. It is therefore recommended to clear the setting
before becoming idle, to avoid an idle core preventing adjacent userspace
execution from running at full performance.
Care must be taken to ensure there are no ret or indirect branch instructions
between spec_ctrl_{enter,exit}_idle() invocations, which are forced always
inline. Care must also be taken to avoid using spec_ctrl_enter_idle() between
flushing caches and becoming idle, in cases where that matters.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4c7e478d597b0346eef3a256cfd6794ac778b608
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Wed, 14 Feb 2018 10:44:23 +0000 (11:44 +0100)]
x86/cpuid: Offer Indirect Branch Controls to guests
With all infrastructure in place, it is now safe to let guests see and use
these features.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
master commit: 67c6838ddacfa646f9d1ae802bd0f16a935665b8
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Wed, 14 Feb 2018 10:43:57 +0000 (11:43 +0100)]
x86/ctxt: Issue a speculation barrier between vcpu contexts
Issuing an IBPB command flushes the Branch Target Buffer, so that any poison
left by one vcpu won't remain when beginning to execute the next.
The cost of IBPB is substantial, and skipped on transition to idle, as Xen's
idle code is robust already. All transitions into vcpu context are fully
serialising in practice (and under consideration for being retroactively
declared architecturally serialising), so a cunning attacker cannot use SP1 to
try and skip the flush.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a2ed643ed783020f885035432e9c0919756921d1
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Wed, 14 Feb 2018 10:43:28 +0000 (11:43 +0100)]
x86/boot: Calculate the most appropriate BTI mitigation to use
See the logic and comments in init_speculation_mitigations() for further
details.
There are two controls for RSB overwriting, because in principle there are
cases where it might be safe to forego rsb_native (Off the top of my head,
SMEP active, no 32bit PV guests at all, no use of vmevent/paging subsystems
for HVM guests, but I make no guarantees that this list of restrictions is
exhaustive).
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/spec_ctrl: Fix determination of when to use IBRS
The original version of this logic was:
/*
* On Intel hardware, we'd like to use retpoline in preference to
* IBRS, but only if it is safe on this hardware.
*/
else if ( boot_cpu_has(X86_FEATURE_IBRSB) )
{
if ( retpoline_safe() )
thunk = THUNK_RETPOLINE;
else
ibrs = true;
}
but it was changed by a request during review. Sadly, the result is buggy as
it breaks the later fallback logic by allowing IBRS to appear as available
when in fact it isn't.
This in practice means that on repoline-unsafe hardware without IBRS, we
select THUNK_JUMP despite intending to select THUNK_RETPOLINE.
Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2713715305ca516f698d58cec5e0b322c3b2c4eb
master date: 2018-01-26 14:10:21 +0000
master commit: 30cbd0c83ef3d0edac2d5bcc41a9a2b7a843ae58
master date: 2018-02-06 18:32:58 +0000
Andrew Cooper [Wed, 14 Feb 2018 10:42:51 +0000 (11:42 +0100)]
x86/entry: Avoid using alternatives in NMI/#MC paths
This patch is deliberately arranged to be easy to revert if/when alternatives
patching becomes NMI/#MC safe.
For safety, there must be a dispatch serialising instruction in (what is
logically) DO_SPEC_CTRL_ENTRY so that, in the case that Xen needs IBRS set in
context, an attacker can't speculate around the WRMSR and reach an indirect
branch within the speculation window.
Using conditionals opens this attack vector up, so the else clause gets an
LFENCE to force the pipeline to catch up before continuing. This also covers
the safety of RSB conditional, as execution it is guaranteed to either hit the
WRMSR or LFENCE.
One downside of not using alternatives is that there unconditionally an LFENCE
in the IST path in cases where we are not using the features from IBRS-capable
microcode.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3fffaf9c13e9502f09ad4ab1aac3f8b7b9398f6f
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Wed, 14 Feb 2018 10:42:12 +0000 (11:42 +0100)]
x86/entry: Organise the clobbering of the RSB/RAS on entry to Xen
ret instructions are speculated directly to values recorded in the Return
Stack Buffer/Return Address Stack, as there is no uncertainty in well-formed
code. Guests can take advantage of this in two ways:
1) If they can find a path in Xen which executes more ret instructions than
call instructions. (At least one in the waitqueue infrastructure,
probably others.)
2) Use the fact that the RSB/RAS in hardware is actually a circular stack
without a concept of empty. (When it logically empties, stale values
will start being used.)
To mitigate, overwrite the RSB on entry to Xen with gadgets which will capture
and contain rogue speculation.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e6c0128e9ab25bf66df11377a33ee5584d7f99e3
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Wed, 14 Feb 2018 10:39:21 +0000 (11:39 +0100)]
x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point
We need to be able to either set or clear IBRS in Xen context, as well as
restore appropriate guest values in guest context. See the documentation in
asm-x86/spec_ctrl_asm.h for details.
With the contemporary microcode, writes to %cr3 are slower when SPEC_CTRL.IBRS
is set. Therefore, the positioning of SPEC_CTRL_{ENTRY/EXIT}* is important.
Ideally, the IBRS_SET/IBRS_CLEAR hunks might be positioned either side of the
%cr3 change, but that is rather more complicated to arrange, and could still
result in a guest controlled value in SPEC_CTRL during the %cr3 change,
negating the saving if the guest chose to have IBRS set.
Therefore, we optimise for the pre-Skylake case (being far more common in the
field than Skylake and later, at the moment), where we have a Xen-preferred
value of IBRS clear when switching %cr3.
There is a semi-unrelated bugfix, where various asm_defn.h macros have a
hidden dependency on PAGE_SIZE, which results in an assembler error if used in
a .macro definition.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5e7962901131186d3514528ed57c7a9901a15a3e
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Wed, 14 Feb 2018 10:38:30 +0000 (11:38 +0100)]
x86/hvm: Permit guests direct access to MSR_{SPEC_CTRL,PRED_CMD}
For performance reasons, HVM guests should have direct access to these MSRs
when possible.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 5a2fe171144ebcc908ea1fca45058d6010f6a286
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Wed, 14 Feb 2018 10:38:00 +0000 (11:38 +0100)]
x86/migrate: Move MSR_SPEC_CTRL on migrate
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0cf2a4eb769302b7d7d7835540e7b2f15006df30
master date: 2018-01-26 14:10:21 +0000
MSR_ARCH_CAPABILITIES will only come into existence on new hardware, but is
implemented as a straight #GP for now to avoid being leaky when new hardware
arrives.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ea58a679a6190e714a592f1369b660769a48a80c
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Wed, 14 Feb 2018 10:36:48 +0000 (11:36 +0100)]
x86/cpuid: Handling of IBRS/IBPB, STIBP and IBRS for guests
Intel specifies IBRS/IBPB (combined, in a single bit) and STIBP as a separate
bit. AMD specifies IBPB alone in a 3rd bit.
AMD's IBPB is a subset of Intel's combined IBRS/IBPB. For performance
reasons, administrators might wish to express "IBPB only" even on Intel
hardware, so we allow the AMD bit to be used for this purpose.
The behaviour of STIBP is more complicated.
It is our current understanding that STIBP will be advertised on HT-capable
hardware irrespective of whether HT is enabled, but not advertised on
HT-incapable hardware. However, for ease of virtualisation, STIBP's
functionality is ignored rather than reserved by microcode/hardware on
HT-incapable hardware.
For guest safety, we treat STIBP as special, always override the toolstack
choice, and always advertise STIBP if IBRS is available. This removes the
corner case where STIBP is not advertised, but the guest is running on
HT-capable hardware where it does matter.
Finally as a bugfix, update the libxc CPUID logic to understand the e8b
feature leaf, which has the side effect of also offering CLZERO to guests on
applicable hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d297b56682e730d598e2529cc6998151d3b6f6f8
master date: 2018-01-26 14:10:21 +0000
Wei Liu [Wed, 14 Feb 2018 10:36:09 +0000 (11:36 +0100)]
x86: fix GET_STACK_END
AIUI the purpose of having the .if directive is to make GET_STACK_END
work with any general purpose registers. The code as-is would produce
the wrong result for r8. Fix it.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 8155476765a5bdecea1534b46562cf28e0113a9a
master date: 2018-01-25 11:34:17 +0000
Roger Pau Monné [Wed, 14 Feb 2018 10:35:42 +0000 (11:35 +0100)]
x86/acpi: process softirqs while printing CPU ACPI data
Or else the watchdog triggers on boxes with a huge number of CPUs
Reported-by: Simon Crowe <simon.crowe@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: a5579ee79ef8546dd47abe34d73dc9a69a14bbda
master date: 2018-01-24 18:02:14 +0100
Andrew Cooper [Wed, 14 Feb 2018 10:35:00 +0000 (11:35 +0100)]
x86/cmdline: Introduce a command line option to disable IBRS/IBPB, STIBP and IBPB
Instead of gaining yet another top level boolean, introduce a more generic
cpuid= option. Also introduce a helper function to parse a generic boolean
value.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/cmdline: Fix parse_boolean() for unadorned values
A command line such as "cpuid=no-ibrsb,no-stibp" tickles a bug in
parse_boolean() because the separating comma fails the NUL case.
Instead, check for slen == nlen which accounts for the boundary (if any)
passed via the 'e' parameter.
Andrew Cooper [Wed, 14 Feb 2018 10:34:14 +0000 (11:34 +0100)]
x86/feature: Definitions for Indirect Branch Controls
Contemporary processors are gaining Indirect Branch Controls via microcode
updates. Intel are introducing one bit to indicate IBRS and IBPB support, and
a second bit for STIBP. AMD are introducing IBPB only, so enumerate it with a
separate bit.
Furthermore, depending on compiler and microcode availability, we may want to
run Xen with IBRS set, or clear.
To use these facilities, we synthesise separate IBRS and IBPB bits for
internal use. A lot of infrastructure is required before these features are
safe to offer to guests.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
master commit: 0d703a701cc4bc47773986b2796eebd28b1439b5
master date: 2018-01-16 17:45:50 +0000
Andrew Cooper [Wed, 14 Feb 2018 10:30:08 +0000 (11:30 +0100)]
x86/amd: Try to set lfence as being Dispatch Serialising
This property is required for the AMD's recommended mitigation for Branch
Target Injection, but Xen needs to cope with being unable to detect or modify
the MSR.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: fe3ee5530a8d0d0b6a478167125d00c40f294a86
master date: 2018-01-16 17:45:50 +0000
Andrew Cooper [Wed, 14 Feb 2018 10:20:45 +0000 (11:20 +0100)]
x86: Support indirect thunks from assembly code
Introduce INDIRECT_CALL and INDIRECT_JMP which either degrade to a normal
indirect branch, or dispatch to the __x86_indirect_thunk_* symbols.
Update all the manual indirect branches in to use the new thunks. The
indirect branches in the early boot and kexec path are left intact as we can't
use the compiled-in thunks at those points.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7c508612f7a5096b4819d4ef2ce566e01bd66c0c
master date: 2018-01-16 17:45:50 +0000
Andrew Cooper [Wed, 14 Feb 2018 10:19:35 +0000 (11:19 +0100)]
x86: Support compiling with indirect branch thunks
Use -mindirect-branch=thunk-extern/-mindirect-branch-register when available.
To begin with, use the retpoline thunk. Later work will add alternative
thunks which can be selected at boot time.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 3659f0f4bcc6ca08103d1a7ae4e97535ecc978be
master date: 2018-01-16 17:45:50 +0000
Andrew Cooper [Wed, 14 Feb 2018 10:17:57 +0000 (11:17 +0100)]
x86/entry: Erase guest GPR state on entry to Xen
This reduces the number of code gadgets which can be attacked with arbitrary
guest-controlled GPR values.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: 03bd8c3a70d101fc2f8f36f1e171b7594462a4cd
master date: 2018-01-05 19:57:08 +0000
Andrew Cooper [Wed, 14 Feb 2018 10:17:09 +0000 (11:17 +0100)]
x86/hvm: Use SAVE_ALL to construct the cpu_user_regs frame after VMExit
No practical change.
One side effect in debug builds is that %rbp is inverted in the manner
expected by the stack unwinder to indicate a interrupt frame.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: 13682ca8c94bd5612a44f7f1edc1fd8ff675dacb
master date: 2018-01-05 19:57:08 +0000
Andrew Cooper [Wed, 14 Feb 2018 10:16:32 +0000 (11:16 +0100)]
x86/entry: Rearrange RESTORE_ALL to restore register in stack order
Results in a more predictable (i.e. linear) memory access pattern.
No functional change.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: f85d105e27735f0e20aa30d77f03774f3ed55ae5
master date: 2018-01-05 19:57:08 +0000
Andrew Cooper [Wed, 14 Feb 2018 10:15:46 +0000 (11:15 +0100)]
x86: Introduce a common cpuid_policy_updated()
No practical change at the moment, but future changes will need to react
irrespective of guest type.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: b357546b43ab87dfb10d740ae637a685134d5e32
master date: 2018-01-05 19:57:07 +0000
Andrew Cooper [Wed, 14 Feb 2018 10:14:54 +0000 (11:14 +0100)]
x86/hvm: Rename update_guest_vendor() callback to cpuid_policy_changed()
It will shortly be used for more than just changing the vendor.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3bea00966eb6680410c89df764d075a8fbacc3cc
master date: 2018-01-05 19:57:07 +0000
Andrew Cooper [Wed, 14 Feb 2018 10:13:00 +0000 (11:13 +0100)]
x86/alt: Break out alternative-asm into a separate header file
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 9d7b4351d3bb5c744db311cffa57ba3ebb583327
master date: 2018-01-05 19:57:07 +0000
Julien Grall [Fri, 2 Feb 2018 14:19:25 +0000 (14:19 +0000)]
xen/arm32: entry: Document the purpose of r11 in the traps handler
It took me a bit of time to understand why __DEFINE_TRAP_ENTRY is
storing the original stack pointer in r11. It is working in pair with
return_traps_entry where sp will be restored from r11.
This is fine because per the AAPCS r11 must be preserved by the
subroutine. So in return_from_trap, r11 will still contain the original
stack pointer.
Add some documentation in the code to point the 2 sides to each other.
Julien Grall [Fri, 2 Feb 2018 14:19:24 +0000 (14:19 +0000)]
xen/arm32: Invalidate icache on guest exist for Cortex-A15
In order to avoid aliasing attacks against the branch predictor on
Cortex A-15, let's invalidate the BTB on guest exit, which can only be
done by invalidating the icache (with ACTLR[0] being set).
We use the same hack as for A12/A17 to perform the vector decoding.
This is based on Linux patch from the kpti branch in [1].
Julien Grall [Fri, 2 Feb 2018 14:19:23 +0000 (14:19 +0000)]
xen/arm32: Invalidate BTB on guest exit for Cortex A17 and 12
In order to avoid aliasing attackes agains the branch predictor, let's
invalidate the BTB on guest exist. This is made complicated by the fact
that we cannot take a branch invalidating the BTB.
This is based on the fourth version posted by Marc Zyngier on Linux-arm
mailing list (see [1]).
Julien Grall [Fri, 2 Feb 2018 14:19:22 +0000 (14:19 +0000)]
xen/arm32: Add skeleton to harden branch predictor aliasing attacks
Aliasing attacked against CPU branch predictors can allow an attacker to
redirect speculative control flow on some CPUs and potentially divulge
information from one context to another.
This patch adds initiatial skeleton code behind a new Kconfig option
to enable implementation-specific mitigations against these attacks
for CPUs that are affected.
Most of mitigations will have to be applied when entering to the
hypervisor from the guest context.
Because the attack is against branch predictor, it is not possible to
safely use branch instruction before the mitigation is applied.
Therefore this has to be done in the vector entry before jump to the
helper handling a given exception.
However, on arm32, each vector contain a single instruction. This means
that the hardened vector tables may rely on the state of registers that
does not hold when in the hypervisor (e.g SP is 8 bytes aligned).
Therefore hypervisor code running with guest vectors table should be
minimized and always have IRQs and SErrors masked to reduce the risk to
use them.
This patch provides an infrastructure to switch vector tables before
entering to the guest and when leaving it.
Note that alternative could have been used, but older Xen (4.8 or
earlier) doesn't have support. So avoid using alternative to ease
backporting.
Julien Grall [Fri, 2 Feb 2018 14:19:21 +0000 (14:19 +0000)]
xen/arm32: entry: Add missing trap_reset entry
At the moment, the reset vector is defined as .word 0 (e.g andeq r0, r0,
r0).
This is rather unintuitive and will result to execute the trap
undefined. Instead introduce trap helpers for reset and will generate an
error message in the unlikely case that reset will be called.
Julien Grall [Tue, 16 Jan 2018 14:23:37 +0000 (14:23 +0000)]
xen/arm64: Implement branch predictor hardening for affected Cortex-A CPUs
Cortex-A57, A72, A73 and A75 are susceptible to branch predictor
aliasing and can theoritically be attacked by malicious code.
This patch implements a PSCI-based mitigation for these CPUs when
available. The call into firmware will invalidate the branch predictor
state, preventing any malicious entries from affection other victim
contexts.
Ported from Linux git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git
branch kpti.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
This is part of XSA-254.
Julien Grall [Tue, 16 Jan 2018 14:23:36 +0000 (14:23 +0000)]
xen/arm64: Add skeleton to harden the branch predictor aliasing attacks
Aliasing attacked against CPU branch predictors can allow an attacker to
redirect speculative control flow on some CPUs and potentially divulge
information from one context to another.
This patch adds initial skeleton code behind a new Kconfig option to
enable implementation-specific mitigations against these attacks for
CPUs that are affected.
Most of the mitigations will have to be applied when entering to the
hypervisor from the guest context. For safety, it is applied at every
exception entry. So there are potential for optimizing when receiving
an exception at the same level.
Because the attack is against branch predictor, it is not possible to
safely use branch instruction before the mitigation is applied.
Therefore, this has to be done in the vector entry before jump to the
helper handling a given exception.
On Arm64, each vector can hold 32 instructions. This leave us 31
instructions for the mitigation. The last one is the branch instruction
to the helper.
Because a platform may have CPUs with different micro-architectures,
per-CPU vector table needs to be provided. Realistically, only a few
different mitigations will be necessary. So provide a small set of
vector tables. They will be re-used and patch with the mitigations
on-demand.
This is based on the work done in Linux (see [1]).
Julien Grall [Tue, 16 Jan 2018 14:23:33 +0000 (14:23 +0000)]
xen/arm: Introduce enable callback to enable a capabilities on each online CPU
Once Xen knows what features/workarounds present on the platform, it
might be necessary to configure each online CPU.
Introduce a new callback "enable" that will be called on each online CPU to
configure the "capability".
The code is based on Linux v4.14 (where cpufeature.c comes from), the
explanation of why using stop_machine_run is kept as we have similar
problem in the future.
Lastly introduce enable_errata_workaround that will be called once CPUs
have booted and before the hardware domain is created.
xen/arm: Detect silicon revision and set cap bits accordingly
After each CPU has been started, we iterate through a list of CPU
errata to detect CPUs which need from hypervisor code patches.
For each bug there is a function which checks if that a particular CPU is
affected. This needs to be done on every CPU to cover heterogenous
systems properly.
If a certain erratum has been detected, the capability bit will be set.
In the case the erratum requires code patching, this will be triggered
by the call to apply_alternatives.
The code is based on the file arch/arm64/kernel/cpu_errata.c in Linux
v4.6-rc3.
xen/arm: cpufeature: Provide an helper to check if a capability is supported
The CPU capabilities will be set depending on the value found in the CPU
registers. This patch provides a generic to go through a set of capabilities
and find which one should be enabled.
The parameter "info" is used to display the kind of capability updated (e.g
workaround, feature...).
Julien Grall [Wed, 22 Jun 2016 11:15:18 +0000 (12:15 +0100)]
xen/arm: Add macros to handle the MIDR
Add new macros to easily get different parts of the register and to
check if a given MIDR match a CPU model range. The latter will be really
useful to handle errata later.
The macros have been imported from the header
arch/arm64/include/asm/cputype.h in Linux v4.6-rc3.
Jan Beulich [Wed, 17 Jan 2018 16:24:59 +0000 (17:24 +0100)]
x86: allow Meltdown band-aid to be disabled
First of all we don't need it on AMD systems. Additionally allow its use
to be controlled by command line option. For best backportability, this
intentionally doesn't use alternative instruction patching to achieve
the intended effect - while we likely want it, this will be later
follow-up.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e871e80c38547d9faefc6604532ba3e985e65873
master date: 2018-01-16 17:50:59 +0100
Jan Beulich [Wed, 17 Jan 2018 16:24:12 +0000 (17:24 +0100)]
x86: Meltdown band-aid against malicious 64-bit PV guests
This is a very simplistic change limiting the amount of memory a running
64-bit PV guest has mapped (and hence available for attacking): Only the
mappings of stack, IDT, and TSS are being cloned from the direct map
into per-CPU page tables. Guest controlled parts of the page tables are
being copied into those per-CPU page tables upon entry into the guest.
Cross-vCPU synchronization of top level page table entry changes is
being effected by forcing other active vCPU-s of the guest into the
hypervisor.
The change to context_switch() isn't strictly necessary, but there's no
reason to keep switching page tables once a PV guest is being scheduled
out.
This isn't providing full isolation yet, but it should be covering all
pieces of information exposure of which would otherwise require an XSA.
There is certainly much room for improvement, especially of performance,
here - first and foremost suppressing all the negative effects on AMD
systems. But in the interest of backportability (including to really old
hypervisors, which may not even have alternative patching) any such is
being left out here.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5784de3e2067ed73efc2fe42e62831e8ae7f46c4
master date: 2018-01-16 17:49:03 +0100
Jan H. Schönherr [Wed, 17 Jan 2018 16:23:08 +0000 (17:23 +0100)]
x86: Don't use potentially incorrect CPUID values for topology information
Intel says for CPUID leaf 0Bh:
"Software must not use EBX[15:0] to enumerate processor
topology of the system. This value in this field
(EBX[15:0]) is only intended for display/diagnostic
purposes. The actual number of logical processors
available to BIOS/OS/Applications may be different from
the value of EBX[15:0], depending on software and platform
hardware configurations."
And yet, we're using them to derive the number cores in a package
and the number of siblings in a core.
Derive the number of siblings and cores from EAX instead, which is
intended for that.
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d51baf310e530659f73e714acf575555bdc46303
master date: 2018-01-08 10:48:24 +0000
Andrew Cooper [Wed, 17 Jan 2018 16:22:34 +0000 (17:22 +0100)]
x86/entry: Remove support for partial cpu_user_regs frames
Save all GPRs on entry to Xen.
The entry_int82() path is via a DPL1 gate, only usable by 32bit PV guests, so
can get away with only saving the 32bit registers. All other entrypoints can
be reached from 32 or 64bit contexts.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: f9eb74789af77e985ae653193f3622263499f674
master date: 2018-01-05 19:57:07 +0000
Roger Pau Monné [Wed, 17 Jan 2018 16:22:03 +0000 (17:22 +0100)]
x86/upcall: inject a spurious event after setting upcall vector
In case the vCPU has pending events to inject. This fixes a bug that
happened if the guest mapped the vcpu info area using
VCPUOP_register_vcpu_info without having setup the event channel
upcall, and then setup the upcall vector.
In this scenario the guest would not receive any upcalls, because the
call to VCPUOP_register_vcpu_info would have marked the vCPU as having
pending events, but the vector could not be injected because it was
not yet setup.
This has not caused issues so far because all the consumers first
setup the vector callback and then map the vcpu info page, but there's
no limitation that prevents doing it in the inverse order.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7b5b8ca7dffde866d851f0b87b994e0b13e5b867
master date: 2018-01-04 14:29:16 +0100
Jan Beulich [Wed, 17 Jan 2018 16:21:21 +0000 (17:21 +0100)]
x86/E820: don't overrun array
The bounds check needs to be done after the increment, not before, or
else it needs to use a one lower immediate. Also use word operations
rather than byte ones for both the increment and the compare (allowing
E820_BIOS_MAX to be more easily bumped, should the need ever arise).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0036c9dbcd8b52316aeebb475929d3a36cf5e514
master date: 2018-01-03 11:03:56 +0100
Remove useless smp_wmb() barrier after cpumask_set_cpu(cpuid,
&cpu_online_map), which is not synchronizing against anything.
Keep the other smp_wmb(), before the cpumask_set_cpu call, to ensure
that all writes before setting the cpu online are visible to other cpus.
For that to work properly, we need a corresponding smp_rmb() barrier,
after reading the online cpumask from other processors, which is
currently missing. Add it.
arm: configure interrupts to be in non-secure group1
Xen uses non-secure group1 interrupts, however it doesn't configure the
GICv3 accordingly. Xen needs to set GICD_IGROUPR for SPIs and
GICR_IGROUPR0 for local interrupt to "1" to specify that interrupts
belong to group1. This is particularly important if the system has
GICD_CTLR.DS set, also see commit 7c9b973061b03af62734f613f6abec46c0dd4a88 in Linux.
Julien Grall [Wed, 6 Dec 2017 14:51:37 +0000 (14:51 +0000)]
xen/arm: gic-v3: Bail out if gicv3_cpu_init fail
When system registers are not enabled, all the access to them will trap
in EL2. In Xen, system registers will be enabled by gicv3_cpu_init only
on success. As the rest of the code (e.g gicv3_hyp_init) relies on
system register, it is better to bail out directly.
This will save time on debugging early boot issue on GICv3 platform.
Tom Lendacky [Wed, 20 Dec 2017 15:24:23 +0000 (16:24 +0100)]
x86/microcode: Add support for fam17h microcode loading
The size for the Microcode Patch Block (MPB) for an AMD family 17h
processor is 3200 bytes. Add a #define for fam17h so that it does
not default to 2048 bytes and fail a microcode load/update.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
[Linux commit f4e9b7af0cd58dd039a0fb2cd67d57cea4889abf]
Ported to Xen.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 61d458ba8c171809e8dd9abd19339c87f3f934ca
master date: 2017-12-13 14:30:10 +0000
Jan Beulich [Wed, 20 Dec 2017 15:23:52 +0000 (16:23 +0100)]
gnttab: improve GNTTABOP_cache_flush locking
Dropping the lock before returning from grant_map_exists() means handing
possibly stale information back to the caller. Return back the pointer
to the active entry instead, for the caller to release the lock once
done.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andre Przywara <andre.przywara@linaro.org> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 553ac37137c2d1c03bf1b69cfb192ffbfe29daa4
master date: 2017-12-04 11:04:18 +0100
Jann validly points out that with a caller bogusly requesting a zero-
element batch with non-zero high command bits (the ones used for
continuation encoding), the assertion right before the call to
hypercall_create_continuation() would trigger. A similar situation would
arise afaict for non-empty batches with op and/or length zero in every
element.
While we want the former to succeed (as we do elsewhere for similar
no-op requests), the latter can clearly be converted to an error, as
this is a state that can't be the result of a prior operation.
Take the opportunity and also correct the order of argument checks:
We shouldn't accept zero-length elements with unknown bits set in "op".
Also constify cache_flush()'s first parameter.
Sergey Dyasli [Wed, 20 Dec 2017 15:22:58 +0000 (16:22 +0100)]
x86/vvmx: don't enable vmcs shadowing for nested guests
Running "./xtf_runner vvmx" in L1 Xen under L0 Xen produces the
following result on H/W with VMCS shadowing:
Test: vmxon
Failure in test_vmxon_in_root_cpl0()
Expected 0x8200000f: VMfailValid(15) VMXON_IN_ROOT
Got 0x82004400: VMfailValid(17408) <unknown>
Test result: FAILURE
This happens because SDM allows vmentries with enabled VMCS shadowing
VM-execution control and VMCS link pointer value of ~0ull. But results
of a nested VMREAD are undefined in such cases.
Fix this by not copying the value of VMCS shadowing control from vmcs01
to vmcs02.
Andrew Cooper [Wed, 20 Dec 2017 15:22:27 +0000 (16:22 +0100)]
xen/pv: Construct d0v0's GDT properly
c/s cf6d39f8199 "x86/PV: properly populate descriptor tables" changed the GDT
to reference zero_page for intermediate frames between the guest and Xen
frames.
Because dom0_construct_pv() doesn't call arch_set_info_guest(), some bits of
initialisation are missed, including the pv_destroy_gdt() which initially
fills the references to zero_page.
In practice, this means there is a window between starting and the first call
to HYPERCALL_set_gdt() were lar/lsl/verr/verw suffer non-architectural
behaviour.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 08f27f4468eedbeccaac9fdda4ef732247efd74e
master date: 2017-12-01 19:03:26 +0000
Paul Durrant [Wed, 20 Dec 2017 15:21:59 +0000 (16:21 +0100)]
x86/hvm: fix interaction between internal and external emulation
A call to handle_hvm_io_completion() is needed for completing I/O
that requires external emulation. Such completion should be requested when
hvm_vcpu_io_need_completion() returns true after hvm_emulate_once() has
completed. This is indicative of the underlying I/O emulation having
returned X86EMUL_RETRY and hence a re-emulation of the instruction is
needed to pick up the result of the I/O.
A call to handle_hvm_io_completion() is NOT needed when the underlying
I/O has not returned X86EMUL_RETRY since there will be no result to pick
up. Hence it bogus to request such completion when mmio_retry is set,
since this can only happen if the underlying I/O emulation has returned
X86EMUL_OKAY (meaning the I/O has completed successfully).
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/HVM: don't retain emulated insn cache when exiting back to guest
vio->mmio_retry is being set when a repeated string insn is being split
up. In that case we'll exit to the guest, expecting immediate re-entry.
Interruptions, however, may be serviced by the guest before re-entry
from the repeated string insn. Any emulation needed in the course of
handling the interruption must not fetch from the internally maintained
cache.
As a follow-up to XSA-212 we should have addressed a similar issue here:
The handles being advanced at the top of xenmem_add_to_physmap_batch()
means we allow hypervisor space accesses (in particular, for "errs",
writes) with suitably crafted input arguments. This isn't a security
issue in this case because of the limited width of struct
xen_add_to_physmap_batch's size field: It being 16-bits wide, only the
r/o M2P area can be accessed. Still we can and should do better.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 7f080956e9eed821fd42013bef11c1a2873fbeba
master date: 2017-11-28 13:15:12 +0100
Jan Beulich [Wed, 20 Dec 2017 15:21:03 +0000 (16:21 +0100)]
x86: check paging mode earlier in xenmem_add_to_physmap_one()
There's no point in deferring this until after some initial processing,
and it's actively wrong for the XENMAPSPACE_gmfn_foreign handling to not
have such a check at all.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f38f3dccf1e1a8aabcf57364326fc8f44cddc41a
master date: 2017-11-28 13:14:43 +0100
Jan Beulich [Wed, 20 Dec 2017 15:20:31 +0000 (16:20 +0100)]
x86: replace bad ASSERT() in xenmem_add_to_physmap_one()
There are no locks being held, i.e. it is possible to be triggered by
racy hypercall invocations. Subsequent code doesn't really depend on the
checked values, so this is not a security issue.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: f33d653f46f5889db7be4fef31d71bc871834c10
master date: 2017-11-28 13:14:10 +0100
Jan Beulich [Wed, 20 Dec 2017 15:19:12 +0000 (16:19 +0100)]
sync CPU state upon final domain destruction
See the code comment being added for why we need this.
This is being placed here to balance between the desire to prevent
future similar issues (the risk of which would grow if it was put
further down the call stack, e.g. in vmx_vcpu_destroy()) and the
intention to limit the performance impact (otherwise it could also go
into rcu_do_batch(), paralleling the use in do_tasklet_work()).
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 24246e1fb7496b830aca8a6a1fd3064ca1e3ebf9
master date: 2017-11-23 11:38:22 +0100
Andrew Cooper [Wed, 20 Dec 2017 15:18:40 +0000 (16:18 +0100)]
x86/hvm: Don't corrupt the HVM context stream when writing the MSR record
Ever since it was introduced in c/s bd1f0b45ff, hvm_save_cpu_msrs() has had a
bug whereby it corrupts the HVM context stream if some, but fewer than the
maximum number of MSRs are written.
_hvm_init_entry() creates an hvm_save_descriptor with length for
msr_count_max, but in the case that we write fewer than max, h->cur only moves
forward by the amount of space used, causing the subsequent
hvm_save_descriptor to be written within the bounds of the previous one.
To resolve this, reduce the length reported by the descriptor to match the
actual number of bytes used.
A typical failure on the destination side looks like:
(XEN) HVM4 restore: CPU_MSR 0
(XEN) HVM4.0 restore: not enough data left to read 56 MSR bytes
(XEN) HVM4 restore: failed to load entry 20/0
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d2f86bf604698806d311cc251c1b66fbb752673c
master date: 2017-11-21 11:19:02 +0000
The altp2m_vcpu_enable_notify subop handler might skip calling
rcu_unlock_domain() after rcu_lock_current_domain(). Albeit since both
rcu functions are no-ops when run on the current domain, this doesn't
really have repercussions.
The second change is adding a missing break that would have potentially
enabled #VE for the current domain even if it had intended to enable it
for another one (not a supported functionality).
Signed-off-by: Adrian Pop <apop@bitdefender.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: eb0660c6950e08e44fdfeca3e29320382e2a1554
master date: 2017-11-16 17:25:59 +0000
Andrew Cooper [Wed, 20 Dec 2017 15:17:26 +0000 (16:17 +0100)]
common/gnttab: Correct error handling for gnttab_setup_table()
Simplify the error labels to just "unlock" and "out". This fixes an erroneous
path where a failure of rcu_lock_domain_by_any_id() still results in
rcu_unlock_domain() being called.
This is only not an XSA by luck. rcu_unlock_domain() is a nop other than
decrementing the preempt count, and nothing reads the preempt count outside of
a debug build.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5e436e7a45082ea2cadc176c19e1df46c178448f
master date: 2017-08-15 15:08:57 +0100
Jan Beulich [Tue, 12 Dec 2017 14:04:28 +0000 (15:04 +0100)]
x86/shadow: fix ref-counting error handling
The old-Linux handling in shadow_set_l4e() mistakenly ORed together the
results of sh_get_ref() and sh_pin(). As the latter failing is not a
correctness problem, simply ignore its return value.
In sh_set_toplevel_shadow() a failing sh_get_ref() must not be
accompanied by installing the entry, despite the domain being crashed.
This is XSA-250.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 10be8001de7d87be1f0ccdda75cc70e922e56d03
master date: 2017-12-12 14:29:45 +0100
Jan Beulich [Tue, 12 Dec 2017 14:04:00 +0000 (15:04 +0100)]
x86/shadow: fix refcount overflow check
Commit c385d27079 ("x86 shadow: for multi-page shadows, explicitly track
the first page") reduced the refcount width to 25, without adjusting the
overflow check. Eliminate the disconnect by using a manifest constant.
Interestingly, up to commit 047782fa01 ("Out-of-sync L1 shadows: OOS
snapshot") the refcount was 27 bits wide, yet the check was already
using 26.
This is XSA-249.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 54e2292e8df7a1a7b041192be9d6d797b6d00869
master date: 2017-12-12 14:29:13 +0100
Jan Beulich [Tue, 12 Dec 2017 14:03:34 +0000 (15:03 +0100)]
x86/mm: don't wrongly set page ownership
PV domains can obtain mappings of any pages owned by the correct domain,
including ones that aren't actually assigned as "normal" RAM, but used
by Xen internally. At the moment such "internal" pages marked as owned
by a guest include pages used to track logdirty bits, as well as p2m
pages and the "unpaged pagetable" for HVM guests. Since the PV memory
management and shadow code conflict in their use of struct page_info
fields, and since shadow code is being used for log-dirty handling for
PV domains, pages coming from the shadow pool must, for PV domains, not
have the domain set as their owner.
While the change could be done conditionally for just the PV case in
shadow code, do it unconditionally (and for consistency also for HAP),
just to be on the safe side.
There's one special case though for shadow code: The page table used for
running a HVM guest in unpaged mode is subject to get_page() (in
set_shadow_status()) and hence must have its owner set.
This is XSA-248.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: ff2a793e15bb0b6254bc849ef8e83e1c284c3583
master date: 2017-12-12 14:28:36 +0100
Jan Beulich [Tue, 12 Dec 2017 14:03:00 +0000 (15:03 +0100)]
x86: don't wrongly trigger linear page table assertion (2)
_put_final_page_type(), when free_page_type() has exited early to allow
for preemption, should not update the time stamp, as the page continues
to retain the typ which is in the process of being unvalidated. I can't
see why the time stamp update was put on that path in the first place
(albeit it may well have been me who had put it there years ago).
This is part of XSA-240.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: George Dunlap <george.dunlap.com>
master commit: e40b0219a8c77741ae48989efb520f4a762a5be3
master date: 2017-12-12 14:27:34 +0100