Andrew Cooper [Tue, 6 Mar 2018 14:46:56 +0000 (15:46 +0100)]
x86/emul: Fix the emulation of invlpga
The instruction requires EFER.SVME set to be usable in the first place.
Furthermore, the emulation doesn't handle ASIDs, so avoid giving the
impression that they work. Permit ASID 0 which is reserved for non-root
mode (in which case the instruction is identical to invlpg), but raise #UD for
any other ASID.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a91b2ec337a45d5d98e5a4387aa6563bc5cdc4c9
master date: 2018-02-05 18:17:22 +0000
Julien Grall [Fri, 16 Feb 2018 14:59:56 +0000 (14:59 +0000)]
xen/arm: vgic: Make sure the number of SPIs is a multiple of 32
The vGIC relies on having a pending_irq available for every IRQs
described in the ranks. As each rank describes 32 interrupts, we need to
make sure the number of SPIs is a multiple of 32.
Andrew Cooper [Tue, 27 Feb 2018 13:19:50 +0000 (14:19 +0100)]
x86/hvm: Disallow the creation of HVM domains without Local APIC emulation
There are multiple problems, not necesserily limited to:
* Guests which configure event channels via hvmop_set_evtchn_upcall_vector(),
or which hit %cr8 emulation will cause Xen to fall over a NULL vlapic->regs
pointer.
* On Intel hardware, disabling the TPR_SHADOW execution control without
reenabling CR8_{LOAD,STORE} interception means that the guests %cr8
accesses interact with the real TPR. Amongst other things, setting the
real TPR to 0xf blocks even IPIs from interrupting this CPU.
* On hardware which sets up the use of Interrupt Posting, including
IOMMU-Posting, guests run without the appropriate non-root configuration,
which at a minimum will result in dropped interrupts.
Whether no-LAPIC mode is of any use at all remains to be seen.
This is XSA-256.
Reported-by: Ian Jackson <ian.jackson@eu.citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0aa6158b674c5d083b75ac8fcd1e7ae92d0c39ae
master date: 2018-02-27 14:08:36 +0100
Jan Beulich [Tue, 27 Feb 2018 13:19:19 +0000 (14:19 +0100)]
gnttab: don't blindly free status pages upon version change
There may still be active mappings, which would trigger the respective
BUG_ON(). Split the loop into one dealing with the page attributes and
the second (when the first fully passed) freeing the pages. Return an
error if any pages still have pending references.
This is part of XSA-255.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 38bfcc165dda5f4284d7c218b91df9e144ddd88d
master date: 2018-02-27 14:07:12 +0100
Jan Beulich [Tue, 27 Feb 2018 13:18:34 +0000 (14:18 +0100)]
gnttab/ARM: don't corrupt shared GFN array
... by writing status GFNs to it. Introduce a second array instead.
Also implement gnttab_status_gmfn() properly now that the information is
suitably being tracked.
While touching it anyway, remove a misguided (but luckily benign) upper
bound check from gnttab_shared_gmfn(): We should never access beyond the
bounds of that array.
This is part of XSA-255.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9d2f8f9c65d4da35437f50ed9e812a2c5ab313e2
master date: 2018-02-27 14:04:44 +0100
Jan Beulich [Tue, 27 Feb 2018 13:17:36 +0000 (14:17 +0100)]
memory: don't implicitly unpin for decrease-reservation
It very likely was a mistake (copy-and-paste from domain cleanup code)
to implicitly unpin here: The caller should really unpin itself before
(or after, if they so wish) requesting the page to be removed.
This is XSA-252.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d798a0952903db9d8ee0a580e03f214d2b49b7d7
master date: 2018-02-27 14:03:27 +0100
Julien Grall [Wed, 14 Feb 2018 12:22:23 +0000 (12:22 +0000)]
xen/arm: cpuerrata: Actually check errata on non-boot CPUs
The cpu errata framework was introduced in commit 8b01f6364f "xen/arm:
Detect silicon revision and set cap bits accordingly" and was meant to
detect errata present on any CPUs (via check_local_cpu_errata). However,
the function to check the MIDR (is_affected_midr_range) mistakenly
always use the boot CPU MIDR.
Fix is_affected_midr_range to use the current CPU MIDR.
Julien Grall [Tue, 6 Feb 2018 15:53:24 +0000 (15:53 +0000)]
xen/arm: vsmc: Don't implement function IDs that don't exist
The current implementation of SMCCC relies on the fact only the function
number (bits [15:0]) is enough to identify what to implement.
However, PSCI call are only available in the range 0x84000000-0x8400001F
and 0xC4000000-0xC400001F. Furthermore, not all SMC32 functions have
equivalent in the SMC64. This is the case of:
* PSCI_VERSION
* CPU_OFF
* MIGRATE_INFO_TYPE
* SYSTEM_OFF
* SYSTEM_RESET
Similarly call count, call uid, revision can only be query using smc32/hvc32
fast calls (See 6.2 in ARM DEN 0028B).
Xen should only implement identifier existing in the specification in
order to avoid potential clashes with later revision. Therefore rework the
vsmc code to use the whole function identifier rather than only the
function number.
At the same time, the new macros for call count, call uid, revision are
renamed to better suit the spec.
Lastly, update SSSC_SMCCC_FUNCTION_COUNT to match the correct number of
funtions. Note that version is not updated because the number has always
been wrong, and nobody could properly use it.
Andrew Cooper [Thu, 8 Feb 2018 10:54:52 +0000 (11:54 +0100)]
x86/idle: Clear SPEC_CTRL while idle
On contemporary hardware, setting IBRS/STIBP has a performance impact on
adjacent hyperthreads. It is therefore recommended to clear the setting
before becoming idle, to avoid an idle core preventing adjacent userspace
execution from running at full performance.
Care must be taken to ensure there are no ret or indirect branch instructions
between spec_ctrl_{enter,exit}_idle() invocations, which are forced always
inline. Care must also be taken to avoid using spec_ctrl_enter_idle() between
flushing caches and becoming idle, in cases where that matters.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4c7e478d597b0346eef3a256cfd6794ac778b608
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Thu, 8 Feb 2018 10:54:12 +0000 (11:54 +0100)]
x86/cpuid: Offer Indirect Branch Controls to guests
With all infrastructure in place, it is now safe to let guests see and use
these features.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
master commit: 67c6838ddacfa646f9d1ae802bd0f16a935665b8
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Thu, 8 Feb 2018 10:53:40 +0000 (11:53 +0100)]
x86/ctxt: Issue a speculation barrier between vcpu contexts
Issuing an IBPB command flushes the Branch Target Buffer, so that any poison
left by one vcpu won't remain when beginning to execute the next.
The cost of IBPB is substantial, and skipped on transition to idle, as Xen's
idle code is robust already. All transitions into vcpu context are fully
serialising in practice (and under consideration for being retroactively
declared architecturally serialising), so a cunning attacker cannot use SP1 to
try and skip the flush.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a2ed643ed783020f885035432e9c0919756921d1
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Thu, 8 Feb 2018 10:53:10 +0000 (11:53 +0100)]
x86/boot: Calculate the most appropriate BTI mitigation to use
See the logic and comments in init_speculation_mitigations() for further
details.
There are two controls for RSB overwriting, because in principle there are
cases where it might be safe to forego rsb_native (Off the top of my head,
SMEP active, no 32bit PV guests at all, no use of vmevent/paging subsystems
for HVM guests, but I make no guarantees that this list of restrictions is
exhaustive).
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/spec_ctrl: Fix determination of when to use IBRS
The original version of this logic was:
/*
* On Intel hardware, we'd like to use retpoline in preference to
* IBRS, but only if it is safe on this hardware.
*/
else if ( boot_cpu_has(X86_FEATURE_IBRSB) )
{
if ( retpoline_safe() )
thunk = THUNK_RETPOLINE;
else
ibrs = true;
}
but it was changed by a request during review. Sadly, the result is buggy as
it breaks the later fallback logic by allowing IBRS to appear as available
when in fact it isn't.
This in practice means that on repoline-unsafe hardware without IBRS, we
select THUNK_JUMP despite intending to select THUNK_RETPOLINE.
Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2713715305ca516f698d58cec5e0b322c3b2c4eb
master date: 2018-01-26 14:10:21 +0000
master commit: 30cbd0c83ef3d0edac2d5bcc41a9a2b7a843ae58
master date: 2018-02-06 18:32:58 +0000
Andrew Cooper [Thu, 8 Feb 2018 10:52:28 +0000 (11:52 +0100)]
x86/entry: Avoid using alternatives in NMI/#MC paths
This patch is deliberately arranged to be easy to revert if/when alternatives
patching becomes NMI/#MC safe.
For safety, there must be a dispatch serialising instruction in (what is
logically) DO_SPEC_CTRL_ENTRY so that, in the case that Xen needs IBRS set in
context, an attacker can't speculate around the WRMSR and reach an indirect
branch within the speculation window.
Using conditionals opens this attack vector up, so the else clause gets an
LFENCE to force the pipeline to catch up before continuing. This also covers
the safety of RSB conditional, as execution it is guaranteed to either hit the
WRMSR or LFENCE.
One downside of not using alternatives is that there unconditionally an LFENCE
in the IST path in cases where we are not using the features from IBRS-capable
microcode.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3fffaf9c13e9502f09ad4ab1aac3f8b7b9398f6f
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Thu, 8 Feb 2018 10:51:38 +0000 (11:51 +0100)]
x86/entry: Organise the clobbering of the RSB/RAS on entry to Xen
ret instructions are speculated directly to values recorded in the Return
Stack Buffer/Return Address Stack, as there is no uncertainty in well-formed
code. Guests can take advantage of this in two ways:
1) If they can find a path in Xen which executes more ret instructions than
call instructions. (At least one in the waitqueue infrastructure,
probably others.)
2) Use the fact that the RSB/RAS in hardware is actually a circular stack
without a concept of empty. (When it logically empties, stale values
will start being used.)
To mitigate, overwrite the RSB on entry to Xen with gadgets which will capture
and contain rogue speculation.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e6c0128e9ab25bf66df11377a33ee5584d7f99e3
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Thu, 8 Feb 2018 10:50:40 +0000 (11:50 +0100)]
x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point
We need to be able to either set or clear IBRS in Xen context, as well as
restore appropriate guest values in guest context. See the documentation in
asm-x86/spec_ctrl_asm.h for details.
With the contemporary microcode, writes to %cr3 are slower when SPEC_CTRL.IBRS
is set. Therefore, the positioning of SPEC_CTRL_{ENTRY/EXIT}* is important.
Ideally, the IBRS_SET/IBRS_CLEAR hunks might be positioned either side of the
%cr3 change, but that is rather more complicated to arrange, and could still
result in a guest controlled value in SPEC_CTRL during the %cr3 change,
negating the saving if the guest chose to have IBRS set.
Therefore, we optimise for the pre-Skylake case (being far more common in the
field than Skylake and later, at the moment), where we have a Xen-preferred
value of IBRS clear when switching %cr3.
There is a semi-unrelated bugfix, where various asm_defn.h macros have a
hidden dependency on PAGE_SIZE, which results in an assembler error if used in
a .macro definition.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5e7962901131186d3514528ed57c7a9901a15a3e
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Thu, 8 Feb 2018 10:49:32 +0000 (11:49 +0100)]
x86/hvm: Permit guests direct access to MSR_{SPEC_CTRL,PRED_CMD}
For performance reasons, HVM guests should have direct access to these MSRs
when possible.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 5a2fe171144ebcc908ea1fca45058d6010f6a286
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Thu, 8 Feb 2018 10:49:02 +0000 (11:49 +0100)]
x86/migrate: Move MSR_SPEC_CTRL on migrate
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0cf2a4eb769302b7d7d7835540e7b2f15006df30
master date: 2018-01-26 14:10:21 +0000
MSR_ARCH_CAPABILITIES will only come into existence on new hardware, but is
implemented as a straight #GP for now to avoid being leaky when new hardware
arrives.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ea58a679a6190e714a592f1369b660769a48a80c
master date: 2018-01-26 14:10:21 +0000
Andrew Cooper [Thu, 8 Feb 2018 10:47:41 +0000 (11:47 +0100)]
x86/cpuid: Handling of IBRS/IBPB, STIBP and IBRS for guests
Intel specifies IBRS/IBPB (combined, in a single bit) and STIBP as a separate
bit. AMD specifies IBPB alone in a 3rd bit.
AMD's IBPB is a subset of Intel's combined IBRS/IBPB. For performance
reasons, administrators might wish to express "IBPB only" even on Intel
hardware, so we allow the AMD bit to be used for this purpose.
The behaviour of STIBP is more complicated.
It is our current understanding that STIBP will be advertised on HT-capable
hardware irrespective of whether HT is enabled, but not advertised on
HT-incapable hardware. However, for ease of virtualisation, STIBP's
functionality is ignored rather than reserved by microcode/hardware on
HT-incapable hardware.
For guest safety, we treat STIBP as special, always override the toolstack
choice, and always advertise STIBP if IBRS is available. This removes the
corner case where STIBP is not advertised, but the guest is running on
HT-capable hardware where it does matter.
Finally as a bugfix, update the libxc CPUID logic to understand the e8b
feature leaf, which has the side effect of also offering CLZERO to guests on
applicable hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d297b56682e730d598e2529cc6998151d3b6f6f8
master date: 2018-01-26 14:10:21 +0000
Wei Liu [Thu, 8 Feb 2018 10:45:19 +0000 (11:45 +0100)]
x86: fix GET_STACK_END
AIUI the purpose of having the .if directive is to make GET_STACK_END
work with any general purpose registers. The code as-is would produce
the wrong result for r8. Fix it.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 8155476765a5bdecea1534b46562cf28e0113a9a
master date: 2018-01-25 11:34:17 +0000
Roger Pau Monné [Thu, 8 Feb 2018 10:44:49 +0000 (11:44 +0100)]
x86/acpi: process softirqs while printing CPU ACPI data
Or else the watchdog triggers on boxes with a huge number of CPUs
Reported-by: Simon Crowe <simon.crowe@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: a5579ee79ef8546dd47abe34d73dc9a69a14bbda
master date: 2018-01-24 18:02:14 +0100
Roger Pau Monné [Thu, 8 Feb 2018 10:44:15 +0000 (11:44 +0100)]
xen/x86: report domain id on cpuid
Use the ECX register of the hypervisor leaf 5. The EAX register on
this leaf is a flags field that can be used to notice the presence of
the domain id in ECX. Note that this is only available to HVM guests.
Andrew Cooper [Thu, 8 Feb 2018 10:42:36 +0000 (11:42 +0100)]
x86/cmdline: Introduce a command line option to disable IBRS/IBPB, STIBP and IBPB
Instead of gaining yet another top level boolean, introduce a more generic
cpuid= option. Also introduce a helper function to parse a generic boolean
value.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/cmdline: Fix parse_boolean() for unadorned values
A command line such as "cpuid=no-ibrsb,no-stibp" tickles a bug in
parse_boolean() because the separating comma fails the NUL case.
Instead, check for slen == nlen which accounts for the boundary (if any)
passed via the 'e' parameter.
Andrew Cooper [Thu, 8 Feb 2018 10:41:14 +0000 (11:41 +0100)]
x86/feature: Definitions for Indirect Branch Controls
Contemporary processors are gaining Indirect Branch Controls via microcode
updates. Intel are introducing one bit to indicate IBRS and IBPB support, and
a second bit for STIBP. AMD are introducing IBPB only, so enumerate it with a
separate bit.
Furthermore, depending on compiler and microcode availability, we may want to
run Xen with IBRS set, or clear.
To use these facilities, we synthesise separate IBRS and IBPB bits for
internal use. A lot of infrastructure is required before these features are
safe to offer to guests.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wei.liu2@citrix.com>
master commit: 0d703a701cc4bc47773986b2796eebd28b1439b5
master date: 2018-01-16 17:45:50 +0000
Andrew Cooper [Thu, 8 Feb 2018 10:07:02 +0000 (11:07 +0100)]
x86/amd: Try to set lfence as being Dispatch Serialising
This property is required for the AMD's recommended mitigation for Branch
Target Injection, but Xen needs to cope with being unable to detect or modify
the MSR.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: fe3ee5530a8d0d0b6a478167125d00c40f294a86
master date: 2018-01-16 17:45:50 +0000
Andrew Cooper [Thu, 8 Feb 2018 10:04:53 +0000 (11:04 +0100)]
x86: Support indirect thunks from assembly code
Introduce INDIRECT_CALL and INDIRECT_JMP which either degrade to a normal
indirect branch, or dispatch to the __x86_indirect_thunk_* symbols.
Update all the manual indirect branches in to use the new thunks. The
indirect branches in the early boot and kexec path are left intact as we can't
use the compiled-in thunks at those points.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7c508612f7a5096b4819d4ef2ce566e01bd66c0c
master date: 2018-01-16 17:45:50 +0000
Andrew Cooper [Thu, 8 Feb 2018 10:03:45 +0000 (11:03 +0100)]
x86: Support compiling with indirect branch thunks
Use -mindirect-branch=thunk-extern/-mindirect-branch-register when available.
To begin with, use the retpoline thunk. Later work will add alternative
thunks which can be selected at boot time.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 3659f0f4bcc6ca08103d1a7ae4e97535ecc978be
master date: 2018-01-16 17:45:50 +0000
Andrew Cooper [Thu, 8 Feb 2018 10:01:29 +0000 (11:01 +0100)]
x86/entry: Erase guest GPR state on entry to Xen
This reduces the number of code gadgets which can be attacked with arbitrary
guest-controlled GPR values.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: 03bd8c3a70d101fc2f8f36f1e171b7594462a4cd
master date: 2018-01-05 19:57:08 +0000
Andrew Cooper [Thu, 8 Feb 2018 10:00:45 +0000 (11:00 +0100)]
x86/hvm: Use SAVE_ALL to construct the cpu_user_regs frame after VMExit
No practical change.
One side effect in debug builds is that %rbp is inverted in the manner
expected by the stack unwinder to indicate a interrupt frame.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: 13682ca8c94bd5612a44f7f1edc1fd8ff675dacb
master date: 2018-01-05 19:57:08 +0000
Andrew Cooper [Thu, 8 Feb 2018 09:57:44 +0000 (10:57 +0100)]
x86/entry: Rearrange RESTORE_ALL to restore register in stack order
Results in a more predictable (i.e. linear) memory access pattern.
No functional change.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: f85d105e27735f0e20aa30d77f03774f3ed55ae5
master date: 2018-01-05 19:57:08 +0000
Andrew Cooper [Thu, 8 Feb 2018 09:56:51 +0000 (10:56 +0100)]
x86: Introduce a common cpuid_policy_updated()
No practical change at the moment, but future changes will need to react
irrespective of guest type.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: b357546b43ab87dfb10d740ae637a685134d5e32
master date: 2018-01-05 19:57:07 +0000
Andrew Cooper [Thu, 8 Feb 2018 09:56:01 +0000 (10:56 +0100)]
x86/hvm: Rename update_guest_vendor() callback to cpuid_policy_changed()
It will shortly be used for more than just changing the vendor.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3bea00966eb6680410c89df764d075a8fbacc3cc
master date: 2018-01-05 19:57:07 +0000
Andrew Cooper [Thu, 8 Feb 2018 09:54:11 +0000 (10:54 +0100)]
x86/alt: Break out alternative-asm into a separate header file
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 9d7b4351d3bb5c744db311cffa57ba3ebb583327
master date: 2018-01-05 19:57:07 +0000
Julien Grall [Fri, 2 Feb 2018 14:19:25 +0000 (14:19 +0000)]
xen/arm32: entry: Document the purpose of r11 in the traps handler
It took me a bit of time to understand why __DEFINE_TRAP_ENTRY is
storing the original stack pointer in r11. It is working in pair with
return_traps_entry where sp will be restored from r11.
This is fine because per the AAPCS r11 must be preserved by the
subroutine. So in return_from_trap, r11 will still contain the original
stack pointer.
Add some documentation in the code to point the 2 sides to each other.
Julien Grall [Fri, 2 Feb 2018 14:19:24 +0000 (14:19 +0000)]
xen/arm32: Invalidate icache on guest exist for Cortex-A15
In order to avoid aliasing attacks against the branch predictor on
Cortex A-15, let's invalidate the BTB on guest exit, which can only be
done by invalidating the icache (with ACTLR[0] being set).
We use the same hack as for A12/A17 to perform the vector decoding.
This is based on Linux patch from the kpti branch in [1].
Julien Grall [Fri, 2 Feb 2018 14:19:23 +0000 (14:19 +0000)]
xen/arm32: Invalidate BTB on guest exit for Cortex A17 and 12
In order to avoid aliasing attackes agains the branch predictor, let's
invalidate the BTB on guest exist. This is made complicated by the fact
that we cannot take a branch invalidating the BTB.
This is based on the fourth version posted by Marc Zyngier on Linux-arm
mailing list (see [1]).
Julien Grall [Fri, 2 Feb 2018 14:19:22 +0000 (14:19 +0000)]
xen/arm32: Add skeleton to harden branch predictor aliasing attacks
Aliasing attacked against CPU branch predictors can allow an attacker to
redirect speculative control flow on some CPUs and potentially divulge
information from one context to another.
This patch adds initiatial skeleton code behind a new Kconfig option
to enable implementation-specific mitigations against these attacks
for CPUs that are affected.
Most of mitigations will have to be applied when entering to the
hypervisor from the guest context.
Because the attack is against branch predictor, it is not possible to
safely use branch instruction before the mitigation is applied.
Therefore this has to be done in the vector entry before jump to the
helper handling a given exception.
However, on arm32, each vector contain a single instruction. This means
that the hardened vector tables may rely on the state of registers that
does not hold when in the hypervisor (e.g SP is 8 bytes aligned).
Therefore hypervisor code running with guest vectors table should be
minimized and always have IRQs and SErrors masked to reduce the risk to
use them.
This patch provides an infrastructure to switch vector tables before
entering to the guest and when leaving it.
Note that alternative could have been used, but older Xen (4.8 or
earlier) doesn't have support. So avoid using alternative to ease
backporting.
Julien Grall [Fri, 2 Feb 2018 14:19:21 +0000 (14:19 +0000)]
xen/arm32: entry: Add missing trap_reset entry
At the moment, the reset vector is defined as .word 0 (e.g andeq r0, r0,
r0).
This is rather unintuitive and will result to execute the trap
undefined. Instead introduce trap helpers for reset and will generate an
error message in the unlikely case that reset will be called.
Ian Jackson [Wed, 31 Jan 2018 13:05:08 +0000 (13:05 +0000)]
SUPPORT.md: Fix version and Initial-Release
Security-Support-Until should be `TBD'. We need to answer these
questions properly, but let's not block fixing the obvious bugs here
for that policy discussion.
CC: Andrew Cooper <andrew.cooper3@citrix.com> CC: Lars Kurth <lars.kurth@citrix.com> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Mon, 22 Jan 2018 14:35:42 +0000 (14:35 +0000)]
xen/arm: cpuerrata: Remove percpu.h include
The include percpu.h was added by mistake in cpuerrata.h (see commit 4c4fddc166 "xen/arm64: Add skeleton to harden the branch aliasing
attacks"). So remove it.
Julien Grall [Tue, 16 Jan 2018 14:23:37 +0000 (14:23 +0000)]
xen/arm64: Implement branch predictor hardening for affected Cortex-A CPUs
Cortex-A57, A72, A73 and A75 are susceptible to branch predictor
aliasing and can theoritically be attacked by malicious code.
This patch implements a PSCI-based mitigation for these CPUs when
available. The call into firmware will invalidate the branch predictor
state, preventing any malicious entries from affection other victim
contexts.
Ported from Linux git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git
branch kpti.
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com>
This is part of XSA-254.
Julien Grall [Tue, 16 Jan 2018 14:23:36 +0000 (14:23 +0000)]
xen/arm64: Add skeleton to harden the branch predictor aliasing attacks
Aliasing attacked against CPU branch predictors can allow an attacker to
redirect speculative control flow on some CPUs and potentially divulge
information from one context to another.
This patch adds initial skeleton code behind a new Kconfig option to
enable implementation-specific mitigations against these attacks for
CPUs that are affected.
Most of the mitigations will have to be applied when entering to the
hypervisor from the guest context. For safety, it is applied at every
exception entry. So there are potential for optimizing when receiving
an exception at the same level.
Because the attack is against branch predictor, it is not possible to
safely use branch instruction before the mitigation is applied.
Therefore, this has to be done in the vector entry before jump to the
helper handling a given exception.
On Arm64, each vector can hold 32 instructions. This leave us 31
instructions for the mitigation. The last one is the branch instruction
to the helper.
Because a platform may have CPUs with different micro-architectures,
per-CPU vector table needs to be provided. Realistically, only a few
different mitigations will be necessary. So provide a small set of
vector tables. They will be re-used and patch with the mitigations
on-demand.
This is based on the work done in Linux (see [1]).
Julien Grall [Tue, 16 Jan 2018 14:23:33 +0000 (14:23 +0000)]
xen/arm: Introduce enable callback to enable a capabilities on each online CPU
Once Xen knows what features/workarounds present on the platform, it
might be necessary to configure each online CPU.
Introduce a new callback "enable" that will be called on each online CPU to
configure the "capability".
The code is based on Linux v4.14 (where cpufeature.c comes from), the
explanation of why using stop_machine_run is kept as we have similar
problem in the future.
Lastly introduce enable_errata_workaround that will be called once CPUs
have booted and before the hardware domain is created.
Jan Beulich [Wed, 17 Jan 2018 15:57:33 +0000 (16:57 +0100)]
x86: allow Meltdown band-aid to be disabled
First of all we don't need it on AMD systems. Additionally allow its use
to be controlled by command line option. For best backportability, this
intentionally doesn't use alternative instruction patching to achieve
the intended effect - while we likely want it, this will be later
follow-up.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e871e80c38547d9faefc6604532ba3e985e65873
master date: 2018-01-16 17:50:59 +0100
Jan Beulich [Wed, 17 Jan 2018 15:56:57 +0000 (16:56 +0100)]
x86: Meltdown band-aid against malicious 64-bit PV guests
This is a very simplistic change limiting the amount of memory a running
64-bit PV guest has mapped (and hence available for attacking): Only the
mappings of stack, IDT, and TSS are being cloned from the direct map
into per-CPU page tables. Guest controlled parts of the page tables are
being copied into those per-CPU page tables upon entry into the guest.
Cross-vCPU synchronization of top level page table entry changes is
being effected by forcing other active vCPU-s of the guest into the
hypervisor.
The change to context_switch() isn't strictly necessary, but there's no
reason to keep switching page tables once a PV guest is being scheduled
out.
This isn't providing full isolation yet, but it should be covering all
pieces of information exposure of which would otherwise require an XSA.
There is certainly much room for improvement, especially of performance,
here - first and foremost suppressing all the negative effects on AMD
systems. But in the interest of backportability (including to really old
hypervisors, which may not even have alternative patching) any such is
being left out here.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5784de3e2067ed73efc2fe42e62831e8ae7f46c4
master date: 2018-01-16 17:49:03 +0100
Jan H. Schönherr [Wed, 17 Jan 2018 15:55:26 +0000 (16:55 +0100)]
x86: Don't use potentially incorrect CPUID values for topology information
Intel says for CPUID leaf 0Bh:
"Software must not use EBX[15:0] to enumerate processor
topology of the system. This value in this field
(EBX[15:0]) is only intended for display/diagnostic
purposes. The actual number of logical processors
available to BIOS/OS/Applications may be different from
the value of EBX[15:0], depending on software and platform
hardware configurations."
And yet, we're using them to derive the number cores in a package
and the number of siblings in a core.
Derive the number of siblings and cores from EAX instead, which is
intended for that.
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d51baf310e530659f73e714acf575555bdc46303
master date: 2018-01-08 10:48:24 +0000
Andrew Cooper [Wed, 17 Jan 2018 15:54:44 +0000 (16:54 +0100)]
x86/entry: Remove support for partial cpu_user_regs frames
Save all GPRs on entry to Xen.
The entry_int82() path is via a DPL1 gate, only usable by 32bit PV guests, so
can get away with only saving the 32bit registers. All other entrypoints can
be reached from 32 or 64bit contexts.
This is part of XSA-254.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wei.liu2@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: f9eb74789af77e985ae653193f3622263499f674
master date: 2018-01-05 19:57:07 +0000
Roger Pau Monné [Wed, 17 Jan 2018 15:54:10 +0000 (16:54 +0100)]
x86/upcall: inject a spurious event after setting upcall vector
In case the vCPU has pending events to inject. This fixes a bug that
happened if the guest mapped the vcpu info area using
VCPUOP_register_vcpu_info without having setup the event channel
upcall, and then setup the upcall vector.
In this scenario the guest would not receive any upcalls, because the
call to VCPUOP_register_vcpu_info would have marked the vCPU as having
pending events, but the vector could not be injected because it was
not yet setup.
This has not caused issues so far because all the consumers first
setup the vector callback and then map the vcpu info page, but there's
no limitation that prevents doing it in the inverse order.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7b5b8ca7dffde866d851f0b87b994e0b13e5b867
master date: 2018-01-04 14:29:16 +0100
Jan Beulich [Wed, 17 Jan 2018 15:53:23 +0000 (16:53 +0100)]
x86/E820: don't overrun array
The bounds check needs to be done after the increment, not before, or
else it needs to use a one lower immediate. Also use word operations
rather than byte ones for both the increment and the compare (allowing
E820_BIOS_MAX to be more easily bumped, should the need ever arise).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0036c9dbcd8b52316aeebb475929d3a36cf5e514
master date: 2018-01-03 11:03:56 +0100
Julien Grall [Wed, 6 Dec 2017 14:51:37 +0000 (14:51 +0000)]
xen/arm: gic-v3: Bail out if gicv3_cpu_init fail
When system registers are not enabled, all the access to them will trap
in EL2. In Xen, system registers will be enabled by gicv3_cpu_init only
on success. As the rest of the code (e.g gicv3_hyp_init) relies on
system register, it is better to bail out directly.
This will save time on debugging early boot issue on GICv3 platform.
Andrew Cooper [Thu, 4 Jan 2018 13:32:01 +0000 (14:32 +0100)]
x86/msr: Free msr_vcpu_policy during vcpu destruction
c/s 4187f79dc7 "x86/msr: introduce struct msr_vcpu_policy" introduced a
per-vcpu memory allocation, but failed to free it in the clean vcpu
destruction case.
This is XSA-253.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e204e60f77702bf5c884dd37c3f1b01f14e396ae
master date: 2018-01-04 14:27:38 +0100
Andrew Cooper [Wed, 20 Dec 2017 14:45:32 +0000 (15:45 +0100)]
x86/vmx: Don't use hvm_inject_hw_exception() in long_mode_do_msr_write()
Since c/s 49de10f3c1718 "x86/hvm: Don't raise #GP behind the emulators back
for MSR accesses", returning X86EMUL_EXCEPTION has pushed the exception
generation to the top of the call tree.
Using hvm_inject_hw_exception() and returning X86EMUL_EXCEPTION causes a
double #GP injection, which combines to #DF.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 896ee3980e72866b602e743396751384de301fb0
master date: 2017-12-14 18:05:45 +0000
Jan Beulich [Wed, 20 Dec 2017 14:44:20 +0000 (15:44 +0100)]
gnttab: improve GNTTABOP_cache_flush locking
Dropping the lock before returning from grant_map_exists() means handing
possibly stale information back to the caller. Return back the pointer
to the active entry instead, for the caller to release the lock once
done.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andre Przywara <andre.przywara@linaro.org> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 553ac37137c2d1c03bf1b69cfb192ffbfe29daa4
master date: 2017-12-04 11:04:18 +0100
Jann validly points out that with a caller bogusly requesting a zero-
element batch with non-zero high command bits (the ones used for
continuation encoding), the assertion right before the call to
hypercall_create_continuation() would trigger. A similar situation would
arise afaict for non-empty batches with op and/or length zero in every
element.
While we want the former to succeed (as we do elsewhere for similar
no-op requests), the latter can clearly be converted to an error, as
this is a state that can't be the result of a prior operation.
Take the opportunity and also correct the order of argument checks:
We shouldn't accept zero-length elements with unknown bits set in "op".
Also constify cache_flush()'s first parameter.
Tom Lendacky [Wed, 20 Dec 2017 14:43:14 +0000 (15:43 +0100)]
x86/microcode: Add support for fam17h microcode loading
The size for the Microcode Patch Block (MPB) for an AMD family 17h
processor is 3200 bytes. Add a #define for fam17h so that it does
not default to 2048 bytes and fail a microcode load/update.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>
[Linux commit f4e9b7af0cd58dd039a0fb2cd67d57cea4889abf]
Ported to Xen.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 61d458ba8c171809e8dd9abd19339c87f3f934ca
master date: 2017-12-13 14:30:10 +0000
Release builds work fine, which is a first indication that the assertion
isn't really needed.
What's worse though - there appears to be a timing window where the
guest runs in shadow mode, but not in log-dirty mode, and that is what
triggers the assertion (the same could, afaict, be achieved by test-
enabling shadow mode on a PV guest). This is because turing off log-
dirty mode is being performed in two steps: First the log-dirty bit gets
cleared (paging_log_dirty_disable() [having paused the domain] ->
sh_disable_log_dirty() -> shadow_one_bit_disable()), followed by
unpausing the domain and only then clearing shadow mode (via
shadow_test_disable(), which pauses the domain a second time).
Hence besides removing the ASSERT() here (or optionally replacing it by
explicit translate and refcounts mode checks, but this seems rather
pointless now that the three are tied together) I wonder whether either
shadow_one_bit_disable() should turn off shadow mode if no other bit
besides PG_SH_enable remains set (just like shadow_one_bit_enable()
enables it if not already set), or the domain pausing scope should be
extended so that both steps occur without the domain getting a chance to
run in between.
Reported-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b95f7be32d668fa4b09300892ebe19636ecebe36
master date: 2017-12-12 16:56:15 +0100
Daniel Kiper [Wed, 20 Dec 2017 14:42:13 +0000 (15:42 +0100)]
x86/mb2: avoid Xen image when looking for module/crashkernel position
Commit e22e1c4 (x86/EFI: avoid Xen image when looking for module/kexec
position) added relevant check for EFI case. However, since commit f75a304 (x86: add multiboot2 protocol support for relocatable images)
Multiboot2 compatible bootloaders are able to relocate Xen image too.
So, we have to avoid also Xen image region in such cases.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 9589927e5bf9e123ec42b6e0b0809f153bd92732
master date: 2017-12-12 14:30:53 +0100
Sergey Dyasli [Wed, 20 Dec 2017 14:41:33 +0000 (15:41 +0100)]
x86/vvmx: don't enable vmcs shadowing for nested guests
Running "./xtf_runner vvmx" in L1 Xen under L0 Xen produces the
following result on H/W with VMCS shadowing:
Test: vmxon
Failure in test_vmxon_in_root_cpl0()
Expected 0x8200000f: VMfailValid(15) VMXON_IN_ROOT
Got 0x82004400: VMfailValid(17408) <unknown>
Test result: FAILURE
This happens because SDM allows vmentries with enabled VMCS shadowing
VM-execution control and VMCS link pointer value of ~0ull. But results
of a nested VMREAD are undefined in such cases.
Fix this by not copying the value of VMCS shadowing control from vmcs01
to vmcs02.
Andrew Cooper [Wed, 20 Dec 2017 14:40:58 +0000 (15:40 +0100)]
xen/pv: Construct d0v0's GDT properly
c/s cf6d39f8199 "x86/PV: properly populate descriptor tables" changed the GDT
to reference zero_page for intermediate frames between the guest and Xen
frames.
Because dom0_construct_pv() doesn't call arch_set_info_guest(), some bits of
initialisation are missed, including the pv_destroy_gdt() which initially
fills the references to zero_page.
In practice, this means there is a window between starting and the first call
to HYPERCALL_set_gdt() were lar/lsl/verr/verw suffer non-architectural
behaviour.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 08f27f4468eedbeccaac9fdda4ef732247efd74e
master date: 2017-12-01 19:03:26 +0000
Jan Beulich [Fri, 8 Dec 2017 15:32:05 +0000 (15:32 +0000)]
x86: don't wrongly trigger linear page table assertion (2)
_put_final_page_type(), when free_page_type() has exited early to allow
for preemption, should not update the time stamp, as the page continues
to retain the typ which is in the process of being unvalidated. I can't
see why the time stamp update was put on that path in the first place
(albeit it may well have been me who had put it there years ago).
Jan Beulich [Fri, 8 Dec 2017 15:27:14 +0000 (15:27 +0000)]
x86/shadow: fix ref-counting error handling
The old-Linux handling in shadow_set_l4e() mistakenly ORed together the
results of sh_get_ref() and sh_pin(). As the latter failing is not a
correctness problem, simply ignore its return value.
In sh_set_toplevel_shadow() a failing sh_get_ref() must not be
accompanied by installing the entry, despite the domain being crashed.
This is XSA-250.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org>
Jan Beulich [Fri, 8 Dec 2017 15:27:14 +0000 (15:27 +0000)]
x86/shadow: fix refcount overflow check
Commit c385d27079 ("x86 shadow: for multi-page shadows, explicitly track
the first page") reduced the refcount width to 25, without adjusting the
overflow check. Eliminate the disconnect by using a manifest constant.
Interestingly, up to commit 047782fa01 ("Out-of-sync L1 shadows: OOS
snapshot") the refcount was 27 bits wide, yet the check was already
using 26.
This is XSA-249.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Tim Deegan <tim@xen.org>
Jan Beulich [Fri, 8 Dec 2017 15:27:14 +0000 (15:27 +0000)]
x86/mm: don't wrongly set page ownership
PV domains can obtain mappings of any pages owned by the correct domain,
including ones that aren't actually assigned as "normal" RAM, but used
by Xen internally. At the moment such "internal" pages marked as owned
by a guest include pages used to track logdirty bits, as well as p2m
pages and the "unpaged pagetable" for HVM guests. Since the PV memory
management and shadow code conflict in their use of struct page_info
fields, and since shadow code is being used for log-dirty handling for
PV domains, pages coming from the shadow pool must, for PV domains, not
have the domain set as their owner.
While the change could be done conditionally for just the PV case in
shadow code, do it unconditionally (and for consistency also for HAP),
just to be on the safe side.
There's one special case though for shadow code: The page table used for
running a HVM guest in unpaged mode is subject to get_page() (in
set_shadow_status()) and hence must have its owner set.
This is XSA-248.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Tim Deegan <tim@xen.org> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Jan Beulich [Thu, 7 Dec 2017 09:59:22 +0000 (10:59 +0100)]
x86/HVM: don't retain emulated insn cache when exiting back to guest
vio->mmio_retry is being set when a repeated string insn is being split
up. In that case we'll exit to the guest, expecting immediate re-entry.
Interruptions, however, may be serviced by the guest before re-entry
from the repeated string insn. Any emulation needed in the course of
handling the interruption must not fetch from the internally maintained
cache.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
master commit: 5fcb26e69e8089e20c9168774bee681b8f5a3187
master date: 2017-12-06 12:50:23 +0100
Paul Durrant [Tue, 28 Nov 2017 14:05:19 +0000 (14:05 +0000)]
x86/hvm: fix interaction between internal and external emulation
A call to handle_hvm_io_completion() is needed for completing I/O
that requires external emulation. Such completion should be requested when
hvm_vcpu_io_need_completion() returns true after hvm_emulate_once() has
completed. This is indicative of the underlying I/O emulation having
returned X86EMUL_RETRY and hence a re-emulation of the instruction is
needed to pick up the result of the I/O.
A call to handle_hvm_io_completion() is NOT needed when the underlying
I/O has not returned X86EMUL_RETRY since there will be no result to pick
up. Hence it bogus to request such completion when mmio_retry is set,
since this can only happen if the underlying I/O emulation has returned
X86EMUL_OKAY (meaning the I/O has completed successfully).
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
(cherry picked from commit 9c9384d6d8184ca6d21975ccf4e4f72b560540cc)
Andrew Cooper [Sat, 25 Nov 2017 15:17:14 +0000 (15:17 +0000)]
x86: Avoid corruption on migrate for vcpus using CPUID Faulting
Xen 4.8 and later virtualises CPUID Faulting support for guests. However, the
value of MSR_MISC_FEATURES_ENABLES is omitted from the vcpu state, meaning
that the current cpuid faulting setting is lost on migrate/suspend/resume.
Instead of following the MSR status quo, take the opportunity to make the
logic more generic, and in particular, trivial to extend for future MSRs.
This is done by discarding the notion of optional MSRs, and requiring the
toolstack to be prepared to move all of the MSRs, although only a subset will
typically need to move.
This allows for the use of guest_{rd,wr}msr() alone to evaluate whether an MSR
needs moving. This is a benefit because it means there is a single piece of
logic responsible for evaluating whether a guest can use an MSR, and which
values are acceptable.
One small adjustment to guest_wrmsr() is required to cope with being called in
toolstack context.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
(cherry picked from commit b90f86be161c74df8cb69c98d9f22885d9d87114)
Andre Przywara [Thu, 16 Nov 2017 12:02:35 +0000 (12:02 +0000)]
arm64: ITS: fix cacheability adjustment
If the host GICv3 redistributor reports that the pending table cannot
use shareable memory, we try to drop the cacheability attributes as
well. However we fail horribly in doing computer science 101 bit
masking, effectively clearing the whole register instead of just a few
bits.
Fix this by removing the one redundant masking operation and adding the
magic negation for the actually needed other operation.
Ian Jackson [Tue, 14 Nov 2017 12:15:42 +0000 (12:15 +0000)]
tools: xentoolcore_restrict_all: Do deregistration before close
Closing the fd before unhooking it from the list runs the risk that a
concurrent thread calls xentoolcore_restrict_all will operate on the
old fd value, which might refer to a new fd by then. So we need to do
it in the other order.
Sadly this weakens the guarantee provided by xentoolcore_restrict_all
slightly, but not (I think) in a problematic way. It would be
possible to implement the previous guarantee, but it would involve
replacing all of the close() calls in all of the individual osdep
parts of all of the individual libraries with calls to a new function
which does
dup2("/dev/null", thing->fd);
pthread_mutex_lock(&handles_lock);
thing->fd = -1;
pthread_mutex_unlock(&handles_lock);
close(fd);
which would be terribly tedious.
Signed-off-by: Ian Jackson <Ian.Jackson@eu.citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
As a follow-up to XSA-212 we should have addressed a similar issue here:
The handles being advanced at the top of xenmem_add_to_physmap_batch()
means we allow hypervisor space accesses (in particular, for "errs",
writes) with suitably crafted input arguments. This isn't a security
issue in this case because of the limited width of struct
xen_add_to_physmap_batch's size field: It being 16-bits wide, only the
r/o M2P area can be accessed. Still we can and should do better.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Jan Beulich [Tue, 28 Nov 2017 12:14:43 +0000 (13:14 +0100)]
x86: check paging mode earlier in xenmem_add_to_physmap_one()
There's no point in deferring this until after some initial processing,
and it's actively wrong for the XENMAPSPACE_gmfn_foreign handling to not
have such a check at all.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
Jan Beulich [Tue, 28 Nov 2017 12:14:10 +0000 (13:14 +0100)]
x86: replace bad ASSERT() in xenmem_add_to_physmap_one()
There are no locks being held, i.e. it is possible to be triggered by
racy hypercall invocations. Subsequent code doesn't really depend on the
checked values, so this is not a security issue.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Release-acked-by: Julien Grall <julien.grall@linaro.org>
George Dunlap [Tue, 28 Nov 2017 12:13:26 +0000 (13:13 +0100)]
p2m: Check return value of p2m_set_entry() when decreasing reservation
If the entire range specified to p2m_pod_decrease_reservation() is marked
populate-on-demand, then it will make a single p2m_set_entry() call,
reducing its PoD entry count.
Unfortunately, in the right circumstances, this p2m_set_entry() call
may fail. It that case, repeated calls to decrease_reservation() may
cause p2m->pod.entry_count to fall below zero, potentially tripping
over BUG_ON()s to the contrary.
Instead, check to see if the entry succeeded, and return false if not.
The caller will then call guest_remove_page() on the gfns, which will
return -EINVAL upon finding no valid memory there to return.
Unfortunately if the order > 0, the entry may have partially changed.
A domain_crash() is probably the safest thing in that case.
Other p2m_set_entry() calls in the same function should be fine,
because they are writing the entry at its current order. Nonetheless,
check the return value and crash if our assumption turns otu to be
wrong.
This is part of XSA-247.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
George Dunlap [Tue, 28 Nov 2017 12:13:03 +0000 (13:13 +0100)]
p2m: Always check to see if removing a p2m entry actually worked
The PoD zero-check functions speculatively remove memory from the p2m,
then check to see if it's completely zeroed, before putting it in the
cache.
Unfortunately, the p2m_set_entry() calls may fail if the underlying
pagetable structure needs to change and the domain has exhausted its
p2m memory pool: for instance, if we're removing a 2MiB region out of
a 1GiB entry (in the p2m_pod_zero_check_superpage() case), or a 4k
region out of a 2MiB or larger entry (in the p2m_pod_zero_check()
case); and the return value is not checked.
The underlying mfn will then be added into the PoD cache, and at some
point mapped into another location in the p2m. If the guest
afterwards ballons out this memory, it will be freed to the hypervisor
and potentially reused by another domain, in spite of the fact that
the original domain still has writable mappings to it.
There are several places where p2m_set_entry() shouldn't be able to
fail, as it is guaranteed to write an entry of the same order that
succeeded before. Add a backstop of crashing the domain just in case,
and an ASSERT_UNREACHABLE() to flag up the broken assumption on debug
builds.
While we're here, use PAGE_ORDER_2M rather than a magic constant.
This is part of XSA-247.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Tue, 28 Nov 2017 12:11:55 +0000 (13:11 +0100)]
x86/pod: prevent infinite loop when shattering large pages
When populating pages, the PoD may need to split large ones using
p2m_set_entry and request the caller to retry (see ept_get_entry for
instance).
p2m_set_entry may fail to shatter if it is not possible to allocate
memory for the new page table. However, the error is not propagated
resulting to the callers to retry infinitely the PoD.
Prevent the infinite loop by return false when it is not possible to
shatter the large mapping.
This is XSA-246.
Signed-off-by: Julien Grall <julien.grall@linaro.org> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>