]> xenbits.xensource.com Git - xen.git/log
xen.git
7 years agognttab: don't blindly free status pages upon version change
Jan Beulich [Tue, 27 Feb 2018 13:32:32 +0000 (14:32 +0100)]
gnttab: don't blindly free status pages upon version change

There may still be active mappings, which would trigger the respective
BUG_ON(). Split the loop into one dealing with the page attributes and
the second (when the first fully passed) freeing the pages. Return an
error if any pages still have pending references.

This is part of XSA-255.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 38bfcc165dda5f4284d7c218b91df9e144ddd88d
master date: 2018-02-27 14:07:12 +0100

7 years agognttab/ARM: don't corrupt shared GFN array
Jan Beulich [Tue, 27 Feb 2018 13:32:14 +0000 (14:32 +0100)]
gnttab/ARM: don't corrupt shared GFN array

... by writing status GFNs to it. Introduce a second array instead.
Also implement gnttab_status_gmfn() properly now that the information is
suitably being tracked.

While touching it anyway, remove a misguided (but luckily benign) upper
bound check from gnttab_shared_gmfn(): We should never access beyond the
bounds of that array.

This is part of XSA-255.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9d2f8f9c65d4da35437f50ed9e812a2c5ab313e2
master date: 2018-02-27 14:04:44 +0100

7 years agomemory: don't implicitly unpin for decrease-reservation
Jan Beulich [Tue, 27 Feb 2018 13:31:30 +0000 (14:31 +0100)]
memory: don't implicitly unpin for decrease-reservation

It very likely was a mistake (copy-and-paste from domain cleanup code)
to implicitly unpin here: The caller should really unpin itself before
(or after, if they so wish) requesting the page to be removed.

This is XSA-252.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d798a0952903db9d8ee0a580e03f214d2b49b7d7
master date: 2018-02-27 14:03:27 +0100

7 years agox86/PV: correctly count MSRs to migrate
Jan Beulich [Fri, 23 Feb 2018 09:20:10 +0000 (10:20 +0100)]
x86/PV: correctly count MSRs to migrate

Signed-off-by: Jan Beulich <jbeulich@suse.com>
7 years agoxen/arm: cpuerrata: Actually check errata on non-boot CPUs
Julien Grall [Wed, 14 Feb 2018 12:22:23 +0000 (12:22 +0000)]
xen/arm: cpuerrata: Actually check errata on non-boot CPUs

The cpu errata framework was introduced in commit 8b01f6364f "xen/arm:
Detect silicon revision and set cap bits accordingly" and was meant to
detect errata present on any CPUs (via check_local_cpu_errata). However,
the function to check the MIDR (is_affected_midr_range) mistakenly
always use the boot CPU MIDR.

Fix is_affected_midr_range to use the current CPU MIDR.

Reported-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 27196d4cc917d91b5b5daee50173565139ca9c9d)

7 years agotools/kdd: don't use a pointer to an unaligned field.
Tim Deegan [Thu, 15 Feb 2018 12:10:54 +0000 (13:10 +0100)]
tools/kdd: don't use a pointer to an unaligned field.

The 'val' field in the packet is byte-aligned (because it is part of a
packed struct), but the pointer argument to kdd_rdmsr() has the normal
alignment constraints for a uint64_t *.  Use a local variable to make sure
the passed pointer has the correct alignment.

Reported-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Tim Deegan <tim@xen.org>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Tested-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: bfd9a2095f1882e8c074b2d911bcb07d12cf6cf5
master date: 2017-03-15 10:57:00 +0000

7 years agolibxc: fix build (introduce _AC())
Jan Beulich [Thu, 15 Feb 2018 09:15:45 +0000 (10:15 +0100)]
libxc: fix build (introduce _AC())

Signed-off-by: Jan Beulich <jbeulich@suse.com>
7 years agox86: fix build with older tool chain
Jan Beulich [Wed, 14 Feb 2018 11:06:22 +0000 (12:06 +0100)]
x86: fix build with older tool chain

Signed-off-by: Jan Beulich <jbeulich@suse.com>
7 years agox86/idle: Clear SPEC_CTRL while idle
Andrew Cooper [Wed, 14 Feb 2018 10:45:01 +0000 (11:45 +0100)]
x86/idle: Clear SPEC_CTRL while idle

On contemporary hardware, setting IBRS/STIBP has a performance impact on
adjacent hyperthreads.  It is therefore recommended to clear the setting
before becoming idle, to avoid an idle core preventing adjacent userspace
execution from running at full performance.

Care must be taken to ensure there are no ret or indirect branch instructions
between spec_ctrl_{enter,exit}_idle() invocations, which are forced always
inline.  Care must also be taken to avoid using spec_ctrl_enter_idle() between
flushing caches and becoming idle, in cases where that matters.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4c7e478d597b0346eef3a256cfd6794ac778b608
master date: 2018-01-26 14:10:21 +0000

7 years agox86/cpuid: Offer Indirect Branch Controls to guests
Andrew Cooper [Wed, 14 Feb 2018 10:44:23 +0000 (11:44 +0100)]
x86/cpuid: Offer Indirect Branch Controls to guests

With all infrastructure in place, it is now safe to let guests see and use
these features.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
master commit: 67c6838ddacfa646f9d1ae802bd0f16a935665b8
master date: 2018-01-26 14:10:21 +0000

7 years agox86/ctxt: Issue a speculation barrier between vcpu contexts
Andrew Cooper [Wed, 14 Feb 2018 10:43:57 +0000 (11:43 +0100)]
x86/ctxt: Issue a speculation barrier between vcpu contexts

Issuing an IBPB command flushes the Branch Target Buffer, so that any poison
left by one vcpu won't remain when beginning to execute the next.

The cost of IBPB is substantial, and skipped on transition to idle, as Xen's
idle code is robust already.  All transitions into vcpu context are fully
serialising in practice (and under consideration for being retroactively
declared architecturally serialising), so a cunning attacker cannot use SP1 to
try and skip the flush.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a2ed643ed783020f885035432e9c0919756921d1
master date: 2018-01-26 14:10:21 +0000

7 years agox86/boot: Calculate the most appropriate BTI mitigation to use
Andrew Cooper [Wed, 14 Feb 2018 10:43:28 +0000 (11:43 +0100)]
x86/boot: Calculate the most appropriate BTI mitigation to use

See the logic and comments in init_speculation_mitigations() for further
details.

There are two controls for RSB overwriting, because in principle there are
cases where it might be safe to forego rsb_native (Off the top of my head,
SMEP active, no 32bit PV guests at all, no use of vmevent/paging subsystems
for HVM guests, but I make no guarantees that this list of restrictions is
exhaustive).

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/spec_ctrl: Fix determination of when to use IBRS

The original version of this logic was:

    /*
     * On Intel hardware, we'd like to use retpoline in preference to
     * IBRS, but only if it is safe on this hardware.
     */
    else if ( boot_cpu_has(X86_FEATURE_IBRSB) )
    {
        if ( retpoline_safe() )
            thunk = THUNK_RETPOLINE;
        else
            ibrs = true;
    }

but it was changed by a request during review.  Sadly, the result is buggy as
it breaks the later fallback logic by allowing IBRS to appear as available
when in fact it isn't.

This in practice means that on repoline-unsafe hardware without IBRS, we
select THUNK_JUMP despite intending to select THUNK_RETPOLINE.

Reported-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2713715305ca516f698d58cec5e0b322c3b2c4eb
master date: 2018-01-26 14:10:21 +0000
master commit: 30cbd0c83ef3d0edac2d5bcc41a9a2b7a843ae58
master date: 2018-02-06 18:32:58 +0000

7 years agox86/entry: Avoid using alternatives in NMI/#MC paths
Andrew Cooper [Wed, 14 Feb 2018 10:42:51 +0000 (11:42 +0100)]
x86/entry: Avoid using alternatives in NMI/#MC paths

This patch is deliberately arranged to be easy to revert if/when alternatives
patching becomes NMI/#MC safe.

For safety, there must be a dispatch serialising instruction in (what is
logically) DO_SPEC_CTRL_ENTRY so that, in the case that Xen needs IBRS set in
context, an attacker can't speculate around the WRMSR and reach an indirect
branch within the speculation window.

Using conditionals opens this attack vector up, so the else clause gets an
LFENCE to force the pipeline to catch up before continuing.  This also covers
the safety of RSB conditional, as execution it is guaranteed to either hit the
WRMSR or LFENCE.

One downside of not using alternatives is that there unconditionally an LFENCE
in the IST path in cases where we are not using the features from IBRS-capable
microcode.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3fffaf9c13e9502f09ad4ab1aac3f8b7b9398f6f
master date: 2018-01-26 14:10:21 +0000

7 years agox86/entry: Organise the clobbering of the RSB/RAS on entry to Xen
Andrew Cooper [Wed, 14 Feb 2018 10:42:12 +0000 (11:42 +0100)]
x86/entry: Organise the clobbering of the RSB/RAS on entry to Xen

ret instructions are speculated directly to values recorded in the Return
Stack Buffer/Return Address Stack, as there is no uncertainty in well-formed
code.  Guests can take advantage of this in two ways:

  1) If they can find a path in Xen which executes more ret instructions than
     call instructions.  (At least one in the waitqueue infrastructure,
     probably others.)

  2) Use the fact that the RSB/RAS in hardware is actually a circular stack
     without a concept of empty.  (When it logically empties, stale values
     will start being used.)

To mitigate, overwrite the RSB on entry to Xen with gadgets which will capture
and contain rogue speculation.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e6c0128e9ab25bf66df11377a33ee5584d7f99e3
master date: 2018-01-26 14:10:21 +0000

7 years agox86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point
Andrew Cooper [Wed, 14 Feb 2018 10:39:21 +0000 (11:39 +0100)]
x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point

We need to be able to either set or clear IBRS in Xen context, as well as
restore appropriate guest values in guest context.  See the documentation in
asm-x86/spec_ctrl_asm.h for details.

With the contemporary microcode, writes to %cr3 are slower when SPEC_CTRL.IBRS
is set.  Therefore, the positioning of SPEC_CTRL_{ENTRY/EXIT}* is important.

Ideally, the IBRS_SET/IBRS_CLEAR hunks might be positioned either side of the
%cr3 change, but that is rather more complicated to arrange, and could still
result in a guest controlled value in SPEC_CTRL during the %cr3 change,
negating the saving if the guest chose to have IBRS set.

Therefore, we optimise for the pre-Skylake case (being far more common in the
field than Skylake and later, at the moment), where we have a Xen-preferred
value of IBRS clear when switching %cr3.

There is a semi-unrelated bugfix, where various asm_defn.h macros have a
hidden dependency on PAGE_SIZE, which results in an assembler error if used in
a .macro definition.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5e7962901131186d3514528ed57c7a9901a15a3e
master date: 2018-01-26 14:10:21 +0000

7 years agox86/hvm: Permit guests direct access to MSR_{SPEC_CTRL,PRED_CMD}
Andrew Cooper [Wed, 14 Feb 2018 10:38:30 +0000 (11:38 +0100)]
x86/hvm: Permit guests direct access to MSR_{SPEC_CTRL,PRED_CMD}

For performance reasons, HVM guests should have direct access to these MSRs
when possible.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 5a2fe171144ebcc908ea1fca45058d6010f6a286
master date: 2018-01-26 14:10:21 +0000

7 years agox86/migrate: Move MSR_SPEC_CTRL on migrate
Andrew Cooper [Wed, 14 Feb 2018 10:38:00 +0000 (11:38 +0100)]
x86/migrate: Move MSR_SPEC_CTRL on migrate

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0cf2a4eb769302b7d7d7835540e7b2f15006df30
master date: 2018-01-26 14:10:21 +0000

7 years agox86/msr: Emulation of MSR_{SPEC_CTRL,PRED_CMD} for guests
Andrew Cooper [Wed, 14 Feb 2018 10:37:28 +0000 (11:37 +0100)]
x86/msr: Emulation of MSR_{SPEC_CTRL,PRED_CMD} for guests

As per the spec currently available here:

https://software.intel.com/sites/default/files/managed/c5/63/336996-Speculative-Execution-Side-Channel-Mitigations.pdf

MSR_ARCH_CAPABILITIES will only come into existence on new hardware, but is
implemented as a straight #GP for now to avoid being leaky when new hardware
arrives.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ea58a679a6190e714a592f1369b660769a48a80c
master date: 2018-01-26 14:10:21 +0000

7 years agox86/cpuid: Handling of IBRS/IBPB, STIBP and IBRS for guests
Andrew Cooper [Wed, 14 Feb 2018 10:36:48 +0000 (11:36 +0100)]
x86/cpuid: Handling of IBRS/IBPB, STIBP and IBRS for guests

Intel specifies IBRS/IBPB (combined, in a single bit) and STIBP as a separate
bit.  AMD specifies IBPB alone in a 3rd bit.

AMD's IBPB is a subset of Intel's combined IBRS/IBPB.  For performance
reasons, administrators might wish to express "IBPB only" even on Intel
hardware, so we allow the AMD bit to be used for this purpose.

The behaviour of STIBP is more complicated.

It is our current understanding that STIBP will be advertised on HT-capable
hardware irrespective of whether HT is enabled, but not advertised on
HT-incapable hardware.  However, for ease of virtualisation, STIBP's
functionality is ignored rather than reserved by microcode/hardware on
HT-incapable hardware.

For guest safety, we treat STIBP as special, always override the toolstack
choice, and always advertise STIBP if IBRS is available.  This removes the
corner case where STIBP is not advertised, but the guest is running on
HT-capable hardware where it does matter.

Finally as a bugfix, update the libxc CPUID logic to understand the e8b
feature leaf, which has the side effect of also offering CLZERO to guests on
applicable hardware.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d297b56682e730d598e2529cc6998151d3b6f6f8
master date: 2018-01-26 14:10:21 +0000

7 years agox86: fix GET_STACK_END
Wei Liu [Wed, 14 Feb 2018 10:36:09 +0000 (11:36 +0100)]
x86: fix GET_STACK_END

AIUI the purpose of having the .if directive is to make GET_STACK_END
work with any general purpose registers. The code as-is would produce
the wrong result for r8. Fix it.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 8155476765a5bdecea1534b46562cf28e0113a9a
master date: 2018-01-25 11:34:17 +0000

7 years agox86/acpi: process softirqs while printing CPU ACPI data
Roger Pau Monné [Wed, 14 Feb 2018 10:35:42 +0000 (11:35 +0100)]
x86/acpi: process softirqs while printing CPU ACPI data

Or else the watchdog triggers on boxes with a huge number of CPUs

Reported-by: Simon Crowe <simon.crowe@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: a5579ee79ef8546dd47abe34d73dc9a69a14bbda
master date: 2018-01-24 18:02:14 +0100

7 years agox86/cmdline: Introduce a command line option to disable IBRS/IBPB, STIBP and IBPB
Andrew Cooper [Wed, 14 Feb 2018 10:35:00 +0000 (11:35 +0100)]
x86/cmdline: Introduce a command line option to disable IBRS/IBPB, STIBP and IBPB

Instead of gaining yet another top level boolean, introduce a more generic
cpuid= option.  Also introduce a helper function to parse a generic boolean
value.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
xen/cmdline: Fix parse_boolean() for unadorned values

A command line such as "cpuid=no-ibrsb,no-stibp" tickles a bug in
parse_boolean() because the separating comma fails the NUL case.

Instead, check for slen == nlen which accounts for the boundary (if any)
passed via the 'e' parameter.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7850b1c00749df834ea2ad0c1f5d9364c4838795
master date: 2018-01-16 17:45:50 +0000
master commit: ac37ec1ddef234eeba6f438c29ff687c64962ebd
master date: 2018-01-31 10:47:12 +0000

7 years agox86/feature: Definitions for Indirect Branch Controls
Andrew Cooper [Wed, 14 Feb 2018 10:34:14 +0000 (11:34 +0100)]
x86/feature: Definitions for Indirect Branch Controls

Contemporary processors are gaining Indirect Branch Controls via microcode
updates.  Intel are introducing one bit to indicate IBRS and IBPB support, and
a second bit for STIBP.  AMD are introducing IBPB only, so enumerate it with a
separate bit.

Furthermore, depending on compiler and microcode availability, we may want to
run Xen with IBRS set, or clear.

To use these facilities, we synthesise separate IBRS and IBPB bits for
internal use.  A lot of infrastructure is required before these features are
safe to offer to guests.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
master commit: 0d703a701cc4bc47773986b2796eebd28b1439b5
master date: 2018-01-16 17:45:50 +0000

7 years agox86: Introduce alternative indirect thunks
Andrew Cooper [Wed, 14 Feb 2018 10:32:55 +0000 (11:32 +0100)]
x86: Introduce alternative indirect thunks

Depending on hardware and microcode availability, we will want to replace
IND_THUNK_REPOLINE with other implementations.

For AMD hardware, choose IND_THUNK_LFENCE in preference to retpoline if lfence
is known to be (or was successfully made) dispatch serialising.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 858cba0d4c6b6b45180afcb41561fd6585ad51a3
master date: 2018-01-16 17:45:50 +0000

7 years agox86/amd: Try to set lfence as being Dispatch Serialising
Andrew Cooper [Wed, 14 Feb 2018 10:30:08 +0000 (11:30 +0100)]
x86/amd: Try to set lfence as being Dispatch Serialising

This property is required for the AMD's recommended mitigation for Branch
Target Injection, but Xen needs to cope with being unable to detect or modify
the MSR.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: fe3ee5530a8d0d0b6a478167125d00c40f294a86
master date: 2018-01-16 17:45:50 +0000

7 years agox86/boot: Report details of speculative mitigations
Andrew Cooper [Wed, 14 Feb 2018 10:22:59 +0000 (11:22 +0100)]
x86/boot: Report details of speculative mitigations

Nothing very interesting at the moment, but the logic will grow as new
mitigations are added.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 31d6c53adf6417bf449ca50e8416e41b64d46803
master date: 2018-01-16 17:45:50 +0000

7 years agox86: Support indirect thunks from assembly code
Andrew Cooper [Wed, 14 Feb 2018 10:20:45 +0000 (11:20 +0100)]
x86: Support indirect thunks from assembly code

Introduce INDIRECT_CALL and INDIRECT_JMP which either degrade to a normal
indirect branch, or dispatch to the __x86_indirect_thunk_* symbols.

Update all the manual indirect branches in to use the new thunks.  The
indirect branches in the early boot and kexec path are left intact as we can't
use the compiled-in thunks at those points.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7c508612f7a5096b4819d4ef2ce566e01bd66c0c
master date: 2018-01-16 17:45:50 +0000

7 years agox86: Support compiling with indirect branch thunks
Andrew Cooper [Wed, 14 Feb 2018 10:19:35 +0000 (11:19 +0100)]
x86: Support compiling with indirect branch thunks

Use -mindirect-branch=thunk-extern/-mindirect-branch-register when available.
To begin with, use the retpoline thunk.  Later work will add alternative
thunks which can be selected at boot time.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 3659f0f4bcc6ca08103d1a7ae4e97535ecc978be
master date: 2018-01-16 17:45:50 +0000

7 years agocommon/wait: Clarifications to wait infrastructure
Andrew Cooper [Wed, 14 Feb 2018 10:18:32 +0000 (11:18 +0100)]
common/wait: Clarifications to wait infrastructure

This logic is not as clear as it could be.  Add some comments to help.

Rearrange the asm block in __prepare_to_wait() to separate the GPR
saving/restoring from the internal logic.

While tweaking, add an unreachable() following the jmp in
check_wakeup_from_wait().

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2d1c82261d966735e82e5971eddb63ba3c565a37
master date: 2018-01-05 19:57:08 +0000

7 years agox86/entry: Erase guest GPR state on entry to Xen
Andrew Cooper [Wed, 14 Feb 2018 10:17:57 +0000 (11:17 +0100)]
x86/entry: Erase guest GPR state on entry to Xen

This reduces the number of code gadgets which can be attacked with arbitrary
guest-controlled GPR values.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: 03bd8c3a70d101fc2f8f36f1e171b7594462a4cd
master date: 2018-01-05 19:57:08 +0000

7 years agox86/hvm: Use SAVE_ALL to construct the cpu_user_regs frame after VMExit
Andrew Cooper [Wed, 14 Feb 2018 10:17:09 +0000 (11:17 +0100)]
x86/hvm: Use SAVE_ALL to construct the cpu_user_regs frame after VMExit

No practical change.

One side effect in debug builds is that %rbp is inverted in the manner
expected by the stack unwinder to indicate a interrupt frame.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: 13682ca8c94bd5612a44f7f1edc1fd8ff675dacb
master date: 2018-01-05 19:57:08 +0000

7 years agox86/entry: Rearrange RESTORE_ALL to restore register in stack order
Andrew Cooper [Wed, 14 Feb 2018 10:16:32 +0000 (11:16 +0100)]
x86/entry: Rearrange RESTORE_ALL to restore register in stack order

Results in a more predictable (i.e. linear) memory access pattern.

No functional change.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
master commit: f85d105e27735f0e20aa30d77f03774f3ed55ae5
master date: 2018-01-05 19:57:08 +0000

7 years agox86: Introduce a common cpuid_policy_updated()
Andrew Cooper [Wed, 14 Feb 2018 10:15:46 +0000 (11:15 +0100)]
x86: Introduce a common cpuid_policy_updated()

No practical change at the moment, but future changes will need to react
irrespective of guest type.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: b357546b43ab87dfb10d740ae637a685134d5e32
master date: 2018-01-05 19:57:07 +0000

7 years agox86/hvm: Rename update_guest_vendor() callback to cpuid_policy_changed()
Andrew Cooper [Wed, 14 Feb 2018 10:14:54 +0000 (11:14 +0100)]
x86/hvm: Rename update_guest_vendor() callback to cpuid_policy_changed()

It will shortly be used for more than just changing the vendor.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3bea00966eb6680410c89df764d075a8fbacc3cc
master date: 2018-01-05 19:57:07 +0000

7 years agox86/alt: Introduce ALTERNATIVE{,_2} macros
Andrew Cooper [Wed, 14 Feb 2018 10:14:01 +0000 (11:14 +0100)]
x86/alt: Introduce ALTERNATIVE{,_2} macros

To help creating alternative frames in assembly.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4711428f5e2a9bfff9f8d75b6a696072118c19a4
master date: 2018-01-05 19:57:07 +0000

7 years agox86/alt: Break out alternative-asm into a separate header file
Andrew Cooper [Wed, 14 Feb 2018 10:13:00 +0000 (11:13 +0100)]
x86/alt: Break out alternative-asm into a separate header file

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 9d7b4351d3bb5c744db311cffa57ba3ebb583327
master date: 2018-01-05 19:57:07 +0000

7 years agoxen/arm32: entry: Document the purpose of r11 in the traps handler
Julien Grall [Fri, 2 Feb 2018 14:19:25 +0000 (14:19 +0000)]
xen/arm32: entry: Document the purpose of r11 in the traps handler

It took me a bit of time to understand why __DEFINE_TRAP_ENTRY is
storing the original stack pointer in r11. It is working in pair with
return_traps_entry where sp will be restored from r11.

This is fine because per the AAPCS r11 must be preserved by the
subroutine. So in return_from_trap, r11 will still contain the original
stack pointer.

Add some documentation in the code to point the 2 sides to each other.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit dd855aa430f2da9b677c145f0c625a82aaa97110)

7 years agoxen/arm32: Invalidate icache on guest exist for Cortex-A15
Julien Grall [Fri, 2 Feb 2018 14:19:24 +0000 (14:19 +0000)]
xen/arm32: Invalidate icache on guest exist for Cortex-A15

In order to avoid aliasing attacks against the branch predictor on
Cortex A-15, let's invalidate the BTB on guest exit, which can only be
done by invalidating the icache (with ACTLR[0] being set).

We use the same hack as for A12/A17 to perform the vector decoding.

This is based on Linux patch from the kpti branch in [1].

[1] https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 665c4b6aa79eb21b1aada9f7f98fb5cb7f03743a)

7 years agoxen/arm32: Invalidate BTB on guest exit for Cortex A17 and 12
Julien Grall [Fri, 2 Feb 2018 14:19:23 +0000 (14:19 +0000)]
xen/arm32: Invalidate BTB on guest exit for Cortex A17 and 12

In order to avoid aliasing attackes agains the branch predictor, let's
invalidate the BTB on guest exist. This is made complicated by the fact
that we cannot take a branch invalidating the BTB.

This is based on the fourth version posted by Marc Zyngier on Linux-arm
mailing list (see [1]).

This is part of XSA-254.

[1] https://www.spinics.net/lists/arm-kernel/msg632062.html

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 05e0690d03dc6177e614e060ae78001d4f2abde2)

Renamed trap_hypervisor_call to trap_supervisor_call

7 years agoxen/arm32: Add skeleton to harden branch predictor aliasing attacks
Julien Grall [Fri, 2 Feb 2018 14:19:22 +0000 (14:19 +0000)]
xen/arm32: Add skeleton to harden branch predictor aliasing attacks

Aliasing attacked against CPU branch predictors can allow an attacker to
redirect speculative control flow on some CPUs and potentially divulge
information from one context to another.

This patch adds initiatial skeleton code behind a new Kconfig option
to enable implementation-specific mitigations against these attacks
for CPUs that are affected.

Most of mitigations will have to be applied when entering to the
hypervisor from the guest context.

Because the attack is against branch predictor, it is not possible to
safely use branch instruction before the mitigation is applied.
Therefore this has to be done in the vector entry before jump to the
helper handling a given exception.

However, on arm32, each vector contain a single instruction. This means
that the hardened vector tables may rely on the state of registers that
does not hold when in the hypervisor (e.g SP is 8 bytes aligned).
Therefore hypervisor code running with guest vectors table should be
minimized and always have IRQs and SErrors masked to reduce the risk to
use them.

This patch provides an infrastructure to switch vector tables before
entering to the guest and when leaving it.

Note that alternative could have been used, but older Xen (4.8 or
earlier) doesn't have support. So avoid using alternative to ease
backporting.

This is part of XSA-254.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 9bd4463b5c7cc026a07b9bbd41a6a7122a95647e)

7 years agoxen/arm32: entry: Add missing trap_reset entry
Julien Grall [Fri, 2 Feb 2018 14:19:21 +0000 (14:19 +0000)]
xen/arm32: entry: Add missing trap_reset entry

At the moment, the reset vector is defined as .word 0 (e.g andeq r0, r0,
r0).

This is rather unintuitive and will result to execute the trap
undefined. Instead introduce trap helpers for reset and will generate an
error message in the unlikely case that reset will be called.

This is part of XSA-254.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 00268cc91270c7b0aa3a1906bf7e7702db9c61c1)

Conflicts:
xen/arch/arm/arm32/traps.c

7 years agoxen/arm32: Add missing MIDR values for Cortex-A17 and A12
Julien Grall [Fri, 2 Feb 2018 14:19:20 +0000 (14:19 +0000)]
xen/arm32: Add missing MIDR values for Cortex-A17 and A12

Cortex-A17 and A12 MIDR will be used in a follow-up patch for hardening
the branch predictor.

This is part of XSA-254.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 340367bca5360f3e3d263341b58234d0efe5ced2)

7 years agoxen/arm32: entry: Consolidate DEFINE_TRAP_ENTRY_* macros
Julien Grall [Wed, 7 Feb 2018 16:52:44 +0000 (08:52 -0800)]
xen/arm32: entry: Consolidate DEFINE_TRAP_ENTRY_* macros

The only difference between all the DEFINE_TRAP_ENTRY_* macros  are the
interrupts (Asynchronous Abort, IRQ, FIQ) unmasked.

Rather than duplicating the code, introduce __DEFINE_TRAP_ENTRY macro
that will take the list of interrupts to unmask.

This is part of XSA-254.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 3bd8fd751e50dd981b7055fb33cdc8aa29537673)

7 years agoxen/arm64: Implement branch predictor hardening for affected Cortex-A CPUs
Julien Grall [Tue, 16 Jan 2018 14:23:37 +0000 (14:23 +0000)]
xen/arm64: Implement branch predictor hardening for affected Cortex-A CPUs

Cortex-A57, A72, A73 and A75 are susceptible to branch predictor
aliasing and can theoritically be attacked by malicious code.

This patch implements a PSCI-based mitigation for these CPUs when
available. The call into firmware will invalidate the branch predictor
state, preventing any malicious entries from affection other victim
contexts.

Ported from Linux git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git
branch kpti.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
This is part of XSA-254.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit e730f8e41e8537f1db9770b9464f9523c28857b9)

Conflicts:
xen/arch/arm/cpuerrata.c

7 years agoxen/arm64: Add skeleton to harden the branch predictor aliasing attacks
Julien Grall [Tue, 16 Jan 2018 14:23:36 +0000 (14:23 +0000)]
xen/arm64: Add skeleton to harden the branch predictor aliasing attacks

Aliasing attacked against CPU branch predictors can allow an attacker to
redirect speculative control flow on some CPUs and potentially divulge
information from one context to another.

This patch adds initial skeleton code behind a new Kconfig option to
enable implementation-specific mitigations against these attacks for
CPUs that are affected.

Most of the mitigations will have to be applied when entering to the
hypervisor from the guest context. For safety, it is applied at every
exception entry. So there are potential for optimizing when receiving
an exception at the same level.

Because the attack is against branch predictor, it is not possible to
safely use branch instruction before the mitigation is applied.
Therefore, this has to be done in the vector entry before jump to the
helper handling a given exception.

On Arm64, each vector can hold 32 instructions. This leave us 31
instructions for the mitigation. The last one is the branch instruction
to the helper.

Because a platform may have CPUs with different micro-architectures,
per-CPU vector table needs to be provided. Realistically, only a few
different mitigations will be necessary. So provide a small set of
vector tables. They will be re-used and patch with the mitigations
on-demand.

This is based on the work done in Linux (see [1]).

This is part of XSA-254.

[1] git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git
branch ktpi

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 4c4fddc166cf528aca49540bcc9ee4f196b01dac)

Conflicts:
xen/arch/arm/cpuerrata.c
xen/include/asm-arm/cpuerrata.h
xen/include/asm-arm/cpufeature.h
xen/arch/arm/Kconfig
xen/arch/arm/arm64/Makefile

7 years agoxen/arm: cpuerrata: Add MIDR_ALL_VERSIONS
Julien Grall [Tue, 16 Jan 2018 14:23:35 +0000 (14:23 +0000)]
xen/arm: cpuerrata: Add MIDR_ALL_VERSIONS

Introduce a new macro MIDR_ALL_VERSIONS to match all variant/revision of a
given CPU model.

This is part of XSA-254.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit ba73070af43a38d200413f446d6a718e108867b6)

Conflicts:
xen/arch/arm/cpuerrata.c

7 years agoxen/arm64: Add missing MIDR values for Cortex-A72, A73 and A75
Julien Grall [Tue, 16 Jan 2018 14:23:34 +0000 (14:23 +0000)]
xen/arm64: Add missing MIDR values for Cortex-A72, A73 and A75

Cortex-A72, A73 and A75 MIDR will be used to a follow-up for hardening
the branch predictor.

This is part of XSA-254.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 7975bff524c4e2c30efbf144de753f151d974e53)

Conflicts:
xen/include/asm-arm/processor.h

Add missing definitions of A15, A53, and A57.

7 years agoxen/arm: Introduce enable callback to enable a capabilities on each online CPU
Julien Grall [Tue, 16 Jan 2018 14:23:33 +0000 (14:23 +0000)]
xen/arm: Introduce enable callback to enable a capabilities on each online CPU

Once Xen knows what features/workarounds present on the platform, it
might be necessary to configure each online CPU.

Introduce a new callback "enable" that will be called on each online CPU to
configure the "capability".

The code is based on Linux v4.14 (where cpufeature.c comes from), the
explanation of why using stop_machine_run is kept as we have similar
problem in the future.

Lastly introduce enable_errata_workaround that will be called once CPUs
have booted and before the hardware domain is created.

This is part of XSA-254.

Signed-of-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 7500495155aacce437878cb576f45224ae984f40)

Conflicts:
xen/include/asm-arm/cpufeature.h
xen/arch/arm/setup.c

7 years agoxen/arm: Detect silicon revision and set cap bits accordingly
Julien Grall [Wed, 27 Jul 2016 16:37:07 +0000 (17:37 +0100)]
xen/arm: Detect silicon revision and set cap bits accordingly

After each CPU has been started, we iterate through a list of CPU
errata to detect CPUs which need from hypervisor code patches.

For each bug there is a function which checks if that a particular CPU is
affected. This needs to be done on every CPU to cover heterogenous
systems properly.

If a certain erratum has been detected, the capability bit will be set.
In the case the erratum requires code patching, this will be triggered
by the call to apply_alternatives.

The code is based on the file arch/arm64/kernel/cpu_errata.c in Linux
v4.6-rc3.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 8b01f6364f50f3b416105cc5f1ee2ca4f13d43eb)

7 years agoxen/arm: cpufeature: Provide an helper to check if a capability is supported
Julien Grall [Wed, 27 Jul 2016 16:37:06 +0000 (17:37 +0100)]
xen/arm: cpufeature: Provide an helper to check if a capability is supported

The CPU capabilities will be set depending on the value found in the CPU
registers. This patch provides a generic to go through a set of capabilities
and find which one should be enabled.

The parameter "info" is used to display the kind of capability updated (e.g
workaround, feature...).

Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 64eb606000f1258267f27e6819c59848e2226773)

7 years agoxen/arm: Add cpu_hwcap bitmap
Julien Grall [Wed, 22 Jun 2016 11:15:19 +0000 (12:15 +0100)]
xen/arm: Add cpu_hwcap bitmap

This will be used to know if a feature, which Xen cares, is available accross
all the CPUs.

This code is a light version of arch/arm64/kernel/cpufeature.c from
Linux v4.6-rc3.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit bfb489629c0c5f36c60dec879383cbed4080a509)

Conflicts:
xen/arch/arm/Makefile

7 years agoxen/arm: Add macros to handle the MIDR
Julien Grall [Wed, 22 Jun 2016 11:15:18 +0000 (12:15 +0100)]
xen/arm: Add macros to handle the MIDR

Add new macros to easily get different parts of the register and to
check if a given MIDR match a CPU model range. The latter will be really
useful to handle errata later.

The macros have been imported from the header
arch/arm64/include/asm/cputype.h in Linux v4.6-rc3.

Also remove MIDR_MASK which is unused.

Signed-off-by: Julien Grall <julien.grall@arm.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 96c53eaa8cd32f86700b749065eaa37bf4cdc24c)

7 years agox86: allow Meltdown band-aid to be disabled
Jan Beulich [Wed, 17 Jan 2018 16:24:59 +0000 (17:24 +0100)]
x86: allow Meltdown band-aid to be disabled

First of all we don't need it on AMD systems. Additionally allow its use
to be controlled by command line option. For best backportability, this
intentionally doesn't use alternative instruction patching to achieve
the intended effect - while we likely want it, this will be later
follow-up.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e871e80c38547d9faefc6604532ba3e985e65873
master date: 2018-01-16 17:50:59 +0100

7 years agox86: Meltdown band-aid against malicious 64-bit PV guests
Jan Beulich [Wed, 17 Jan 2018 16:24:12 +0000 (17:24 +0100)]
x86: Meltdown band-aid against malicious 64-bit PV guests

This is a very simplistic change limiting the amount of memory a running
64-bit PV guest has mapped (and hence available for attacking): Only the
mappings of stack, IDT, and TSS are being cloned from the direct map
into per-CPU page tables. Guest controlled parts of the page tables are
being copied into those per-CPU page tables upon entry into the guest.
Cross-vCPU synchronization of top level page table entry changes is
being effected by forcing other active vCPU-s of the guest into the
hypervisor.

The change to context_switch() isn't strictly necessary, but there's no
reason to keep switching page tables once a PV guest is being scheduled
out.

This isn't providing full isolation yet, but it should be covering all
pieces of information exposure of which would otherwise require an XSA.

There is certainly much room for improvement, especially of performance,
here - first and foremost suppressing all the negative effects on AMD
systems. But in the interest of backportability (including to really old
hypervisors, which may not even have alternative patching) any such is
being left out here.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5784de3e2067ed73efc2fe42e62831e8ae7f46c4
master date: 2018-01-16 17:49:03 +0100

7 years agox86/mm: Always set _PAGE_ACCESSED on L4e updates
Andrew Cooper [Wed, 17 Jan 2018 16:23:37 +0000 (17:23 +0100)]
x86/mm: Always set _PAGE_ACCESSED on L4e updates

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: bd61fe94bee0556bc2f64999a4a8315b93f90f21
master date: 2018-01-15 13:53:16 +0000

7 years agox86: Don't use potentially incorrect CPUID values for topology information
Jan H. Schönherr [Wed, 17 Jan 2018 16:23:08 +0000 (17:23 +0100)]
x86: Don't use potentially incorrect CPUID values for topology information

Intel says for CPUID leaf 0Bh:

  "Software must not use EBX[15:0] to enumerate processor
   topology of the system. This value in this field
   (EBX[15:0]) is only intended for display/diagnostic
   purposes. The actual number of logical processors
   available to BIOS/OS/Applications may be different from
   the value of EBX[15:0], depending on software and platform
   hardware configurations."

And yet, we're using them to derive the number cores in a package
and the number of siblings in a core.

Derive the number of siblings and cores from EAX instead, which is
intended for that.

Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d51baf310e530659f73e714acf575555bdc46303
master date: 2018-01-08 10:48:24 +0000

7 years agox86/entry: Remove support for partial cpu_user_regs frames
Andrew Cooper [Wed, 17 Jan 2018 16:22:34 +0000 (17:22 +0100)]
x86/entry: Remove support for partial cpu_user_regs frames

Save all GPRs on entry to Xen.

The entry_int82() path is via a DPL1 gate, only usable by 32bit PV guests, so
can get away with only saving the 32bit registers.  All other entrypoints can
be reached from 32 or 64bit contexts.

This is part of XSA-254.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: f9eb74789af77e985ae653193f3622263499f674
master date: 2018-01-05 19:57:07 +0000

7 years agox86/upcall: inject a spurious event after setting upcall vector
Roger Pau Monné [Wed, 17 Jan 2018 16:22:03 +0000 (17:22 +0100)]
x86/upcall: inject a spurious event after setting upcall vector

In case the vCPU has pending events to inject. This fixes a bug that
happened if the guest mapped the vcpu info area using
VCPUOP_register_vcpu_info without having setup the event channel
upcall, and then setup the upcall vector.

In this scenario the guest would not receive any upcalls, because the
call to VCPUOP_register_vcpu_info would have marked the vCPU as having
pending events, but the vector could not be injected because it was
not yet setup.

This has not caused issues so far because all the consumers first
setup the vector callback and then map the vcpu info page, but there's
no limitation that prevents doing it in the inverse order.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7b5b8ca7dffde866d851f0b87b994e0b13e5b867
master date: 2018-01-04 14:29:16 +0100

7 years agox86/E820: don't overrun array
Jan Beulich [Wed, 17 Jan 2018 16:21:21 +0000 (17:21 +0100)]
x86/E820: don't overrun array

The bounds check needs to be done after the increment, not before, or
else it needs to use a one lower immediate. Also use word operations
rather than byte ones for both the increment and the compare (allowing
E820_BIOS_MAX to be more easily bumped, should the need ever arise).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 0036c9dbcd8b52316aeebb475929d3a36cf5e514
master date: 2018-01-03 11:03:56 +0100

7 years agox86/IRQ: conditionally preserve access permission on map error paths
Jan Beulich [Wed, 17 Jan 2018 16:20:45 +0000 (17:20 +0100)]
x86/IRQ: conditionally preserve access permission on map error paths

Permissions that had been granted before should not be revoked when
handling unrelated errors.

Reported-by: HW42 <hw42@ipsumj.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 3443e68a778572a6e082d7dfcf9ce794eca62f5f
master date: 2018-01-03 11:03:10 +0100

7 years agoxen/arm: fix smpboot barriers
Stefano Stabellini [Wed, 7 Dec 2016 19:13:05 +0000 (11:13 -0800)]
xen/arm: fix smpboot barriers

Remove useless smp_wmb() barrier after cpumask_set_cpu(cpuid,
&cpu_online_map), which is not synchronizing against anything.

Keep the other smp_wmb(), before the cpumask_set_cpu call, to ensure
that all writes before setting the cpu online are visible to other cpus.
For that to work properly, we need a corresponding smp_rmb() barrier,
after reading the online cpumask from other processors, which is
currently missing. Add it.

See: http://marc.info/?l=xen-devel&m=148093236307211

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@arm.com>
(cherry picked from commit 779a0e15ca0d9d5dbcbdee29b1dad9faf73bfc77)

7 years agoarm: configure interrupts to be in non-secure group1
Stefano Stabellini [Wed, 18 Oct 2017 21:29:58 +0000 (14:29 -0700)]
arm: configure interrupts to be in non-secure group1

Xen uses non-secure group1 interrupts, however it doesn't configure the
GICv3 accordingly. Xen needs to set GICD_IGROUPR for SPIs and
GICR_IGROUPR0 for local interrupt to "1" to specify that interrupts
belong to group1. This is particularly important if the system has
GICD_CTLR.DS set, also see commit
7c9b973061b03af62734f613f6abec46c0dd4a88 in Linux.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Julien Grall <julien.grall@linaro.org>
Released-acked-by: Julien Grall <julien.grall@linaro.org>
(cherry picked from commit 0c8055c2f45f489aff67f4d362f3fdc192cc2d94)

7 years agoxen/arm: bootfdt: Use proper default for #address-cells and #size-cells
Julien Grall [Wed, 29 Nov 2017 17:57:32 +0000 (17:57 +0000)]
xen/arm: bootfdt: Use proper default for #address-cells and #size-cells

Per the device-tree specific [1], when the property #address-cells
and  #size-cells are not present, the default value should be resp. 1
and 2.

[1] https://www.devicetree.org/downloads/devicetree-specification-v0.1-20160524.pdf

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit c05aa4afac64ea687c1a2bf9277ba6552809495b)

7 years agoxen/arm: gic-v3: Bail out if gicv3_cpu_init fail
Julien Grall [Wed, 6 Dec 2017 14:51:37 +0000 (14:51 +0000)]
xen/arm: gic-v3: Bail out if gicv3_cpu_init fail

When system registers are not enabled, all the access to them will trap
in EL2. In Xen, system registers will be enabled by gicv3_cpu_init only
on success. As the rest of the code (e.g gicv3_hyp_init) relies on
system register, it is better to bail out directly.

This will save time on debugging early boot issue on GICv3 platform.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 9630c5ae363b4cbf8eb61366530f40c80680af4d)

7 years agoxen/efi: Fix build with clang-5.0
Andrew Cooper [Wed, 20 Dec 2017 15:24:52 +0000 (16:24 +0100)]
xen/efi: Fix build with clang-5.0

The clang-5.0 build is reliably failing with:

  Error: size of boot.o:.text is 0x01

which is because efi_arch_flush_dcache_area() exists as a single ret
instruction.  Mark it as __init like everything else in the files.

Spotted by Travis.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: c4f6ad4c5fd25cb0ccc0cdbe711db97e097f0407
master date: 2017-12-14 10:59:26 +0000

7 years agox86/microcode: Add support for fam17h microcode loading
Tom Lendacky [Wed, 20 Dec 2017 15:24:23 +0000 (16:24 +0100)]
x86/microcode: Add support for fam17h microcode loading

The size for the Microcode Patch Block (MPB) for an AMD family 17h
processor is 3200 bytes.  Add a #define for fam17h so that it does
not default to 2048 bytes and fail a microcode load/update.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[Linux commit f4e9b7af0cd58dd039a0fb2cd67d57cea4889abf]

Ported to Xen.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 61d458ba8c171809e8dd9abd19339c87f3f934ca
master date: 2017-12-13 14:30:10 +0000

7 years agognttab: improve GNTTABOP_cache_flush locking
Jan Beulich [Wed, 20 Dec 2017 15:23:52 +0000 (16:23 +0100)]
gnttab: improve GNTTABOP_cache_flush locking

Dropping the lock before returning from grant_map_exists() means handing
possibly stale information back to the caller. Return back the pointer
to the active entry instead, for the caller to release the lock once
done.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andre Przywara <andre.przywara@linaro.org>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 553ac37137c2d1c03bf1b69cfb192ffbfe29daa4
master date: 2017-12-04 11:04:18 +0100

7 years agognttab: correct GNTTABOP_cache_flush empty batch handling
Jan Beulich [Wed, 20 Dec 2017 15:23:26 +0000 (16:23 +0100)]
gnttab: correct GNTTABOP_cache_flush empty batch handling

Jann validly points out that with a caller bogusly requesting a zero-
element batch with non-zero high command bits (the ones used for
continuation encoding), the assertion right before the call to
hypercall_create_continuation() would trigger. A similar situation would
arise afaict for non-empty batches with op and/or length zero in every
element.

While we want the former to succeed (as we do elsewhere for similar
no-op requests), the latter can clearly be converted to an error, as
this is a state that can't be the result of a prior operation.

Take the opportunity and also correct the order of argument checks:
We shouldn't accept zero-length elements with unknown bits set in "op".
Also constify cache_flush()'s first parameter.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andre Przywara <andre.przywara@linaro.org>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 9c22e4d67f5552c7c896ed83bd95d5d4c5837a9d
master date: 2017-12-04 11:03:32 +0100

7 years agox86/vvmx: don't enable vmcs shadowing for nested guests
Sergey Dyasli [Wed, 20 Dec 2017 15:22:58 +0000 (16:22 +0100)]
x86/vvmx: don't enable vmcs shadowing for nested guests

Running "./xtf_runner vvmx" in L1 Xen under L0 Xen produces the
following result on H/W with VMCS shadowing:

    Test: vmxon
    Failure in test_vmxon_in_root_cpl0()
      Expected 0x8200000f: VMfailValid(15) VMXON_IN_ROOT
           Got 0x82004400: VMfailValid(17408) <unknown>
    Test result: FAILURE

This happens because SDM allows vmentries with enabled VMCS shadowing
VM-execution control and VMCS link pointer value of ~0ull. But results
of a nested VMREAD are undefined in such cases.

Fix this by not copying the value of VMCS shadowing control from vmcs01
to vmcs02.

Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 19fdb8e258619aea265af9c183e035e545cbc2d2
master date: 2017-12-01 19:03:27 +0000

7 years agoxen/pv: Construct d0v0's GDT properly
Andrew Cooper [Wed, 20 Dec 2017 15:22:27 +0000 (16:22 +0100)]
xen/pv: Construct d0v0's GDT properly

c/s cf6d39f8199 "x86/PV: properly populate descriptor tables" changed the GDT
to reference zero_page for intermediate frames between the guest and Xen
frames.

Because dom0_construct_pv() doesn't call arch_set_info_guest(), some bits of
initialisation are missed, including the pv_destroy_gdt() which initially
fills the references to zero_page.

In practice, this means there is a window between starting and the first call
to HYPERCALL_set_gdt() were lar/lsl/verr/verw suffer non-architectural
behaviour.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 08f27f4468eedbeccaac9fdda4ef732247efd74e
master date: 2017-12-01 19:03:26 +0000

7 years agox86/hvm: fix interaction between internal and external emulation
Paul Durrant [Wed, 20 Dec 2017 15:21:59 +0000 (16:21 +0100)]
x86/hvm: fix interaction between internal and external emulation

A call to handle_hvm_io_completion() is needed for completing I/O
that requires external emulation. Such completion should be requested when
hvm_vcpu_io_need_completion() returns true after hvm_emulate_once() has
completed. This is indicative of the underlying I/O emulation having
returned X86EMUL_RETRY and hence a re-emulation of the instruction is
needed to pick up the result of the I/O.

A call to handle_hvm_io_completion() is NOT needed when the underlying
I/O has not returned X86EMUL_RETRY since there will be no result to pick
up. Hence it bogus to request such completion when mmio_retry is set,
since this can only happen if the underlying I/O emulation has returned
X86EMUL_OKAY (meaning the I/O has completed successfully).

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/HVM: don't retain emulated insn cache when exiting back to guest

vio->mmio_retry is being set when a repeated string insn is being split
up. In that case we'll exit to the guest, expecting immediate re-entry.
Interruptions, however, may be serviced by the guest before re-entry
from the repeated string insn. Any emulation needed in the course of
handling the interruption must not fetch from the internally maintained
cache.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
master commit: 9c9384d6d8184ca6d21975ccf4e4f72b560540cc
master date: 2017-12-01 18:09:48 +0000
master commit: 5fcb26e69e8089e20c9168774bee681b8f5a3187
master date: 2017-12-06 12:50:23 +0100

7 years agoimprove XENMEM_add_to_physmap_batch address checking
Jan Beulich [Wed, 20 Dec 2017 15:21:31 +0000 (16:21 +0100)]
improve XENMEM_add_to_physmap_batch address checking

As a follow-up to XSA-212 we should have addressed a similar issue here:
The handles being advanced at the top of xenmem_add_to_physmap_batch()
means we allow hypervisor space accesses (in particular, for "errs",
writes) with suitably crafted input arguments. This isn't a security
issue in this case because of the limited width of struct
xen_add_to_physmap_batch's size field: It being 16-bits wide, only the
r/o M2P area can be accessed. Still we can and should do better.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 7f080956e9eed821fd42013bef11c1a2873fbeba
master date: 2017-11-28 13:15:12 +0100

7 years agox86: check paging mode earlier in xenmem_add_to_physmap_one()
Jan Beulich [Wed, 20 Dec 2017 15:21:03 +0000 (16:21 +0100)]
x86: check paging mode earlier in xenmem_add_to_physmap_one()

There's no point in deferring this until after some initial processing,
and it's actively wrong for the XENMAPSPACE_gmfn_foreign handling to not
have such a check at all.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f38f3dccf1e1a8aabcf57364326fc8f44cddc41a
master date: 2017-11-28 13:14:43 +0100

7 years agox86: replace bad ASSERT() in xenmem_add_to_physmap_one()
Jan Beulich [Wed, 20 Dec 2017 15:20:31 +0000 (16:20 +0100)]
x86: replace bad ASSERT() in xenmem_add_to_physmap_one()

There are no locks being held, i.e. it is possible to be triggered by
racy hypercall invocations. Subsequent code doesn't really depend on the
checked values, so this is not a security issue.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: f33d653f46f5889db7be4fef31d71bc871834c10
master date: 2017-11-28 13:14:10 +0100

7 years agosync CPU state upon final domain destruction
Jan Beulich [Wed, 20 Dec 2017 15:19:12 +0000 (16:19 +0100)]
sync CPU state upon final domain destruction

See the code comment being added for why we need this.

This is being placed here to balance between the desire to prevent
future similar issues (the risk of which would grow if it was put
further down the call stack, e.g. in vmx_vcpu_destroy()) and the
intention to limit the performance impact (otherwise it could also go
into rcu_do_batch(), paralleling the use in do_tasklet_work()).

Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 24246e1fb7496b830aca8a6a1fd3064ca1e3ebf9
master date: 2017-11-23 11:38:22 +0100

7 years agox86/hvm: Don't corrupt the HVM context stream when writing the MSR record
Andrew Cooper [Wed, 20 Dec 2017 15:18:40 +0000 (16:18 +0100)]
x86/hvm: Don't corrupt the HVM context stream when writing the MSR record

Ever since it was introduced in c/s bd1f0b45ff, hvm_save_cpu_msrs() has had a
bug whereby it corrupts the HVM context stream if some, but fewer than the
maximum number of MSRs are written.

_hvm_init_entry() creates an hvm_save_descriptor with length for
msr_count_max, but in the case that we write fewer than max, h->cur only moves
forward by the amount of space used, causing the subsequent
hvm_save_descriptor to be written within the bounds of the previous one.

To resolve this, reduce the length reported by the descriptor to match the
actual number of bytes used.

A typical failure on the destination side looks like:

    (XEN) HVM4 restore: CPU_MSR 0
    (XEN) HVM4.0 restore: not enough data left to read 56 MSR bytes
    (XEN) HVM4 restore: failed to load entry 20/0

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d2f86bf604698806d311cc251c1b66fbb752673c
master date: 2017-11-21 11:19:02 +0000

7 years agox86/hvm: Fix altp2m_vcpu_enable_notify error handling
Adrian Pop [Wed, 20 Dec 2017 15:18:09 +0000 (16:18 +0100)]
x86/hvm: Fix altp2m_vcpu_enable_notify error handling

The altp2m_vcpu_enable_notify subop handler might skip calling
rcu_unlock_domain() after rcu_lock_current_domain().  Albeit since both
rcu functions are no-ops when run on the current domain, this doesn't
really have repercussions.

The second change is adding a missing break that would have potentially
enabled #VE for the current domain even if it had intended to enable it
for another one (not a supported functionality).

Signed-off-by: Adrian Pop <apop@bitdefender.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: eb0660c6950e08e44fdfeca3e29320382e2a1554
master date: 2017-11-16 17:25:59 +0000

7 years agocommon/gnttab: Correct error handling for gnttab_setup_table()
Andrew Cooper [Wed, 20 Dec 2017 15:17:26 +0000 (16:17 +0100)]
common/gnttab: Correct error handling for gnttab_setup_table()

Simplify the error labels to just "unlock" and "out".  This fixes an erroneous
path where a failure of rcu_lock_domain_by_any_id() still results in
rcu_unlock_domain() being called.

This is only not an XSA by luck.  rcu_unlock_domain() is a nop other than
decrementing the preempt count, and nothing reads the preempt count outside of
a debug build.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5e436e7a45082ea2cadc176c19e1df46c178448f
master date: 2017-08-15 15:08:57 +0100

7 years agox86/paging: don't unconditionally BUG() on finding SHARED_M2P_ENTRY
Jan Beulich [Tue, 12 Dec 2017 14:05:09 +0000 (15:05 +0100)]
x86/paging: don't unconditionally BUG() on finding SHARED_M2P_ENTRY

PV guests can fully control the values written into the P2M.

This is XSA-251.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b4d0218cff66b7eaa9c9b8dc9bd71e7b089b016d
master date: 2017-12-12 14:30:17 +0100

7 years agox86/shadow: fix ref-counting error handling
Jan Beulich [Tue, 12 Dec 2017 14:04:28 +0000 (15:04 +0100)]
x86/shadow: fix ref-counting error handling

The old-Linux handling in shadow_set_l4e() mistakenly ORed together the
results of sh_get_ref() and sh_pin(). As the latter failing is not a
correctness problem, simply ignore its return value.

In sh_set_toplevel_shadow() a failing sh_get_ref() must not be
accompanied by installing the entry, despite the domain being crashed.

This is XSA-250.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 10be8001de7d87be1f0ccdda75cc70e922e56d03
master date: 2017-12-12 14:29:45 +0100

7 years agox86/shadow: fix refcount overflow check
Jan Beulich [Tue, 12 Dec 2017 14:04:00 +0000 (15:04 +0100)]
x86/shadow: fix refcount overflow check

Commit c385d27079 ("x86 shadow: for multi-page shadows, explicitly track
the first page") reduced the refcount width to 25, without adjusting the
overflow check. Eliminate the disconnect by using a manifest constant.

Interestingly, up to commit 047782fa01 ("Out-of-sync L1 shadows: OOS
snapshot") the refcount was 27 bits wide, yet the check was already
using 26.

This is XSA-249.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: 54e2292e8df7a1a7b041192be9d6d797b6d00869
master date: 2017-12-12 14:29:13 +0100

7 years agox86/mm: don't wrongly set page ownership
Jan Beulich [Tue, 12 Dec 2017 14:03:34 +0000 (15:03 +0100)]
x86/mm: don't wrongly set page ownership

PV domains can obtain mappings of any pages owned by the correct domain,
including ones that aren't actually assigned as "normal" RAM, but used
by Xen internally.  At the moment such "internal" pages marked as owned
by a guest include pages used to track logdirty bits, as well as p2m
pages and the "unpaged pagetable" for HVM guests. Since the PV memory
management and shadow code conflict in their use of struct page_info
fields, and since shadow code is being used for log-dirty handling for
PV domains, pages coming from the shadow pool must, for PV domains, not
have the domain set as their owner.

While the change could be done conditionally for just the PV case in
shadow code, do it unconditionally (and for consistency also for HAP),
just to be on the safe side.

There's one special case though for shadow code: The page table used for
running a HVM guest in unpaged mode is subject to get_page() (in
set_shadow_status()) and hence must have its owner set.

This is XSA-248.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Tim Deegan <tim@xen.org>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: ff2a793e15bb0b6254bc849ef8e83e1c284c3583
master date: 2017-12-12 14:28:36 +0100

7 years agox86: don't wrongly trigger linear page table assertion (2)
Jan Beulich [Tue, 12 Dec 2017 14:03:00 +0000 (15:03 +0100)]
x86: don't wrongly trigger linear page table assertion (2)

_put_final_page_type(), when free_page_type() has exited early to allow
for preemption, should not update the time stamp, as the page continues
to retain the typ which is in the process of being unvalidated. I can't
see why the time stamp update was put on that path in the first place
(albeit it may well have been me who had put it there years ago).

This is part of XSA-240.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: George Dunlap <george.dunlap.com>
master commit: e40b0219a8c77741ae48989efb520f4a762a5be3
master date: 2017-12-12 14:27:34 +0100

7 years agop2m: Check return value of p2m_set_entry() when decreasing reservation
George Dunlap [Tue, 28 Nov 2017 12:45:13 +0000 (13:45 +0100)]
p2m: Check return value of p2m_set_entry() when decreasing reservation

If the entire range specified to p2m_pod_decrease_reservation() is marked
populate-on-demand, then it will make a single p2m_set_entry() call,
reducing its PoD entry count.

Unfortunately, in the right circumstances, this p2m_set_entry() call
may fail.  It that case, repeated calls to decrease_reservation() may
cause p2m->pod.entry_count to fall below zero, potentially tripping
over BUG_ON()s to the contrary.

Instead, check to see if the entry succeeded, and return false if not.
The caller will then call guest_remove_page() on the gfns, which will
return -EINVAL upon finding no valid memory there to return.

Unfortunately if the order > 0, the entry may have partially changed.
A domain_crash() is probably the safest thing in that case.

Other p2m_set_entry() calls in the same function should be fine,
because they are writing the entry at its current order.  Nonetheless,
check the return value and crash if our assumption turns otu to be
wrong.

This is part of XSA-247.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a3d64de8e86f5812917d2d0af28298f80debdf9a
master date: 2017-11-28 13:13:26 +0100

7 years agop2m: Always check to see if removing a p2m entry actually worked
George Dunlap [Tue, 28 Nov 2017 12:44:42 +0000 (13:44 +0100)]
p2m: Always check to see if removing a p2m entry actually worked

The PoD zero-check functions speculatively remove memory from the p2m,
then check to see if it's completely zeroed, before putting it in the
cache.

Unfortunately, the p2m_set_entry() calls may fail if the underlying
pagetable structure needs to change and the domain has exhausted its
p2m memory pool: for instance, if we're removing a 2MiB region out of
a 1GiB entry (in the p2m_pod_zero_check_superpage() case), or a 4k
region out of a 2MiB or larger entry (in the p2m_pod_zero_check()
case); and the return value is not checked.

The underlying mfn will then be added into the PoD cache, and at some
point mapped into another location in the p2m.  If the guest
afterwards ballons out this memory, it will be freed to the hypervisor
and potentially reused by another domain, in spite of the fact that
the original domain still has writable mappings to it.

There are several places where p2m_set_entry() shouldn't be able to
fail, as it is guaranteed to write an entry of the same order that
succeeded before.  Add a backstop of crashing the domain just in case,
and an ASSERT_UNREACHABLE() to flag up the broken assumption on debug
builds.

While we're here, use PAGE_ORDER_2M rather than a magic constant.

This is part of XSA-247.

Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 92790672dedf2eab042e04ecc277c19d40fd348a
master date: 2017-11-28 13:13:03 +0100

7 years agox86/pod: prevent infinite loop when shattering large pages
Julien Grall [Tue, 28 Nov 2017 12:44:09 +0000 (13:44 +0100)]
x86/pod: prevent infinite loop when shattering large pages

When populating pages, the PoD may need to split large ones using
p2m_set_entry and request the caller to retry (see ept_get_entry for
instance).

p2m_set_entry may fail to shatter if it is not possible to allocate
memory for the new page table. However, the error is not propagated
resulting to the callers to retry infinitely the PoD.

Prevent the infinite loop by return false when it is not possible to
shatter the large mapping.

This is XSA-246.

Signed-off-by: Julien Grall <julien.grall@linaro.org>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: a1c6c6768971ea387d7eba0803908ef0928b43ac
master date: 2017-11-28 13:11:55 +0100

7 years agoupdate Xen version to 4.7.5-pre
Jan Beulich [Tue, 28 Nov 2017 12:43:28 +0000 (13:43 +0100)]
update Xen version to 4.7.5-pre

7 years agoupdate Xen version to 4.7.4 RELEASE-4.7.4
Jan Beulich [Mon, 20 Nov 2017 15:01:11 +0000 (16:01 +0100)]
update Xen version to 4.7.4

7 years agox86/shadow: correct SH_LINEAR mapping detection in sh_guess_wrmap()
Andrew Cooper [Thu, 16 Nov 2017 11:03:26 +0000 (12:03 +0100)]
x86/shadow: correct SH_LINEAR mapping detection in sh_guess_wrmap()

The fix for XSA-243 / CVE-2017-15592 (c/s bf2b4eadcf379) introduced a change
in behaviour for sh_guest_wrmap(), where it had to cope with no shadow linear
mapping being present.

As the name suggests, guest_vtable is a mapping of the guests pagetable, not
Xen's pagetable, meaning that it isn't the pagetable we need to check for the
shadow linear slot in.

The practical upshot is that a shadow HVM vcpu which switches into 4-level
paging mode, with an L4 pagetable that contains a mapping which aliases Xen's
SH_LINEAR_PT_VIRT_START will fool the safety check for whether a SHADOW_LINEAR
mapping is present.  As the check passes (when it should have failed), Xen
subsequently falls over the missing mapping with a pagefault such as:

    (XEN) Pagetable walk from ffff8140a0503880:
    (XEN)  L4[0x102] = 000000046c218063 ffffffffffffffff
    (XEN)  L3[0x102] = 000000046c218063 ffffffffffffffff
    (XEN)  L2[0x102] = 000000046c218063 ffffffffffffffff
    (XEN)  L1[0x103] = 0000000000000000 ffffffffffffffff

This is part of XSA-243.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Tim Deegan <tim@xen.org>
master commit: d20daf4294adbdb9316850566013edb98db7bfbc
master date: 2017-11-16 10:38:14 +0100

7 years agox86: don't wrongly trigger linear page table assertion
Jan Beulich [Thu, 16 Nov 2017 11:02:55 +0000 (12:02 +0100)]
x86: don't wrongly trigger linear page table assertion

_put_page_type() may do multiple iterations until its cmpxchg()
succeeds. It invokes set_tlbflush_timestamp() on the first
iteration, however. Code inside the function takes care of this, but
- the assertion in _put_final_page_type() would trigger on the second
  iteration if time stamps in a debug build are permitted to be
  sufficiently much wider than the default 6 bits (see WRAP_MASK in
  flushtlb.c),
- it returning -EINTR (for a continuation to be scheduled) would leave
  the page inconsistent state (until the re-invocation completes).
Make the set_tlbflush_timestamp() invocation conditional, bypassing it
(for now) only in the case we really can't tolerate the stamp to be
stored.

This is part of XSA-240.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 2c458dfcb59f3d9d8a35fc5ffbf780b6ed7a26a6
master date: 2017-11-16 10:37:29 +0100

7 years agox86/mm: fix race condition in modify_xen_mappings()
Yu Zhang [Thu, 16 Nov 2017 11:02:24 +0000 (12:02 +0100)]
x86/mm: fix race condition in modify_xen_mappings()

In modify_xen_mappings(), a L1/L2 page table shall be freed,
if all entries of this page table are empty. Corresponding
L2/L3 PTE will need be cleared in such scenario.

However, concurrent paging structure modifications on different
CPUs may cause the L2/L3 PTEs to be already be cleared or set
to reference a superpage.

Therefore the logic to enumerate the L1/L2 page table and to
reset the corresponding L2/L3 PTE need to be protected with
spinlock. And the _PAGE_PRESENT and _PAGE_PSE flags need be
checked after the lock is obtained.

Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b9ee1fd7b98064cf27d0f8f1adf1f5359b72c97f
master date: 2017-11-14 17:11:26 +0100

7 years agox86/mm: fix race conditions in map_pages_to_xen()
Min He [Thu, 16 Nov 2017 11:01:57 +0000 (12:01 +0100)]
x86/mm: fix race conditions in map_pages_to_xen()

In map_pages_to_xen(), a L2 page table entry may be reset to point to
a superpage, and its corresponding L1 page table need be freed in such
scenario, when these L1 page table entries are mapping to consecutive
page frames and having the same mapping flags.

However, variable `pl1e` is not protected by the lock before L1 page table
is enumerated. A race condition may happen if this code path is invoked
simultaneously on different CPUs.

For example, `pl1e` value on CPU0 may hold an obsolete value, pointing
to a page which has just been freed on CPU1. Besides, before this page
is reused, it will still be holding the old PTEs, referencing consecutive
page frames. Consequently the `free_xen_pagetable(l2e_to_l1e(ol2e))` will
be triggered on CPU0, resulting the unexpected free of a normal page.

This patch fixes the above problem by protecting the `pl1e` with the lock.

Also, there're other potential race conditions. For instance, the L2/L3
entry may be modified concurrently on different CPUs, by routines such as
map_pages_to_xen(), modify_xen_mappings() etc. To fix this, this patch will
check the _PAGE_PRESENT and _PAGE_PSE flags, after the spinlock is obtained,
for the corresponding L2/L3 entry.

Signed-off-by: Min He <min.he@intel.com>
Signed-off-by: Yi Zhang <yi.z.zhang@intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a5114662297ad03efc36b52ad365ffa05fb357b7
master date: 2017-11-14 17:10:56 +0100

7 years agox86/hvm: do not register hpet mmio during s3 cycle
Eric Chanudet [Thu, 16 Nov 2017 11:01:28 +0000 (12:01 +0100)]
x86/hvm: do not register hpet mmio during s3 cycle

Do it once at domain creation (hpet_init).

Sleep -> Resume cycles will end up crashing an HVM guest with hpet as
the sequence during resume takes the path:
-> hvm_s3_suspend
  -> hpet_reset
    -> hpet_deinit
    -> hpet_init
      -> register_mmio_handler
        -> hvm_next_io_handler

register_mmio_handler will use a new io handler each time, until
eventually it reaches NR_IO_HANDLERS, then hvm_next_io_handler calls
domain_crash.

Signed-off-by: Eric Chanudet <chanudete@ainfosec.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 015d6738ddff4074668c1d4887bbffd507ed1a7f
master date: 2017-11-14 17:09:50 +0100

7 years agox86/mm: Make PV linear pagetables optional
George Dunlap [Thu, 16 Nov 2017 11:00:28 +0000 (12:00 +0100)]
x86/mm: Make PV linear pagetables optional

Allowing pagetables to point to other pagetables of the same level
(often called 'linear pagetables') has been included in Xen since its
inception; but recently it has been the source of a number of subtle
reference-counting bugs.

It is not used by Linux or MiniOS; but it is used by NetBSD and Novell
Netware.  There are significant numbers of people who are never going
to use the feature, along with significant numbers who need the
feature.

Add a Kconfig option for the feature (default to 'y').  Also add a
command-line option to control whether PV linear pagetables are
allowed (default to 'true').

NB that we leave linear_pt_count in the page struct.  It's in a union,
so its presence doesn't increase the size of the data struct.
Changing the layout of the other elements based on configuration
options is asking for trouble however; so we'll just leave it there
and ASSERT that it's zero.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3285e75dea89afb0ef5b3ee39bd15194bd7cc110
master date: 2017-10-27 14:36:45 +0100

7 years agox86: fix asm() constraint for GS selector update
Jan Beulich [Thu, 16 Nov 2017 10:59:55 +0000 (11:59 +0100)]
x86: fix asm() constraint for GS selector update

Exception fixup code may alter the operand, which ought to be reflected
in the constraint.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 65ab53de34851243fb7793ebf12fd92a65f84ddd
master date: 2017-10-27 13:49:10 +0100

7 years agox86: don't latch wrong (stale) GS base addresses
Jan Beulich [Thu, 16 Nov 2017 10:59:21 +0000 (11:59 +0100)]
x86: don't latch wrong (stale) GS base addresses

load_segments() writes selector registers before doing any of the base
address updates. Any of these selector loads can cause a page fault in
case it references the LDT, and the LDT page accessed was only recently
installed. Therefore the call tree map_ldt_shadow_page() ->
guest_get_eff_kern_l1e() -> toggle_guest_mode() would in such a case
wrongly latch the outgoing vCPU's GS.base into the incoming vCPU's
recorded state.

Split page table toggling from GS handling - neither
guest_get_eff_kern_l1e() nor guest_io_okay() need more than the page
tables being the kernel ones for the memory access they want to do.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a711f6f24a7157ae70d1cc32e61b98f23dc0c584
master date: 2017-10-27 13:49:10 +0100

7 years agox86: also show FS/GS base addresses when dumping registers
Jan Beulich [Thu, 16 Nov 2017 10:58:47 +0000 (11:58 +0100)]
x86: also show FS/GS base addresses when dumping registers

Their state may be important to figure the reason for a crash. To not
further grow duplicate code, break out a helper function.

I realize that (ab)using the control register array here may not be
considered the nicest solution, but it seems easier (and less overall
overhead) to do so compared to the alternative of introducing another
helper structure.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by Andrew Cooper <andrew.cooper3@citrix.com>
master commit: be7f60b5a39741eab0a8fea0324f7be0cb724cfb
master date: 2017-10-24 18:13:13 +0200

7 years agox86: fix GS-base-dirty determination
Jan Beulich [Thu, 16 Nov 2017 10:58:10 +0000 (11:58 +0100)]
x86: fix GS-base-dirty determination

load_segments() writes the two MSRs in their "canonical" positions
(GS_BASE for the user base, SHADOW_GS_BASE for the kernel one) and uses
SWAPGS to switch them around if the incoming vCPU is in kernel mode. In
order to not leave a stale kernel address in GS_BASE when the incoming
guest is in user mode, the check on the outgoing vCPU needs to be
dependent upon the mode it is currently in, rather than blindly looking
at the user base.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 91f85280b9b80852352fcad73d94ed29fafb88da
master date: 2017-10-24 18:12:31 +0200

7 years agox86emul: handle address wrapping
Jan Beulich [Tue, 24 Oct 2017 14:50:14 +0000 (16:50 +0200)]
x86emul: handle address wrapping

This just the emulator part of commit 7869e2bafe
("x86emul/fuzz: add rudimentary limit checking"):

Several adjustments to the emulator's address calculations are needed:
While the DstBitBase one is really mandatory, the specification allows
for either original or new behavior for two-part accesses. Observed
behavior on real hardware, however, is for such accesses to silently
wrap at the 2^^32 boundary in other than 64-bit mode, just like they do
at the 2^^64 boundary in 64-bit mode, which our code is now being
brought in line with. While adding truncate_ea() invocations there,
also convert open coded instances of it.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
7 years agoVMX: PLATFORM_INFO MSR is r/o
Jan Beulich [Tue, 24 Oct 2017 14:49:44 +0000 (16:49 +0200)]
VMX: PLATFORM_INFO MSR is r/o

Therefore all write attempts should produce #GP, just like on real
hardware.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Kevin Tian <kevin.tian@intel.com>