]> xenbits.xensource.com Git - xen.git/log
xen.git
4 years agox86/ept: atomically modify entries in ept_next_level
Roger Pau Monné [Tue, 7 Jul 2020 13:10:14 +0000 (15:10 +0200)]
x86/ept: atomically modify entries in ept_next_level

ept_next_level was passing a live PTE pointer to ept_set_middle_entry,
which was then modified without taking into account that the PTE could
be part of a live EPT table. This wasn't a security issue because the
pages returned by p2m_alloc_ptp are zeroed, so adding such an entry
before actually initializing it didn't allow a guest to access
physical memory addresses it wasn't supposed to access.

This is part of XSA-328.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: bc3d9f95d661372b059a5539ae6cb1e79435bb95
master date: 2020-07-07 14:37:12 +0200

4 years agox86/EPT: ept_set_middle_entry() related adjustments
Jan Beulich [Tue, 7 Jul 2020 13:09:50 +0000 (15:09 +0200)]
x86/EPT: ept_set_middle_entry() related adjustments

ept_split_super_page() wants to further modify the newly allocated
table, so have ept_set_middle_entry() return the mapped pointer rather
than tearing it down and then getting re-established right again.

Similarly ept_next_level() wants to hand back a mapped pointer of
the next level page, so re-use the one established by
ept_set_middle_entry() in case that path was taken.

Pull the setting of suppress_ve ahead of insertion into the higher level
table, and don't have ept_split_super_page() set the field a 2nd time.

This is part of XSA-328.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 1104288186ee73a7f9bfa41cbaa5bb7611521028
master date: 2020-07-07 14:36:52 +0200

4 years agox86/shadow: correct an inverted conditional in dirty VRAM tracking
Jan Beulich [Tue, 7 Jul 2020 13:09:25 +0000 (15:09 +0200)]
x86/shadow: correct an inverted conditional in dirty VRAM tracking

This originally was "mfn_x(mfn) == INVALID_MFN". Make it like this
again, taking the opportunity to also drop the unnecessary nearby
braces.

This is XSA-319.

Fixes: 246a5a3377c2 ("xen: Use a typesafe to define INVALID_MFN")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 23a216f99d40fbfbc2318ade89d8213eea6ba1f8
master date: 2020-07-07 14:36:24 +0200

4 years agoxen/common: event_channel: Don't ignore error in get_free_port()
Julien Grall [Tue, 7 Jul 2020 13:08:59 +0000 (15:08 +0200)]
xen/common: event_channel: Don't ignore error in get_free_port()

Currently, get_free_port() is assuming that the port has been allocated
when evtchn_allocate_port() is not return -EBUSY.

However, the function may return an error when:
    - We exhausted all the event channels. This can happen if the limit
    configured by the administrator for the guest ('max_event_channels'
    in xl cfg) is higher than the ABI used by the guest. For instance,
    if the guest is using 2L, the limit should not be higher than 4095.
    - We cannot allocate memory (e.g Xen has not more memory).

Users of get_free_port() (such as EVTCHNOP_alloc_unbound) will validly
assuming the port was valid and will next call evtchn_from_port(). This
will result to a crash as the memory backing the event channel structure
is not present.

Fixes: 368ae9a05fe ("xen/pvshim: forward evtchn ops between L0 Xen and L2 DomU")
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2e9c2bc292231823a3a021d2e0a9f1956bf00b3c
master date: 2020-07-07 14:35:36 +0200

4 years agolibacpi: widen TPM detection
Jason Andryuk [Wed, 24 Jun 2020 15:17:38 +0000 (17:17 +0200)]
libacpi: widen TPM detection

The hardcoded tpm_signature is too restrictive to detect many TPMs.  For
instance, it doesn't accept a QEMU emulated TPM (VID 0x1014 DID 0x0001).
Make the TPM detection match that in rombios which accepts a wider
range.

With this change, the TPM's TCPA ACPI table is generated and the guest
OS can automatically load the tpm_tis driver.  It also allows seabios to
detect and use the TPM.  However, seabios skips some TPM initialization
when running under Xen, so it will not populate any PCRs unless modified
to run the initialization under Xen.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d3db7e043cddd7e939195e014241ce2c5d436179
master date: 2020-06-16 10:31:08 +0200

4 years agoioreq: handle pending emulation racing with ioreq server destruction
Paul Durrant [Wed, 24 Jun 2020 15:17:05 +0000 (17:17 +0200)]
ioreq: handle pending emulation racing with ioreq server destruction

When an emulation request is initiated in hvm_send_ioreq() the guest vcpu is
blocked on an event channel until that request is completed. If, however,
the emulator is killed whilst that emulation is pending then the ioreq
server may be destroyed. Thus when the vcpu is awoken the code in
handle_hvm_io_completion() will find no pending request to wait for, but will
leave the internal vcpu io_req.state set to IOREQ_READY and the vcpu shutdown
deferall flag in place (because hvm_io_assist() will never be called). The
emulation request is then completed anyway. This means that any subsequent call
to hvmemul_do_io() will find an unexpected value in io_req.state and will
return X86EMUL_UNHANDLEABLE, which in some cases will result in continuous
re-tries.

This patch fixes the issue by moving the setting of io_req.state and clearing
of shutdown deferral (as will as MSI-X write completion) out of hvm_io_assist()
and directly into handle_hvm_io_completion().

Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f7039ee41b3d3448775a1623f230037fd0455104
master date: 2020-06-09 12:56:24 +0200

4 years agox86/Intel: insert Ice Lake and Comet Lake model numbers
Jan Beulich [Wed, 24 Jun 2020 15:16:30 +0000 (17:16 +0200)]
x86/Intel: insert Ice Lake and Comet Lake model numbers

Both match prior generation processors as far as LBR and C-state MSRs
go (SDM rev 072) as well as applicability of the if_pschange_mc erratum
(recent spec updates).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1fe406685cb19e9544681c6243e7d376deb0297e
master date: 2020-06-09 12:55:53 +0200

4 years agobuild: fix dependency tracking for preprocessed files
Jan Beulich [Wed, 24 Jun 2020 15:15:53 +0000 (17:15 +0200)]
build: fix dependency tracking for preprocessed files

While the issue is more general, I noticed that asm-macros.i not getting
re-generated as needed. This was due to its .*.d file mentioning
asm-macros.o instead of asm-macros.i. Use -MQ here as well, and while at
it also use -MQ to avoid the somewhat fragile sed-ary on the *.lds
dependency tracking files. While there, further avoid open-coding $(CPP)
and drop the bogus (Arm) / stale (x86) -Ui386.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Julien Grall <jgrall@amazon.com>
master commit: 75131ad75bb3c91717b5dfda6881e61c52bfd22e
master date: 2020-06-08 10:25:40 +0200

4 years agox86/svm: do not try to handle recalc NPT faults immediately
Igor Druzhinin [Wed, 24 Jun 2020 15:15:23 +0000 (17:15 +0200)]
x86/svm: do not try to handle recalc NPT faults immediately

A recalculation NPT fault doesn't always require additional handling
in hvm_hap_nested_page_fault(), moreover in general case if there is no
explicit handling done there - the fault is wrongly considered fatal.

This covers a specific case of migration with vGPU assigned which
uses direct MMIO mappings made by XEN_DOMCTL_memory_mapping hypercall:
at a moment log-dirty is enabled globally, recalculation is requested
for the whole guest memory including those mapped MMIO regions
which causes a page fault being raised at the first access to them;
but due to MMIO P2M type not having any explicit handling in
hvm_hap_nested_page_fault() a domain is erroneously crashed with unhandled
SVM violation.

Instead of trying to be opportunistic - use safer approach and handle
P2M recalculation in a separate NPT fault by attempting to retry after
making the necessary adjustments. This is aligned with Intel behavior
where there are separate VMEXITs for recalculation and EPT violations
(faults) and only faults are handled in hvm_hap_nested_page_fault().
Do it by also unifying do_recalc return code with Intel implementation
where returning 1 means P2M was actually changed.

Since there was no case previously where p2m_pt_handle_deferred_changes()
could return a positive value - it's safe to replace ">= 0" with just "== 0"
in VMEXIT_NPF handler. finish_type_change() is also not affected by the
change as being able to deal with >0 return value of p2m->recalc from
EPT implementation.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 51ca66c37371b10b378513af126646de22eddb17
master date: 2020-06-05 17:12:11 +0200

4 years agobuild32: don't discard .shstrtab in linker script
Roger Pau Monné [Wed, 24 Jun 2020 15:14:40 +0000 (17:14 +0200)]
build32: don't discard .shstrtab in linker script

LLVM linker doesn't support discarding .shstrtab, and complains with:

ld -melf_i386_fbsd -N -T build32.lds -o reloc.lnk reloc.o
ld: error: discarding .shstrtab section is not allowed

Add an explicit .shstrtab, .strtab and .symtab sections to the linker
script after the text section in order to make LLVM LD happy and match
the behavior of GNU LD.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 10d27b48b5b4dfbead2d9bf03290984bba4806e4
master date: 2020-06-02 13:37:53 +0200

4 years agox86/mm: do not attempt to convert _PAGE_GNTTAB to a boolean
Roger Pau Monné [Wed, 24 Jun 2020 15:14:11 +0000 (17:14 +0200)]
x86/mm: do not attempt to convert _PAGE_GNTTAB to a boolean

Clang 10 complains with:

mm.c:1239:10: error: converting the result of '<<' to a boolean always evaluates to true
      [-Werror,-Wtautological-constant-compare]
    if ( _PAGE_GNTTAB && (l1e_get_flags(l1e) & _PAGE_GNTTAB) &&
         ^
xen/include/asm/x86_64/page.h:161:25: note: expanded from macro '_PAGE_GNTTAB'
#define _PAGE_GNTTAB (1U<<22)
                        ^

Remove the conversion of _PAGE_GNTTAB to a boolean and instead use a
preprocessor conditional to check if _PAGE_GNTTAB is defined.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 6eb61b1a9dfe23ca443f977799cafb22770708a0
master date: 2020-06-02 13:36:41 +0200

4 years agox86emul: rework CMP and TEST emulation
Jan Beulich [Wed, 24 Jun 2020 15:13:29 +0000 (17:13 +0200)]
x86emul: rework CMP and TEST emulation

Unlike similarly encoded insns these don't write their memory operands,
and hence x86_is_mem_write() should return false for them. However,
rather than adding special logic there, rework how their emulation gets
done, by making decoding attributes properly describe the r/o nature of
their memory operands:
-  change the table entries for opcodes 0x38 and 0x39, with no other
   adjustments to the attributes later on,
-  for the other opcodes, leave the table entries as they are, and
   override the attributes for the specific sub-cases (identified by
   ModRM.reg).

For opcodes 0x38 and 0x39 the change of the table entries implies
changing the order of operands as passed to emulate_2op_SrcV(), hence
the splitting of the cases in the main switch().

Note how this also allows dropping custom LOCK prefix checks.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 20bc1b9cc99b70b17757e1903f629c7a26584790
master date: 2020-05-29 17:28:45 +0200

4 years agox86emul: address x86_insn_is_mem_{access,write}() omissions
Jan Beulich [Wed, 24 Jun 2020 15:12:53 +0000 (17:12 +0200)]
x86emul: address x86_insn_is_mem_{access,write}() omissions

First of all explain in comments what the functions' purposes are. Then
make them actually match their comments.

Note that fc6fa977be54 ("x86emul: extend x86_insn_is_mem_write()
coverage") didn't actually fix the function's behavior for {,V}STMXCSR:
Both are covered by generic code higher up in the function, due to
x86_decode_twobyte() already doing suitable adjustments. And VSTMXCSR
wouldn't have been covered anyway without a further X86EMUL_OPC_VEX()
case label. Keep the inner case label in a comment for reference.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e28d13eeb65c25c0bd56e8bfa83c7473047d778d
master date: 2020-05-29 17:28:04 +0200

4 years agox86/hvm: Improve error information in handle_pio()
Andrew Cooper [Wed, 24 Jun 2020 15:12:20 +0000 (17:12 +0200)]
x86/hvm: Improve error information in handle_pio()

domain_crash() should always have a message which is emitted even in release
builds, so something more useful than this is presented to the user.

  (XEN) domain_crash called from io.c:171
  (XEN) domain_crash called from io.c:171
  (XEN) domain_crash called from io.c:171
  ...

To avoid possibly printing stack rubble, initialise data to ~0 right away.
Furthermore, the maximum access size is 4, so drop data from long to int.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 4609fc8eb04e6af531d86923c9d057f32a96b7d8
master date: 2020-05-29 16:25:05 +0100

4 years agoVT-x: extend LBR Broadwell errata coverage
Jan Beulich [Wed, 24 Jun 2020 15:11:44 +0000 (17:11 +0200)]
VT-x: extend LBR Broadwell errata coverage

For lbr_tsx_fixup_check() simply name a few more specific erratum
numbers.

For bdf93_fixup_check(), however, more models are affected. Oddly enough
despite being the same model and stepping, the erratum is listed for
Xeon E3 but not its Core counterpart. Apply the workaround uniformly,
and also for Xeon D, which only has the LBR-from one listed in its spec
update.

Seeing this broader applicability, rename anything BDF93-related to more
generic names.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 724913de8ac8426d313a4645741d86c1169ae406
master date: 2020-05-28 12:03:25 +0200

4 years agox86/boot: Fix load_system_tables() to be NMI/#MC-safe
Andrew Cooper [Wed, 24 Jun 2020 15:11:08 +0000 (17:11 +0200)]
x86/boot: Fix load_system_tables() to be NMI/#MC-safe

During boot, load_system_tables() is used in reinit_bsp_stack() to switch the
virtual addresses used from their .data/.bss alias, to their directmap alias.

The structure assignment is implemented as a memset() to zero first, then a
copy-in of the new data.  This causes the NMI/#MC stack pointers to
transiently become 0, at a point where we may have an NMI watchdog running.

Rewrite the logic using a volatile tss pointer (equivalent to, but more
readable than, using ACCESS_ONCE() for all writes).

This does drop the zeroing side effect for holes in the structure, but the
backing memory for the TSS is fully zeroed anyway, and architecturally, they
are all reserved.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 9f3e9139fa6c3d620eb08dff927518fc88200b8d
master date: 2020-05-27 16:44:04 +0100

4 years agox86: clear RDRAND CPUID bit on AMD family 15h/16h
Jan Beulich [Wed, 24 Jun 2020 15:10:22 +0000 (17:10 +0200)]
x86: clear RDRAND CPUID bit on AMD family 15h/16h

Inspired by Linux commit c49a0a80137c7ca7d6ced4c812c9e07a949f6f24:

    There have been reports of RDRAND issues after resuming from suspend on
    some AMD family 15h and family 16h systems. This issue stems from a BIOS
    not performing the proper steps during resume to ensure RDRAND continues
    to function properly.

    Update the CPU initialization to clear the RDRAND CPUID bit for any family
    15h and 16h processor that supports RDRAND. If it is known that the family
    15h or family 16h system does not have an RDRAND resume issue or that the
    system will not be placed in suspend, the "cpuid=rdrand" kernel parameter
    can be used to stop the clearing of the RDRAND CPUID bit.

    Note, that clearing the RDRAND CPUID bit does not prevent a processor
    that normally supports the RDRAND instruction from executing it. So any
    code that determined the support based on family and model won't #UD.

Warn if no explicit choice was given on affected hardware.

Check RDRAND functions at boot as well as after S3 resume (the retry
limit chosen is entirely arbitrary).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 93401e28a84b9dc5945f5d0bf5bce68e9d5ee121
master date: 2020-05-27 09:49:37 +0200

4 years agox86/idle: Extend ISR/C6 erratum workaround to Haswell
Andrew Cooper [Wed, 24 Jun 2020 15:03:30 +0000 (17:03 +0200)]
x86/idle: Extend ISR/C6 erratum workaround to Haswell

This bug was first discovered against Haswell.  It is definitely affected.

(The XenServer ticket for this bug was opened on 2013-05-30 which is coming up
on 7 years old, and predates Broadwell).

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: b72d8870b5f68f06b083e6bfdb28f081bcb6ab3b
master date: 2020-05-22 20:04:23 +0100

4 years agox86/idle: prevent entering C3/C6 on some Intel CPUs due to errata
Roger Pau Monné [Wed, 24 Jun 2020 15:02:55 +0000 (17:02 +0200)]
x86/idle: prevent entering C3/C6 on some Intel CPUs due to errata

Apply a workaround for errata BA80, AAK120, AAM108, AAO67, BD59,
AAY54: Rapid Core C3/C6 Transition May Cause Unpredictable System
Behavior.

Limit maximum C state to C1 when SMT is enabled on the affected CPUs.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b2d502466547e6782ccadd501b8ef1482c391f2c
master date: 2020-05-22 16:08:54 +0200

4 years agox86/idle: prevent entering C6 with in service interrupts on Intel
Roger Pau Monné [Wed, 24 Jun 2020 15:02:23 +0000 (17:02 +0200)]
x86/idle: prevent entering C6 with in service interrupts on Intel

Apply a workaround for Intel errata BDX99, CLX30, SKX100, CFW125,
BDF104, BDH85, BDM135, KWB131: "A Pending Fixed Interrupt May Be
Dispatched Before an Interrupt of The Same Priority Completes".

Apply the errata to all server and client models (big cores) from
Broadwell to Cascade Lake. The workaround is grouped together with the
existing fix for errata AAJ72, and the eoi from the function name is
removed.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: fc44a7014cafe28b8c53eeaf6ac2a71f5bc8b815
master date: 2020-05-22 16:07:38 +0200

4 years agox86/idle: rework C6 EOI workaround
Roger Pau Monné [Wed, 24 Jun 2020 15:01:48 +0000 (17:01 +0200)]
x86/idle: rework C6 EOI workaround

Change the C6 EOI workaround (errata AAJ72) to use x86_match_cpu. Also
call the workaround from mwait_idle, previously it was only used by
the ACPI idle driver. Finally make sure the routine is called for all
states equal or greater than ACPI_STATE_C3, note that the ACPI driver
doesn't currently handle them, but the errata condition shouldn't be
limited by that.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5fef1fd713660406a6187ef352fbf79986abfe43
master date: 2020-05-20 12:48:37 +0200

4 years agox86: determine MXCSR mask in all cases
Jan Beulich [Wed, 24 Jun 2020 15:01:10 +0000 (17:01 +0200)]
x86: determine MXCSR mask in all cases

For its use(s) by the emulator to be correct in all cases, the filling
of the variable needs to be independent of XSAVE availability. As
there's no suitable function in i387.c to put the logic in, keep it in
xstate_init(), arrange for the function to be called unconditionally,
and pull the logic ahead of all return paths there.

Fixes: 9a4496a35b20 ("x86emul: support {,V}{LD,ST}MXCSR")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2b532519d64e653a6bbfd9eefed6040a09c8876d
master date: 2020-05-18 17:18:56 +0200

4 years agox86/hvm: Fix shifting in stdvga_mem_read()
Andrew Cooper [Wed, 24 Jun 2020 15:00:33 +0000 (17:00 +0200)]
x86/hvm: Fix shifting in stdvga_mem_read()

stdvga_mem_read() has a return type of uint8_t, which promotes to int rather
than unsigned int.  Shifting by 24 may hit the sign bit.

Spotted by Coverity.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 97fb0253e6c2f2221bfd0895b7ffe3a99330d847
master date: 2020-05-18 15:22:53 +0100

4 years agox86/build: Unilaterally disable -fcf-protection
Andrew Cooper [Wed, 24 Jun 2020 14:59:49 +0000 (16:59 +0200)]
x86/build: Unilaterally disable -fcf-protection

Xen doesn't support CET-IBT yet.  At a minimum, logic is required to enable it
for supervisor use, but the livepatch functionality needs to learn not to
overwrite ENDBR64 instructions.

Furthermore, Ubuntu enables -fcf-protection by default, along with a buggy
version of GCC-9 which objects to it in combination with
-mindirect-branch=thunk-extern (Fixed in GCC 10, 9.4).

Various objects (Xen boot path, Rombios 32 stubs) require .text to be at the
beginning of the object.  These paths explode when .note.gnu.properties gets
put ahead of .text and we end up executing the notes data.

Disable -fcf-protection for all embedded objects.

Reported-by: Jason Andryuk <jandryuk@gmail.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
4 years agox86/build: move -fno-asynchronous-unwind-tables into EMBEDDED_EXTRA_CFLAGS
Andrew Cooper [Wed, 24 Jun 2020 14:59:21 +0000 (16:59 +0200)]
x86/build: move -fno-asynchronous-unwind-tables into EMBEDDED_EXTRA_CFLAGS

Users of EMBEDDED_EXTRA_CFLAGS already use -fno-asynchronous-unwind-tables, or
ought to.  This shrinks the size of the rombios 32bit stubs in guest memory.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
4 years agox86/build32: Discard all orphaned sections
Andrew Cooper [Wed, 24 Jun 2020 14:58:51 +0000 (16:58 +0200)]
x86/build32: Discard all orphaned sections

Linkers may put orphaned sections ahead of .text, which breaks the calling
requirements.  A concrete example is Ubuntu's GCC-9 default of enabling
-fcf-protection which causes us to try and execute .note.gnu.properties during
Xen's boot.

Put .got.plt in its own section as it specifically needs preserving from the
linkers point of view, and discard everything else.  This will hopefully be
more robust to other unexpected toolchain properties.

Fixes boot from an Ubuntu build of Xen.

Reported-by: Jason Andryuk <jandryuk@gmail.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
4 years agox86/guest: Fix assembler warnings with newer binutils
Andrew Cooper [Wed, 24 Jun 2020 14:58:22 +0000 (16:58 +0200)]
x86/guest: Fix assembler warnings with newer binutils

GAS of at least version 2.34 complains:

  hypercall_page.S: Assembler messages:
  hypercall_page.S:24: Warning: symbol 'HYPERCALL_set_trap_table' already has its type set
  ...
  hypercall_page.S:71: Warning: symbol 'HYPERCALL_arch_7' already has its type set

which is because the whole page is declared as STT_OBJECT already.  Rearrange
.set with respect to .type in DECLARE_HYPERCALL() so STT_FUNC is already in
place.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
4 years agox86/cpuidle: correct Cannon Lake residency MSRs
Jan Beulich [Wed, 24 Jun 2020 14:57:39 +0000 (16:57 +0200)]
x86/cpuidle: correct Cannon Lake residency MSRs

As per SDM rev 071 Cannon Lake has
- no CC3 residency MSR at 3FC,
- a CC1 residency MSR ar 660 (like various Atoms),
- a useless (always zero) CC3 residency MSR at 662.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9ff09aefc46385dc04c38b6dd1f1ac25f784f482
master date: 2020-04-03 17:15:58 +0200

4 years agoupdate Xen version to 4.12.4-pre
Jan Beulich [Wed, 24 Jun 2020 14:55:04 +0000 (16:55 +0200)]
update Xen version to 4.12.4-pre

4 years agotools/libxl: Fix memory leak in libxl_cpuid_set()
Andrew Cooper [Fri, 12 Jun 2020 17:32:27 +0000 (18:32 +0100)]
tools/libxl: Fix memory leak in libxl_cpuid_set()

xc_cpuid_set() returns allocated memory via cpuid_res, which libxl needs to
free() seeing as it discards the results.

This is logically a backport of c/s b91825f628 "tools/libxc: Drop
config_transformed parameter from xc_cpuid_set()" but rewritten as one caller
of xc_cpuid_set() does use returned values.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit c54de7d9df7718ea53bf21e1ff5bbd339602a704)

4 years agox86/spec-ctrl: Allow the RDRAND/RDSEED features to be hidden
Andrew Cooper [Wed, 10 Jun 2020 17:57:00 +0000 (18:57 +0100)]
x86/spec-ctrl: Allow the RDRAND/RDSEED features to be hidden

RDRAND/RDSEED can be hidden using cpuid= to mitigate SRBDS if microcode
isn't available.

This is part of XSA-320 / CVE-2020-0543.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit 7028534d8482d25860c4d1aa8e45f0b911abfc5a)

4 years agox86/spec-ctrl: Mitigate the Special Register Buffer Data Sampling sidechannel
Andrew Cooper [Wed, 8 Jan 2020 19:47:46 +0000 (19:47 +0000)]
x86/spec-ctrl: Mitigate the Special Register Buffer Data Sampling sidechannel

See patch documentation and comments.

This is part of XSA-320 / CVE-2020-0543

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 6a49b9a7920c82015381740905582b666160d955)

4 years agox86/spec-ctrl: CPUID/MSR definitions for Special Register Buffer Data Sampling
Andrew Cooper [Wed, 8 Jan 2020 19:47:46 +0000 (19:47 +0000)]
x86/spec-ctrl: CPUID/MSR definitions for Special Register Buffer Data Sampling

This is part of XSA-320 / CVE-2020-0543

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit caab85ab58c0cdf74ab070a5de5c4df89f509ff3)

4 years agoupdate Xen version to 4.12.3 RELEASE-4.12.3
Jan Beulich [Thu, 14 May 2020 12:21:48 +0000 (14:21 +0200)]
update Xen version to 4.12.3

5 years agox86/ucode/intel: Writeback and invalidate caches before updating microcode
Ashok Raj [Thu, 7 May 2020 12:58:16 +0000 (14:58 +0200)]
x86/ucode/intel: Writeback and invalidate caches before updating microcode

Updating microcode is less error prone when caches have been flushed and
depending on what exactly the microcode is updating. For example, some of the
issues around certain Broadwell parts can be addressed by doing a full cache
flush.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[Linux commit 91df9fdf51492aec9fed6b4cbd33160886740f47, ported to Xen]
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 77c82949990edaf21130be842a289a7fb7a439e1
master date: 2020-05-05 20:18:19 +0100

5 years agox86/traps: fix an off-by-one error
Hongyan Xia [Thu, 7 May 2020 12:57:35 +0000 (14:57 +0200)]
x86/traps: fix an off-by-one error

stack++ can go into the next page and unmap_domain_page() will unmap the
wrong one, causing mapcache and memory corruption. Fix.

Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2e3d87cc734a895ef5b486926274a178836b67a9
master date: 2020-05-05 16:13:44 +0100

5 years agox86/hvm: simplify hvm_physdev_op allowance control
Roger Pau Monné [Thu, 7 May 2020 12:56:56 +0000 (14:56 +0200)]
x86/hvm: simplify hvm_physdev_op allowance control

PVHv1 dom0 was given access to all PHYSDEVOP hypercalls, and such
restriction was not removed when PVHv1 code was removed. As a result
the switch in hvm_physdev_op was more complicated than required, and
relied on PVHv2 dom0 not having PIRQ support in order to prevent
access to some PV specific PHYSDEVOPs.

Fix this by moving the default case to the bottom of the switch, since
there's no need for any fall through now. Also remove the hardware
domain check, as all the not explicitly listed PHYSDEVOPs are
forbidden for HVM domains.

Finally tighten the condition to allow usage of
PHYSDEVOP_pci_mmcfg_reserved: apart from having vPCI enabled it should
only be used by the hardware domain. Note that the code in
do_physdev_op is already restricting the call to privileged domains
only, but it can be further restricted to the hardware domain only, as
other privileged domains don't have access to MMCFG regions anyway.

Overall no functional change should arise from this change.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a00e3737e085ebc1f313e36b188d4958e939e531
master date: 2020-05-05 09:52:28 +0200

5 years agox86emul: extend x86_insn_is_mem_write() coverage
Jan Beulich [Thu, 7 May 2020 12:56:03 +0000 (14:56 +0200)]
x86emul: extend x86_insn_is_mem_write() coverage

Several insns were missed when this function was first added. As far as
insns already supported by the emulator go - SMSW and {,V}STMXCSR were
wrongly considered r/o insns so far.

Insns like the VMX, SVM, or CET-SS ones, PTWRITE, or AMD's new SNP ones
are intentionally not covered just yet. VMPTRST is put there just to
complete the respective group.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fc6fa977be54a24a1325e3f2d08b1b1dcb675f44
master date: 2020-05-05 09:50:54 +0200

5 years agox86/pass-through: avoid double IRQ unbind during domain cleanup
Jan Beulich [Thu, 7 May 2020 12:54:39 +0000 (14:54 +0200)]
x86/pass-through: avoid double IRQ unbind during domain cleanup

XEN_DOMCTL_destroydomain creates a continuation if domain_kill -ERESTARTs.
In that scenario, it is possible to receive multiple _pirq_guest_unbind
calls for the same pirq from domain_kill, if the pirq has not yet been
removed from the domain's pirq_tree, as:
  domain_kill()
    -> domain_relinquish_resources()
      -> pci_release_devices()
        -> pci_clean_dpci_irq()
          -> pirq_guest_unbind()
            -> __pirq_guest_unbind()

Avoid recurring invocations of pirq_guest_unbind() by removing the pIRQ
from the tree being iterated after the first call there. In case such a
removed entry still has a softirq outstanding, record it and re-check
upon re-invocation.

Note that pirq_cleanup_check() gets relaxed beyond what's strictly
needed here, to avoid introducing an asymmetry there between HVM and PV
guests.

Reported-by: Varad Gautam <vrd@amazon.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Varad Gautam <vrd@amazon.de>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 5b58dad089880127674d460494d1a9d68109b3d7
master date: 2020-04-30 10:40:59 +0200

5 years agoxen/grants: fix hypercall continuation for GNTTABOP_cache_flush
Juergen Gross [Thu, 7 May 2020 12:53:13 +0000 (14:53 +0200)]
xen/grants: fix hypercall continuation for GNTTABOP_cache_flush

The GNTTABOP_cache_flush hypercall has a wrong test for hypercall
continuation, the test today is:

    if ( rc > 0 || opaque_out != 0 )

Unfortunately this will be true even in case of an error (rc < 0),
possibly leading to very long lasting hypercalls (times of more
than an hour have been observed in a test case).

Correct the test condition to result in false with rc < 0 and set
opaque_out only if no error occurred, to be on the safe side.

Partially-suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 46d8f69d466a05863737fb81d8c9ef39c3be8b45
master date: 2020-04-29 14:12:50 +0100

5 years agolibxc/restore: Fix REC_TYPE_X86_PV_VCPU_XSAVE data auditing (take 2)
Andrew Cooper [Tue, 4 Feb 2020 20:29:38 +0000 (20:29 +0000)]
libxc/restore: Fix REC_TYPE_X86_PV_VCPU_XSAVE data auditing (take 2)

It turns out that a bug (since forever) in Xen causes XSAVE records to have
non-architectural behaviour on xsave-capable hardware, when a PV guest has not
touched the state.

In such a case, the data record returned from Xen is 2*uint64_t, both claiming
the (illegitimate) state of %xcr0 and %xcr0_accum being 0.

Adjust the bound in handle_x86_pv_vcpu_blob() to cope with this.

Fixes: 2a62c22715b "libxc/restore: Fix data auditing in handle_x86_pv_vcpu_blob()"
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit 0729830cc425a8ff27a3137e87b93768ae3c853c)
(cherry picked from commit d2aecd86c4481291b260869c47cf0a9a02321564)

5 years agolibxc/restore: Fix data auditing in handle_x86_pv_vcpu_blob()
Andrew Cooper [Thu, 19 Dec 2019 20:32:20 +0000 (20:32 +0000)]
libxc/restore: Fix data auditing in handle_x86_pv_vcpu_blob()

The current logic only works by chance, in that XSAVE records also tend to be
a multiple of 128.  Implement the missing logic for XSAVE.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit 2a62c22715bf81c5695ae0511f89a940c7c6a492)
(cherry picked from commit 0e2bbcf8b4fe6f5fd23a341848f3785c213b26bb)

5 years agolibxc/restore: Fix data auditing in handle_x86_pv_info()
Andrew Cooper [Wed, 18 Dec 2019 20:17:42 +0000 (20:17 +0000)]
libxc/restore: Fix data auditing in handle_x86_pv_info()

handle_x86_pv_info() has a subtle bug.  It uses an 'else if' chain with a
clause in the middle which doesn't exit unconditionally.  In practice, this
means that when restoring a 32bit PV guest, later sanity checks are skipped.

Rework the logic a little to be simpler.  There are exactly two valid
combinations of fields in X86_PV_INFO, so factor this out and check them all
in one go, before making adjustments to the current domain.

Once adjustments have been completed successfully, sanity check the result
against the X86_PV_INFO settings in one go, rather than piece-wise.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit aafae0e800e9936b9eb6566e5fcdbe823625a7d1)
(cherry picked from commit 5932ee1e06047d71bcf6975e1a631e31afaf5fe2)

5 years agolibxc/restore: Fix error message for unrecognised stream version
Andrew Cooper [Tue, 17 Dec 2019 13:49:56 +0000 (13:49 +0000)]
libxc/restore: Fix error message for unrecognised stream version

The Expected and Got values are rendered in the wrong order.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit f50a4f6e244cfc8e773300c03aaf4db391f3028a)
(cherry picked from commit 7b2225078b4b91044c365b2276c8897c46241c79)

5 years agotools/xenstore: fix a use after free problem in xenstored
Juergen Gross [Fri, 3 Apr 2020 12:03:40 +0000 (13:03 +0100)]
tools/xenstore: fix a use after free problem in xenstored

Commit 562a1c0f7ef3fb ("tools/xenstore: dont unlink connection object
twice") introduced a potential use after free problem in
domain_cleanup(): after calling talloc_unlink() for domain->conn
domain->conn is set to NULL. The problem is that domain is registered
as talloc child of domain->conn, so it might be freed by the
talloc_unlink() call.

With Xenstore being single threaded there are normally no concurrent
memory allocations running and freeing a virtual memory area normally
doesn't result in that area no longer being accessible. A problem
could occur only in case either a signal received results in some
memory allocation done in the signal handler (SIGHUP is a primary
candidate leading to reopening the log file), or in case the talloc
framework would do some internal memory allocation during freeing of
the memory (which would lead to clobbering of the freed domain
structure).

Fixes: 562a1c0f7ef3fb ("tools/xenstore: dont unlink connection object twice")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit bb2a34fd740e9a26be9e2244f1a5b4cef439e5a8)
(cherry picked from commit dc5176d0f9434e275e0be1df8d0518e243798beb)

5 years agolibxl: Fix comment about dcs.sdss
Anthony PERARD [Thu, 23 Jan 2020 16:56:46 +0000 (16:56 +0000)]
libxl: Fix comment about dcs.sdss

The field 'sdss' was named 'dmss' before, commit 3148bebbf0ab did the
renamed but didn't update the comment.

Fixes: 3148bebbf0ab ("libxl: rename a field in libxl__domain_create_state")
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 035c4d771600f300382a1637f2da33023f76b4c1)
(cherry picked from commit 5351a0a170fc7f6290d7d3d8be302d53d2426a87)

5 years agodocs/misc: pvcalls: Verbatim block should be indented with 4 spaces
Julien Grall [Sat, 11 Jan 2020 00:03:44 +0000 (00:03 +0000)]
docs/misc: pvcalls: Verbatim block should be indented with 4 spaces

At the moment, the diagram is only indented with 2 spaces. So pandoc
will try to badly interpret it and not display it correctly.

Fix it by indenting all the block by 4 spaces (i.e an extra 2 spaces).

Fixes: d661611d08 ("docs/markdown: Switch to using pandoc, and fix underscore escaping")
Signed-off-by: Julien Grall <julien@xen.org>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 9c8705f8fe5bfb75a6a00163308d297059b61f6a)
(cherry picked from commit 8b60270731eabe7a7dfd41bd625338505829a617)

5 years agodocs: document CONTROL command of xenstore protocol
Juergen Gross [Tue, 28 Jan 2020 06:21:07 +0000 (06:21 +0000)]
docs: document CONTROL command of xenstore protocol

The CONTROL command (former DEBUG command) isn't specified in the
xenstore protocol doc. Add it.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Backport: 4.9+
(cherry picked from commit f910c3ebc6a178c5cbbc0868134be536fae7f7cf)

5 years agodocs: add DIRECTORY_PART specification do xenstore protocol doc
Juergen Gross [Mon, 27 Jan 2020 16:50:50 +0000 (17:50 +0100)]
docs: add DIRECTORY_PART specification do xenstore protocol doc

DIRECTORY_PART was missing in docs/misc/xenstore.txt. Add it.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Paul Durrant <pdurrant@amazon.com>
Acked-by: Wei Liu <wl@xen.org>
Backport: 4.9+
(cherry picked from commit 94a0252c10cb9938bdee98cc456c23e17b28eafb)

5 years agobuild,xsm: fix multiple call
Anthony PERARD [Mon, 27 Apr 2020 13:58:42 +0000 (15:58 +0200)]
build,xsm: fix multiple call

Both script mkflask.sh and mkaccess_vector.sh generates multiple
files. Exploits the 'multi-target pattern rule' trick to call each
scripts only once.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 52f3f319851e40892fbafeae53e512c7d61f03d0
master date: 2020-04-23 09:59:05 +0200

5 years agox86: validate VM assist value in arch_set_info_guest()
Jan Beulich [Mon, 27 Apr 2020 13:57:13 +0000 (15:57 +0200)]
x86: validate VM assist value in arch_set_info_guest()

While I can't spot anything that would go wrong, just like the
respective hypercall only permits applicable bits to be set, we should
also do so when loading guest context.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a62c6fe05c4ae905b7d4cb0ca946508b7f96d522
master date: 2020-04-22 13:01:10 +0200

5 years agox86/HVM: expose VM assist hypercall
Jan Beulich [Mon, 27 Apr 2020 13:55:51 +0000 (15:55 +0200)]
x86/HVM: expose VM assist hypercall

In preparation for the addition of VMASST_TYPE_runstate_update_flag
commit 72c538cca957 ("arm: add support for vm_assist hypercall") enabled
the hypercall for Arm. I consider it not logical that it then isn't also
exposed to x86 HVM guests (with the same single feature permitted to be
enabled as Arm has); Linux actually tries to use it afaict.

Rather than introducing yet another thin wrapper around vm_assist(),
make that function the main handler, requiring a per-arch
arch_vm_assist_valid_mask() definition instead.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: f13404d57f55a97838f1c16a366fbc3231ec21f1
master date: 2020-04-22 12:58:25 +0200

5 years agox86: Enumeration for Control-flow Enforcement Technology
Andrew Cooper [Mon, 27 Apr 2020 13:54:14 +0000 (15:54 +0200)]
x86: Enumeration for Control-flow Enforcement Technology

The CET spec has been published and guest kernels are starting to get support.
Introduce the CPUID and MSRs, and fully block the MSRs from guest use.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wl@xen.org>
master commit: 4803a67114279a656a54a23cebed646da32efeb6
master date: 2020-04-21 16:52:03 +0100

5 years agox86/vtd: relax EPT page table sharing check
Roger Pau Monné [Mon, 27 Apr 2020 13:53:26 +0000 (15:53 +0200)]
x86/vtd: relax EPT page table sharing check

The EPT page tables can be shared with the IOMMU as long as the page
sizes supported by EPT are also supported by the IOMMU.

Current code checks that both the IOMMU and EPT support the same page
sizes, but this is not strictly required, the IOMMU supporting more
page sizes than EPT is fine and shouldn't block page table sharing.

This is likely not a common case (IOMMU supporting more page sizes
than EPT), but should still be fixed for correctness.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 3957e12c02670b97855ef0933b373f99993fa598
master date: 2020-04-21 10:54:56 +0200

5 years agohvmloader: enable MMIO and I/O decode, after all resource allocation
Harsha Shamsundara Havanur [Mon, 27 Apr 2020 13:52:45 +0000 (15:52 +0200)]
hvmloader: enable MMIO and I/O decode, after all resource allocation

It was observed that PCI MMIO and/or IO BARs were programmed with
memory and I/O decodes (bits 0 and 1 of PCI COMMAND register) enabled,
during PCI setup phase. This resulted in incorrect memory mapping as
soon as the lower half of the 64 bit bar is programmed.
This displaced any RAM mappings under 4G. After the
upper half is programmed PCI memory mapping is restored to its
intended high mem location, but the RAM displaced is not restored.
The OS then continues to boot and function until it tries to access
the displaced RAM at which point it suffers a page fault and crashes.

This patch address the issue by deferring enablement of memory and
I/O decode in command register until all the resources, like interrupts
I/O and/or MMIO BARs for all the PCI device functions are programmed,
in the descending order of memory requested.

Signed-off-by: Harsha Shamsundara Havanur <havanur@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a8e0c228c79f3a000e19183090eb41fca173b034
master date: 2020-04-16 10:58:46 +0200

5 years agox86/boot: Fix early exception handling with CONFIG_PERF_COUNTERS
Andrew Cooper [Mon, 27 Apr 2020 13:51:14 +0000 (15:51 +0200)]
x86/boot: Fix early exception handling with CONFIG_PERF_COUNTERS

The PERFC_INCR() macro uses current->processor, but current is not valid
during early boot.  This causes the following crash to occur if
e.g. rdmsr_safe() has to recover from a #GP fault.

  (XEN) Early fatal page fault at e008:ffff82d0803b1a39 (cr2=0000000000000004, ec=0000)
  (XEN) ----[ Xen-4.14-unstable  x86_64  debug=y   Not tainted ]----
  (XEN) CPU:    0
  (XEN) RIP:    e008:[<ffff82d0803b1a39>] x86_64/entry.S#handle_exception_saved+0x64/0xb8
  ...
  (XEN) Xen call trace:
  (XEN)    [<ffff82d0803b1a39>] R x86_64/entry.S#handle_exception_saved+0x64/0xb8
  (XEN)    [<ffff82d0806394fe>] F __start_xen+0x2cd/0x2980
  (XEN)    [<ffff82d0802000ec>] F __high_start+0x4c/0x4e

Furthermore, the PERFC_INCR() macro is wildly inefficient.  There has been a
single caller for many releases now, so inline it and delete the macro
completely.

There is no need to reference current at all.  What is actually needed is the
per_cpu_offset which can be obtained directly from the top-of-stack block.
This simplifies the counter handling to 3 instructions and no spilling to the
stack at all.

The same breakage from above is now handled properly:

  (XEN) traps.c:1591: GPF (0000): ffff82d0806394fe [__start_xen+0x2cd/0x2980] -> ffff82d0803b3bfb

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Julien Grall <jgrall@amazon.com>
master commit: 615bfe42c6d183a0e54a0525ef82b58580d01619
master date: 2020-04-16 09:48:38 +0100

5 years agox86/EFI: also fill boot_tsc_stamp on the xen.efi boot path
Jan Beulich [Mon, 27 Apr 2020 13:49:38 +0000 (15:49 +0200)]
x86/EFI: also fill boot_tsc_stamp on the xen.efi boot path

Commit e3a379c35eff ("x86/time: always count s_time from Xen boot")
introducing this missed adjusting this path as well.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: 0dbc112e727f6c17f306c864950bdf83dece5cd5
master date: 2020-04-14 11:42:11 +0200

5 years agognttab: fix GNTTABOP_copy continuation handling
Jan Beulich [Tue, 14 Apr 2020 13:00:18 +0000 (15:00 +0200)]
gnttab: fix GNTTABOP_copy continuation handling

The XSA-226 fix was flawed - the backwards transformation on rc was done
too early, causing a continuation to not get invoked when the need for
preemption was determined at the very first iteration of the request.
This in particular means that all of the status fields of the individual
operations would be left untouched, i.e. set to whatever the caller may
or may not have initialized them to.

This is part of XSA-318.

Reported-by: Pawel Wieczorkiewicz <wipawel@amazon.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Tested-by: Pawel Wieczorkiewicz <wipawel@amazon.de>
master commit: d6f22d5d9e8d6848ec229083ac9fb044f0adea93
master date: 2020-04-14 14:42:32 +0200

5 years agoxen/gnttab: Fix error path in map_grant_ref()
Ross Lagerwall [Tue, 14 Apr 2020 12:58:48 +0000 (14:58 +0200)]
xen/gnttab: Fix error path in map_grant_ref()

Part of XSA-295 (c/s 863e74eb2cffb) inadvertently re-positioned the brackets,
changing the logic.  If the _set_status() call fails, the grant_map hypercall
would fail with a status of 1 (rc != GNTST_okay) instead of the expected
negative GNTST_* error.

This error path can be taken due to bad guest state, and causes net/blk-back
in Linux to crash.

This is XSA-316.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: da0c66c8f48042a0186799014af69db0303b1da5
master date: 2020-04-14 14:41:02 +0200

5 years agoxen/rwlock: Add missing memory barrier in the unlock path of rwlock
Julien Grall [Tue, 14 Apr 2020 12:56:58 +0000 (14:56 +0200)]
xen/rwlock: Add missing memory barrier in the unlock path of rwlock

The rwlock unlock paths are using atomic_sub() to release the lock.
However the implementation of atomic_sub() rightfully doesn't contain a
memory barrier. On Arm, this means a processor is allowed to re-order
the memory access with the preceeding access.

In other words, the unlock may be seen by another processor before all
the memory accesses within the "critical" section.

The rwlock paths already contains barrier indirectly, but they are not
very useful without the counterpart in the unlock paths.

The memory barriers are not necessary on x86 because loads/stores are
not re-ordered with lock instructions.

So add arch_lock_release_barrier() in the unlock paths that will only
add memory barrier on Arm.

Take the opportunity to document each lock paths explaining why a
barrier is not necessary.

This is XSA-314.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 6890a04072e664c25447a297fe663b45ecfd6398
master date: 2020-04-14 14:37:11 +0200

5 years agoxenoprof: limit consumption of shared buffer data
Jan Beulich [Tue, 14 Apr 2020 12:56:06 +0000 (14:56 +0200)]
xenoprof: limit consumption of shared buffer data

Since a shared buffer can be written to by the guest, we may only read
the head and tail pointers from there (all other fields should only ever
be written to). Furthermore, for any particular operation the two values
must be read exactly once, with both checks and consumption happening
with the thus read values. (The backtrace related xenoprof_buf_space()
use in xenoprof_log_event() is an exception: The values used there get
re-checked by every subsequent xenoprof_add_sample().)

Since that code needed touching, also fix the double increment of the
lost samples count in case the backtrace related xenoprof_add_sample()
invocation in xenoprof_log_event() fails.

Where code is being touched anyway, add const as appropriate, but take
the opportunity to entirely drop the now unused domain parameter of
xenoprof_buf_space().

This is part of XSA-313.

Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: 50ef9a3cb26e2f9383f6fdfbed361d8f174bae9f
master date: 2020-04-14 14:33:19 +0200

5 years agoxenoprof: clear buffer intended to be shared with guests
Jan Beulich [Tue, 14 Apr 2020 12:55:05 +0000 (14:55 +0200)]
xenoprof: clear buffer intended to be shared with guests

alloc_xenheap_pages() making use of MEMF_no_scrub is fine for Xen
internally used allocations, but buffers allocated to be shared with
(unpriviliged) guests need to be zapped of their prior content.

This is part of XSA-313.

Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: 0763a7ebfcdad66cf9e5475a1301eefb29bae9ed
master date: 2020-04-14 14:32:33 +0200

5 years agoxen/arm: Sign extend TimerValue when computing the CompareValue
Jeff Kubascik [Tue, 21 Jan 2020 15:07:04 +0000 (10:07 -0500)]
xen/arm: Sign extend TimerValue when computing the CompareValue

Xen will only store the CompareValue as it can be derived from the
TimerValue (ARM DDI 0487E.a section D11.2.4):

  CompareValue = (Counter[63:0] + SignExtend(TimerValue))[63:0]

While the TimerValue is a 32-bit signed value, our implementation
assumed it is a 32-bit unsigned value.

Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com>
Acked-by: Julien Grall <julien@xen.org>
(cherry picked from commit 3c601c5f056fba055b7a1438b84b69fc649275c3)

5 years agoxen/arm: remove physical timer offset
Jeff Kubascik [Tue, 21 Jan 2020 15:07:03 +0000 (10:07 -0500)]
xen/arm: remove physical timer offset

The physical timer traps apply an offset so that time starts at 0 for
the guest. However, this offset is not currently applied to the physical
counter. Per the ARMv8 Reference Manual (ARM DDI 0487E.a), section
D11.2.4 Timers, the "Offset" between the counter and timer should be
zero for a physical timer. This removes the offset to make the timer and
counter consistent.

This also cleans up the physical timer implementation to better match
the virtual timer - both cval's now hold the hardware value.

In the case the guest sets cval to a time before Xen started, the correct
behavior is to expire the timer immediately. To do this, we set the expires
argument of set_timer to zero.

Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com>
Acked-by: Julien Grall <julien@xen.org>
(cherry picked from commit f14f55b7ee295277c8dd09e37e0fa0902ccf7eb4)

5 years agoxen/arm: during efi boot, improve the check for usable memory
Stefano Stabellini [Tue, 14 Jan 2020 23:31:55 +0000 (15:31 -0800)]
xen/arm: during efi boot, improve the check for usable memory

When booting via EFI, the EFI memory map has information about memory
regions and their type. Improve the check for the type and attribute of
each memory region to figure out whether it is usable memory or not.
This patch brings the check on par with Linux v5.5-rc6 and makes more
memory reusable as normal memory by Xen (except that Linux also reuses
EFI_PERSISTENT_MEMORY, which we do not).

Specifically, this patch also reuses memory marked as
EfiLoaderCode/Data, and it uses both Attribute and Type for the check
(Attribute needs to be EFI_MEMORY_WB).

Reported-by: Roman Shaposhnik <roman@zededa.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Acked-by: Julien Grall <julien@xen.org>
(cherry picked from commit b31666c8912bf18d9eff963b06d856e7e818ff34)

5 years agoxen/arm: initialize vpl011 flag register
Jeff Kubascik [Mon, 25 Nov 2019 20:58:00 +0000 (15:58 -0500)]
xen/arm: initialize vpl011 flag register

The tx/rx fifo flags were not set when the vpl011 is initialized. This
is a problem for certain guests that are operating in polled mode, as a
guest will generally check the rx fifo empty flag to determine if there
is data before doing a read. The result is a continuous spam of the
message "vpl011: Unexpected IN ring buffer empty" before the first valid
character is received. This initializes the flag status register to the
default specified in the PL011 technical reference manual.

Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com>
Acked-by: Julien Grall <julien@xen.org>
(cherry picked from commit b4637ed6cd5375f04ac51d6b900a9ccad6c6c03a)

5 years agoxen/arm: Handle unimplemented VGICv3 registers as RAZ/WI
Jeff Kubascik [Tue, 4 Feb 2020 19:51:50 +0000 (14:51 -0500)]
xen/arm: Handle unimplemented VGICv3 registers as RAZ/WI

Per the ARM Generic Interrupt Controller Architecture Specification (ARM
IHI 0069E), reserved registers should generally be treated as RAZ/WI.
To simplify the VGICv3 design and improve guest compatibility, treat the
default case for GICD and GICR registers as read_as_zero/write_ignore.

Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com>
Acked-by: Julien Grall <julien@xen.org>
(cherry picked from commit 69da7d5440c609c57c5bba9a73b91c62ba2852e6)

5 years agocredit2: fix credit reset happening too few times
Dario Faggioli [Thu, 9 Apr 2020 07:36:51 +0000 (09:36 +0200)]
credit2: fix credit reset happening too few times

There is a bug in commit 5e4b4199667b9 ("xen: credit2: only reset
credit on reset condition"). In fact, the aim of that commit was to
make sure that we do not perform too many credit reset operations
(which are not super cheap, and in an hot-path). But the check used
to determine whether a reset is necessary was the wrong one.

In fact, knowing just that some vCPUs have been skipped, while
traversing the runqueue (in runq_candidate()), is not enough. We
need to check explicitly whether the first vCPU in the runqueue
has a negative amount of credit.

Since a trace record is changed, this patch updates xentrace format file
and xenalyze as well

This should be backported.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: dae7b62e976b28af9c8efa150618c25501bf1650
master date: 2020-04-03 10:46:53 +0200

5 years agocredit2: avoid vCPUs to ever reach lower credits than idle
Dario Faggioli [Thu, 9 Apr 2020 07:36:08 +0000 (09:36 +0200)]
credit2: avoid vCPUs to ever reach lower credits than idle

There have been report of stalls of guest vCPUs, when Credit2 was used.
It seemed like these vCPUs were not getting scheduled for very long
time, even under light load conditions (e.g., during dom0 boot).

Investigations led to the discovery that --although rarely-- it can
happen that a vCPU manages to run for very long timeslices. In Credit2,
this means that, when runtime accounting happens, the vCPU will lose a
large quantity of credits. This in turn may lead to the vCPU having less
credits than the idle vCPUs (-2^30). At this point, the scheduler will
pick the idle vCPU, instead of the ready to run vCPU, for a few
"epochs", which often times is enough for the guest kernel to think the
vCPU is not responding and crashing.

An example of this situation is shown here. In fact, we can see d0v1
sitting in the runqueue while all the CPUs are idle, as it has
-1254238270 credits, which is smaller than -2^30 = −1073741824:

    (XEN) Runqueue 0:
    (XEN)   ncpus              = 28
    (XEN)   cpus               = 0-27
    (XEN)   max_weight         = 256
    (XEN)   pick_bias          = 22
    (XEN)   instload           = 1
    (XEN)   aveload            = 293391 (~111%)
    (XEN)   idlers: 00,00000000,00000000,00000000,00000000,00000000,0fffffff
    (XEN)   tickled: 00,00000000,00000000,00000000,00000000,00000000,00000000
    (XEN)   fully idle cores: 00,00000000,00000000,00000000,00000000,00000000,0fffffff
    [...]
    (XEN) Runqueue 0:
    (XEN) CPU[00] runq=0, sibling=00,..., core=00,...
    (XEN) CPU[01] runq=0, sibling=00,..., core=00,...
    [...]
    (XEN) CPU[26] runq=0, sibling=00,..., core=00,...
    (XEN) CPU[27] runq=0, sibling=00,..., core=00,...
    (XEN) RUNQ:
    (XEN)     0: [0.1] flags=0 cpu=5 credit=-1254238270 [w=256] load=262144 (~100%)

We certainly don't want, under any circumstance, this to happen.
Let's, therefore, define a minimum amount of credits a vCPU can have.
During accounting, we make sure that, for however long the vCPU has
run, it will never get to have less than such minimum amount of
credits. Then, we set the credits of the idle vCPU to an even
smaller value.

NOTE: investigations have been done about _how_ it is possible for a
vCPU to execute for so much time that its credits becomes so low. While
still not completely clear, there are evidence that:
- it only happens very rarely,
- it appears to be both machine and workload specific,
- it does not look to be a Credit2 (e.g., as it happens when
  running with Credit1 as well) issue, or a scheduler issue.

This patch makes Credit2 more robust to events like this, whatever
the cause is, and should hence be backported (as far as possible).

Reported-by: Glen <glenbarney@gmail.com>
Reported-by: Tomas Mozes <hydrapolic@gmail.com>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 36f3662f27dec32d76c0edb4c6b62b9628d6869d
master date: 2020-04-03 10:45:43 +0200

5 years agox86/ucode/amd: Fix more potential buffer overruns with microcode parsing
Andrew Cooper [Thu, 9 Apr 2020 07:35:24 +0000 (09:35 +0200)]
x86/ucode/amd: Fix more potential buffer overruns with microcode parsing

cpu_request_microcode() doesn't know the buffer is at least 4 bytes long
before inspecting UCODE_MAGIC.

install_equiv_cpu_table() doesn't know the boundary of the buffer it is
interpreting as an equivalency table.  This case was clearly observed at one
point in the past, given the subsequent overrun detection, but without
comprehending that the damage was already done.

Make the logic consistent with container_fast_forward() and pass size_left in
to install_equiv_cpu_table().

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 718d1432000079ea7120f6cb770372afe707ce27
master date: 2020-04-01 14:00:12 +0100

5 years agox86/HVM: fix AMD ECS handling for Fam10
Jan Beulich [Thu, 9 Apr 2020 07:34:19 +0000 (09:34 +0200)]
x86/HVM: fix AMD ECS handling for Fam10

The involved comparison was, very likely inadvertently, converted from
>= to > when making changes unrelated to the actual family range.

Fixes: 9841eb71ea87 ("x86/cpuid: Drop a guests cached x86 family and model information")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 5d515b1c296ebad6889748ea1e49e063453216a3
master date: 2020-04-01 12:28:30 +0200

5 years agox86/ucode/amd: Fix potential buffer overrun with equiv table handling
Andrew Cooper [Thu, 9 Apr 2020 07:33:20 +0000 (09:33 +0200)]
x86/ucode/amd: Fix potential buffer overrun with equiv table handling

find_equiv_cpu_id() loops until it finds a 0 installed_cpu entry.  Well formed
AMD microcode containers have this property.

Extend the checking in install_equiv_cpu_table() to reject tables which don't
have a sentinal at the end.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1f97b6b9f1b5978659c5735954c37c130e7bb151
master date: 2020-03-27 13:13:26 +0000

5 years agolibx86/CPUID: fix (not just) leaf 7 processing
Jan Beulich [Thu, 9 Apr 2020 07:32:36 +0000 (09:32 +0200)]
libx86/CPUID: fix (not just) leaf 7 processing

x86_cpuid_policy_fill_native() should, as it did originally, iterate
over all subleaves here as well as over all main leaves. Switch to
using a "<= MIN()"-based approach similar to that used in
x86_cpuid_copy_to_buffer(). Also follow this for the extended main
leaves then.

Fixes: 1bd2b750537b ("libx86: Fix 32bit stubdom build of x86_cpuid_policy_fill_native()")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: eb0bad81fceb3e81df5f73441771b49b732edf56
master date: 2020-03-27 11:40:59 +0100

5 years agox86/ucode: Fix error paths in apply_microcode()
Andrew Cooper [Thu, 9 Apr 2020 07:31:45 +0000 (09:31 +0200)]
x86/ucode: Fix error paths in apply_microcode()

In the unlikley case that patch application completes, but the resutling
revision isn't expected, sig->rev doesn't get updated to match reality.

It will get adjusted the next time collect_cpu_info() gets called, but in the
meantime Xen might operate on a stale value.  Nothing good will come of this.

Rewrite the logic to always update the stashed revision, before worrying about
whether the attempt was a success or failure.

Take the opportunity to make the printk() messages as consistent as possible.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: d2a0a96cf76603b2e2b87c3ce80c3f9d098327d4
master date: 2020-03-26 18:57:45 +0000

5 years agox86/shim: fix ballooning up the guest
Igor Druzhinin [Thu, 9 Apr 2020 07:30:58 +0000 (09:30 +0200)]
x86/shim: fix ballooning up the guest

args.preempted is meaningless here as it doesn't signal whether the
hypercall was preempted before. Use start_extent instead which is
correct (as long as the hypercall was invoked in a "normal" way).

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 76dbabb59eeaa78e9f57407e5b15a6606488333e
master date: 2020-03-18 12:55:54 +0100

5 years agox86/vPMU: don't blindly assume IA32_PERF_CAPABILITIES MSR exists
Jan Beulich [Thu, 9 Apr 2020 07:30:14 +0000 (09:30 +0200)]
x86/vPMU: don't blindly assume IA32_PERF_CAPABILITIES MSR exists

Just like VMX'es lbr_tsx_fixup_check() the respective CPUID bit should
be consulted first.

Reported-by: Farrah Chen <farrah.chen@intel.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 15c39c7c913f26fba40231e103ce1ffa6101e7c9
master date: 2020-02-26 17:35:48 +0100

5 years agoAMD/IOMMU: fix off-by-one in amd_iommu_get_paging_mode() callers
Jan Beulich [Thu, 9 Apr 2020 07:29:00 +0000 (09:29 +0200)]
AMD/IOMMU: fix off-by-one in amd_iommu_get_paging_mode() callers

amd_iommu_get_paging_mode() expects a count, not a "maximum possible"
value. Prior to b4f042236ae0 dropping the reference, the use of our mis-
named "max_page" in amd_iommu_domain_init() may have lead to such a
misunderstanding. In an attempt to avoid such confusion in the future,
rename the function's parameter and - while at it - convert it to an
inline function.

Also replace a literal 4 by an expression tying it to a wider use
constant, just like amd_iommu_quarantine_init() does.

Fixes: ea38867831da ("x86 / iommu: set up a scratch page in the quarantine domain")
Fixes: b4f042236ae0 ("AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b75b3c62fe4afe381c6f74a07f614c0b39fe2f5d
master date: 2020-03-16 11:24:29 +0100

5 years agox86/msr: Virtualise MSR_PLATFORM_ID properly
Andrew Cooper [Thu, 5 Mar 2020 10:24:09 +0000 (11:24 +0100)]
x86/msr: Virtualise MSR_PLATFORM_ID properly

This is an Intel-only, read-only MSR related to microcode loading.  Expose it
in similar circumstances as the PATCHLEVEL MSR.

This should have been alongside c/s 013896cb8b2 "x86/msr: Fix handling of
MSR_AMD_PATCHLEVEL/MSR_IA32_UCODE_REV"

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 691265f96097d4fe3e46ff4267451d49b30143e6
master date: 2020-02-20 17:29:50 +0000

5 years agoVT-d: check all of an RMRR for being E820-reserved
Jan Beulich [Thu, 5 Mar 2020 10:23:33 +0000 (11:23 +0100)]
VT-d: check all of an RMRR for being E820-reserved

Checking just the first and last page is not sufficient (and redundant
for single-page regions). As we don't need to care about IA64 anymore,
use an x86-specific function to get this done without looping over each
individual page.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: d6573bc6e6b7d95bb9de8471a6bfd7048ebc50f3
master date: 2020-02-18 16:21:19 +0100

5 years agox86/time: report correct frequency of Xen PV clocksource
Igor Druzhinin [Thu, 5 Mar 2020 10:22:57 +0000 (11:22 +0100)]
x86/time: report correct frequency of Xen PV clocksource

The value of the counter represents the number of nanoseconds
since host boot. That means the correct frequency is always 1GHz.

This inconsistency caused time to go slower in PV shim on most
platforms.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: c52bd545de461127f3ca67c48e8fef7145402035
master date: 2020-02-14 18:01:52 +0000

5 years agox86/shim: suspend and resume platform time correctly
Igor Druzhinin [Thu, 5 Mar 2020 10:22:20 +0000 (11:22 +0100)]
x86/shim: suspend and resume platform time correctly

Similarly to S3, platform time needs to be saved on guest suspend
and restored on resume respectively. This should account for expected
jumps in PV clock counter value after resume. time_suspend/resume()
are safe to use in PVH setting as is since any existing operations
with PIT/HPET that they do would simply be ignored if PIT/HPET is
not present.

Additionally, add resume callback for Xen PV clocksource to avoid
its breakage on migration.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a7a3ecd82e289a9a2ecc1d3b5128580e0b577cc7
master date: 2020-02-14 18:01:52 +0000

5 years agox86/smp: reset x2apic_enabled in smp_send_stop()
David Woodhouse [Thu, 5 Mar 2020 10:21:47 +0000 (11:21 +0100)]
x86/smp: reset x2apic_enabled in smp_send_stop()

Just before smp_send_stop() re-enables interrupts when shutting down
for reboot or kexec, it calls __stop_this_cpu() which in turn calls
disable_local_APIC(), which puts the APIC back in to the mode Xen found
it in at boot.

If that means turning x2APIC off and going back into xAPIC mode, then
a timer interrupt occurring just after interrupts come back on will
lead to a GP# when apic_timer_interrupt() attempts to ack the IRQ
through the EOI register in x2APIC MSR 0x80b:

  (XEN) Executing kexec image on cpu0
  (XEN) ----[ Xen-4.14-unstable  x86_64  debug=n   Not tainted ]----
  (XEN) CPU:    0
  (XEN) RIP:    e008:[<ffff82d08026c139>] apic_timer_interrupt+0x29/0x40
  (XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
  (XEN) rax: 0000000000000000   rbx: 00000000000000fa   rcx: 000000000000080b
  ...
  (XEN) Xen code around <ffff82d08026c139> (apic_timer_interrupt+0x29/0x40):
  (XEN)  c0 b9 0b 08 00 00 89 c2 <0f> 30 31 ff e9 0e c9 fb ff 0f 1f 40 00 66 2e 0f
  ...
  (XEN) Xen call trace:
  (XEN)    [<ffff82d08026c139>] R apic_timer_interrupt+0x29/0x40
  (XEN)    [<ffff82d080283825>] S do_IRQ+0x95/0x750
  ...
  (XEN)    [<ffff82d0802a0ad2>] S smp_send_stop+0x42/0xd0

We can't clear the global x2apic_enabled variable in disable_local_APIC()
itself because that runs on each CPU. Instead, correct it (by using
current_local_apic_mode()) in smp_send_stop() while interrupts are still
disabled immediately after calling __stop_this_cpu() for the boot CPU,
after all other CPUs have been stopped.

cf: d639bdd9bbe ("x86/apic: Disable the LAPIC later in smp_send_stop()")
    ... which didn't quite fix it completely.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8b1002ab037aeacdece7723c07ab35ca16c1e22e
master date: 2020-02-14 18:01:52 +0000

5 years agoxen/pvh: Fix segment selector ABI
Andrew Cooper [Thu, 5 Mar 2020 10:21:09 +0000 (11:21 +0100)]
xen/pvh: Fix segment selector ABI

The written ABI states that %es will be set up, but libxc doesn't do so.  In
practice, it breaks `rep movs` inside guests before they reload %es.

The written ABI doesn't mention %ss, but libxc does set it up.  Having %ds
different to %ss is obnoxous to work with, as different registers have
different implicit segments.

Modify the spec to state that %ss is set up as a flat read/write segment.
This a) matches the Multiboot 1 spec, b) matches what is set up in practice,
and c) is the more sane behaviour for guests to use.

Fixes: 68e1183411b ('libxc: introduce a xc_dom_arch for hvm-3.0-x86_32 guests')
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
x86/pvh: Adjust dom0's starting state

Fixes: b25fb1a04e "xen/pvh: Fix segment selector ABI"
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: b25fb1a04e99cc03359eade1affb56ef0eee766f
master date: 2020-02-10 15:26:09 +0000
master commit: 6ee10313623c1f41fc72fe12372e176e744463c1
master date: 2020-02-11 11:04:26 +0000

5 years agoxmalloc: guard against integer overflow
Jan Beulich [Thu, 5 Mar 2020 10:20:12 +0000 (11:20 +0100)]
xmalloc: guard against integer overflow

There are hypercall handling paths (EFI ones are what this was found
with) needing to allocate buffers of a caller specified size. This is
generally fine, as our page allocator enforces an upper bound on all
allocations. However, certain extremely large sizes could, when adding
in allocator overhead, result in an apparently tiny allocation size,
which would typically result in either a successful allocation, but a
severe buffer overrun when using that memory block, or in a crash right
in the allocator code.

Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: cf38b4926e2b55d1d7715cff5095a7444f5ed42d
master date: 2020-02-06 09:53:12 +0100

5 years agoEFI: don't leak heap contents through XEN_EFI_get_next_variable_name
Jan Beulich [Thu, 5 Mar 2020 10:19:31 +0000 (11:19 +0100)]
EFI: don't leak heap contents through XEN_EFI_get_next_variable_name

Commit 1f4eb9d27d0e ("EFI: fix getting EFI variable list on some
systems") switched to using the caller provided size for the copy-out
without making sure the copied buffer is properly scrubbed.

Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 4783ee894f6bfb0f4deec9f1fe8e7faceafaa1a2
master date: 2020-02-06 09:52:33 +0100

5 years agoEFI: re-check {get,set}-variable name strings after copying in
Jan Beulich [Thu, 5 Mar 2020 10:19:02 +0000 (11:19 +0100)]
EFI: re-check {get,set}-variable name strings after copying in

A malicious guest given permission to invoke XENPF_efi_runtime_call may
play with the strings underneath Xen sizing them and copying them in.
Guard against this by re-checking the copyied in data for consistency
with the initial sizing. At the same time also check that the actual
copy-in is in fact successful, and switch to the lighter weight non-
checking flavor of the function.

Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: ad38db5852f0e30d90c93c6a62b754f2861549e0
master date: 2020-02-06 09:51:17 +0100

5 years agoxen/x86: domctl: Don't leak data via XEN_DOMCTL_gethvmcontext
Julien Grall [Thu, 5 Mar 2020 10:18:24 +0000 (11:18 +0100)]
xen/x86: domctl: Don't leak data via XEN_DOMCTL_gethvmcontext

The HVM context may not fill up the full buffer passed by the caller.
While we report corectly the size of the context, we will still be
copying back the full size of the buffer.

As the buffer is allocated through xmalloc(), we will be copying some
bits from the previous allocation.

Only copy back the part of the buffer used by the HVM context to prevent
any leak.

Note that per XSA-72, this is not a security issue.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 41d8869003e96d8b7250ad1d0246371d6929aca6
master date: 2020-01-31 18:51:38 +0000

5 years agox86/suspend: disable watchdog before calling console_start_sync()
Igor Druzhinin [Thu, 5 Mar 2020 10:17:53 +0000 (11:17 +0100)]
x86/suspend: disable watchdog before calling console_start_sync()

... and enable it after exiting S-state. Otherwise accumulated
output in serial buffer might easily trigger the watchdog if it's
still enabled after entering sync transmission mode.

The issue observed on machines which, unfortunately, generate non-0
output in CPU offline callbacks.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5e08f5f56c9955d853c26c985b6fb1fb45d0355d
master date: 2020-01-29 15:06:10 +0100

5 years agox86/apic: fix disabling LVT0 in disconnect_bsp_APIC
Roger Pau Monné [Thu, 5 Mar 2020 10:17:22 +0000 (11:17 +0100)]
x86/apic: fix disabling LVT0 in disconnect_bsp_APIC

The Intel SDM states:

"When an illegal vector value (0 to 15) is written to a LVT entry and
the delivery mode is Fixed (bits 8-11 equal 0), the APIC may signal an
illegal vector error, without regard to whether the mask bit is set or
whether an interrupt is actually seen on the input."

And that's exactly what's currently done in disconnect_bsp_APIC when
virt_wire_setup is true and LVT LINT0 is being masked. By writing only
APIC_LVT_MASKED Xen is actually setting the vector to 0 and the
delivery mode to Fixed (0), and hence it triggers an APIC error even
when the LVT entry is masked.

This would usually manifest when Xen is being shut down, as that's
where disconnect_bsp_APIC is called:

(XEN) APIC error on CPU0: 40(00)

Fix this by calling clear_local_APIC prior to setting the LVT LINT
registers which already clear LVT LINT0, and hence the troublesome
write can be avoided as the register is already cleared.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 782b48b7f7319c07b044606d67a60875e53dd05b
master date: 2020-01-29 14:47:00 +0100

5 years agoVT-d: don't pass bridge devices to domain_context_mapping_one()
Jan Beulich [Thu, 5 Mar 2020 10:16:46 +0000 (11:16 +0100)]
VT-d: don't pass bridge devices to domain_context_mapping_one()

When passed a non-NULL pdev, the function does an owner check when it
finds an already existing context mapping. Bridges, however, don't get
passed through to guests, and hence their owner is always going to be
Dom0, leading to the assigment of all but one of the function of multi-
function PCI devices behind bridges to fail.

Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: a4d457fd59f4ebfb524aec82cb6a3030087914ca
master date: 2020-01-22 16:39:58 +0100

5 years agox86/sm{e, a}p: do not enable SMEP/SMAP in PV shim by default on AMD
Igor Druzhinin [Thu, 5 Mar 2020 10:16:11 +0000 (11:16 +0100)]
x86/sm{e, a}p: do not enable SMEP/SMAP in PV shim by default on AMD

Due to AMD and Hygon being unable to selectively trap CR4 bit modifications
running 32-bit PV guest inside PV shim comes with significant performance
hit. Moreover, for SMEP in particular every time CR4.SMEP changes on context
switch to/from 32-bit PV guest, it gets trapped by L0 Xen which then
tries to perform global TLB invalidation for PV shim domain. This usually
results in eventual hang of a PV shim with at least several vCPUs.

Since the overall security risk is generally lower for shim Xen as it being
there more of a defense-in-depth mechanism, choose to disable SMEP/SMAP in
it by default on AMD and Hygon unless a user chose otherwise.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b05ec9263e56ef0784da766e829cfe08569d1d88
master date: 2020-01-17 16:18:20 +0100

5 years agox86/time: update TSC stamp on restore from deep C-state
Igor Druzhinin [Thu, 5 Mar 2020 10:14:38 +0000 (11:14 +0100)]
x86/time: update TSC stamp on restore from deep C-state

If ITSC is not available on CPU (e.g if running nested as PV shim)
then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e.
all AMD and some old Intel processors. In which case TSC would need to
be restored on CPU from platform time by Xen upon exiting C-states.

As platform time might be behind the last TSC stamp recorded for the
current CPU, invariant of TSC stamp being always behind local TSC counter
is violated. This has an effect of get_s_time() going negative resulting
in eventual system hang or crash.

Fix this issue by updating local TSC stamp along with TSC counter write.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: bbf283f853f8c0e4d29248dd44d3b0e0abc07629
master date: 2020-01-17 16:11:20 +0100

5 years agoIRQ: u16 is too narrow for an event channel number
Jan Beulich [Thu, 5 Mar 2020 10:13:55 +0000 (11:13 +0100)]
IRQ: u16 is too narrow for an event channel number

FIFO event channels allow ports up to 2^17, so we need to use a wider
field in struct pirq. Move "masked" such that it may share the 8-byte
slot with struct arch_pirq on 64-bit arches, rather than leaving a
7-byte hole in all cases.

Take the opportunity and also add a comment regarding "arch" placement
within the structure.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Arm: fix build after 892b9dcebdb7

"IRQ: u16 is too narrow for an event channel number" introduced a use of
evetchn_port_t, but its typedef apparently surfaces indirectly here only
on x86.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 892b9dcebdb7f646657e11cfdd95a385107bbefa
master date: 2020-01-14 12:03:47 +0100
master commit: b4194711ffaffa5e63d986338fb8d4020fa6bad1
master date: 2020-01-14 16:06:27 +0100

5 years agox86: clear per cpu stub page information in cpu_smpboot_free()
Juergen Gross [Thu, 5 Mar 2020 10:13:21 +0000 (11:13 +0100)]
x86: clear per cpu stub page information in cpu_smpboot_free()

cpu_smpboot_free() removes the stubs for the cpu going offline, but it
isn't clearing the related percpu variables. This will result in
crashes when a stub page is released due to all related cpus gone
offline and one of those cpus going online later.

Fix that by clearing stubs.addr and stubs.mfn in order to allocate a
new stub page when needed, irrespective of whether the CPU gets parked
or removed.

Fixes: 2e6c8f182c9c50 ("x86: distinguish CPU offlining from CPU removal")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Wei Liu <wl@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Tao Xu <tao3.xu@intel.com>
master commit: 774901788c5614798931a1cb2e20dd8b885f97ab
master date: 2020-01-09 11:07:38 +0100

5 years agoupdate Xen version to 4.12.3-pre
Jan Beulich [Thu, 5 Mar 2020 10:11:54 +0000 (11:11 +0100)]
update Xen version to 4.12.3-pre

5 years agoxen/arm: Place a speculation barrier sequence following an eret instruction
Julien Grall [Thu, 19 Dec 2019 08:12:21 +0000 (08:12 +0000)]
xen/arm: Place a speculation barrier sequence following an eret instruction

Some CPUs can speculate past an ERET instruction and potentially perform
speculative accesses to memory before processing the exception return.
Since the register state is often controlled by lower privilege level
at the point of an ERET, this could potentially be used as part of a
side-channel attack.

Newer CPUs may implement a new SB barrier instruction which acts
as an architected speculation barrier. For current CPUs, the sequence
DSB; ISB is known to prevent speculation.

The latter sequence is heavier than SB but it would never be executed
(this is speculation after all!).

Introduce a new macro 'sb' that could be used when a speculation barrier
is required. For now it is using dsb; isb but this could easily be
updated to cater SB in the future.

This is XSA-312.

Signed-off-by: Julien Grall <julien@xen.org>
5 years agoupdate Xen version to 4.12.2 RELEASE-4.12.2
Jan Beulich [Fri, 20 Dec 2019 10:45:51 +0000 (11:45 +0100)]
update Xen version to 4.12.2

5 years agolz4: fix system halt at boot kernel on x86_64
Krzysztof Kolasa [Wed, 11 Dec 2019 14:15:52 +0000 (15:15 +0100)]
lz4: fix system halt at boot kernel on x86_64

Sometimes, on x86_64, decompression fails with the following
error:

Decompressing Linux...

Decoding failed

 -- System halted

This condition is not needed for a 64bit kernel(from commit d5e7caf):

if( ... ||
    (op + COPYLENGTH) > oend)
    goto _output_error

macro LZ4_SECURE_COPY() tests op and does not copy any data
when op exceeds the value.

added by analogy to lz4_uncompress_unknownoutputsize(...)

Signed-off-by: Krzysztof Kolasa <kkolasa@winsoft.pl>
[Linux commit 99b7e93c95c78952724a9783de6c78def8fbfc3f]

The offending commit in our case is fcc17f96c277 ("LZ4 : fix the data
abort issue").

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5d90ff79542ab9c6eebe5c315c68c196bcf353b9
master date: 2019-12-09 14:02:35 +0100

5 years agolz4: refine commit 9143a6c55ef7 for the 64-bit case
Jan Beulich [Wed, 11 Dec 2019 14:15:11 +0000 (15:15 +0100)]
lz4: refine commit 9143a6c55ef7 for the 64-bit case

I clearly went too far there: While the LZ4_WILDCOPY() instances indeed
need prior guarding, LZ4_SECURECOPY() needs this only in the 32-bit case
(where it simply aliases LZ4_WILDCOPY()). "cpy" can validly point
(slightly) below "op" in these cases, due to

cpy = op + length - (STEPSIZE - 4);

where length can be as low as 0 and STEPSIZE is 8. However, instead of
removing the check via "#if !LZ4_ARCH64", refine it such that it would
also properly work in the 64-bit case, aborting decompression instead
of continuing on bogus input.

Reported-by: Mark Pryor <pryorm09@gmail.com>
Reported-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Mark Pryor <pryorm09@gmail.com>
Tested-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2d7572cdfa4d481c1ca246aa1ce5239ccae7eb59
master date: 2019-12-09 14:01:25 +0100

5 years agoAMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables
Andrew Cooper [Wed, 11 Dec 2019 14:14:16 +0000 (15:14 +0100)]
AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables

update_paging_mode() has multiple bugs:

 1) Booting with iommu=debug will cause it to inform you that that it called
    without the pdev_list lock held.
 2) When growing by more than a single level, it leaks the newly allocated
    table(s) in the case of a further error.

Furthermore, the choice of default level for a domain has issues:

 1) All HVM guests grow from 2 to 3 levels during construction because of the
    position of the VRAM just below the 4G boundary, so defaulting to 2 is a
    waste of effort.
 2) The limit for PV guests doesn't take memory hotplug into account, and
    isn't dynamic at runtime like HVM guests.  This means that a PV guest may
    get RAM which it can't map in the IOMMU.

The dynamic height is a property unique to AMD, and adds a substantial
quantity of complexity for what is a marginal performance improvement.  Remove
the complexity by removing the dynamic height.

PV guests now get 3 or 4 levels based on any hotplug regions in the host.
This only makes a difference for hardware which previously had all RAM below
the 512G boundary, and a hotplug region above.

HVM guests now get 4 levels (which will be sufficient until 256TB guests
become a thing), because we don't currently have the information to know when
3 would be safe to use.

The overhead of this extra level is not expected to be noticeable.  It costs
one page (4k) per domain, and one extra IO-TLB paging structure cache entry
which is very hot and less likely to be evicted.

This is XSA-311.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: b4f042236ae0bb6725b3e8dd40af5a2466a6f971
master date: 2019-12-11 14:55:32 +0100