]> xenbits.xensource.com Git - xen.git/log
xen.git
4 years agolibx86: introduce a helper to deserialise msr_policy objects
Sergey Dyasli [Fri, 11 Sep 2020 12:56:34 +0000 (14:56 +0200)]
libx86: introduce a helper to deserialise msr_policy objects

As with the serialise side, Xen's copy_from_guest API is used, with a
compatibility wrapper for the userspace build.

Signed-off-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: afec08b92ffe8b85d2bf2e8c7c221b63ba96743e
master date: 2019-03-12 14:12:23 +0000

4 years agox86/hvm: set 'ipat' in EPT for special pages
Paul Durrant [Fri, 7 Aug 2020 15:33:22 +0000 (17:33 +0200)]
x86/hvm: set 'ipat' in EPT for special pages

All non-MMIO ranges (i.e those not mapping real device MMIO regions) that
map valid MFNs are normally marked MTRR_TYPE_WRBACK and 'ipat' is set. Hence
when PV drivers running in a guest populate the BAR space of the Xen Platform
PCI Device with pages such as the Shared Info page or Grant Table pages,
accesses to these pages will be cachable.

However, should IOMMU mappings be enabled be enabled for the guest then these
accesses become uncachable. This has a substantial negative effect on I/O
throughput of PV devices. Arguably PV drivers should bot be using BAR space to
host the Shared Info and Grant Table pages but it is currently commonplace for
them to do this and so this problem needs mitigation. Hence this patch makes
sure the 'ipat' bit is set for any special page regardless of where in GFN
space it is mapped.

NOTE: Clearly this mitigation only applies to Intel EPT. It is not obvious
      that there is any similar mitigation possible for AMD NPT. Downstreams
      such as Citrix XenServer have been carrying a patch similar to this for
      several releases though.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ca24b2ffdbd9a25b2d313a547ccbe97baf3e5a8d
master date: 2020-07-31 17:42:47 +0200

4 years agox86emul: replace UB shifts
Jan Beulich [Fri, 7 Aug 2020 15:32:42 +0000 (17:32 +0200)]
x86emul: replace UB shifts

Displacement values can be negative, hence we shouldn't left-shift them.
Or else we get

(XEN) UBSAN: Undefined behaviour in x86_emulate/x86_emulate.c:3482:55
(XEN) left shift of negative value -2

While auditing shifts, I noticed a pair of missing parentheses, which
also get added right here.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86emul: replace further UB shifts

I have no explanation how I managed to overlook these while putting
together what is now b6a907f8c83d ("x86emul: replace UB shifts").

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b6a907f8c83d37886d0523f1aeff61b98e133498
master date: 2020-07-31 17:41:58 +0200
master commit: 21de9680eb594a7038d4d4ed78e53ac90a8c5a6e
master date: 2020-08-05 10:19:29 +0200

4 years agox86/cpuid: Fix APIC bit clearing
Fam Zheng [Fri, 7 Aug 2020 15:32:02 +0000 (17:32 +0200)]
x86/cpuid: Fix APIC bit clearing

The bug is obvious here, other places in this function used
"cpufeat_mask" correctly.

Fixed: b648feff8ea2 ("xen/x86: Improvements to in-hypervisor cpuid sanity checks")
Signed-off-by: Fam Zheng <famzheng@amazon.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 64219fa179c3e48adad12bfce3f6b3f1596cccbf
master date: 2020-07-29 19:03:41 +0100

4 years agox86/S3: put data segment registers into known state upon resume
Jan Beulich [Fri, 7 Aug 2020 15:31:16 +0000 (17:31 +0200)]
x86/S3: put data segment registers into known state upon resume

wakeup_32 sets %ds and %es to BOOT_DS, while leaving %fs at what
wakeup_start did set it to, and %gs at whatever BIOS did load into it.
All of this may end up confusing the first load_segments() to run on
the BSP after resume, in particular allowing a non-nul selector value
to be left in %fs.

Alongside %ss, also put all other data segment registers into the same
state that the boot and CPU bringup paths put them in.

Reported-by: M. Vefa Bicakci <m.v.b@runbox.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 55f8c389d4348cc517946fdcb10794112458e81e
master date: 2020-07-24 10:17:26 +0200

4 years agox86/spec-ctrl: Protect against CALL/JMP straight-line speculation
Andrew Cooper [Fri, 7 Aug 2020 15:30:35 +0000 (17:30 +0200)]
x86/spec-ctrl: Protect against CALL/JMP straight-line speculation

Some x86 CPUs speculatively execute beyond indirect CALL/JMP instructions.

With CONFIG_INDIRECT_THUNK / Retpolines, indirect CALL/JMP instructions are
converted to direct CALL/JMP's to __x86_indirect_thunk_REG(), leaving just a
handful of indirect JMPs implementing those stubs.

There is no architectrual execution beyond an indirect JMP, so use INT3 as
recommended by vendors to halt speculative execution.  This is shorter than
LFENCE (which would also work fine), but also shows up in logs if we do
unexpected execute them.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3b7dab93f2401b08c673244c9ae0f92e08bd03ba
master date: 2020-07-01 17:01:24 +0100

4 years agomm: fix public declaration of struct xen_mem_acquire_resource
Roger Pau Monné [Fri, 7 Aug 2020 15:29:41 +0000 (17:29 +0200)]
mm: fix public declaration of struct xen_mem_acquire_resource

XENMEM_acquire_resource and it's related structure is currently inside
a __XEN__ or __XEN_TOOLS__ guarded section to limit it's scope to the
hypervisor or the toolstack only. This is wrong as the hypercall is
already being used by the Linux kernel at least, and as such needs to
be public.

Also switch the usage of uint64_aligned_t to plain uint64_t, as
uint64_aligned_t is only to be used by the toolstack. Doing such
change will reduce the size of the structure on 32bit x86 by 4bytes,
since there will be no padding added after the frame_list handle.

This is fine, as users of the previous layout will allocate 4bytes of
padding that won't be read by Xen, and users of the new layout won't
allocate those, which is also fine since Xen won't try to access them.

Note that the structure already has compat handling, and such handling
will take care of copying the right size (ie: minus the padding) when
called from a 32bit x86 context. This is true for the compat code both
before and after this patch, since the structures in the memory.h
compat header are subject to a pragma pack(4), which already removed
the trailing padding that would otherwise be introduced by the
alignment of the frame field to 8 bytes.

Fixes: 3f8f12281dd20 ('x86/mm: add HYPERVISOR_memory_op to acquire guest resources')
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0e2e54966af556f4047c1048855c4a071028a32d
master date: 2020-06-29 18:03:49 +0200

4 years agox86/msr: Disallow access to Processor Trace MSRs
Andrew Cooper [Fri, 7 Aug 2020 15:28:48 +0000 (17:28 +0200)]
x86/msr: Disallow access to Processor Trace MSRs

We do not expose the feature to guests, so should disallow access to the
respective MSRs.  For simplicity, drop the entire block of MSRs, not just the
subset which have been specified thus far.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: bcdfbb70fca579baa04f212c0936b77919bdae11
master date: 2020-06-26 16:34:02 +0100

4 years agox86/acpi: use FADT flags to determine the PMTMR width
Grzegorz Uriasz [Fri, 7 Aug 2020 15:28:06 +0000 (17:28 +0200)]
x86/acpi: use FADT flags to determine the PMTMR width

On some computers the bit width of the PM Timer as reported
by ACPI is 32 bits when in fact the FADT flags report correctly
that the timer is 24 bits wide. On affected machines such as the
ASUS FX504GM and never gaming laptops this results in the inability
to resume the machine from suspend. Without this patch suspend is
broken on affected machines and even if a machine manages to resume
correctly then the kernel time and xen timers are trashed.

Signed-off-by: Grzegorz Uriasz <gorbak25@gmail.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f325d2477eef8229c47d97031d314629521c70ab
master date: 2020-06-25 09:11:09 +0200

4 years agox86/vmx: use P2M_ALLOC in vmx_load_pdptrs instead of P2M_UNSHARE
Tamas K Lengyel [Fri, 7 Aug 2020 15:27:11 +0000 (17:27 +0200)]
x86/vmx: use P2M_ALLOC in vmx_load_pdptrs instead of P2M_UNSHARE

While forking VMs running a small RTOS system (Zephyr) a Xen crash has been
observed due to a mm-lock order violation while copying the HVM CPU context
from the parent. This issue has been identified to be due to
hap_update_paging_modes first getting a lock on the gfn using get_gfn. This
call also creates a shared entry in the fork's memory map for the cr3 gfn. The
function later calls hap_update_cr3 while holding the paging_lock, which
results in the lock-order violation in vmx_load_pdptrs when it tries to unshare
the above entry when it grabs the page with the P2M_UNSHARE flag set.

Since vmx_load_pdptrs only reads from the page its usage of P2M_UNSHARE was
unnecessary to start with. Using P2M_ALLOC is the appropriate flag to ensure
the p2m is properly populated.

Note that the lock order violation is avoided because before the paging_lock is
taken a lookup is performed with P2M_ALLOC that forks the page, thus the second
lookup in vmx_load_pdptrs succeeds without having to perform the fork. We keep
P2M_ALLOC in vmx_load_pdptrs because there are code-paths leading up to it
which don't take the paging_lock and that have no previous lookup. Currently no
other code-path exists leading there with the paging_lock taken, thus no
further adjustments are necessary.

Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: b67e859d0823f5b450e29379af9142d44a3ea370
master date: 2020-06-19 15:24:55 +0200

4 years agoxen: Check the alignment of the offset pased via VCPUOP_register_vcpu_info
Julien Grall [Tue, 7 Jul 2020 13:13:40 +0000 (15:13 +0200)]
xen: Check the alignment of the offset pased via VCPUOP_register_vcpu_info

Currently a guest is able to register any guest physical address to use
for the vcpu_info structure as long as the structure can fits in the
rest of the frame.

This means a guest can provide an address that is not aligned to the
natural alignment of the structure.

On Arm 32-bit, unaligned access are completely forbidden by the
hypervisor. This will result to a data abort which is fatal.

On Arm 64-bit, unaligned access are only forbidden when used for atomic
access. As the structure contains fields (such as evtchn_pending_self)
that are updated using atomic operations, any unaligned access will be
fatal as well.

While the misalignment is only fatal on Arm, a generic check is added
as an x86 guest shouldn't sensibly pass an unaligned address (this
would result to a split lock).

This is XSA-327.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 3fdc211b01b29f252166937238efe02d15cb5780
master date: 2020-07-07 14:41:00 +0200

4 years agox86/ept: flush cache when modifying PTEs and sharing page tables
Roger Pau Monné [Tue, 7 Jul 2020 13:13:07 +0000 (15:13 +0200)]
x86/ept: flush cache when modifying PTEs and sharing page tables

Modifications made to the page tables by EPT code need to be written
to memory when the page tables are shared with the IOMMU, as Intel
IOMMUs can be non-coherent and thus require changes to be written to
memory in order to be visible to the IOMMU.

In order to achieve this make sure data is written back to memory
after writing an EPT entry when the recalc bit is not set in
atomic_write_ept_entry. If such bit is set, the entry will be
adjusted and atomic_write_ept_entry will be called a second time
without the recalc bit set. Note that when splitting a super page the
new tables resulting of the split should also be written back.

Failure to do so can allow devices behind the IOMMU access to the
stale super page, or cause coherency issues as changes made by the
processor to the page tables are not visible to the IOMMU.

This allows to remove the VT-d specific iommu_pte_flush helper, since
the cache write back is now performed by atomic_write_ept_entry, and
hence iommu_iotlb_flush can be used to flush the IOMMU TLB. The newly
used method (iommu_iotlb_flush) can result in less flushes, since it
might sometimes be called rightly with 0 flags, in which case it
becomes a no-op.

This is part of XSA-321.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c23274fd0412381bd75068ebc9f8f8c90a4be748
master date: 2020-07-07 14:40:11 +0200

4 years agovtd: optimize CPU cache sync
Roger Pau Monné [Tue, 7 Jul 2020 13:12:46 +0000 (15:12 +0200)]
vtd: optimize CPU cache sync

Some VT-d IOMMUs are non-coherent, which requires a cache write back
in order for the changes made by the CPU to be visible to the IOMMU.
This cache write back was unconditionally done using clflush, but there are
other more efficient instructions to do so, hence implement support
for them using the alternative framework.

This is part of XSA-321.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a64ea16522a73a13a0d66cfa4b66a9d3b95dd9d6
master date: 2020-07-07 14:39:54 +0200

4 years agox86/alternative: introduce alternative_2
Roger Pau Monné [Tue, 7 Jul 2020 13:12:16 +0000 (15:12 +0200)]
x86/alternative: introduce alternative_2

It's based on alternative_io_2 without inputs or outputs but with an
added memory clobber.

This is part of XSA-321.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 23570bce00ee6ba2139ece978ab6f03ff166e21d
master date: 2020-07-07 14:39:25 +0200

4 years agovtd: don't assume addresses are aligned in sync_cache
Roger Pau Monné [Tue, 7 Jul 2020 13:11:55 +0000 (15:11 +0200)]
vtd: don't assume addresses are aligned in sync_cache

Current code in sync_cache assume that the address passed in is
aligned to a cache line size. Fix the code to support passing in
arbitrary addresses not necessarily aligned to a cache line size.

This is part of XSA-321.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b6d9398144f21718d25daaf8d72669a75592abc5
master date: 2020-07-07 14:39:05 +0200

4 years agox86/iommu: introduce a cache sync hook
Roger Pau Monné [Tue, 7 Jul 2020 13:11:18 +0000 (15:11 +0200)]
x86/iommu: introduce a cache sync hook

The hook is only implemented for VT-d and it uses the already existing
iommu_sync_cache function present in VT-d code. The new hook is
added so that the cache can be flushed by code outside of VT-d when
using shared page tables.

Note that alloc_pgtable_maddr must use the now locally defined
sync_cache function, because IOMMU ops are not yet setup the first
time the function gets called during IOMMU initialization.

No functional change intended.

This is part of XSA-321.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 91526b460e5009fc56edbd6809e66c327281faba
master date: 2020-07-07 14:38:34 +0200

4 years agovtd: prune (and rename) cache flush functions
Roger Pau Monné [Tue, 7 Jul 2020 13:10:57 +0000 (15:10 +0200)]
vtd: prune (and rename) cache flush functions

Rename __iommu_flush_cache to iommu_sync_cache and remove
iommu_flush_cache_page. Also remove the iommu_flush_cache_entry
wrapper and just use iommu_sync_cache instead. Note the _entry suffix
was meaningless as the wrapper was already taking a size parameter in
bytes. While there also constify the addr parameter.

No functional change intended.

This is part of XSA-321.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 62298825b9a44f45761acbd758138b5ba059ebd1
master date: 2020-07-07 14:38:13 +0200

4 years agovtd: improve IOMMU TLB flush
Jan Beulich [Tue, 7 Jul 2020 13:10:34 +0000 (15:10 +0200)]
vtd: improve IOMMU TLB flush

Do not limit PSI flushes to order 0 pages, in order to avoid doing a
full TLB flush if the passed in page has an order greater than 0 and
is aligned. Should increase the performance of IOMMU TLB flushes when
dealing with page orders greater than 0.

This is part of XSA-321.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 5fe515a0fede07543f2a3b049167b1fd8b873caf
master date: 2020-07-07 14:37:46 +0200

4 years agox86/ept: atomically modify entries in ept_next_level
Roger Pau Monné [Tue, 7 Jul 2020 13:10:14 +0000 (15:10 +0200)]
x86/ept: atomically modify entries in ept_next_level

ept_next_level was passing a live PTE pointer to ept_set_middle_entry,
which was then modified without taking into account that the PTE could
be part of a live EPT table. This wasn't a security issue because the
pages returned by p2m_alloc_ptp are zeroed, so adding such an entry
before actually initializing it didn't allow a guest to access
physical memory addresses it wasn't supposed to access.

This is part of XSA-328.

Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: bc3d9f95d661372b059a5539ae6cb1e79435bb95
master date: 2020-07-07 14:37:12 +0200

4 years agox86/EPT: ept_set_middle_entry() related adjustments
Jan Beulich [Tue, 7 Jul 2020 13:09:50 +0000 (15:09 +0200)]
x86/EPT: ept_set_middle_entry() related adjustments

ept_split_super_page() wants to further modify the newly allocated
table, so have ept_set_middle_entry() return the mapped pointer rather
than tearing it down and then getting re-established right again.

Similarly ept_next_level() wants to hand back a mapped pointer of
the next level page, so re-use the one established by
ept_set_middle_entry() in case that path was taken.

Pull the setting of suppress_ve ahead of insertion into the higher level
table, and don't have ept_split_super_page() set the field a 2nd time.

This is part of XSA-328.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 1104288186ee73a7f9bfa41cbaa5bb7611521028
master date: 2020-07-07 14:36:52 +0200

4 years agox86/shadow: correct an inverted conditional in dirty VRAM tracking
Jan Beulich [Tue, 7 Jul 2020 13:09:25 +0000 (15:09 +0200)]
x86/shadow: correct an inverted conditional in dirty VRAM tracking

This originally was "mfn_x(mfn) == INVALID_MFN". Make it like this
again, taking the opportunity to also drop the unnecessary nearby
braces.

This is XSA-319.

Fixes: 246a5a3377c2 ("xen: Use a typesafe to define INVALID_MFN")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 23a216f99d40fbfbc2318ade89d8213eea6ba1f8
master date: 2020-07-07 14:36:24 +0200

4 years agoxen/common: event_channel: Don't ignore error in get_free_port()
Julien Grall [Tue, 7 Jul 2020 13:08:59 +0000 (15:08 +0200)]
xen/common: event_channel: Don't ignore error in get_free_port()

Currently, get_free_port() is assuming that the port has been allocated
when evtchn_allocate_port() is not return -EBUSY.

However, the function may return an error when:
    - We exhausted all the event channels. This can happen if the limit
    configured by the administrator for the guest ('max_event_channels'
    in xl cfg) is higher than the ABI used by the guest. For instance,
    if the guest is using 2L, the limit should not be higher than 4095.
    - We cannot allocate memory (e.g Xen has not more memory).

Users of get_free_port() (such as EVTCHNOP_alloc_unbound) will validly
assuming the port was valid and will next call evtchn_from_port(). This
will result to a crash as the memory backing the event channel structure
is not present.

Fixes: 368ae9a05fe ("xen/pvshim: forward evtchn ops between L0 Xen and L2 DomU")
Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2e9c2bc292231823a3a021d2e0a9f1956bf00b3c
master date: 2020-07-07 14:35:36 +0200

4 years agolibacpi: widen TPM detection
Jason Andryuk [Wed, 24 Jun 2020 15:17:38 +0000 (17:17 +0200)]
libacpi: widen TPM detection

The hardcoded tpm_signature is too restrictive to detect many TPMs.  For
instance, it doesn't accept a QEMU emulated TPM (VID 0x1014 DID 0x0001).
Make the TPM detection match that in rombios which accepts a wider
range.

With this change, the TPM's TCPA ACPI table is generated and the guest
OS can automatically load the tpm_tis driver.  It also allows seabios to
detect and use the TPM.  However, seabios skips some TPM initialization
when running under Xen, so it will not populate any PCRs unless modified
to run the initialization under Xen.

Signed-off-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d3db7e043cddd7e939195e014241ce2c5d436179
master date: 2020-06-16 10:31:08 +0200

4 years agoioreq: handle pending emulation racing with ioreq server destruction
Paul Durrant [Wed, 24 Jun 2020 15:17:05 +0000 (17:17 +0200)]
ioreq: handle pending emulation racing with ioreq server destruction

When an emulation request is initiated in hvm_send_ioreq() the guest vcpu is
blocked on an event channel until that request is completed. If, however,
the emulator is killed whilst that emulation is pending then the ioreq
server may be destroyed. Thus when the vcpu is awoken the code in
handle_hvm_io_completion() will find no pending request to wait for, but will
leave the internal vcpu io_req.state set to IOREQ_READY and the vcpu shutdown
deferall flag in place (because hvm_io_assist() will never be called). The
emulation request is then completed anyway. This means that any subsequent call
to hvmemul_do_io() will find an unexpected value in io_req.state and will
return X86EMUL_UNHANDLEABLE, which in some cases will result in continuous
re-tries.

This patch fixes the issue by moving the setting of io_req.state and clearing
of shutdown deferral (as will as MSI-X write completion) out of hvm_io_assist()
and directly into handle_hvm_io_completion().

Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f7039ee41b3d3448775a1623f230037fd0455104
master date: 2020-06-09 12:56:24 +0200

4 years agox86/Intel: insert Ice Lake and Comet Lake model numbers
Jan Beulich [Wed, 24 Jun 2020 15:16:30 +0000 (17:16 +0200)]
x86/Intel: insert Ice Lake and Comet Lake model numbers

Both match prior generation processors as far as LBR and C-state MSRs
go (SDM rev 072) as well as applicability of the if_pschange_mc erratum
(recent spec updates).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1fe406685cb19e9544681c6243e7d376deb0297e
master date: 2020-06-09 12:55:53 +0200

4 years agobuild: fix dependency tracking for preprocessed files
Jan Beulich [Wed, 24 Jun 2020 15:15:53 +0000 (17:15 +0200)]
build: fix dependency tracking for preprocessed files

While the issue is more general, I noticed that asm-macros.i not getting
re-generated as needed. This was due to its .*.d file mentioning
asm-macros.o instead of asm-macros.i. Use -MQ here as well, and while at
it also use -MQ to avoid the somewhat fragile sed-ary on the *.lds
dependency tracking files. While there, further avoid open-coding $(CPP)
and drop the bogus (Arm) / stale (x86) -Ui386.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Julien Grall <jgrall@amazon.com>
master commit: 75131ad75bb3c91717b5dfda6881e61c52bfd22e
master date: 2020-06-08 10:25:40 +0200

4 years agox86/svm: do not try to handle recalc NPT faults immediately
Igor Druzhinin [Wed, 24 Jun 2020 15:15:23 +0000 (17:15 +0200)]
x86/svm: do not try to handle recalc NPT faults immediately

A recalculation NPT fault doesn't always require additional handling
in hvm_hap_nested_page_fault(), moreover in general case if there is no
explicit handling done there - the fault is wrongly considered fatal.

This covers a specific case of migration with vGPU assigned which
uses direct MMIO mappings made by XEN_DOMCTL_memory_mapping hypercall:
at a moment log-dirty is enabled globally, recalculation is requested
for the whole guest memory including those mapped MMIO regions
which causes a page fault being raised at the first access to them;
but due to MMIO P2M type not having any explicit handling in
hvm_hap_nested_page_fault() a domain is erroneously crashed with unhandled
SVM violation.

Instead of trying to be opportunistic - use safer approach and handle
P2M recalculation in a separate NPT fault by attempting to retry after
making the necessary adjustments. This is aligned with Intel behavior
where there are separate VMEXITs for recalculation and EPT violations
(faults) and only faults are handled in hvm_hap_nested_page_fault().
Do it by also unifying do_recalc return code with Intel implementation
where returning 1 means P2M was actually changed.

Since there was no case previously where p2m_pt_handle_deferred_changes()
could return a positive value - it's safe to replace ">= 0" with just "== 0"
in VMEXIT_NPF handler. finish_type_change() is also not affected by the
change as being able to deal with >0 return value of p2m->recalc from
EPT implementation.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 51ca66c37371b10b378513af126646de22eddb17
master date: 2020-06-05 17:12:11 +0200

4 years agobuild32: don't discard .shstrtab in linker script
Roger Pau Monné [Wed, 24 Jun 2020 15:14:40 +0000 (17:14 +0200)]
build32: don't discard .shstrtab in linker script

LLVM linker doesn't support discarding .shstrtab, and complains with:

ld -melf_i386_fbsd -N -T build32.lds -o reloc.lnk reloc.o
ld: error: discarding .shstrtab section is not allowed

Add an explicit .shstrtab, .strtab and .symtab sections to the linker
script after the text section in order to make LLVM LD happy and match
the behavior of GNU LD.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 10d27b48b5b4dfbead2d9bf03290984bba4806e4
master date: 2020-06-02 13:37:53 +0200

4 years agox86/mm: do not attempt to convert _PAGE_GNTTAB to a boolean
Roger Pau Monné [Wed, 24 Jun 2020 15:14:11 +0000 (17:14 +0200)]
x86/mm: do not attempt to convert _PAGE_GNTTAB to a boolean

Clang 10 complains with:

mm.c:1239:10: error: converting the result of '<<' to a boolean always evaluates to true
      [-Werror,-Wtautological-constant-compare]
    if ( _PAGE_GNTTAB && (l1e_get_flags(l1e) & _PAGE_GNTTAB) &&
         ^
xen/include/asm/x86_64/page.h:161:25: note: expanded from macro '_PAGE_GNTTAB'
#define _PAGE_GNTTAB (1U<<22)
                        ^

Remove the conversion of _PAGE_GNTTAB to a boolean and instead use a
preprocessor conditional to check if _PAGE_GNTTAB is defined.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 6eb61b1a9dfe23ca443f977799cafb22770708a0
master date: 2020-06-02 13:36:41 +0200

4 years agox86emul: rework CMP and TEST emulation
Jan Beulich [Wed, 24 Jun 2020 15:13:29 +0000 (17:13 +0200)]
x86emul: rework CMP and TEST emulation

Unlike similarly encoded insns these don't write their memory operands,
and hence x86_is_mem_write() should return false for them. However,
rather than adding special logic there, rework how their emulation gets
done, by making decoding attributes properly describe the r/o nature of
their memory operands:
-  change the table entries for opcodes 0x38 and 0x39, with no other
   adjustments to the attributes later on,
-  for the other opcodes, leave the table entries as they are, and
   override the attributes for the specific sub-cases (identified by
   ModRM.reg).

For opcodes 0x38 and 0x39 the change of the table entries implies
changing the order of operands as passed to emulate_2op_SrcV(), hence
the splitting of the cases in the main switch().

Note how this also allows dropping custom LOCK prefix checks.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 20bc1b9cc99b70b17757e1903f629c7a26584790
master date: 2020-05-29 17:28:45 +0200

4 years agox86emul: address x86_insn_is_mem_{access,write}() omissions
Jan Beulich [Wed, 24 Jun 2020 15:12:53 +0000 (17:12 +0200)]
x86emul: address x86_insn_is_mem_{access,write}() omissions

First of all explain in comments what the functions' purposes are. Then
make them actually match their comments.

Note that fc6fa977be54 ("x86emul: extend x86_insn_is_mem_write()
coverage") didn't actually fix the function's behavior for {,V}STMXCSR:
Both are covered by generic code higher up in the function, due to
x86_decode_twobyte() already doing suitable adjustments. And VSTMXCSR
wouldn't have been covered anyway without a further X86EMUL_OPC_VEX()
case label. Keep the inner case label in a comment for reference.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e28d13eeb65c25c0bd56e8bfa83c7473047d778d
master date: 2020-05-29 17:28:04 +0200

4 years agox86/hvm: Improve error information in handle_pio()
Andrew Cooper [Wed, 24 Jun 2020 15:12:20 +0000 (17:12 +0200)]
x86/hvm: Improve error information in handle_pio()

domain_crash() should always have a message which is emitted even in release
builds, so something more useful than this is presented to the user.

  (XEN) domain_crash called from io.c:171
  (XEN) domain_crash called from io.c:171
  (XEN) domain_crash called from io.c:171
  ...

To avoid possibly printing stack rubble, initialise data to ~0 right away.
Furthermore, the maximum access size is 4, so drop data from long to int.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 4609fc8eb04e6af531d86923c9d057f32a96b7d8
master date: 2020-05-29 16:25:05 +0100

4 years agoVT-x: extend LBR Broadwell errata coverage
Jan Beulich [Wed, 24 Jun 2020 15:11:44 +0000 (17:11 +0200)]
VT-x: extend LBR Broadwell errata coverage

For lbr_tsx_fixup_check() simply name a few more specific erratum
numbers.

For bdf93_fixup_check(), however, more models are affected. Oddly enough
despite being the same model and stepping, the erratum is listed for
Xeon E3 but not its Core counterpart. Apply the workaround uniformly,
and also for Xeon D, which only has the LBR-from one listed in its spec
update.

Seeing this broader applicability, rename anything BDF93-related to more
generic names.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 724913de8ac8426d313a4645741d86c1169ae406
master date: 2020-05-28 12:03:25 +0200

4 years agox86/boot: Fix load_system_tables() to be NMI/#MC-safe
Andrew Cooper [Wed, 24 Jun 2020 15:11:08 +0000 (17:11 +0200)]
x86/boot: Fix load_system_tables() to be NMI/#MC-safe

During boot, load_system_tables() is used in reinit_bsp_stack() to switch the
virtual addresses used from their .data/.bss alias, to their directmap alias.

The structure assignment is implemented as a memset() to zero first, then a
copy-in of the new data.  This causes the NMI/#MC stack pointers to
transiently become 0, at a point where we may have an NMI watchdog running.

Rewrite the logic using a volatile tss pointer (equivalent to, but more
readable than, using ACCESS_ONCE() for all writes).

This does drop the zeroing side effect for holes in the structure, but the
backing memory for the TSS is fully zeroed anyway, and architecturally, they
are all reserved.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 9f3e9139fa6c3d620eb08dff927518fc88200b8d
master date: 2020-05-27 16:44:04 +0100

4 years agox86: clear RDRAND CPUID bit on AMD family 15h/16h
Jan Beulich [Wed, 24 Jun 2020 15:10:22 +0000 (17:10 +0200)]
x86: clear RDRAND CPUID bit on AMD family 15h/16h

Inspired by Linux commit c49a0a80137c7ca7d6ced4c812c9e07a949f6f24:

    There have been reports of RDRAND issues after resuming from suspend on
    some AMD family 15h and family 16h systems. This issue stems from a BIOS
    not performing the proper steps during resume to ensure RDRAND continues
    to function properly.

    Update the CPU initialization to clear the RDRAND CPUID bit for any family
    15h and 16h processor that supports RDRAND. If it is known that the family
    15h or family 16h system does not have an RDRAND resume issue or that the
    system will not be placed in suspend, the "cpuid=rdrand" kernel parameter
    can be used to stop the clearing of the RDRAND CPUID bit.

    Note, that clearing the RDRAND CPUID bit does not prevent a processor
    that normally supports the RDRAND instruction from executing it. So any
    code that determined the support based on family and model won't #UD.

Warn if no explicit choice was given on affected hardware.

Check RDRAND functions at boot as well as after S3 resume (the retry
limit chosen is entirely arbitrary).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 93401e28a84b9dc5945f5d0bf5bce68e9d5ee121
master date: 2020-05-27 09:49:37 +0200

4 years agox86/idle: Extend ISR/C6 erratum workaround to Haswell
Andrew Cooper [Wed, 24 Jun 2020 15:03:30 +0000 (17:03 +0200)]
x86/idle: Extend ISR/C6 erratum workaround to Haswell

This bug was first discovered against Haswell.  It is definitely affected.

(The XenServer ticket for this bug was opened on 2013-05-30 which is coming up
on 7 years old, and predates Broadwell).

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: b72d8870b5f68f06b083e6bfdb28f081bcb6ab3b
master date: 2020-05-22 20:04:23 +0100

4 years agox86/idle: prevent entering C3/C6 on some Intel CPUs due to errata
Roger Pau Monné [Wed, 24 Jun 2020 15:02:55 +0000 (17:02 +0200)]
x86/idle: prevent entering C3/C6 on some Intel CPUs due to errata

Apply a workaround for errata BA80, AAK120, AAM108, AAO67, BD59,
AAY54: Rapid Core C3/C6 Transition May Cause Unpredictable System
Behavior.

Limit maximum C state to C1 when SMT is enabled on the affected CPUs.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b2d502466547e6782ccadd501b8ef1482c391f2c
master date: 2020-05-22 16:08:54 +0200

4 years agox86/idle: prevent entering C6 with in service interrupts on Intel
Roger Pau Monné [Wed, 24 Jun 2020 15:02:23 +0000 (17:02 +0200)]
x86/idle: prevent entering C6 with in service interrupts on Intel

Apply a workaround for Intel errata BDX99, CLX30, SKX100, CFW125,
BDF104, BDH85, BDM135, KWB131: "A Pending Fixed Interrupt May Be
Dispatched Before an Interrupt of The Same Priority Completes".

Apply the errata to all server and client models (big cores) from
Broadwell to Cascade Lake. The workaround is grouped together with the
existing fix for errata AAJ72, and the eoi from the function name is
removed.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: fc44a7014cafe28b8c53eeaf6ac2a71f5bc8b815
master date: 2020-05-22 16:07:38 +0200

4 years agox86/idle: rework C6 EOI workaround
Roger Pau Monné [Wed, 24 Jun 2020 15:01:48 +0000 (17:01 +0200)]
x86/idle: rework C6 EOI workaround

Change the C6 EOI workaround (errata AAJ72) to use x86_match_cpu. Also
call the workaround from mwait_idle, previously it was only used by
the ACPI idle driver. Finally make sure the routine is called for all
states equal or greater than ACPI_STATE_C3, note that the ACPI driver
doesn't currently handle them, but the errata condition shouldn't be
limited by that.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5fef1fd713660406a6187ef352fbf79986abfe43
master date: 2020-05-20 12:48:37 +0200

4 years agox86: determine MXCSR mask in all cases
Jan Beulich [Wed, 24 Jun 2020 15:01:10 +0000 (17:01 +0200)]
x86: determine MXCSR mask in all cases

For its use(s) by the emulator to be correct in all cases, the filling
of the variable needs to be independent of XSAVE availability. As
there's no suitable function in i387.c to put the logic in, keep it in
xstate_init(), arrange for the function to be called unconditionally,
and pull the logic ahead of all return paths there.

Fixes: 9a4496a35b20 ("x86emul: support {,V}{LD,ST}MXCSR")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2b532519d64e653a6bbfd9eefed6040a09c8876d
master date: 2020-05-18 17:18:56 +0200

4 years agox86/hvm: Fix shifting in stdvga_mem_read()
Andrew Cooper [Wed, 24 Jun 2020 15:00:33 +0000 (17:00 +0200)]
x86/hvm: Fix shifting in stdvga_mem_read()

stdvga_mem_read() has a return type of uint8_t, which promotes to int rather
than unsigned int.  Shifting by 24 may hit the sign bit.

Spotted by Coverity.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 97fb0253e6c2f2221bfd0895b7ffe3a99330d847
master date: 2020-05-18 15:22:53 +0100

4 years agox86/build: Unilaterally disable -fcf-protection
Andrew Cooper [Wed, 24 Jun 2020 14:59:49 +0000 (16:59 +0200)]
x86/build: Unilaterally disable -fcf-protection

Xen doesn't support CET-IBT yet.  At a minimum, logic is required to enable it
for supervisor use, but the livepatch functionality needs to learn not to
overwrite ENDBR64 instructions.

Furthermore, Ubuntu enables -fcf-protection by default, along with a buggy
version of GCC-9 which objects to it in combination with
-mindirect-branch=thunk-extern (Fixed in GCC 10, 9.4).

Various objects (Xen boot path, Rombios 32 stubs) require .text to be at the
beginning of the object.  These paths explode when .note.gnu.properties gets
put ahead of .text and we end up executing the notes data.

Disable -fcf-protection for all embedded objects.

Reported-by: Jason Andryuk <jandryuk@gmail.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jason Andryuk <jandryuk@gmail.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
4 years agox86/build: move -fno-asynchronous-unwind-tables into EMBEDDED_EXTRA_CFLAGS
Andrew Cooper [Wed, 24 Jun 2020 14:59:21 +0000 (16:59 +0200)]
x86/build: move -fno-asynchronous-unwind-tables into EMBEDDED_EXTRA_CFLAGS

Users of EMBEDDED_EXTRA_CFLAGS already use -fno-asynchronous-unwind-tables, or
ought to.  This shrinks the size of the rombios 32bit stubs in guest memory.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
4 years agox86/build32: Discard all orphaned sections
Andrew Cooper [Wed, 24 Jun 2020 14:58:51 +0000 (16:58 +0200)]
x86/build32: Discard all orphaned sections

Linkers may put orphaned sections ahead of .text, which breaks the calling
requirements.  A concrete example is Ubuntu's GCC-9 default of enabling
-fcf-protection which causes us to try and execute .note.gnu.properties during
Xen's boot.

Put .got.plt in its own section as it specifically needs preserving from the
linkers point of view, and discard everything else.  This will hopefully be
more robust to other unexpected toolchain properties.

Fixes boot from an Ubuntu build of Xen.

Reported-by: Jason Andryuk <jandryuk@gmail.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Tested-by: Jason Andryuk <jandryuk@gmail.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
4 years agox86/guest: Fix assembler warnings with newer binutils
Andrew Cooper [Wed, 24 Jun 2020 14:58:22 +0000 (16:58 +0200)]
x86/guest: Fix assembler warnings with newer binutils

GAS of at least version 2.34 complains:

  hypercall_page.S: Assembler messages:
  hypercall_page.S:24: Warning: symbol 'HYPERCALL_set_trap_table' already has its type set
  ...
  hypercall_page.S:71: Warning: symbol 'HYPERCALL_arch_7' already has its type set

which is because the whole page is declared as STT_OBJECT already.  Rearrange
.set with respect to .type in DECLARE_HYPERCALL() so STT_FUNC is already in
place.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
4 years agox86/cpuidle: correct Cannon Lake residency MSRs
Jan Beulich [Wed, 24 Jun 2020 14:57:39 +0000 (16:57 +0200)]
x86/cpuidle: correct Cannon Lake residency MSRs

As per SDM rev 071 Cannon Lake has
- no CC3 residency MSR at 3FC,
- a CC1 residency MSR ar 660 (like various Atoms),
- a useless (always zero) CC3 residency MSR at 662.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9ff09aefc46385dc04c38b6dd1f1ac25f784f482
master date: 2020-04-03 17:15:58 +0200

4 years agoupdate Xen version to 4.12.4-pre
Jan Beulich [Wed, 24 Jun 2020 14:55:04 +0000 (16:55 +0200)]
update Xen version to 4.12.4-pre

4 years agotools/libxl: Fix memory leak in libxl_cpuid_set()
Andrew Cooper [Fri, 12 Jun 2020 17:32:27 +0000 (18:32 +0100)]
tools/libxl: Fix memory leak in libxl_cpuid_set()

xc_cpuid_set() returns allocated memory via cpuid_res, which libxl needs to
free() seeing as it discards the results.

This is logically a backport of c/s b91825f628 "tools/libxc: Drop
config_transformed parameter from xc_cpuid_set()" but rewritten as one caller
of xc_cpuid_set() does use returned values.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit c54de7d9df7718ea53bf21e1ff5bbd339602a704)

4 years agox86/spec-ctrl: Allow the RDRAND/RDSEED features to be hidden
Andrew Cooper [Wed, 10 Jun 2020 17:57:00 +0000 (18:57 +0100)]
x86/spec-ctrl: Allow the RDRAND/RDSEED features to be hidden

RDRAND/RDSEED can be hidden using cpuid= to mitigate SRBDS if microcode
isn't available.

This is part of XSA-320 / CVE-2020-0543.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit 7028534d8482d25860c4d1aa8e45f0b911abfc5a)

4 years agox86/spec-ctrl: Mitigate the Special Register Buffer Data Sampling sidechannel
Andrew Cooper [Wed, 8 Jan 2020 19:47:46 +0000 (19:47 +0000)]
x86/spec-ctrl: Mitigate the Special Register Buffer Data Sampling sidechannel

See patch documentation and comments.

This is part of XSA-320 / CVE-2020-0543

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 6a49b9a7920c82015381740905582b666160d955)

4 years agox86/spec-ctrl: CPUID/MSR definitions for Special Register Buffer Data Sampling
Andrew Cooper [Wed, 8 Jan 2020 19:47:46 +0000 (19:47 +0000)]
x86/spec-ctrl: CPUID/MSR definitions for Special Register Buffer Data Sampling

This is part of XSA-320 / CVE-2020-0543

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit caab85ab58c0cdf74ab070a5de5c4df89f509ff3)

4 years agoupdate Xen version to 4.12.3 RELEASE-4.12.3
Jan Beulich [Thu, 14 May 2020 12:21:48 +0000 (14:21 +0200)]
update Xen version to 4.12.3

5 years agox86/ucode/intel: Writeback and invalidate caches before updating microcode
Ashok Raj [Thu, 7 May 2020 12:58:16 +0000 (14:58 +0200)]
x86/ucode/intel: Writeback and invalidate caches before updating microcode

Updating microcode is less error prone when caches have been flushed and
depending on what exactly the microcode is updating. For example, some of the
issues around certain Broadwell parts can be addressed by doing a full cache
flush.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[Linux commit 91df9fdf51492aec9fed6b4cbd33160886740f47, ported to Xen]
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 77c82949990edaf21130be842a289a7fb7a439e1
master date: 2020-05-05 20:18:19 +0100

5 years agox86/traps: fix an off-by-one error
Hongyan Xia [Thu, 7 May 2020 12:57:35 +0000 (14:57 +0200)]
x86/traps: fix an off-by-one error

stack++ can go into the next page and unmap_domain_page() will unmap the
wrong one, causing mapcache and memory corruption. Fix.

Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2e3d87cc734a895ef5b486926274a178836b67a9
master date: 2020-05-05 16:13:44 +0100

5 years agox86/hvm: simplify hvm_physdev_op allowance control
Roger Pau Monné [Thu, 7 May 2020 12:56:56 +0000 (14:56 +0200)]
x86/hvm: simplify hvm_physdev_op allowance control

PVHv1 dom0 was given access to all PHYSDEVOP hypercalls, and such
restriction was not removed when PVHv1 code was removed. As a result
the switch in hvm_physdev_op was more complicated than required, and
relied on PVHv2 dom0 not having PIRQ support in order to prevent
access to some PV specific PHYSDEVOPs.

Fix this by moving the default case to the bottom of the switch, since
there's no need for any fall through now. Also remove the hardware
domain check, as all the not explicitly listed PHYSDEVOPs are
forbidden for HVM domains.

Finally tighten the condition to allow usage of
PHYSDEVOP_pci_mmcfg_reserved: apart from having vPCI enabled it should
only be used by the hardware domain. Note that the code in
do_physdev_op is already restricting the call to privileged domains
only, but it can be further restricted to the hardware domain only, as
other privileged domains don't have access to MMCFG regions anyway.

Overall no functional change should arise from this change.

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a00e3737e085ebc1f313e36b188d4958e939e531
master date: 2020-05-05 09:52:28 +0200

5 years agox86emul: extend x86_insn_is_mem_write() coverage
Jan Beulich [Thu, 7 May 2020 12:56:03 +0000 (14:56 +0200)]
x86emul: extend x86_insn_is_mem_write() coverage

Several insns were missed when this function was first added. As far as
insns already supported by the emulator go - SMSW and {,V}STMXCSR were
wrongly considered r/o insns so far.

Insns like the VMX, SVM, or CET-SS ones, PTWRITE, or AMD's new SNP ones
are intentionally not covered just yet. VMPTRST is put there just to
complete the respective group.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fc6fa977be54a24a1325e3f2d08b1b1dcb675f44
master date: 2020-05-05 09:50:54 +0200

5 years agox86/pass-through: avoid double IRQ unbind during domain cleanup
Jan Beulich [Thu, 7 May 2020 12:54:39 +0000 (14:54 +0200)]
x86/pass-through: avoid double IRQ unbind during domain cleanup

XEN_DOMCTL_destroydomain creates a continuation if domain_kill -ERESTARTs.
In that scenario, it is possible to receive multiple _pirq_guest_unbind
calls for the same pirq from domain_kill, if the pirq has not yet been
removed from the domain's pirq_tree, as:
  domain_kill()
    -> domain_relinquish_resources()
      -> pci_release_devices()
        -> pci_clean_dpci_irq()
          -> pirq_guest_unbind()
            -> __pirq_guest_unbind()

Avoid recurring invocations of pirq_guest_unbind() by removing the pIRQ
from the tree being iterated after the first call there. In case such a
removed entry still has a softirq outstanding, record it and re-check
upon re-invocation.

Note that pirq_cleanup_check() gets relaxed beyond what's strictly
needed here, to avoid introducing an asymmetry there between HVM and PV
guests.

Reported-by: Varad Gautam <vrd@amazon.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Varad Gautam <vrd@amazon.de>
Reviewed-by: Paul Durrant <paul@xen.org>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 5b58dad089880127674d460494d1a9d68109b3d7
master date: 2020-04-30 10:40:59 +0200

5 years agoxen/grants: fix hypercall continuation for GNTTABOP_cache_flush
Juergen Gross [Thu, 7 May 2020 12:53:13 +0000 (14:53 +0200)]
xen/grants: fix hypercall continuation for GNTTABOP_cache_flush

The GNTTABOP_cache_flush hypercall has a wrong test for hypercall
continuation, the test today is:

    if ( rc > 0 || opaque_out != 0 )

Unfortunately this will be true even in case of an error (rc < 0),
possibly leading to very long lasting hypercalls (times of more
than an hour have been observed in a test case).

Correct the test condition to result in false with rc < 0 and set
opaque_out only if no error occurred, to be on the safe side.

Partially-suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 46d8f69d466a05863737fb81d8c9ef39c3be8b45
master date: 2020-04-29 14:12:50 +0100

5 years agolibxc/restore: Fix REC_TYPE_X86_PV_VCPU_XSAVE data auditing (take 2)
Andrew Cooper [Tue, 4 Feb 2020 20:29:38 +0000 (20:29 +0000)]
libxc/restore: Fix REC_TYPE_X86_PV_VCPU_XSAVE data auditing (take 2)

It turns out that a bug (since forever) in Xen causes XSAVE records to have
non-architectural behaviour on xsave-capable hardware, when a PV guest has not
touched the state.

In such a case, the data record returned from Xen is 2*uint64_t, both claiming
the (illegitimate) state of %xcr0 and %xcr0_accum being 0.

Adjust the bound in handle_x86_pv_vcpu_blob() to cope with this.

Fixes: 2a62c22715b "libxc/restore: Fix data auditing in handle_x86_pv_vcpu_blob()"
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit 0729830cc425a8ff27a3137e87b93768ae3c853c)
(cherry picked from commit d2aecd86c4481291b260869c47cf0a9a02321564)

5 years agolibxc/restore: Fix data auditing in handle_x86_pv_vcpu_blob()
Andrew Cooper [Thu, 19 Dec 2019 20:32:20 +0000 (20:32 +0000)]
libxc/restore: Fix data auditing in handle_x86_pv_vcpu_blob()

The current logic only works by chance, in that XSAVE records also tend to be
a multiple of 128.  Implement the missing logic for XSAVE.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit 2a62c22715bf81c5695ae0511f89a940c7c6a492)
(cherry picked from commit 0e2bbcf8b4fe6f5fd23a341848f3785c213b26bb)

5 years agolibxc/restore: Fix data auditing in handle_x86_pv_info()
Andrew Cooper [Wed, 18 Dec 2019 20:17:42 +0000 (20:17 +0000)]
libxc/restore: Fix data auditing in handle_x86_pv_info()

handle_x86_pv_info() has a subtle bug.  It uses an 'else if' chain with a
clause in the middle which doesn't exit unconditionally.  In practice, this
means that when restoring a 32bit PV guest, later sanity checks are skipped.

Rework the logic a little to be simpler.  There are exactly two valid
combinations of fields in X86_PV_INFO, so factor this out and check them all
in one go, before making adjustments to the current domain.

Once adjustments have been completed successfully, sanity check the result
against the X86_PV_INFO settings in one go, rather than piece-wise.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit aafae0e800e9936b9eb6566e5fcdbe823625a7d1)
(cherry picked from commit 5932ee1e06047d71bcf6975e1a631e31afaf5fe2)

5 years agolibxc/restore: Fix error message for unrecognised stream version
Andrew Cooper [Tue, 17 Dec 2019 13:49:56 +0000 (13:49 +0000)]
libxc/restore: Fix error message for unrecognised stream version

The Expected and Got values are rendered in the wrong order.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit f50a4f6e244cfc8e773300c03aaf4db391f3028a)
(cherry picked from commit 7b2225078b4b91044c365b2276c8897c46241c79)

5 years agotools/xenstore: fix a use after free problem in xenstored
Juergen Gross [Fri, 3 Apr 2020 12:03:40 +0000 (13:03 +0100)]
tools/xenstore: fix a use after free problem in xenstored

Commit 562a1c0f7ef3fb ("tools/xenstore: dont unlink connection object
twice") introduced a potential use after free problem in
domain_cleanup(): after calling talloc_unlink() for domain->conn
domain->conn is set to NULL. The problem is that domain is registered
as talloc child of domain->conn, so it might be freed by the
talloc_unlink() call.

With Xenstore being single threaded there are normally no concurrent
memory allocations running and freeing a virtual memory area normally
doesn't result in that area no longer being accessible. A problem
could occur only in case either a signal received results in some
memory allocation done in the signal handler (SIGHUP is a primary
candidate leading to reopening the log file), or in case the talloc
framework would do some internal memory allocation during freeing of
the memory (which would lead to clobbering of the freed domain
structure).

Fixes: 562a1c0f7ef3fb ("tools/xenstore: dont unlink connection object twice")
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit bb2a34fd740e9a26be9e2244f1a5b4cef439e5a8)
(cherry picked from commit dc5176d0f9434e275e0be1df8d0518e243798beb)

5 years agolibxl: Fix comment about dcs.sdss
Anthony PERARD [Thu, 23 Jan 2020 16:56:46 +0000 (16:56 +0000)]
libxl: Fix comment about dcs.sdss

The field 'sdss' was named 'dmss' before, commit 3148bebbf0ab did the
renamed but didn't update the comment.

Fixes: 3148bebbf0ab ("libxl: rename a field in libxl__domain_create_state")
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 035c4d771600f300382a1637f2da33023f76b4c1)
(cherry picked from commit 5351a0a170fc7f6290d7d3d8be302d53d2426a87)

5 years agodocs/misc: pvcalls: Verbatim block should be indented with 4 spaces
Julien Grall [Sat, 11 Jan 2020 00:03:44 +0000 (00:03 +0000)]
docs/misc: pvcalls: Verbatim block should be indented with 4 spaces

At the moment, the diagram is only indented with 2 spaces. So pandoc
will try to badly interpret it and not display it correctly.

Fix it by indenting all the block by 4 spaces (i.e an extra 2 spaces).

Fixes: d661611d08 ("docs/markdown: Switch to using pandoc, and fix underscore escaping")
Signed-off-by: Julien Grall <julien@xen.org>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 9c8705f8fe5bfb75a6a00163308d297059b61f6a)
(cherry picked from commit 8b60270731eabe7a7dfd41bd625338505829a617)

5 years agodocs: document CONTROL command of xenstore protocol
Juergen Gross [Tue, 28 Jan 2020 06:21:07 +0000 (06:21 +0000)]
docs: document CONTROL command of xenstore protocol

The CONTROL command (former DEBUG command) isn't specified in the
xenstore protocol doc. Add it.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Backport: 4.9+
(cherry picked from commit f910c3ebc6a178c5cbbc0868134be536fae7f7cf)

5 years agodocs: add DIRECTORY_PART specification do xenstore protocol doc
Juergen Gross [Mon, 27 Jan 2020 16:50:50 +0000 (17:50 +0100)]
docs: add DIRECTORY_PART specification do xenstore protocol doc

DIRECTORY_PART was missing in docs/misc/xenstore.txt. Add it.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Paul Durrant <pdurrant@amazon.com>
Acked-by: Wei Liu <wl@xen.org>
Backport: 4.9+
(cherry picked from commit 94a0252c10cb9938bdee98cc456c23e17b28eafb)

5 years agobuild,xsm: fix multiple call
Anthony PERARD [Mon, 27 Apr 2020 13:58:42 +0000 (15:58 +0200)]
build,xsm: fix multiple call

Both script mkflask.sh and mkaccess_vector.sh generates multiple
files. Exploits the 'multi-target pattern rule' trick to call each
scripts only once.

Signed-off-by: Anthony PERARD <anthony.perard@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 52f3f319851e40892fbafeae53e512c7d61f03d0
master date: 2020-04-23 09:59:05 +0200

5 years agox86: validate VM assist value in arch_set_info_guest()
Jan Beulich [Mon, 27 Apr 2020 13:57:13 +0000 (15:57 +0200)]
x86: validate VM assist value in arch_set_info_guest()

While I can't spot anything that would go wrong, just like the
respective hypercall only permits applicable bits to be set, we should
also do so when loading guest context.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a62c6fe05c4ae905b7d4cb0ca946508b7f96d522
master date: 2020-04-22 13:01:10 +0200

5 years agox86/HVM: expose VM assist hypercall
Jan Beulich [Mon, 27 Apr 2020 13:55:51 +0000 (15:55 +0200)]
x86/HVM: expose VM assist hypercall

In preparation for the addition of VMASST_TYPE_runstate_update_flag
commit 72c538cca957 ("arm: add support for vm_assist hypercall") enabled
the hypercall for Arm. I consider it not logical that it then isn't also
exposed to x86 HVM guests (with the same single feature permitted to be
enabled as Arm has); Linux actually tries to use it afaict.

Rather than introducing yet another thin wrapper around vm_assist(),
make that function the main handler, requiring a per-arch
arch_vm_assist_valid_mask() definition instead.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: f13404d57f55a97838f1c16a366fbc3231ec21f1
master date: 2020-04-22 12:58:25 +0200

5 years agox86: Enumeration for Control-flow Enforcement Technology
Andrew Cooper [Mon, 27 Apr 2020 13:54:14 +0000 (15:54 +0200)]
x86: Enumeration for Control-flow Enforcement Technology

The CET spec has been published and guest kernels are starting to get support.
Introduce the CPUID and MSRs, and fully block the MSRs from guest use.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Wei Liu <wl@xen.org>
master commit: 4803a67114279a656a54a23cebed646da32efeb6
master date: 2020-04-21 16:52:03 +0100

5 years agox86/vtd: relax EPT page table sharing check
Roger Pau Monné [Mon, 27 Apr 2020 13:53:26 +0000 (15:53 +0200)]
x86/vtd: relax EPT page table sharing check

The EPT page tables can be shared with the IOMMU as long as the page
sizes supported by EPT are also supported by the IOMMU.

Current code checks that both the IOMMU and EPT support the same page
sizes, but this is not strictly required, the IOMMU supporting more
page sizes than EPT is fine and shouldn't block page table sharing.

This is likely not a common case (IOMMU supporting more page sizes
than EPT), but should still be fixed for correctness.

Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 3957e12c02670b97855ef0933b373f99993fa598
master date: 2020-04-21 10:54:56 +0200

5 years agohvmloader: enable MMIO and I/O decode, after all resource allocation
Harsha Shamsundara Havanur [Mon, 27 Apr 2020 13:52:45 +0000 (15:52 +0200)]
hvmloader: enable MMIO and I/O decode, after all resource allocation

It was observed that PCI MMIO and/or IO BARs were programmed with
memory and I/O decodes (bits 0 and 1 of PCI COMMAND register) enabled,
during PCI setup phase. This resulted in incorrect memory mapping as
soon as the lower half of the 64 bit bar is programmed.
This displaced any RAM mappings under 4G. After the
upper half is programmed PCI memory mapping is restored to its
intended high mem location, but the RAM displaced is not restored.
The OS then continues to boot and function until it tries to access
the displaced RAM at which point it suffers a page fault and crashes.

This patch address the issue by deferring enablement of memory and
I/O decode in command register until all the resources, like interrupts
I/O and/or MMIO BARs for all the PCI device functions are programmed,
in the descending order of memory requested.

Signed-off-by: Harsha Shamsundara Havanur <havanur@amazon.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a8e0c228c79f3a000e19183090eb41fca173b034
master date: 2020-04-16 10:58:46 +0200

5 years agox86/boot: Fix early exception handling with CONFIG_PERF_COUNTERS
Andrew Cooper [Mon, 27 Apr 2020 13:51:14 +0000 (15:51 +0200)]
x86/boot: Fix early exception handling with CONFIG_PERF_COUNTERS

The PERFC_INCR() macro uses current->processor, but current is not valid
during early boot.  This causes the following crash to occur if
e.g. rdmsr_safe() has to recover from a #GP fault.

  (XEN) Early fatal page fault at e008:ffff82d0803b1a39 (cr2=0000000000000004, ec=0000)
  (XEN) ----[ Xen-4.14-unstable  x86_64  debug=y   Not tainted ]----
  (XEN) CPU:    0
  (XEN) RIP:    e008:[<ffff82d0803b1a39>] x86_64/entry.S#handle_exception_saved+0x64/0xb8
  ...
  (XEN) Xen call trace:
  (XEN)    [<ffff82d0803b1a39>] R x86_64/entry.S#handle_exception_saved+0x64/0xb8
  (XEN)    [<ffff82d0806394fe>] F __start_xen+0x2cd/0x2980
  (XEN)    [<ffff82d0802000ec>] F __high_start+0x4c/0x4e

Furthermore, the PERFC_INCR() macro is wildly inefficient.  There has been a
single caller for many releases now, so inline it and delete the macro
completely.

There is no need to reference current at all.  What is actually needed is the
per_cpu_offset which can be obtained directly from the top-of-stack block.
This simplifies the counter handling to 3 instructions and no spilling to the
stack at all.

The same breakage from above is now handled properly:

  (XEN) traps.c:1591: GPF (0000): ffff82d0806394fe [__start_xen+0x2cd/0x2980] -> ffff82d0803b3bfb

Reported-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Julien Grall <jgrall@amazon.com>
master commit: 615bfe42c6d183a0e54a0525ef82b58580d01619
master date: 2020-04-16 09:48:38 +0100

5 years agox86/EFI: also fill boot_tsc_stamp on the xen.efi boot path
Jan Beulich [Mon, 27 Apr 2020 13:49:38 +0000 (15:49 +0200)]
x86/EFI: also fill boot_tsc_stamp on the xen.efi boot path

Commit e3a379c35eff ("x86/time: always count s_time from Xen boot")
introducing this missed adjusting this path as well.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: 0dbc112e727f6c17f306c864950bdf83dece5cd5
master date: 2020-04-14 11:42:11 +0200

5 years agognttab: fix GNTTABOP_copy continuation handling
Jan Beulich [Tue, 14 Apr 2020 13:00:18 +0000 (15:00 +0200)]
gnttab: fix GNTTABOP_copy continuation handling

The XSA-226 fix was flawed - the backwards transformation on rc was done
too early, causing a continuation to not get invoked when the need for
preemption was determined at the very first iteration of the request.
This in particular means that all of the status fields of the individual
operations would be left untouched, i.e. set to whatever the caller may
or may not have initialized them to.

This is part of XSA-318.

Reported-by: Pawel Wieczorkiewicz <wipawel@amazon.de>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Tested-by: Pawel Wieczorkiewicz <wipawel@amazon.de>
master commit: d6f22d5d9e8d6848ec229083ac9fb044f0adea93
master date: 2020-04-14 14:42:32 +0200

5 years agoxen/gnttab: Fix error path in map_grant_ref()
Ross Lagerwall [Tue, 14 Apr 2020 12:58:48 +0000 (14:58 +0200)]
xen/gnttab: Fix error path in map_grant_ref()

Part of XSA-295 (c/s 863e74eb2cffb) inadvertently re-positioned the brackets,
changing the logic.  If the _set_status() call fails, the grant_map hypercall
would fail with a status of 1 (rc != GNTST_okay) instead of the expected
negative GNTST_* error.

This error path can be taken due to bad guest state, and causes net/blk-back
in Linux to crash.

This is XSA-316.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: da0c66c8f48042a0186799014af69db0303b1da5
master date: 2020-04-14 14:41:02 +0200

5 years agoxen/rwlock: Add missing memory barrier in the unlock path of rwlock
Julien Grall [Tue, 14 Apr 2020 12:56:58 +0000 (14:56 +0200)]
xen/rwlock: Add missing memory barrier in the unlock path of rwlock

The rwlock unlock paths are using atomic_sub() to release the lock.
However the implementation of atomic_sub() rightfully doesn't contain a
memory barrier. On Arm, this means a processor is allowed to re-order
the memory access with the preceeding access.

In other words, the unlock may be seen by another processor before all
the memory accesses within the "critical" section.

The rwlock paths already contains barrier indirectly, but they are not
very useful without the counterpart in the unlock paths.

The memory barriers are not necessary on x86 because loads/stores are
not re-ordered with lock instructions.

So add arch_lock_release_barrier() in the unlock paths that will only
add memory barrier on Arm.

Take the opportunity to document each lock paths explaining why a
barrier is not necessary.

This is XSA-314.

Signed-off-by: Julien Grall <jgrall@amazon.com>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 6890a04072e664c25447a297fe663b45ecfd6398
master date: 2020-04-14 14:37:11 +0200

5 years agoxenoprof: limit consumption of shared buffer data
Jan Beulich [Tue, 14 Apr 2020 12:56:06 +0000 (14:56 +0200)]
xenoprof: limit consumption of shared buffer data

Since a shared buffer can be written to by the guest, we may only read
the head and tail pointers from there (all other fields should only ever
be written to). Furthermore, for any particular operation the two values
must be read exactly once, with both checks and consumption happening
with the thus read values. (The backtrace related xenoprof_buf_space()
use in xenoprof_log_event() is an exception: The values used there get
re-checked by every subsequent xenoprof_add_sample().)

Since that code needed touching, also fix the double increment of the
lost samples count in case the backtrace related xenoprof_add_sample()
invocation in xenoprof_log_event() fails.

Where code is being touched anyway, add const as appropriate, but take
the opportunity to entirely drop the now unused domain parameter of
xenoprof_buf_space().

This is part of XSA-313.

Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: 50ef9a3cb26e2f9383f6fdfbed361d8f174bae9f
master date: 2020-04-14 14:33:19 +0200

5 years agoxenoprof: clear buffer intended to be shared with guests
Jan Beulich [Tue, 14 Apr 2020 12:55:05 +0000 (14:55 +0200)]
xenoprof: clear buffer intended to be shared with guests

alloc_xenheap_pages() making use of MEMF_no_scrub is fine for Xen
internally used allocations, but buffers allocated to be shared with
(unpriviliged) guests need to be zapped of their prior content.

This is part of XSA-313.

Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: 0763a7ebfcdad66cf9e5475a1301eefb29bae9ed
master date: 2020-04-14 14:32:33 +0200

5 years agoxen/arm: Sign extend TimerValue when computing the CompareValue
Jeff Kubascik [Tue, 21 Jan 2020 15:07:04 +0000 (10:07 -0500)]
xen/arm: Sign extend TimerValue when computing the CompareValue

Xen will only store the CompareValue as it can be derived from the
TimerValue (ARM DDI 0487E.a section D11.2.4):

  CompareValue = (Counter[63:0] + SignExtend(TimerValue))[63:0]

While the TimerValue is a 32-bit signed value, our implementation
assumed it is a 32-bit unsigned value.

Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com>
Acked-by: Julien Grall <julien@xen.org>
(cherry picked from commit 3c601c5f056fba055b7a1438b84b69fc649275c3)

5 years agoxen/arm: remove physical timer offset
Jeff Kubascik [Tue, 21 Jan 2020 15:07:03 +0000 (10:07 -0500)]
xen/arm: remove physical timer offset

The physical timer traps apply an offset so that time starts at 0 for
the guest. However, this offset is not currently applied to the physical
counter. Per the ARMv8 Reference Manual (ARM DDI 0487E.a), section
D11.2.4 Timers, the "Offset" between the counter and timer should be
zero for a physical timer. This removes the offset to make the timer and
counter consistent.

This also cleans up the physical timer implementation to better match
the virtual timer - both cval's now hold the hardware value.

In the case the guest sets cval to a time before Xen started, the correct
behavior is to expire the timer immediately. To do this, we set the expires
argument of set_timer to zero.

Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com>
Acked-by: Julien Grall <julien@xen.org>
(cherry picked from commit f14f55b7ee295277c8dd09e37e0fa0902ccf7eb4)

5 years agoxen/arm: during efi boot, improve the check for usable memory
Stefano Stabellini [Tue, 14 Jan 2020 23:31:55 +0000 (15:31 -0800)]
xen/arm: during efi boot, improve the check for usable memory

When booting via EFI, the EFI memory map has information about memory
regions and their type. Improve the check for the type and attribute of
each memory region to figure out whether it is usable memory or not.
This patch brings the check on par with Linux v5.5-rc6 and makes more
memory reusable as normal memory by Xen (except that Linux also reuses
EFI_PERSISTENT_MEMORY, which we do not).

Specifically, this patch also reuses memory marked as
EfiLoaderCode/Data, and it uses both Attribute and Type for the check
(Attribute needs to be EFI_MEMORY_WB).

Reported-by: Roman Shaposhnik <roman@zededa.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Acked-by: Julien Grall <julien@xen.org>
(cherry picked from commit b31666c8912bf18d9eff963b06d856e7e818ff34)

5 years agoxen/arm: initialize vpl011 flag register
Jeff Kubascik [Mon, 25 Nov 2019 20:58:00 +0000 (15:58 -0500)]
xen/arm: initialize vpl011 flag register

The tx/rx fifo flags were not set when the vpl011 is initialized. This
is a problem for certain guests that are operating in polled mode, as a
guest will generally check the rx fifo empty flag to determine if there
is data before doing a read. The result is a continuous spam of the
message "vpl011: Unexpected IN ring buffer empty" before the first valid
character is received. This initializes the flag status register to the
default specified in the PL011 technical reference manual.

Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com>
Acked-by: Julien Grall <julien@xen.org>
(cherry picked from commit b4637ed6cd5375f04ac51d6b900a9ccad6c6c03a)

5 years agoxen/arm: Handle unimplemented VGICv3 registers as RAZ/WI
Jeff Kubascik [Tue, 4 Feb 2020 19:51:50 +0000 (14:51 -0500)]
xen/arm: Handle unimplemented VGICv3 registers as RAZ/WI

Per the ARM Generic Interrupt Controller Architecture Specification (ARM
IHI 0069E), reserved registers should generally be treated as RAZ/WI.
To simplify the VGICv3 design and improve guest compatibility, treat the
default case for GICD and GICR registers as read_as_zero/write_ignore.

Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com>
Acked-by: Julien Grall <julien@xen.org>
(cherry picked from commit 69da7d5440c609c57c5bba9a73b91c62ba2852e6)

5 years agocredit2: fix credit reset happening too few times
Dario Faggioli [Thu, 9 Apr 2020 07:36:51 +0000 (09:36 +0200)]
credit2: fix credit reset happening too few times

There is a bug in commit 5e4b4199667b9 ("xen: credit2: only reset
credit on reset condition"). In fact, the aim of that commit was to
make sure that we do not perform too many credit reset operations
(which are not super cheap, and in an hot-path). But the check used
to determine whether a reset is necessary was the wrong one.

In fact, knowing just that some vCPUs have been skipped, while
traversing the runqueue (in runq_candidate()), is not enough. We
need to check explicitly whether the first vCPU in the runqueue
has a negative amount of credit.

Since a trace record is changed, this patch updates xentrace format file
and xenalyze as well

This should be backported.

Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: dae7b62e976b28af9c8efa150618c25501bf1650
master date: 2020-04-03 10:46:53 +0200

5 years agocredit2: avoid vCPUs to ever reach lower credits than idle
Dario Faggioli [Thu, 9 Apr 2020 07:36:08 +0000 (09:36 +0200)]
credit2: avoid vCPUs to ever reach lower credits than idle

There have been report of stalls of guest vCPUs, when Credit2 was used.
It seemed like these vCPUs were not getting scheduled for very long
time, even under light load conditions (e.g., during dom0 boot).

Investigations led to the discovery that --although rarely-- it can
happen that a vCPU manages to run for very long timeslices. In Credit2,
this means that, when runtime accounting happens, the vCPU will lose a
large quantity of credits. This in turn may lead to the vCPU having less
credits than the idle vCPUs (-2^30). At this point, the scheduler will
pick the idle vCPU, instead of the ready to run vCPU, for a few
"epochs", which often times is enough for the guest kernel to think the
vCPU is not responding and crashing.

An example of this situation is shown here. In fact, we can see d0v1
sitting in the runqueue while all the CPUs are idle, as it has
-1254238270 credits, which is smaller than -2^30 = −1073741824:

    (XEN) Runqueue 0:
    (XEN)   ncpus              = 28
    (XEN)   cpus               = 0-27
    (XEN)   max_weight         = 256
    (XEN)   pick_bias          = 22
    (XEN)   instload           = 1
    (XEN)   aveload            = 293391 (~111%)
    (XEN)   idlers: 00,00000000,00000000,00000000,00000000,00000000,0fffffff
    (XEN)   tickled: 00,00000000,00000000,00000000,00000000,00000000,00000000
    (XEN)   fully idle cores: 00,00000000,00000000,00000000,00000000,00000000,0fffffff
    [...]
    (XEN) Runqueue 0:
    (XEN) CPU[00] runq=0, sibling=00,..., core=00,...
    (XEN) CPU[01] runq=0, sibling=00,..., core=00,...
    [...]
    (XEN) CPU[26] runq=0, sibling=00,..., core=00,...
    (XEN) CPU[27] runq=0, sibling=00,..., core=00,...
    (XEN) RUNQ:
    (XEN)     0: [0.1] flags=0 cpu=5 credit=-1254238270 [w=256] load=262144 (~100%)

We certainly don't want, under any circumstance, this to happen.
Let's, therefore, define a minimum amount of credits a vCPU can have.
During accounting, we make sure that, for however long the vCPU has
run, it will never get to have less than such minimum amount of
credits. Then, we set the credits of the idle vCPU to an even
smaller value.

NOTE: investigations have been done about _how_ it is possible for a
vCPU to execute for so much time that its credits becomes so low. While
still not completely clear, there are evidence that:
- it only happens very rarely,
- it appears to be both machine and workload specific,
- it does not look to be a Credit2 (e.g., as it happens when
  running with Credit1 as well) issue, or a scheduler issue.

This patch makes Credit2 more robust to events like this, whatever
the cause is, and should hence be backported (as far as possible).

Reported-by: Glen <glenbarney@gmail.com>
Reported-by: Tomas Mozes <hydrapolic@gmail.com>
Signed-off-by: Dario Faggioli <dfaggioli@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 36f3662f27dec32d76c0edb4c6b62b9628d6869d
master date: 2020-04-03 10:45:43 +0200

5 years agox86/ucode/amd: Fix more potential buffer overruns with microcode parsing
Andrew Cooper [Thu, 9 Apr 2020 07:35:24 +0000 (09:35 +0200)]
x86/ucode/amd: Fix more potential buffer overruns with microcode parsing

cpu_request_microcode() doesn't know the buffer is at least 4 bytes long
before inspecting UCODE_MAGIC.

install_equiv_cpu_table() doesn't know the boundary of the buffer it is
interpreting as an equivalency table.  This case was clearly observed at one
point in the past, given the subsequent overrun detection, but without
comprehending that the damage was already done.

Make the logic consistent with container_fast_forward() and pass size_left in
to install_equiv_cpu_table().

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 718d1432000079ea7120f6cb770372afe707ce27
master date: 2020-04-01 14:00:12 +0100

5 years agox86/HVM: fix AMD ECS handling for Fam10
Jan Beulich [Thu, 9 Apr 2020 07:34:19 +0000 (09:34 +0200)]
x86/HVM: fix AMD ECS handling for Fam10

The involved comparison was, very likely inadvertently, converted from
>= to > when making changes unrelated to the actual family range.

Fixes: 9841eb71ea87 ("x86/cpuid: Drop a guests cached x86 family and model information")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 5d515b1c296ebad6889748ea1e49e063453216a3
master date: 2020-04-01 12:28:30 +0200

5 years agox86/ucode/amd: Fix potential buffer overrun with equiv table handling
Andrew Cooper [Thu, 9 Apr 2020 07:33:20 +0000 (09:33 +0200)]
x86/ucode/amd: Fix potential buffer overrun with equiv table handling

find_equiv_cpu_id() loops until it finds a 0 installed_cpu entry.  Well formed
AMD microcode containers have this property.

Extend the checking in install_equiv_cpu_table() to reject tables which don't
have a sentinal at the end.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1f97b6b9f1b5978659c5735954c37c130e7bb151
master date: 2020-03-27 13:13:26 +0000

5 years agolibx86/CPUID: fix (not just) leaf 7 processing
Jan Beulich [Thu, 9 Apr 2020 07:32:36 +0000 (09:32 +0200)]
libx86/CPUID: fix (not just) leaf 7 processing

x86_cpuid_policy_fill_native() should, as it did originally, iterate
over all subleaves here as well as over all main leaves. Switch to
using a "<= MIN()"-based approach similar to that used in
x86_cpuid_copy_to_buffer(). Also follow this for the extended main
leaves then.

Fixes: 1bd2b750537b ("libx86: Fix 32bit stubdom build of x86_cpuid_policy_fill_native()")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: eb0bad81fceb3e81df5f73441771b49b732edf56
master date: 2020-03-27 11:40:59 +0100

5 years agox86/ucode: Fix error paths in apply_microcode()
Andrew Cooper [Thu, 9 Apr 2020 07:31:45 +0000 (09:31 +0200)]
x86/ucode: Fix error paths in apply_microcode()

In the unlikley case that patch application completes, but the resutling
revision isn't expected, sig->rev doesn't get updated to match reality.

It will get adjusted the next time collect_cpu_info() gets called, but in the
meantime Xen might operate on a stale value.  Nothing good will come of this.

Rewrite the logic to always update the stashed revision, before worrying about
whether the attempt was a success or failure.

Take the opportunity to make the printk() messages as consistent as possible.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Wei Liu <wl@xen.org>
master commit: d2a0a96cf76603b2e2b87c3ce80c3f9d098327d4
master date: 2020-03-26 18:57:45 +0000

5 years agox86/shim: fix ballooning up the guest
Igor Druzhinin [Thu, 9 Apr 2020 07:30:58 +0000 (09:30 +0200)]
x86/shim: fix ballooning up the guest

args.preempted is meaningless here as it doesn't signal whether the
hypercall was preempted before. Use start_extent instead which is
correct (as long as the hypercall was invoked in a "normal" way).

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 76dbabb59eeaa78e9f57407e5b15a6606488333e
master date: 2020-03-18 12:55:54 +0100

5 years agox86/vPMU: don't blindly assume IA32_PERF_CAPABILITIES MSR exists
Jan Beulich [Thu, 9 Apr 2020 07:30:14 +0000 (09:30 +0200)]
x86/vPMU: don't blindly assume IA32_PERF_CAPABILITIES MSR exists

Just like VMX'es lbr_tsx_fixup_check() the respective CPUID bit should
be consulted first.

Reported-by: Farrah Chen <farrah.chen@intel.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 15c39c7c913f26fba40231e103ce1ffa6101e7c9
master date: 2020-02-26 17:35:48 +0100

5 years agoAMD/IOMMU: fix off-by-one in amd_iommu_get_paging_mode() callers
Jan Beulich [Thu, 9 Apr 2020 07:29:00 +0000 (09:29 +0200)]
AMD/IOMMU: fix off-by-one in amd_iommu_get_paging_mode() callers

amd_iommu_get_paging_mode() expects a count, not a "maximum possible"
value. Prior to b4f042236ae0 dropping the reference, the use of our mis-
named "max_page" in amd_iommu_domain_init() may have lead to such a
misunderstanding. In an attempt to avoid such confusion in the future,
rename the function's parameter and - while at it - convert it to an
inline function.

Also replace a literal 4 by an expression tying it to a wider use
constant, just like amd_iommu_quarantine_init() does.

Fixes: ea38867831da ("x86 / iommu: set up a scratch page in the quarantine domain")
Fixes: b4f042236ae0 ("AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables")
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b75b3c62fe4afe381c6f74a07f614c0b39fe2f5d
master date: 2020-03-16 11:24:29 +0100

5 years agox86/msr: Virtualise MSR_PLATFORM_ID properly
Andrew Cooper [Thu, 5 Mar 2020 10:24:09 +0000 (11:24 +0100)]
x86/msr: Virtualise MSR_PLATFORM_ID properly

This is an Intel-only, read-only MSR related to microcode loading.  Expose it
in similar circumstances as the PATCHLEVEL MSR.

This should have been alongside c/s 013896cb8b2 "x86/msr: Fix handling of
MSR_AMD_PATCHLEVEL/MSR_IA32_UCODE_REV"

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 691265f96097d4fe3e46ff4267451d49b30143e6
master date: 2020-02-20 17:29:50 +0000

5 years agoVT-d: check all of an RMRR for being E820-reserved
Jan Beulich [Thu, 5 Mar 2020 10:23:33 +0000 (11:23 +0100)]
VT-d: check all of an RMRR for being E820-reserved

Checking just the first and last page is not sufficient (and redundant
for single-page regions). As we don't need to care about IA64 anymore,
use an x86-specific function to get this done without looping over each
individual page.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: d6573bc6e6b7d95bb9de8471a6bfd7048ebc50f3
master date: 2020-02-18 16:21:19 +0100

5 years agox86/time: report correct frequency of Xen PV clocksource
Igor Druzhinin [Thu, 5 Mar 2020 10:22:57 +0000 (11:22 +0100)]
x86/time: report correct frequency of Xen PV clocksource

The value of the counter represents the number of nanoseconds
since host boot. That means the correct frequency is always 1GHz.

This inconsistency caused time to go slower in PV shim on most
platforms.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: c52bd545de461127f3ca67c48e8fef7145402035
master date: 2020-02-14 18:01:52 +0000

5 years agox86/shim: suspend and resume platform time correctly
Igor Druzhinin [Thu, 5 Mar 2020 10:22:20 +0000 (11:22 +0100)]
x86/shim: suspend and resume platform time correctly

Similarly to S3, platform time needs to be saved on guest suspend
and restored on resume respectively. This should account for expected
jumps in PV clock counter value after resume. time_suspend/resume()
are safe to use in PVH setting as is since any existing operations
with PIT/HPET that they do would simply be ignored if PIT/HPET is
not present.

Additionally, add resume callback for Xen PV clocksource to avoid
its breakage on migration.

Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a7a3ecd82e289a9a2ecc1d3b5128580e0b577cc7
master date: 2020-02-14 18:01:52 +0000

5 years agox86/smp: reset x2apic_enabled in smp_send_stop()
David Woodhouse [Thu, 5 Mar 2020 10:21:47 +0000 (11:21 +0100)]
x86/smp: reset x2apic_enabled in smp_send_stop()

Just before smp_send_stop() re-enables interrupts when shutting down
for reboot or kexec, it calls __stop_this_cpu() which in turn calls
disable_local_APIC(), which puts the APIC back in to the mode Xen found
it in at boot.

If that means turning x2APIC off and going back into xAPIC mode, then
a timer interrupt occurring just after interrupts come back on will
lead to a GP# when apic_timer_interrupt() attempts to ack the IRQ
through the EOI register in x2APIC MSR 0x80b:

  (XEN) Executing kexec image on cpu0
  (XEN) ----[ Xen-4.14-unstable  x86_64  debug=n   Not tainted ]----
  (XEN) CPU:    0
  (XEN) RIP:    e008:[<ffff82d08026c139>] apic_timer_interrupt+0x29/0x40
  (XEN) RFLAGS: 0000000000010046   CONTEXT: hypervisor
  (XEN) rax: 0000000000000000   rbx: 00000000000000fa   rcx: 000000000000080b
  ...
  (XEN) Xen code around <ffff82d08026c139> (apic_timer_interrupt+0x29/0x40):
  (XEN)  c0 b9 0b 08 00 00 89 c2 <0f> 30 31 ff e9 0e c9 fb ff 0f 1f 40 00 66 2e 0f
  ...
  (XEN) Xen call trace:
  (XEN)    [<ffff82d08026c139>] R apic_timer_interrupt+0x29/0x40
  (XEN)    [<ffff82d080283825>] S do_IRQ+0x95/0x750
  ...
  (XEN)    [<ffff82d0802a0ad2>] S smp_send_stop+0x42/0xd0

We can't clear the global x2apic_enabled variable in disable_local_APIC()
itself because that runs on each CPU. Instead, correct it (by using
current_local_apic_mode()) in smp_send_stop() while interrupts are still
disabled immediately after calling __stop_this_cpu() for the boot CPU,
after all other CPUs have been stopped.

cf: d639bdd9bbe ("x86/apic: Disable the LAPIC later in smp_send_stop()")
    ... which didn't quite fix it completely.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8b1002ab037aeacdece7723c07ab35ca16c1e22e
master date: 2020-02-14 18:01:52 +0000