Paul Durrant [Fri, 7 Aug 2020 15:33:22 +0000 (17:33 +0200)]
x86/hvm: set 'ipat' in EPT for special pages
All non-MMIO ranges (i.e those not mapping real device MMIO regions) that
map valid MFNs are normally marked MTRR_TYPE_WRBACK and 'ipat' is set. Hence
when PV drivers running in a guest populate the BAR space of the Xen Platform
PCI Device with pages such as the Shared Info page or Grant Table pages,
accesses to these pages will be cachable.
However, should IOMMU mappings be enabled be enabled for the guest then these
accesses become uncachable. This has a substantial negative effect on I/O
throughput of PV devices. Arguably PV drivers should bot be using BAR space to
host the Shared Info and Grant Table pages but it is currently commonplace for
them to do this and so this problem needs mitigation. Hence this patch makes
sure the 'ipat' bit is set for any special page regardless of where in GFN
space it is mapped.
NOTE: Clearly this mitigation only applies to Intel EPT. It is not obvious
that there is any similar mitigation possible for AMD NPT. Downstreams
such as Citrix XenServer have been carrying a patch similar to this for
several releases though.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ca24b2ffdbd9a25b2d313a547ccbe97baf3e5a8d
master date: 2020-07-31 17:42:47 +0200
Jan Beulich [Fri, 7 Aug 2020 15:32:42 +0000 (17:32 +0200)]
x86emul: replace UB shifts
Displacement values can be negative, hence we shouldn't left-shift them.
Or else we get
(XEN) UBSAN: Undefined behaviour in x86_emulate/x86_emulate.c:3482:55
(XEN) left shift of negative value -2
While auditing shifts, I noticed a pair of missing parentheses, which
also get added right here.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86emul: replace further UB shifts
I have no explanation how I managed to overlook these while putting
together what is now b6a907f8c83d ("x86emul: replace UB shifts").
Jan Beulich [Fri, 7 Aug 2020 15:31:16 +0000 (17:31 +0200)]
x86/S3: put data segment registers into known state upon resume
wakeup_32 sets %ds and %es to BOOT_DS, while leaving %fs at what
wakeup_start did set it to, and %gs at whatever BIOS did load into it.
All of this may end up confusing the first load_segments() to run on
the BSP after resume, in particular allowing a non-nul selector value
to be left in %fs.
Alongside %ss, also put all other data segment registers into the same
state that the boot and CPU bringup paths put them in.
Reported-by: M. Vefa Bicakci <m.v.b@runbox.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 55f8c389d4348cc517946fdcb10794112458e81e
master date: 2020-07-24 10:17:26 +0200
Andrew Cooper [Fri, 7 Aug 2020 15:30:35 +0000 (17:30 +0200)]
x86/spec-ctrl: Protect against CALL/JMP straight-line speculation
Some x86 CPUs speculatively execute beyond indirect CALL/JMP instructions.
With CONFIG_INDIRECT_THUNK / Retpolines, indirect CALL/JMP instructions are
converted to direct CALL/JMP's to __x86_indirect_thunk_REG(), leaving just a
handful of indirect JMPs implementing those stubs.
There is no architectrual execution beyond an indirect JMP, so use INT3 as
recommended by vendors to halt speculative execution. This is shorter than
LFENCE (which would also work fine), but also shows up in logs if we do
unexpected execute them.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3b7dab93f2401b08c673244c9ae0f92e08bd03ba
master date: 2020-07-01 17:01:24 +0100
Roger Pau Monné [Fri, 7 Aug 2020 15:29:41 +0000 (17:29 +0200)]
mm: fix public declaration of struct xen_mem_acquire_resource
XENMEM_acquire_resource and it's related structure is currently inside
a __XEN__ or __XEN_TOOLS__ guarded section to limit it's scope to the
hypervisor or the toolstack only. This is wrong as the hypercall is
already being used by the Linux kernel at least, and as such needs to
be public.
Also switch the usage of uint64_aligned_t to plain uint64_t, as
uint64_aligned_t is only to be used by the toolstack. Doing such
change will reduce the size of the structure on 32bit x86 by 4bytes,
since there will be no padding added after the frame_list handle.
This is fine, as users of the previous layout will allocate 4bytes of
padding that won't be read by Xen, and users of the new layout won't
allocate those, which is also fine since Xen won't try to access them.
Note that the structure already has compat handling, and such handling
will take care of copying the right size (ie: minus the padding) when
called from a 32bit x86 context. This is true for the compat code both
before and after this patch, since the structures in the memory.h
compat header are subject to a pragma pack(4), which already removed
the trailing padding that would otherwise be introduced by the
alignment of the frame field to 8 bytes.
Fixes: 3f8f12281dd20 ('x86/mm: add HYPERVISOR_memory_op to acquire guest resources') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0e2e54966af556f4047c1048855c4a071028a32d
master date: 2020-06-29 18:03:49 +0200
Andrew Cooper [Fri, 7 Aug 2020 15:28:48 +0000 (17:28 +0200)]
x86/msr: Disallow access to Processor Trace MSRs
We do not expose the feature to guests, so should disallow access to the
respective MSRs. For simplicity, drop the entire block of MSRs, not just the
subset which have been specified thus far.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wl@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: bcdfbb70fca579baa04f212c0936b77919bdae11
master date: 2020-06-26 16:34:02 +0100
Grzegorz Uriasz [Fri, 7 Aug 2020 15:28:06 +0000 (17:28 +0200)]
x86/acpi: use FADT flags to determine the PMTMR width
On some computers the bit width of the PM Timer as reported
by ACPI is 32 bits when in fact the FADT flags report correctly
that the timer is 24 bits wide. On affected machines such as the
ASUS FX504GM and never gaming laptops this results in the inability
to resume the machine from suspend. Without this patch suspend is
broken on affected machines and even if a machine manages to resume
correctly then the kernel time and xen timers are trashed.
Signed-off-by: Grzegorz Uriasz <gorbak25@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f325d2477eef8229c47d97031d314629521c70ab
master date: 2020-06-25 09:11:09 +0200
Tamas K Lengyel [Fri, 7 Aug 2020 15:27:11 +0000 (17:27 +0200)]
x86/vmx: use P2M_ALLOC in vmx_load_pdptrs instead of P2M_UNSHARE
While forking VMs running a small RTOS system (Zephyr) a Xen crash has been
observed due to a mm-lock order violation while copying the HVM CPU context
from the parent. This issue has been identified to be due to
hap_update_paging_modes first getting a lock on the gfn using get_gfn. This
call also creates a shared entry in the fork's memory map for the cr3 gfn. The
function later calls hap_update_cr3 while holding the paging_lock, which
results in the lock-order violation in vmx_load_pdptrs when it tries to unshare
the above entry when it grabs the page with the P2M_UNSHARE flag set.
Since vmx_load_pdptrs only reads from the page its usage of P2M_UNSHARE was
unnecessary to start with. Using P2M_ALLOC is the appropriate flag to ensure
the p2m is properly populated.
Note that the lock order violation is avoided because before the paging_lock is
taken a lookup is performed with P2M_ALLOC that forks the page, thus the second
lookup in vmx_load_pdptrs succeeds without having to perform the fork. We keep
P2M_ALLOC in vmx_load_pdptrs because there are code-paths leading up to it
which don't take the paging_lock and that have no previous lookup. Currently no
other code-path exists leading there with the paging_lock taken, thus no
further adjustments are necessary.
Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: b67e859d0823f5b450e29379af9142d44a3ea370
master date: 2020-06-19 15:24:55 +0200
xen: Check the alignment of the offset pased via VCPUOP_register_vcpu_info
Currently a guest is able to register any guest physical address to use
for the vcpu_info structure as long as the structure can fits in the
rest of the frame.
This means a guest can provide an address that is not aligned to the
natural alignment of the structure.
On Arm 32-bit, unaligned access are completely forbidden by the
hypervisor. This will result to a data abort which is fatal.
On Arm 64-bit, unaligned access are only forbidden when used for atomic
access. As the structure contains fields (such as evtchn_pending_self)
that are updated using atomic operations, any unaligned access will be
fatal as well.
While the misalignment is only fatal on Arm, a generic check is added
as an x86 guest shouldn't sensibly pass an unaligned address (this
would result to a split lock).
x86/ept: flush cache when modifying PTEs and sharing page tables
Modifications made to the page tables by EPT code need to be written
to memory when the page tables are shared with the IOMMU, as Intel
IOMMUs can be non-coherent and thus require changes to be written to
memory in order to be visible to the IOMMU.
In order to achieve this make sure data is written back to memory
after writing an EPT entry when the recalc bit is not set in
atomic_write_ept_entry. If such bit is set, the entry will be
adjusted and atomic_write_ept_entry will be called a second time
without the recalc bit set. Note that when splitting a super page the
new tables resulting of the split should also be written back.
Failure to do so can allow devices behind the IOMMU access to the
stale super page, or cause coherency issues as changes made by the
processor to the page tables are not visible to the IOMMU.
This allows to remove the VT-d specific iommu_pte_flush helper, since
the cache write back is now performed by atomic_write_ept_entry, and
hence iommu_iotlb_flush can be used to flush the IOMMU TLB. The newly
used method (iommu_iotlb_flush) can result in less flushes, since it
might sometimes be called rightly with 0 flags, in which case it
becomes a no-op.
This is part of XSA-321.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c23274fd0412381bd75068ebc9f8f8c90a4be748
master date: 2020-07-07 14:40:11 +0200
Some VT-d IOMMUs are non-coherent, which requires a cache write back
in order for the changes made by the CPU to be visible to the IOMMU.
This cache write back was unconditionally done using clflush, but there are
other more efficient instructions to do so, hence implement support
for them using the alternative framework.
This is part of XSA-321.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a64ea16522a73a13a0d66cfa4b66a9d3b95dd9d6
master date: 2020-07-07 14:39:54 +0200
vtd: don't assume addresses are aligned in sync_cache
Current code in sync_cache assume that the address passed in is
aligned to a cache line size. Fix the code to support passing in
arbitrary addresses not necessarily aligned to a cache line size.
This is part of XSA-321.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b6d9398144f21718d25daaf8d72669a75592abc5
master date: 2020-07-07 14:39:05 +0200
The hook is only implemented for VT-d and it uses the already existing
iommu_sync_cache function present in VT-d code. The new hook is
added so that the cache can be flushed by code outside of VT-d when
using shared page tables.
Note that alloc_pgtable_maddr must use the now locally defined
sync_cache function, because IOMMU ops are not yet setup the first
time the function gets called during IOMMU initialization.
No functional change intended.
This is part of XSA-321.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 91526b460e5009fc56edbd6809e66c327281faba
master date: 2020-07-07 14:38:34 +0200
Rename __iommu_flush_cache to iommu_sync_cache and remove
iommu_flush_cache_page. Also remove the iommu_flush_cache_entry
wrapper and just use iommu_sync_cache instead. Note the _entry suffix
was meaningless as the wrapper was already taking a size parameter in
bytes. While there also constify the addr parameter.
No functional change intended.
This is part of XSA-321.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 62298825b9a44f45761acbd758138b5ba059ebd1
master date: 2020-07-07 14:38:13 +0200
Jan Beulich [Tue, 7 Jul 2020 13:10:34 +0000 (15:10 +0200)]
vtd: improve IOMMU TLB flush
Do not limit PSI flushes to order 0 pages, in order to avoid doing a
full TLB flush if the passed in page has an order greater than 0 and
is aligned. Should increase the performance of IOMMU TLB flushes when
dealing with page orders greater than 0.
This is part of XSA-321.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 5fe515a0fede07543f2a3b049167b1fd8b873caf
master date: 2020-07-07 14:37:46 +0200
x86/ept: atomically modify entries in ept_next_level
ept_next_level was passing a live PTE pointer to ept_set_middle_entry,
which was then modified without taking into account that the PTE could
be part of a live EPT table. This wasn't a security issue because the
pages returned by p2m_alloc_ptp are zeroed, so adding such an entry
before actually initializing it didn't allow a guest to access
physical memory addresses it wasn't supposed to access.
This is part of XSA-328.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: bc3d9f95d661372b059a5539ae6cb1e79435bb95
master date: 2020-07-07 14:37:12 +0200
Jan Beulich [Tue, 7 Jul 2020 13:09:50 +0000 (15:09 +0200)]
x86/EPT: ept_set_middle_entry() related adjustments
ept_split_super_page() wants to further modify the newly allocated
table, so have ept_set_middle_entry() return the mapped pointer rather
than tearing it down and then getting re-established right again.
Similarly ept_next_level() wants to hand back a mapped pointer of
the next level page, so re-use the one established by
ept_set_middle_entry() in case that path was taken.
Pull the setting of suppress_ve ahead of insertion into the higher level
table, and don't have ept_split_super_page() set the field a 2nd time.
This is part of XSA-328.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 1104288186ee73a7f9bfa41cbaa5bb7611521028
master date: 2020-07-07 14:36:52 +0200
Jan Beulich [Tue, 7 Jul 2020 13:09:25 +0000 (15:09 +0200)]
x86/shadow: correct an inverted conditional in dirty VRAM tracking
This originally was "mfn_x(mfn) == INVALID_MFN". Make it like this
again, taking the opportunity to also drop the unnecessary nearby
braces.
This is XSA-319.
Fixes: 246a5a3377c2 ("xen: Use a typesafe to define INVALID_MFN") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 23a216f99d40fbfbc2318ade89d8213eea6ba1f8
master date: 2020-07-07 14:36:24 +0200
xen/common: event_channel: Don't ignore error in get_free_port()
Currently, get_free_port() is assuming that the port has been allocated
when evtchn_allocate_port() is not return -EBUSY.
However, the function may return an error when:
- We exhausted all the event channels. This can happen if the limit
configured by the administrator for the guest ('max_event_channels'
in xl cfg) is higher than the ABI used by the guest. For instance,
if the guest is using 2L, the limit should not be higher than 4095.
- We cannot allocate memory (e.g Xen has not more memory).
Users of get_free_port() (such as EVTCHNOP_alloc_unbound) will validly
assuming the port was valid and will next call evtchn_from_port(). This
will result to a crash as the memory backing the event channel structure
is not present.
Fixes: 368ae9a05fe ("xen/pvshim: forward evtchn ops between L0 Xen and L2 DomU") Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2e9c2bc292231823a3a021d2e0a9f1956bf00b3c
master date: 2020-07-07 14:35:36 +0200
Jason Andryuk [Wed, 24 Jun 2020 15:17:38 +0000 (17:17 +0200)]
libacpi: widen TPM detection
The hardcoded tpm_signature is too restrictive to detect many TPMs. For
instance, it doesn't accept a QEMU emulated TPM (VID 0x1014 DID 0x0001).
Make the TPM detection match that in rombios which accepts a wider
range.
With this change, the TPM's TCPA ACPI table is generated and the guest
OS can automatically load the tpm_tis driver. It also allows seabios to
detect and use the TPM. However, seabios skips some TPM initialization
when running under Xen, so it will not populate any PCRs unless modified
to run the initialization under Xen.
Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d3db7e043cddd7e939195e014241ce2c5d436179
master date: 2020-06-16 10:31:08 +0200
Paul Durrant [Wed, 24 Jun 2020 15:17:05 +0000 (17:17 +0200)]
ioreq: handle pending emulation racing with ioreq server destruction
When an emulation request is initiated in hvm_send_ioreq() the guest vcpu is
blocked on an event channel until that request is completed. If, however,
the emulator is killed whilst that emulation is pending then the ioreq
server may be destroyed. Thus when the vcpu is awoken the code in
handle_hvm_io_completion() will find no pending request to wait for, but will
leave the internal vcpu io_req.state set to IOREQ_READY and the vcpu shutdown
deferall flag in place (because hvm_io_assist() will never be called). The
emulation request is then completed anyway. This means that any subsequent call
to hvmemul_do_io() will find an unexpected value in io_req.state and will
return X86EMUL_UNHANDLEABLE, which in some cases will result in continuous
re-tries.
This patch fixes the issue by moving the setting of io_req.state and clearing
of shutdown deferral (as will as MSI-X write completion) out of hvm_io_assist()
and directly into handle_hvm_io_completion().
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f7039ee41b3d3448775a1623f230037fd0455104
master date: 2020-06-09 12:56:24 +0200
Jan Beulich [Wed, 24 Jun 2020 15:16:30 +0000 (17:16 +0200)]
x86/Intel: insert Ice Lake and Comet Lake model numbers
Both match prior generation processors as far as LBR and C-state MSRs
go (SDM rev 072) as well as applicability of the if_pschange_mc erratum
(recent spec updates).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1fe406685cb19e9544681c6243e7d376deb0297e
master date: 2020-06-09 12:55:53 +0200
Jan Beulich [Wed, 24 Jun 2020 15:15:53 +0000 (17:15 +0200)]
build: fix dependency tracking for preprocessed files
While the issue is more general, I noticed that asm-macros.i not getting
re-generated as needed. This was due to its .*.d file mentioning
asm-macros.o instead of asm-macros.i. Use -MQ here as well, and while at
it also use -MQ to avoid the somewhat fragile sed-ary on the *.lds
dependency tracking files. While there, further avoid open-coding $(CPP)
and drop the bogus (Arm) / stale (x86) -Ui386.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <jgrall@amazon.com>
master commit: 75131ad75bb3c91717b5dfda6881e61c52bfd22e
master date: 2020-06-08 10:25:40 +0200
Igor Druzhinin [Wed, 24 Jun 2020 15:15:23 +0000 (17:15 +0200)]
x86/svm: do not try to handle recalc NPT faults immediately
A recalculation NPT fault doesn't always require additional handling
in hvm_hap_nested_page_fault(), moreover in general case if there is no
explicit handling done there - the fault is wrongly considered fatal.
This covers a specific case of migration with vGPU assigned which
uses direct MMIO mappings made by XEN_DOMCTL_memory_mapping hypercall:
at a moment log-dirty is enabled globally, recalculation is requested
for the whole guest memory including those mapped MMIO regions
which causes a page fault being raised at the first access to them;
but due to MMIO P2M type not having any explicit handling in
hvm_hap_nested_page_fault() a domain is erroneously crashed with unhandled
SVM violation.
Instead of trying to be opportunistic - use safer approach and handle
P2M recalculation in a separate NPT fault by attempting to retry after
making the necessary adjustments. This is aligned with Intel behavior
where there are separate VMEXITs for recalculation and EPT violations
(faults) and only faults are handled in hvm_hap_nested_page_fault().
Do it by also unifying do_recalc return code with Intel implementation
where returning 1 means P2M was actually changed.
Since there was no case previously where p2m_pt_handle_deferred_changes()
could return a positive value - it's safe to replace ">= 0" with just "== 0"
in VMEXIT_NPF handler. finish_type_change() is also not affected by the
change as being able to deal with >0 return value of p2m->recalc from
EPT implementation.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 51ca66c37371b10b378513af126646de22eddb17
master date: 2020-06-05 17:12:11 +0200
Roger Pau Monné [Wed, 24 Jun 2020 15:14:40 +0000 (17:14 +0200)]
build32: don't discard .shstrtab in linker script
LLVM linker doesn't support discarding .shstrtab, and complains with:
ld -melf_i386_fbsd -N -T build32.lds -o reloc.lnk reloc.o
ld: error: discarding .shstrtab section is not allowed
Add an explicit .shstrtab, .strtab and .symtab sections to the linker
script after the text section in order to make LLVM LD happy and match
the behavior of GNU LD.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 10d27b48b5b4dfbead2d9bf03290984bba4806e4
master date: 2020-06-02 13:37:53 +0200
Roger Pau Monné [Wed, 24 Jun 2020 15:14:11 +0000 (17:14 +0200)]
x86/mm: do not attempt to convert _PAGE_GNTTAB to a boolean
Clang 10 complains with:
mm.c:1239:10: error: converting the result of '<<' to a boolean always evaluates to true
[-Werror,-Wtautological-constant-compare]
if ( _PAGE_GNTTAB && (l1e_get_flags(l1e) & _PAGE_GNTTAB) &&
^
xen/include/asm/x86_64/page.h:161:25: note: expanded from macro '_PAGE_GNTTAB'
#define _PAGE_GNTTAB (1U<<22)
^
Remove the conversion of _PAGE_GNTTAB to a boolean and instead use a
preprocessor conditional to check if _PAGE_GNTTAB is defined.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 6eb61b1a9dfe23ca443f977799cafb22770708a0
master date: 2020-06-02 13:36:41 +0200
Jan Beulich [Wed, 24 Jun 2020 15:13:29 +0000 (17:13 +0200)]
x86emul: rework CMP and TEST emulation
Unlike similarly encoded insns these don't write their memory operands,
and hence x86_is_mem_write() should return false for them. However,
rather than adding special logic there, rework how their emulation gets
done, by making decoding attributes properly describe the r/o nature of
their memory operands:
- change the table entries for opcodes 0x38 and 0x39, with no other
adjustments to the attributes later on,
- for the other opcodes, leave the table entries as they are, and
override the attributes for the specific sub-cases (identified by
ModRM.reg).
For opcodes 0x38 and 0x39 the change of the table entries implies
changing the order of operands as passed to emulate_2op_SrcV(), hence
the splitting of the cases in the main switch().
Note how this also allows dropping custom LOCK prefix checks.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 20bc1b9cc99b70b17757e1903f629c7a26584790
master date: 2020-05-29 17:28:45 +0200
First of all explain in comments what the functions' purposes are. Then
make them actually match their comments.
Note that fc6fa977be54 ("x86emul: extend x86_insn_is_mem_write()
coverage") didn't actually fix the function's behavior for {,V}STMXCSR:
Both are covered by generic code higher up in the function, due to
x86_decode_twobyte() already doing suitable adjustments. And VSTMXCSR
wouldn't have been covered anyway without a further X86EMUL_OPC_VEX()
case label. Keep the inner case label in a comment for reference.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e28d13eeb65c25c0bd56e8bfa83c7473047d778d
master date: 2020-05-29 17:28:04 +0200
Jan Beulich [Wed, 24 Jun 2020 15:11:44 +0000 (17:11 +0200)]
VT-x: extend LBR Broadwell errata coverage
For lbr_tsx_fixup_check() simply name a few more specific erratum
numbers.
For bdf93_fixup_check(), however, more models are affected. Oddly enough
despite being the same model and stepping, the erratum is listed for
Xeon E3 but not its Core counterpart. Apply the workaround uniformly,
and also for Xeon D, which only has the LBR-from one listed in its spec
update.
Seeing this broader applicability, rename anything BDF93-related to more
generic names.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 724913de8ac8426d313a4645741d86c1169ae406
master date: 2020-05-28 12:03:25 +0200
Andrew Cooper [Wed, 24 Jun 2020 15:11:08 +0000 (17:11 +0200)]
x86/boot: Fix load_system_tables() to be NMI/#MC-safe
During boot, load_system_tables() is used in reinit_bsp_stack() to switch the
virtual addresses used from their .data/.bss alias, to their directmap alias.
The structure assignment is implemented as a memset() to zero first, then a
copy-in of the new data. This causes the NMI/#MC stack pointers to
transiently become 0, at a point where we may have an NMI watchdog running.
Rewrite the logic using a volatile tss pointer (equivalent to, but more
readable than, using ACCESS_ONCE() for all writes).
This does drop the zeroing side effect for holes in the structure, but the
backing memory for the TSS is fully zeroed anyway, and architecturally, they
are all reserved.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 9f3e9139fa6c3d620eb08dff927518fc88200b8d
master date: 2020-05-27 16:44:04 +0100
There have been reports of RDRAND issues after resuming from suspend on
some AMD family 15h and family 16h systems. This issue stems from a BIOS
not performing the proper steps during resume to ensure RDRAND continues
to function properly.
Update the CPU initialization to clear the RDRAND CPUID bit for any family
15h and 16h processor that supports RDRAND. If it is known that the family
15h or family 16h system does not have an RDRAND resume issue or that the
system will not be placed in suspend, the "cpuid=rdrand" kernel parameter
can be used to stop the clearing of the RDRAND CPUID bit.
Note, that clearing the RDRAND CPUID bit does not prevent a processor
that normally supports the RDRAND instruction from executing it. So any
code that determined the support based on family and model won't #UD.
Warn if no explicit choice was given on affected hardware.
Check RDRAND functions at boot as well as after S3 resume (the retry
limit chosen is entirely arbitrary).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 93401e28a84b9dc5945f5d0bf5bce68e9d5ee121
master date: 2020-05-27 09:49:37 +0200
Roger Pau Monné [Wed, 24 Jun 2020 15:02:23 +0000 (17:02 +0200)]
x86/idle: prevent entering C6 with in service interrupts on Intel
Apply a workaround for Intel errata BDX99, CLX30, SKX100, CFW125,
BDF104, BDH85, BDM135, KWB131: "A Pending Fixed Interrupt May Be
Dispatched Before an Interrupt of The Same Priority Completes".
Apply the errata to all server and client models (big cores) from
Broadwell to Cascade Lake. The workaround is grouped together with the
existing fix for errata AAJ72, and the eoi from the function name is
removed.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: fc44a7014cafe28b8c53eeaf6ac2a71f5bc8b815
master date: 2020-05-22 16:07:38 +0200
Roger Pau Monné [Wed, 24 Jun 2020 15:01:48 +0000 (17:01 +0200)]
x86/idle: rework C6 EOI workaround
Change the C6 EOI workaround (errata AAJ72) to use x86_match_cpu. Also
call the workaround from mwait_idle, previously it was only used by
the ACPI idle driver. Finally make sure the routine is called for all
states equal or greater than ACPI_STATE_C3, note that the ACPI driver
doesn't currently handle them, but the errata condition shouldn't be
limited by that.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5fef1fd713660406a6187ef352fbf79986abfe43
master date: 2020-05-20 12:48:37 +0200
Jan Beulich [Wed, 24 Jun 2020 15:01:10 +0000 (17:01 +0200)]
x86: determine MXCSR mask in all cases
For its use(s) by the emulator to be correct in all cases, the filling
of the variable needs to be independent of XSAVE availability. As
there's no suitable function in i387.c to put the logic in, keep it in
xstate_init(), arrange for the function to be called unconditionally,
and pull the logic ahead of all return paths there.
Fixes: 9a4496a35b20 ("x86emul: support {,V}{LD,ST}MXCSR") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 2b532519d64e653a6bbfd9eefed6040a09c8876d
master date: 2020-05-18 17:18:56 +0200
Andrew Cooper [Wed, 24 Jun 2020 14:59:49 +0000 (16:59 +0200)]
x86/build: Unilaterally disable -fcf-protection
Xen doesn't support CET-IBT yet. At a minimum, logic is required to enable it
for supervisor use, but the livepatch functionality needs to learn not to
overwrite ENDBR64 instructions.
Furthermore, Ubuntu enables -fcf-protection by default, along with a buggy
version of GCC-9 which objects to it in combination with
-mindirect-branch=thunk-extern (Fixed in GCC 10, 9.4).
Various objects (Xen boot path, Rombios 32 stubs) require .text to be at the
beginning of the object. These paths explode when .note.gnu.properties gets
put ahead of .text and we end up executing the notes data.
Disable -fcf-protection for all embedded objects.
Reported-by: Jason Andryuk <jandryuk@gmail.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 24 Jun 2020 14:59:21 +0000 (16:59 +0200)]
x86/build: move -fno-asynchronous-unwind-tables into EMBEDDED_EXTRA_CFLAGS
Users of EMBEDDED_EXTRA_CFLAGS already use -fno-asynchronous-unwind-tables, or
ought to. This shrinks the size of the rombios 32bit stubs in guest memory.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 24 Jun 2020 14:58:51 +0000 (16:58 +0200)]
x86/build32: Discard all orphaned sections
Linkers may put orphaned sections ahead of .text, which breaks the calling
requirements. A concrete example is Ubuntu's GCC-9 default of enabling
-fcf-protection which causes us to try and execute .note.gnu.properties during
Xen's boot.
Put .got.plt in its own section as it specifically needs preserving from the
linkers point of view, and discard everything else. This will hopefully be
more robust to other unexpected toolchain properties.
Fixes boot from an Ubuntu build of Xen.
Reported-by: Jason Andryuk <jandryuk@gmail.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Jason Andryuk <jandryuk@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 24 Jun 2020 14:58:22 +0000 (16:58 +0200)]
x86/guest: Fix assembler warnings with newer binutils
GAS of at least version 2.34 complains:
hypercall_page.S: Assembler messages:
hypercall_page.S:24: Warning: symbol 'HYPERCALL_set_trap_table' already has its type set
...
hypercall_page.S:71: Warning: symbol 'HYPERCALL_arch_7' already has its type set
which is because the whole page is declared as STT_OBJECT already. Rearrange
.set with respect to .type in DECLARE_HYPERCALL() so STT_FUNC is already in
place.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Wed, 24 Jun 2020 14:57:39 +0000 (16:57 +0200)]
x86/cpuidle: correct Cannon Lake residency MSRs
As per SDM rev 071 Cannon Lake has
- no CC3 residency MSR at 3FC,
- a CC1 residency MSR ar 660 (like various Atoms),
- a useless (always zero) CC3 residency MSR at 662.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9ff09aefc46385dc04c38b6dd1f1ac25f784f482
master date: 2020-04-03 17:15:58 +0200
Andrew Cooper [Fri, 12 Jun 2020 17:32:27 +0000 (18:32 +0100)]
tools/libxl: Fix memory leak in libxl_cpuid_set()
xc_cpuid_set() returns allocated memory via cpuid_res, which libxl needs to
free() seeing as it discards the results.
This is logically a backport of c/s b91825f628 "tools/libxc: Drop
config_transformed parameter from xc_cpuid_set()" but rewritten as one caller
of xc_cpuid_set() does use returned values.
Andrew Cooper [Wed, 10 Jun 2020 17:57:00 +0000 (18:57 +0100)]
x86/spec-ctrl: Allow the RDRAND/RDSEED features to be hidden
RDRAND/RDSEED can be hidden using cpuid= to mitigate SRBDS if microcode
isn't available.
This is part of XSA-320 / CVE-2020-0543.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit 7028534d8482d25860c4d1aa8e45f0b911abfc5a)
Andrew Cooper [Wed, 8 Jan 2020 19:47:46 +0000 (19:47 +0000)]
x86/spec-ctrl: Mitigate the Special Register Buffer Data Sampling sidechannel
See patch documentation and comments.
This is part of XSA-320 / CVE-2020-0543
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 6a49b9a7920c82015381740905582b666160d955)
Andrew Cooper [Wed, 8 Jan 2020 19:47:46 +0000 (19:47 +0000)]
x86/spec-ctrl: CPUID/MSR definitions for Special Register Buffer Data Sampling
This is part of XSA-320 / CVE-2020-0543
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit caab85ab58c0cdf74ab070a5de5c4df89f509ff3)
Ashok Raj [Thu, 7 May 2020 12:58:16 +0000 (14:58 +0200)]
x86/ucode/intel: Writeback and invalidate caches before updating microcode
Updating microcode is less error prone when caches have been flushed and
depending on what exactly the microcode is updating. For example, some of the
issues around certain Broadwell parts can be addressed by doing a full cache
flush.
Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[Linux commit 91df9fdf51492aec9fed6b4cbd33160886740f47, ported to Xen] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 77c82949990edaf21130be842a289a7fb7a439e1
master date: 2020-05-05 20:18:19 +0100
Roger Pau Monné [Thu, 7 May 2020 12:56:56 +0000 (14:56 +0200)]
x86/hvm: simplify hvm_physdev_op allowance control
PVHv1 dom0 was given access to all PHYSDEVOP hypercalls, and such
restriction was not removed when PVHv1 code was removed. As a result
the switch in hvm_physdev_op was more complicated than required, and
relied on PVHv2 dom0 not having PIRQ support in order to prevent
access to some PV specific PHYSDEVOPs.
Fix this by moving the default case to the bottom of the switch, since
there's no need for any fall through now. Also remove the hardware
domain check, as all the not explicitly listed PHYSDEVOPs are
forbidden for HVM domains.
Finally tighten the condition to allow usage of
PHYSDEVOP_pci_mmcfg_reserved: apart from having vPCI enabled it should
only be used by the hardware domain. Note that the code in
do_physdev_op is already restricting the call to privileged domains
only, but it can be further restricted to the hardware domain only, as
other privileged domains don't have access to MMCFG regions anyway.
Overall no functional change should arise from this change.
Reported-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a00e3737e085ebc1f313e36b188d4958e939e531
master date: 2020-05-05 09:52:28 +0200
Jan Beulich [Thu, 7 May 2020 12:56:03 +0000 (14:56 +0200)]
x86emul: extend x86_insn_is_mem_write() coverage
Several insns were missed when this function was first added. As far as
insns already supported by the emulator go - SMSW and {,V}STMXCSR were
wrongly considered r/o insns so far.
Insns like the VMX, SVM, or CET-SS ones, PTWRITE, or AMD's new SNP ones
are intentionally not covered just yet. VMPTRST is put there just to
complete the respective group.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fc6fa977be54a24a1325e3f2d08b1b1dcb675f44
master date: 2020-05-05 09:50:54 +0200
Jan Beulich [Thu, 7 May 2020 12:54:39 +0000 (14:54 +0200)]
x86/pass-through: avoid double IRQ unbind during domain cleanup
XEN_DOMCTL_destroydomain creates a continuation if domain_kill -ERESTARTs.
In that scenario, it is possible to receive multiple _pirq_guest_unbind
calls for the same pirq from domain_kill, if the pirq has not yet been
removed from the domain's pirq_tree, as:
domain_kill()
-> domain_relinquish_resources()
-> pci_release_devices()
-> pci_clean_dpci_irq()
-> pirq_guest_unbind()
-> __pirq_guest_unbind()
Avoid recurring invocations of pirq_guest_unbind() by removing the pIRQ
from the tree being iterated after the first call there. In case such a
removed entry still has a softirq outstanding, record it and re-check
upon re-invocation.
Note that pirq_cleanup_check() gets relaxed beyond what's strictly
needed here, to avoid introducing an asymmetry there between HVM and PV
guests.
Reported-by: Varad Gautam <vrd@amazon.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Varad Gautam <vrd@amazon.de> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 5b58dad089880127674d460494d1a9d68109b3d7
master date: 2020-04-30 10:40:59 +0200
Juergen Gross [Thu, 7 May 2020 12:53:13 +0000 (14:53 +0200)]
xen/grants: fix hypercall continuation for GNTTABOP_cache_flush
The GNTTABOP_cache_flush hypercall has a wrong test for hypercall
continuation, the test today is:
if ( rc > 0 || opaque_out != 0 )
Unfortunately this will be true even in case of an error (rc < 0),
possibly leading to very long lasting hypercalls (times of more
than an hour have been observed in a test case).
Correct the test condition to result in false with rc < 0 and set
opaque_out only if no error occurred, to be on the safe side.
Andrew Cooper [Tue, 4 Feb 2020 20:29:38 +0000 (20:29 +0000)]
libxc/restore: Fix REC_TYPE_X86_PV_VCPU_XSAVE data auditing (take 2)
It turns out that a bug (since forever) in Xen causes XSAVE records to have
non-architectural behaviour on xsave-capable hardware, when a PV guest has not
touched the state.
In such a case, the data record returned from Xen is 2*uint64_t, both claiming
the (illegitimate) state of %xcr0 and %xcr0_accum being 0.
Adjust the bound in handle_x86_pv_vcpu_blob() to cope with this.
Fixes: 2a62c22715b "libxc/restore: Fix data auditing in handle_x86_pv_vcpu_blob()" Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit 0729830cc425a8ff27a3137e87b93768ae3c853c)
(cherry picked from commit d2aecd86c4481291b260869c47cf0a9a02321564)
Andrew Cooper [Wed, 18 Dec 2019 20:17:42 +0000 (20:17 +0000)]
libxc/restore: Fix data auditing in handle_x86_pv_info()
handle_x86_pv_info() has a subtle bug. It uses an 'else if' chain with a
clause in the middle which doesn't exit unconditionally. In practice, this
means that when restoring a 32bit PV guest, later sanity checks are skipped.
Rework the logic a little to be simpler. There are exactly two valid
combinations of fields in X86_PV_INFO, so factor this out and check them all
in one go, before making adjustments to the current domain.
Once adjustments have been completed successfully, sanity check the result
against the X86_PV_INFO settings in one go, rather than piece-wise.
tools/xenstore: fix a use after free problem in xenstored
Commit 562a1c0f7ef3fb ("tools/xenstore: dont unlink connection object
twice") introduced a potential use after free problem in
domain_cleanup(): after calling talloc_unlink() for domain->conn
domain->conn is set to NULL. The problem is that domain is registered
as talloc child of domain->conn, so it might be freed by the
talloc_unlink() call.
With Xenstore being single threaded there are normally no concurrent
memory allocations running and freeing a virtual memory area normally
doesn't result in that area no longer being accessible. A problem
could occur only in case either a signal received results in some
memory allocation done in the signal handler (SIGHUP is a primary
candidate leading to reopening the log file), or in case the talloc
framework would do some internal memory allocation during freeing of
the memory (which would lead to clobbering of the freed domain
structure).
Anthony PERARD [Thu, 23 Jan 2020 16:56:46 +0000 (16:56 +0000)]
libxl: Fix comment about dcs.sdss
The field 'sdss' was named 'dmss' before, commit 3148bebbf0ab did the
renamed but didn't update the comment.
Fixes: 3148bebbf0ab ("libxl: rename a field in libxl__domain_create_state") Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 035c4d771600f300382a1637f2da33023f76b4c1)
(cherry picked from commit 5351a0a170fc7f6290d7d3d8be302d53d2426a87)
Julien Grall [Sat, 11 Jan 2020 00:03:44 +0000 (00:03 +0000)]
docs/misc: pvcalls: Verbatim block should be indented with 4 spaces
At the moment, the diagram is only indented with 2 spaces. So pandoc
will try to badly interpret it and not display it correctly.
Fix it by indenting all the block by 4 spaces (i.e an extra 2 spaces).
Fixes: d661611d08 ("docs/markdown: Switch to using pandoc, and fix underscore escaping") Signed-off-by: Julien Grall <julien@xen.org> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 9c8705f8fe5bfb75a6a00163308d297059b61f6a)
(cherry picked from commit 8b60270731eabe7a7dfd41bd625338505829a617)
Juergen Gross [Tue, 28 Jan 2020 06:21:07 +0000 (06:21 +0000)]
docs: document CONTROL command of xenstore protocol
The CONTROL command (former DEBUG command) isn't specified in the
xenstore protocol doc. Add it.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Backport: 4.9+
(cherry picked from commit f910c3ebc6a178c5cbbc0868134be536fae7f7cf)
Jan Beulich [Mon, 27 Apr 2020 13:57:13 +0000 (15:57 +0200)]
x86: validate VM assist value in arch_set_info_guest()
While I can't spot anything that would go wrong, just like the
respective hypercall only permits applicable bits to be set, we should
also do so when loading guest context.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a62c6fe05c4ae905b7d4cb0ca946508b7f96d522
master date: 2020-04-22 13:01:10 +0200
Jan Beulich [Mon, 27 Apr 2020 13:55:51 +0000 (15:55 +0200)]
x86/HVM: expose VM assist hypercall
In preparation for the addition of VMASST_TYPE_runstate_update_flag
commit 72c538cca957 ("arm: add support for vm_assist hypercall") enabled
the hypercall for Arm. I consider it not logical that it then isn't also
exposed to x86 HVM guests (with the same single feature permitted to be
enabled as Arm has); Linux actually tries to use it afaict.
Rather than introducing yet another thin wrapper around vm_assist(),
make that function the main handler, requiring a per-arch
arch_vm_assist_valid_mask() definition instead.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: f13404d57f55a97838f1c16a366fbc3231ec21f1
master date: 2020-04-22 12:58:25 +0200
Andrew Cooper [Mon, 27 Apr 2020 13:54:14 +0000 (15:54 +0200)]
x86: Enumeration for Control-flow Enforcement Technology
The CET spec has been published and guest kernels are starting to get support.
Introduce the CPUID and MSRs, and fully block the MSRs from guest use.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wl@xen.org>
master commit: 4803a67114279a656a54a23cebed646da32efeb6
master date: 2020-04-21 16:52:03 +0100
The EPT page tables can be shared with the IOMMU as long as the page
sizes supported by EPT are also supported by the IOMMU.
Current code checks that both the IOMMU and EPT support the same page
sizes, but this is not strictly required, the IOMMU supporting more
page sizes than EPT is fine and shouldn't block page table sharing.
This is likely not a common case (IOMMU supporting more page sizes
than EPT), but should still be fixed for correctness.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 3957e12c02670b97855ef0933b373f99993fa598
master date: 2020-04-21 10:54:56 +0200
hvmloader: enable MMIO and I/O decode, after all resource allocation
It was observed that PCI MMIO and/or IO BARs were programmed with
memory and I/O decodes (bits 0 and 1 of PCI COMMAND register) enabled,
during PCI setup phase. This resulted in incorrect memory mapping as
soon as the lower half of the 64 bit bar is programmed.
This displaced any RAM mappings under 4G. After the
upper half is programmed PCI memory mapping is restored to its
intended high mem location, but the RAM displaced is not restored.
The OS then continues to boot and function until it tries to access
the displaced RAM at which point it suffers a page fault and crashes.
This patch address the issue by deferring enablement of memory and
I/O decode in command register until all the resources, like interrupts
I/O and/or MMIO BARs for all the PCI device functions are programmed,
in the descending order of memory requested.
Andrew Cooper [Mon, 27 Apr 2020 13:51:14 +0000 (15:51 +0200)]
x86/boot: Fix early exception handling with CONFIG_PERF_COUNTERS
The PERFC_INCR() macro uses current->processor, but current is not valid
during early boot. This causes the following crash to occur if
e.g. rdmsr_safe() has to recover from a #GP fault.
(XEN) Early fatal page fault at e008:ffff82d0803b1a39 (cr2=0000000000000004, ec=0000)
(XEN) ----[ Xen-4.14-unstable x86_64 debug=y Not tainted ]----
(XEN) CPU: 0
(XEN) RIP: e008:[<ffff82d0803b1a39>] x86_64/entry.S#handle_exception_saved+0x64/0xb8
...
(XEN) Xen call trace:
(XEN) [<ffff82d0803b1a39>] R x86_64/entry.S#handle_exception_saved+0x64/0xb8
(XEN) [<ffff82d0806394fe>] F __start_xen+0x2cd/0x2980
(XEN) [<ffff82d0802000ec>] F __high_start+0x4c/0x4e
Furthermore, the PERFC_INCR() macro is wildly inefficient. There has been a
single caller for many releases now, so inline it and delete the macro
completely.
There is no need to reference current at all. What is actually needed is the
per_cpu_offset which can be obtained directly from the top-of-stack block.
This simplifies the counter handling to 3 instructions and no spilling to the
stack at all.
The same breakage from above is now handled properly:
Jan Beulich [Tue, 14 Apr 2020 13:00:18 +0000 (15:00 +0200)]
gnttab: fix GNTTABOP_copy continuation handling
The XSA-226 fix was flawed - the backwards transformation on rc was done
too early, causing a continuation to not get invoked when the need for
preemption was determined at the very first iteration of the request.
This in particular means that all of the status fields of the individual
operations would be left untouched, i.e. set to whatever the caller may
or may not have initialized them to.
Ross Lagerwall [Tue, 14 Apr 2020 12:58:48 +0000 (14:58 +0200)]
xen/gnttab: Fix error path in map_grant_ref()
Part of XSA-295 (c/s 863e74eb2cffb) inadvertently re-positioned the brackets,
changing the logic. If the _set_status() call fails, the grant_map hypercall
would fail with a status of 1 (rc != GNTST_okay) instead of the expected
negative GNTST_* error.
This error path can be taken due to bad guest state, and causes net/blk-back
in Linux to crash.
This is XSA-316.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: da0c66c8f48042a0186799014af69db0303b1da5
master date: 2020-04-14 14:41:02 +0200
xen/rwlock: Add missing memory barrier in the unlock path of rwlock
The rwlock unlock paths are using atomic_sub() to release the lock.
However the implementation of atomic_sub() rightfully doesn't contain a
memory barrier. On Arm, this means a processor is allowed to re-order
the memory access with the preceeding access.
In other words, the unlock may be seen by another processor before all
the memory accesses within the "critical" section.
The rwlock paths already contains barrier indirectly, but they are not
very useful without the counterpart in the unlock paths.
The memory barriers are not necessary on x86 because loads/stores are
not re-ordered with lock instructions.
So add arch_lock_release_barrier() in the unlock paths that will only
add memory barrier on Arm.
Take the opportunity to document each lock paths explaining why a
barrier is not necessary.
Jan Beulich [Tue, 14 Apr 2020 12:56:06 +0000 (14:56 +0200)]
xenoprof: limit consumption of shared buffer data
Since a shared buffer can be written to by the guest, we may only read
the head and tail pointers from there (all other fields should only ever
be written to). Furthermore, for any particular operation the two values
must be read exactly once, with both checks and consumption happening
with the thus read values. (The backtrace related xenoprof_buf_space()
use in xenoprof_log_event() is an exception: The values used there get
re-checked by every subsequent xenoprof_add_sample().)
Since that code needed touching, also fix the double increment of the
lost samples count in case the backtrace related xenoprof_add_sample()
invocation in xenoprof_log_event() fails.
Where code is being touched anyway, add const as appropriate, but take
the opportunity to entirely drop the now unused domain parameter of
xenoprof_buf_space().
This is part of XSA-313.
Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Wei Liu <wl@xen.org>
master commit: 50ef9a3cb26e2f9383f6fdfbed361d8f174bae9f
master date: 2020-04-14 14:33:19 +0200
Jan Beulich [Tue, 14 Apr 2020 12:55:05 +0000 (14:55 +0200)]
xenoprof: clear buffer intended to be shared with guests
alloc_xenheap_pages() making use of MEMF_no_scrub is fine for Xen
internally used allocations, but buffers allocated to be shared with
(unpriviliged) guests need to be zapped of their prior content.
This is part of XSA-313.
Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wl@xen.org>
master commit: 0763a7ebfcdad66cf9e5475a1301eefb29bae9ed
master date: 2020-04-14 14:32:33 +0200
Jeff Kubascik [Tue, 21 Jan 2020 15:07:03 +0000 (10:07 -0500)]
xen/arm: remove physical timer offset
The physical timer traps apply an offset so that time starts at 0 for
the guest. However, this offset is not currently applied to the physical
counter. Per the ARMv8 Reference Manual (ARM DDI 0487E.a), section
D11.2.4 Timers, the "Offset" between the counter and timer should be
zero for a physical timer. This removes the offset to make the timer and
counter consistent.
This also cleans up the physical timer implementation to better match
the virtual timer - both cval's now hold the hardware value.
In the case the guest sets cval to a time before Xen started, the correct
behavior is to expire the timer immediately. To do this, we set the expires
argument of set_timer to zero.
Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com> Acked-by: Julien Grall <julien@xen.org>
(cherry picked from commit f14f55b7ee295277c8dd09e37e0fa0902ccf7eb4)
xen/arm: during efi boot, improve the check for usable memory
When booting via EFI, the EFI memory map has information about memory
regions and their type. Improve the check for the type and attribute of
each memory region to figure out whether it is usable memory or not.
This patch brings the check on par with Linux v5.5-rc6 and makes more
memory reusable as normal memory by Xen (except that Linux also reuses
EFI_PERSISTENT_MEMORY, which we do not).
Specifically, this patch also reuses memory marked as
EfiLoaderCode/Data, and it uses both Attribute and Type for the check
(Attribute needs to be EFI_MEMORY_WB).
Reported-by: Roman Shaposhnik <roman@zededa.com> Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Acked-by: Julien Grall <julien@xen.org>
(cherry picked from commit b31666c8912bf18d9eff963b06d856e7e818ff34)
Jeff Kubascik [Mon, 25 Nov 2019 20:58:00 +0000 (15:58 -0500)]
xen/arm: initialize vpl011 flag register
The tx/rx fifo flags were not set when the vpl011 is initialized. This
is a problem for certain guests that are operating in polled mode, as a
guest will generally check the rx fifo empty flag to determine if there
is data before doing a read. The result is a continuous spam of the
message "vpl011: Unexpected IN ring buffer empty" before the first valid
character is received. This initializes the flag status register to the
default specified in the PL011 technical reference manual.
Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com> Acked-by: Julien Grall <julien@xen.org>
(cherry picked from commit b4637ed6cd5375f04ac51d6b900a9ccad6c6c03a)
Jeff Kubascik [Tue, 4 Feb 2020 19:51:50 +0000 (14:51 -0500)]
xen/arm: Handle unimplemented VGICv3 registers as RAZ/WI
Per the ARM Generic Interrupt Controller Architecture Specification (ARM
IHI 0069E), reserved registers should generally be treated as RAZ/WI.
To simplify the VGICv3 design and improve guest compatibility, treat the
default case for GICD and GICR registers as read_as_zero/write_ignore.
Signed-off-by: Jeff Kubascik <jeff.kubascik@dornerworks.com> Acked-by: Julien Grall <julien@xen.org>
(cherry picked from commit 69da7d5440c609c57c5bba9a73b91c62ba2852e6)
There is a bug in commit 5e4b4199667b9 ("xen: credit2: only reset
credit on reset condition"). In fact, the aim of that commit was to
make sure that we do not perform too many credit reset operations
(which are not super cheap, and in an hot-path). But the check used
to determine whether a reset is necessary was the wrong one.
In fact, knowing just that some vCPUs have been skipped, while
traversing the runqueue (in runq_candidate()), is not enough. We
need to check explicitly whether the first vCPU in the runqueue
has a negative amount of credit.
Since a trace record is changed, this patch updates xentrace format file
and xenalyze as well
credit2: avoid vCPUs to ever reach lower credits than idle
There have been report of stalls of guest vCPUs, when Credit2 was used.
It seemed like these vCPUs were not getting scheduled for very long
time, even under light load conditions (e.g., during dom0 boot).
Investigations led to the discovery that --although rarely-- it can
happen that a vCPU manages to run for very long timeslices. In Credit2,
this means that, when runtime accounting happens, the vCPU will lose a
large quantity of credits. This in turn may lead to the vCPU having less
credits than the idle vCPUs (-2^30). At this point, the scheduler will
pick the idle vCPU, instead of the ready to run vCPU, for a few
"epochs", which often times is enough for the guest kernel to think the
vCPU is not responding and crashing.
An example of this situation is shown here. In fact, we can see d0v1
sitting in the runqueue while all the CPUs are idle, as it has
-1254238270 credits, which is smaller than -2^30 = −1073741824:
We certainly don't want, under any circumstance, this to happen.
Let's, therefore, define a minimum amount of credits a vCPU can have.
During accounting, we make sure that, for however long the vCPU has
run, it will never get to have less than such minimum amount of
credits. Then, we set the credits of the idle vCPU to an even
smaller value.
NOTE: investigations have been done about _how_ it is possible for a
vCPU to execute for so much time that its credits becomes so low. While
still not completely clear, there are evidence that:
- it only happens very rarely,
- it appears to be both machine and workload specific,
- it does not look to be a Credit2 (e.g., as it happens when
running with Credit1 as well) issue, or a scheduler issue.
This patch makes Credit2 more robust to events like this, whatever
the cause is, and should hence be backported (as far as possible).
Andrew Cooper [Thu, 9 Apr 2020 07:35:24 +0000 (09:35 +0200)]
x86/ucode/amd: Fix more potential buffer overruns with microcode parsing
cpu_request_microcode() doesn't know the buffer is at least 4 bytes long
before inspecting UCODE_MAGIC.
install_equiv_cpu_table() doesn't know the boundary of the buffer it is
interpreting as an equivalency table. This case was clearly observed at one
point in the past, given the subsequent overrun detection, but without
comprehending that the damage was already done.
Make the logic consistent with container_fast_forward() and pass size_left in
to install_equiv_cpu_table().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 718d1432000079ea7120f6cb770372afe707ce27
master date: 2020-04-01 14:00:12 +0100
Jan Beulich [Thu, 9 Apr 2020 07:34:19 +0000 (09:34 +0200)]
x86/HVM: fix AMD ECS handling for Fam10
The involved comparison was, very likely inadvertently, converted from
>= to > when making changes unrelated to the actual family range.
Fixes: 9841eb71ea87 ("x86/cpuid: Drop a guests cached x86 family and model information") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 5d515b1c296ebad6889748ea1e49e063453216a3
master date: 2020-04-01 12:28:30 +0200
Jan Beulich [Thu, 9 Apr 2020 07:32:36 +0000 (09:32 +0200)]
libx86/CPUID: fix (not just) leaf 7 processing
x86_cpuid_policy_fill_native() should, as it did originally, iterate
over all subleaves here as well as over all main leaves. Switch to
using a "<= MIN()"-based approach similar to that used in
x86_cpuid_copy_to_buffer(). Also follow this for the extended main
leaves then.
Fixes: 1bd2b750537b ("libx86: Fix 32bit stubdom build of x86_cpuid_policy_fill_native()") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: eb0bad81fceb3e81df5f73441771b49b732edf56
master date: 2020-03-27 11:40:59 +0100
Andrew Cooper [Thu, 9 Apr 2020 07:31:45 +0000 (09:31 +0200)]
x86/ucode: Fix error paths in apply_microcode()
In the unlikley case that patch application completes, but the resutling
revision isn't expected, sig->rev doesn't get updated to match reality.
It will get adjusted the next time collect_cpu_info() gets called, but in the
meantime Xen might operate on a stale value. Nothing good will come of this.
Rewrite the logic to always update the stashed revision, before worrying about
whether the attempt was a success or failure.
Take the opportunity to make the printk() messages as consistent as possible.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wl@xen.org>
master commit: d2a0a96cf76603b2e2b87c3ce80c3f9d098327d4
master date: 2020-03-26 18:57:45 +0000
Igor Druzhinin [Thu, 9 Apr 2020 07:30:58 +0000 (09:30 +0200)]
x86/shim: fix ballooning up the guest
args.preempted is meaningless here as it doesn't signal whether the
hypercall was preempted before. Use start_extent instead which is
correct (as long as the hypercall was invoked in a "normal" way).
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 76dbabb59eeaa78e9f57407e5b15a6606488333e
master date: 2020-03-18 12:55:54 +0100
Jan Beulich [Thu, 9 Apr 2020 07:29:00 +0000 (09:29 +0200)]
AMD/IOMMU: fix off-by-one in amd_iommu_get_paging_mode() callers
amd_iommu_get_paging_mode() expects a count, not a "maximum possible"
value. Prior to b4f042236ae0 dropping the reference, the use of our mis-
named "max_page" in amd_iommu_domain_init() may have lead to such a
misunderstanding. In an attempt to avoid such confusion in the future,
rename the function's parameter and - while at it - convert it to an
inline function.
Also replace a literal 4 by an expression tying it to a wider use
constant, just like amd_iommu_quarantine_init() does.
Fixes: ea38867831da ("x86 / iommu: set up a scratch page in the quarantine domain") Fixes: b4f042236ae0 ("AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: b75b3c62fe4afe381c6f74a07f614c0b39fe2f5d
master date: 2020-03-16 11:24:29 +0100
Jan Beulich [Thu, 5 Mar 2020 10:23:33 +0000 (11:23 +0100)]
VT-d: check all of an RMRR for being E820-reserved
Checking just the first and last page is not sufficient (and redundant
for single-page regions). As we don't need to care about IA64 anymore,
use an x86-specific function to get this done without looping over each
individual page.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: d6573bc6e6b7d95bb9de8471a6bfd7048ebc50f3
master date: 2020-02-18 16:21:19 +0100
Igor Druzhinin [Thu, 5 Mar 2020 10:22:20 +0000 (11:22 +0100)]
x86/shim: suspend and resume platform time correctly
Similarly to S3, platform time needs to be saved on guest suspend
and restored on resume respectively. This should account for expected
jumps in PV clock counter value after resume. time_suspend/resume()
are safe to use in PVH setting as is since any existing operations
with PIT/HPET that they do would simply be ignored if PIT/HPET is
not present.
Additionally, add resume callback for Xen PV clocksource to avoid
its breakage on migration.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a7a3ecd82e289a9a2ecc1d3b5128580e0b577cc7
master date: 2020-02-14 18:01:52 +0000
David Woodhouse [Thu, 5 Mar 2020 10:21:47 +0000 (11:21 +0100)]
x86/smp: reset x2apic_enabled in smp_send_stop()
Just before smp_send_stop() re-enables interrupts when shutting down
for reboot or kexec, it calls __stop_this_cpu() which in turn calls
disable_local_APIC(), which puts the APIC back in to the mode Xen found
it in at boot.
If that means turning x2APIC off and going back into xAPIC mode, then
a timer interrupt occurring just after interrupts come back on will
lead to a GP# when apic_timer_interrupt() attempts to ack the IRQ
through the EOI register in x2APIC MSR 0x80b:
We can't clear the global x2apic_enabled variable in disable_local_APIC()
itself because that runs on each CPU. Instead, correct it (by using
current_local_apic_mode()) in smp_send_stop() while interrupts are still
disabled immediately after calling __stop_this_cpu() for the boot CPU,
after all other CPUs have been stopped.
cf: d639bdd9bbe ("x86/apic: Disable the LAPIC later in smp_send_stop()")
... which didn't quite fix it completely.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8b1002ab037aeacdece7723c07ab35ca16c1e22e
master date: 2020-02-14 18:01:52 +0000