Jan Beulich [Thu, 18 Mar 2021 14:01:04 +0000 (15:01 +0100)]
crypto: adjust rijndaelEncrypt() prototype for gcc11
The upcoming release complains, not entirely unreasonably:
In file included from rijndael.c:33:
.../xen/include/crypto/rijndael.h:55:53: note: previously declared as 'const unsigned char[]'
55 | void rijndaelEncrypt(const unsigned int [], int, const unsigned char [],
| ^~~~~~~~~~~~~~~~~~~~~~
rijndael.c:865:8: error: argument 4 of type 'u8[16]' {aka 'unsigned char[16]'} with mismatched bound [-Werror=array-parameter=]
865 | u8 ct[16])
| ~~~^~~~~~
In file included from rijndael.c:33:
.../xen/include/crypto/rijndael.h:56:13: note: previously declared as 'unsigned char[]'
56 | unsigned char []);
| ^~~~~~~~~~~~~~~~
Simply declare the correct array dimensions right away. This then allows
compilers to apply checking at call sites, which seems desirable anyway.
For the moment I'm leaving untouched the disagreement between u8/u32
used in the function definition and unsigned {char,int} used in the
declaration, as making this consistent would call for touching further
functions.
Reported-by: Charles Arnold <carnold@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: c6ad5a701b9a6df443a6c98d9e7201c958bbcafc
master date: 2021-03-04 16:47:51 +0100
Julien Grall [Fri, 5 Mar 2021 14:35:54 +0000 (15:35 +0100)]
xen/sched: Add missing memory barrier in vcpu_block()
The comment in vcpu_block() states that the events should be checked
/after/ blocking to avoids wakeup waiting race. However, from a generic
perspective, set_bit() doesn't prevent re-ordering. So the following
could happen:
CPU0 (blocking vCPU A) | CPU1 ( unblock vCPU A)
|
A <- read local events |
| set local events
| test_and_clear_bit(_VPF_blocked)
| -> Bail out as the bit if not set
|
set_bit(_VFP_blocked) |
|
check A |
The variable A will be 0 and therefore the vCPU will be blocked when it
should continue running.
vcpu_block() is now gaining an smp_mb__after_atomic() to prevent the CPU
to read any information about local events before the flag _VPF_blocked
is set.
Jan Beulich [Fri, 5 Mar 2021 14:35:19 +0000 (15:35 +0100)]
x86/EFI: suppress GNU ld 2.36'es creation of base relocs
All of the sudden ld creates base relocations itself, for PE
executables - as a result we now have two of them for every entity to
be relocated. While we will likely want to use this down the road, it
doesn't work quite right yet in corner cases, so rather than suppressing
our own way of creating the relocations we need to tell ld to avoid
doing so.
Probe whether --disable-reloc-section (which was introduced by the same
commit making relocation generation the default) is recognized by ld's PE
emulation, and use the option if so. (To limit redundancy, move the first
part of setting EFI_LDFLAGS earlier, and use it already while probing.)
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 76cbb9c3f4dd9ab6aa44eeacab84fb88b2e8bfc1
master date: 2021-02-25 15:11:58 +0100
Jan Beulich [Fri, 5 Mar 2021 14:34:53 +0000 (15:34 +0100)]
gnttab: bypass IOMMU (un)mapping when a domain is (un)mapping its own grant
Mappings for a domain's own pages should already be present in the
IOMMU. While installing the same mapping again is merely redundant (and
inefficient), removing the mapping when the grant mapping gets removed
is outright wrong in this case: The mapping was there before the map, so
should remain in place after unmapping.
This affects
- Arm Dom0 in the direct mapped case,
- x86 PV Dom0 in the "iommu=dom0-strict" / "dom0-iommu=strict" case,
- all x86 PV DomU-s, including driver domains.
See the code comment for why it's the original domain and not the page
owner that gets compared against.
Jan Beulich [Fri, 5 Mar 2021 14:34:07 +0000 (15:34 +0100)]
gnttab: never permit mapping transitive grants
Transitive grants allow an intermediate domain I to grant a target
domain T access to a page which origin domain O did grant I access to.
As an implementation restriction, T is not allowed to map such a grant.
This restriction is currently tried to be enforced by marking active
entries resulting from transitive grants as is-sub-page; sub-page grants
for obvious reasons don't allow mapping. However, marking (and checking)
only active entries is insufficient, as a map attempt may also occur on
a grant not otherwise in use. When not presently in use (pin count zero)
the grant type itself needs checking. Otherwise T may be able to map an
unrelated page owned by I. This is because the "transitive" sub-
structure of the v2 union would end up being interpreted as "full_page"
sub-structure instead. The low 32 bits of the GFN used would match the
grant reference specified in I's transitive grant entry, while the upper
32 bits could be random (depending on how exactly I sets up its grant
table entries).
Note that if one mapping already exists and the granting domain _then_
changes the grant to GTF_transitive (which the domain is not supposed to
do), the changed type will only be honored after the pin count has gone
back to zero. This is no different from e.g. GTF_readonly or
GTF_sub_page becoming set when a grant is already in use.
While adjusting the implementation, also adjust commentary in the public
header to better reflect reality.
Jan Beulich [Fri, 5 Mar 2021 14:32:24 +0000 (15:32 +0100)]
x86emul: fix SYSENTER/SYSCALL switching into 64-bit mode
When invoked by compat mode, mode_64bit() will be false at the start of
emulation. The logic after complete_insn, however, needs to consider the
mode switched into, in particular to avoid truncating RIP.
Inspired by / paralleling and extending Linux commit 943dea8af21b ("KVM:
x86: Update emulator context mode if SYSENTER xfers to 64-bit mode").
While there, tighten a related assertion in x86_emulate_wrapper() - we
want to be sure to not switch into an impossible mode when the code gets
built for 32-bit only (as is possible for the test harness).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citirix.com>
master commit: f3e1eb2f0234c955243a915d69ebd84f26eec130
master date: 2021-02-11 17:53:10 +0100
Andrew Cooper [Fri, 5 Mar 2021 14:31:50 +0000 (15:31 +0100)]
x86/ucode/amd: Fix OoB read in cpu_request_microcode()
verify_patch_size() is a maximum size check, and doesn't have a minimum bound.
If the microcode container encodes a blob with a length less than 64 bytes,
the subsequent calls to microcode_fits()/compare_header() may read off the end
of the buffer.
Fixes: 4de936a38a ("x86/ucode/amd: Rework parsing logic in cpu_request_microcode()") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1cbc4d89c45cba3929f1c0cb4bca0b000c4f174b
master date: 2021-02-10 13:23:51 +0000
Jan Beulich [Fri, 5 Mar 2021 14:31:12 +0000 (15:31 +0100)]
x86/EFI: work around GNU ld 2.36 issue
Our linker capability check fails with the recent binutils release's ld:
.../check.o:(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against `.debug_info'
.../check.o:(.debug_info+0x6): relocation truncated to fit: R_X86_64_32 against `.debug_abbrev'
.../check.o:(.debug_info+0xc): relocation truncated to fit: R_X86_64_32 against `.debug_str'+76
.../check.o:(.debug_info+0x11): relocation truncated to fit: R_X86_64_32 against `.debug_str'+d
.../check.o:(.debug_info+0x15): relocation truncated to fit: R_X86_64_32 against `.debug_str'+2b
.../check.o:(.debug_info+0x29): relocation truncated to fit: R_X86_64_32 against `.debug_line'
.../check.o:(.debug_info+0x30): relocation truncated to fit: R_X86_64_32 against `.debug_str'+19
.../check.o:(.debug_info+0x37): relocation truncated to fit: R_X86_64_32 against `.debug_str'+71
.../check.o:(.debug_info+0x3e): relocation truncated to fit: R_X86_64_32 against `.debug_str'
.../check.o:(.debug_info+0x45): relocation truncated to fit: R_X86_64_32 against `.debug_str'+5e
.../check.o:(.debug_info+0x4c): additional relocation overflows omitted from the output
Tell the linker to strip debug info as a workaround. Debug info has been
getting stripped already anyway when linking the actual xen.efi.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f4318db940c39cc656128fcf72df3e79d2e55bc1
master date: 2021-02-05 14:09:42 +0100
Roger Pau Monné [Fri, 5 Mar 2021 14:30:23 +0000 (15:30 +0100)]
x86/efi: enable MS ABI attribute on clang
Or else the EFI service calls will use the wrong calling convention.
The __ms_abi__ attribute is available on all supported versions of
clang. Add a specific Clang check because the GCC version reported by
Clang is below the required 4.4 to use the __ms_abi__ attribute.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Ian Jackson <iwj@xenproject.org>
master commit: 92f5ffa58d188c9f9a9f1bcdccb6d6348d9df612
master date: 2021-02-04 14:02:32 +0100
Jan Beulich [Fri, 5 Mar 2021 14:29:52 +0000 (15:29 +0100)]
x86/string: correct memmove()'s forwarding to memcpy()
With memcpy() expanding to the compiler builtin, we may not hand it
overlapping source and destination. We strictly mean to forward to our
own implementation (a few lines up in the same source file).
Fixes: 78825e1c60fa ("x86/string: Clean up x86/string.h") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 7b93d92a35dc7c0a6e5f1f79b3c887aa3e66ddc0
master date: 2021-02-04 13:59:56 +0100
Tamas K Lengyel [Fri, 5 Mar 2021 14:29:31 +0000 (15:29 +0100)]
x86/debug: fix page-overflow bug in dbg_rw_guest_mem
When using gdbsx dbg_rw_guest_mem is used to read/write guest memory. When the
buffer being accessed is on a page-boundary, the next page needs to be grabbed
to access the correct memory for the buffer's overflown parts. While
dbg_rw_guest_mem has logic to handle that, it broke with 229492e210a. Instead
of grabbing the next page the code right now is looping back to the
start of the first page. This results in errors like the following while trying
to use gdb with Linux' lx-dmesg:
[ 0.114457] PM: hibernation: Registered nosave memory: [mem
0xfdfff000-0xffffffff]
[ 0.114460] [mem 0x90000000-0xfbffffff] available for PCI demem 0
[ 0.114462] f]f]
Python Exception <class 'ValueError'> embedded null character:
Error occurred in Python: embedded null character
Fixing this bug by taking the variable assignment outside the loop.
Fixes: 229492e210a ("x86/debugger: use copy_to/from_guest() in dbg_rw_guest_mem()") Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9dc687f155a57216b83b17f9cde55dd43e06b0cd
master date: 2021-01-30 03:21:33 +0000
Jan Beulich [Fri, 5 Mar 2021 14:28:57 +0000 (15:28 +0100)]
x86/HVM: re-order error path of hvm_domain_initialise()
hvm_destroy_all_ioreq_servers(), called from
hvm_domain_relinquish_resources(), invokes relocate_portio_handler(),
which uses d->arch.hvm.io_handler. Defer freeing of this array
accordingly on the error path of hvm_domain_initialise().
Similarly rtc_deinit() requires d->arch.hvm.pl_time to still be around,
or else an armed timer structure would get freed, and that timer never
get killed.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fbb3bf002b42803ef289ea2a649ebd5f25d22236
master date: 2021-01-29 11:36:54 +0100
Jan Beulich [Fri, 5 Mar 2021 14:28:30 +0000 (15:28 +0100)]
memory: bail from page scrubbing when CPU is no longer online
Scrubbing can significantly delay the offlining (parking) of a CPU (e.g.
because of booting into in smt=0 mode), to a degree that the "CPU <n>
still not dead..." messages logged on x86 in 1s intervals can be seen
multiple times. There are no softirqs involved in this process, so
extend the existing preemption check in the scrubbing logic to also exit
when the CPU is no longer observed online.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 3c9fd69416f8ffc611705fb24dfb383203ddc84f
master date: 2021-01-29 11:34:37 +0100
Andrew Cooper [Fri, 5 Mar 2021 14:27:22 +0000 (15:27 +0100)]
x86/timer: Fix boot on Intel systems using ITSSPRC static PIT clock gating
Recent Intel client devices have disabled the legacy PIT for powersaving
reasons, breaking compatibility with a traditional IBM PC. Xen depends on a
legacy timer interrupt to check that the IO-APIC/PIC routing is configured
correctly, and fails to boot with:
(XEN) *******************************
(XEN) Panic on CPU 0:
(XEN) IO-APIC + timer doesn't work! Boot with apic_verbosity=debug and send report. Then try booting with the `noapic` option
(XEN) *******************************
While this setting can be undone by Xen, the details of how to differ by
chipset, and would be very short sighted for battery based devices. See bit 2
"8254 Static Clock Gating Enable" in:
All impacted systems have an HPET, but there is no indication of the absence
of PIT functionality, nor a suitable way to probe for its absence. As a short
term fix, reconfigure the HPET into legacy replacement mode. A better
longterm fix would be to avoid the reliance on the timer interrupt entirely.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Jason Andryuk <jandryuk@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: e1de4c196a2eb4fd5063c30a2e115adf144bdeef
master date: 2021-01-27 19:15:19 +0000
Jan Beulich [Fri, 5 Mar 2021 14:26:51 +0000 (15:26 +0100)]
xen/include: compat/xlat.h may change with .config changes
$(xlat-y) getting derived from $(headers-y) means its contents may
change with changes to .config. The individual files $(xlat-y) refers
to, otoh, may not change, and hence not trigger rebuilding of xlat.h.
(Note that the issue was already present before the commit referred to
below, but it was far more limited in affecting only changes to
CONFIG_XSM_FLASK.)
Fixes: 2c8fabb2232d ("x86: only generate compat headers actually needed") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c7b0f25e8f86373ed54e1c446f8e67ce25ac6819
master date: 2021-01-26 14:42:23 +0100
Roger Pau Monné [Fri, 5 Mar 2021 14:26:05 +0000 (15:26 +0100)]
x86/vioapic: check IRR before attempting to inject interrupt after EOI
In vioapic_update_EOI the irq_lock will be dropped in order to forward
the EOI to the dpci handler, so there's a window between clearing IRR
and checking if the line is asserted where IRR can change behind our
back.
Fix this by checking whether IRR is set before attempting to inject a
new interrupt.
Fixes: 06e3f8f2766 ('vt-d: Do dpci eoi outside of irq_lock.') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ba584fb1a26c058ebd0e6a2779287b3e4400415c
master date: 2021-01-22 12:13:05 +0100
On ARM we need gnttab_need_iommu_mapping to be true for dom0 when it is
directly mapped and IOMMU is enabled for the domain, like the old check
did, but the new check is always false.
In fact, need_iommu_pt_sync is defined as dom_iommu(d)->need_sync and
need_sync is set as:
if ( !is_hardware_domain(d) || iommu_hwdom_strict )
hd->need_sync = !iommu_use_hap_pt(d);
iommu_use_hap_pt(d) means that the page-table used by the IOMMU is the
P2M. It is true on ARM. need_sync means that you have a separate IOMMU
page-table and it needs to be updated for every change. need_sync is set
to false on ARM. Hence, gnttab_need_iommu_mapping(d) is false too,
which is wrong.
As a consequence, when using PV network from a domU on a system where
IOMMU is on from Dom0, I get:
(XEN) smmu: /smmu@fd800000: Unhandled context fault: fsr=0x402, iova=0x8424cb148, fsynr=0xb0001, cb=0
[ 68.290307] macb ff0e0000.ethernet eth0: DMA bus error: HRESP not OK
The fix is to go back to something along the lines of the old
implementation of gnttab_need_iommu_mapping.
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Fixes: 91d4eca7add ("mm / iommu: split need_iommu() into has_iommu_pt() and need_iommu_pt_sync()")
Backport: 4.13+
(cherry picked from commit 04085ec1ac05a362812e9b0c6b5a8713d7dc88ad)
Julien Grall [Tue, 16 Feb 2021 14:36:16 +0000 (15:36 +0100)]
xen/page_alloc: Only flush the page to RAM once we know they are scrubbed
At the moment, each page are flushed to RAM just after the allocator
found some free pages. However, this is happening before check if the
page was scrubbed.
As a consequence, on Arm, a guest may be able to access the old content
of the scrubbed pages if it has cache disabled (default at boot) and
the content didn't reach the Point of Coherency.
The flush is now moved after we know the content of the page will not
change. This also has the benefit to reduce the amount of work happening
with the heap_lock held.
This is XSA-364.
Fixes: 307c3be3ccb2 ("mm: Don't scrub pages while holding heap lock in alloc_heap_pages()") Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3b1cc15f1931ba56d0ee256fe9bfe65509733b27
master date: 2021-02-16 15:32:08 +0100
Roger Pau Monné [Thu, 21 Jan 2021 15:21:04 +0000 (16:21 +0100)]
x86/dpci: do not remove pirqs from domain tree on unbind
A fix for a previous issue removed the pirqs from the domain tree when
they are unbound in order to prevent shared pirqs from triggering a
BUG_ON in __pirq_guest_unbind if they are unbound multiple times. That
caused free_domain_pirqs to no longer unmap the pirqs because they
are gone from the domain pirq tree, thus leaving stale unbound pirqs
after domain destruction if the domain had mapped dpci pirqs after
shutdown.
Take a different approach to fix the original issue, instead of
removing the pirq from d->pirq_tree clear the flags of the dpci pirq
struct to signal that the pirq is now unbound. This prevents calling
pirq_guest_unbind multiple times for the same pirq without having to
remove it from the domain pirq tree.
This is XSA-360.
Fixes: 5b58dad089 ('x86/pass-through: avoid double IRQ unbind during domain cleanup') Reported-by: Samuel Verschelde <samuel.verschelde@vates.fr> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 58427889f5a420cc5226f88524b3228f90b72a58
master date: 2021-01-21 16:11:41 +0100
UBSAN catches an uninitialized use of the 'preempted' variable in
fork_hap_allocation when there is no preemption.
Fixes: 41548c5472a ("mem_sharing: VM forking") Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: cb34a2fa162184b150d48a3b6f385bacbec22ce7
master date: 2021-01-18 17:50:11 +0000
Jan Beulich [Thu, 21 Jan 2021 15:19:34 +0000 (16:19 +0100)]
x86/ACPI: don't overwrite FADT
When marking fields invalid for our own purposes, we should do so in our
local copy (so we will notice later on), not in the firmware provided
one (which another entity may want to look at again, e.g. after kexec).
Also mark the function parameter const to notice such issues right away.
Instead use the pointer at the firmware copy for specifying an adjacent
printk()'s arguments. If nothing else this at least reduces the number
of relocations the assembler hasto emit and the linker has to process.
Fixes: 62d1a69a4e9f ("ACPI: support v5 (reduced HW) sleep interface") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 654c917d94d24587bb6b4a8d862baf04b25cbe33
master date: 2021-01-11 14:55:52 +0100
Roger Pau Monné [Thu, 21 Jan 2021 15:19:09 +0000 (16:19 +0100)]
x86/hypercall: fix gnttab hypercall args conditional build on pvshim
A pvshim build doesn't require the grant table functionality built in,
but it does require knowing the number of arguments the hypercall has
so the hypercall parameter clobbering works properly.
Instead of also setting the argument count for the gnttab case if PV
shim functionality is enabled, just drop all of the conditionals from
hypercall_args_table, as a hypercall having a NULL handler won't get
to use that information anyway.
Note this hasn't been detected by osstest because the tools pvshim
build is done without debug enabled, so the hypercall parameter
clobbering doesn't happen.
Fixes: d2151152dd2 ('xen: make grant table support configurable') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b468b464c89e92629bd52cec58e9f51eee2e950a
master date: 2021-01-08 16:51:52 +0100
Roger Pau Monné [Thu, 21 Jan 2021 15:18:32 +0000 (16:18 +0100)]
x86/dpci: EOI interrupt regardless of its masking status
Modify hvm_pirq_eoi to always EOI the interrupt if required, instead
of not doing such EOI if the interrupt is routed through the vIO-APIC
and the entry is masked at the time the EOI is performed.
Further unmask of the vIO-APIC pin won't EOI the interrupt, and thus
the guest OS has to wait for the timeout to expire and the automatic
EOI to be performed.
This allows to simplify the helpers and drop the vioapic_redir_entry
parameter from all of them.
Fixes: ccfe4e08455 ('Intel vt-d specific changes in arch/x86/hvm/vmx/vtd.') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: eb298f32fac5ac362eef30a66a9c9c42724d4ce6
master date: 2021-01-07 15:10:29 +0100
Jan Beulich [Thu, 21 Jan 2021 15:18:06 +0000 (16:18 +0100)]
x86/vPCI: tolerate (un)masking a disabled MSI-X entry
None of the four reasons causing vpci_msix_arch_mask_entry() to get
called (there's just a single call site) are impossible or illegal prior
to an entry actually having got set up:
- the entry may remain masked (in this case, however, a prior masked ->
unmasked transition would already not have worked),
- MSI-X may not be enabled,
- the global mask bit may be set,
- the entry may not otherwise have been updated.
Hence the function asserting that the entry was previously set up was
simply wrong. Since the caller tracks the masked state (and setting up
of an entry would only be effected when that software bit is clear),
it's okay to skip both masking and unmasking requests in this case.
Fixes: d6281be9d0145 ('vpci/msix: add MSI-X handlers') Reported-by: Manuel Bouyer <bouyer@antioche.eu.org> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Manuel Bouyer <bouyer@antioche.eu.org>
master commit: 04b090366ca59e8a75837c822df261a8d0bd1a30
master date: 2021-01-05 13:17:54 +0100
Andrew Cooper [Thu, 21 Jan 2021 15:17:21 +0000 (16:17 +0100)]
x86/hpet: Fix return value of hpet_setup()
hpet_setup() is idempotent if the rate has already been calculated, and
returns the cached value. However, this only works correctly when the return
statements are identical.
Use a sensibly named local variable, rather than a dead one with a bad name.
Fixes: a60bb68219 ("x86/time: reduce rounding errors in calculations") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 83736c567d6b64dbce98f251ca72e7870f556421
master date: 2020-12-31 16:19:00 +0000
Jan Beulich [Tue, 15 Dec 2020 13:15:13 +0000 (14:15 +0100)]
evtchn/FIFO: add 2nd smp_rmb() to evtchn_fifo_word_from_port()
Besides with add_page_to_event_array() the function also needs to
synchronize with evtchn_fifo_init_control() setting both d->evtchn_fifo
and (subsequently) d->evtchn_port_ops.
Jan Beulich [Tue, 15 Dec 2020 13:14:54 +0000 (14:14 +0100)]
evtchn/FIFO: re-order and synchronize (with) map_control_block()
For evtchn_fifo_set_pending()'s check of the control block having been
set to be effective, ordering of respective reads and writes needs to be
ensured: The control block pointer needs to be recorded strictly after
the setting of all the queue heads, and it needs checking strictly
before any uses of them (this latter aspect was already guaranteed).
Roger Pau Monné [Tue, 15 Dec 2020 13:14:34 +0000 (14:14 +0100)]
x86/irq: fix infinite loop in irq_move_cleanup_interrupt
If Xen enters irq_move_cleanup_interrupt with a dynamic vector below
IRQ_MOVE_CLEANUP_VECTOR pending in IRR (0x20 or 0x21) that's also
designated for a cleanup it will enter a loop where
irq_move_cleanup_interrupt continuously sends a cleanup IPI (vector
0x22) to itself while waiting for the vector with lower priority to be
injected - which will never happen because IRQ_MOVE_CLEANUP_VECTOR
takes precedence and it's always injected first.
Fix this by making sure vectors below IRQ_MOVE_CLEANUP_VECTOR are
marked as used and thus not available for APs. Also add some logic to
assert and prevent irq_move_cleanup_interrupt from entering such an
infinite loop, albeit that should never happen given the current code.
This is XSA-356 / CVE-2020-29567.
Fixes: 3fba06ba9f8 ('x86/IRQ: re-use legacy vector ranges on APs') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ca85682e8c16361fdf3814c9b25a2ec3ff4f8bed
master date: 2020-12-15 13:42:16 +0100
Jan Beulich [Tue, 15 Dec 2020 13:13:56 +0000 (14:13 +0100)]
x86: avoid calling {svm,vmx}_do_resume()
These functions follow the following path: hvm_do_resume() ->
handle_hvm_io_completion() -> hvm_wait_for_io() ->
wait_on_xen_event_channel() -> do_softirq() -> schedule() ->
sched_context_switch() -> continue_running() and hence may
recursively invoke themselves. If this ends up happening a couple of
times, a stack overflow would result.
Prevent this by also resetting the stack at the
->arch.ctxt_switch->tail() invocations (in both places for consistency)
and thus jumping to the functions instead of calling them.
Jan Beulich [Tue, 15 Dec 2020 13:12:47 +0000 (14:12 +0100)]
x86: replace reset_stack_and_jump_nolp()
Move the necessary check into check_for_livepatch_work(), rather than
mostly duplicating reset_stack_and_jump() for this purpose. This is to
prevent an inflation of reset_stack_and_jump() flavors.
Edwin Török [Tue, 15 Dec 2020 13:09:16 +0000 (14:09 +0100)]
tools/ocaml/xenstored: only Dom0 can change node owner
Otherwise we can give quota away to another domain, either causing it to run
out of quota, or in case of Dom0 use unbounded amounts of memory and bypass
the quota system entirely.
This was fixed in the C version of xenstored in 2006 (c/s db34d2aaa5f5,
predating the XSA process by 5 years).
It was also fixed in the mirage version of xenstore in 2012, with a unit test
demonstrating the vulnerability:
but possibly without realising that the vulnerability still affected the
in-tree oxenstored (added c/s f44af660412 in 2010).
This is XSA-352.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Tue, 15 Dec 2020 13:09:10 +0000 (14:09 +0100)]
tools/ocaml/xenstored: delete watch from trie too when resetting watches
c/s f8c72b526129 "oxenstored: implement XS_RESET_WATCHES" from Xen 4.6
introduced reset watches support in oxenstored by mirroring the change
in cxenstored.
However the OCaml version has some additional data structures to
optimize watch firing, and just resetting the watches in one of the data
structures creates a security bug where a malicious guest kernel can
exceed its watch quota, driving oxenstored into OOM:
* create watches
* reset watches (this still keeps the watches lingering in another data
structure, using memory)
* create some more watches
* loop until oxenstored dies
The guest kernel doesn't necessarily have to be malicious to trigger
this:
* if control/platform-feature-xs_reset_watches is set
* the guest kexecs (e.g. because it crashes)
* on boot more watches are set up
* this will slowly "leak" memory for watches in oxenstored, driving it
towards OOM.
This is XSA-330.
Fixes: f8c72b526129 ("oxenstored: implement XS_RESET_WATCHES") Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
tools/xenstore: Preserve bad client until they are destroyed
XenStored will kill any connection that it thinks has misbehaved,
this is currently happening in two places:
* In `handle_input()` if the sanity check on the ring and the message
fails.
* In `handle_output()` when failing to write the response in the ring.
As the domain structure is a child of the connection, XenStored will
destroy its view of the domain when killing the connection. This will
result in sending @releaseDomain event to all the watchers.
As the watch event doesn't carry which domain has been released,
the watcher (such as XenStored) will generally go through the list of
domains registers and check if one of them is shutting down/dying.
In the case of a client misbehaving, the domain will likely to be
running, so no action will be performed.
When the domain is effectively destroyed, XenStored will not be aware of
the domain anymore. So the watch event is not going to be sent.
By consequence, the watchers of the event will not release mappings
they may have on the domain. This will result in a zombie domain.
In order to send @releaseDomain event at the correct time, we want
to keep the domain structure until the domain is effectively
shutting-down/dying.
We also want to keep the connection around so we could possibly revive
the connection in the future.
A new flag 'is_ignored' is added to mark whether a connection should be
ignored when checking if there are work to do. Additionally any
transactions, watches, buffers associated to the connection will be
freed as you can't do much with them (restarting the connection will
likely need a reset).
As a side note, when the device model were running in a stubdomain, a
guest would have been able to introduce a use-after-free because there
is two parents for a guest connection.
Juergen Gross [Tue, 15 Dec 2020 13:08:21 +0000 (14:08 +0100)]
tools/xenstore: drop watch event messages exceeding maximum size
By setting a watch with a very large tag it is possible to trick
xenstored to send watch event messages exceeding the maximum allowed
payload size. This might in turn lead to a crash of xenstored as the
resulting error can cause dereferencing a NULL pointer in case there
is no active request being handled by the guest the watch event is
being sent to.
Fix that by just dropping such watch events. Additionally modify the
error handling to test the pointer to be not NULL before dereferencing
it.
Edwin Török [Tue, 15 Dec 2020 13:08:17 +0000 (14:08 +0100)]
tools/ocaml/xenstored: Fix path length validation
Currently, oxenstored checks the length of paths against 1024, then
prepends "/local/domain/$DOMID/" to relative paths. This allows a domU
to create paths which can't subsequently be read by anyone, even dom0.
This also interferes with listing directories, etc.
Define a new oxenstored.conf entry: quota-path-max, defaulting to 1024
as before. For paths that begin with "/local/domain/$DOMID/" check the
relative path length against this quota. For all other paths check the
entire path length.
This ensures that if the domid changes (and thus the length of a prefix
changes) a path that used to be valid stays valid (e.g. after a
live-migration). It also ensures that regardless how the client tries
to access a path (domid-relative or absolute) it will get consistent
results, since the limit is always applied on the final canonicalized
path.
Delete the unused Domain.get_path to avoid it being confused with
Connection.get_path (which differs by a trailing slash only).
Rewrite Util.path_validate to apply the appropriate length restriction
based on whether the path is relative or not. Remove the check for
connection_path being absolute, because it is not guest controlled data.
This is part of XSA-323.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
Edwin Török [Tue, 15 Dec 2020 13:08:03 +0000 (14:08 +0100)]
tools/ocaml/xenstored: clean up permissions for dead domains
domain ids are prone to wrapping (15-bits), and with sufficient number
of VMs in a reboot loop it is possible to trigger it. Xenstore entries
may linger after a domain dies, until a toolstack cleans it up. During
this time there is a window where a wrapped domid could access these
xenstore keys (that belonged to another VM).
To prevent this do a cleanup when a domain dies:
* walk the entire xenstore tree and update permissions for all nodes
* if the dead domain had an ACL entry: remove it
* if the dead domain was the owner: change the owner to Dom0
This is done without quota checks or a transaction. Quota checks would
be a no-op (either the domain is dead, or it is Dom0 where they are not
enforced). Transactions are not needed, because this is all done
atomically by oxenstored's single thread.
The xenstore entries owned by the dead domain are not deleted, because
that could confuse a toolstack / backends that are still bound to it
(or generate unexpected watch events). It is the responsibility of a
toolstack to remove the xenstore entries themselves.
This is part of XSA-322.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
Juergen Gross [Tue, 15 Dec 2020 13:07:52 +0000 (14:07 +0100)]
tools/xenstore: revoke access rights for removed domains
Access rights of Xenstore nodes are per domid. Unfortunately existing
granted access rights are not removed when a domain is being destroyed.
This means that a new domain created with the same domid will inherit
the access rights to Xenstore nodes from the previous domain(s) with
the same domid.
This can be avoided by adding a generation counter to each domain.
The generation counter of the domain is set to the global generation
counter when a domain structure is being allocated. When reading or
writing a node all permissions of domains which are younger than the
node itself are dropped. This is done by flagging the related entry
as invalid in order to avoid modifying permissions in a way the user
could detect.
A special case has to be considered: for a new domain the first
Xenstore entries are already written before the domain is officially
introduced in Xenstore. In order not to drop the permissions for the
new domain a domain struct is allocated even before introduction if
the hypervisor is aware of the domain. This requires adding another
bool "introduced" to struct domain in xenstored. In order to avoid
additional padding holes convert the shutdown flag to bool, too.
As verifying permissions has its price regarding runtime add a new
quota for limiting the number of permissions an unprivileged domain
can set for a node. The default for that new quota is 5.
Edwin Török [Tue, 15 Dec 2020 13:07:03 +0000 (14:07 +0100)]
tools/ocaml/xenstored: add xenstored.conf flag to turn off watch permission checks
There are flags to turn off quotas and the permission system, so add one
that turns off the newly introduced watch permission checks as well.
This is part of XSA-115.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Tue, 15 Dec 2020 13:06:58 +0000 (14:06 +0100)]
tools/ocaml/xenstored: avoid watch events for nodes without access
Today watch events are sent regardless of the access rights of the
node the event is sent for. This enables any guest to e.g. setup a
watch for "/" in order to have a detailed record of all Xenstore
modifications.
Modify that by sending only watch events for nodes that the watcher
has a chance to see otherwise (either via direct reads or by querying
the children of a node). This includes cases where the visibility of
a node for a watcher is changing (permissions being removed).
Permissions for nodes are looked up either in the old (pre
transaction/command) or current trees (post transaction). If
permissions are changed multiple times in a transaction only the final
version is checked, because considering a transaction atomic the
individual permission changes would not be noticable to an outside
observer.
Two trees are only needed for set_perms: here we can either notice the
node disappearing (if we loose permission), appearing
(if we gain permission), or changing (if we preserve permission).
RM needs to only look at the old tree: in the new tree the node would be
gone, or could have different permissions if it was recreated (the
recreation would get its own watch fired).
Inside a tree we lookup the watch path's parent, and then the watch path
child itself. This gets us 4 sets of permissions in worst case, and if
either of these allows a watch, then we permit it to fire. The
permission lookups are done without logging the failures, otherwise we'd
get confusing errors about permission denied for some paths, but a watch
still firing. The actual result is logged in xenstored-access log:
'w event ...' as usual if watch was fired
'w notfired...' if the watch was not fired, together with path and
permission set to help in troubleshooting
Adding a watch bypasses permission checks and always fires the watch
once immediately. This is consistent with the specification, and no
information is gained (the watch is fired both if the path exists or
doesn't, and both if you have or don't have access, i.e. it reflects the
path a domain gave it back to that domain).
There are some semantic changes here:
* Write+rm in a single transaction of the same path is unobservable
now via watches: both before and after a transaction the path
doesn't exist, thus both tree lookups come up with the empty
permission set, and noone, not even Dom0 can see this. This is
consistent with transaction atomicity though.
* Similar to above if we temporarily grant and then revoke permission
on a path any watches fired inbetween are ignored as well
* There is a new log event (w notfired) which shows the permission set
of the path, and the path.
* Watches on paths that a domain doesn't have access to are now not
seen, which is the purpose of the security fix.
This is part of XSA-115.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Tue, 15 Dec 2020 13:06:53 +0000 (14:06 +0100)]
tools/ocaml/xenstored: introduce permissions for special watches
The special watches "@introduceDomain" and "@releaseDomain" should be
allowed for privileged callers only, as they allow to gain information
about presence of other guests on the host. So send watch events for
those watches via privileged connections only.
Start to address this by treating the special watches as regular nodes
in the tree, which gives them normal semantics for permissions. A later
change will restrict the handling, so that they can't be listed, etc.
This is part of XSA-115.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Tue, 15 Dec 2020 13:06:48 +0000 (14:06 +0100)]
tools/ocaml/xenstored: unify watch firing
This will make it easier insert additional checks in a follow-up patch.
All watches are now fired from a single function.
This is part of XSA-115.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Tue, 15 Dec 2020 13:06:43 +0000 (14:06 +0100)]
tools/ocaml/xenstored: check privilege for XS_IS_DOMAIN_INTRODUCED
The Xenstore command XS_IS_DOMAIN_INTRODUCED should be possible for privileged
domains only (the only user in the tree is the xenpaging daemon).
This is part of XSA-115.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Tue, 15 Dec 2020 13:06:38 +0000 (14:06 +0100)]
tools/ocaml/xenstored: ignore transaction id for [un]watch
Instead of ignoring the transaction id for XS_WATCH and XS_UNWATCH
commands as it is documented in docs/misc/xenstore.txt, it is tested
for validity today.
Really ignore the transaction id for XS_WATCH and XS_UNWATCH.
This is part of XSA-115.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Juergen Gross [Thu, 11 Jun 2020 14:12:46 +0000 (16:12 +0200)]
tools/xenstore: avoid watch events for nodes without access
Today watch events are sent regardless of the access rights of the
node the event is sent for. This enables any guest to e.g. setup a
watch for "/" in order to have a detailed record of all Xenstore
modifications.
Modify that by sending only watch events for nodes that the watcher
has a chance to see otherwise (either via direct reads or by querying
the children of a node). This includes cases where the visibility of
a node for a watcher is changing (permissions being removed).
Juergen Gross [Thu, 11 Jun 2020 14:12:45 +0000 (16:12 +0200)]
tools/xenstore: allow special watches for privileged callers only
The special watches "@introduceDomain" and "@releaseDomain" should be
allowed for privileged callers only, as they allow to gain information
about presence of other guests on the host. So send watch events for
those watches via privileged connections only.
In order to allow for disaggregated setups where e.g. driver domains
need to make use of those special watches add support for calling
"set permissions" for those special nodes, too.
Juergen Gross [Thu, 11 Jun 2020 14:12:43 +0000 (16:12 +0200)]
tools/xenstore: fire watches only when removing a specific node
Instead of firing all watches for removing a subtree in one go, do so
only when the related node is being removed.
The watches for the top-most node being removed include all watches
including that node, while watches for nodes below that are only fired
if they are matching exactly. This avoids firing any watch more than
once when removing a subtree.
Juergen Gross [Thu, 11 Jun 2020 14:12:42 +0000 (16:12 +0200)]
tools/xenstore: rework node removal
Today a Xenstore node is being removed by deleting it from the parent
first and then deleting itself and all its children. This results in
stale entries remaining in the data base in case e.g. a memory
allocation is failing during processing. This would result in the
rather strange behavior to be able to read a node (as its still in the
data base) while not being visible in the tree view of Xenstore.
Fix that by deleting the nodes from the leaf side instead of starting
at the root.
As fire_watches() is now called from _rm() the ctx parameter needs a
const attribute.
Juergen Gross [Thu, 11 Jun 2020 14:12:40 +0000 (16:12 +0200)]
tools/xenstore: simplify and rename check_event_node()
There is no path which allows to call check_event_node() without a
event name. So don't let the result depend on the name being NULL and
add an assert() covering that case.
Rename the function to check_special_event() to better match the
semantics.
Juergen Gross [Thu, 11 Jun 2020 14:12:39 +0000 (16:12 +0200)]
tools/xenstore: fix node accounting after failed node creation
When a node creation fails the number of nodes of the domain should be
the same as before the failed node creation. In case of failure when
trying to create a node requiring to create one or more intermediate
nodes as well (e.g. when /a/b/c/d is to be created, but /a/b isn't
existing yet) it might happen that the number of nodes of the creating
domain is not reset to the value it had before.
So move the quota accounting out of construct_node() and into the node
write loop in create_node() in order to be able to undo the accounting
in case of an error in the intermediate node destructor.
Juergen Gross [Thu, 11 Jun 2020 14:12:38 +0000 (16:12 +0200)]
tools/xenstore: ignore transaction id for [un]watch
Instead of ignoring the transaction id for XS_WATCH and XS_UNWATCH
commands as it is documented in docs/misc/xenstore.txt, it is tested
for validity today.
Really ignore the transaction id for XS_WATCH and XS_UNWATCH.
Juergen Gross [Thu, 11 Jun 2020 14:12:37 +0000 (16:12 +0200)]
tools/xenstore: allow removing child of a node exceeding quota
An unprivileged user of Xenstore is not allowed to write nodes with a
size exceeding a global quota, while privileged users like dom0 are
allowed to write such nodes. The size of a node is the needed space
to store all node specific data, this includes the names of all
children of the node.
When deleting a node its parent has to be modified by removing the
name of the to be deleted child from it.
This results in the strange situation that an unprivileged owner of a
node might not succeed in deleting that node in case its parent is
exceeding the quota of that unprivileged user (it might have been
written by dom0), as the user is not allowed to write the updated
parent node.
Fix that by not checking the quota when writing a node for the
purpose of removing a child's name only.
The same applies to transaction handling: a node being read during a
transaction is written to the transaction specific area and it should
not be tested for exceeding the quota, as it might not be owned by
the reader and presumably the original write would have failed if the
node is owned by the reader.
Edwin Török [Tue, 15 Dec 2020 13:05:16 +0000 (14:05 +0100)]
tools/ocaml/xenstored: do permission checks on xenstore root
This was lacking in a disappointing number of places.
The xenstore root node is treated differently from all other nodes, because it
doesn't have a parent, and mutation requires changing the parent.
Unfortunately this lead to open-coding the special case for root into every
single xenstore operation, and out of all the xenstore operations only read
did a permission check when handling the root node.
This means that an unprivileged guest can:
* xenstore-chmod / to its liking and subsequently write new arbitrary nodes
there (subject to quota)
* xenstore-rm -r / deletes almost the entire xenstore tree (xenopsd quickly
refills some, but you are left with a broken system)
* DIRECTORY on / lists all children when called through python
bindings (xenstore-ls stops at /local because it tries to list recursively)
* get-perms on / works too, but that is just a minor information leak
Add the missing permission checks, but this should really be refactored to do
the root handling and permission checks on the node only once from a single
function, instead of getting it wrong nearly everywhere.
This is XSA-353.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monné [Tue, 1 Dec 2020 14:34:55 +0000 (15:34 +0100)]
x86/vioapic: fix usage of index in place of GSI in vioapic_write_redirent
The usage of idx instead of the GSI in vioapic_write_redirent when
accessing gsi_assert_count can cause a PVH dom0 with multiple
vIO-APICs to lose interrupts in case a pin of a IO-APIC different than
the first one is unmasked with pending interrupts.
Switch to use gsi instead to fix the issue.
Fixes: 9f44b08f7d0e4 ('x86/vioapic: introduce support for multiple vIO APICS') Reported-by: Manuel Bouyer <bouyer@antioche.eu.org> Signed-off-by: Roger Pau Monné <roge.rpau@citrix.com> Tested-by: Manuel Bouyer <bouyer@antioche.eu.org> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3ae469af8e680df31eecd0a2ac6a83b58ad7ce53
master date: 2020-11-30 14:06:38 +0100
Juergen Gross [Tue, 1 Dec 2020 14:34:31 +0000 (15:34 +0100)]
xen/events: rework fifo queue locking
Two cpus entering evtchn_fifo_set_pending() for the same event channel
can race in case the first one gets interrupted after setting
EVTCHN_FIFO_PENDING and when the other one manages to set
EVTCHN_FIFO_LINKED before the first one is testing that bit. This can
lead to evtchn_check_pollers() being called before the event is put
properly into the queue, resulting eventually in the guest not seeing
the event pending and thus blocking forever afterwards.
Note that commit 5f2df45ead7c1195 ("xen/evtchn: rework per event channel
lock") made the race just more obvious, while the fifo event channel
implementation had this race forever since the introduction and use of
per-channel locks, when an unmask operation was running in parallel with
an event channel send operation.
Using a spinlock for the per event channel lock had turned out
problematic due to some paths needing to take the lock are called with
interrupts off, so the lock would need to disable interrupts, which in
turn broke some use cases related to vm events.
For avoiding this race the queue locking in evtchn_fifo_set_pending()
needs to be reworked to cover the test of EVTCHN_FIFO_PENDING,
EVTCHN_FIFO_MASKED and EVTCHN_FIFO_LINKED, too. Additionally when an
event channel needs to change queues both queues need to be locked
initially, in order to avoid having a window with no lock held at all.
Reported-by: Jan Beulich <jbeulich@suse.com> Fixes: 5f2df45ead7c1195 ("xen/evtchn: rework per event channel lock") Fixes: de6acb78bf0e137c ("evtchn: use a per-event channel lock for sending events") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 71ac522909e9302350a88bc378be99affa87067c
master date: 2020-11-30 14:05:39 +0100
Juergen Gross [Tue, 1 Dec 2020 14:33:19 +0000 (15:33 +0100)]
xen/events: access last_priority and last_vcpu_id together
The queue for a fifo event is depending on the vcpu_id and the
priority of the event. When sending an event it might happen the
event needs to change queues and the old queue needs to be kept for
keeping the links between queue elements intact. For this purpose
the event channel contains last_priority and last_vcpu_id values
elements for being able to identify the old queue.
In order to avoid races always access last_priority and last_vcpu_id
with a single atomic operation avoiding any inconsistencies.
Jan Beulich [Tue, 1 Dec 2020 14:32:51 +0000 (15:32 +0100)]
x86/vpt: fix build with old gcc
I believe it was the XSA-336 fix (42fcdd42328f "x86/vpt: fix race when
migrating timers between vCPUs") which has unmasked a bogus
uninitialized variable warning. This is observable with gcc 4.3.4, but
only on 4.13 and older; it's hidden on newer versions apparently due to
the addition to _read_unlock() done by 12509bbeb9e3 ("rwlocks: call
preempt_disable() when taking a rwlock").
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f2c620aa062767b318267d678ae249dcb637b870
master date: 2020-11-18 12:38:01 +0100
With the event channel lock no longer disabling interrupts commit 52e1fc47abc3a0123 ("evtchn/Flask: pre-allocate node on send path") can
be reverted again.
Juergen Gross [Tue, 1 Dec 2020 14:31:01 +0000 (15:31 +0100)]
xen/evtchn: rework per event channel lock
Currently the lock for a single event channel needs to be taken with
interrupts off, which causes deadlocks in some cases.
Rework the per event channel lock to be non-blocking for the case of
sending an event and removing the need for disabling interrupts for
taking the lock.
The lock is needed for avoiding races between event channel state
changes (creation, closing, binding) against normal operations (set
pending, [un]masking, priority changes).
Use a rwlock, but with some restrictions:
- Changing the state of an event channel (creation, closing, binding)
needs to use write_lock(), with ASSERT()ing that the lock is taken as
writer only when the state of the event channel is either before or
after the locked region appropriate (either free or unbound).
- Sending an event needs to use read_trylock() mostly, in case of not
obtaining the lock the operation is omitted. This is needed as
sending an event can happen with interrupts off (at least in some
cases).
- Dumping the event channel state for debug purposes is using
read_trylock(), too, in order to avoid blocking in case the lock is
taken as writer for a long time.
Andrew Cooper [Thu, 23 Apr 2020 12:05:46 +0000 (13:05 +0100)]
x86/msr: Disallow guest access to the RAPL MSRs
Researchers have demonstrated using the RAPL interface to perform a
differential power analysis attack to recover AES keys used by other cores in
the system.
Furthermore, even privileged guests cannot use this interface correctly, due
to MSR scope and vcpu scheduling issues. The interface would want to be
paravirtualised to be used sensibly.
Disallow access to the RAPL MSRs completely, as well as other MSRs which
potentially access fine grain power information.
This is part of XSA-351.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monné [Tue, 6 Oct 2020 16:23:27 +0000 (18:23 +0200)]
x86/msr: fix handling of MSR_IA32_PERF_{STATUS/CTL}
Currently a PV hardware domain can also be given control over the CPU
frequency, and such guest is allowed to write to MSR_IA32_PERF_CTL.
However since commit 322ec7c89f6 the default behavior has been changed
to reject accesses to not explicitly handled MSRs, preventing PV
guests that manage CPU frequency from reading
MSR_IA32_PERF_{STATUS/CTL}.
Additionally some HVM guests (Windows at least) will attempt to read
MSR_IA32_PERF_CTL and will panic if given back a #GP fault:
Move the handling of MSR_IA32_PERF_{STATUS/CTL} to the common MSR
handling shared between HVM and PV guests, and add an explicit case
for reads to MSR_IA32_PERF_{STATUS/CTL}.
Restore previous behavior and allow PV guests with the required
permissions to read the contents of the mentioned MSRs. Non privileged
guests will get 0 when trying to read those registers, as writes to
MSR_IA32_PERF_CTL by such guest will already be silently dropped.
Fixes: 322ec7c89f6 ('x86/pv: disallow access to unknown MSRs') Fixes: 84e848fd7a1 ('x86/hvm: disallow access to unknown MSRs') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3059178798a23ba870ff86ff54d442a07e6651fc)
Julien Grall [Mon, 9 Nov 2020 20:28:59 +0000 (20:28 +0000)]
xen/arm: Always trap AMU system registers
The Activity Monitors Unit (AMU) has been introduced by ARMv8.4. It is
considered to be unsafe to be expose to guests as they might expose
information about code executed by other guests or the host.
Arm provided a way to trap all the AMU system registers by setting
CPTR_EL2.TAM to 1.
Unfortunately, on older revision of the specification, the bit 30 (now
CPTR_EL1.TAM) was RES0. Because of that, Xen is setting it to 0 and
therefore the system registers would be exposed to the guest when it is
run on processors with AMU.
As the bit is mark as UNKNOWN at boot in Armv8.4, the only safe solution
for us is to always set CPTR_EL1.TAM to 1.
Guest trying to access the AMU system registers will now receive an
undefined instruction. Unfortunately, this means that even well-behaved
guest may fail to boot because we don't sanitize the ID registers.
This is a known issues with other Armv8.0+ features (e.g. SVE, Pointer
Auth). This will taken care separately.
tools/libs/stat: use memcpy instead of strncpy in getBridge
Use memcpy in getBridge to prevent gcc warnings about truncated
strings. We know that we might truncate it, so the gcc warning
here is wrong.
Revert previous change changing buffer sizes as bigger buffers
are not needed.
Olaf Hering [Wed, 23 Sep 2020 06:48:40 +0000 (08:48 +0200)]
tools/libxc: report malloc errors in writev_exact
The caller of writev_exact should be notified about malloc errors
when dealing with partial writes.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit 0d8d289af7a679c028462c4ed5d98586f9ef9648)
David Woodhouse [Thu, 19 Mar 2020 20:40:24 +0000 (20:40 +0000)]
tools/xenstore: Do not abort xenstore-ls if a node disappears while iterating
The do_ls() function has somewhat inconsistent handling of errors.
If reading the node's contents with xs_read() fails, then do_ls() will
just quietly not display the contents.
If reading the node's permissions with xs_get_permissions() fails, then
do_ls() will print a warning, continue, and ultimately won't exit with
an error code (unless another error happens).
If recursing into the node with xs_directory() fails, then do_ls() will
abort immediately, not printing any further nodes.
For persistent failure modes — such as ENOENT because a node has been
removed, or EACCES because it has had its permisions changed since the
xs_directory() on the parent directory returned its name — it's
obviously quite likely that if either of the first two errors occur for
a given node, then so will the third and thus xenstore-ls will abort.
The ENOENT one is actually a fairly common case, and has caused tools to
fail to clean up a network device because it *apparently* already
doesn't exist in xenstore.
There is a school of thought that says, "Well, xenstore-ls returned an
error. So the tools should not trust its output."
The natural corollary of this would surely be that the tools must re-run
xenstore-ls as many times as is necessary until its manages to exit
without hitting the race condition. I am not keen on that conclusion.
For the specific case of ENOENT it seems reasonable to declare that,
but for the timing, we might as well just not have seen that node at
all when calling xs_directory() for the parent. By ignoring the error,
we give acceptable output.
The issue can be reproduced as follows:
(dom0) # for a in `seq 1 1000` ; do
xenstore-write /local/domain/2/foo/$a $a ;
done
Now simultaneously:
(dom0) # for a in `seq 1 999` ; do
xenstore-rm /local/domain/2/foo/$a ;
done
(dom2) # while true ; do
./xenstore-ls -p /local/domain/2/foo | grep -c 1000 ;
done
We should expect to see node 1000 in the output, every time.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit beb105af19cc3e58e2ed49224cfe190a736e5fa0)
Bertrand Marquis [Thu, 15 Oct 2020 09:16:09 +0000 (10:16 +0100)]
tools/xenpmd: Fix gcc10 snprintf warning
Add a check for snprintf return code and ignore the entry if we get an
error. This should in fact never happen and is more a trick to make gcc
happy and prevent compilation errors.
This is solving the following gcc warning when compiling for arm32 host
platforms with optimization activated:
xenpmd.c:92:37: error: '%s' directive output may be truncated writing
between 4 and 2147483645 bytes into a region of size 271
[-Werror=format-truncation=]
This is also solving the following Debian bug:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970802
libxl: fix -Werror=stringop-truncation in libxl__prepare_sockaddr_un
In file included from /usr/include/string.h:495,
from libxl_internal.h:38,
from libxl_utils.c:20:
In function 'strncpy',
inlined from 'libxl__prepare_sockaddr_un' at libxl_utils.c:1262:5:
/usr/include/bits/string_fortified.h:106:10: error: '__builtin_strncpy' specified bound 108 equals destination size [-Werror=stringop-truncation]
106 | return __builtin___strncpy_chk (__dest, __src, __len, __bos (__dest));
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1: all warnings being treated as errors
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit fff1b7f50e75ad9535c86f3fcf425b4945c50a1c)
It seems xlu_pci_parse_bdf has a state machine that is too complex for
gcc to understand. The build fails with:
libxlu_pci.c: In function 'xlu_pci_parse_bdf':
libxlu_pci.c:32:18: error: 'func' may be used uninitialized in this function [-Werror=maybe-uninitialized]
32 | pcidev->func = func;
| ~~~~~~~~~~~~~^~~~~~
libxlu_pci.c:51:29: note: 'func' was declared here
51 | unsigned dom, bus, dev, func, vslot = 0;
| ^~~~
libxlu_pci.c:31:17: error: 'dev' may be used uninitialized in this function [-Werror=maybe-uninitialized]
31 | pcidev->dev = dev;
| ~~~~~~~~~~~~^~~~~
libxlu_pci.c:51:24: note: 'dev' was declared here
51 | unsigned dom, bus, dev, func, vslot = 0;
| ^~~
libxlu_pci.c:30:17: error: 'bus' may be used uninitialized in this function [-Werror=maybe-uninitialized]
30 | pcidev->bus = bus;
| ~~~~~~~~~~~~^~~~~
libxlu_pci.c:51:19: note: 'bus' was declared here
51 | unsigned dom, bus, dev, func, vslot = 0;
| ^~~
libxlu_pci.c:29:20: error: 'dom' may be used uninitialized in this function [-Werror=maybe-uninitialized]
29 | pcidev->domain = domain;
| ~~~~~~~~~~~~~~~^~~~~~~~
libxlu_pci.c:51:14: note: 'dom' was declared here
51 | unsigned dom, bus, dev, func, vslot = 0;
| ^~~
cc1: all warnings being treated as errors
Workaround it by setting the initial value to invalid value (0xffffffff)
and then assert on each value being set. This way we mute the gcc
warning, while still detecting bugs in the parse code.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit d25cc3ec93ebda030349045d2c7fa14ffde07ed7)
Jason Andryuk [Fri, 6 Nov 2020 08:57:11 +0000 (09:57 +0100)]
SUPPORT: Add linux device model stubdom to Toolstack
Add qemu-xen linux device model stubdomain to the Toolstack section as a
Tech Preview.
Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
master commit: 4ddd6499d999a7d08cabfda5b0262e473dd5beed
master date: 2020-10-23 17:14:08 +0100
Laurentiu Tudor [Fri, 2 Oct 2020 10:33:44 +0000 (13:33 +0300)]
arm,smmu: match start level of page table walk with P2M
Don't hardcode the lookup start level of the page table walk to 1
and instead match the one used in P2M. This should fix scenarios
involving SMMU where the start level is different than 1.
In order for the SMMU driver to also compile on arm32 move the
P2M_ROOT_LEVEL in the p2m header file (while at it, for
consistency also P2M_ROOT_ORDER) and use the macro in the smmu
driver.
xen/arm: sched: Ensure the vCPU context is seen before vcpu_pause() returns
Some callers of vcpu_pause() will expect to access the latest vcpu
context when the function returns (see XENDOMCTL_{set,get}vcpucontext}.
However, the latest vCPU context can only be observed after
v->is_running has been observed to be false.
As there is no memory barrier instruction generated, a processor could
try to speculatively access the vCPU context before it was observed.
To prevent the corruption of the vCPU context, we need to insert a
memory barrier instruction after v->is_running is observed and before
the context is accessed. This barrier is added in sync_vcpu_execstate()
as it seems to be the place where we expect the synchronization to
happen.
Jan Beulich [Fri, 11 Sep 2020 10:45:33 +0000 (12:45 +0200)]
xen/arm64: force gcc 10+ to always inline generic atomics helpers
Recent versions of gcc (at least 10.x) will not inline generic atomics
helpers by default. Instead they will expect the software to either link
with libatomic.so or implement the helpers, which would result in
undefined reference to `__aarch64_ldadd4_acq_rel'
for us (not having any local implementation).
To keep the previous behavior, force gcc to always inline the generic
atomics helpers.
Long term we probably want to avoid relying on gcc atomics helpers as
this doesn't allow us to switch between LSE and LL/SC atomics.
Suggested-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit 5d45ecabe3c0b2097df623ab7b471f8915cfdde6)
Jan Beulich [Wed, 4 Nov 2020 10:02:30 +0000 (11:02 +0100)]
x86emul: fix PINSRW and adjust other {,V}PINSR*
The use of simd_packed_int together with no further update to op_bytes
has lead to wrong signaling of #GP(0) for PINSRW without a 16-byte
aligned memory operand. Use simd_none instead and override it after
general decoding with simd_other, like is done for the B/D/Q siblings.
While benign, for consistency also use DstImplicit instead of DstReg
in x86_decode_twobyte().
PINSR{B,D,Q} also had a stray (redundant) get_fpu() invocation, which
gets dropped.
For further consistency also
- use src.bytes instead of op_bytes in relevant memcpy() invocations,
- avoid the pointless updating of op_bytes (all we care about later is
that the value be less than 16).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 06f0598b41f23c9e4cf7d8c5a05b282de92f3a35
master date: 2020-10-23 18:03:18 +0200
Roger Pau Monné [Wed, 4 Nov 2020 10:01:27 +0000 (11:01 +0100)]
pci: cleanup MSI interrupts before removing device from IOMMU
Doing the MSI cleanup after removing the device from the IOMMU leads
to the following panic on AMD hardware:
Assertion 'table.ptr && (index < intremap_table_entries(table.ptr, iommu))' failed at iommu_intr.c:172
----[ Xen-4.13.1-10.0.3-d x86_64 debug=y Not tainted ]----
CPU: 3
RIP: e008:[<ffff82d08026ae3c>] drivers/passthrough/amd/iommu_intr.c#get_intremap_entry+0x52/0x7b
[...]
Xen call trace:
[<ffff82d08026ae3c>] R drivers/passthrough/amd/iommu_intr.c#get_intremap_entry+0x52/0x7b
[<ffff82d08026af25>] F drivers/passthrough/amd/iommu_intr.c#update_intremap_entry_from_msi_msg+0xc0/0x342
[<ffff82d08026ba65>] F amd_iommu_msi_msg_update_ire+0x98/0x129
[<ffff82d08025dd36>] F iommu_update_ire_from_msi+0x1e/0x21
[<ffff82d080286862>] F msi_free_irq+0x55/0x1a0
[<ffff82d080286f25>] F pci_cleanup_msi+0x8c/0xb0
[<ffff82d08025cf52>] F pci_remove_device+0x1af/0x2da
[<ffff82d0802a42d1>] F do_physdev_op+0xd18/0x1187
[<ffff82d080383925>] F pv_hypercall+0x1f5/0x567
[<ffff82d08038a432>] F lstar_enter+0x112/0x120
That's because the call to iommu_remove_device on AMD hardware will
remove the per-device interrupt remapping table, and hence the call to
pci_cleanup_msi done afterwards will find a null intremap table and
crash.
Reorder the calls so that MSI interrupts are torn down before removing
the device from the IOMMU.
Jan Beulich [Wed, 4 Nov 2020 10:01:02 +0000 (11:01 +0100)]
build: use if_changed more consistently (and correctly) for prelink*.o
Switch to $(call if_changed,ld) where possible; presumably not doing so
in e321576f4047 ("xen/build: start using if_changed") right away was an
oversight, as it did for Arm in (just) one case. It failed to add
prelink.o to $(targets), though, causing - judging from the observed
behavior on x86 - undue rebuilds of the final binary (because of
prelink.o getting rebuild for $(cmd_prelink.o) being empty, in turn
because of .prelink.o.cmd not getting read) during "make install-xen".
Ian Jackson [Wed, 4 Nov 2020 08:35:16 +0000 (09:35 +0100)]
SUPPORT.md: Desupport qemu trad except stub dm
While investigating XSA-335 we discovered that many upstream security
fixes were missing. It is not practical to backport them. There is
no good reason to be running this very ancient version of qemu, except
that it is the only way to run a stub dm which is currently supported
by upstream.
Andrew Cooper [Mon, 19 Oct 2020 14:51:22 +0000 (15:51 +0100)]
x86/pv: Flush TLB in response to paging structure changes
With MMU_UPDATE, a PV guest can make changes to higher level pagetables. This
is safe from Xen's point of view (as the update only affects guest mappings),
and the guest is required to flush (if necessary) after making updates.
However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
writeable pagetables, etc.) is an implementation detail outside of the
API/ABI.
Changes in the paging structure require invalidations in the linear pagetable
range for subsequent accesses into the linear pagetables to access non-stale
mappings. Xen must provide suitable flushing to prevent intermixed guest
actions from accidentally accessing/modifying the wrong pagetable.
For all L2 and higher modifications, flush the TLB. PV guests cannot create
L2 or higher entries with the Global bit set, so no mappings established in
the linear range can be global. (This could in principle be an order 39 flush
starting at LINEAR_PT_VIRT_START, but no such mechanism exists in practice.)
Express the necessary flushes as a set of booleans which accumulate across the
operation. Comment the flushing logic extensively.
This is XSA-286.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 16a20963b3209788f2c0d3a3eebb7d92f03f5883)
Andrew Cooper [Thu, 22 Oct 2020 10:28:58 +0000 (11:28 +0100)]
x86/pv: Drop FLUSH_TLB_GLOBAL in do_mmu_update() for XPTI
c/s 9d1d31ad9498 "x86: slightly reduce Meltdown band-aid overhead" removed the
use of Global TLB flushes on the Xen entry path, but added a FLUSH_TLB_GLOBAL
to the L4 path in do_mmu_update().
However, this was unnecessary.
It is the guests responsibility to perform appropriate TLB flushing if the L4
modification altered an established mapping in a flush-relevant way. In this
case, an MMUEXT_OP hypercall will follow. The case which Xen needs to cover
is when new mappings are created, and the resync on the exit-to-guest path
covers this correctly.
There is a corner case with multiple vCPUs in hypercalls at the same time,
which 9d1d31ad9498 changed, and this patch changes back to its original XPTI
behaviour.
Architecturally, established TLB entries can continue to be used until the
broadcast flush has completed. Therefore, even with concurrent hypercalls,
the guest cannot depend on older mappings not being used until an MMUEXT_OP
hypercall completes. Xen's implementation of guest-initiated flushes will
take correct effect on top of an in-progress hypercall, picking up new mapping
setting before the other vCPU's MMUEXT_OP completes.
Note: The correctness of this change is not impacted by whether XPTI uses
global mappings or not. Correctness there depends on the behaviour of Xen on
the entry/exit paths when switching two/from the XPTI "shadow" pagetables.
This is (not really) XSA-286 (but necessary to simplify the logic).
Fixes: 9d1d31ad9498 ("x86: slightly reduce Meltdown band-aid overhead") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 055e1c3a3d95b1e753148369fbc4ba48782dd602)
Igor Druzhinin [Tue, 20 Oct 2020 12:46:20 +0000 (14:46 +0200)]
hvmloader: flip "ACPI data" to "ACPI NVS" type for ACPI table region
ACPI specification contains statements describing memory marked with regular
"ACPI data" type as reclaimable by the guest. Although the guest shouldn't
really do it if it wants kexec or similar functionality to work, there
could still be ambiguities in treating these regions as potentially regular
RAM.
One such example is SeaBIOS which currently reports "ACPI data" regions as
RAM to the guest in its e801 call. Which it might have the right to do as any
user of this is expected to be ACPI unaware. But a QEMU bootloader later seems
to ignore that fact and is instead using e801 to find a place for initrd which
causes the tables to be erased. While arguably QEMU bootloader or SeaBIOS need
to be fixed / improved here, that is just one example of the potential problems
from using a reclaimable memory type.
Flip the type to "ACPI NVS" which doesn't have this ambiguity in it and is
described by the spec as non-reclaimable (so cannot ever be treated like RAM).
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: de6d188a519f9e3b7a1acc7784adf4c243865f9a
master date: 2020-10-20 08:54:23 +0200
Andrew Cooper [Tue, 20 Oct 2020 12:45:56 +0000 (14:45 +0200)]
x86/smpboot: Don't unconditionally call memguard_guard_stack() in cpu_smpboot_alloc()
cpu_smpboot_alloc() is designed to be idempotent with respect to partially
initialised state. This occurs for S3 and CPU parking, where enough state to
handle NMIs/#MCs needs to remain valid for the entire lifetime of Xen, even
when we otherwise want to offline the CPU.
For simplicity between various configuration, Xen always uses shadow stack
mappings (Read-only + Dirty) for the guard page, irrespective of whether
CET-SS is enabled.
Unfortunately, the CET-SS changes in memguard_guard_stack() broke idempotency
by first writing out the supervisor shadow stack tokens with plain writes,
then changing the mapping to being read-only.
This ordering is strictly necessary to configure the BSP, which cannot have
the supervisor tokens be written with WRSS.
Instead of calling memguard_guard_stack() unconditionally, call it only when
actually allocating a new stack. Xenheap allocates are guaranteed to be
writeable, and the net result is idempotency WRT configuring stack_base[].
Fixes: 91d26ed304f ("x86/shstk: Create shadow stacks") Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a7952a320c1e202a218702bfdb14f75132f04894
master date: 2020-10-16 11:56:59 +0100
Andrew Cooper [Tue, 20 Oct 2020 12:45:31 +0000 (14:45 +0200)]
x86/traps: 'Fix' safety of read_registers() in #DF path
All interrupts and exceptions pass a struct cpu_user_regs up into C. This
contains the legacy vm86 fields from 32bit days, which are beyond the
hardware-pushed frame.
Accessing these fields is generally illegal, as they are logically out of
bounds for anything other than an interrupt/exception hitting ring1/3 code.
show_registers() unconditionally reads these fields, but the content is
discarded before use. This is benign right now, as all parts of the stack are
readable, including the guard pages.
However, read_registers() in the #DF handler writes to these fields as part of
preparing the state dump, and being IST, hits the adjacent stack frame.
This has been broken forever, but c/s 6001660473 "x86/shstk: Rework the stack
layout to support shadow stacks" repositioned the #DF stack to be adjacent to
the guard page, which turns this OoB write into a fatal pagefault:
(XEN) *** DOUBLE FAULT ***
(XEN) ----[ Xen-4.15-unstable x86_64 debug=y Tainted: C ]----
(XEN) ----[ Xen-4.15-unstable x86_64 debug=y Tainted: C ]----
(XEN) CPU: 4
(XEN) RIP: e008:[<ffff82d04031fd4f>] traps.c#read_registers+0x29/0xc1
(XEN) RFLAGS: 0000000000050086 CONTEXT: hypervisor (d1v0)
...
(XEN) Xen call trace:
(XEN) [<ffff82d04031fd4f>] R traps.c#read_registers+0x29/0xc1
(XEN) [<ffff82d0403207b3>] F do_double_fault+0x3d/0x7e
(XEN) [<ffff82d04039acd7>] F double_fault+0x107/0x110
(XEN)
(XEN) Pagetable walk from ffff830236f6d008:
(XEN) L4[0x106] = 80000000bfa9b063ffffffffffffffff
(XEN) L3[0x008] = 0000000236ffd063ffffffffffffffff
(XEN) L2[0x1b7] = 0000000236ffc063ffffffffffffffff
(XEN) L1[0x16d] = 8000000236f6d161ffffffffffffffff
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 4:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0003]
(XEN) Faulting linear address: ffff830236f6d008
(XEN) ****************************************
(XEN)
and rendering the main #DF analysis broken.
The proper fix is to delete cpu_user_regs.es and later, so no
interrupt/exception path can access OoB, but this needs disentangling from the
PV ABI first.
Not-really-fixes: 6001660473 ("x86/shstk: Rework the stack layout to support shadow stacks") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6065a05adf152a556fb9f11a5218c89e41b62893
master date: 2020-10-16 11:55:33 +0100
Chen Yu [Tue, 20 Oct 2020 12:44:59 +0000 (14:44 +0200)]
x86/mwait-idle: customize IceLake server support
On ICX platform, the C1E auto-promotion is enabled by default.
As a result, the CPU might fall into C1E more offen than previous
platforms. So disable C1E auto-promotion and expose C1E as a separate
idle state.
Beside C1 and C1E, the exit latency of C6 was measured
by a dedicated tool. However the exit latency(41us) exposed
by _CST is much smaller than the one we measured(128us). This
is probably due to the _CST uses the exit latency when woken
up from PC0+C6, rather than PC6+C6 when C6 was measured. Choose
the latter as we need the longest latency in theory.
Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
[Linux commit a472ad2bcea479ba068880125d7273fc95c14b70] Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 44ac57af81ff8097e228895738b911ca819bda19
master date: 2020-10-15 12:29:11 +0200
Jan Beulich [Tue, 20 Oct 2020 12:44:36 +0000 (14:44 +0200)]
x86: fix resource leaks on arch_vcpu_create() error path
{hvm,pv}_vcpu_initialise() have always kind of been meant to be the
final possible source of errors in arch_vcpu_create(), hence not
requiring any unrolling of what they've done on the error path. (Of
course this may change once the various involved paths all have become
idempotent.)
But even beyond this aspect I think it is more logical to do policy
initialization ahead of the calling of these two functions, as they may
in principle want to access it.
Fixes: 4187f79dc718 ("x86/msr: introduce struct msr_vcpu_policy") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6a34e67c118408ebdd62bfa7be76598ca040f170
master date: 2020-10-14 14:03:38 +0200
Andrew Cooper [Tue, 20 Oct 2020 12:43:40 +0000 (14:43 +0200)]
x86/S3: Restore CR4 earlier during resume
c/s 4304ff420e5 "x86/S3: Drop {save,restore}_rest_processor_state()
completely" moved CR4 restoration up into C, to account for the fact that MCE
was explicitly handled later.
However, time_resume() ends up making an EFI Runtime Service call, and EFI
explodes without OSFXSR, presumably when trying to spill %xmm registers onto
the stack.
Given this codepath, and the potential for other issues of a similar kind (TLB
flushing vs INVPCID, HVM logic vs VMXE, etc), restore CR4 in asm before
entering C.
Ignore the previous MCE special case, because its not actually necessary. The
handler is already suitably configured from before suspend.
Fixes: 4304ff420e5 ("x86/S3: Drop {save,restore}_rest_processor_state() completely") Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
master commit: 7f66c0dc41ae5f770c614e516810eb1f336e2470
master date: 2020-10-06 12:28:37 +0100
Jan Beulich [Tue, 20 Oct 2020 12:42:52 +0000 (14:42 +0200)]
evtchn/fifo: use stable fields when recording "last queue" information
Both evtchn->priority and evtchn->notify_vcpu_id could change behind the
back of evtchn_fifo_set_pending(), as for it - in the case of
interdomain channels - only the remote side's per-channel lock is held.
Neither the queue's priority nor the vCPU's vcpu_id fields have similar
properties, so they seem better suited for the purpose. In particular
they reflect the respective evtchn fields' values at the time they were
used to determine queue and vCPU.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 6f6f07b64cbe90e54f8e62b4d6f2404cf5306536
master date: 2020-10-02 08:37:35 +0200