Roger Pau Monné [Mon, 25 Nov 2024 11:18:38 +0000 (12:18 +0100)]
x86/mm: skip super-page alignment checks for non-present entries
INVALID_MFN is ~0, so by it having all bits as 1s it doesn't fulfill the
super-page address alignment checks for L3 and L2 entries. Skip the alignment
checks if the new entry is a non-present one.
This fixes a regression introduced by 0b6b51a69f4d, where the switch from 0 to
INVALID_MFN caused all super-pages to be shattered when attempting to remove
mappings by passing INVALID_MFN instead of 0.
Fixes: 0b6b51a69f4d ('xen/mm: Switch map_pages_to_xen to use MFN typesafe') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/mm: fix alignment check for non-present entries
While the alignment of the mfn is not relevant for non-present entries, the
alignment of the linear address is. Commit 5b52e1b0436f introduced a
regression by not checking the alignment of the linear address when the new
entry was a non-present one.
Fix by always checking the alignment of the linear address, non-present entries
must just skip the alignment check of the physical address.
Fixes: 5b52e1b0436f ('x86/mm: skip super-page alignment checks for non-present entries') Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5b52e1b0436f4adb784562f4d05ae67605ce8ce7
master date: 2024-11-14 16:12:35 +0100
master commit: b1ebb6461a027f07e4a844cae348fbd9cfabe984
master date: 2024-11-15 14:14:12 +0100
Jan Beulich [Mon, 25 Nov 2024 11:18:09 +0000 (12:18 +0100)]
x86emul: avoid double memory read for RORX
Originally only twobyte_table[0x3a] determined what part of generic
operand fetching (near the top of x86_emulate()) comes into play. When
ext0f3a_table[] was added, ->desc was updated to properly describe the
ModR/M byte's function. With that generic source operand fetching came
into play for RORX, rendering the explicit fetching in the respective
case block redundant (and wrong at the very least when MMIO with side
effects is accessed).
While there also make a purely cosmetic / documentary adjustment to
ext0f3a_table[]: RORX really is a 2-operand insn, MOV-like in that it
only writes its destination register.
Fixes: 9f7f5f6bc95b ("x86emul: add tables for 0f38 and 0f3a extension space") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 939a9e800c4156677c10c6cf08fde071e9b86eaf
master date: 2024-11-14 13:03:18 +0100
Jan Beulich [Mon, 25 Nov 2024 11:17:49 +0000 (12:17 +0100)]
x86emul: ignore VEX.W for BMI{1,2} insns in 32-bit mode
While result values and other status flags are unaffected as long as we
can ignore the case of registers having their upper 32 bits non-zero
outside of 64-bit mode, EFLAGS.SF may obtain a wrong value when we
mistakenly re-execute the original insn with VEX.W set.
Note that guest the memory access, if any, is correctly carried out as
32-bit regardless of VEX.W. The emulator-local memory operand will be
accessed as a 64-bit quantity, but it is pre-initialised to zero so no
internal state can leak.
Fixes: 771daacd197a ("x86emul: support BMI1 insns") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1179d51dcb7d93111bfb35172c75eb5a73fe6a43
master date: 2024-11-14 13:00:57 +0100
Andrew Cooper [Mon, 25 Nov 2024 11:16:34 +0000 (12:16 +0100)]
x86/cpu-policy: Extend the guest max policy max leaf/subleaves
We already have one migration case opencoded (feat.max_subleaf). A more
recent discovery is that we advertise x2APIC to guests without ensuring that
we provide max_leaf >= 0xb.
In general, any leaf known to Xen can be safely configured by the toolstack if
it doesn't violate other constraints.
Therefore, introduce guest_common_{max,default}_leaves() to generalise the
special case we currently have for feat.max_subleaf, in preparation to be able
to provide x2APIC topology in leaf 0xb even on older hardware.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: fa2d8318033e468a4ded1fc3d721dc3e019e449b
master date: 2024-10-30 17:34:32 +0000
Roger Pau Monné [Mon, 25 Nov 2024 11:16:09 +0000 (12:16 +0100)]
x86/alternatives: do not BUG during apply
alternatives is used both at boot time, and when loading livepatch payloads.
While for the former it makes sense to panic, it's not useful for the later, as
for livepatches it's possible to fail to load the livepatch if alternatives
cannot be resolved and continue operating normally.
Relax the BUGs in _apply_alternatives() to instead return an error code. The
caller will figure out whether the failures are fatal and panic.
Print an error message to provide some user-readable information about what
went wrong.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: aa5a06d5d6eda291afa3849ab143ffcd69b0d4e6
master date: 2024-09-26 14:18:03 +0100
Roger Pau Monné [Mon, 25 Nov 2024 11:15:56 +0000 (12:15 +0100)]
xen/livepatch: do Xen build-id check earlier
The check against the expected Xen build ID should be done ahead of attempting
to apply the alternatives contained in the livepatch.
If the CPUID in the alternatives patching data is out of the scope of the
running Xen featureset the BUG() in _apply_alternatives() will trigger thus
bringing the system down. Note the layout of struct alt_instr could also
change between versions. It's also possible for struct exception_table_entry
to have changed format, hence leading to other kind of errors if parsing of the
payload is done ahead of checking if the Xen build-id matches.
Move the Xen build ID check as early as possible. To do so introduce a new
check_xen_buildid() function that parses and checks the Xen build-id before
moving the payload. Since the expected Xen build-id is used early to
detect whether the livepatch payload could be loaded, there's no reason to
store it in the payload struct, as a non-matching Xen build-id won't get the
payload populated in the first place.
Note printing the expected Xen build ID has part of dumping the payload
information is no longer done: all loaded payloads would have Xen build IDs
matching the running Xen, otherwise they would have failed to load.
Fixes: 879615f5db1d ('livepatch: Always check hypervisor build ID upon livepatch upload') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fa49f4be413cdabde5b2264fc85d2710b15ea691
master date: 2024-09-26 14:18:03 +0100
Roger Pau Monné [Mon, 25 Nov 2024 11:15:29 +0000 (12:15 +0100)]
xen/livepatch: simplify and unify logic in prepare_payload()
The following sections: .note.gnu.build-id, .livepatch.xen_depends and
.livepatch.depends are mandatory and ensured to be present by
check_special_sections() before prepare_payload() is called.
Simplify the logic in prepare_payload() by introducing a generic function to
parse the sections that contain a buildid. Note the function assumes the
buildid related section to always be present.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 86d09d16dd74298b19a03df492d9503f20cfc17c
master date: 2024-09-26 14:18:03 +0100
Roger Pau Monné [Mon, 25 Nov 2024 11:15:00 +0000 (12:15 +0100)]
xen/livepatch: drop load_addr Elf section field
The Elf loading logic will initially use the `data` section field to stash a
pointer to the temporary loaded data (from the buffer allocated in
livepatch_upload(), which is later relocated and the new pointer stashed in
`load_addr`.
Remove this dual field usage and use an `addr` uniformly. Initially data will
point to the temporary buffer, until relocation happens, at which point the
pointer will be updated to the relocated address.
This avoids leaving a dangling pointer in the `data` field once the temporary
buffer is freed by livepatch_upload().
Note the `addr` field cannot retain the const attribute from the previous
`data`field, as there's logic that performs manipulations against the loaded
sections, like applying relocations or sorting the exception table.
No functional change intended.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Michal Orzel <michal.orzel@amd.com>
master commit: 8c81423038f17f6cbc853dd35d69d50a4458f764
master date: 2024-09-26 14:18:03 +0100
Frediano Ziglio [Mon, 25 Nov 2024 11:14:13 +0000 (12:14 +0100)]
x86/boot: Preserve the value clobbered by the load-base calculation
Right now, Xen clobbers the value at 0xffc when performing it's load-base
calculation. We've got plenty of free registers at this point, so the value
can be preserved easily.
This fixes a real bug booting under Coreboot+SeaBIOS, where 0xffc happens to
be the cbmem pointer (e.g. Coreboot's dmesg ring, among other things).
However, there's also a better choice of memory location to use than 0xffc, as
all our supported boot protocols have a pointer to an info structure in %ebx.
Update the documentation to match.
Fixes: 1695e53851e5 ("x86/boot: Fix the boot time relocation calculations") Fixes: d96bb172e8c9 ("x86/entry: Early PVH boot code") Signed-off-by: Frediano Ziglio <frediano.ziglio@cloud.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e58a2858d588ed57ca13200f3d3148d78ad0e491
master date: 2024-08-27 18:08:19 +0100
Andrew Cooper [Mon, 25 Nov 2024 11:13:47 +0000 (12:13 +0100)]
tools/ocaml: Fix the version embedded in META files
Xen 4.1 is more than a decade stale now. Use the same mechanism as elsewhere
in the tree to get the current version number.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christian Lindig <christian.lindig@cloud.com> Reviewed-by: Edwin Török <edwin.torok@cloud.com>
master commit: 1965e9a930740b37637ac450f4752fd53edf63c4
master date: 2024-08-23 15:02:27 +0100
Andrew Cooper [Mon, 25 Nov 2024 11:13:28 +0000 (12:13 +0100)]
tools/ocaml: Drop the OCAMLOPTFLAG_G invocation
These days, `ocamlopt -h` asks you whether you meant --help instead, meaning
that the $(shell ) invocation here isn't going end up containing '-g'.
Make it unconditional, like it is in OCAMLCFLAGS already.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christian Lindig <christian.lindig@cloud.com> Reviewed-by: Edwin Török <edwin.torok@cloud.com>
master commit: 126293eae6485089471ebdfd91fe944a0274e613
master date: 2024-08-23 15:02:27 +0100
The current logic to chose the preferred reboot method is based on the mode Xen
has been booted into, so if the box is booted from UEFI, the preferred reboot
method will be to use the ResetSystem() run time service call.
However, that method seems to be widely untested, and quite often leads to a
result similar to:
****************************************
Panic on CPU 0:
FATAL TRAP: vector = 6 (invalid opcode)
****************************************
Which in most cases does lead to a reboot, however that's unreliable.
Change the default reboot preference to prefer ACPI over UEFI if available and
not in reduced hardware mode.
This is in line to what Linux does, so it's unlikely to cause issues on current
and future hardware, since there's a much higher chance of vendors testing
hardware with Linux rather than Xen.
Add a special case for one Acer model that does require being rebooted using
ResetSystem(). See Linux commit 0082517fa4bce for rationale.
x86/viridian: Clarify some viridian logging strings
It's sadically misleading to show an error without letters and expect
the dmesg reader to understand it's in hex. The patch adds a 0x prefix
to all hex numbers that don't already have it.
On the one instance in which a boolean is printed as an integer, print
it as a decimal integer instead so it's 0/1 in the common case and not
misleading if it's ever not just that due to a bug.
While at it, rename VIRIDIAN CRASH to VIRIDIAN GUEST_CRASH. Every member
of a support team that looks at the message systematically believes
"viridian" crashed, which is absolutely not what goes on. It's the guest
asking the hypervisor for a sudden shutdown because it crashed, and
stating why.
Andrew Cooper [Mon, 25 Nov 2024 11:11:04 +0000 (12:11 +0100)]
tools/libxs: Stop playing with SIGPIPE
It's very rude for a library to play with signals behind the back of the
application, no matter ones views on the default behaviour of SIGPIPE under
POSIX. Even if the application doesn't care about the xenstored socket, it my
care about others.
This logic has existed since xenstore/xenstored was originally added in commit 29c9e570b1ed ("Add xenstore daemon and library") in 2005.
It's also unnecessary. Pass MSG_NOSIGNAL when talking to xenstored over a
pipe (to avoid sucumbing to SIGPIPE if xenstored has crashed), and forgo any
playing with the signal disposition.
This has a side benefit of saving 2 syscalls per xenstore request.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: a17b6db9b00784b409c35e3017dc45aed1ec2bfb
master date: 2024-07-23 15:11:27 +0100
Andrew Cooper [Mon, 25 Nov 2024 11:10:34 +0000 (12:10 +0100)]
tools/libxs: Use writev()/sendmsg() instead of write()
With the input data now conveniently arranged, use writev()/sendmsg() instead
of decomposing it into write() calls.
This causes all requests to be submitted with a single system call, rather
than at least two. While in principle short writes can occur, the chances of
it happening are slim given that most xenbus comms are only a handful of
bytes.
Nevertheless, provide {writev,sendmsg}_exact() wrappers which take care of
resubmitting on EINTR or short write.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: ebaeb0c64a6d363313e213eb9995f48307604ebb
master date: 2024-07-23 15:11:27 +0100
Andrew Cooper [Mon, 25 Nov 2024 11:10:07 +0000 (12:10 +0100)]
tools/libxs: Rework xs_talkv() to take xsd_sockmsg within the iovec
We would like to writev() the whole outgoing message, but this is hard given
the current need to prepend the locally-constructed xsd_sockmsg.
Instead, have the caller provide xsd_sockmsg in iovec[0]. This in turn drops
the 't' and 'type' parameters from xs_talkv().
Note that xs_talkv() may alter the iovec structure. This may happen when
writev() is really used under the covers, and it's preferable to having the
lower levels need to duplicate the iovec to edit it upon encountering a short
write. xs_directory_part() is the only function impacted by this, and it's
easy to rearrange to be compatible.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com>
master commit: e2a93bed8b9e0f0c4779dcd4b10dc7ba2a959fbc
master date: 2024-07-23 15:11:27 +0100
Andrew Cooper [Mon, 25 Nov 2024 11:09:51 +0000 (12:09 +0100)]
tools/libxs: Fix length check in xs_talkv()
If the sum of iov element lengths overflows, the XENSTORE_PAYLOAD_MAX can
pass, after which we'll write 4G of data with a good-looking length field, and
the remainder of the payload will be interpreted as subsequent commands.
Check each iov element length for XENSTORE_PAYLOAD_MAX before accmulating it.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jason Andryuk <jason.andryuk@amd.com> Reviewed-by: Juergen Gross <jgross@suse.com>
master commit: 42db2deb5e7617f0459b68cd73ab503938356186
master date: 2024-07-23 15:11:27 +0100
Matthew Barnes [Mon, 25 Nov 2024 11:09:37 +0000 (12:09 +0100)]
tools/misc: xen-hvmcrash: Inject #DF instead of overwriting RIP
xen-hvmcrash would previously save records, overwrite the instruction
pointer with a bogus value, and then restore them to crash a domain
just enough to cause the guest OS to memdump.
This approach is found to be unreliable when tested on a guest running
Windows 10 x64, with some executions doing nothing at all.
Another approach would be to trigger NMIs. This approach is found to be
unreliable when tested on Linux (Ubuntu 22.04), as Linux will ignore
NMIs if it is not configured to handle such.
Injecting a double fault abort to all vCPUs is found to be more
reliable at crashing and invoking memdumps from Windows and Linux
domains.
This patch modifies the xen-hvmcrash tool to inject #DF to all vCPUs
belonging to the specified domain, instead of overwriting RIP.
Signed-off-by: Matthew Barnes <matthew.barnes@cloud.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e42e4d8c3e2c391e8ce990e0a2c76d6e9a17aad6
master date: 2024-07-22 10:50:03 +0100
Roger Pau Monné [Tue, 12 Nov 2024 12:54:56 +0000 (13:54 +0100)]
xen/x86: prevent addition of .note.gnu.property if livepatch is enabled
GNU assembly that supports such feature will unconditionally add a
.note.gnu.property section to object files. The content of that section can
change depending on the generated instructions. The current logic in
livepatch-build-tools doesn't know how to deal with such section changing
as a result of applying a patch and rebuilding.
Since .note.gnu.property is not consumed by the Xen build, suppress its
addition when livepatch support is enabled.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 718400a54dcfcc8a11958a6d953168f50944f002
master date: 2024-11-11 13:19:45 +0100
Roger Pau Monné [Tue, 12 Nov 2024 12:54:41 +0000 (13:54 +0100)]
x86/io-apic: fix directed EOI when using AMD-Vi interrupt remapping
When using AMD-Vi interrupt remapping the vector field in the IO-APIC RTE is
repurposed to contain part of the offset into the remapping table. Previous to 2ca9fbd739b8 Xen had logic so that the offset into the interrupt remapping
table would match the vector. Such logic was mandatory for end of interrupt to
work, since the vector field (even when not containing a vector) is used by the
IO-APIC to find for which pin the EOI must be performed.
A simple solution wold be to read the IO-APIC RTE each time an EOI is to be
performed, so the raw value of the vector field can be obtained. However
that's likely to perform poorly. Instead introduce a cache to store the
EOI handles when using interrupt remapping, so that the IO-APIC driver can
translate pins into EOI handles without having to read the IO-APIC RTE entry.
Note that to simplify the logic such cache is used unconditionally when
interrupt remapping is enabled, even if strictly it would only be required
for AMD-Vi.
Reported-by: Willi Junga <xenproject@ymy.be> Suggested-by: David Woodhouse <dwmw@amazon.co.uk> Fixes: 2ca9fbd739b8 ('AMD IOMMU: allocate IRTE entries instead of using a static mapping') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 86001b3970fea4536048607ea6e12541736c48e1
master date: 2024-11-05 10:36:53 +0000
Andrew Cooper [Tue, 12 Nov 2024 12:53:40 +0000 (13:53 +0100)]
x86/hvm: Simplify stdvga_mem_accept() further
stdvga_mem_accept() is called on almost all IO emulations, and the
overwhelming likely answer is to reject the ioreq. Simply rearranging the
expression yields an improvement:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-57 (-57)
Function old new delta
stdvga_mem_accept 109 52 -57
which is best explained looking at the disassembly:
Before: After:
f3 0f 1e fa endbr64 f3 0f 1e fa endbr64
0f b6 4e 1e movzbl 0x1e(%rsi),%ecx | 0f b6 46 1e movzbl 0x1e(%rsi),%eax
48 8b 16 mov (%rsi),%rdx | 31 d2 xor %edx,%edx
f6 c1 40 test $0x40,%cl | a8 30 test $0x30,%al
75 38 jne <stdvga_mem_accept+0x48> | 75 23 jne <stdvga_mem_accept+0x31>
31 c0 xor %eax,%eax <
48 81 fa ff ff 09 00 cmp $0x9ffff,%rdx <
76 26 jbe <stdvga_mem_accept+0x41> <
8b 46 14 mov 0x14(%rsi),%eax <
8b 7e 10 mov 0x10(%rsi),%edi <
48 0f af c7 imul %rdi,%rax <
48 8d 54 02 ff lea -0x1(%rdx,%rax,1),%rdx <
31 c0 xor %eax,%eax <
48 81 fa ff ff 0b 00 cmp $0xbffff,%rdx <
77 0c ja <stdvga_mem_accept+0x41> <
83 e1 30 and $0x30,%ecx <
75 07 jne <stdvga_mem_accept+0x41> <
83 7e 10 01 cmpl $0x1,0x10(%rsi) 83 7e 10 01 cmpl $0x1,0x10(%rsi)
0f 94 c0 sete %al | 75 1d jne <stdvga_mem_accept+0x31>
c3 ret | 48 8b 0e mov (%rsi),%rcx
66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) | 48 81 f9 ff ff 09 00 cmp $0x9ffff,%rcx
8b 46 10 mov 0x10(%rsi),%eax | 76 11 jbe <stdvga_mem_accept+0x31>
8b 7e 14 mov 0x14(%rsi),%edi | 8b 46 14 mov 0x14(%rsi),%eax
49 89 d0 mov %rdx,%r8 | 48 8d 44 01 ff lea -0x1(%rcx,%rax,1),%rax
48 83 e8 01 sub $0x1,%rax | 48 3d ff ff 0b 00 cmp $0xbffff,%rax
48 8d 54 3a ff lea -0x1(%rdx,%rdi,1),%rdx | 0f 96 c2 setbe %dl
48 0f af c7 imul %rdi,%rax | 89 d0 mov %edx,%eax
49 29 c0 sub %rax,%r8 <
31 c0 xor %eax,%eax <
49 81 f8 ff ff 09 00 cmp $0x9ffff,%r8 <
77 be ja <stdvga_mem_accept+0x2a> <
c3 ret c3 ret
By moving the "p->count != 1" check ahead of the
ioreq_mmio_{first,last}_byte() calls, both multiplies disappear along with a
lot of surrounding logic.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 08ffd8705d36c7c445df3ecee8ad9b8f8d65fbe0)
Jan Beulich [Tue, 12 Nov 2024 12:53:24 +0000 (13:53 +0100)]
x86/HVM: drop stdvga's "lock" struct member
No state is left to protect. It being the last field, drop the struct
itself as well. Similarly for then ending up empty, drop the .complete
handler.
This is part of XSA-463 / CVE-2024-45818
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit b180a50326c8a2c171f37c1940a0fbbdcad4be90)
Jan Beulich [Tue, 12 Nov 2024 12:53:03 +0000 (13:53 +0100)]
x86/HVM: drop stdvga's "vram_page[]" struct member
No uses are left, hence its setup, teardown, and the field itself can
also go away. stdvga_deinit() is then empty and can be dropped as well.
This is part of XSA-463 / CVE-2024-45818
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 3beb4baf2a0a2eef40d39eb7e6eecbfd36da5d14)
Jan Beulich [Tue, 12 Nov 2024 12:52:46 +0000 (13:52 +0100)]
x86/HVM: drop stdvga's "{g,s}r_index" struct members
No consumers are left, hence the producer and the fields themselves can
also go away. stdvga_outb() is then useless, rendering stdvga_out()
useless as well. Hence the entire I/O port intercept can go away.
This is part of XSA-463 / CVE-2024-45818
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 86c03372e107f5c18266a62281663861b1144929)
Jan Beulich [Tue, 12 Nov 2024 12:52:28 +0000 (13:52 +0100)]
x86/HVM: drop stdvga's "sr[]" struct member
No consumers are left, hence the producer and the array itself can also
go away. The static sr_mask[] is then orphaned and hence needs dropping,
too.
This is part of XSA-463 / CVE-2024-45818
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 7aba44bdd78aedb97703811948c3b69ccff85032)
Jan Beulich [Tue, 12 Nov 2024 12:52:08 +0000 (13:52 +0100)]
x86/HVM: drop stdvga's "gr[]" struct member
No consumers are left, hence the producer and the array itself can also
go away. The static gr_mask[] is then orphaned and hence needs dropping,
too.
This is part of XSA-463 / CVE-2024-45818
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit b16c0966a17f19c0e55ed0b9baa28191d2590178)
Jan Beulich [Tue, 12 Nov 2024 12:51:51 +0000 (13:51 +0100)]
x86/HVM: remove unused MMIO handling code
All read accesses are rejected by the ->accept handler, while writes
bypass the bulk of the function body. Drop the dead code, leaving an
assertion in the read handler.
A number of other static items (and a macro) are then unreferenced and
hence also need (want) dropping. The same applies to the "latch" field
of the state structure.
This is part of XSA-463 / CVE-2024-45818
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 89108547af1f230b72893b48351f9c1106189649)
Jan Beulich [Tue, 12 Nov 2024 12:51:30 +0000 (13:51 +0100)]
x86/HVM: drop stdvga's "stdvga" struct member
Two of its consumers are dead (in compile-time constant conditionals)
and the only remaining ones are merely controlling debug logging. Hence
the field is now pointless to set, which in particular allows to get rid
of the questionable conditional from which the field's value was
established (afaict 551ceee97513 ["x86, hvm: stdvga cache always on"]
had dropped too much of the earlier extra check that was there, and
quite likely further checks were missing).
This is part of XSA-463 / CVE-2024-45818
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit b740a9369e81bdda675a9780130ce2b9e75d4ec9)
Jan Beulich [Tue, 12 Nov 2024 12:50:54 +0000 (13:50 +0100)]
x86/HVM: drop stdvga's "cache" struct member
Since 68e1183411be ("libxc: introduce a xc_dom_arch for hvm-3.0-x86_32
guests"), HVM guests are built using XEN_DOMCTL_sethvmcontext, which
ends up disabling stdvga caching because of arch_hvm_load() being
involved in the processing of the request. With that the field is
useless, and can be dropped. Drop the helper functions manipulating /
checking as well right away, but leave the use sites of
stdvga_cache_is_enabled() with the hard-coded result the function would
have produced, to aid validation of subsequent dropping of further code.
This is part of XSA-463 / CVE-2024-45818
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 53b7246bdfb3c280adcdf714918e4decb7e108f4)
Andrew Cooper [Sat, 13 Jul 2024 16:50:30 +0000 (17:50 +0100)]
CI: Stop building QEMU in general
We spend an awful lot of CI time building QEMU, even though most changes don't
touch the subset of tools/libs/ used by QEMU. Some numbers taken at a time
when CI was otherwise quiet:
With Without
Alpine: 13m38s 6m04s
Debian 12: 10m05s 8m10s
OpenSUSE Tumbleweed: 11m40s 7m54s
Ubuntu 24.04: 14m56s 8m06s
which is a >50% improvement in wallclock time in some cases.
The only build we have that needs QEMU is alpine-3.18-gcc-debug. This is the
build deployed and used by the QubesOS ADL-* and Zen3p-* jobs.
Xilinx-x86_64 deploys it too, but is PVH-only and doesn't use QEMU.
QEMU is also built by CirrusCI for FreeBSD (fully Clang/LLVM toolchain).
This should help quite a lot with Gitlab CI capacity.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit e305256e69b1c943db3ca20173da6ded3db2d252)
Andrew Cooper [Wed, 10 Jul 2024 12:38:52 +0000 (13:38 +0100)]
CI: Mark Archlinux/x86 as allowing failures
Archlinux is a rolling distro. As a consequence, rebuilding the container
periodically changes the toolchain, and this affects all stable branches in
one go.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech> Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>
(cherry picked from commit 5e1773dc863d6e1fb4c1398e380bdfc754342f7b)
Daniel P. Smith [Tue, 29 Oct 2024 15:42:29 +0000 (16:42 +0100)]
x86/boot: Fix XSM module handling during PVH boot
As detailed in commit 0fe607b2a144 ("x86/boot: Fix PVH boot during boot_info
transition period"), the use of __va(mbi->mods_addr) constitutes a
use-after-free on the PVH boot path.
This pattern has been in use since before PVH support was added. This has
most likely gone unnoticed because no-one's tried using a detached Flask
policy in a PVH VM before.
Plumb the boot_info pointer down, replacing module_map and mbi. Importantly,
bi->mods[].mod is a safe way to access the module list during PVH boot.
As this is the final non-bi use of mbi in __start_xen(), make the pointer
unusable once bi has been established, to prevent new uses creeping back in.
This is a stopgap until mbi can be fully removed.
Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Daniel P. Smith <dpsmith@apertussolutions.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 6cf0aaeb8df951fb34679f0408461a5c67cb02c6
master date: 2024-10-23 18:14:24 +0100
Daniel P. Smith [Tue, 29 Oct 2024 15:42:16 +0000 (16:42 +0100)]
x86/boot: Fix microcode module handling during PVH boot
As detailed in commit 0fe607b2a144 ("x86/boot: Fix PVH boot during boot_info
transition period"), the use of __va(mbi->mods_addr) constitutes a
use-after-free on the PVH boot path.
This pattern has been in use since before PVH support was added. Inside a PVH
VM, it will go unnoticed as long as the microcode container parser doesn't
choke on the random data it finds.
The use within early_microcode_init() happens to be safe because it's prior to
move_xen(). microcode_init_cache() is after move_xen(), and therefore unsafe.
Plumb the boot_info pointer down, replacing module_map and mbi. Importantly,
bi->mods[].mod is a safe way to access the module list during PVH boot.
Note: microcode_scan_module() is still bogusly stashing a bootstrap_map()'d
pointer in ucode_blob.data, which constitutes a different
use-after-free, and only works in general because of a second bug. This
is unrelated to PVH, and needs untangling differently.
Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Daniel P. Smith <dpsmith@apertussolutions.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8ddf63a252a6eae6e619ba2df9ad6b6f82e660c1
master date: 2024-10-23 18:14:24 +0100
Roger Pau Monné [Tue, 29 Oct 2024 15:41:42 +0000 (16:41 +0100)]
iommu/amd-vi: do not error if device referenced in IVMD is not behind any IOMMU
IVMD table contains restrictions about memory which must be mandatory assigned
to devices (and which permissions it should use), or memory that should be
never accessible to devices.
Some hardware however contains ranges in IVMD that reference devices outside of
the IVHD tables (in other words, devices not behind any IOMMU). Such mismatch
will cause Xen to fail in register_range_for_device(), ultimately leading to
the IOMMU being disabled, and Xen crashing as x2APIC support might be already
enabled and relying on the IOMMU functionality.
Relax IVMD parsing: allow IVMD blocks to reference devices not assigned to any
IOMMU. It's impossible for Xen to fulfill the requirement in the IVMD block if
the device is not behind any IOMMU, but it's no worse than booting without
IOMMU support, and thus not parsing ACPI IVRS in the first place.
Andrew Cooper [Tue, 29 Oct 2024 15:41:30 +0000 (16:41 +0100)]
xen/spinlock: Fix UBSAN "load of address with insufficient space" in lock_prof_init()
UBSAN complains:
(XEN) ================================================================================
(XEN) UBSAN: Undefined behaviour in common/spinlock.c:794:10
(XEN) load of address ffff82d040ae24c8 with insufficient space
(XEN) for an object of type 'struct lock_profile *'
(XEN) ----[ Xen-4.20-unstable x86_64 debug=y ubsan=y Tainted: C ]----
This shows up with GCC-14, but not with GCC-12. I have not bisected further.
Either way, the types for __lock_profile_{start,end} are incorrect.
They are an array of struct lock_profile pointers. Correct the extern's
types, and adjust the loop to match.
No practical change.
Reported-by: Andreas Glashauser <ag@andreasglashauser.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com>
master commit: 542ac112fc68c66cfafc577e252404c21da4f75b
master date: 2024-10-14 16:14:26 +0100
Roger Pau Monné [Tue, 29 Oct 2024 15:40:58 +0000 (16:40 +0100)]
x86/domctl: fix maximum number of MSRs in XEN_DOMCTL_{get,set}_vcpu_msrs
Since the addition of the MSR_AMD64_DR{1-4}_ADDRESS_MASK MSRs to the
msrs_to_send array, the calculations for the maximum number of MSRs that
the hypercall can handle is off by 4.
Remove the addition of 4 to the maximum number of MSRs that
XEN_DOMCTL_{set,get}_vcpu_msrs supports, as those are already part of the
array.
A further adjustment could be to subtract 4 from the maximum size if the DBEXT
CPUID feature is not exposed to the guest, but guest_{rd,wr}msr() will already
perform that check when fetching or loading the MSRs. The maximum array is
used to indicate the caller of the buffer it needs to allocate in the get case,
and as an early input sanitation in the set case, using a buffer size slightly
lager than required is not an issue.
Fixes: 86d47adcd3c4 ('x86/msr: Handle MSR_AMD64_DR{0-3}_ADDRESS_MASK in the new MSR infrastructure') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c95cd5f9c5a8c1c6ab1b0b366d829fa8561958fd
master date: 2024-10-08 14:37:53 +0200
Jan Beulich [Tue, 29 Oct 2024 15:40:46 +0000 (16:40 +0100)]
ioreq: don't wrongly claim "success" in ioreq_send_buffered()
Returning a literal number is a bad idea anyway when all other returns
use IOREQ_STATUS_* values. The function is dead on Arm, and mapping to
X86EMUL_OKAY is surely wrong on x86.
Fixes: f6bf39f84f82 ("x86/hvm: add support for broadcast of buffered ioreqs...") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: 2e0b545b847df7d4feb07308d50bad708bd35a66
master date: 2024-10-08 14:36:27 +0200
Roger Pau Monné [Tue, 29 Oct 2024 15:39:43 +0000 (16:39 +0100)]
x86/dpci: do not leak pending interrupts on CPU offline
The current dpci logic relies on a softirq being executed as a side effect of
the cpu_notifier_call_chain() call in the code path that offlines the target
CPU. However the call to cpu_notifier_call_chain() won't trigger any softirq
processing, and even if it did, such processing should be done after all
interrupts have been migrated off the current CPU, otherwise new pending dpci
interrupts could still appear.
Currently the ASSERT() in the cpu callback notifier is fairly easy to trigger
by doing CPU offline from a PVH dom0.
Solve this by instead moving out any dpci interrupts pending processing once
the CPU is dead. This might introduce more latency than attempting to drain
before the CPU is put offline, but it's less complex, and CPU online/offline is
not a common action. Any extra introduced latency should be tolerable.
Fixes: f6dd295381f4 ('dpci: replace tasklet with softirq') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 29555668b5725b9d5393b72bfe7ff9a3fa606714
master date: 2024-10-07 11:10:21 +0200
Andrew Cooper [Tue, 29 Oct 2024 15:39:22 +0000 (16:39 +0100)]
stubdom: Fix newlib build with GCC-14
Based on a fix from OpenSUSE, but adjusted to be Clang-compatible too. Pass
-Wno-implicit-function-declaration library-wide rather than using local GCC
pragmas.
Fix of copy_past_newline() to avoid triggering -Wstrict-prototypes.
Andrew Cooper [Tue, 29 Oct 2024 15:38:41 +0000 (16:38 +0100)]
x86/pv: Rename pv.iobmp_limit to iobmp_nr and clarify behaviour
Ever since it's introduction in commit 013351bd7ab3 ("Define new event-channel
and physdev hypercalls") in 2006, the public interface was named nr_ports
while the internal field was called iobmp_limit.
Rename the internal field to iobmp_nr to match the public interface, and
clarify that, when nonzero, Xen will read 2 bytes.
There isn't a perfect parallel with a real TSS, but iobmp_nr being 0 is the
paravirt "no IOPB" case, and it is important that no read occurs in this case.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 633ee8b2df963f7e5cb8de1219c1a48bfb4447f6
master date: 2024-10-01 14:58:18 +0100
Andrew Cooper [Tue, 29 Oct 2024 15:38:29 +0000 (16:38 +0100)]
x86/pv: Handle #PF correctly when reading the IO permission bitmap
The switch statement in guest_io_okay() is a very expensive way of
pre-initialising x with ~0, and performing a partial read into it.
However, the logic isn't correct either.
In a real TSS, the CPU always reads two bytes (like here), and any TSS limit
violation turns silently into no-access. But, in-limit accesses trigger #PF
as usual. AMD document this property explicitly, and while Intel don't (so
far as I can tell), they do behave consistently with AMD.
Switch from __copy_from_guest_offset() to __copy_from_guest_pv(), like
everything else in this file. This removes code generation setting up
copy_from_user_hvm() (in the likely path even), and safety LFENCEs from
evaluate_nospec().
Change the logic to raise #PF if __copy_from_guest_pv() fails, rather than
disallowing the IO port access. This brings the behaviour better in line with
normal x86.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 8a6c495d725408d333c1b47bb8af44615a5bfb18
master date: 2024-10-01 14:58:18 +0100
Andrew Cooper [Tue, 29 Oct 2024 15:38:17 +0000 (16:38 +0100)]
x86/pv: Rework guest_io_okay() to return X86EMUL_*
In order to fix a bug with guest_io_okay() (subsequent patch), rework
guest_io_okay() to take in an emulation context, and return X86EMUL_* rather
than a boolean.
For the failing case, take the opportunity to inject #GP explicitly, rather
than returning X86EMUL_UNHANDLEABLE. There is a logical difference between
"we know what this is, and it's #GP", vs "we don't know what this is".
There is no change in practice as emulation is the final step on general #GP
resolution, but returning X86EMUL_UNHANDLEABLE would be a latent bug if a
subsequent action were to appear.
No practical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7429e1cc071b0e20ea9581da4893fb9b2f6d21d4
master date: 2024-10-01 14:58:18 +0100
x86/traps: Re-enable interrupts after reading cr2 in the #PF handler
Hitting a page fault clobbers %cr2, so if a page fault is handled while
handling a previous page fault then %cr2 will hold the address of the
latter fault rather than the former. In particular, if a debug key
handler happens to trigger during #PF and before %cr2 is read, and that
handler itself encounters a #PF, then %cr2 will be corrupt for the outer #PF
handler.
This patch makes the page fault path delay re-enabling IRQs until %cr2
has been read in order to ensure it stays consistent.
A similar argument holds in additional cases, but they happen to be safe:
* %dr6 inside #DB: Safe because IST exceptions don't re-enable IRQs.
* MSR_XFD_ERR inside #NM: Safe because AMX isn't used in #NM handler.
While in the area, remove redundant q suffix to a movq in entry.S and
the space after the comma.
Fixes: a4cd20a19073 ("[XEN] 'd' key dumps both host and guest state.") Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: b06e76db7c35974f1b127762683e7852ca0c8e76
master date: 2024-10-01 09:45:49 +0200
Taking a fault on a non-byte-granular insn means that the "number of
bytes not handled" return value would need extra care in calculating, if
we want callers to be able to derive e.g. exception context (to be
injected to the guest) - CR2 for #PF in particular - from the value. To
simplify things rather than complicating them, reduce inline assembly to
just byte-granular string insns. On recent CPUs that's also supposed to
be more efficient anyway.
For singular element accessors, however, alignment checks are added,
hence slightly complicating the code. Misaligned (user) buffer accesses
will now be forwarded to copy_{from,to}_guest_ll().
Naturally copy_{from,to}_unsafe_ll() accessors end up being adjusted the
same way, as they're produced by mere re-processing of the same code.
Otoh copy_{from,to}_unsafe() aren't similarly adjusted, but have their
comments made match reality; down the road we may want to change their
return types, e.g. to bool.
Fixes: 76974398a63c ("Added user-memory accessing functionality for x86_64") Fixes: 7b8c36701d26 ("Introduce clear_user and clear_guest") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 67a8e5721e1ea9c28526883036bf08fb2e8a8c9c
master date: 2024-10-01 09:44:55 +0200
xen/ucode: Fix buffer under-run when parsing AMD containers
The AMD container format has no formal spec. It is, at best, precision
guesswork based on AMD's prior contributions to open source projects. The
Equivalence Table has both an explicit length, and an expectation of having a
NULL entry at the end.
Xen was sanity checking the NULL entry, but without confirming that an entry
was present, resulting in a read off the front of the buffer. With some
manual debugging/annotations this manifests as:
(XEN) *** Buf ffff83204c00b19c, eq ffff83204c00b194
(XEN) *** eq: 0c 00 00 00 44 4d 41 00 00 00 00 00 00 00 00 00 aa aa aa aa
^-Actual buffer-------------------^
(XEN) *** installed_cpu: 000c
(XEN) microcode: Bad equivalent cpu table
(XEN) Parsing microcode blob error -22
When loaded by hypercall, the 4 bytes interpreted as installed_cpu happen to
be the containing struct ucode_buf's len field, and luckily will be nonzero.
When loaded at boot, it's possible for the access to #PF if the module happens
to have been placed on a 2M boundary by the bootloader. Under Linux, it will
commonly be the end of the CPIO header.
Drop the probe of the NULL entry; Nothing else cares. A container without one
is well formed, insofar that we can still parse it correctly. With this
dropped, the same container results in:
(XEN) microcode: couldn't find any matching ucode in the provided blob!
Fixes: 4de936a38aa9 ("x86/ucode/amd: Rework parsing logic in cpu_request_microcode()") Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a8bf14f6f331d4f428010b4277b67c33f561ed19
master date: 2024-09-13 15:23:30 +0100
blkif: reconcile protocol specification with in-use implementations
Current blkif implementations (both backends and frontends) have all slight
differences about how they handle the 'sector-size' xenstore node, and how
other fields are derived from this value or hardcoded to be expressed in units
of 512 bytes.
To give some context, this is an excerpt of how different implementations use
the value in 'sector-size' as the base unit for to other fields rather than
just to set the logical sector size of the block device:
An attempt was made by 67e1c050e36b in order to change the base units of the
request fields and the xenstore 'sectors' node. That however only lead to more
confusion, as the specification now clearly diverged from the reference
implementation in Linux. Such change was only implemented for QEMU Qdisk
and Windows PV blkfront.
Partially revert to the state before 67e1c050e36b while adjusting the
documentation for 'sectors' to match what it used to be previous to 2fa701e5346d:
* Declare 'feature-large-sector-size' deprecated. Frontends should not expose
the node, backends should not make decisions based on its presence.
* Clarify that 'sectors' xenstore node and the requests fields are always in
512-byte units, like it was previous to 2fa701e5346d and 67e1c050e36b.
All base units for the fields used in the protocol are 512-byte based, the
xenbus 'sector-size' field is only used to signal the logic block size. When
'sector-size' is greater than 512, blkfront implementations must make sure that
the offsets and sizes (despite being expressed in 512-byte units) are aligned
to the logical block size specified in 'sector-size', otherwise the backend
will fail to process the requests.
This will require changes to some of the frontends and backends in order to
properly support 'sector-size' nodes greater than 512.
Fixes: 2fa701e5346d ('blkif.h: Provide more complete documentation of the blkif interface') Fixes: 67e1c050e36b ('public/io/blkif.h: try to fix the semantics of sector based quantities') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: 221f2748e8dabe8361b8cdfcffbeab9102c4c899
master date: 2024-09-12 14:04:56 +0200
xen/x86/pvh: handle ACPI RSDT table in PVH Dom0 build
Xen always generates an XSDT table even if the firmware only provided an
RSDT table. Copy the RSDT header from the firmware table, adjusting the
signature, for the XSDT table when not provided by the firmware.
This is necessary to run Xen on QEMU.
Fixes: 1d74282c455f ('x86: setup PVHv2 Dom0 ACPI tables') Suggested-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com> Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 6e7f7a0c16c4d406bda6d4a900252ff63a7c5fad
master date: 2024-09-12 09:18:25 +0200
Jan Beulich [Tue, 24 Sep 2024 13:00:07 +0000 (15:00 +0200)]
x86/HVM: properly reject "indirect" VRAM writes
While ->count will only be different from 1 for "indirect" (data in
guest memory) accesses, it being 1 does not exclude the request being an
"indirect" one. Check both to be on the safe side, and bring the ->count
part also in line with what ioreq_send_buffered() actually refuses to
handle.
Fixes: 3bbaaec09b1b ("x86/hvm: unify stdvga mmio intercept with standard mmio intercept") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: eb7cd0593d88c4b967a24bca8bd30591966676cd
master date: 2024-09-12 09:13:04 +0200
Jan Beulich [Tue, 24 Sep 2024 12:59:22 +0000 (14:59 +0200)]
x86emul/test: fix build with gas 2.43
Drop explicit {evex} pseudo-prefixes. New gas (validly) complains when
they're used on things other than instructions. Our use was potentially
ahead of macro invocations - see simd.h's "override" macro.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 3c09288298af881ea1bb568740deb2d2a06bcd41
master date: 2024-09-06 08:41:18 +0200
Jan Beulich [Tue, 24 Sep 2024 12:58:58 +0000 (14:58 +0200)]
x86: fix UP build with gcc14
The complaint is:
In file included from ././include/xen/config.h:17,
from <command-line>:
arch/x86/smpboot.c: In function ‘link_thread_siblings.constprop’:
./include/asm-generic/percpu.h:16:51: error: array subscript [0, 0] is outside array bounds of ‘long unsigned int[1]’ [-Werror=array-bounds=]
16 | (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))
./include/xen/compiler.h:140:29: note: in definition of macro ‘RELOC_HIDE’
140 | (typeof(ptr)) (__ptr + (off)); })
| ^~~
arch/x86/smpboot.c:238:27: note: in expansion of macro ‘per_cpu’
238 | cpumask_set_cpu(cpu2, per_cpu(cpu_sibling_mask, cpu1));
| ^~~~~~~
In file included from ./arch/x86/include/generated/asm/percpu.h:1,
from ./include/xen/percpu.h:30,
from ./arch/x86/include/asm/cpuid.h:9,
from ./arch/x86/include/asm/cpufeature.h:11,
from ./arch/x86/include/asm/system.h:6,
from ./include/xen/list.h:11,
from ./include/xen/mm.h:68,
from arch/x86/smpboot.c:12:
./include/asm-generic/percpu.h:12:22: note: while referencing ‘__per_cpu_offset’
12 | extern unsigned long __per_cpu_offset[NR_CPUS];
| ^~~~~~~~~~~~~~~~
Which I consider bogus in the first place ("array subscript [0, 0]" vs a
1-element array). Yet taking the experience from 99f942f3d410 ("Arm64:
adjust __irq_to_desc() to fix build with gcc14") I guessed that
switching function parameters to unsigned int (which they should have
been anyway) might help. And voilà ...
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a2de7dc4d845738e734b10fce6550c89c6b1092c
master date: 2024-09-04 16:09:28 +0200
Jan Beulich [Tue, 24 Sep 2024 12:58:45 +0000 (14:58 +0200)]
SUPPORT.md: split XSM from Flask
XSM is a generic framework, which in particular is also used by SILO.
With this it can't really be experimental: Arm mandates SILO for having
a security supported configuration.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Daniel P. Smith <dpsmith@apertussolutions.com>
master commit: d7c18b8720824d7efc39ffa7296751e1812865a9
master date: 2024-09-04 16:05:03 +0200
libxl: Fix nul-termination of the return value of libxl_xen_console_read_line()
When built with ASAN, "xl dmesg" crashes in the "printf("%s", line)"
call in main_dmesg(). ASAN reports a heap buffer overflow: an
off-by-one access to cr->buffer.
The readconsole sysctl copies up to count characters into the buffer,
but it does not add a null character at the end. Despite the
documentation of libxl_xen_console_read_line(), line_r is not
nul-terminated if 16384 characters were copied to the buffer.
Fix this by asking xc_readconsolering() to fill the buffer up to size
- 1. As the number of characters in the buffer is only needed in
libxl_xen_console_read_line(), make it a local variable there instead
of part of the libxl__xen_console_reader struct.
Jan Beulich [Tue, 24 Sep 2024 12:57:43 +0000 (14:57 +0200)]
Arm64: adjust __irq_to_desc() to fix build with gcc14
With the original code I observe
In function ‘__irq_to_desc’,
inlined from ‘route_irq_to_guest’ at arch/arm/irq.c:465:12:
arch/arm/irq.c:54:16: error: array subscript -2 is below array bounds of ‘irq_desc_t[32]’ {aka ‘struct irq_desc[32]’} [-Werror=array-bounds=]
54 | return &this_cpu(local_irq_desc)[irq];
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
which looks pretty bogus: How in the world does the compiler arrive at
-2 when compiling route_irq_to_guest()? Yet independent of that the
function's parameter wants to be of unsigned type anyway, as shown by
a vast majority of callers (others use plain int when they really mean
non-negative quantities). With that adjustment the code compiles fine
again.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Michal Orzel <michal.orzel@amd.com>
master commit: 99f942f3d410059dc223ee0a908827e928ef3592
master date: 2024-08-29 10:03:53 +0200
For partial writes the non-written parts of registers are folded into
the full 64-bit value from what they're presently set to. That's wrong
to do though when the behavior is write-1-to-clear: Writes not
including to low 3 bits would unconditionally clear all ISR bits which
are presently set. Re-calculate the value to use.
Fixes: be07023be115 ("x86/vhpet: add support for level triggered interrupts") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 41d358d2f9607ba37c216effa39b9f1bc58de69d
master date: 2024-08-29 10:02:20 +0200
x86/dom0: disable SMAP for PV domain building only
Move the logic that disables SMAP so it's only performed when building a PV
dom0, PVH dom0 builder doesn't require disabling SMAP.
The fixes tag is to account for the wrong usage of cpu_has_smap in
create_dom0(), it should instead have used
boot_cpu_has(X86_FEATURE_XEN_SMAP). Fix while moving the logic to apply to PV
only.
While there also make cr4_pv32_mask __ro_after_init.
Fixes: 493ab190e5b1 ('xen/sm{e, a}p: allow disabling sm{e, a}p for Xen itself') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fb1658221a31ec1db33253a80001191391e73b17
master date: 2024-08-28 19:59:07 +0100
Jan Beulich [Tue, 24 Sep 2024 12:56:16 +0000 (14:56 +0200)]
x86/x2APIC: correct cluster tracking upon CPUs going down for S3
Downing CPUs for S3 is somewhat special: Since we can expect the system
to come back up in exactly the same hardware configuration, per-CPU data
for the secondary CPUs isn't de-allocated (and then cleared upon re-
allocation when the CPUs are being brought back up). Therefore the
cluster_cpus per-CPU pointer will retain its value for all CPUs other
than the final one in a cluster (i.e. in particular for all CPUs in the
same cluster as CPU0). That, however, is in conflict with the assertion
early in init_apic_ldr_x2apic_cluster().
Note that the issue is avoided on Intel hardware, where we park CPUs
instead of bringing them down.
Extend the bypassing of the freeing to the suspend case, thus making
suspend/resume also a tiny bit faster.
Fixes: 2e6c8f182c9c ("x86: distinguish CPU offlining from CPU removal") Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ad3ff7b4279d16c91c23cda6e8be5bc670b25c9a
master date: 2024-08-26 10:30:40 +0200
Jan Beulich [Tue, 24 Sep 2024 12:55:48 +0000 (14:55 +0200)]
x86emul: set (fake) operand size for AVX512CD broadcast insns
Back at the time I failed to pay attention to op_bytes still being zero
when reaching the respective case block: With the ext0f38_table[]
entries having simd_packed_int, the defaulting at the bottom of
x86emul_decode() won't set the field to non-zero for F3-prefixed insns.
Fixes: 37ccca740c26 ("x86emul: support AVX512CD insns") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6fa6b7feaafd622db3a2f3436750cf07782f4c12
master date: 2024-08-23 09:12:24 +0200
Jan Beulich [Tue, 24 Sep 2024 12:55:11 +0000 (14:55 +0200)]
x86emul: always set operand size for AVX-VNNI-INT8 insns
Unlike for AVX-VNNI-INT16 I failed to notice that op_bytes may still be
zero when reaching the respective case block: With the ext0f38_table[]
entries having simd_packed_int, the defaulting at the bottom of
x86emul_decode() won't set the field to non-zero for F3- or F2-prefixed
insns.
Fixes: 842acaa743a5 ("x86emul: support AVX-VNNI-INT8") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d45687cca2450bfebe1dfbddb22f4f03c6fbc9cb
master date: 2024-08-23 09:11:15 +0200
Andrew Cooper [Tue, 24 Sep 2024 12:54:35 +0000 (14:54 +0200)]
x86/pv: Address Coverity complaint in check_guest_io_breakpoint()
Commit 08aacc392d86 ("x86/emul: Fix misaligned IO breakpoint behaviour in PV
guests") caused a Coverity INTEGER_OVERFLOW complaint based on the reasoning
that width could be 0.
It can't, but digging into the code generation, GCC 8 and later (bisected on
godbolt) choose to emit a CSWITCH lookup table, and because the range (bottom
2 bits clear), it's a 16-entry lookup table.
So Coverity is understandable, given that GCC did emit a (dead) logic path
where width stayed 0.
Rewrite the logic. Introduce x86_bp_width() which compiles to a single basic
block, which replaces the switch() statement. Take the opportunity to also
make start and width be loop-scope variables.
No practical change, but it should compile better and placate Coverity.
Fixes: 08aacc392d86 ("x86/emul: Fix misaligned IO breakpoint behaviour in PV guests")
Coverity-ID: 1616152 Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6d41a9d8a12ff89adabdc286e63e9391a0481699
master date: 2024-08-21 23:59:19 +0100
Andrew Cooper [Tue, 24 Sep 2024 12:53:59 +0000 (14:53 +0200)]
x86/pv: Fix merging of new status bits into %dr6
All #DB exceptions result in an update of %dr6, but this isn't captured in
Xen's handling, and is buggy just about everywhere.
To begin resolving this issue, add a new pending_dbg field to x86_event
(unioned with cr2 to avoid taking any extra space, adjusting users to avoid
old-GCC bugs with anonymous unions), and introduce pv_inject_DB() to replace
the current callers using pv_inject_hw_exception().
Push the adjustment of v->arch.dr6 into pv_inject_event(), and use the new
x86_merge_dr6() rather than the current incorrect logic.
A key property is that pending_dbg is taken with positive polarity to deal
with RTM/BLD sensibly. Most callers pass in a constant, but callers passing
in a hardware %dr6 value need to XOR the value with X86_DR6_DEFAULT to flip to
positive polarity.
This fixes the behaviour of the breakpoint status bits; that any left pending
are generally discarded when a new #DB is raised. In principle it would fix
RTM/BLD too, except PV guests can't turn these capabilities on to start with.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: db39fa4b27ea470902d4625567cb6fa24030ddfa
master date: 2024-08-21 23:59:19 +0100
Andrew Cooper [Tue, 24 Sep 2024 12:53:22 +0000 (14:53 +0200)]
x86/pv: Introduce x86_merge_dr6() and fix do_debug()
Pretty much everywhere in Xen the logic to update %dr6 when injecting #DB is
buggy. Introduce a new x86_merge_dr6() helper, and start fixing the mess by
adjusting the dr6 merge in do_debug(). Also correct the comment.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 54ef601a66e8d812a6a6a308f02524e81201825e
master date: 2024-08-21 23:59:19 +0100
Jan Beulich [Tue, 24 Sep 2024 12:52:15 +0000 (14:52 +0200)]
Arm: correct FIXADDR_TOP
While reviewing a RISC-V patch cloning the Arm code, I noticed an
off-by-1 here: FIX_PMAP_{BEGIN,END} being an inclusive range and
FIX_LAST being the same as FIX_PMAP_END, FIXADDR_TOP cannot derive from
FIX_LAST alone, or else the BUG_ON() in virt_to_fix() would trigger if
FIX_PMAP_END ended up being used.
While touching this area also add a check for fixmap and boot FDT area
to not only not overlap, but to have at least one (unmapped) page in
between.
Jan Beulich [Tue, 24 Sep 2024 12:49:18 +0000 (14:49 +0200)]
x86/vLAPIC: prevent undue recursion of vlapic_error()
With the error vector set to an illegal value, the function invoking
vlapic_set_irq() would bring execution back here, with the non-recursive
lock already held. Avoid the call in this case, merely further updating
ESR (if necessary).
This is XSA-462 / CVE-2024-45817.
Fixes: 5f32d186a8b1 ("x86/vlapic: don't silently accept bad vectors") Reported-by: Federico Serafini <federico.serafini@bugseng.com> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c42d9ec61f6d11e25fa77bd44dd11dad1edda268
master date: 2024-09-24 14:23:29 +0200
Use expect to invoke QEMU so that we can terminate the test as soon as
we get the right string in the output instead of waiting until the
final timeout.
For timeout, instead of an hardcoding the value, use a Gitlab CI
variable "QEMU_TIMEOUT" that can be changed depending on the latest
status of the Gitlab CI runners.
[This backport skips the PPC and RISC scripts as well as the XTF
scripts on x86]
The Yocto jobs take a long time to run. We are changing Gitlab ARM64
runners and the new runners might not be able to finish the Yocto jobs
in a reasonable time.
For now, disable the Yocto jobs by turning them into "manual" trigger
(they need to be manually executed.)
Jan Beulich [Tue, 13 Aug 2024 14:49:45 +0000 (16:49 +0200)]
x86/pass-through: documents as security-unsupported when sharing resources
When multiple devices share resources and one of them is to be passed
through to a guest, security of the entire system and of respective
guests individually cannot really be guaranteed without knowing
internals of any of the involved guests. Therefore such a configuration
cannot really be security-supported, yet making that explicit was so far
missing.
Teddy Astie [Tue, 13 Aug 2024 14:49:06 +0000 (16:49 +0200)]
x86/IOMMU: move tracking in iommu_identity_mapping()
If for some reason xmalloc() fails after having mapped the reserved
regions, an error is reported, but the regions remain mapped in the P2M.
Similarly if an error occurs during set_identity_p2m_entry() (except on
the first call), the partial mappings of the region would be retained
without being tracked anywhere, and hence without there being a way to
remove them again from the domain's P2M.
Move the setting up of the list entry ahead of trying to map the region.
In cases other than the first mapping failing, keep record of the full
region, such that a subsequent unmapping request can be properly torn
down.
To compensate for the potentially excess unmapping requests, don't log a
warning from p2m_remove_identity_entry() when there really was nothing
mapped at a given GFN.
This is XSA-460 / CVE-2024-31145.
Fixes: 2201b67b9128 ("VT-d: improve RMRR region handling") Fixes: c0e19d7c6c42 ("IOMMU: generalize VT-d's tracking of mapped RMRR regions") Signed-off-by: Teddy Astie <teddy.astie@vates.tech> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: beadd68b5490ada053d72f8a9ce6fd696d626596
master date: 2024-08-13 16:36:40 +0200
Matthew Barnes [Thu, 8 Aug 2024 11:53:26 +0000 (13:53 +0200)]
tools/lsevtchn: Use errno macro to handle hypercall error cases
Currently, lsevtchn aborts its event channel enumeration when it hits
an event channel that is owned by Xen.
lsevtchn does not distinguish between different hypercall errors, which
results in lsevtchn missing potential relevant event channels with
higher port numbers.
Use the errno macro to distinguish between hypercall errors, and
continue event channel enumeration if the hypercall error is not
critical to enumeration.
Signed-off-by: Matthew Barnes <matthew.barnes@cloud.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: e92a453c8db8bba62d6be3006079e2b9990c3978
master date: 2024-08-02 08:43:57 +0200
George Dunlap [Thu, 8 Aug 2024 11:53:01 +0000 (13:53 +0200)]
xen/hvm: Don't skip MSR_READ trace record
Commit 37f074a3383 ("x86/msr: introduce guest_rdmsr()") introduced a
function to combine the MSR_READ handling between PV and HVM.
Unfortunately, by returning directly, it skipped the trace generation,
leading to gaps in the trace record, as well as xenalyze errors like
this:
hvm_generic_postprocess: d2v0 Strange, exit 7c(VMEXIT_MSR) missing a handler
Roger Pau Monné [Thu, 8 Aug 2024 11:52:11 +0000 (13:52 +0200)]
x86/altcall: further refine clang workaround
The current code in ALT_CALL_ARG() won't successfully workaround the clang
code-generation issue if the arg parameter has a size that's not a power of 2.
While there are no such sized parameters at the moment, improve the workaround
to also be effective when such sizes are used.
Instead of using a union with a long use an unsigned long that's first
initialized to 0 and afterwards set to the argument value.
Roger Pau Monné [Thu, 8 Aug 2024 11:51:39 +0000 (13:51 +0200)]
x86/dom0: fix restoring %cr3 and the mapcache override on PV build error
One of the error paths in the PV dom0 builder section that runs on the guest
page-tables wasn't restoring the Xen value of %cr3, neither removing the
mapcache override.
Andrew Cooper [Thu, 8 Aug 2024 11:51:09 +0000 (13:51 +0200)]
XSM/domctl: Fix permission checks on XEN_DOMCTL_createdomain
The XSM checks for XEN_DOMCTL_createdomain are problematic. There's a split
between xsm_domctl() called early, and flask_domain_create() called quite late
during domain construction.
All XSM implementations except Flask have a simple IS_PRIV check in
xsm_domctl(), and operate as expected when an unprivileged domain tries to
make a hypercall.
Flask however foregoes any action in xsm_domctl() and defers everything,
including the simple "is the caller permitted to create a domain" check, to
flask_domain_create().
As a consequence, when XSM Flask is active, and irrespective of the policy
loaded, all domains irrespective of privilege can:
* Mutate the global 'rover' variable, used to track the next free domid.
Therefore, all domains can cause a domid wraparound, and combined with a
voluntary reboot, choose their own domid.
* Cause a reasonable amount of a domain to be constructed before ultimately
failing for permission reasons, including the use of settings outside of
supported limits.
In order to remediate this, pass the ssidref into xsm_domctl() and at least
check that the calling domain privileged enough to create domains.
Take the opportunity to also fix the sign of the cmd parameter to be unsigned.
This issue has not been assigned an XSA, because Flask is experimental and not
security supported.
Reported-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
master commit: ee32b9b29af449d38aad0a1b3a81aaae586f5ea7
master date: 2024-07-30 17:42:17 +0100
Ross Lagerwall [Thu, 8 Aug 2024 11:50:36 +0000 (13:50 +0200)]
bunzip2: fix rare decompression failure
The decompression code parses a huffman tree and counts the number of
symbols for a given bit length. In rare cases, there may be >= 256
symbols with a given bit length, causing the unsigned char to overflow.
This causes a decompression failure later when the code tries and fails to
find the bit length for a given symbol.
Since the maximum number of symbols is 258, use unsigned short instead.
Fixes: ab77e81f6521 ("x86/dom0: support bzip2 and lzma compressed bzImage payloads") Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 303d3ff85c90ee4af4bad4e3b1d4932fa2634d64
master date: 2024-07-30 11:55:56 +0200
x86/altcall: fix clang code-gen when using altcall in loop constructs
Yet another clang code generation issue when using altcalls.
The issue this time is with using loop constructs around alternative_{,v}call
instances using parameter types smaller than the register size.
Given the following example code:
static void bar(bool b)
{
unsigned int i;
for ( i = 0; i < 10; i++ )
{
int ret_;
register union {
bool e;
unsigned long r;
} di asm("rdi") = { .e = b };
register unsigned long si asm("rsi");
register unsigned long dx asm("rdx");
register unsigned long cx asm("rcx");
register unsigned long r8 asm("r8");
register unsigned long r9 asm("r9");
register unsigned long r10 asm("r10");
register unsigned long r11 asm("r11");
Clang will generate machine code that only resets the low 8 bits of %rdi
between loop calls, leaving the rest of the register possibly containing
garbage from the use of %rdi inside the called function. Note also that clang
doesn't truncate the input parameters at the callee, thus breaking the psABI.
Fix this by turning the `e` element in the anonymous union into an array that
consumes the same space as an unsigned long, as this forces clang to reset the
whole %rdi register instead of just the low 8 bits.
Fixes: 2ce562b2a413 ('x86/altcall: use a union as register type for function parameters on clang') Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d51b2f5ea1915fe058f730b0ec542cf84254fca0
master date: 2024-07-23 13:59:30 +0200
x86/physdev: Return pirq that irq was already mapped to
Fix bug introduced by 0762e2502f1f ("x86/physdev: factor out the code to allocate and
map a pirq"). After that re-factoring, when pirq<0 and current_pirq>0, it means
caller want to allocate a free pirq for irq but irq already has a mapped pirq, then
it returns the negative pirq, so it fails. However, the logic before that
re-factoring is different, it should return the current_pirq that irq was already
mapped to and make the call success.
Fixes: 0762e2502f1f ("x86/physdev: factor out the code to allocate and map a pirq") Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com> Signed-off-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Jiqian Chen <Jiqian.Chen@amd.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0d2b87b5adfc19e87e9027d996db204c66a47f30
master date: 2024-07-08 14:46:12 +0100
Jan Beulich [Tue, 16 Jul 2024 12:14:43 +0000 (14:14 +0200)]
x86/IRQ: avoid double unlock in map_domain_pirq()
Forever since its introduction the main loop in the function dealing
with multi-vector MSI had error exit points ("break") with different
properties: In one case no IRQ descriptor lock is being held.
Nevertheless the subsequent error cleanup path assumed such a lock would
uniformly need releasing. Identify the case by setting "desc" to NULL,
thus allowing the unlock to be skipped as necessary.
Jan Beulich [Thu, 4 Jul 2024 14:57:29 +0000 (16:57 +0200)]
evtchn: build fix for Arm
When backporting daa90dfea917 ("pirq_cleanup_check() leaks") I neglected
to pay attention to it depending on 13a7b0f9f747 ("restrict concept of
pIRQ to x86"). That one doesn't want backporting imo, so use / adjust
custom #ifdef-ary to address the immediate issue of pirq_cleanup_check()
not being available on Arm.
Jan Beulich [Thu, 4 Jul 2024 12:14:49 +0000 (14:14 +0200)]
x86/entry: don't clear DF when raising #UD for lack of syscall handler
While doing so is intentional when invoking the actual callback, to
mimic a hard-coded SYCALL_MASK / FMASK MSR, the same should not be done
when no handler is available and hence #UD is raised.
Fixes: ca6fcf4321b3 ("x86/pv: Inject #UD for missing SYSCALL callbacks") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d2fe9ab3048d503869ec81bc49db07e55a4a2386
master date: 2024-07-02 12:01:21 +0200
Jan Beulich [Thu, 4 Jul 2024 12:14:16 +0000 (14:14 +0200)]
cmdline: document and enforce "extra_guest_irqs" upper bounds
PHYSDEVOP_pirq_eoi_gmfn_v<N> accepting just a single GFN implies that no
more than 32k pIRQ-s can be used by a domain on x86. Document this upper
bound.
To also enforce the limit, (ab)use both arch_hwdom_irqs() (changing its
parameter type) and setup_system_domains(). This is primarily to avoid
exposing the two static variables or introducing yet further arch hooks.
While touching arch_hwdom_irqs() also mark it hwdom-init.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
amend 'cmdline: document and enforce "extra_guest_irqs" upper bounds'
Address late review comments for what is now commit 17f6d398f765:
- bound max_irqs right away against nr_irqs
- introduce a #define for a constant used twice
Requested-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 17f6d398f76597f8009ec0530842fb8705ece7ba
master date: 2024-07-02 12:00:27 +0200
master commit: 1f56accba33ffea0abf7d1c6384710823d10cbd6
master date: 2024-07-03 14:03:27 +0200
Andrew Cooper [Thu, 4 Jul 2024 12:12:31 +0000 (14:12 +0200)]
tools/dombuilder: Correct the length calculation in xc_dom_alloc_segment()
xc_dom_alloc_segment() is passed a size in bytes, calculates a size in pages
from it, then fills in the new segment information with a bytes value
re-calculated from the number of pages.
This causes the module information given to the guest (MB, or PVH) to have
incorrect sizes; specifically, sizes rounded up to the next page.
This in turn is problematic for Xen. When Xen finds a gzipped module, it
peeks at the end metadata to judge the decompressed size, which is a -4
backreference from the reported end of the module.
Fill in seg->vend using the correct number of bytes.
Fixes: ea7c8a3d0e82 ("libxc: reorganize domain builder guest memory allocator") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: 4c3a618b0adaa0cd59e0fa0898bb60978b8b3a5f
master date: 2024-07-02 10:50:18 +0100