Jan Beulich [Tue, 12 Nov 2024 12:39:56 +0000 (13:39 +0100)]
x86/HVM: drop stdvga's "{g,s}r_index" struct members
No consumers are left, hence the producer and the fields themselves can
also go away. stdvga_outb() is then useless, rendering stdvga_out()
useless as well. Hence the entire I/O port intercept can go away.
This is part of XSA-463 / CVE-2024-45818
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 86c03372e107f5c18266a62281663861b1144929)
Jan Beulich [Tue, 12 Nov 2024 12:39:38 +0000 (13:39 +0100)]
x86/HVM: drop stdvga's "sr[]" struct member
No consumers are left, hence the producer and the array itself can also
go away. The static sr_mask[] is then orphaned and hence needs dropping,
too.
This is part of XSA-463 / CVE-2024-45818
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 7aba44bdd78aedb97703811948c3b69ccff85032)
Jan Beulich [Tue, 12 Nov 2024 12:39:19 +0000 (13:39 +0100)]
x86/HVM: drop stdvga's "gr[]" struct member
No consumers are left, hence the producer and the array itself can also
go away. The static gr_mask[] is then orphaned and hence needs dropping,
too.
This is part of XSA-463 / CVE-2024-45818
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit b16c0966a17f19c0e55ed0b9baa28191d2590178)
Jan Beulich [Tue, 12 Nov 2024 12:38:59 +0000 (13:38 +0100)]
x86/HVM: remove unused MMIO handling code
All read accesses are rejected by the ->accept handler, while writes
bypass the bulk of the function body. Drop the dead code, leaving an
assertion in the read handler.
A number of other static items (and a macro) are then unreferenced and
hence also need (want) dropping. The same applies to the "latch" field
of the state structure.
This is part of XSA-463 / CVE-2024-45818
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 89108547af1f230b72893b48351f9c1106189649)
Jan Beulich [Tue, 12 Nov 2024 12:38:35 +0000 (13:38 +0100)]
x86/HVM: drop stdvga's "stdvga" struct member
Two of its consumers are dead (in compile-time constant conditionals)
and the only remaining ones are merely controlling debug logging. Hence
the field is now pointless to set, which in particular allows to get rid
of the questionable conditional from which the field's value was
established (afaict 551ceee97513 ["x86, hvm: stdvga cache always on"]
had dropped too much of the earlier extra check that was there, and
quite likely further checks were missing).
This is part of XSA-463 / CVE-2024-45818
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit b740a9369e81bdda675a9780130ce2b9e75d4ec9)
Jan Beulich [Tue, 12 Nov 2024 12:38:08 +0000 (13:38 +0100)]
x86/HVM: drop stdvga's "cache" struct member
Since 68e1183411be ("libxc: introduce a xc_dom_arch for hvm-3.0-x86_32
guests"), HVM guests are built using XEN_DOMCTL_sethvmcontext, which
ends up disabling stdvga caching because of arch_hvm_load() being
involved in the processing of the request. With that the field is
useless, and can be dropped. Drop the helper functions manipulating /
checking as well right away, but leave the use sites of
stdvga_cache_is_enabled() with the hard-coded result the function would
have produced, to aid validation of subsequent dropping of further code.
This is part of XSA-463 / CVE-2024-45818
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 53b7246bdfb3c280adcdf714918e4decb7e108f4)
Javi Merino [Fri, 18 Oct 2024 09:17:43 +0000 (10:17 +0100)]
CI: Refresh the Debian 12 x86_32 container
Rework the container to be non-root, use heredocs for readability, and
use apt-get --no-install-recommends to keep the size down. Rename the
job to x86_32, to be consistent with XEN_TARGET_ARCH and the
naming scheme of all the other CI jobs:
${VERSION}-${ARCH}-${BUILD_NAME}
Remove build dependencies for building QEMU. The absence of ninja/meson means
that the container hasn't been able to build QEMU in years.
Remove build dependencies for the documentation as we don't have to
build it for every single arch.
This reduces the size of the container from 2.22GB to 1.32Gb.
Signed-off-by: Javi Merino <javi.merino@cloud.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 1ceabff11575e5acb97f29aa9091539dfaf05e3d)
Javi Merino [Mon, 14 Oct 2024 16:53:31 +0000 (17:53 +0100)]
CI: Refresh the Debian 12 x86_64 container
Rework the container to use heredocs for readability, and use
apt-get --no-install-recommends to keep the size down.
This reduces the size of the (uncompressed) container from 3.44GB to
1.97GB.
The container is left running the builds and tests as root. A
subsequent patch will make the necessary changes to the test scripts
to allow test execution as a non-root user.
Signed-off-by: Javi Merino <javi.merino@cloud.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 44b742de09f2fd14f6211a6c7f24c0cba1624c14)
Andrew Cooper [Thu, 31 Oct 2024 18:02:57 +0000 (18:02 +0000)]
CI: Drop alpine-3.18-rootfs-export and use test-artefacts
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit babe11b46c1a97036164099528a308476ea27953)
Andrew Cooper [Mon, 21 Oct 2024 13:17:56 +0000 (14:17 +0100)]
CI: Add {adl,zen3p}-pvshim-* tests
GitlabCI has no testing of Xen's PVH entrypoint. Fix this.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
(cherry picked from commit b837d02163ff19e2440cae766e2bc50956da5410)
Andrew Cooper [Mon, 21 Oct 2024 14:07:54 +0000 (15:07 +0100)]
CI: Rework domU_config generation in qubes-x86-64.sh
Right now, various blocks rewrite domU_config= as a whole, even though it is
largely the same.
* dom0pvh-hvm does nothing but change the domain type to hvm
* *-pci sets the domain type, clears vif=[], appends earlyprintk=xen to the
cmdline, and adds some PCI config.
Refactor this to be domU_type (defaults to pvh), domU_vif (defaults to
xenbr0), and domU_extra_config (defaults to empty) and use these variables to
build domU_config= once.
Of note, the default domU_config= now sets cmdline=, and extra= is intended
for inclusion via domU_extra_config as necessary.
No practical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
(cherry picked from commit 3be3ae07705af77ab1c87c2e442c7646c938ad25)
Andrew Cooper [Mon, 21 Oct 2024 13:06:24 +0000 (14:06 +0100)]
CI: Minor cleanup to qubes-x86-64.sh
* List all the test_variants and summerise what's going on
* Use case rather than an if/else chain for $test_variant
* Adjust indentation inside the case block
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Reviewed-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
(cherry picked from commit 6685a129c7864e3733afef09a2539ccd722a4380)
Andrew Cooper [Sat, 13 Jul 2024 16:50:30 +0000 (17:50 +0100)]
CI: Stop building QEMU in general
We spend an awful lot of CI time building QEMU, even though most changes don't
touch the subset of tools/libs/ used by QEMU. Some numbers taken at a time
when CI was otherwise quiet:
With Without
Alpine: 13m38s 6m04s
Debian 12: 10m05s 8m10s
OpenSUSE Tumbleweed: 11m40s 7m54s
Ubuntu 24.04: 14m56s 8m06s
which is a >50% improvement in wallclock time in some cases.
The only build we have that needs QEMU is alpine-3.18-gcc-debug. This is the
build deployed and used by the QubesOS ADL-* and Zen3p-* jobs.
Xilinx-x86_64 deploys it too, but is PVH-only and doesn't use QEMU.
QEMU is also built by CirrusCI for FreeBSD (fully Clang/LLVM toolchain).
This should help quite a lot with Gitlab CI capacity.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit e305256e69b1c943db3ca20173da6ded3db2d252)
The smoke tests when successful complete in about 5s. Don't waste
20min+ on failure, shorten the timeout to 120s
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit bcce5a6b62761c8b678aebce33c55ea66f879f66)
Check if xen.efi is bootable with an XTF dom0.
The multiboot2+EFI path is tested on hardware tests already.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 2d1c673baea563bb1af00b1e977b4ff7c213cf7f)
It will be useful for further tests. Deuplicate the collection.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 636e66b143ac1aad2f6a9c2e7166d8ba88f4559a)
TEST_TIMEOUT is set as a CI/CD project variable, as it should be, to
match the capability and speed of the testing infrastructure.
As it turns out, TEST_TIMEOUT defined in test.yaml cannot override
TEST_TIMEOUT defined as CI/CD project variable. As a consequence, today
the TEST_TIMEOUT setting in test.yaml for the Xilinx jobs is ignored.
Instead, rename TEST_TIMEOUT to TEST_TIMEOUT_OVERRIDE in test.yaml and
check for TEST_TIMEOUT_OVERRIDE first in console.exp.
Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com> Reviewed-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
(cherry picked from commit d82e0e094e7a07353ba0fb35732724316c2ec2f6)
Victor Lira [Thu, 29 Aug 2024 22:34:23 +0000 (15:34 -0700)]
automation: use expect utility in xilinx tests
Fixes: 95764a0817a5 (automation: update xilinx test scripts (tty))
This patch introduced a CI failure due to a timeout in xilinx-x86_64 test.
Change xilinx-x86_64 and xilinx-arm64 scripts to use "expect" utility
to determine test result and allow early exit from tests.
Add "expect" to xilinx container environment (dockerfile).
Rename references to "QEMU" in "qemu-key.exp" expect script to "TEST" to be
used by both QEMU and hardware tests.
Signed-off-by: Victor Lira <victorm.lira@amd.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 5e99a40ea54a6bf0bdc18241992866a642d7782b)
Victor Lira [Thu, 29 Aug 2024 22:34:22 +0000 (15:34 -0700)]
automation: fix false success in qemu tests
Fix flaw in qemu-*.sh tests that producess a false success. The following
lines produces success despite the "expect" script producing nonzero exit
status:
set +e
...
./automation/scripts/qemu-key.exp | sed 's/\r\+$//'
(end of file)
The default exit status for a pipeline using "|" operator is that of the
rightmost command. Fix this by setting the "pipefail" option in the shell,
and removing "set +e" allowing the expect script to determine the result.
Signed-off-by: Victor Lira <victorm.lira@amd.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 740c41ca05a83a2c3629eb2ff323877c37d95c1e)
Victor Lira [Fri, 23 Aug 2024 22:29:04 +0000 (15:29 -0700)]
automation: update xilinx test scripts (tty)
Update serial device names from ttyUSB* to test board specific names.
Update xilinx-smoke-dom0-x86_64 with new Xen command line console options,
which are now set as Gitlab CI/CD variables. Abstract the directory where
binaries are stored. Increase the timeout to match new setup.
Signed-off-by: Victor Lira <victorm.lira@amd.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 95764a0817a51741b7ffb1f78cba2a19b08ab2d1)
After commit c36efb7fcea6 ("automation: use expect to run QEMU") we lost
the \r filtering introduced by b576497e3b7d ("automation: remove CR
characters from serial output"). This patch reintroduced it.
Fixes: c36efb7fcea6 ("automation: use expect to run QEMU") Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com>
(cherry picked from commit aa80a04df488528d90a0d892f0752571b1759e8b)
Daniel P. Smith [Tue, 29 Oct 2024 15:31:38 +0000 (16:31 +0100)]
x86/boot: Fix XSM module handling during PVH boot
As detailed in commit 0fe607b2a144 ("x86/boot: Fix PVH boot during boot_info
transition period"), the use of __va(mbi->mods_addr) constitutes a
use-after-free on the PVH boot path.
This pattern has been in use since before PVH support was added. This has
most likely gone unnoticed because no-one's tried using a detached Flask
policy in a PVH VM before.
Plumb the boot_info pointer down, replacing module_map and mbi. Importantly,
bi->mods[].mod is a safe way to access the module list during PVH boot.
As this is the final non-bi use of mbi in __start_xen(), make the pointer
unusable once bi has been established, to prevent new uses creeping back in.
This is a stopgap until mbi can be fully removed.
Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Daniel P. Smith <dpsmith@apertussolutions.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 6cf0aaeb8df951fb34679f0408461a5c67cb02c6
master date: 2024-10-23 18:14:24 +0100
Daniel P. Smith [Tue, 29 Oct 2024 15:31:25 +0000 (16:31 +0100)]
x86/boot: Fix microcode module handling during PVH boot
As detailed in commit 0fe607b2a144 ("x86/boot: Fix PVH boot during boot_info
transition period"), the use of __va(mbi->mods_addr) constitutes a
use-after-free on the PVH boot path.
This pattern has been in use since before PVH support was added. Inside a PVH
VM, it will go unnoticed as long as the microcode container parser doesn't
choke on the random data it finds.
The use within early_microcode_init() happens to be safe because it's prior to
move_xen(). microcode_init_cache() is after move_xen(), and therefore unsafe.
Plumb the boot_info pointer down, replacing module_map and mbi. Importantly,
bi->mods[].mod is a safe way to access the module list during PVH boot.
Note: microcode_scan_module() is still bogusly stashing a bootstrap_map()'d
pointer in ucode_blob.data, which constitutes a different
use-after-free, and only works in general because of a second bug. This
is unrelated to PVH, and needs untangling differently.
Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Daniel P. Smith <dpsmith@apertussolutions.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8ddf63a252a6eae6e619ba2df9ad6b6f82e660c1
master date: 2024-10-23 18:14:24 +0100
Roger Pau Monné [Tue, 29 Oct 2024 15:30:51 +0000 (16:30 +0100)]
iommu/amd-vi: do not error if device referenced in IVMD is not behind any IOMMU
IVMD table contains restrictions about memory which must be mandatory assigned
to devices (and which permissions it should use), or memory that should be
never accessible to devices.
Some hardware however contains ranges in IVMD that reference devices outside of
the IVHD tables (in other words, devices not behind any IOMMU). Such mismatch
will cause Xen to fail in register_range_for_device(), ultimately leading to
the IOMMU being disabled, and Xen crashing as x2APIC support might be already
enabled and relying on the IOMMU functionality.
Relax IVMD parsing: allow IVMD blocks to reference devices not assigned to any
IOMMU. It's impossible for Xen to fulfill the requirement in the IVMD block if
the device is not behind any IOMMU, but it's no worse than booting without
IOMMU support, and thus not parsing ACPI IVRS in the first place.
Andrew Cooper [Tue, 29 Oct 2024 15:30:41 +0000 (16:30 +0100)]
xen/spinlock: Fix UBSAN "load of address with insufficient space" in lock_prof_init()
UBSAN complains:
(XEN) ================================================================================
(XEN) UBSAN: Undefined behaviour in common/spinlock.c:794:10
(XEN) load of address ffff82d040ae24c8 with insufficient space
(XEN) for an object of type 'struct lock_profile *'
(XEN) ----[ Xen-4.20-unstable x86_64 debug=y ubsan=y Tainted: C ]----
This shows up with GCC-14, but not with GCC-12. I have not bisected further.
Either way, the types for __lock_profile_{start,end} are incorrect.
They are an array of struct lock_profile pointers. Correct the extern's
types, and adjust the loop to match.
No practical change.
Reported-by: Andreas Glashauser <ag@andreasglashauser.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com>
master commit: 542ac112fc68c66cfafc577e252404c21da4f75b
master date: 2024-10-14 16:14:26 +0100
Roger Pau Monné [Tue, 29 Oct 2024 15:30:04 +0000 (16:30 +0100)]
x86/domctl: fix maximum number of MSRs in XEN_DOMCTL_{get,set}_vcpu_msrs
Since the addition of the MSR_AMD64_DR{1-4}_ADDRESS_MASK MSRs to the
msrs_to_send array, the calculations for the maximum number of MSRs that
the hypercall can handle is off by 4.
Remove the addition of 4 to the maximum number of MSRs that
XEN_DOMCTL_{set,get}_vcpu_msrs supports, as those are already part of the
array.
A further adjustment could be to subtract 4 from the maximum size if the DBEXT
CPUID feature is not exposed to the guest, but guest_{rd,wr}msr() will already
perform that check when fetching or loading the MSRs. The maximum array is
used to indicate the caller of the buffer it needs to allocate in the get case,
and as an early input sanitation in the set case, using a buffer size slightly
lager than required is not an issue.
Fixes: 86d47adcd3c4 ('x86/msr: Handle MSR_AMD64_DR{0-3}_ADDRESS_MASK in the new MSR infrastructure') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c95cd5f9c5a8c1c6ab1b0b366d829fa8561958fd
master date: 2024-10-08 14:37:53 +0200
Jan Beulich [Tue, 29 Oct 2024 15:29:47 +0000 (16:29 +0100)]
ioreq: don't wrongly claim "success" in ioreq_send_buffered()
Returning a literal number is a bad idea anyway when all other returns
use IOREQ_STATUS_* values. The function is dead on Arm, and mapping to
X86EMUL_OKAY is surely wrong on x86.
Fixes: f6bf39f84f82 ("x86/hvm: add support for broadcast of buffered ioreqs...") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: 2e0b545b847df7d4feb07308d50bad708bd35a66
master date: 2024-10-08 14:36:27 +0200
Roger Pau Monné [Tue, 29 Oct 2024 15:29:12 +0000 (16:29 +0100)]
x86/dpci: do not leak pending interrupts on CPU offline
The current dpci logic relies on a softirq being executed as a side effect of
the cpu_notifier_call_chain() call in the code path that offlines the target
CPU. However the call to cpu_notifier_call_chain() won't trigger any softirq
processing, and even if it did, such processing should be done after all
interrupts have been migrated off the current CPU, otherwise new pending dpci
interrupts could still appear.
Currently the ASSERT() in the cpu callback notifier is fairly easy to trigger
by doing CPU offline from a PVH dom0.
Solve this by instead moving out any dpci interrupts pending processing once
the CPU is dead. This might introduce more latency than attempting to drain
before the CPU is put offline, but it's less complex, and CPU online/offline is
not a common action. Any extra introduced latency should be tolerable.
Fixes: f6dd295381f4 ('dpci: replace tasklet with softirq') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 29555668b5725b9d5393b72bfe7ff9a3fa606714
master date: 2024-10-07 11:10:21 +0200
Andrew Cooper [Tue, 29 Oct 2024 15:28:48 +0000 (16:28 +0100)]
stubdom: Fix newlib build with GCC-14
Based on a fix from OpenSUSE, but adjusted to be Clang-compatible too. Pass
-Wno-implicit-function-declaration library-wide rather than using local GCC
pragmas.
Fix of copy_past_newline() to avoid triggering -Wstrict-prototypes.
Andrew Cooper [Tue, 29 Oct 2024 15:27:54 +0000 (16:27 +0100)]
x86/pv: Rename pv.iobmp_limit to iobmp_nr and clarify behaviour
Ever since it's introduction in commit 013351bd7ab3 ("Define new event-channel
and physdev hypercalls") in 2006, the public interface was named nr_ports
while the internal field was called iobmp_limit.
Rename the internal field to iobmp_nr to match the public interface, and
clarify that, when nonzero, Xen will read 2 bytes.
There isn't a perfect parallel with a real TSS, but iobmp_nr being 0 is the
paravirt "no IOPB" case, and it is important that no read occurs in this case.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 633ee8b2df963f7e5cb8de1219c1a48bfb4447f6
master date: 2024-10-01 14:58:18 +0100
Andrew Cooper [Tue, 29 Oct 2024 15:27:41 +0000 (16:27 +0100)]
x86/pv: Handle #PF correctly when reading the IO permission bitmap
The switch statement in guest_io_okay() is a very expensive way of
pre-initialising x with ~0, and performing a partial read into it.
However, the logic isn't correct either.
In a real TSS, the CPU always reads two bytes (like here), and any TSS limit
violation turns silently into no-access. But, in-limit accesses trigger #PF
as usual. AMD document this property explicitly, and while Intel don't (so
far as I can tell), they do behave consistently with AMD.
Switch from __copy_from_guest_offset() to __copy_from_guest_pv(), like
everything else in this file. This removes code generation setting up
copy_from_user_hvm() (in the likely path even), and safety LFENCEs from
evaluate_nospec().
Change the logic to raise #PF if __copy_from_guest_pv() fails, rather than
disallowing the IO port access. This brings the behaviour better in line with
normal x86.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 8a6c495d725408d333c1b47bb8af44615a5bfb18
master date: 2024-10-01 14:58:18 +0100
Andrew Cooper [Tue, 29 Oct 2024 15:27:29 +0000 (16:27 +0100)]
x86/pv: Rework guest_io_okay() to return X86EMUL_*
In order to fix a bug with guest_io_okay() (subsequent patch), rework
guest_io_okay() to take in an emulation context, and return X86EMUL_* rather
than a boolean.
For the failing case, take the opportunity to inject #GP explicitly, rather
than returning X86EMUL_UNHANDLEABLE. There is a logical difference between
"we know what this is, and it's #GP", vs "we don't know what this is".
There is no change in practice as emulation is the final step on general #GP
resolution, but returning X86EMUL_UNHANDLEABLE would be a latent bug if a
subsequent action were to appear.
No practical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7429e1cc071b0e20ea9581da4893fb9b2f6d21d4
master date: 2024-10-01 14:58:18 +0100
x86/traps: Re-enable interrupts after reading cr2 in the #PF handler
Hitting a page fault clobbers %cr2, so if a page fault is handled while
handling a previous page fault then %cr2 will hold the address of the
latter fault rather than the former. In particular, if a debug key
handler happens to trigger during #PF and before %cr2 is read, and that
handler itself encounters a #PF, then %cr2 will be corrupt for the outer #PF
handler.
This patch makes the page fault path delay re-enabling IRQs until %cr2
has been read in order to ensure it stays consistent.
A similar argument holds in additional cases, but they happen to be safe:
* %dr6 inside #DB: Safe because IST exceptions don't re-enable IRQs.
* MSR_XFD_ERR inside #NM: Safe because AMX isn't used in #NM handler.
While in the area, remove redundant q suffix to a movq in entry.S and
the space after the comma.
Fixes: a4cd20a19073 ("[XEN] 'd' key dumps both host and guest state.") Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: b06e76db7c35974f1b127762683e7852ca0c8e76
master date: 2024-10-01 09:45:49 +0200
Taking a fault on a non-byte-granular insn means that the "number of
bytes not handled" return value would need extra care in calculating, if
we want callers to be able to derive e.g. exception context (to be
injected to the guest) - CR2 for #PF in particular - from the value. To
simplify things rather than complicating them, reduce inline assembly to
just byte-granular string insns. On recent CPUs that's also supposed to
be more efficient anyway.
For singular element accessors, however, alignment checks are added,
hence slightly complicating the code. Misaligned (user) buffer accesses
will now be forwarded to copy_{from,to}_guest_ll().
Naturally copy_{from,to}_unsafe_ll() accessors end up being adjusted the
same way, as they're produced by mere re-processing of the same code.
Otoh copy_{from,to}_unsafe() aren't similarly adjusted, but have their
comments made match reality; down the road we may want to change their
return types, e.g. to bool.
Fixes: 76974398a63c ("Added user-memory accessing functionality for x86_64") Fixes: 7b8c36701d26 ("Introduce clear_user and clear_guest") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 67a8e5721e1ea9c28526883036bf08fb2e8a8c9c
master date: 2024-10-01 09:44:55 +0200
xen/ucode: Fix buffer under-run when parsing AMD containers
The AMD container format has no formal spec. It is, at best, precision
guesswork based on AMD's prior contributions to open source projects. The
Equivalence Table has both an explicit length, and an expectation of having a
NULL entry at the end.
Xen was sanity checking the NULL entry, but without confirming that an entry
was present, resulting in a read off the front of the buffer. With some
manual debugging/annotations this manifests as:
(XEN) *** Buf ffff83204c00b19c, eq ffff83204c00b194
(XEN) *** eq: 0c 00 00 00 44 4d 41 00 00 00 00 00 00 00 00 00 aa aa aa aa
^-Actual buffer-------------------^
(XEN) *** installed_cpu: 000c
(XEN) microcode: Bad equivalent cpu table
(XEN) Parsing microcode blob error -22
When loaded by hypercall, the 4 bytes interpreted as installed_cpu happen to
be the containing struct ucode_buf's len field, and luckily will be nonzero.
When loaded at boot, it's possible for the access to #PF if the module happens
to have been placed on a 2M boundary by the bootloader. Under Linux, it will
commonly be the end of the CPIO header.
Drop the probe of the NULL entry; Nothing else cares. A container without one
is well formed, insofar that we can still parse it correctly. With this
dropped, the same container results in:
(XEN) microcode: couldn't find any matching ucode in the provided blob!
Fixes: 4de936a38aa9 ("x86/ucode/amd: Rework parsing logic in cpu_request_microcode()") Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: a8bf14f6f331d4f428010b4277b67c33f561ed19
master date: 2024-09-13 15:23:30 +0100
blkif: reconcile protocol specification with in-use implementations
Current blkif implementations (both backends and frontends) have all slight
differences about how they handle the 'sector-size' xenstore node, and how
other fields are derived from this value or hardcoded to be expressed in units
of 512 bytes.
To give some context, this is an excerpt of how different implementations use
the value in 'sector-size' as the base unit for to other fields rather than
just to set the logical sector size of the block device:
An attempt was made by 67e1c050e36b in order to change the base units of the
request fields and the xenstore 'sectors' node. That however only lead to more
confusion, as the specification now clearly diverged from the reference
implementation in Linux. Such change was only implemented for QEMU Qdisk
and Windows PV blkfront.
Partially revert to the state before 67e1c050e36b while adjusting the
documentation for 'sectors' to match what it used to be previous to 2fa701e5346d:
* Declare 'feature-large-sector-size' deprecated. Frontends should not expose
the node, backends should not make decisions based on its presence.
* Clarify that 'sectors' xenstore node and the requests fields are always in
512-byte units, like it was previous to 2fa701e5346d and 67e1c050e36b.
All base units for the fields used in the protocol are 512-byte based, the
xenbus 'sector-size' field is only used to signal the logic block size. When
'sector-size' is greater than 512, blkfront implementations must make sure that
the offsets and sizes (despite being expressed in 512-byte units) are aligned
to the logical block size specified in 'sector-size', otherwise the backend
will fail to process the requests.
This will require changes to some of the frontends and backends in order to
properly support 'sector-size' nodes greater than 512.
Fixes: 2fa701e5346d ('blkif.h: Provide more complete documentation of the blkif interface') Fixes: 67e1c050e36b ('public/io/blkif.h: try to fix the semantics of sector based quantities') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: 221f2748e8dabe8361b8cdfcffbeab9102c4c899
master date: 2024-09-12 14:04:56 +0200
xen/x86/pvh: handle ACPI RSDT table in PVH Dom0 build
Xen always generates an XSDT table even if the firmware only provided an
RSDT table. Copy the RSDT header from the firmware table, adjusting the
signature, for the XSDT table when not provided by the firmware.
This is necessary to run Xen on QEMU.
Fixes: 1d74282c455f ('x86: setup PVHv2 Dom0 ACPI tables') Suggested-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com> Signed-off-by: Daniel P. Smith <dpsmith@apertussolutions.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 6e7f7a0c16c4d406bda6d4a900252ff63a7c5fad
master date: 2024-09-12 09:18:25 +0200
Jan Beulich [Tue, 24 Sep 2024 12:43:02 +0000 (14:43 +0200)]
x86/HVM: properly reject "indirect" VRAM writes
While ->count will only be different from 1 for "indirect" (data in
guest memory) accesses, it being 1 does not exclude the request being an
"indirect" one. Check both to be on the safe side, and bring the ->count
part also in line with what ioreq_send_buffered() actually refuses to
handle.
Fixes: 3bbaaec09b1b ("x86/hvm: unify stdvga mmio intercept with standard mmio intercept") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: eb7cd0593d88c4b967a24bca8bd30591966676cd
master date: 2024-09-12 09:13:04 +0200
Jan Beulich [Tue, 24 Sep 2024 12:42:39 +0000 (14:42 +0200)]
x86emul/test: fix build with gas 2.43
Drop explicit {evex} pseudo-prefixes. New gas (validly) complains when
they're used on things other than instructions. Our use was potentially
ahead of macro invocations - see simd.h's "override" macro.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 3c09288298af881ea1bb568740deb2d2a06bcd41
master date: 2024-09-06 08:41:18 +0200
Jan Beulich [Tue, 24 Sep 2024 12:41:59 +0000 (14:41 +0200)]
x86: fix UP build with gcc14
The complaint is:
In file included from ././include/xen/config.h:17,
from <command-line>:
arch/x86/smpboot.c: In function ‘link_thread_siblings.constprop’:
./include/asm-generic/percpu.h:16:51: error: array subscript [0, 0] is outside array bounds of ‘long unsigned int[1]’ [-Werror=array-bounds=]
16 | (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))
./include/xen/compiler.h:140:29: note: in definition of macro ‘RELOC_HIDE’
140 | (typeof(ptr)) (__ptr + (off)); })
| ^~~
arch/x86/smpboot.c:238:27: note: in expansion of macro ‘per_cpu’
238 | cpumask_set_cpu(cpu2, per_cpu(cpu_sibling_mask, cpu1));
| ^~~~~~~
In file included from ./arch/x86/include/generated/asm/percpu.h:1,
from ./include/xen/percpu.h:30,
from ./arch/x86/include/asm/cpuid.h:9,
from ./arch/x86/include/asm/cpufeature.h:11,
from ./arch/x86/include/asm/system.h:6,
from ./include/xen/list.h:11,
from ./include/xen/mm.h:68,
from arch/x86/smpboot.c:12:
./include/asm-generic/percpu.h:12:22: note: while referencing ‘__per_cpu_offset’
12 | extern unsigned long __per_cpu_offset[NR_CPUS];
| ^~~~~~~~~~~~~~~~
Which I consider bogus in the first place ("array subscript [0, 0]" vs a
1-element array). Yet taking the experience from 99f942f3d410 ("Arm64:
adjust __irq_to_desc() to fix build with gcc14") I guessed that
switching function parameters to unsigned int (which they should have
been anyway) might help. And voilà ...
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a2de7dc4d845738e734b10fce6550c89c6b1092c
master date: 2024-09-04 16:09:28 +0200
Jan Beulich [Tue, 24 Sep 2024 12:41:51 +0000 (14:41 +0200)]
SUPPORT.md: split XSM from Flask
XSM is a generic framework, which in particular is also used by SILO.
With this it can't really be experimental: Arm mandates SILO for having
a security supported configuration.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Daniel P. Smith <dpsmith@apertussolutions.com>
master commit: d7c18b8720824d7efc39ffa7296751e1812865a9
master date: 2024-09-04 16:05:03 +0200
libxl: Fix nul-termination of the return value of libxl_xen_console_read_line()
When built with ASAN, "xl dmesg" crashes in the "printf("%s", line)"
call in main_dmesg(). ASAN reports a heap buffer overflow: an
off-by-one access to cr->buffer.
The readconsole sysctl copies up to count characters into the buffer,
but it does not add a null character at the end. Despite the
documentation of libxl_xen_console_read_line(), line_r is not
nul-terminated if 16384 characters were copied to the buffer.
Fix this by asking xc_readconsolering() to fill the buffer up to size
- 1. As the number of characters in the buffer is only needed in
libxl_xen_console_read_line(), make it a local variable there instead
of part of the libxl__xen_console_reader struct.
Jan Beulich [Tue, 24 Sep 2024 12:40:34 +0000 (14:40 +0200)]
Arm64: adjust __irq_to_desc() to fix build with gcc14
With the original code I observe
In function ‘__irq_to_desc’,
inlined from ‘route_irq_to_guest’ at arch/arm/irq.c:465:12:
arch/arm/irq.c:54:16: error: array subscript -2 is below array bounds of ‘irq_desc_t[32]’ {aka ‘struct irq_desc[32]’} [-Werror=array-bounds=]
54 | return &this_cpu(local_irq_desc)[irq];
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
which looks pretty bogus: How in the world does the compiler arrive at
-2 when compiling route_irq_to_guest()? Yet independent of that the
function's parameter wants to be of unsigned type anyway, as shown by
a vast majority of callers (others use plain int when they really mean
non-negative quantities). With that adjustment the code compiles fine
again.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Michal Orzel <michal.orzel@amd.com>
master commit: 99f942f3d410059dc223ee0a908827e928ef3592
master date: 2024-08-29 10:03:53 +0200
For partial writes the non-written parts of registers are folded into
the full 64-bit value from what they're presently set to. That's wrong
to do though when the behavior is write-1-to-clear: Writes not
including to low 3 bits would unconditionally clear all ISR bits which
are presently set. Re-calculate the value to use.
Fixes: be07023be115 ("x86/vhpet: add support for level triggered interrupts") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 41d358d2f9607ba37c216effa39b9f1bc58de69d
master date: 2024-08-29 10:02:20 +0200
x86/dom0: disable SMAP for PV domain building only
Move the logic that disables SMAP so it's only performed when building a PV
dom0, PVH dom0 builder doesn't require disabling SMAP.
The fixes tag is to account for the wrong usage of cpu_has_smap in
create_dom0(), it should instead have used
boot_cpu_has(X86_FEATURE_XEN_SMAP). Fix while moving the logic to apply to PV
only.
While there also make cr4_pv32_mask __ro_after_init.
Fixes: 493ab190e5b1 ('xen/sm{e, a}p: allow disabling sm{e, a}p for Xen itself') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fb1658221a31ec1db33253a80001191391e73b17
master date: 2024-08-28 19:59:07 +0100
Jan Beulich [Tue, 24 Sep 2024 12:38:27 +0000 (14:38 +0200)]
x86/x2APIC: correct cluster tracking upon CPUs going down for S3
Downing CPUs for S3 is somewhat special: Since we can expect the system
to come back up in exactly the same hardware configuration, per-CPU data
for the secondary CPUs isn't de-allocated (and then cleared upon re-
allocation when the CPUs are being brought back up). Therefore the
cluster_cpus per-CPU pointer will retain its value for all CPUs other
than the final one in a cluster (i.e. in particular for all CPUs in the
same cluster as CPU0). That, however, is in conflict with the assertion
early in init_apic_ldr_x2apic_cluster().
Note that the issue is avoided on Intel hardware, where we park CPUs
instead of bringing them down.
Extend the bypassing of the freeing to the suspend case, thus making
suspend/resume also a tiny bit faster.
Fixes: 2e6c8f182c9c ("x86: distinguish CPU offlining from CPU removal") Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ad3ff7b4279d16c91c23cda6e8be5bc670b25c9a
master date: 2024-08-26 10:30:40 +0200
Jan Beulich [Tue, 24 Sep 2024 12:37:52 +0000 (14:37 +0200)]
x86emul: set (fake) operand size for AVX512CD broadcast insns
Back at the time I failed to pay attention to op_bytes still being zero
when reaching the respective case block: With the ext0f38_table[]
entries having simd_packed_int, the defaulting at the bottom of
x86emul_decode() won't set the field to non-zero for F3-prefixed insns.
Fixes: 37ccca740c26 ("x86emul: support AVX512CD insns") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 6fa6b7feaafd622db3a2f3436750cf07782f4c12
master date: 2024-08-23 09:12:24 +0200
Jan Beulich [Tue, 24 Sep 2024 12:37:08 +0000 (14:37 +0200)]
x86emul: always set operand size for AVX-VNNI-INT8 insns
Unlike for AVX-VNNI-INT16 I failed to notice that op_bytes may still be
zero when reaching the respective case block: With the ext0f38_table[]
entries having simd_packed_int, the defaulting at the bottom of
x86emul_decode() won't set the field to non-zero for F3- or F2-prefixed
insns.
Fixes: 842acaa743a5 ("x86emul: support AVX-VNNI-INT8") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d45687cca2450bfebe1dfbddb22f4f03c6fbc9cb
master date: 2024-08-23 09:11:15 +0200
Andrew Cooper [Tue, 24 Sep 2024 12:36:25 +0000 (14:36 +0200)]
x86/pv: Address Coverity complaint in check_guest_io_breakpoint()
Commit 08aacc392d86 ("x86/emul: Fix misaligned IO breakpoint behaviour in PV
guests") caused a Coverity INTEGER_OVERFLOW complaint based on the reasoning
that width could be 0.
It can't, but digging into the code generation, GCC 8 and later (bisected on
godbolt) choose to emit a CSWITCH lookup table, and because the range (bottom
2 bits clear), it's a 16-entry lookup table.
So Coverity is understandable, given that GCC did emit a (dead) logic path
where width stayed 0.
Rewrite the logic. Introduce x86_bp_width() which compiles to a single basic
block, which replaces the switch() statement. Take the opportunity to also
make start and width be loop-scope variables.
No practical change, but it should compile better and placate Coverity.
Fixes: 08aacc392d86 ("x86/emul: Fix misaligned IO breakpoint behaviour in PV guests")
Coverity-ID: 1616152 Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 6d41a9d8a12ff89adabdc286e63e9391a0481699
master date: 2024-08-21 23:59:19 +0100
Andrew Cooper [Tue, 24 Sep 2024 12:34:30 +0000 (14:34 +0200)]
x86/pv: Fix merging of new status bits into %dr6
All #DB exceptions result in an update of %dr6, but this isn't captured in
Xen's handling, and is buggy just about everywhere.
To begin resolving this issue, add a new pending_dbg field to x86_event
(unioned with cr2 to avoid taking any extra space, adjusting users to avoid
old-GCC bugs with anonymous unions), and introduce pv_inject_DB() to replace
the current callers using pv_inject_hw_exception().
Push the adjustment of v->arch.dr6 into pv_inject_event(), and use the new
x86_merge_dr6() rather than the current incorrect logic.
A key property is that pending_dbg is taken with positive polarity to deal
with RTM/BLD sensibly. Most callers pass in a constant, but callers passing
in a hardware %dr6 value need to XOR the value with X86_DR6_DEFAULT to flip to
positive polarity.
This fixes the behaviour of the breakpoint status bits; that any left pending
are generally discarded when a new #DB is raised. In principle it would fix
RTM/BLD too, except PV guests can't turn these capabilities on to start with.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: db39fa4b27ea470902d4625567cb6fa24030ddfa
master date: 2024-08-21 23:59:19 +0100
Andrew Cooper [Tue, 24 Sep 2024 12:30:49 +0000 (14:30 +0200)]
x86/pv: Introduce x86_merge_dr6() and fix do_debug()
Pretty much everywhere in Xen the logic to update %dr6 when injecting #DB is
buggy. Introduce a new x86_merge_dr6() helper, and start fixing the mess by
adjusting the dr6 merge in do_debug(). Also correct the comment.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 54ef601a66e8d812a6a6a308f02524e81201825e
master date: 2024-08-21 23:59:19 +0100
Jan Beulich [Tue, 24 Sep 2024 12:30:04 +0000 (14:30 +0200)]
x86emul: correct #UD check for AVX512-FP16 complex multiplications
avx512_vlen_check()'s argument was inverted, while the surrounding
conditional wrongly forced the EVEX.L'L check for the scalar forms when
embedded rounding was in effect.
Fixes: d14c52cba0f5 ("x86emul: handle AVX512-FP16 complex multiplication insns") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a30d438ce58b70c5955f5d37f776086ab8f88623
master date: 2024-08-19 15:32:31 +0200
Jan Beulich [Tue, 24 Sep 2024 12:28:22 +0000 (14:28 +0200)]
Arm: correct FIXADDR_TOP
While reviewing a RISC-V patch cloning the Arm code, I noticed an
off-by-1 here: FIX_PMAP_{BEGIN,END} being an inclusive range and
FIX_LAST being the same as FIX_PMAP_END, FIXADDR_TOP cannot derive from
FIX_LAST alone, or else the BUG_ON() in virt_to_fix() would trigger if
FIX_PMAP_END ended up being used.
While touching this area also add a check for fixmap and boot FDT area
to not only not overlap, but to have at least one (unmapped) page in
between.
Jan Beulich [Tue, 24 Sep 2024 12:27:03 +0000 (14:27 +0200)]
x86/vLAPIC: prevent undue recursion of vlapic_error()
With the error vector set to an illegal value, the function invoking
vlapic_set_irq() would bring execution back here, with the non-recursive
lock already held. Avoid the call in this case, merely further updating
ESR (if necessary).
This is XSA-462 / CVE-2024-45817.
Fixes: 5f32d186a8b1 ("x86/vlapic: don't silently accept bad vectors") Reported-by: Federico Serafini <federico.serafini@bugseng.com> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c42d9ec61f6d11e25fa77bd44dd11dad1edda268
master date: 2024-09-24 14:23:29 +0200
Use expect to invoke QEMU so that we can terminate the test as soon as
we get the right string in the output instead of waiting until the
final timeout.
For timeout, instead of an hardcoding the value, use a Gitlab CI
variable "QEMU_TIMEOUT" that can be changed depending on the latest
status of the Gitlab CI runners.
The Yocto jobs take a long time to run. We are changing Gitlab ARM64
runners and the new runners might not be able to finish the Yocto jobs
in a reasonable time.
For now, disable the Yocto jobs by turning them into "manual" trigger
(they need to be manually executed.)
Jan Beulich [Tue, 13 Aug 2024 14:48:13 +0000 (16:48 +0200)]
x86/pass-through: documents as security-unsupported when sharing resources
When multiple devices share resources and one of them is to be passed
through to a guest, security of the entire system and of respective
guests individually cannot really be guaranteed without knowing
internals of any of the involved guests. Therefore such a configuration
cannot really be security-supported, yet making that explicit was so far
missing.
Teddy Astie [Tue, 13 Aug 2024 14:47:19 +0000 (16:47 +0200)]
x86/IOMMU: move tracking in iommu_identity_mapping()
If for some reason xmalloc() fails after having mapped the reserved
regions, an error is reported, but the regions remain mapped in the P2M.
Similarly if an error occurs during set_identity_p2m_entry() (except on
the first call), the partial mappings of the region would be retained
without being tracked anywhere, and hence without there being a way to
remove them again from the domain's P2M.
Move the setting up of the list entry ahead of trying to map the region.
In cases other than the first mapping failing, keep record of the full
region, such that a subsequent unmapping request can be properly torn
down.
To compensate for the potentially excess unmapping requests, don't log a
warning from p2m_remove_identity_entry() when there really was nothing
mapped at a given GFN.
This is XSA-460 / CVE-2024-31145.
Fixes: 2201b67b9128 ("VT-d: improve RMRR region handling") Fixes: c0e19d7c6c42 ("IOMMU: generalize VT-d's tracking of mapped RMRR regions") Signed-off-by: Teddy Astie <teddy.astie@vates.tech> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: beadd68b5490ada053d72f8a9ce6fd696d626596
master date: 2024-08-13 16:36:40 +0200
Matthew Barnes [Thu, 8 Aug 2024 11:47:30 +0000 (13:47 +0200)]
tools/lsevtchn: Use errno macro to handle hypercall error cases
Currently, lsevtchn aborts its event channel enumeration when it hits
an event channel that is owned by Xen.
lsevtchn does not distinguish between different hypercall errors, which
results in lsevtchn missing potential relevant event channels with
higher port numbers.
Use the errno macro to distinguish between hypercall errors, and
continue event channel enumeration if the hypercall error is not
critical to enumeration.
Signed-off-by: Matthew Barnes <matthew.barnes@cloud.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech>
master commit: e92a453c8db8bba62d6be3006079e2b9990c3978
master date: 2024-08-02 08:43:57 +0200
George Dunlap [Thu, 8 Aug 2024 11:47:02 +0000 (13:47 +0200)]
xen/hvm: Don't skip MSR_READ trace record
Commit 37f074a3383 ("x86/msr: introduce guest_rdmsr()") introduced a
function to combine the MSR_READ handling between PV and HVM.
Unfortunately, by returning directly, it skipped the trace generation,
leading to gaps in the trace record, as well as xenalyze errors like
this:
hvm_generic_postprocess: d2v0 Strange, exit 7c(VMEXIT_MSR) missing a handler
Roger Pau Monné [Thu, 8 Aug 2024 11:45:58 +0000 (13:45 +0200)]
x86/altcall: further refine clang workaround
The current code in ALT_CALL_ARG() won't successfully workaround the clang
code-generation issue if the arg parameter has a size that's not a power of 2.
While there are no such sized parameters at the moment, improve the workaround
to also be effective when such sizes are used.
Instead of using a union with a long use an unsigned long that's first
initialized to 0 and afterwards set to the argument value.
Roger Pau Monné [Thu, 8 Aug 2024 11:45:28 +0000 (13:45 +0200)]
x86/dom0: fix restoring %cr3 and the mapcache override on PV build error
One of the error paths in the PV dom0 builder section that runs on the guest
page-tables wasn't restoring the Xen value of %cr3, neither removing the
mapcache override.
Andrew Cooper [Thu, 8 Aug 2024 11:44:56 +0000 (13:44 +0200)]
XSM/domctl: Fix permission checks on XEN_DOMCTL_createdomain
The XSM checks for XEN_DOMCTL_createdomain are problematic. There's a split
between xsm_domctl() called early, and flask_domain_create() called quite late
during domain construction.
All XSM implementations except Flask have a simple IS_PRIV check in
xsm_domctl(), and operate as expected when an unprivileged domain tries to
make a hypercall.
Flask however foregoes any action in xsm_domctl() and defers everything,
including the simple "is the caller permitted to create a domain" check, to
flask_domain_create().
As a consequence, when XSM Flask is active, and irrespective of the policy
loaded, all domains irrespective of privilege can:
* Mutate the global 'rover' variable, used to track the next free domid.
Therefore, all domains can cause a domid wraparound, and combined with a
voluntary reboot, choose their own domid.
* Cause a reasonable amount of a domain to be constructed before ultimately
failing for permission reasons, including the use of settings outside of
supported limits.
In order to remediate this, pass the ssidref into xsm_domctl() and at least
check that the calling domain privileged enough to create domains.
Take the opportunity to also fix the sign of the cmd parameter to be unsigned.
This issue has not been assigned an XSA, because Flask is experimental and not
security supported.
Reported-by: Ross Lagerwall <ross.lagerwall@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com>
master commit: ee32b9b29af449d38aad0a1b3a81aaae586f5ea7
master date: 2024-07-30 17:42:17 +0100
Ross Lagerwall [Thu, 8 Aug 2024 11:44:26 +0000 (13:44 +0200)]
bunzip2: fix rare decompression failure
The decompression code parses a huffman tree and counts the number of
symbols for a given bit length. In rare cases, there may be >= 256
symbols with a given bit length, causing the unsigned char to overflow.
This causes a decompression failure later when the code tries and fails to
find the bit length for a given symbol.
Since the maximum number of symbols is 258, use unsigned short instead.
Fixes: ab77e81f6521 ("x86/dom0: support bzip2 and lzma compressed bzImage payloads") Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 303d3ff85c90ee4af4bad4e3b1d4932fa2634d64
master date: 2024-07-30 11:55:56 +0200
"$dev" needs to be set correctly for backendtype=phy as well as
backendtype=tap. Move the setting into the conditional, so it can be
handled properly for each.
(dev could be captured during tap-ctl allocate for blktap module, but it
would not be set properly for the find_device case. The backendtype=tap
case would need to be handled regardless.)
x86/altcall: fix clang code-gen when using altcall in loop constructs
Yet another clang code generation issue when using altcalls.
The issue this time is with using loop constructs around alternative_{,v}call
instances using parameter types smaller than the register size.
Given the following example code:
static void bar(bool b)
{
unsigned int i;
for ( i = 0; i < 10; i++ )
{
int ret_;
register union {
bool e;
unsigned long r;
} di asm("rdi") = { .e = b };
register unsigned long si asm("rsi");
register unsigned long dx asm("rdx");
register unsigned long cx asm("rcx");
register unsigned long r8 asm("r8");
register unsigned long r9 asm("r9");
register unsigned long r10 asm("r10");
register unsigned long r11 asm("r11");
Clang will generate machine code that only resets the low 8 bits of %rdi
between loop calls, leaving the rest of the register possibly containing
garbage from the use of %rdi inside the called function. Note also that clang
doesn't truncate the input parameters at the callee, thus breaking the psABI.
Fix this by turning the `e` element in the anonymous union into an array that
consumes the same space as an unsigned long, as this forces clang to reset the
whole %rdi register instead of just the low 8 bits.
Fixes: 2ce562b2a413 ('x86/altcall: use a union as register type for function parameters on clang') Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>
master commit: d51b2f5ea1915fe058f730b0ec542cf84254fca0
master date: 2024-07-23 13:59:30 +0200
Jan Beulich [Tue, 16 Jul 2024 12:09:14 +0000 (14:09 +0200)]
x86/IRQ: avoid double unlock in map_domain_pirq()
Forever since its introduction the main loop in the function dealing
with multi-vector MSI had error exit points ("break") with different
properties: In one case no IRQ descriptor lock is being held.
Nevertheless the subsequent error cleanup path assumed such a lock would
uniformly need releasing. Identify the case by setting "desc" to NULL,
thus allowing the unlock to be skipped as necessary.
This is CVE-2024-31143 / XSA-458.
Coverity ID: 1605298 Fixes: d1b6d0a02489 ("x86: enable multi-vector MSI") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Thu, 11 Jul 2024 15:09:58 +0000 (16:09 +0100)]
CI: Add Ubuntu 22.04 (Jammy) and 24.04 (Noble) testing
The containers are exactly as per 20.04 (Focal). However, this now brings us
to 5 releases * 4 build jobs worth of Ubuntu testing, which is overkill.
The oldest and newest toolchains are the most likely to find problems with new
code, so reduce the middle 3 releases (18/20/22) to just a single smoke test
each.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech> Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Andrew Cooper [Thu, 11 Jul 2024 15:09:22 +0000 (16:09 +0100)]
CI: Refresh Ubuntu Focal container as 20.04-x86_64
As with 16.04 (Xenial), with python3-setuptools included. Having this package
only in some containers was intentional; see commit bbc72a7877d8 ("automation:
Add python3's setuptools to some containers") for the rational.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech> Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech> Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech> Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Andrew Cooper [Wed, 10 Jul 2024 13:37:53 +0000 (14:37 +0100)]
CI: Refresh OpenSUSE Leap container
See prior patch for most discussion.
Despite appearing to be a fixed release (and therefore not marked as permitted
failure), the dockerfile references the `leap` tag which is rolling in
practice. Switch to 15.6 explicitly, for better test stability.
Vs tumbleweed, use `zypper update` rather than dist-upgrade, and retain the
RomBIOS dependencies; bin86 and dev86.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech> Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Andrew Cooper [Wed, 10 Jul 2024 13:40:23 +0000 (14:40 +0100)]
CI: Refresh OpenSUSE Tumbleweed container
Existing as suse:opensuse-tumbleweed is a historical quirk, and adjusted for
consistency with all the other containers.
Make it non-root, use heredocs for legibility, and use the zypper long names
for the benefit of those wondering what was being referenced or duplicated.
Trim the dependencies substantially. Testing docs isn't very interesting and
saves a lot of space. Other savings come from removing a huge pile of
optional QEMU dependencies (QEMU just needs to build the Xen parts to be
useful here, not have a full GUI environment).
Finally, there where some packages such as bc, libssh2-devel, libtasn1-devel
and nasm that I'm not aware of any reason to have had, even historically.
Furthermore, identify which components of the build use which dependencies,
which will help managing them in the future.
Thanks to Olaf Hering for dependency fixes that have been subsumed into this
total overhaul.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech> Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Andrew Cooper [Tue, 9 Jul 2024 14:54:52 +0000 (15:54 +0100)]
CI: Refresh and upgrade the GCC-IBT container
Upgrade from Debian buster to bookworm, GCC 11.3 to 11.4 and to be a non-root
container.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech> Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Andrew Cooper [Mon, 8 Jul 2024 17:18:22 +0000 (18:18 +0100)]
CI: Refresh bullseye-ppc64le as debian:11-ppc64le
... in the style of debian:12-ppc64le.
Rename the jobs and reposition them later as they're not a dependency for the
smoke testing any more.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech> Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Andrew Cooper [Mon, 8 Jul 2024 17:17:25 +0000 (18:17 +0100)]
CI: Use debian:12-ppc64le for smoke testing
qemu-system-ppc64/8.1.0-ppc64 was added because bullseye's QEMU didn't
understand the powernv9 machine. However bookworm's QEMU does and this is
preferable to maintaining a random build of QEMU ourselves.
Use the debian:12-ppc64le container and test the output of that build too.
Remove qemu-system-ppc64-8.1.0-ppc64-export which is unused now.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech> Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Andrew Cooper [Mon, 8 Jul 2024 17:00:21 +0000 (18:00 +0100)]
CI: Introduce a debian:12-ppc64le container
... conforming to the new naming scheme; $DISTRO-$VERSION-$ARCH-* so the jobs
sort more coherently.
Make it non-root by default, and set XEN_TARGET_ARCH=ppc64. Include QEMU too,
which will be used subsequently.
Add build jobs too, with debian-12-ppc64le-gcc-debug specifically early as it
will be used for smoke testing shortly.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech> Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Andrew Cooper [Wed, 10 Jul 2024 12:38:52 +0000 (13:38 +0100)]
CI: Mark Archlinux/x86 as allowing failures
Archlinux is a rolling distro. As a consequence, rebuilding the container
periodically changes the toolchain, and this affects all stable branches in
one go.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech> Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Andrew Cooper [Wed, 10 Jul 2024 00:01:13 +0000 (01:01 +0100)]
CI: Drop Ubuntu Trusty testing
This is also End of Life.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Anthony PERARD <anthony.perard@vates.tech> Release-Acked-By: Oleksii Kurochko <oleksii.kurochko@gmail.com>