libfsimage/xfs: Sanity-check the superblock during mounts
Sanity-check the XFS superblock for wellformedness at the mount handler.
This forces pygrub to abort parsing a potentially malformed filesystem and
ensures the invariants assumed throughout the rest of the code hold.
Also, derive parameters from previously sanitized parameters where possible
(rather than reading them off the superblock)
The code doesn't try to avoid overflowing the end of the disk, because
that's an unlikely and benign error. Parameters used in calculations of
xfs_daddr_t (like the root inode index) aren't in critical need of being
sanitized.
The sanitization of agblklog is basically checking that no obvious
overflows happen on agblklog, and then ensuring agblocks is contained in
the range (2^(sb_agblklog-1), 2^sb_agblklog].
This is part of XSA-443 / CVE-2023-34325
Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monne [Tue, 13 Jun 2023 13:01:05 +0000 (15:01 +0200)]
iommu/amd-vi: flush IOMMU TLB when flushing the DTE
The caching invalidation guidelines from the AMD-Vi specification (48882—Rev
3.07-PUB—Oct 2022) seem to be misleading on some hardware, as devices will
malfunction (see stale DMA mappings) if some fields of the DTE are updated but
the IOMMU TLB is not flushed. This has been observed in practice on AMD
systems. Due to the lack of guidance from the currently published
specification this patch aims to increase the flushing done in order to prevent
device malfunction.
In order to fix, issue an INVALIDATE_IOMMU_PAGES command from
amd_iommu_flush_device(), flushing all the address space. Note this requires
callers to be adjusted in order to pass the DomID on the DTE previous to the
modification.
Some call sites don't provide a valid DomID to amd_iommu_flush_device() in
order to avoid the flush. That's because the device had address translations
disabled and hence the previous DomID on the DTE is not valid. Note the
current logic relies on the entity disabling address translations to also flush
the TLB of the in use DomID.
Device I/O TLB flushing when ATS are enabled is not covered by the current
change, as ATS usage is not security supported.
This is XSA-442 / CVE-2023-34326
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Michal Orzel [Fri, 6 Oct 2023 07:51:41 +0000 (09:51 +0200)]
x86: Clarify that only 5 hypercall parameters are supported
The x86 hypercall ABI really used to have 6-argument hypercalls. V4V, the
downstream predecessor to Argo did take 6th args.
However, the 6th arg being %ebp in the 32bit ABI makes it unusable in
practice, because that's the frame pointer in builds with frame pointers
enabled. Therefore Argo was altered to being a 5-arg hypercall when it was
upstreamed.
c/s 2f531c122e95 ("x86: limit number of hypercall parameters to 5") removed
the ability for hypercalls to take 6 arguments.
Update the documentation to match reality.
Signed-off-by: Michal Orzel <michal.orzel@amd.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jason Andryuk <jandryuk@gmail.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Roger Pau Monné [Fri, 6 Oct 2023 14:50:46 +0000 (16:50 +0200)]
tools/xenpvboot: remove as unable to convert to Python 3
The script heavily relies on the urlgrabber python module, which doesn't seem
to be packaged by all distros; it's missing from newer Debian versions at
least.
Also the usage of the commands module has been deprecated since Python 2.6, and
removed in Python 3, so the code would need to be re-written to not rely on
urlgrabber and the commands modules.
Arguably the purpose of the xenpvnetboot bootloader can also be achieved with
an isolated script that fetches the kernel and ramdisk before attempting to
launch the domain, without having to run in libxl context. There are no xl.cfg
options consumed by the bootloader apart from the base location and the
subpaths of the kernel and ramdisk to fetch.
Any interested parties in keeping such functionality are free to submit an
updated script that works with Python 3.
Resolves: xen-project/xen#172 Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Andrew Cooper [Fri, 6 Oct 2023 13:53:20 +0000 (14:53 +0100)]
x86/memshr: Fix build in copy_vcpu_settings()
The last user of this variable was dropped.
Fixes: 295514ff7550 ("common: convert vCPU info area registration") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Roger Pau Monné [Fri, 6 Oct 2023 13:00:59 +0000 (15:00 +0200)]
domain: expose newly introduced hypercalls as XENFEAT
XENFEAT_runstate_phys_area is exposed to all architectures, while
XENFEAT_vcpu_time_phys_area is currently only implemented for x86, and hence
the feature flag is also only exposed on x86.
Additionally add dummy guards with TODOs in the respective hypercall
implementations, to signal the intention to control the availability of those
hypercalls on a guest-by-guest basis from the toolstack.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Roger Pau Monné [Fri, 6 Oct 2023 13:00:58 +0000 (15:00 +0200)]
domain: fix misaligned unmap address in {,un}map_guest_area()
unmap_domain_page_global() expects the provided address to be page aligned, or
else some of the called functions will trigger assertions, like
modify_xen_mappings() on x86 or destroy_xen_mappings() on Arm.
The following assert has been reported by osstest arm 32bit tests:
arm/ioreq: guard interaction data on read/write operations
For read operations, there's a potential issue when the data field
of the ioreq struct is partially updated in the response. To address
this, zero data field during read operations. This modification
serves as a safeguard against implementations that may inadvertently
partially update the data field in response to read requests.
For instance, consider an 8-bit read operation. In such cases, QEMU,
returns the same content of the data field with only 8 bits of
updated data. This behavior could potentially result in the
propagation of incorrect or unintended data to ioreq clients.
During a write access, the Device Model only need to know the content
of the bits associated with the access size (e.g. for 8-bit, the lower
8-bits). During a read access, the Device Model don't need to know any
value. So restrict the value it can access.
Signed-off-by: Andrii Chepurnyi <andrii_chepurnyi@epam.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
Jan Beulich [Mon, 2 Oct 2023 15:11:27 +0000 (17:11 +0200)]
common: convert vCPU info area registration
Switch to using map_guest_area(). Noteworthy differences from
map_vcpu_info():
- remote vCPU-s are paused rather than checked for being down (which in
principle can change right after the check),
- the domain lock is taken for a much smaller region,
- the error code for an attempt to re-register the area is now -EBUSY,
- we could in principle permit de-registration when no area was
previously registered (which would permit "probing", if necessary for
anything).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Julien Grall <jgrall@amazon.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Jan Beulich [Mon, 2 Oct 2023 15:11:26 +0000 (17:11 +0200)]
x86: introduce GADDR based secondary time area registration alternative
The registration by virtual/linear address has downsides: The access is
expensive for HVM/PVH domains. Furthermore for 64-bit PV domains the area
is inaccessible (and hence cannot be updated by Xen) when in guest-user
mode.
Introduce a new vCPU operation allowing to register the secondary time
area by guest-physical address.
An at least theoretical downside to using physically registered areas is
that PV then won't see dirty (and perhaps also accessed) bits set in its
respective page table entries.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Julien Grall <jgrall@amazon.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Jan Beulich [Mon, 2 Oct 2023 15:11:25 +0000 (17:11 +0200)]
domain: introduce GADDR based runstate area registration alternative
The registration by virtual/linear address has downsides: At least on
x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
PV domains the area is inaccessible (and hence cannot be updated by Xen)
when in guest-user mode.
Introduce a new vCPU operation allowing to register the runstate area by
guest-physical address.
An at least theoretical downside to using physically registered areas is
that PV then won't see dirty (and perhaps also accessed) bits set in its
respective page table entries.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Julien Grall <jgrall@amazon.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Jan Beulich [Mon, 2 Oct 2023 15:11:24 +0000 (17:11 +0200)]
domain: map/unmap GADDR based shared guest areas
The registration by virtual/linear address has downsides: At least on
x86 the access is expensive for HVM/PVH domains. Furthermore for 64-bit
PV domains the areas are inaccessible (and hence cannot be updated by
Xen) when in guest-user mode, and for HVM guests they may be
inaccessible when Meltdown mitigations are in place. (There are yet
more issues.)
In preparation of the introduction of new vCPU operations allowing to
register the respective areas (one of the two is x86-specific) by
guest-physical address, flesh out the map/unmap functions.
Noteworthy differences from map_vcpu_info():
- areas can be registered more than once (and de-registered),
- remote vCPU-s are paused rather than checked for being down (which in
principle can change right after the check),
- the domain lock is taken for a much smaller region.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Jan Beulich [Wed, 4 Oct 2023 13:53:31 +0000 (15:53 +0200)]
x86/mem-sharing: copy GADDR based shared guest areas
In preparation of the introduction of new vCPU operations allowing to
register the respective areas (one of the two is x86-specific) by
guest-physical address, add the necessary fork handling (with the
backing function yet to be filled in).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Jan Beulich [Mon, 2 Oct 2023 15:11:22 +0000 (17:11 +0200)]
x86: update GADDR based secondary time area
Before adding a new vCPU operation to register the secondary time area
by guest-physical address, add code to actually keep such areas up-to-
date.
Note that pages aren't marked dirty when written to (matching the
handling of space mapped by map_vcpu_info()), on the basis that the
registrations are lost anyway across migration (or would need re-
populating at the target for transparent migration). Plus the contents
of the areas in question have to be deemed volatile in the first place
(so saving a "most recent" value is pretty meaningless even for e.g.
snapshotting).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Jan Beulich [Mon, 2 Oct 2023 15:11:21 +0000 (17:11 +0200)]
domain: update GADDR based runstate guest area
Before adding a new vCPU operation to register the runstate area by
guest-physical address, add code to actually keep such areas up-to-date.
Note that updating of the area will be done exclusively following the
model enabled by VMASST_TYPE_runstate_update_flag for virtual-address
based registered areas.
Note further that pages aren't marked dirty when written to (matching
the handling of space mapped by map_vcpu_info()), on the basis that the
registrations are lost anyway across migration (or would need re-
populating at the target for transparent migration). Plus the contents
of the areas in question have to be deemed volatile in the first place
(so saving a "most recent" value is pretty meaningless even for e.g.
snapshotting).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Jan Beulich [Mon, 2 Oct 2023 15:11:20 +0000 (17:11 +0200)]
domain: GADDR based shared guest area registration alternative - teardown
In preparation of the introduction of new vCPU operations allowing to
register the respective areas (one of the two is x86-specific) by
guest-physical address, add the necessary domain cleanup hooks.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Jan Beulich [Mon, 2 Oct 2023 15:11:19 +0000 (17:11 +0200)]
x86/shim: zap runstate and time area handles during shutdown
While likely the guest would just re-register the same areas after
a possible resume, let's not take this for granted and avoid the risk of
otherwise corrupting guest memory.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Roger Pau Monne [Mon, 2 Oct 2023 15:11:18 +0000 (17:11 +0200)]
mem_sharing/fork: do not attempt to populate vcpu_info page
Instead let map_vcpu_info() and it's call to get_page_from_gfn()
populate the page in the child as needed. Also remove the bogus
copy_domain_page(): should be placed before the call to map_vcpu_info(),
as the later can update the contents of the vcpu_info page.
Note that this eliminates a bug in copy_vcpu_settings(): The function did
allocate a new page regardless of the GFN already having a mapping, thus in
particular breaking the case of two vCPU-s having their info areas on the same
page.
Fixes: 41548c5472a3 ('mem_sharing: VM forking') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
SUPPORT: downgrade Physical CPU Hotplug to Experimental
The feature is not commonly used, and we don't have hardware to test it,
not in OSSTest, not in Gitlab, and not even ad-hoc manually by community
members. We could use QEMU to test it, but even that it is known not to
work on our end.
Also take the opportunity to rename the feature to "ACPI CPU Hotplug"
for clarity.
Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
During the discussions that led to the acceptance of Rule 2.1, we
decided on a few exceptions that were not properly recorded in
rules.rst. Add them now.
Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Bertrand Marquis <bertrand.marquis@arm.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
The code to set up the stack in head.S erroneously loads the bottom of
the stack (the symbol cpu0_boot_stack) into r1 instead of the top of the
stack (cpu0_boot_stack + STACK_SIZE).
Fixes: 3a4e6f67bc68 ("xen/ppc: Set up a basic C environment") Signed-off-by: Shawn Anastasio <sanastasio@raptorengineering.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
David Kahurani [Fri, 29 Sep 2023 04:57:24 +0000 (07:57 +0300)]
tools/xenstore: Avoid leaking memory in check_store
check_store() will leak the memory from reading the "@introduceDomain" and
"@releaseDomain" nodes.
While this code should not be trigger-able from an unprivileged domain
it is called multiple times when the database gets inconsistent. This
means that a malicious guest able to corrupt the database will trigger
the leaks here.
Fix the leaks so that this code can be safely called from anywhere.
Fixes: 67617067f0b6 ("tools/xenstore: let check_store() check the accounting data") Signed-off-by: David Kahurani <k.kahurani@gmail.com> Reviewed-by: Juergen Gross <jgross@suse.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
xen/common: Add NUMA node id bounds check to page_alloc.c/node_to_scrub
When building for Power with CONFIG_DEBUG unset, a compiler error gets
raised inside page_alloc.c's node_to_scrub function:
common/page_alloc.c: In function 'node_to_scrub.part.0':
common/page_alloc.c:1217:29: error: array subscript 1 is above array
bounds of 'long unsigned int[1]' [-Werror=array-bounds]
1217 | if ( node_need_scrub[node] )
It appears that this is a false positive, given that in practice
cycle_node should never return a node ID >= MAX_NUMNODES as long as the
architecture's node_online_map is properly defined and initialized, so
this additional bounds check is only to satisfy GCC.
automation: Change build script to use arch defconfig
Change automation build script to call the make defconfig target before
setting CONFIG_DEBUG and extra options. This fixes issues on Power where
the build fails without using the ppc64_defconfig.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Shawn Anastasio <sanastasio@raptorengineering.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
ARM: GICv3 ITS: flush caches for newly allocated ITT
ITS manages Device Tables and Interrupt Translation Tables on its own,
so generally we are not interested in maintaining any coherence with
CPU's view of those memory regions, except one case: ITS requires that
Interrupt Translation Tables should be initialized with
zeroes. Existing code already does this, but it does not cleans
caches afterwards. This means that ITS may see un-initialized ITT and
CPU can overwrite portions of ITT later, when it finally decides to
flush caches. Visible effect of this issue that there are not
interrupts delivered from a device.
Fix this by calling clean_and_invalidate_dcache_va_range() for newly
allocated ITT.
Fixes: 69082e1c210d ("ARM: GICv3 ITS: introduce device mapping") Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> Tested-by: Stewart Hildebrand <stewart.hildebrand@amd.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
drivers/video: make declarations of defined functions available
The declarations for 'vesa_{init,early_init,endboot}' needed by
'xen/drivers/video/vesa.c' and 'fill_console_start_info' in 'vga.c'
are now available by moving the relative code inside 'vga.h'.
While moving the code, the alternative definitions are now guarded by
CONFIG_VGA. The alternative #define-s for 'vesa_early_init' and 'vesa_endboot'
are dropped, since currently they have no callers when CONFIG_VGA is not defined.
This also resolves violations of MISRA C:2012 Rule 8.4.
Fixes: 6d9199bd0f22 ("x86-64: enable hypervisor output on VESA frame buffer") Signed-off-by: Nicola Vetrini <nicola.vetrini@bugseng.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
xen/emul-i8254: remove forward declarations and re-order functions
Remove forward declarations, including one that violates MISRA C Rule
8.3 ("All declarations of an object or function shall use the same
names and type qualifiers"), and re-order functions.
No functional change.
Signed-off-by: Federico Serafini <federico.serafini@bugseng.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
xen/pdx: Standardize region validation wrt pdx compression
Regions must be occasionally validated for pdx compression validity. That
is, whether any of the machine addresses spanning the region have a bit set
in the pdx "hole" (which is expected to always contain zeroes). There are
a few such tests through the code, and they all check for different things.
This patch replaces all such occurrences with a call to a centralized
function that checks a region for validity.
Signed-off-by: Alejandro Vallejo <alejandro.vallejo@cloud.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Andrew Cooper [Tue, 12 Sep 2023 21:31:43 +0000 (22:31 +0100)]
x86/pv: Fix the determiniation of whether to inject #DB
We long ago fixed the emulator to not inject exceptions behind our back.
Therefore, assert that that a PV event (including interrupts, because that
would be buggy too) isn't pending, rather than skipping the #DB injection if
one is.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Nicola reports that the XSA-438 fix introduced new MISRA violations because of
some incidental tidying it tried to do. The parameter is useless, so resolve
the MISRA regression by removing it.
hap_update_cr3() discards the parameter entirely, while sh_update_cr3() uses
it to distinguish internal and external callers and therefore whether the
paging lock should be taken.
However, we have paging_lock_recursive() for this purpose, which also avoids
the ability for the shadow internal callers to accidentally not hold the lock.
Fixes: fb0ff49fe9f7 ("x86/shadow: defer releasing of PV's top-level shadow reference") Reported-by: Nicola Vetrini <nicola.vetrini@bugseng.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
George Dunlap [Fri, 30 Jun 2023 11:06:32 +0000 (12:06 +0100)]
credit: Don't steal vcpus which have yielded
On large systems with many vcpus yielding due to spinlock priority
inversion, it's not uncommon for a vcpu to yield its timeslice, only
to be immediately stolen by another pcpu looking for higher-priority
work.
To prevent this:
* Keep the YIELD flag until a vcpu is removed from a runqueue
* When looking for work to steal, skip vcpus which have yielded
NB that this does mean that sometimes a VM is inserted into an empty
runqueue; handle that case.
Signed-off-by: George Dunlap <george.dunlap@cloud.com> Reviewed-by: Juergen Gross <jgross@suse.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
George Dunlap [Mon, 18 Sep 2023 15:46:47 +0000 (16:46 +0100)]
credit: Limit load balancing to once per millisecond
The credit scheduler tries as hard as it can to ensure that it always
runs scheduling units with positive credit (PRI_TS_UNDER) before
running those with negative credit (PRI_TS_OVER). If the next
runnable scheduling unit is of priority OVER, it will always run the
load balancer, which will scour the system looking for another
scheduling unit of the UNDER priority.
Unfortunately, as the number of cores on a system has grown, the cost
of the work-stealing algorithm has dramatically increased; a recent
trace on a system with 128 cores showed this taking over 50
microseconds.
Add a parameter, load_balance_ratelimit, to limit the frequency of
load balance operations on a given pcpu. Default this to 1
millisecond.
Invert the load balancing conditional to make it more clear, and line
up more closely with the comment above it.
Overall it might be cleaner to have the last_load_balance checking
happen inside csched_load_balance(), but that would require either
passing both now and spc into the function, or looking them up again;
both of which seemed to be worse than simply checking and setting the
values before calling it.
On a system with a vcpu:pcpu ratio of 2:1, running Windows guests
(which will end up calling YIELD during spinlock contention), this
patch increased performance significantly.
Signed-off-by: George Dunlap <george.dunlap@cloud.com> Reviewed-by: Juergen Gross <jgross@suse.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Testing on a Kaby Lake box with 8 CPUs leads to the serial buffer
being filled halfway during dom0 boot, and thus a non-trivial chunk of
Linux boot messages are dropped.
Increasing the buffer to 32K does fix the issue and Linux boot
messages are no longer dropped. There's no justification either on
why 16K was chosen, and hence bumping to 32K in order to cope with
current systems generating output faster does seem appropriate to have
a better user experience with the provided defaults.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Julien Grall <jgrall@amazon.com>
Henry Wang [Sat, 16 Sep 2023 04:06:49 +0000 (12:06 +0800)]
xen/arm64: head.S: Fix wrong enable_boot_cpu_mm() code movement
Some addressed comments on enable_boot_cpu_mm() were reintroduced
back during the code movement from arm64/head.S to arm64/mmu/head.S.
We should drop the unreachable code, move the 'mov lr, x5' closer to
'b remove_identity_mapping' so it is clearer that it will return,
and update the in-code comment accordingly.
Fixes: 6734327d76be ("xen/arm64: Split and move MMU-specific head.S to mmu/head.S") Reported-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Ayan Kumar Halder <ayan.kumar.halder@amd.com>
tools/light: Revoke permissions when a PCI detach for HVM domain
Currently, libxl will grant IOMEM, I/O port and IRQ permissions when
a PCI is attached (see pci_add_dm_done()) for all domain types. However,
the permissions are only revoked for non-HVM domain (see do_pci_remove()).
This means that HVM domains will be left with extra permissions. While
this look bad on the paper, the IRQ permissions should be revoked
when the Device Model call xc_physdev_unmap_pirq() and such domain
cannot directly mapped I/O port and IOMEM regions. Instead, this has to
be done by a Device Model.
The Device Model can only run in dom0 or PV stubdomain (upstream libxl
doesn't have support for HVM/PVH stubdomain).
For PV/PVH stubdomain, the permission are properly revoked, so there is
no security concern.
This leaves dom0. There are two cases:
1) Privileged: Anyone gaining access to the Device Model would already
have large control on the host.
2) Deprivileged: PCI passthrough require PHYSDEV operations which
are not accessible when the Device Model is restricted.
So overall, it is believed that the extra permissions cannot be exploited.
Rework the code so the permissions are all removed for HVM domains.
This needs to happen after the QEMU has detached the device. So
the revocation is now moved to pci_remove_detached().
Also add a comment on top of the error message when the PIRQ cannot
be unbind to explain this could be a spurious error as QEMU may have
already done it.
Signed-off-by: Julien Grall <jgrall@amazon.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
README: Remove old note about the build system's python expectation
Changesets 5852ca485263 (build: fix tools/configure in case only python3
exists) and c8a8645f1efe ("xen/build: Automatically locate a suitable python
interpreter") cause the build to look for python3, python and python2 in that
order.
Remove the outdated note from the README.
Signed-off-by: Javi Merino <javi.merino@cloud.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
tools: Don't use distutils in configure or Makefile
Python distutils is deprecated and is going to be removed in Python
3.12. distutils.sysconfig is available as the sysconfig module in
stdlib since Python 2.7 and Python 3.2, so use that directly.
Update the README to reflect that we now depend on Python 2.7.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Javi Merino <javi.merino@cloud.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
[Regen ./configure] Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
tools/python: convert setup.py to use setuptools if available
Python distutils is deprecated and is going to be removed in Python
3.12. Add support for setuptools.
Setuptools in Python 3.11 complains:
SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
Keep using setup.py anyway to build the C extension.
Signed-off-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Javi Merino <javi.merino@cloud.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
automation: Add python3's setuptools to some containers
In preparation of supporting both distutils and setuptools, add the
python3 setuptools module to the containers that have recent python3
installations.
Debian Stretch, Ubuntu trusty (14.04), Ubuntu xenial (16.04) and
Ubuntu bionic (18.04) are kept without setuptools on purpose, to test
installations that don't have it.
Centos 7 in particular is kept with python2 only.
Signed-off-by: Javi Merino <javi.merino@cloud.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Wed, 20 Sep 2023 09:31:42 +0000 (10:31 +0100)]
x86/shadow: defer releasing of PV's top-level shadow reference
sh_set_toplevel_shadow() re-pinning the top-level shadow we may be
running on is not enough (and at the same time unnecessary when the
shadow isn't what we're running on): That shadow becomes eligible for
blowing away (from e.g. shadow_prealloc()) immediately after the
paging lock was dropped. Yet it needs to remain valid until the actual
page table switch occurred.
Propagate up the call chain the shadow entry that needs releasing
eventually, and carry out the release immediately after switching page
tables. Handle update_cr3() failures by switching to idle pagetables.
Note that various further uses of update_cr3() are HVM-only or only act
on paused vCPU-s, in which case sh_set_toplevel_shadow() will not defer
releasing of the reference.
While changing the update_cr3() hook, also convert the "do_locking"
parameter to boolean.
This is CVE-2023-34322 / XSA-438.
Reported-by: Tim Deegan <tim@xen.org> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@cloud.com>
Andrew Cooper [Tue, 19 Sep 2023 10:23:34 +0000 (11:23 +0100)]
x86/entry: Partially revert IST-exit checks
The patch adding check_ist_exit() didn't account for the fact that
reset_stack_and_jump() is not an ABI-preserving boundary. The IST-ness in
%r12 doesn't survive into the next context, and is a stale value C.
There's no straightforward way to reconstruct the IST-exit-ness on the
exit-to-guest path after a context switch. For now, we only need IST-exit on
the return-to-Xen path.
Fixes: 21bdc25b05a0 ("x86/entry: Track the IST-ness of an entry for the exit paths") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Add stub function and symbol definitions required by common code. If the
file that the definition is supposed to be located in doesn't already
exist yet, temporarily place its definition in the new stubs.c
Signed-off-by: Shawn Anastasio <sanastasio@raptorengineering.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Simon Gaiser [Tue, 19 Sep 2023 09:02:13 +0000 (11:02 +0200)]
x86/ACPI: Fix logging of MADT entries
The recent change to ignore MADT entries with invalid APIC IDs also
affected logging of MADT entries. That's not desired [1] [2], so restore
the old behavior.
Andrew Cooper [Wed, 30 Aug 2023 19:24:25 +0000 (20:24 +0100)]
x86/spec-ctrl: Mitigate the Zen1 DIV leakage
In the Zen1 microarchitecure, there is one divider in the pipeline which
services uops from both threads. In the case of #DE, the latched result from
the previous DIV to execute will be forwarded speculatively.
This is an interesting covert channel that allows two threads to communicate
without any system calls. In also allows userspace to obtain the result of
the most recent DIV instruction executed (even speculatively) in the core,
which can be from a higher privilege context.
Scrub the result from the divider by executing a non-faulting divide. This
needs performing on the exit-to-guest paths, and ist_exit-to-Xen.
Alternatives in IST context is believed safe now that it's done in NMI
context.
This is XSA-439 / CVE-2023-20588.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 15 Sep 2023 11:13:51 +0000 (12:13 +0100)]
x86/amd: Introduce is_zen{1,2}_uarch() predicates
We already have 3 cases using STIBP as a Zen1/2 heuristic, and are about to
introduce a 4th. Wrap the heuristic into a pair of predicates rather than
opencoding it, and the explanation of the heuristic, at each usage site.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 13 Sep 2023 12:53:33 +0000 (13:53 +0100)]
x86/spec-ctrl: Issue VERW during IST exit to Xen
There is a corner case where e.g. an NMI hitting an exit-to-guest path after
SPEC_CTRL_EXIT_TO_* would have run the entire NMI handler *after* the VERW
flush to scrub potentially sensitive data from uarch buffers.
In order to compensate, issue VERW when exiting to Xen from an IST entry.
SPEC_CTRL_EXIT_TO_XEN already has two reads of spec_ctrl_flags off the stack,
and we're about to add a third. Load the field into %ebx, and list the
register as clobbered.
%r12 has been arranged to be the ist_exit signal, so add this as an input
dependency and use it to identify when to issue a VERW.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Wed, 13 Sep 2023 11:20:12 +0000 (12:20 +0100)]
x86/entry: Track the IST-ness of an entry for the exit paths
Use %r12 to hold an ist_exit boolean. This register is zero elsewhere in the
entry/exit asm, so it only needs setting in the IST path.
As this is subtle and fragile, add check_ist_exit() to be used in debugging
builds to cross-check that the ist_exit boolean matches the entry vector.
Write check_ist_exit() it in C, because it's debug only and the logic more
complicated than I care to maintain in asm.
For now, we only need to use this signal in the exit-to-Xen path, but some
exit-to-guest paths happen in IST context too. Check the correctness in all
exit paths to avoid the logic bit-rotting.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 12 Sep 2023 16:03:16 +0000 (17:03 +0100)]
x86/spec-ctrl: Fold DO_SPEC_CTRL_EXIT_TO_XEN into it's single user
With the SPEC_CTRL_EXIT_TO_XEN{,_IST} confusion fixed, it's now obvious that
there's only a single EXIT_TO_XEN path. Fold DO_SPEC_CTRL_EXIT_TO_XEN into
SPEC_CTRL_EXIT_TO_XEN to simplify further fixes.
When merging labels, switch the name to .L\@_skip_sc_msr as "skip" on its own
is going to be too generic shortly.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 12 Sep 2023 14:06:49 +0000 (15:06 +0100)]
x86/spec-ctrl: Fix confusion between SPEC_CTRL_EXIT_TO_XEN{,_IST}
c/s 3fffaf9c13e9 ("x86/entry: Avoid using alternatives in NMI/#MC paths")
dropped the only user, leaving behind the (incorrect) implication that Xen had
split exit paths.
Delete the unused SPEC_CTRL_EXIT_TO_XEN and rename SPEC_CTRL_EXIT_TO_XEN_IST
to SPEC_CTRL_EXIT_TO_XEN for consistency.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Implement bitops.h, based on Linux's implementation as of commit 5321d1b1afb9a17302c6cec79f0cbf823eb0d3fc. Though it is based off of
Linux's implementation, this code diverges significantly in a number of
ways:
- Bitmap entries changed to 32-bit words to match X86 and Arm on Xen
- PPC32-specific code paths dropped
- Formatting completely re-done to more closely line up with Xen.
Including 4 space indentation.
- Use GCC's __builtin_popcount* for hweight* implementation
Signed-off-by: Shawn Anastasio <sanastasio@raptorengineering.com> Acked-by: Jan Beulich <jbeulich@suse.com>
OpenBSD 7.3 will unconditionally access HWCR if the TSC is reported as
Invariant, and it will then attempt to also unconditionally access PSTATE0 if
HWCR.TscFreqSel is set (currently the case on Xen).
The motivation for exposing HWCR.TscFreqSel was to avoid warning messages from
Linux. It has been agreed that Linux should be changed instead to not
complaint about missing HWCR.TscFreqSel when running virtualized.
The relation between HWCR.TscFreqSel and PSTATE0 is not clearly written down in
the PPR, but it's natural for OSes to attempt to fetch the P0 frequency if the
TSC is stated to increment at the P0 frequency.
Exposing PSTATEn (PSTATE0 at least) with all zeroes is not a suitable solution
because the PstateEn bit is read-write, and OSes could legitimately attempt to
set PstateEn=1 which Xen couldn't handle.
Furthermore, the TscFreqSel bit is model specific and was never safe to expose
like this in the first place. At a minimum it should have had a toolstack
adjustment to know not to migrate such a VM.
Therefore, simply remove the bit. Note the HWCR itself is an architectural
register, and does need to be accessible by the guest. Since HWCR contains
both architectural and non-architectural bits, going forward care must be taken
to assert the exposed value is correct on newer CPU families.
Reported-by: Solène Rapenne <solene@openbsd.org> Link: https://github.com/QubesOS/qubes-issues/issues/8502 Fixes: 14b95b3b8546 ('x86/AMD: expose HWCR.TscFreqSel to guests') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Mon, 18 Sep 2023 13:06:59 +0000 (15:06 +0200)]
timer: fix NR_CPUS=1 build with gcc13
Gcc13 apparently infers from "if ( old_cpu < new_cpu )" that "new_cpu"
is >= 1, and then (on x86) complains about "per_cpu(timers, new_cpu)"
exceeding __per_cpu_offset[]'s bounds (being an array of 1 in such a
configuration). Make the code conditional upon there being at least 2
CPUs configured (otherwise there simply is nothing to migrate [to]).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: George Dunlap <george.dunlap@cloud.com>
Michal Orzel [Tue, 12 Sep 2023 10:53:41 +0000 (12:53 +0200)]
xen/arm: Skip Xen specific nodes/properties from hwdom /chosen node
Skip the following Xen specific host device tree nodes/properties
from being included into hardware domain /chosen node:
- xen,static-heap: this property informs Xen about memory regions
reserved exclusively as static heap,
- xen,domain-shared-memory-v1: node with this compatible informs Xen
about static shared memory region for a domain. Xen exposes a different
node (under /reserved-memory with compatible "xen,shared-memory-v1") to
let domain know about the shared region,
- xen,evtchn-v1: node with this compatible informs Xen about static
event channel configuration for a domain. Xen does not expose
information about static event channels to domUs and dom0 case was
overlooked (by default nodes from host dt are copied to dom0 fdt unless
explicitly marked to be skipped), since the author's idea was not to
expose it (refer docs/misc/arm/device-tree/booting.txt, "Static Event
Channel"). Even if we wanted to expose the static event channel
information, the current node is in the wrong format (i.e. contains
phandle to domU node not visible by dom0). Lastly, this feature is
marked as tech-preview and there is no Linux dt binding in place.
Signed-off-by: Michal Orzel <michal.orzel@amd.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Julien Grall <jgrall@amazon.com>
Implement atomic.h for PPC, based off of the original Xen 3.2
implementation. This implementation depends on some functions that are
not yet defined (notably __cmpxchg), so it can't yet be used.
Signed-off-by: Shawn Anastasio <sanastasio@raptorengineering.com> Acked-by: Jan Beulich <jbeulich@suse.com>
x86/efi: address violations of MISRA C:2012 Rule 7.2
The xen sources contains violations of MISRA C:2012 Rule 7.2 whose
headline states:
"A 'u' or 'U' suffix shall be applied to all integer constants
that are represented in an unsigned type".
Addi the 'U' suffix to integers literals with unsigned type.
x86/mcheck: address violations of MISRA C:2012 Rule 7.2
The xen sources contains violations of MISRA C:2012 Rule 7.2 whose
headline states:
"A 'u' or 'U' suffix shall be applied to all integer constants
that are represented in an unsigned type".
Add the 'U' suffix to integers literals with unsigned type.
For the sake of uniformity, the following change is made:
- add the 'U' suffix to all first macro's arguments
xen/lib: address violations of MISRA C:2012 Rule 7.2
The xen sources contains violations of MISRA C:2012 Rule 7.2 whose
headline states:
"A 'u' or 'U' suffix shall be applied to all integer constants
that are represented in an unsigned type".
Add the 'U' suffix to integers literals with unsigned type.
For the sake of uniformity, the following change is made:
- add the 'U' suffix to switch cases in 'cpuid.c'
Use pdev->sbdf instead of the PCI_SBDF macro in calls to pci_* functions
where appropriate. Move NULL check earlier.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Convert pci_find_*cap* functions and call sites to pci_sbdf_t, and remove some
now unused local variables. Also change to more appropriate types on lines that
are already being modified as a result of the pci_sbdf_t conversion.
Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
x86/hvm: address violations of MISRA C:2012 Rule 7.3
The xen sources contain violations of MISRA C:2012 Rule 7.3 whose headline
states:
"The lowercase character 'l' shall not be used in a literal suffix".
Use the "L" suffix instead of the "l" suffix, to avoid potential ambiguity.
If the "u" suffix is used near "L", use the "U" suffix instead, for consistency.
xen/ioreq: address violations of MISRA C:2012 Rule 7.3
The xen sources contain violations of MISRA C:2012 Rule 7.3 whose headline
states:
"The lowercase character 'l' shall not be used in a literal suffix".
Use the "L" suffix instead of the "l" suffix, to avoid potential ambiguity.
If the "u" suffix is used near "L", use the "U" suffix instead, for consistency.
Michal Orzel [Thu, 24 Aug 2023 09:06:40 +0000 (11:06 +0200)]
xen/arm: Handle empty grant table region in find_unallocated_memory()
When creating dom0 with grant table support disabled in Xen and no IOMMU,
the following assert is triggered (debug build):
"Assertion 's <= e' failed at common/rangeset.c:189"
This is because find_unallocated_memory() (used to find memory holes
for extended regions) calls rangeset_remove_range() for an empty
grant table region. Fix it by checking if the size of region is not 0.
Signed-off-by: Michal Orzel <michal.orzel@amd.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
x86/viridian: address violations of MISRA C:2012 Rule 7.2
The xen sources contains violations of MISRA C:2012 Rule 7.2 whose
headline states:
"A 'u' or 'U' suffix shall be applied to all integer constants
that are represented in an unsigned type".
Add the 'U' suffix to integers literals with unsigned type and also to other
literals used in the same contexts or near violations, when their positive
nature is immediately clear. The latter changes are done for the sake of
uniformity.
x86/viridian: address violations of MISRA C:2012 Rule 7.3
The xen sources contain violations of MISRA C:2012 Rule 7.3 whose headline
states:
"The lowercase character 'l' shall not be used in a literal suffix".
Use the "L" suffix instead of the "l" suffix, to avoid potential ambiguity.
If the "u" suffix is used near "L", use the "U" suffix instead, for consistency.
Andrew Cooper [Thu, 31 May 2018 15:16:37 +0000 (16:16 +0100)]
x86: Fix calculation of %dr6/dr7 reserved bits
RTM debugging and BusLock Detect have both introduced conditional behaviour
into the %dr6/7 calculations which Xen's existing logic doesn't account for.
Introduce the CPUID bit for BusLock Detect, so we can get the %dr6 behaviour
correct from the outset.
Implement x86_adj_dr{6,7}_rsvd() fully, and use them in place of the plain
bitmasks.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jinoh Kang <jinoh.kang.kr@gmail.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 29 Aug 2023 11:01:49 +0000 (12:01 +0100)]
x86: Introduce new debug.c for debug register infrastructure
Broken out of the subsequent patch for clarity.
Add stub x86_adj_dr{6,7}_rsvd() functions which will be extended in the
following patch to fix bugs, and adjust debugreg.h to compile with a more
minimal set of includes.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Tue, 29 Aug 2023 10:16:11 +0000 (11:16 +0100)]
x86: Reject bad %dr6/%dr7 values when loading guest state
Right now, bad PV state is silently dropped and zeroed, while bad HVM state is
passed directly to hardware and can trigger VMEntry/VMRUN failures. e.g.
Furthermore, prior to c/s 30f43f4aa81e ("x86: Reorganise and rename debug
register fields in struct vcpu") in Xen 4.11 where v->arch.dr6 was reduced in
width, the toolstack can cause a host crash by loading a bad %dr6 value on
VT-x hardware.
Reject any %dr6/7 values with upper bits set. For PV guests, also audit
%dr0..3 using the same logic as in set_debugreg() so they aren't silently
zeroed later in the function. Leave a comment behind explaing how %dr4/5
handling changed, and why they're ignored now.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Mon, 11 Sep 2023 15:30:34 +0000 (17:30 +0200)]
include: make domain_page.h's stubs properly use type-unsafe MFN <-> virt helpers
The first of the commits referenced below didn't go far enough, and the
2nd of them, while trying to close (some of) the gap, wrongly kept using
the potentially type-safe variant. This is getting in the way of new
ports preferably not having any type-unsafe private code (and in
particular not having a need for any overrides in newly introduced
files).
Fixes: 41c48004d1d8 ("xen/mm: Use __virt_to_mfn in map_domain_page instead of virt_to_mfn") Fixes: f46b6197344f ("xen: Convert page_to_mfn and mfn_to_page to use typesafe MFN") Reported-by: Shawn Anastasio <sanastasio@raptorengineering.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
coverage: simplify the logic of choosing the number of gcov counters depending on the gcc version
The current structure of choosing the correct file based on the
compiler version makes us make 33 line files just to define a
constant. The changes after gcc 4.7 are minimal, just the number of
counters.
Fold the changes in gcc_4_9.c, gcc_5.c and gcc_7.c into gcc_4_7.c to
remove a lot of the boilerplate and keep the logic of choosing the
GCOV_COUNTER in gcc_4_7.c.
Signed-off-by: Javi Merino <javi.merino@cloud.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>