Jan Beulich [Tue, 5 Jan 2021 12:11:04 +0000 (13:11 +0100)]
x86/ACPI: don't invalidate S5 data when S3 wakeup vector cannot be determined
We can be more tolerant as long as the data collected from FACS is only
needed to enter S3. A prior change already added suitable checking to
acpi_enter_sleep().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Jan Beulich [Tue, 5 Jan 2021 12:09:55 +0000 (13:09 +0100)]
x86/ACPI: fix S3 wakeup vector mapping
Use of __acpi_map_table() here was at least close to an abuse already
before, but it will now consistently return NULL here. Drop the layering
violation and use set_fixmap() directly. Re-use of the ACPI fixmap area
is hopefully going to remain "fine" for the time being.
Add checks to acpi_enter_sleep(): The vector now needs to be contained
within a single page, but the ACPI spec requires 64-byte alignment of
FACS anyway. Also bail if no wakeup vector was determined in the first
place, in part as preparation for a subsequent relaxation change.
Fixes: 1c4aa69ca1e1 ("xen/acpi: Rework acpi_os_map_memory() and acpi_os_unmap_memory()") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Bertrand Marquis [Thu, 17 Dec 2020 15:38:08 +0000 (15:38 +0000)]
xen/arm: Activate TID3 in HCR_EL2
Activate TID3 bit in HCR register when starting a guest.
This will trap all coprecessor ID registers so that we can give to guest
values corresponding to what they can actually use and mask some
features to guests even though they would be supported by the underlying
hardware (like SVE or MPAM).
Bertrand Marquis [Thu, 17 Dec 2020 15:38:07 +0000 (15:38 +0000)]
xen/arm: Add CP10 exception support to handle MVFR
Add support for cp10 exceptions decoding to be able to emulate the
values for MVFR0, MVFR1 and MVFR2 when TID3 bit of HSR is activated.
This is required for aarch32 guests accessing MVFR registers using
vmrs and vmsr instructions.
Bertrand Marquis [Thu, 17 Dec 2020 15:38:06 +0000 (15:38 +0000)]
xen/arm: Add handler for cp15 ID registers
Add support for emulation of cp15 based ID registers (on arm32 or when
running a 32bit guest on arm64).
The handlers are returning the values stored in the guest_cpuinfo
structure for known registers and RAZ for all reserved registers.
In the current status the MVFR registers are no supported.
Bertrand Marquis [Thu, 17 Dec 2020 15:38:05 +0000 (15:38 +0000)]
xen/arm: Add handler for ID registers on arm64
Add vsysreg emulation for registers trapped when TID3 bit is activated
in HSR.
The emulation is returning the value stored in cpuinfo_guest structure
for know registers and is handling reserved registers as RAZ.
Bertrand Marquis [Thu, 17 Dec 2020 15:38:04 +0000 (15:38 +0000)]
xen/arm: create a cpuinfo structure for guest
Create a cpuinfo structure for guest and mask into it the features that
we do not support in Xen or that we do not want to publish to guests.
Modify some values in the cpuinfo structure for guests to mask some
features which we do not want to allow to guests (like AMU) or we do not
support (like SVE).
Modify some values in the guest cpuinfo structure to guests to hide some
processor features:
- SVE as this is not supported by Xen and guest are not allowed to use
this features (ZEN is set to 0 in CPTR_EL2).
- AMU as HCPTR_TAM is set in CPTR_EL2 so AMU cannot be used by guests
All other bits are left untouched.
- RAS as this is not supported by Xen.
The code is trying to group together registers modifications for the
same feature to be able in the long term to easily enable/disable a
feature depending on user parameters or add other registers modification
in the same place (like enabling/disabling HCR bits).
Bertrand Marquis [Thu, 17 Dec 2020 15:38:03 +0000 (15:38 +0000)]
xen/arm: Add arm64 ID registers definitions
Add coprocessor registers definitions for all ID registers trapped
through the TID3 bit of HSR.
Those are the one that will be emulated in Xen to only publish to guests
the features that are supported by Xen and that are accessible to
guests.
Bertrand Marquis [Thu, 17 Dec 2020 15:38:02 +0000 (15:38 +0000)]
xen/arm: Add ID registers and complete cpuinfo
Add definition and entries in cpuinfo for ID registers introduced in
newer Arm Architecture reference manual:
- ID_PFR2: processor feature register 2
- ID_DFR1: debug feature register 1
- ID_MMFR4 and ID_MMFR5: Memory model feature registers 4 and 5
- ID_ISA6: ISA Feature register 6
Add more bitfield definitions in PFR fields of cpuinfo.
Add MVFR2 register definition for aarch32.
Add MVFRx_EL1 defines for aarch32.
Add mvfr values in cpuinfo.
Add some registers definition for arm64 in sysregs as some are not
always know by compilers.
Initialize the new values added in cpuinfo in identify_cpu during init.
Bertrand Marquis [Thu, 17 Dec 2020 15:38:01 +0000 (15:38 +0000)]
xen/arm: Use READ_SYSREG instead of 32/64 versions
Modify identify_cpu function to use READ_SYSREG instead of READ_SYSREG32
or READ_SYSREG64.
All aarch32 specific registers (for example ID_PFR0_EL1) are 64bit when
accessed from aarch64 with upper bits read as 0, so it is right to
access them as 64bit registers on a 64bit platform.
Andrew Cooper [Thu, 31 Dec 2020 16:55:20 +0000 (16:55 +0000)]
x86/p2m: Fix paging_gva_to_gfn() for nested virt
nestedhap_walk_L1_p2m() takes guest physical addresses, not frame numbers.
This means the l2 input is off-by-PAGE_SHIFT, as is the l1 value eventually
returned to the caller.
Delete the misleading comment as well.
Fixes: bab2bd8e222de ("xen/nested_p2m: Don't walk EPT tables with a regular PT walker") Reported-by: Tamas K Lengyel <tamas@tklengyel.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Tamas K Lengyel <tamas@tklengyel.com>
Roger Pau Monné [Mon, 4 Jan 2021 09:03:23 +0000 (10:03 +0100)]
x86/p2m: fix p2m_add_foreign error path
One of the error paths in p2m_add_foreign could call put_page with a
NULL page, thus triggering a fault.
Split the checks into two different if statements, so the appropriate
error path can be taken.
Fixes: 173ae325026bd ('x86/p2m: tidy p2m_add_foreign() a little') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Roger Pau Monne [Wed, 30 Dec 2020 17:34:46 +0000 (18:34 +0100)]
xen: remove the usage of the P ar option
It's not part of the POSIX standard [0] and as such non GNU ar
implementations don't usually have it.
It's not relevant for the use case here anyway, as the archive file is
recreated every time due to the rm invocation before the ar call. No
file name matching should happen so matching using the full path name
or a relative one should yield the same result.
This fixes the build on FreeBSD.
While there also drop the s option, as ar will already generate a
symbol table by default when creating the archive.
Andrew Cooper [Tue, 29 Dec 2020 17:51:23 +0000 (17:51 +0000)]
x86/hpet: Fix return value of hpet_setup()
hpet_setup() is idempotent if the rate has already been calculated, and
returns the cached value. However, this only works correctly when the return
statements are identical.
Use a sensibly named local variable, rather than a dead one with a bad name.
Fixes: a60bb68219 ("x86/time: reduce rounding errors in calculations") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Andrew Cooper [Mon, 28 Sep 2020 17:14:53 +0000 (18:14 +0100)]
xen/domain: Introduce domain_teardown()
There is no common equivelent of domain_reliquish_resources(), which has
caused various pieces of common cleanup to live in inappropriate
places.
Perhaps most obviously, evtchn_destroy() is called for every continuation of
domain_reliquish_resources(), which can easily be thousands of times.
Create domain_teardown() to be a new top level facility, and call it from the
appropriate positions in domain_kill() and domain_create()'s error path. The
intention is for this to supersede domain_reliquish_resources() in due course.
No change in behaviour yet.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 22 Dec 2020 11:01:12 +0000 (12:01 +0100)]
x86/mm: p2m_add_foreign() is HVM-only
This is the case also for xenmem_add_to_physmap_one(), as is it's only
caller of the function. Move the latter next to p2m_add_foreign(),
allowing it one to become static at the same time. While moving, adjust
indentation of the body of the main switch().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 22 Dec 2020 08:00:03 +0000 (09:00 +0100)]
x86/Intel: insert Tiger Lake model numbers
Both match prior generation processors as far as LBR and C-state MSRs
go (SDM rev 073). The if_pschange_mc erratum, according to the spec
update, is not applicable.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
x86/EFI: don't insert timestamp when SOURCE_DATE_EPOCH is defined
By default a timestamp gets added to the xen efi binary. Unfortunately
ld doesn't seem to provide a way to set a custom date, like from
SOURCE_DATE_EPOCH, so set a zero value for the timestamp (option
--no-insert-timestamp) if SOURCE_DATE_EPOCH is defined. This makes
reproducible builds possible.
This is an alternative to the patch suggested in [1]. This patch only
omits the timestamp when SOURCE_DATE_EPOCH is defined.
Jan Beulich [Tue, 22 Dec 2020 07:57:19 +0000 (08:57 +0100)]
x86: verify function type (and maybe attribute) in switch_stack_and_jump()
It is imperative that the functions passed here are taking no arguments,
return no values, and don't return in the first place. While the type
can be checked uniformly, the attribute check is limited to gcc 9 and
newer (no clang support for this so far afaict).
Note that I didn't want to have the "true" fallback "implementation" of
__builtin_has_attribute(..., __noreturn__) generally available, as
"true" may not be a suitable fallback in other cases.
Note further that the noreturn addition to startup_cpu_idle_loop()'s
declaration requires adding unreachable() to Arm's
switch_stack_and_jump(), or else the build would break. I suppose this
should have been there already.
For vmx_asm_do_vmentry() along with adding the attribute, also restrict
its scope.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Wei Liu <wl@xen.org> Acked-by: Julien Grall <jgrall@amazon.com>
Julien Grall [Fri, 18 Dec 2020 13:30:54 +0000 (13:30 +0000)]
xen: Rework WARN_ON() to return whether a warning was triggered
So far, our implementation of WARN_ON() cannot be used in the following
situation:
if ( WARN_ON() )
...
This is because WARN_ON() doesn't return whether a warning has been
triggered. Such construciton can be handy if you want to print more
information and also dump the stack trace.
Therefore, rework the WARN_ON() implementation to return whether a
warning was triggered. The idea was borrowed from Linux
Andrew Cooper [Mon, 21 Dec 2020 14:52:26 +0000 (14:52 +0000)]
x86/shadow: Fix build with !CONFIG_SHADOW_PAGING
Implement a stub for shadow_vcpu_teardown()
Fixes: d162f36848c4 ("xen/x86: Fix memory leak in vcpu_create() error path") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Mon, 28 Sep 2020 14:25:44 +0000 (15:25 +0100)]
xen/x86: Fix memory leak in vcpu_create() error path
Various paths in vcpu_create() end up calling paging_update_paging_modes(),
which eventually allocate a monitor pagetable if one doesn't exist.
However, an error in vcpu_create() results in the vcpu being cleaned up
locally, and not put onto the domain's vcpu list. Therefore, the monitor
table is not freed by {hap,shadow}_teardown()'s loop. This is caught by
assertions later that we've successfully freed the entire hap/shadow memory
pool.
The per-vcpu loops in domain teardown logic is conceptually wrong, but exist
due to insufficient existing structure in the existing logic.
Break paging_vcpu_teardown() out of paging_teardown(), with mirrored breakouts
in the hap/shadow code, and use it from arch_vcpu_create()'s error path. This
fixes the memory leak.
The new {hap,shadow}_vcpu_teardown() must be idempotent, and are written to be
as tolerable as possible, with the minimum number of safety checks possible.
In particular, drop the mfn_valid() check - if these fields are junk, then Xen
is going to explode anyway.
Reported-by: Michał Leszczyński <michal.leszczynski@cert.pl> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Fri, 18 Dec 2020 12:29:14 +0000 (13:29 +0100)]
x86/mm: p2m_add_foreign() is HVM-only
This is the case also for xenmem_add_to_physmap_one(), as is it's only
caller of the function. Move the latter next to p2m_add_foreign(),
allowing it one to become static at the same time. While moving, adjust
indentation of the body of the main switch().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 18 Dec 2020 12:28:30 +0000 (13:28 +0100)]
x86/p2m: tidy p2m_add_foreign() a little
Drop a bogus ASSERT() - we don't typically assert incoming domain
pointers to be non-NULL, and there's no particular reason to do so here.
Replace the open-coded DOMID_SELF check by use of
rcu_lock_remote_domain_by_id(), at the same time covering the request
being made with the current domain's actual ID.
Move the "both domains same" check into just the path where it really
is meaningful.
Swap the order of the two puts, such that
- the p2m lock isn't needlessly held across put_page(),
- a separate put_page() on an error path can be avoided,
- they're inverse to the order of the respective gets.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 18 Dec 2020 12:23:42 +0000 (13:23 +0100)]
lib: move bsearch code
Convert this code to an inline function (backed by an instance in an
archive in case the compiler decides against inlining), which results
in not having it in x86 final binaries. This saves a little bit of dead
code.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wl@xen.org> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
Jan Beulich [Fri, 18 Dec 2020 12:20:42 +0000 (13:20 +0100)]
lib: move list sorting code
Build the source file always, as by putting it into an archive it still
won't be linked into final binaries when not needed. This way possible
build breakage will be easier to notice, and it's more consistent with
us unconditionally building other library kind of code (e.g. sort() or
bsearch()).
While moving the source file, take the opportunity and drop the
pointless EXPORT_SYMBOL() and an unnecessary #include.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wl@xen.org> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com> Acked-by: Julien Grall <jgrall@amazon.com>
Jan Beulich [Fri, 18 Dec 2020 12:17:57 +0000 (13:17 +0100)]
lib: collect library files in an archive
In order to (subsequently) drop odd things like CONFIG_NEEDS_LIST_SORT
just to avoid bloating binaries when only some arch-es and/or
configurations need generic library routines, combine objects under lib/
into an archive, which the linker then can pick the necessary objects
out of.
Note that we can't use thin archives just yet, until we've raised the
minimum required binutils version suitably.
Note further that --start-group / --end-group get put in place right
away to allow for symbol resolution across all archives, once we gain
multuiple ones.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wl@xen.org> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com> Acked-by: Julien Grall <jgrall@amazon.com>
Create a test job that starts Xen and Dom0 on QEMU based on the alpine
linux rootfs. Use the Linux kernel and rootfs from the tests-artifacts
containers. Add the Xen tools binaries from the Alpine Linux build job.
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Acked-by: Wei Liu <wl@xen.org>
automation: make available the tests artifacts to the pipeline
In order to make available the pre-built binaries of the
automation/tests-artifacts containers to the gitlab-ci pipeline we need
to export them as gitlab artifacts.
To do that, we create two "fake" jobs that simply export the require
binaries as artifacts and do nothing else.
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Acked-by: Wei Liu <wl@xen.org>
Some tests (soon to come) will require pre-built binaries to run, such
as the Linux kernel binary. We don't want to rebuild the Linux kernel
for each gitlab-ci run: these builds should not be added to the current
list of build jobs.
Instead, create additional containers that today are built and uploaded
manually, but could be re-built automatically. The containers build the
required binarires during the "docker build" step and store them inside
the container itself.
gitlab-ci will be able to fetch these pre-built binaries during the
regular test runs, saving cycles.
Add two tests artifacts containers:
- one to build the Linux kernel ARM64
- one to create an Alpine Linux ARM64 rootfs for Dom0
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Acked-by: Wei Liu <wl@xen.org>
automation: add dom0less to the QEMU aarch64 smoke test
Add a trivial dom0less test:
- fetch the Debian arm64 kernel and use it ad dom0/U kernel
- use busybox-static to create a trivial dom0/U ramdisk
- use ImageBuilder to generate the uboot boot script automatically
- install and use u-boot from the Debian package to start the test
- binaries are loaded from uboot via tftp
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Acked-by: Wei Liu <wl@xen.org>
Juergen Gross [Thu, 17 Dec 2020 15:50:21 +0000 (16:50 +0100)]
xen/hypfs: add new enter() and exit() per node callbacks
In order to better support resource allocation and locking for dynamic
hypfs nodes add enter() and exit() callbacks to struct hypfs_funcs.
The enter() callback is called when entering a node during hypfs user
actions (traversing, reading or writing it), while the exit() callback
is called when leaving a node (accessing another node at the same or a
higher directory level, or when returning to the user).
For avoiding recursion this requires a parent pointer in each node.
Let the enter() callback return the entry address which is stored as
the last accessed node in order to be able to use a template entry for
that purpose in case of dynamic entries.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Thu, 17 Dec 2020 15:49:49 +0000 (16:49 +0100)]
xen/hypfs: switch write function handles to const
The node specific write functions take a void user address handle as
parameter. As a write won't change the user memory use a const_void
handle instead.
This requires a new macro for casting a guest handle to a const type.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Luca Fancellu [Thu, 10 Dec 2020 10:42:58 +0000 (10:42 +0000)]
xen/arm: Add workaround for Cortex-A53 erratum #843419
On the Cortex A53, when executing in AArch64 state, a load or store instruction
which uses the result of an ADRP instruction as a base register, or which uses
a base register written by an instruction immediately after an ADRP to the
same register, might access an incorrect address.
The workaround is to enable the linker flag --fix-cortex-a53-843419
if present, to check and fix the affected sequence. Otherwise print a warning
that Xen may be susceptible to this errata
Wei Liu [Wed, 16 Dec 2020 17:48:04 +0000 (17:48 +0000)]
Revert patches that break libxl API
This patch reverts eight patches from staging.
The offending patch is the one that introduced libxl_pci_bdf (last one
in the list). The rest depend on that patch so they are also reverted.
8bf0fab14256 "libxl / libxlu: support 'xl pci-attach/detach' by name" e1141654c374 "docs/man: modify xl-pci-configuration(5) to add 'name' field to PCI_SPEC_STRING" 93c16ae47baf "xl: support naming of assignable devices" 5ab684cb3e4d "libxl: introduce libxl_pci_bdf_assignable_add/remove/list/list_free(), ..." 66c2fbc6e82b "libxl: convert internal functions in libxl_pci.c..." f73c5dd56d78 "docs/man: modify xl(1) in preparation for naming of assignable devices" 96ed6ff29741 "libxlu: introduce xlu_pci_parse_spec_string()" 929f23114061 "libxl: introduce 'libxl_pci_bdf' in the idl..."
Jan Beulich [Wed, 16 Dec 2020 15:42:50 +0000 (16:42 +0100)]
x86/PV: avoid double stack reset during schedule tail handling
Invoking check_wakeup_from_wait() from assembly allows the new
continue_pv_domain() to replace the prior continue_nonidle_domain() as
the tail hook, eliminating an extra reset_stack_and_jump().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com> Reviewed-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:33 +0000 (19:30 +0000)]
libxl / libxlu: support 'xl pci-attach/detach' by name
This patch adds a 'name' field into the idl for 'libxl_device_pci' and
libxlu_pci_parse_spec_string() is modified to parse the new 'name'
parameter of PCI_SPEC_STRING detailed in the updated documention in
xl-pci-configuration(5).
If the 'name' field is non-NULL then both libxl_device_pci_add() and
libxl_device_pci_remove() will use it to look up the device BDF in
the list of assignable devices.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:32 +0000 (19:30 +0000)]
docs/man: modify xl-pci-configuration(5) to add 'name' field to PCI_SPEC_STRING
Since assignable devices can be named, a subsequent patch will support use
of a PCI_SPEC_STRING containing a 'name' parameter instead of a 'bdf'. In
this case the name will be used to look up the 'bdf' in the list of assignable
(or assigned) devices.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:31 +0000 (19:30 +0000)]
xl: support naming of assignable devices
This patch converts libxl to use libxl_pci_bdf_assignable_add/remove/list/
list_free() rather than libxl_device_pci_assignable_add/remove/list/
list_free(), which then allows naming of assignable devices to be supported.
With this patch applied 'xl pci-assignable-add' will take an optional '--name'
parameter, 'xl pci-assignable-remove' can be passed either a BDF or a name and
'xl pci-assignable-list' will take a optional '--show-names' flag which
determines whether names are displayed in its output.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
which support naming and use 'libxl_pci_bdf' rather than 'libxl_device_pci',
as replacements for libxl_device_pci_assignable_add/remove/list/list_free().
libxl_pci_bdf_assignable_add() takes a 'name' parameter which is stored in
xenstore and facilitates two addtional functions added by this patch:
libxl_pci_bdf_assignable_name2bdf() and libxl_pci_bdf_assignable_bdf2name().
Currently there are no callers of these two functions. They will be added in
a subsequent patch.
libxl_device_pci_assignable_add/remove/list/list_free() are left in place
for compatibility but are re-implemented in terms of the newly introduced
functions.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:28 +0000 (19:30 +0000)]
docs/man: modify xl(1) in preparation for naming of assignable devices
A subsequent patch will introduce code to allow a name to be specified to
'xl pci-assignable-add' such that the assignable device may be referred to
by than name in subsequent operations.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:27 +0000 (19:30 +0000)]
libxlu: introduce xlu_pci_parse_spec_string()
This patch largely re-writes the code to parse a PCI_SPEC_STRING and enters
it via the newly introduced function. The new parser also deals with 'bdf'
and 'vslot' as non-positional paramaters, as per the documentation in
xl-pci-configuration(5).
The existing xlu_pci_parse_bdf() function remains, but now strictly parses
BDF values. Some existing callers of xlu_pci_parse_bdf() are
modified to call xlu_pci_parse_spec_string() as per the documentation in xl(1).
NOTE: Usage text in xl_cmdtable.c and error messages are also modified
appropriately.
As a side-effect this patch also fixes a bug where using '*' to specify
all functions would lead to an assertion failure at the end of
xlu_pci_parse_bdf().
Fixes: d25cc3ec93eb ("libxl: workaround gcc 10.2 maybe-uninitialized warning") Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:26 +0000 (19:30 +0000)]
libxl: introduce 'libxl_pci_bdf' in the idl...
... and use in 'libxl_device_pci'
This patch is preparatory work for restricting the type passed to functions
that only require BDF information, rather than passing a 'libxl_device_pci'
structure which is only partially filled. In this patch only the minimal
mechanical changes necessary to deal with the structural changes are made.
Subsequent patches will adjust the code to make better use of the new type.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org> Acked-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Paul Durrant [Tue, 8 Dec 2020 19:30:25 +0000 (19:30 +0000)]
docs/man: fix xl(1) documentation for 'pci' operations
Currently the documentation completely fails to mention the existence of
PCI_SPEC_STRING. This patch tidies things up, specifically clarifying that
'pci-assignable-add/remove' take <BDF> arguments where as 'pci-attach/detach'
take <PCI_SPEC_STRING> arguments (which will be enforced in a subsequent
patch).
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:24 +0000 (19:30 +0000)]
docs/man: improve documentation of PCI_SPEC_STRING...
... and prepare for adding support for non-positional parsing of 'bdf' and
'vslot' in a subsequent patch.
Also document 'BDF' as a first-class parameter type and fix the documentation
to state that the default value of 'rdm_policy' is actually 'strict', not
'relaxed', as can be seen in libxl__device_pci_setdefault().
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:22 +0000 (19:30 +0000)]
libxl: use COMPARE_PCI() macro is_pci_in_array()...
... rather than an open-coded equivalent.
This patch tidies up the is_pci_in_array() function, making it take a single
'libxl_device_pci' argument rather than separate domain, bus, device and
function arguments. The already-available COMPARE_PCI() macro can then be
used and it is also modified to return 'bool' rather than 'int'.
The patch also modifies libxl_pci_assignable() to use is_pci_in_array() rather
than a separate open-coded equivalent, and also modifies it to return a
'bool' rather than an 'int'.
NOTE: The COMPARE_PCI() macro is also fixed to include the 'domain' in its
comparison, which should always have been the case.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Acked-by: Wei Liu <wl@xen.org>
... to be used by callers of libxl_device_pci_assignable_list().
Currently there is no API for callers of libxl_device_pci_assignable_list()
to free the list. The xl function pciassignable_list() calls
libxl_device_pci_dispose() on each element of the returned list, but
libxl_pci_assignable() in libxl_pci.c does not. Neither does the implementation
of libxl_device_pci_assignable_list() call libxl_device_pci_init().
This patch adds the new API function, makes sure it is used everywhere and
also modifies libxl_device_pci_assignable_list() to initialize list
entries rather than just zeroing them.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:20 +0000 (19:30 +0000)]
libxl: make sure callers of libxl_device_pci_list() free the list after use
A previous patch introduced libxl_device_pci_list_free() which should be used
by callers of libxl_device_pci_list() to properly dispose of the exported
'libxl_device_pci' types and the free the memory holding them. Whilst all
current callers do ensure the memory is freed, only the code in xl's
pcilist() function actually calls libxl_device_pci_dispose(). As it stands
this laxity does not lead to any memory leaks, but the simple addition of
.e.g. a 'string' into the idl definition of 'libxl_device_pci' would lead
to leaks.
This patch makes sure all callers of libxl_device_pci_list() can call
libxl_device_pci_list_free() by keeping copies of 'libxl_device_pci'
structures inline in 'pci_add_state' and 'pci_remove_state' (and also making
sure these are properly disposed at the end of the operations) rather
than keeping pointers to the structures returned by libxl_device_pci_list().
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:19 +0000 (19:30 +0000)]
libxl: remove get_all_assigned_devices() from libxl_pci.c
Use of this function is a very inefficient way to check whether a device
has already been assigned.
This patch adds code that saves the domain id in xenstore at the point of
assignment, and removes it again when the device id de-assigned (or the
domain is destroyed). It is then straightforward to check whether a device
has been assigned by checking whether a device has a saved domain id.
NOTE: To facilitate the xenstore check it is necessary to move the
pci_info_xs_read() earlier in libxl_pci.c. To keep related functions
together, the rest of the pci_info_xs_XXX() functions are moved too.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:18 +0000 (19:30 +0000)]
libxl: remove unnecessary check from libxl__device_pci_add()
The code currently checks explicitly whether the device is already assigned,
but this is actually unnecessary as assigned devices do not form part of
the list returned by libxl_device_pci_assignable_list() and hence the
libxl_pci_assignable() test would have already failed.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:17 +0000 (19:30 +0000)]
libxl: generalise 'driver_path' xenstore access functions in libxl_pci.c
For the purposes of re-binding a device to its previous driver
libxl__device_pci_assignable_add() writes the driver path into xenstore.
This path is then read back in libxl__device_pci_assignable_remove().
The functions that support this writing to and reading from xenstore are
currently dedicated for this purpose and hence the node name 'driver_path'
is hard-coded. This patch generalizes these utility functions and passes
'driver_path' as an argument. Subsequent patches will invoke them to
access other nodes.
NOTE: Because functions will have a broader use (other than storing a
driver path in lieu of pciback) the base xenstore path is also
changed from '/libxl/pciback' to '/libxl/pci'.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:16 +0000 (19:30 +0000)]
libxl: stop using aodev->device_config in libxl__device_pci_add()...
... to hold a pointer to the device.
There is already a 'pci' field in 'pci_add_state' so simply use that from
the start. This also allows the 'pci' (#3) argument to be dropped from
do_pci_add().
NOTE: This patch also changes the type of the 'pci_domid' field in
'pci_add_state' from 'int' to 'libxl_domid' which is more appropriate
given what the field is used for.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:12 +0000 (19:30 +0000)]
libxl: Make sure devices added by pci-attach are reflected in the config
Currently libxl__device_pci_add_xenstore() is broken in that does not
update the domain's configuration for the first device added (which causes
creation of the overall backend area in xenstore). This can be easily observed
by running 'xl list -l' after adding a single device: the device will be
missing.
This patch fixes the problem and adds a DEBUG log line to allow easy
verification that the domain configuration is being modified. Also, the use
of libxl__device_generic_add() is dropped as it leads to a confusing situation
where only partial backend information is written under the xenstore
'/libxl' path. For LIBXL__DEVICE_KIND_PCI devices the only definitive
information in xenstore is under '/local/domain/0/backend' (the '0' being
hard-coded).
NOTE: This patch includes a whitespace in add_pcis_done().
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:11 +0000 (19:30 +0000)]
libxl: make libxl__device_list() work correctly for LIBXL__DEVICE_KIND_PCI...
... devices.
Currently there is an assumption built into libxl__device_list() that device
backends are fully enumarated under the '/libxl' path in xenstore. This is
not the case for PCI backend devices, which are only properly enumerated
under '/local/domain/0/backend'.
This patch adds a new get_path() method to libxl__device_type to allow a
backend implementation (such as PCI) to specify the xenstore path where
devices are enumerated and modifies libxl__device_list() to use this method
if it is available. Also, if the get_num() method is defined then the
from_xenstore() method expects to be passed the backend path without the device
number concatenated, so this issue is also rectified.
Having made libxl__device_list() work correctly, this patch removes the
open-coded libxl_pci_device_pci_list() in favour of an evaluation of the
LIBXL_DEFINE_DEVICE_LIST() macro. This has the side-effect of also defining
libxl_pci_device_pci_list_free() which will be used in subsequent patches.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:10 +0000 (19:30 +0000)]
xl: s/pcidev/pci where possible
To improve naming consistency, replaces occurrences of 'pcidev' with 'pci'.
The only remaining use of the term should be in relation to
'libxl_domain_config' where there are fields named 'pcidevs' and 'num_pcidevs'.
Purely cosmetic. No functional change.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:09 +0000 (19:30 +0000)]
libxl: s/pcidev/pci and remove DEFINE_DEVICE_TYPE_STRUCT_X
The seemingly arbitrary use of 'pci' and 'pcidev' in the code in libxl_pci.c
is confusing and also compromises use of some macros used for other device
types. Indeed it seems that DEFINE_DEVICE_TYPE_STRUCT_X exists solely because
of this duality.
This patch purges use of 'pcidev' from the libxl internal code, but
unfortunately the 'pcidevs' and 'num_pcidevs' fields in 'libxl_domain_config'
are part of the API and need to be retained to avoid breaking callers,
particularly libvirt.
DEFINE_DEVICE_TYPE_STRUCT_X is still removed to avoid the special case in
libxl_pci.c but DEFINE_DEVICE_TYPE_STRUCT is given an extra 'array' argument
which is used to identify the fields in 'libxl_domain_config' relating to
the device type.
NOTE: Some of the more gross formatting errors (such as lack of spaces after
keywords) that came into context have been fixed in libxl_pci.c.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Juergen Gross [Tue, 15 Dec 2020 15:04:11 +0000 (16:04 +0100)]
tools/xenstore: rework path length check
The different fixed limits for absolute and relative path lengths of
Xenstore nodes make it possible to create per-domain nodes via
absolute paths which are not accessible using relative paths, as the
two limits differ by 1024 characters.
Instead of this weird limits use only one limit, which applies to the
relative path length of per-domain nodes and to the absolute path
length of all other nodes. This means, the path length check is
applied to the path after removing a possible start of
"/local/domain/<n>/" with <n> being a domain id.
There has been the request to be able to limit the path lengths even
more, so an additional quota is added which can be applied to path
lengths. It is XENSTORE_REL_PATH_MAX (2048) per default, but can be
set to lower values. This is done via the new "-M" or "--path-max"
option when invoking xenstored.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Acked-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wl@xen.org>
Jan Beulich [Tue, 15 Dec 2020 12:47:45 +0000 (13:47 +0100)]
x86/PV: guest_get_eff_kern_l1e() may still need to switch page tables
While indeed unnecessary for pv_ro_page_fault(), pv_map_ldt_shadow_page()
may run when guest user mode is active, and hence may need to switch to
the kernel page tables in order to retrieve an LDT page mapping.
Fixes: 9ff970564764 ("x86/mm: drop guest_get_eff_l1e()") Reported-by: Manuel Bouyer <bouyer@antioche.eu.org> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Manuel Bouyer <bouyer@antioche.eu.org>
Jan Beulich [Tue, 15 Dec 2020 12:46:37 +0000 (13:46 +0100)]
evtchn/FIFO: re-order and synchronize (with) map_control_block()
For evtchn_fifo_set_pending()'s check of the control block having been
set to be effective, ordering of respective reads and writes needs to be
ensured: The control block pointer needs to be recorded strictly after
the setting of all the queue heads, and it needs checking strictly
before any uses of them (this latter aspect was already guaranteed).
Jan Beulich [Tue, 15 Dec 2020 12:42:51 +0000 (13:42 +0100)]
evtchn/FIFO: add 2nd smp_rmb() to evtchn_fifo_word_from_port()
Besides with add_page_to_event_array() the function also needs to
synchronize with evtchn_fifo_init_control() setting both d->evtchn_fifo
and (subsequently) d->evtchn_port_ops.
Roger Pau Monné [Tue, 15 Dec 2020 12:42:16 +0000 (13:42 +0100)]
x86/irq: fix infinite loop in irq_move_cleanup_interrupt
If Xen enters irq_move_cleanup_interrupt with a dynamic vector below
IRQ_MOVE_CLEANUP_VECTOR pending in IRR (0x20 or 0x21) that's also
designated for a cleanup it will enter a loop where
irq_move_cleanup_interrupt continuously sends a cleanup IPI (vector
0x22) to itself while waiting for the vector with lower priority to be
injected - which will never happen because IRQ_MOVE_CLEANUP_VECTOR
takes precedence and it's always injected first.
Fix this by making sure vectors below IRQ_MOVE_CLEANUP_VECTOR are
marked as used and thus not available for APs. Also add some logic to
assert and prevent irq_move_cleanup_interrupt from entering such an
infinite loop, albeit that should never happen given the current code.
This is XSA-356 / CVE-2020-29567.
Fixes: 3fba06ba9f8 ('x86/IRQ: re-use legacy vector ranges on APs') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 15 Dec 2020 12:41:23 +0000 (13:41 +0100)]
x86: avoid calling {svm,vmx}_do_resume()
These functions follow the following path: hvm_do_resume() ->
handle_hvm_io_completion() -> hvm_wait_for_io() ->
wait_on_xen_event_channel() -> do_softirq() -> schedule() ->
sched_context_switch() -> continue_running() and hence may
recursively invoke themselves. If this ends up happening a couple of
times, a stack overflow would result.
Prevent this by also resetting the stack at the
->arch.ctxt_switch->tail() invocations (in both places for consistency)
and thus jumping to the functions instead of calling them.
Jan Beulich [Tue, 15 Dec 2020 12:40:27 +0000 (13:40 +0100)]
x86: replace reset_stack_and_jump_nolp()
Move the necessary check into check_for_livepatch_work(), rather than
mostly duplicating reset_stack_and_jump() for this purpose. This is to
prevent an inflation of reset_stack_and_jump() flavors.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com>
Edwin Török [Tue, 15 Dec 2020 12:37:33 +0000 (13:37 +0100)]
tools/ocaml/xenstored: only Dom0 can change node owner
Otherwise we can give quota away to another domain, either causing it to run
out of quota, or in case of Dom0 use unbounded amounts of memory and bypass
the quota system entirely.
This was fixed in the C version of xenstored in 2006 (c/s db34d2aaa5f5,
predating the XSA process by 5 years).
It was also fixed in the mirage version of xenstore in 2012, with a unit test
demonstrating the vulnerability:
but possibly without realising that the vulnerability still affected the
in-tree oxenstored (added c/s f44af660412 in 2010).
This is XSA-352.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Tue, 15 Dec 2020 12:37:14 +0000 (13:37 +0100)]
tools/ocaml/xenstored: delete watch from trie too when resetting watches
c/s f8c72b526129 "oxenstored: implement XS_RESET_WATCHES" from Xen 4.6
introduced reset watches support in oxenstored by mirroring the change
in cxenstored.
However the OCaml version has some additional data structures to
optimize watch firing, and just resetting the watches in one of the data
structures creates a security bug where a malicious guest kernel can
exceed its watch quota, driving oxenstored into OOM:
* create watches
* reset watches (this still keeps the watches lingering in another data
structure, using memory)
* create some more watches
* loop until oxenstored dies
The guest kernel doesn't necessarily have to be malicious to trigger
this:
* if control/platform-feature-xs_reset_watches is set
* the guest kexecs (e.g. because it crashes)
* on boot more watches are set up
* this will slowly "leak" memory for watches in oxenstored, driving it
towards OOM.
This is XSA-330.
Fixes: f8c72b526129 ("oxenstored: implement XS_RESET_WATCHES") Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
tools/xenstore: Preserve bad client until they are destroyed
XenStored will kill any connection that it thinks has misbehaved,
this is currently happening in two places:
* In `handle_input()` if the sanity check on the ring and the message
fails.
* In `handle_output()` when failing to write the response in the ring.
As the domain structure is a child of the connection, XenStored will
destroy its view of the domain when killing the connection. This will
result in sending @releaseDomain event to all the watchers.
As the watch event doesn't carry which domain has been released,
the watcher (such as XenStored) will generally go through the list of
domains registers and check if one of them is shutting down/dying.
In the case of a client misbehaving, the domain will likely to be
running, so no action will be performed.
When the domain is effectively destroyed, XenStored will not be aware of
the domain anymore. So the watch event is not going to be sent.
By consequence, the watchers of the event will not release mappings
they may have on the domain. This will result in a zombie domain.
In order to send @releaseDomain event at the correct time, we want
to keep the domain structure until the domain is effectively
shutting-down/dying.
We also want to keep the connection around so we could possibly revive
the connection in the future.
A new flag 'is_ignored' is added to mark whether a connection should be
ignored when checking if there are work to do. Additionally any
transactions, watches, buffers associated to the connection will be
freed as you can't do much with them (restarting the connection will
likely need a reset).
As a side note, when the device model were running in a stubdomain, a
guest would have been able to introduce a use-after-free because there
is two parents for a guest connection.