which support naming and use 'libxl_pci_bdf' rather than 'libxl_device_pci',
as replacements for libxl_device_pci_assignable_add/remove/list/list_free().
libxl_pci_bdf_assignable_add() takes a 'name' parameter which is stored in
xenstore and facilitates two addtional functions added by this patch:
libxl_pci_bdf_assignable_name2bdf() and libxl_pci_bdf_assignable_bdf2name().
Currently there are no callers of these two functions. They will be added in
a subsequent patch.
libxl_device_pci_assignable_add/remove/list/list_free() are left in place
for compatibility but are re-implemented in terms of the newly introduced
functions.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:28 +0000 (19:30 +0000)]
docs/man: modify xl(1) in preparation for naming of assignable devices
A subsequent patch will introduce code to allow a name to be specified to
'xl pci-assignable-add' such that the assignable device may be referred to
by than name in subsequent operations.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:27 +0000 (19:30 +0000)]
libxlu: introduce xlu_pci_parse_spec_string()
This patch largely re-writes the code to parse a PCI_SPEC_STRING and enters
it via the newly introduced function. The new parser also deals with 'bdf'
and 'vslot' as non-positional paramaters, as per the documentation in
xl-pci-configuration(5).
The existing xlu_pci_parse_bdf() function remains, but now strictly parses
BDF values. Some existing callers of xlu_pci_parse_bdf() are
modified to call xlu_pci_parse_spec_string() as per the documentation in xl(1).
NOTE: Usage text in xl_cmdtable.c and error messages are also modified
appropriately.
As a side-effect this patch also fixes a bug where using '*' to specify
all functions would lead to an assertion failure at the end of
xlu_pci_parse_bdf().
Fixes: d25cc3ec93eb ("libxl: workaround gcc 10.2 maybe-uninitialized warning") Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:26 +0000 (19:30 +0000)]
libxl: introduce 'libxl_pci_bdf' in the idl...
... and use in 'libxl_device_pci'
This patch is preparatory work for restricting the type passed to functions
that only require BDF information, rather than passing a 'libxl_device_pci'
structure which is only partially filled. In this patch only the minimal
mechanical changes necessary to deal with the structural changes are made.
Subsequent patches will adjust the code to make better use of the new type.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org> Acked-by: Nick Rosbrook <rosbrookn@ainfosec.com>
Paul Durrant [Tue, 8 Dec 2020 19:30:25 +0000 (19:30 +0000)]
docs/man: fix xl(1) documentation for 'pci' operations
Currently the documentation completely fails to mention the existence of
PCI_SPEC_STRING. This patch tidies things up, specifically clarifying that
'pci-assignable-add/remove' take <BDF> arguments where as 'pci-attach/detach'
take <PCI_SPEC_STRING> arguments (which will be enforced in a subsequent
patch).
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:24 +0000 (19:30 +0000)]
docs/man: improve documentation of PCI_SPEC_STRING...
... and prepare for adding support for non-positional parsing of 'bdf' and
'vslot' in a subsequent patch.
Also document 'BDF' as a first-class parameter type and fix the documentation
to state that the default value of 'rdm_policy' is actually 'strict', not
'relaxed', as can be seen in libxl__device_pci_setdefault().
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:22 +0000 (19:30 +0000)]
libxl: use COMPARE_PCI() macro is_pci_in_array()...
... rather than an open-coded equivalent.
This patch tidies up the is_pci_in_array() function, making it take a single
'libxl_device_pci' argument rather than separate domain, bus, device and
function arguments. The already-available COMPARE_PCI() macro can then be
used and it is also modified to return 'bool' rather than 'int'.
The patch also modifies libxl_pci_assignable() to use is_pci_in_array() rather
than a separate open-coded equivalent, and also modifies it to return a
'bool' rather than an 'int'.
NOTE: The COMPARE_PCI() macro is also fixed to include the 'domain' in its
comparison, which should always have been the case.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Acked-by: Wei Liu <wl@xen.org>
... to be used by callers of libxl_device_pci_assignable_list().
Currently there is no API for callers of libxl_device_pci_assignable_list()
to free the list. The xl function pciassignable_list() calls
libxl_device_pci_dispose() on each element of the returned list, but
libxl_pci_assignable() in libxl_pci.c does not. Neither does the implementation
of libxl_device_pci_assignable_list() call libxl_device_pci_init().
This patch adds the new API function, makes sure it is used everywhere and
also modifies libxl_device_pci_assignable_list() to initialize list
entries rather than just zeroing them.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:20 +0000 (19:30 +0000)]
libxl: make sure callers of libxl_device_pci_list() free the list after use
A previous patch introduced libxl_device_pci_list_free() which should be used
by callers of libxl_device_pci_list() to properly dispose of the exported
'libxl_device_pci' types and the free the memory holding them. Whilst all
current callers do ensure the memory is freed, only the code in xl's
pcilist() function actually calls libxl_device_pci_dispose(). As it stands
this laxity does not lead to any memory leaks, but the simple addition of
.e.g. a 'string' into the idl definition of 'libxl_device_pci' would lead
to leaks.
This patch makes sure all callers of libxl_device_pci_list() can call
libxl_device_pci_list_free() by keeping copies of 'libxl_device_pci'
structures inline in 'pci_add_state' and 'pci_remove_state' (and also making
sure these are properly disposed at the end of the operations) rather
than keeping pointers to the structures returned by libxl_device_pci_list().
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:19 +0000 (19:30 +0000)]
libxl: remove get_all_assigned_devices() from libxl_pci.c
Use of this function is a very inefficient way to check whether a device
has already been assigned.
This patch adds code that saves the domain id in xenstore at the point of
assignment, and removes it again when the device id de-assigned (or the
domain is destroyed). It is then straightforward to check whether a device
has been assigned by checking whether a device has a saved domain id.
NOTE: To facilitate the xenstore check it is necessary to move the
pci_info_xs_read() earlier in libxl_pci.c. To keep related functions
together, the rest of the pci_info_xs_XXX() functions are moved too.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:18 +0000 (19:30 +0000)]
libxl: remove unnecessary check from libxl__device_pci_add()
The code currently checks explicitly whether the device is already assigned,
but this is actually unnecessary as assigned devices do not form part of
the list returned by libxl_device_pci_assignable_list() and hence the
libxl_pci_assignable() test would have already failed.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:17 +0000 (19:30 +0000)]
libxl: generalise 'driver_path' xenstore access functions in libxl_pci.c
For the purposes of re-binding a device to its previous driver
libxl__device_pci_assignable_add() writes the driver path into xenstore.
This path is then read back in libxl__device_pci_assignable_remove().
The functions that support this writing to and reading from xenstore are
currently dedicated for this purpose and hence the node name 'driver_path'
is hard-coded. This patch generalizes these utility functions and passes
'driver_path' as an argument. Subsequent patches will invoke them to
access other nodes.
NOTE: Because functions will have a broader use (other than storing a
driver path in lieu of pciback) the base xenstore path is also
changed from '/libxl/pciback' to '/libxl/pci'.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:16 +0000 (19:30 +0000)]
libxl: stop using aodev->device_config in libxl__device_pci_add()...
... to hold a pointer to the device.
There is already a 'pci' field in 'pci_add_state' so simply use that from
the start. This also allows the 'pci' (#3) argument to be dropped from
do_pci_add().
NOTE: This patch also changes the type of the 'pci_domid' field in
'pci_add_state' from 'int' to 'libxl_domid' which is more appropriate
given what the field is used for.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:12 +0000 (19:30 +0000)]
libxl: Make sure devices added by pci-attach are reflected in the config
Currently libxl__device_pci_add_xenstore() is broken in that does not
update the domain's configuration for the first device added (which causes
creation of the overall backend area in xenstore). This can be easily observed
by running 'xl list -l' after adding a single device: the device will be
missing.
This patch fixes the problem and adds a DEBUG log line to allow easy
verification that the domain configuration is being modified. Also, the use
of libxl__device_generic_add() is dropped as it leads to a confusing situation
where only partial backend information is written under the xenstore
'/libxl' path. For LIBXL__DEVICE_KIND_PCI devices the only definitive
information in xenstore is under '/local/domain/0/backend' (the '0' being
hard-coded).
NOTE: This patch includes a whitespace in add_pcis_done().
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:11 +0000 (19:30 +0000)]
libxl: make libxl__device_list() work correctly for LIBXL__DEVICE_KIND_PCI...
... devices.
Currently there is an assumption built into libxl__device_list() that device
backends are fully enumarated under the '/libxl' path in xenstore. This is
not the case for PCI backend devices, which are only properly enumerated
under '/local/domain/0/backend'.
This patch adds a new get_path() method to libxl__device_type to allow a
backend implementation (such as PCI) to specify the xenstore path where
devices are enumerated and modifies libxl__device_list() to use this method
if it is available. Also, if the get_num() method is defined then the
from_xenstore() method expects to be passed the backend path without the device
number concatenated, so this issue is also rectified.
Having made libxl__device_list() work correctly, this patch removes the
open-coded libxl_pci_device_pci_list() in favour of an evaluation of the
LIBXL_DEFINE_DEVICE_LIST() macro. This has the side-effect of also defining
libxl_pci_device_pci_list_free() which will be used in subsequent patches.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:10 +0000 (19:30 +0000)]
xl: s/pcidev/pci where possible
To improve naming consistency, replaces occurrences of 'pcidev' with 'pci'.
The only remaining use of the term should be in relation to
'libxl_domain_config' where there are fields named 'pcidevs' and 'num_pcidevs'.
Purely cosmetic. No functional change.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Tue, 8 Dec 2020 19:30:09 +0000 (19:30 +0000)]
libxl: s/pcidev/pci and remove DEFINE_DEVICE_TYPE_STRUCT_X
The seemingly arbitrary use of 'pci' and 'pcidev' in the code in libxl_pci.c
is confusing and also compromises use of some macros used for other device
types. Indeed it seems that DEFINE_DEVICE_TYPE_STRUCT_X exists solely because
of this duality.
This patch purges use of 'pcidev' from the libxl internal code, but
unfortunately the 'pcidevs' and 'num_pcidevs' fields in 'libxl_domain_config'
are part of the API and need to be retained to avoid breaking callers,
particularly libvirt.
DEFINE_DEVICE_TYPE_STRUCT_X is still removed to avoid the special case in
libxl_pci.c but DEFINE_DEVICE_TYPE_STRUCT is given an extra 'array' argument
which is used to identify the fields in 'libxl_domain_config' relating to
the device type.
NOTE: Some of the more gross formatting errors (such as lack of spaces after
keywords) that came into context have been fixed in libxl_pci.c.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Juergen Gross [Tue, 15 Dec 2020 15:04:11 +0000 (16:04 +0100)]
tools/xenstore: rework path length check
The different fixed limits for absolute and relative path lengths of
Xenstore nodes make it possible to create per-domain nodes via
absolute paths which are not accessible using relative paths, as the
two limits differ by 1024 characters.
Instead of this weird limits use only one limit, which applies to the
relative path length of per-domain nodes and to the absolute path
length of all other nodes. This means, the path length check is
applied to the path after removing a possible start of
"/local/domain/<n>/" with <n> being a domain id.
There has been the request to be able to limit the path lengths even
more, so an additional quota is added which can be applied to path
lengths. It is XENSTORE_REL_PATH_MAX (2048) per default, but can be
set to lower values. This is done via the new "-M" or "--path-max"
option when invoking xenstored.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Acked-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wl@xen.org>
Jan Beulich [Tue, 15 Dec 2020 12:47:45 +0000 (13:47 +0100)]
x86/PV: guest_get_eff_kern_l1e() may still need to switch page tables
While indeed unnecessary for pv_ro_page_fault(), pv_map_ldt_shadow_page()
may run when guest user mode is active, and hence may need to switch to
the kernel page tables in order to retrieve an LDT page mapping.
Fixes: 9ff970564764 ("x86/mm: drop guest_get_eff_l1e()") Reported-by: Manuel Bouyer <bouyer@antioche.eu.org> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Manuel Bouyer <bouyer@antioche.eu.org>
Jan Beulich [Tue, 15 Dec 2020 12:46:37 +0000 (13:46 +0100)]
evtchn/FIFO: re-order and synchronize (with) map_control_block()
For evtchn_fifo_set_pending()'s check of the control block having been
set to be effective, ordering of respective reads and writes needs to be
ensured: The control block pointer needs to be recorded strictly after
the setting of all the queue heads, and it needs checking strictly
before any uses of them (this latter aspect was already guaranteed).
Jan Beulich [Tue, 15 Dec 2020 12:42:51 +0000 (13:42 +0100)]
evtchn/FIFO: add 2nd smp_rmb() to evtchn_fifo_word_from_port()
Besides with add_page_to_event_array() the function also needs to
synchronize with evtchn_fifo_init_control() setting both d->evtchn_fifo
and (subsequently) d->evtchn_port_ops.
Roger Pau Monné [Tue, 15 Dec 2020 12:42:16 +0000 (13:42 +0100)]
x86/irq: fix infinite loop in irq_move_cleanup_interrupt
If Xen enters irq_move_cleanup_interrupt with a dynamic vector below
IRQ_MOVE_CLEANUP_VECTOR pending in IRR (0x20 or 0x21) that's also
designated for a cleanup it will enter a loop where
irq_move_cleanup_interrupt continuously sends a cleanup IPI (vector
0x22) to itself while waiting for the vector with lower priority to be
injected - which will never happen because IRQ_MOVE_CLEANUP_VECTOR
takes precedence and it's always injected first.
Fix this by making sure vectors below IRQ_MOVE_CLEANUP_VECTOR are
marked as used and thus not available for APs. Also add some logic to
assert and prevent irq_move_cleanup_interrupt from entering such an
infinite loop, albeit that should never happen given the current code.
This is XSA-356 / CVE-2020-29567.
Fixes: 3fba06ba9f8 ('x86/IRQ: re-use legacy vector ranges on APs') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Tue, 15 Dec 2020 12:41:23 +0000 (13:41 +0100)]
x86: avoid calling {svm,vmx}_do_resume()
These functions follow the following path: hvm_do_resume() ->
handle_hvm_io_completion() -> hvm_wait_for_io() ->
wait_on_xen_event_channel() -> do_softirq() -> schedule() ->
sched_context_switch() -> continue_running() and hence may
recursively invoke themselves. If this ends up happening a couple of
times, a stack overflow would result.
Prevent this by also resetting the stack at the
->arch.ctxt_switch->tail() invocations (in both places for consistency)
and thus jumping to the functions instead of calling them.
Jan Beulich [Tue, 15 Dec 2020 12:40:27 +0000 (13:40 +0100)]
x86: replace reset_stack_and_jump_nolp()
Move the necessary check into check_for_livepatch_work(), rather than
mostly duplicating reset_stack_and_jump() for this purpose. This is to
prevent an inflation of reset_stack_and_jump() flavors.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com>
Edwin Török [Tue, 15 Dec 2020 12:37:33 +0000 (13:37 +0100)]
tools/ocaml/xenstored: only Dom0 can change node owner
Otherwise we can give quota away to another domain, either causing it to run
out of quota, or in case of Dom0 use unbounded amounts of memory and bypass
the quota system entirely.
This was fixed in the C version of xenstored in 2006 (c/s db34d2aaa5f5,
predating the XSA process by 5 years).
It was also fixed in the mirage version of xenstore in 2012, with a unit test
demonstrating the vulnerability:
but possibly without realising that the vulnerability still affected the
in-tree oxenstored (added c/s f44af660412 in 2010).
This is XSA-352.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Tue, 15 Dec 2020 12:37:14 +0000 (13:37 +0100)]
tools/ocaml/xenstored: delete watch from trie too when resetting watches
c/s f8c72b526129 "oxenstored: implement XS_RESET_WATCHES" from Xen 4.6
introduced reset watches support in oxenstored by mirroring the change
in cxenstored.
However the OCaml version has some additional data structures to
optimize watch firing, and just resetting the watches in one of the data
structures creates a security bug where a malicious guest kernel can
exceed its watch quota, driving oxenstored into OOM:
* create watches
* reset watches (this still keeps the watches lingering in another data
structure, using memory)
* create some more watches
* loop until oxenstored dies
The guest kernel doesn't necessarily have to be malicious to trigger
this:
* if control/platform-feature-xs_reset_watches is set
* the guest kexecs (e.g. because it crashes)
* on boot more watches are set up
* this will slowly "leak" memory for watches in oxenstored, driving it
towards OOM.
This is XSA-330.
Fixes: f8c72b526129 ("oxenstored: implement XS_RESET_WATCHES") Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
tools/xenstore: Preserve bad client until they are destroyed
XenStored will kill any connection that it thinks has misbehaved,
this is currently happening in two places:
* In `handle_input()` if the sanity check on the ring and the message
fails.
* In `handle_output()` when failing to write the response in the ring.
As the domain structure is a child of the connection, XenStored will
destroy its view of the domain when killing the connection. This will
result in sending @releaseDomain event to all the watchers.
As the watch event doesn't carry which domain has been released,
the watcher (such as XenStored) will generally go through the list of
domains registers and check if one of them is shutting down/dying.
In the case of a client misbehaving, the domain will likely to be
running, so no action will be performed.
When the domain is effectively destroyed, XenStored will not be aware of
the domain anymore. So the watch event is not going to be sent.
By consequence, the watchers of the event will not release mappings
they may have on the domain. This will result in a zombie domain.
In order to send @releaseDomain event at the correct time, we want
to keep the domain structure until the domain is effectively
shutting-down/dying.
We also want to keep the connection around so we could possibly revive
the connection in the future.
A new flag 'is_ignored' is added to mark whether a connection should be
ignored when checking if there are work to do. Additionally any
transactions, watches, buffers associated to the connection will be
freed as you can't do much with them (restarting the connection will
likely need a reset).
As a side note, when the device model were running in a stubdomain, a
guest would have been able to introduce a use-after-free because there
is two parents for a guest connection.
Juergen Gross [Tue, 15 Dec 2020 12:36:42 +0000 (13:36 +0100)]
tools/xenstore: drop watch event messages exceeding maximum size
By setting a watch with a very large tag it is possible to trick
xenstored to send watch event messages exceeding the maximum allowed
payload size. This might in turn lead to a crash of xenstored as the
resulting error can cause dereferencing a NULL pointer in case there
is no active request being handled by the guest the watch event is
being sent to.
Fix that by just dropping such watch events. Additionally modify the
error handling to test the pointer to be not NULL before dereferencing
it.
Edwin Török [Tue, 15 Dec 2020 12:36:39 +0000 (13:36 +0100)]
tools/ocaml/xenstored: Fix path length validation
Currently, oxenstored checks the length of paths against 1024, then
prepends "/local/domain/$DOMID/" to relative paths. This allows a domU
to create paths which can't subsequently be read by anyone, even dom0.
This also interferes with listing directories, etc.
Define a new oxenstored.conf entry: quota-path-max, defaulting to 1024
as before. For paths that begin with "/local/domain/$DOMID/" check the
relative path length against this quota. For all other paths check the
entire path length.
This ensures that if the domid changes (and thus the length of a prefix
changes) a path that used to be valid stays valid (e.g. after a
live-migration). It also ensures that regardless how the client tries
to access a path (domid-relative or absolute) it will get consistent
results, since the limit is always applied on the final canonicalized
path.
Delete the unused Domain.get_path to avoid it being confused with
Connection.get_path (which differs by a trailing slash only).
Rewrite Util.path_validate to apply the appropriate length restriction
based on whether the path is relative or not. Remove the check for
connection_path being absolute, because it is not guest controlled data.
This is part of XSA-323.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
Edwin Török [Tue, 15 Dec 2020 12:36:04 +0000 (13:36 +0100)]
tools/ocaml/xenstored: clean up permissions for dead domains
domain ids are prone to wrapping (15-bits), and with sufficient number
of VMs in a reboot loop it is possible to trigger it. Xenstore entries
may linger after a domain dies, until a toolstack cleans it up. During
this time there is a window where a wrapped domid could access these
xenstore keys (that belonged to another VM).
To prevent this do a cleanup when a domain dies:
* walk the entire xenstore tree and update permissions for all nodes
* if the dead domain had an ACL entry: remove it
* if the dead domain was the owner: change the owner to Dom0
This is done without quota checks or a transaction. Quota checks would
be a no-op (either the domain is dead, or it is Dom0 where they are not
enforced). Transactions are not needed, because this is all done
atomically by oxenstored's single thread.
The xenstore entries owned by the dead domain are not deleted, because
that could confuse a toolstack / backends that are still bound to it
(or generate unexpected watch events). It is the responsibility of a
toolstack to remove the xenstore entries themselves.
This is part of XSA-322.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
Juergen Gross [Tue, 15 Dec 2020 12:36:01 +0000 (13:36 +0100)]
tools/xenstore: revoke access rights for removed domains
Access rights of Xenstore nodes are per domid. Unfortunately existing
granted access rights are not removed when a domain is being destroyed.
This means that a new domain created with the same domid will inherit
the access rights to Xenstore nodes from the previous domain(s) with
the same domid.
This can be avoided by adding a generation counter to each domain.
The generation counter of the domain is set to the global generation
counter when a domain structure is being allocated. When reading or
writing a node all permissions of domains which are younger than the
node itself are dropped. This is done by flagging the related entry
as invalid in order to avoid modifying permissions in a way the user
could detect.
A special case has to be considered: for a new domain the first
Xenstore entries are already written before the domain is officially
introduced in Xenstore. In order not to drop the permissions for the
new domain a domain struct is allocated even before introduction if
the hypervisor is aware of the domain. This requires adding another
bool "introduced" to struct domain in xenstored. In order to avoid
additional padding holes convert the shutdown flag to bool, too.
As verifying permissions has its price regarding runtime add a new
quota for limiting the number of permissions an unprivileged domain
can set for a node. The default for that new quota is 5.
Edwin Török [Tue, 15 Dec 2020 12:35:19 +0000 (13:35 +0100)]
tools/ocaml/xenstored: add xenstored.conf flag to turn off watch permission checks
There are flags to turn off quotas and the permission system, so add one
that turns off the newly introduced watch permission checks as well.
This is part of XSA-115.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Tue, 15 Dec 2020 12:35:16 +0000 (13:35 +0100)]
tools/ocaml/xenstored: avoid watch events for nodes without access
Today watch events are sent regardless of the access rights of the
node the event is sent for. This enables any guest to e.g. setup a
watch for "/" in order to have a detailed record of all Xenstore
modifications.
Modify that by sending only watch events for nodes that the watcher
has a chance to see otherwise (either via direct reads or by querying
the children of a node). This includes cases where the visibility of
a node for a watcher is changing (permissions being removed).
Permissions for nodes are looked up either in the old (pre
transaction/command) or current trees (post transaction). If
permissions are changed multiple times in a transaction only the final
version is checked, because considering a transaction atomic the
individual permission changes would not be noticable to an outside
observer.
Two trees are only needed for set_perms: here we can either notice the
node disappearing (if we loose permission), appearing
(if we gain permission), or changing (if we preserve permission).
RM needs to only look at the old tree: in the new tree the node would be
gone, or could have different permissions if it was recreated (the
recreation would get its own watch fired).
Inside a tree we lookup the watch path's parent, and then the watch path
child itself. This gets us 4 sets of permissions in worst case, and if
either of these allows a watch, then we permit it to fire. The
permission lookups are done without logging the failures, otherwise we'd
get confusing errors about permission denied for some paths, but a watch
still firing. The actual result is logged in xenstored-access log:
'w event ...' as usual if watch was fired
'w notfired...' if the watch was not fired, together with path and
permission set to help in troubleshooting
Adding a watch bypasses permission checks and always fires the watch
once immediately. This is consistent with the specification, and no
information is gained (the watch is fired both if the path exists or
doesn't, and both if you have or don't have access, i.e. it reflects the
path a domain gave it back to that domain).
There are some semantic changes here:
* Write+rm in a single transaction of the same path is unobservable
now via watches: both before and after a transaction the path
doesn't exist, thus both tree lookups come up with the empty
permission set, and noone, not even Dom0 can see this. This is
consistent with transaction atomicity though.
* Similar to above if we temporarily grant and then revoke permission
on a path any watches fired inbetween are ignored as well
* There is a new log event (w notfired) which shows the permission set
of the path, and the path.
* Watches on paths that a domain doesn't have access to are now not
seen, which is the purpose of the security fix.
This is part of XSA-115.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Tue, 15 Dec 2020 12:35:13 +0000 (13:35 +0100)]
tools/ocaml/xenstored: introduce permissions for special watches
The special watches "@introduceDomain" and "@releaseDomain" should be
allowed for privileged callers only, as they allow to gain information
about presence of other guests on the host. So send watch events for
those watches via privileged connections only.
Start to address this by treating the special watches as regular nodes
in the tree, which gives them normal semantics for permissions. A later
change will restrict the handling, so that they can't be listed, etc.
This is part of XSA-115.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Tue, 15 Dec 2020 12:35:11 +0000 (13:35 +0100)]
tools/ocaml/xenstored: unify watch firing
This will make it easier insert additional checks in a follow-up patch.
All watches are now fired from a single function.
This is part of XSA-115.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Tue, 15 Dec 2020 12:35:08 +0000 (13:35 +0100)]
tools/ocaml/xenstored: check privilege for XS_IS_DOMAIN_INTRODUCED
The Xenstore command XS_IS_DOMAIN_INTRODUCED should be possible for privileged
domains only (the only user in the tree is the xenpaging daemon).
This is part of XSA-115.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Edwin Török [Tue, 15 Dec 2020 12:35:06 +0000 (13:35 +0100)]
tools/ocaml/xenstored: ignore transaction id for [un]watch
Instead of ignoring the transaction id for XS_WATCH and XS_UNWATCH
commands as it is documented in docs/misc/xenstore.txt, it is tested
for validity today.
Really ignore the transaction id for XS_WATCH and XS_UNWATCH.
This is part of XSA-115.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Juergen Gross [Tue, 15 Dec 2020 12:34:58 +0000 (13:34 +0100)]
tools/xenstore: avoid watch events for nodes without access
Today watch events are sent regardless of the access rights of the
node the event is sent for. This enables any guest to e.g. setup a
watch for "/" in order to have a detailed record of all Xenstore
modifications.
Modify that by sending only watch events for nodes that the watcher
has a chance to see otherwise (either via direct reads or by querying
the children of a node). This includes cases where the visibility of
a node for a watcher is changing (permissions being removed).
Juergen Gross [Tue, 15 Dec 2020 12:34:56 +0000 (13:34 +0100)]
tools/xenstore: allow special watches for privileged callers only
The special watches "@introduceDomain" and "@releaseDomain" should be
allowed for privileged callers only, as they allow to gain information
about presence of other guests on the host. So send watch events for
those watches via privileged connections only.
In order to allow for disaggregated setups where e.g. driver domains
need to make use of those special watches add support for calling
"set permissions" for those special nodes, too.
Juergen Gross [Tue, 15 Dec 2020 12:34:50 +0000 (13:34 +0100)]
tools/xenstore: fire watches only when removing a specific node
Instead of firing all watches for removing a subtree in one go, do so
only when the related node is being removed.
The watches for the top-most node being removed include all watches
including that node, while watches for nodes below that are only fired
if they are matching exactly. This avoids firing any watch more than
once when removing a subtree.
Juergen Gross [Tue, 15 Dec 2020 12:34:48 +0000 (13:34 +0100)]
tools/xenstore: rework node removal
Today a Xenstore node is being removed by deleting it from the parent
first and then deleting itself and all its children. This results in
stale entries remaining in the data base in case e.g. a memory
allocation is failing during processing. This would result in the
rather strange behavior to be able to read a node (as its still in the
data base) while not being visible in the tree view of Xenstore.
Fix that by deleting the nodes from the leaf side instead of starting
at the root.
As fire_watches() is now called from _rm() the ctx parameter needs a
const attribute.
Juergen Gross [Tue, 15 Dec 2020 12:34:42 +0000 (13:34 +0100)]
tools/xenstore: simplify and rename check_event_node()
There is no path which allows to call check_event_node() without a
event name. So don't let the result depend on the name being NULL and
add an assert() covering that case.
Rename the function to check_special_event() to better match the
semantics.
Juergen Gross [Tue, 15 Dec 2020 12:34:40 +0000 (13:34 +0100)]
tools/xenstore: fix node accounting after failed node creation
When a node creation fails the number of nodes of the domain should be
the same as before the failed node creation. In case of failure when
trying to create a node requiring to create one or more intermediate
nodes as well (e.g. when /a/b/c/d is to be created, but /a/b isn't
existing yet) it might happen that the number of nodes of the creating
domain is not reset to the value it had before.
So move the quota accounting out of construct_node() and into the node
write loop in create_node() in order to be able to undo the accounting
in case of an error in the intermediate node destructor.
Juergen Gross [Tue, 15 Dec 2020 12:34:37 +0000 (13:34 +0100)]
tools/xenstore: ignore transaction id for [un]watch
Instead of ignoring the transaction id for XS_WATCH and XS_UNWATCH
commands as it is documented in docs/misc/xenstore.txt, it is tested
for validity today.
Really ignore the transaction id for XS_WATCH and XS_UNWATCH.
Juergen Gross [Tue, 15 Dec 2020 12:34:35 +0000 (13:34 +0100)]
tools/xenstore: allow removing child of a node exceeding quota
An unprivileged user of Xenstore is not allowed to write nodes with a
size exceeding a global quota, while privileged users like dom0 are
allowed to write such nodes. The size of a node is the needed space
to store all node specific data, this includes the names of all
children of the node.
When deleting a node its parent has to be modified by removing the
name of the to be deleted child from it.
This results in the strange situation that an unprivileged owner of a
node might not succeed in deleting that node in case its parent is
exceeding the quota of that unprivileged user (it might have been
written by dom0), as the user is not allowed to write the updated
parent node.
Fix that by not checking the quota when writing a node for the
purpose of removing a child's name only.
The same applies to transaction handling: a node being read during a
transaction is written to the transaction specific area and it should
not be tested for exceeding the quota, as it might not be owned by
the reader and presumably the original write would have failed if the
node is owned by the reader.
Edwin Török [Tue, 15 Dec 2020 12:33:43 +0000 (13:33 +0100)]
tools/ocaml/xenstored: do permission checks on xenstore root
This was lacking in a disappointing number of places.
The xenstore root node is treated differently from all other nodes, because it
doesn't have a parent, and mutation requires changing the parent.
Unfortunately this lead to open-coding the special case for root into every
single xenstore operation, and out of all the xenstore operations only read
did a permission check when handling the root node.
This means that an unprivileged guest can:
* xenstore-chmod / to its liking and subsequently write new arbitrary nodes
there (subject to quota)
* xenstore-rm -r / deletes almost the entire xenstore tree (xenopsd quickly
refills some, but you are left with a broken system)
* DIRECTORY on / lists all children when called through python
bindings (xenstore-ls stops at /local because it tries to list recursively)
* get-perms on / works too, but that is just a minor information leak
Add the missing permission checks, but this should really be refactored to do
the root handling and permission checks on the node only once from a single
function, instead of getting it wrong nearly everywhere.
This is XSA-353.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Fri, 11 Dec 2020 10:53:24 +0000 (11:53 +0100)]
xen: fix build when $(obj-y) consists of just blanks
This case can occur when combining empty lists
obj-y :=
...
obj-y += $(empty)
or
obj-y := $(empty) $(empty)
where (only) blanks would accumulate. This was only a latent issue until
now, but would become an active issue for Arm once lib/ gets populated
with all respective objects going into the to be introduced lib.a.
Also address a related issue at this occasion: When an empty built_in.o
gets created, .built_in.o.d will have its dependencies recorded. If, on
a subsequent incremental build, an actual constituent of built_in.o
appeared, the $(filter-out ) would leave these recorded dependencies in
place. But of course the linker won't know what to do with C header
files. (The apparent alternative of avoiding to pass $(c_flags) or
$(a_flags) would not be reliable afaict, as among these flags there may
be some affecting information conveyed via the object file to the
linker. The linker, finding inconsistent flags across object files, may
then error out.) Using just $(obj-y) won't work either: It breaks when
the same object file is listed more than once.
Reported-by: Julien Grall <julien@xen.org> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wl@xen.org> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
Jan Beulich [Fri, 11 Dec 2020 10:52:35 +0000 (11:52 +0100)]
evtchn: convert vIRQ lock to an r/w one
There's no need to serialize all sending of vIRQ-s; all that's needed
is serialization against the closing of the respective event channels
(so far by means of a barrier). To facilitate the conversion, switch to
an ordinary write locked region in evtchn_close().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
Jan Beulich [Tue, 8 Dec 2020 07:53:18 +0000 (08:53 +0100)]
libxenstat: avoid build race
Olaf reported observing
xenstat_qmp.c:26:10: fatal error: _paths.h: No such file or directory
.../tools/libs/stat/../../../tools/Rules.mk:153: xenstat_qmp.opic] Error 1
Obviously _paths.h, when included by any of the sources, needs to be
created in advance of compiling any of them, not just when compiling
into non-PIC objects.
Reported-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com> Acked-by: Wei Liu <wl@xen.org>
Juergen Gross [Tue, 8 Dec 2020 07:52:31 +0000 (08:52 +0100)]
xen/events: do some cleanups in evtchn_fifo_set_pending()
evtchn_fifo_set_pending() can be simplified a little bit. Especially
testing for the fifo control block to exist can be moved out of the
main if clause of the function enabling to avoid testing the LINKED bit
twice.
Suggested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Roger Pau Monne [Tue, 17 Nov 2020 09:32:58 +0000 (10:32 +0100)]
docs: fix documentation about default scheduler
Fix the command line document to account for the default scheduler in
Kconfig being credit2 now, and the fact that it's selectable at build
time and thus different builds could end up with different default
schedulers.
Fixes: dafd936dddbd ('Make credit2 the default scheduler') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Wei Liu <wl@xen.org>
xen/cpupool: switch cpupool list to normal list interface
Instead of open coding something like a linked list just use the
available functionality from list.h.
The allocation of a new cpupool id is not aware of a possible wrap.
Fix that. We don't, however, consider the case of extremely many pools
(beyond 4 billion) as something which needs explicitly handling right
now. First and foremost there would need to be systems with 4 billion
CPUs to make this many pools a sensible thing to have.
While adding the required new include to private.h sort the includes.
Hongyan Xia [Mon, 7 Dec 2020 13:54:44 +0000 (14:54 +0100)]
x86/vmap: handle superpages in vmap_to_mfn()
There is simply no guarantee that vmap won't return superpages to the
caller. It can happen if the list of MFNs are contiguous, or we simply
have a large granularity. Although rare, if such things do happen, we
will simply hit BUG_ON() and crash.
Introduce xen_map_to_mfn() to translate any mapped Xen address to mfn
regardless of page size, and wrap vmap_to_mfn() around it.
Signed-off-by: Hongyan Xia <hongyxia@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Jan Beulich [Mon, 7 Dec 2020 13:53:20 +0000 (14:53 +0100)]
x86/vIO-APIC: make use of xmalloc_flex_struct()
... instead of effectively open-coding it in a type-unsafe way. Drop
hvm_vioapic_size() altogether, folding the other use in a memset()
invocation into the subsequent loop.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Igor Druzhinin [Mon, 7 Dec 2020 13:52:35 +0000 (14:52 +0100)]
x86/IRQ: allocate guest array of max size only for shareable IRQs
... and increase default "irq-max-guests" to 32.
It's not necessary to have an array of a size more than 1 for non-shareable
IRQs and it might impact scalability in case of high "irq-max-guests"
values being used - every IRQ in the system including MSIs would be
supplied with an array of that size.
Since it's now less impactful to use higher "irq-max-guests" value - bump
the default to 32. That should give more headroom for future systems.
Requested-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Igor Druzhinin [Mon, 7 Dec 2020 13:49:30 +0000 (14:49 +0100)]
x86/IRQ: make max number of guests for a shared IRQ configurable
... and increase the default to 16.
Current limit of 7 is too restrictive for modern systems where one GSI
could be shared by potentially many PCI INTx sources where each of them
corresponds to a device passed through to its own guest. Some systems do not
apply due dilligence in swizzling INTx links in case e.g. INTA is declared as
interrupt pin for the majority of PCI devices behind a single router,
resulting in overuse of a GSI.
Introduce a new command line option to configure that limit and dynamically
allocate an array of the necessary size. Set the default size now to 16 which
is higher than 7 but could later be increased even more if necessary.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Wed, 11 Nov 2020 10:01:43 +0000 (11:01 +0100)]
tools/libs/ctrl: fix dumping of ballooned guest
A guest with memory < maxmem often can't be dumped via xl dump-core
without an error message today:
xc: info: exceeded nr_pages (262144) losing pages
In case the last page of the guest isn't allocated the loop in
xc_domain_dumpcore_via_callback() will always spit out this message,
as the number of already dumped pages is tested before the next page
is checked to be valid.
The guest's p2m_size might be lower than expected, so this should be
tested in order to avoid reading past the end of it.
The guest might use high bits in p2m entries to flag special cases like
foreign mappings. Entries with an MFN larger than the highest MFN of
the host should be skipped.
Jan Beulich [Fri, 4 Dec 2020 12:17:24 +0000 (13:17 +0100)]
x86/IRQ: reduce casting involved in guest action retrieval
Introduce a helper function covering both the IRQ_GUEST check and the
cast involved in obtaining the (correctly typed) pointer. Where possible
add const and/or reduce variable scope.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org>
Paul Durrant [Fri, 4 Dec 2020 12:15:38 +0000 (13:15 +0100)]
viridian: log initial invocation of each type of hypercall
To make is simpler to observe which viridian hypercalls are issued by a
particular Windows guest, this patch adds a per-domain mask to track them.
Each type of hypercall causes a different bit to be set in the mask and
when the bit transitions from clear to set, a log line is emitted showing
the name of the hypercall and the domain that issued it.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Fri, 4 Dec 2020 12:15:21 +0000 (13:15 +0100)]
viridian: add ExProcessorMasks variant of the IPI hypercall
A previous patch introduced variants of the flush hypercalls that take a
'Virtual Processor Set' as an argument rather than a simple 64-bit mask.
This patch introduces a similar variant of the HVCALL_SEND_IPI hypercall
(HVCALL_SEND_IPI_EX).
NOTE: As with HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE/LIST_EX, a guest should
not yet issue the HVCALL_SEND_IPI_EX hypercall as support for
'ExProcessorMasks' is not yet advertised via CPUID.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Fri, 4 Dec 2020 12:14:59 +0000 (13:14 +0100)]
viridian: add ExProcessorMasks variants of the flush hypercalls
The Microsoft Hypervisor TLFS specifies variants of the already implemented
HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE/LIST hypercalls that take a 'Virtual
Processor Set' as an argument rather than a simple 64-bit mask.
This patch adds a new hvcall_flush_ex() function to implement these
(HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE/LIST_EX) hypercalls. This makes use of
new helper functions, hv_vpset_nr_banks() and hv_vpset_to_vpmask(), to
determine the size of the Virtual Processor Set (so it can be copied from
guest memory) and parse it into hypercall_vpmask (respectively).
NOTE: A guest should not yet issue these hypercalls as 'ExProcessorMasks'
support needs to be advertised via CPUID. This will be done in a
subsequent patch.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Fri, 4 Dec 2020 12:14:42 +0000 (13:14 +0100)]
viridian: use softirq batching in hvcall_ipi()
vlapic_ipi() uses a softirq batching mechanism to improve the efficiency of
sending a IPIs to large number of processors. This patch modifies send_ipi()
(the worker function called by hvcall_ipi()) to also make use of the
mechanism when there multiple bits set the hypercall_vpmask.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Fri, 4 Dec 2020 12:14:25 +0000 (13:14 +0100)]
viridian: use hypercall_vpmask in hvcall_ipi()
A subsequent patch will need to IPI a mask of virtual processors potentially
wider than 64 bits. A previous patch introduced per-cpu hypercall_vpmask
to allow hvcall_flush() to deal with such wide masks. This patch modifies
the implementation of hvcall_ipi() to make use of the same mask structures,
introducing a for_each_vp() macro to facilitate traversing a mask.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Fri, 4 Dec 2020 12:14:03 +0000 (13:14 +0100)]
viridian: introduce a per-cpu hypercall_vpmask and accessor functions...
... and make use of them in hvcall_flush()/need_flush().
Subsequent patches will need to deal with virtual processor masks potentially
wider than 64 bits. Thus, to avoid using too much stack, this patch
introduces global per-cpu virtual processor masks and converts the
implementation of hvcall_flush() to use them.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Fri, 4 Dec 2020 12:13:41 +0000 (13:13 +0100)]
viridian: move IPI hypercall implementation into separate function
This patch moves the implementation of HVCALL_SEND_IPI that is currently
inline in viridian_hypercall() into a new hvcall_ipi() function.
The new function returns Xen errno values similarly to hvcall_flush(). Hence
the errno translation code in viridial_hypercall() is generalized, removing
the need for the local 'status' variable.
NOTE: The formatting of the switch statement at the top of
viridial_hypercall() is also adjusted as per CODING_STYLE.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Fri, 4 Dec 2020 12:13:22 +0000 (13:13 +0100)]
viridian: move flush hypercall implementation into separate function
This patch moves the implementation of HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE/LIST
that is currently inline in viridian_hypercall() into a new hvcall_flush()
function.
The new function returns Xen erro values which are then dealt with
appropriately. A return value of -ERESTART translates to viridian_hypercall()
returning HVM_HCALL_preempted. Other return values translate to status codes
and viridian_hypercall() returning HVM_HCALL_completed. Currently the only
values, other than -ERESTART, returned by hvcall_flush() are 0 (indicating
success) or -EINVAL.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Paul Durrant [Fri, 4 Dec 2020 12:12:54 +0000 (13:12 +0100)]
viridian: don't blindly write to 32-bit registers if 'mode' is invalid
If hvm_guest_x86_mode() returns something other than 8 or 4 then
viridian_hypercall() will return immediately but, on the way out, will write
back status as if 'mode' was 4. This patch simply makes it leave the registers
alone.
NOTE: The formatting of the 'out' label and the switch statement are also
adjusted as per CODING_STYLE.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Acked-by: Wei Liu <wl@xen.org>
Olaf Hering [Thu, 3 Dec 2020 06:34:36 +0000 (07:34 +0100)]
tools/hotplug: allow tuning of xenwatchdogd arguments
Currently the arguments for xenwatchdogd are hardcoded with 15s
keep-alive interval and 30s timeout.
It is not possible to tweak these values via
/etc/systemd/system/xen-watchdog.service.d/*.conf because ExecStart
can not be replaced. The only option would be a private copy
/etc/systemd/system/xen-watchdog.service, which may get out of sync
with the Xen provided xen-watchdog.service.
Adjust the service file to recognize XENWATCHDOGD_ARGS= in a
private unit configuration file.
Signed-off-by: Olaf Hering <olaf@aepfle.de> Acked-by: Wei Liu <wl@xen.org>
Juergen Gross [Fri, 4 Dec 2020 07:31:25 +0000 (08:31 +0100)]
xen/hypfs: add getsize() and findentry() callbacks to hypfs_funcs
Add a getsize() function pointer to struct hypfs_funcs for being able
to have dynamically filled entries without the need to take the hypfs
lock each time the contents are being generated.
For directories add a findentry callback to the vector and modify
hypfs_get_entry_rel() to use it. For its non-directory node counterpart
introduce the so far unused and hence missing ENOTDIR error code.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Fri, 4 Dec 2020 07:29:41 +0000 (08:29 +0100)]
xen/hypfs: move per-node function pointers into a dedicated struct
Move the function pointers currently stored in each hypfs node into a
dedicated structure in order to save some space for each node. This
will save even more space with additional callbacks added in future.
Provide some standard function vectors.
Instead of testing the write pointer to be not NULL provide a dummy
function just returning -EACCESS. ASSERT() all vector entries being
populated when adding a node. This avoids any potential problem (e.g.
pv domain privilege escalations) in case of calling a non populated
vector entry.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
In docs/misc/dump-core-format.txt there are a few more instances of
'informations'. I'll leave that up to someone who can properly determine
how those sentences should be constructed.
Signed-off-by: Diederik de Haas <didi.debian@cknow.org> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Jan Beulich <jbeulich@suse.com>
When booting Xen with CONFIG_USBAN=y on Sandy Bridge, UBSAN will throw
the following splat:
(XEN) ================================================================================
(XEN) UBSAN: Undefined behaviour in quirks.c:449:63
(XEN) left shift of 1 by 31 places cannot be represented in type 'int'
(XEN) ----[ Xen-4.11.4 x86_64 debug=y Not tainted ]----
Note that splat is from 4.11.4 and not staging. Although, the problem is
still present.
This can be solved by making the first operand unsigned int.
Signed-off-by: Julien Grall <jgrall@amazon.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Juergen Gross [Wed, 2 Dec 2020 09:12:37 +0000 (10:12 +0100)]
xen/cpupool: add missing bits for per-cpupool scheduling granularity
Even with storing the scheduling granularity in struct cpupool there
are still a few bits missing for being able to have cpupools with
different granularity (apart from the missing interface for setting
the individual granularities): the number of cpus in a scheduling
unit is always taken from the global sched_granularity variable.
So store the value in struct cpupool and use that instead of
sched_granularity.
Juergen Gross [Wed, 2 Dec 2020 09:12:04 +0000 (10:12 +0100)]
xen/cpupool: add cpu to sched_res_mask when removing it from cpupool
When a cpu is removed from a cpupool and added to the free cpus it
should be added to sched_res_mask, too.
The related removal from sched_res_mask in case of core scheduling
is already done in schedule_cpu_add().
As long as all cpupools share the same scheduling granularity there
is nothing going wrong with the missing addition, but this will change
when per-cpupool granularity is fully supported.
Rahul Singh [Wed, 2 Dec 2020 09:09:27 +0000 (10:09 +0100)]
ns16550: gate all PCI code with CONFIG_X86
The NS16550 driver is assuming that NS16550 PCI card are usable if the
architecture supports PCI (i.e. CONFIG_HAS_PCI=y). However, the code is
very x86 focus and will fail to build on Arm (/!\ it is not all the
errors):
ns16550.c: In function ‘ns16550_init_irq’:
ns16550.c:726:21: error: implicit declaration of function ‘create_irq’;
did you mean ‘release_irq’? [-Werror=implicit-function-declaration]
uart->irq = create_irq(0, false);
^~~~~~~~~~
release_irq
ns16550.c:726:21: error: nested extern declaration of ‘create_irq’
[-Werror=nested-externs]
ns16550.c: In function ‘ns16550_init_postirq’:
ns16550.c:768:33: error: ‘mmio_ro_ranges’ undeclared (first use in this
function); did you mean ‘mmio_handler’?
rangeset_add_range(mmio_ro_ranges, uart->io_base,
^~~~~~~~~~~~~~
mmio_handler
ns16550.c:768:33: note: each undeclared identifier is reported only once
for each function it appears in
ns16550.c:780:20: error: variable ‘msi’ has initializer but incomplete
type
struct msi_info msi = {
^~~~~~~~
Enabling support for NS16550 PCI card on Arm would require more plumbing
in addition to fixing the compilation error.
Arm systems tend to have platform UART available such as NS16550, PL011.
So there are limited reasons to get NS16550 PCI support for now on Arm.
Guard all remaining PCI code that is not under x86 flag with CONFIG_X86.
Roger Pau Monné [Mon, 30 Nov 2020 13:06:38 +0000 (14:06 +0100)]
x86/vioapic: fix usage of index in place of GSI in vioapic_write_redirent
The usage of idx instead of the GSI in vioapic_write_redirent when
accessing gsi_assert_count can cause a PVH dom0 with multiple
vIO-APICs to lose interrupts in case a pin of a IO-APIC different than
the first one is unmasked with pending interrupts.
Switch to use gsi instead to fix the issue.
Fixes: 9f44b08f7d0e4 ('x86/vioapic: introduce support for multiple vIO APICS') Reported-by: Manuel Bouyer <bouyer@antioche.eu.org> Signed-off-by: Roger Pau Monné <roge.rpau@citrix.com> Tested-by: Manuel Bouyer <bouyer@antioche.eu.org> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Juergen Gross [Mon, 30 Nov 2020 13:05:39 +0000 (14:05 +0100)]
xen/events: rework fifo queue locking
Two cpus entering evtchn_fifo_set_pending() for the same event channel
can race in case the first one gets interrupted after setting
EVTCHN_FIFO_PENDING and when the other one manages to set
EVTCHN_FIFO_LINKED before the first one is testing that bit. This can
lead to evtchn_check_pollers() being called before the event is put
properly into the queue, resulting eventually in the guest not seeing
the event pending and thus blocking forever afterwards.
Note that commit 5f2df45ead7c1195 ("xen/evtchn: rework per event channel
lock") made the race just more obvious, while the fifo event channel
implementation had this race forever since the introduction and use of
per-channel locks, when an unmask operation was running in parallel with
an event channel send operation.
Using a spinlock for the per event channel lock had turned out
problematic due to some paths needing to take the lock are called with
interrupts off, so the lock would need to disable interrupts, which in
turn broke some use cases related to vm events.
For avoiding this race the queue locking in evtchn_fifo_set_pending()
needs to be reworked to cover the test of EVTCHN_FIFO_PENDING,
EVTCHN_FIFO_MASKED and EVTCHN_FIFO_LINKED, too. Additionally when an
event channel needs to change queues both queues need to be locked
initially, in order to avoid having a window with no lock held at all.
Reported-by: Jan Beulich <jbeulich@suse.com> Fixes: 5f2df45ead7c1195 ("xen/evtchn: rework per event channel lock") Fixes: de6acb78bf0e137c ("evtchn: use a per-event channel lock for sending events") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>