SUPPORT.md: clarify support of untrusted driver domains with oxenstored
Add a support statement for the scope of support regarding different
Xenstore variants. Especially oxenstored does not (yet) have security
support of untrusted driver domains, as those might drive oxenstored
out of memory by creating lots of watch events for the guests they are
servicing.
Add a statement regarding Live Update support of oxenstored.
This is part of XSA-326.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
Edwin Török [Wed, 12 Oct 2022 18:13:04 +0000 (19:13 +0100)]
tools/ocaml: Limit maximum in-flight requests / outstanding replies
Introduce a limit on the number of outstanding reply packets in the xenbus
queue. This limits the number of in-flight requests: when the output queue is
full we'll stop processing inputs until the output queue has room again.
To avoid a busy loop on the Unix socket we only add it to the watched input
file descriptor set if we'd be able to call `input` on it. Even though Dom0
is trusted and exempt from quotas a flood of events might cause a backlog
where events are produced faster than daemons in Dom0 can consume them, which
could lead to an unbounded queue size and OOM.
Therefore the xenbus queue limit must apply to all connections, Dom0 is not
exempt from it, although if everything works correctly it will eventually
catch up.
This prevents a malicious guest from sending more commands while it has
outstanding watch events or command replies in its input ring. However if it
can cause the generation of watch events by other means (e.g. by Dom0, or
another cooperative guest) and stop reading its own ring then watch events
would've queued up without limit.
The xenstore protocol doesn't have a back-pressure mechanism, and doesn't
allow dropping watch events. In fact, dropping watch events is known to break
some pieces of normal functionality. This leaves little choice to safely
implement the xenstore protocol without exposing the xenstore daemon to
out-of-memory attacks.
Implement the fix as pipes with bounded buffers:
* Use a bounded buffer for watch events
* The watch structure will have a bounded receiving pipe of watch events
* The source will have an "overflow" pipe of pending watch events it couldn't
deliver
Items are queued up on one end and are sent as far along the pipe as possible:
source domain -> watch -> xenbus of target -> xenstore ring/socket of target
If the pipe is "full" at any point then back-pressure is applied and we prevent
more items from being queued up. For the source domain this means that we'll
stop accepting new commands as long as its pipe buffer is not empty.
Before we try to enqueue an item we first check whether it is possible to send
it further down the pipe, by attempting to recursively flush the pipes. This
ensures that we retain the order of events as much as possible.
We might break causality of watch events if the target domain's queue is full
and we need to start using the watch's queue. This is a breaking change in
the xenstore protocol, but only for domains which are not processing their
incoming ring as expected.
When a watch is deleted its entire pending queue is dropped (no code is needed
for that, because it is part of the 'watch' type).
There is a cache of watches that have pending events that we attempt to flush
at every cycle if possible.
Introduce 3 limits here:
* quota-maxwatchevents on watch event destination: when this is hit the
source will not be allowed to queue up more watch events.
* quota-maxoustanding which is the number of responses not read from the ring:
once exceeded, no more inputs are processed until all outstanding replies
are consumed by the client.
* overflow queue on the watch event source: all watches that cannot be stored
on destination are queued up here, a single command can trigger multiple
watches (e.g. due to recursion).
The overflow queue currently doesn't have an upper bound, it is difficult to
accurately calculate one as it depends on whether you are Dom0 and how many
watches each path has registered and how many watch events you can trigger
with a single command (e.g. a commit). However these events were already
using memory, this just moves them elsewhere, and as long as we correctly
block a domain it shouldn't result in unbounded memory usage.
Note that Dom0 is not excluded from these checks, it is important that Dom0 is
especially not excluded when it is the source, since there are many ways in
which a guest could trigger Dom0 to send it watch events.
This should protect against malicious frontends as long as the backend follows
the PV xenstore protocol and only exposes paths needed by the frontend, and
changes those paths at most once as a reaction to guest events, or protocol
state.
The queue limits are per watch, and per domain-pair, so even if one
communication channel would be "blocked", others would keep working, and the
domain itself won't get blocked as long as it doesn't overflow the queue of
watch events.
Similarly a malicious backend could cause the frontend to get blocked, but
this watch queue protects the frontend as well as long as it follows the PV
protocol. (Although note that protection against malicious backends is only a
best effort at the moment)
This is part of XSA-326 / CVE-2022-42318.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
Edwin Török [Wed, 12 Oct 2022 18:13:03 +0000 (19:13 +0100)]
tools/ocaml/xb: Add BoundedQueue
Ensures we cannot store more than [capacity] elements in a [Queue]. Replacing
all Queue with this module will then ensure at compile time that all Queues
are correctly bound checked.
Each element in the queue has a class with its own limits. This, in a
subsequent change, will ensure that command responses can proceed during a
flood of watch events.
No functional change.
This is part of XSA-326.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
Edwin Török [Wed, 12 Oct 2022 18:13:07 +0000 (19:13 +0100)]
tools/ocaml: GC parameter tuning
By default the OCaml garbage collector would return memory to the OS only
after unused memory is 5x live memory. Tweak this to 120% instead, which
would match the major GC speed.
This is part of XSA-326.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
Edwin Török [Thu, 28 Jul 2022 16:08:15 +0000 (17:08 +0100)]
tools/ocaml/xenstored: Check for maxrequests before performing operations
Previously we'd perform the operation, record the updated tree in the
transaction record, then try to insert a watchop path and the reply packet.
If we exceeded max requests we would've returned EQUOTA, but still:
* have performed the operation on the transaction's tree
* have recorded the watchop, making this queue effectively unbounded
It is better if we check whether we'd have room to store the operation before
performing the transaction, and raise EQUOTA there. Then the transaction
record won't grow.
This is part of XSA-326 / CVE-2022-42317.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
Edwin Török [Wed, 12 Oct 2022 18:13:01 +0000 (19:13 +0100)]
tools/ocaml/xenstored: Synchronise defaults with oxenstore.conf.in
We currently have 2 different set of defaults in upstream Xen git tree:
* defined in the source code, only used if there is no config file
* defined in the oxenstored.conf.in upstream Xen
An oxenstored.conf file is not mandatory, and if missing, maxrequests in
particular has an unsafe default.
Resync the defaults from oxenstored.conf.in into the source code.
This is part of XSA-326 / CVE-2022-42316.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
tools/xenstore: add control command for setting and showing quota
Add a xenstore-control command "quota" to:
- show current quota settings
- change quota settings
- show current quota related values of a domain
Note that in the case the new quota is lower than existing one,
Xenstored may continue to handle requests from a domain exceeding the
new limit (depends on which one has been broken) and the amount of
resource used will not change. However the domain will not be able to
create more resource (associated to the quota) until it is back to below
the limit.
Add the memory accounting for Xenstore nodes. In order to make this
not too complicated allow for some sloppiness when writing nodes. Any
hard quota violation will result in no further requests to be accepted.
When a socket connection is destroyed, the associated watches are
removed, too. In order to keep memory accounting correct the watches
must be removed explicitly via a call of conn_delete_all_watches() from
destroy_conn().
tools/xenstore: add memory accounting for responses
Add the memory accounting for queued responses.
In case adding a watch event for a guest is causing the hard memory
quota of that guest to be violated, the event is dropped. This will
ensure that it is impossible to drive another guest past its memory
quota by generating insane amounts of events for that guest. This is
especially important for protecting driver domains from that attack
vector.
tools/xenstore: add infrastructure to keep track of per domain memory usage
The amount of memory a domain can consume in Xenstore is limited by
various quota today, but even with sane quota a domain can still
consume rather large memory quantities.
Add the infrastructure for keeping track of the amount of memory a
domain is consuming in Xenstore. Note that this is only the memory a
domain has direct control over, so any internal administration data
needed by Xenstore only is not being accounted for.
There are two quotas defined: a soft quota which will result in a
warning issued via syslog() when it is exceeded, and a hard quota
resulting in a stop of accepting further requests or watch events as
long as the hard quota would be violated by accepting those.
Setting any of those quotas to 0 will disable it.
As default values use 2MB per domain for the soft limit (this basically
covers the allowed case to create 1000 nodes needing 2kB each), and
2.5MB for the hard limit.
tools/xenstore: limit max number of nodes accessed in a transaction
Today a guest is free to access as many nodes in a single transaction
as it wants. This can lead to unbounded memory consumption in Xenstore
as there is the need to keep track of all nodes having been accessed
during a transaction.
In oxenstored the number of requests in a transaction is being limited
via a quota maxrequests (default is 1024). As multiple accesses of a
node are not problematic in C Xenstore, limit the number of accessed
nodes.
In order to let read_node() detect a quota error in case too many nodes
are being accessed, check the return value of access_node() and return
NULL in case an error has been seen. Introduce __must_check and add it
to the access_node() prototype.
tools/xenstore: simplify and fix per domain node accounting
The accounting of nodes can be simplified now that each connection
holds the associated domid.
Fix the node accounting to cover nodes created for a domain before it
has been introduced. This requires to react properly to an allocation
failure inside domain_entry_inc() by returning an error code.
Especially in error paths the node accounting has to be fixed in some
cases.
A guest not reading its Xenstore response buffer fast enough might
pile up lots of Xenstore watch events buffered. Reduce the generated
load by dropping new events which already have an identical copy
pending.
The special events "@..." are excluded from that handling as there are
known use cases where the handler is relying on each event to be sent
individually.
Add another quota for limiting the number of outstanding requests of a
guest. As the way to specify quotas on the command line is becoming
rather nasty, switch to a new scheme using [--quota|-Q] <what>=<val>
allowing to add more quotas in future easily.
Set the default value to 20 (basically a random value not seeming to
be too high or too low).
A request is said to be outstanding if any message generated by this
request (the direct response plus potential watch events) is not yet
completely stored into a ring buffer. The initial watch event sent as
a result of registering a watch is an exception.
Note that across a live update the relation to buffered watch events
for other domains is lost.
Use talloc_zero() for allocating the domain structure in order to have
all per-domain quota zeroed initially.
A future modification will limit the number of outstanding requests
for a domain, where "outstanding" means that the response of the
request or any resulting watch event hasn't been consumed yet.
In order to avoid a malicious guest being capable to block other guests
by not reading watch events, add a timeout for watch events. In case a
watch event hasn't been consumed after this timeout, it is being
deleted. Set the default timeout to 20 seconds (a random value being
not too high).
In order to support to specify other timeout values in future, use a
generic command line option for that purpose:
When removing a watched node outside of a transaction, two watch events
are being produced instead of just a single one.
When finalizing a transaction watch events can be generated for each
node which is being modified, even if outside a transaction such
modifications might not have resulted in a watch event.
This happens e.g.:
- for nodes which are only modified due to added/removed child entries
- for nodes being removed or created implicitly (e.g. creation of a/b/c
is implicitly creating a/b, resulting in watch events for a, a/b and
a/b/c instead of a/b/c only)
Avoid these additional watch events, in order to reduce the needed
memory inside Xenstore for queueing them.
This is being achieved by adding event flags to struct accessed_node
specifying whether an event should be triggered, and whether it should
be an exact match of the modified path. Both flags can be set from
fire_watches() instead of implying them only.
tools/xenstore: add helpers to free struct buffered_data
Add two helpers for freeing struct buffered_data: free_buffered_data()
for freeing one instance and conn_free_buffered_data() for freeing all
instances for a connection.
This is avoiding duplicated code and will help later when more actions
are needed when freeing a struct buffered_data.
tools/xenstore: Fail a transaction if it is not possible to create a node
Commit f2bebf72c4d5 "xenstore: rework of transaction handling" moved
out from copying the entire database everytime a new transaction is
opened to track the list of nodes changed.
The content of all the nodes accessed during a transaction will be
temporarily stored in TDB using a different key.
The function create_node() may write/update multiple nodes if the child
doesn't exist. In case of a failure, the function will revert any
changes (this include any update to TDB). Unfortunately, the function
which reverts the changes (i.e. destroy_node()) will not use the correct
key to delete any update or even request the transaction to fail.
This means that if a client decide to go ahead with committing the
transaction, orphan nodes will be created because they were not linked
to an existing node (create_node() will write the nodes backwards).
Once some nodes have been partially updated in a transaction, it is not
easily possible to undo any changes. So rather than continuing and hit
weird issue while committing, it is much saner to fail the transaction.
This will have an impact on any client that decides to commit even if it
can't write a node. Although, it is not clear why a normal client would
want to do that...
Lastly, update destroy_node() to use the correct key for deleting the
node. Rather than recreating it (this will allocate memory and
therefore fail), stash the key in the structure node.
tools/xenstore: create_node: Don't defer work to undo any changes on failure
XSA-115 extended destroy_node() to update the node accounting for the
connection. The implementation is assuming the connection is the parent
of the node, however all the nodes are allocated using a separate context
(see process_message()). This will result to crash (or corrupt) xenstored
as the pointer is wrongly used.
In case of an error, any changes to the database or update to the
accounting will now be reverted in create_node() by calling directly
destroy_node(). This has the nice advantage to remove the loop to unset
the destructors in case of success.
Take the opportunity to free the nodes right now as they are not
going to be reachable (the function returns NULL) and are just wasting
resources.
Andrew Cooper [Wed, 24 Aug 2022 13:16:44 +0000 (14:16 +0100)]
x86/vmx: Revert "VMX: use a single, global APIC access page"
The claim "No accesses would ever go to this page." is false. A consequence
of how Intel's APIC Acceleration works, and Xen's choice to have per-domain
P2Ms (rather than per-vCPU P2Ms) means that the APIC page is fully read-write
to any vCPU which is not in xAPIC mode.
Igor Druzhinin [Fri, 28 Oct 2022 13:49:33 +0000 (15:49 +0200)]
x86/pv-shim: correct ballooning down for compat guests
From: Igor Druzhinin <igor.druzhinin@citrix.com>
The compat layer for multi-extent memory ops may need to split incoming
requests. Since the guest handles in the interface structures may not be
altered, it does so by leveraging do_memory_op()'s continuation
handling: It hands on non-initial requests with a non-zero start extent,
with the (native) handle suitably adjusted down. As a result
do_memory_op() sees only the first of potentially several requests with
start extent being zero. In order to be usable as overall result, the
function accumulates args.nr_done, i.e. it initialized the field with
the start extent. Therefore non-initial requests resulting from the
split would pass too large a number into pv_shim_offline_memory().
Address that breakage by always calling pv_shim_offline_memory()
regardless of current hypercall preemption status, with a suitably
adjusted first argument. Note that this is correct also for the native
guest case: We now simply "commit" what was completed right away, rather
than at the end of a series of preemption/re-start cycles. In fact this
improves overall preemption behavior: There's no longer a potentially
big chunk of work done non-preemptively at the end of the last
"iteration".
Fixes: b2245acc60c3 ("xen/pvshim: memory hotplug") Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Igor Druzhinin [Fri, 28 Oct 2022 13:48:50 +0000 (15:48 +0200)]
x86/pv-shim: correct ballooning up for compat guests
From: Igor Druzhinin <igor.druzhinin@citrix.com>
The compat layer for multi-extent memory ops may need to split incoming
requests. Since the guest handles in the interface structures may not be
altered, it does so by leveraging do_memory_op()'s continuation
handling: It hands on non-initial requests with a non-zero start extent,
with the (native) handle suitably adjusted down. As a result
do_memory_op() sees only the first of potentially several requests with
start extent being zero. It's only that case when the function would
issue a call to pv_shim_online_memory(), yet the range then covers only
the first sub-range that results from the split.
Address that breakage by making a complementary call to
pv_shim_online_memory() in compat layer.
Fixes: b2245acc60c3 ("xen/pvshim: memory hotplug") Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Mem-op requests may have zero extents. Such requests need treating as
no-ops. pv_shim_online_memory(), however, would have tried to take 2³²-1
order-sized pages from its balloon list (to then populate them),
typically ending when the entire set of ballooned pages of this order
was consumed.
Note that pv_shim_offline_memory() does not have such an issue.
Fixes: b2245acc60c3 ("xen/pvshim: memory hotplug") Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Roger Pau Monné [Fri, 28 Oct 2022 09:40:45 +0000 (11:40 +0200)]
vpci: refuse BAR writes only if the BAR is mapped
Writes to the BARs are ignored if memory decoding is enabled for the
device, and the same happen with ROM BARs if the write is an attempt
to change the position of the BAR without disabling it first.
The reason of ignoring such writes is a limitation in Xen, as it would
need to unmap the BAR, change the address, and remap the BAR at the
new position, which the current logic doesn't support.
Some devices however seem to (wrongly) have the memory decoding bit
hardcoded to enabled, and attempts to disable it don't get reflected
on the command register.
This causes issues for well behaved domains that disable memory
decoding and then try to size the BARs, as vPCI will think memory
decoding is still enabled and ignore the write.
Since vPCI doesn't explicitly care about whether the memory decoding
bit is disabled as long as the BAR is not mapped in the domain p2m use
the information in the vpci_bar to check whether the BAR is mapped,
and refuse writes only based on that information. This workarounds
the issue, and allows domains to size and reposition the BARs properly.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Roger Pau Monné [Fri, 28 Oct 2022 09:40:00 +0000 (11:40 +0200)]
pci: do not disable memory decoding for devices
Commit 75cc460a1b added checks to ensure the position of the BARs from
PCI devices don't overlap with regions defined on the memory map.
When there's a collision memory decoding is left disabled for the
device, assuming that dom0 will reposition the BAR if necessary and
enable memory decoding.
While this would be the case for devices being used by dom0, devices
being used by the firmware itself that have no driver would usually be
left with memory decoding disabled by dom0 if that's the state dom0
found them in, and thus firmware trying to make use of them will not
function correctly.
The initial intent of 75cc460a1b was to prevent vPCI from creating
MMIO mappings on the dom0 p2m over regions that would otherwise
already have mappings established. It's my view now that we likely
went too far with 75cc460a1b, and Xen disabling memory decoding of
devices (as buggy as they might be) is harmful, and reduces the set of
hardware on which Xen works.
This commits reverts most of 75cc460a1b, and instead adds checks to
vPCI in order to prevent misplaced BARs from being added to the
hardware domain p2m. Signaling on whether BARs are mapped is tracked
in the vpci structure, so that misplaced BARs are not mapped, and thus
Xen won't attempt to unmap them when memory decoding is disabled.
This restores the behavior of Xen for PV dom0 to the state it was
previous to 75cc460a1b, while also introducing a more contained fix
for the vPCI BAR mapping issues.
Fixes: 75cc460a1b ('xen/pci: detect when BARs are not suitably positioned') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Jan Beulich [Fri, 28 Oct 2022 09:38:32 +0000 (11:38 +0200)]
common: map_vcpu_info() wants to unshare the underlying page
Not passing P2M_UNSHARE to get_page_from_gfn() means there won't even be
an attempt to unshare the referenced page, without any indication to the
caller (e.g. -EAGAIN). Note that guests have no direct control over
which of their pages are shared (or paged out), and hence they have no
way to make sure all on their own that the subsequent obtaining of a
writable type reference can actually succeed.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Julien Grall <jgrall@amazon.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Jan Beulich [Thu, 27 Oct 2022 09:50:47 +0000 (11:50 +0200)]
Arm32: prune (again) ld warning about mismatched wchar_t sizes
The name change (stub.c -> common-stub.c) rendered the earlier
workaround (commit a4d4c541f58b ["xen/arm32: avoid EFI stub wchar_t size
linker warning"]) ineffectual.
Fixes: bfd3e9945d1b ("build: fix x86 out-of-tree build without EFI") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <jgrall@amazon.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Jan Beulich [Thu, 27 Oct 2022 09:49:09 +0000 (11:49 +0200)]
x86: also zap secondary time area handles during soft reset
Just like domain_soft_reset() properly zaps runstate area handles, the
secondary time area ones also need discarding to prevent guest memory
corruption once the guest is re-started.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Patch b4f211606011 ("vpci/msix: fix PBA accesses") introduced call to
iounmap(), but not added corresponding include.
Signed-off-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Andrew Cooper [Wed, 26 Oct 2022 12:39:06 +0000 (13:39 +0100)]
CI: Drop more TravisCI remnants
This was missed from previous attempts to remove Travis.
Fixes: e0dc9b095e7c ("CI: Drop TravisCI") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Roger Pau Monné [Wed, 26 Oct 2022 12:56:58 +0000 (14:56 +0200)]
vpci/msix: remove from table list on detach
Teardown of MSIX vPCI related data doesn't currently remove the MSIX
device data from the list of MSIX tables handled by the domain,
leading to a use-after-free of the data in the msix structure.
Remove the structure from the list before freeing in order to solve
it.
Reported-by: Jan Beulich <jbeulich@suse.com> Fixes: d6281be9d0 ('vpci/msix: add MSI-X handlers') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Roger Pau Monné [Wed, 26 Oct 2022 12:55:30 +0000 (14:55 +0200)]
vpci: don't assume that vpci per-device data exists unconditionally
It's possible for a device to be assigned to a domain but have no
vpci structure if vpci_process_pending() failed and called
vpci_remove_device() as a result. The unconditional accesses done by
vpci_{read,write}() and vpci_remove_device() to pdev->vpci would
then trigger a NULL pointer dereference.
Add checks for pdev->vpci presence in the affected functions.
Fixes: 9c244fdef7 ('vpci: add header handlers') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Michal Orzel [Fri, 21 Oct 2022 13:22:38 +0000 (15:22 +0200)]
automation: Build Xen according to the type of the job
All the build jobs exist in two flavors: debug and non-debug, where the
former sets 'debug' variable to 'y' and the latter to 'n'. This variable
is only being recognized by the toolstack, because Xen requires
enabling/disabling debug build via e.g. menuconfig/config file.
As a corollary, we end up building/testing Xen with CONFIG_DEBUG always
set to a default value ('y' for unstable and 'n' for stable branches),
regardless of the type of the build job.
Fix this behavior by setting CONFIG_DEBUG according to the 'debug' value.
Signed-off-by: Michal Orzel <michal.orzel@amd.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Michal Orzel [Mon, 24 Oct 2022 12:04:43 +0000 (14:04 +0200)]
automation: Explicitly enable NULL scheduler for boot-cpupools test
NULL scheduler is not enabled by default on non-debug Xen builds. This
causes the boot time cpupools test to fail on such build jobs. Fix the issue
by explicitly specifying the config options required to enable the NULL
scheduler.
Fixes: 36e3f4158778 ("automation: Add a new job for testing boot time cpupools on arm64") Signed-off-by: Michal Orzel <michal.orzel@amd.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Make may not have copied "_libxl_types_json.h" into $(XEN_INCLUDE)
before starting to build the different objects.
Make sure that the generated headers are copied into $(XEN_INCLUDE)
before using them. This is achieved by telling make about which
headers are needed to use "libxl_internal.h" which use "libxl_json.h"
which uses "_libxl_types_json.h". "libxl_internal.h" also uses
"libxl.h" so add it to the list.
This also prevent `gcc` from using a potentially installed headers
from a previous version of Xen.
Reported-by: Per Bilse <per.bilse@citrix.com> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Jan Beulich [Mon, 24 Oct 2022 13:46:11 +0000 (15:46 +0200)]
x86/shadow: drop (replace) bogus assertions
The addition of a call to shadow_blow_tables() from shadow_teardown()
has resulted in the "no vcpus" related assertion becoming triggerable:
If domain_create() fails with at least one page successfully allocated
in the course of shadow_enable(), or if domain_create() succeeds and
the domain is then killed without ever invoking XEN_DOMCTL_max_vcpus.
Note that in-tree tests (test-resource and test-tsx) do exactly the
latter of these two.
The assertion's comment was bogus anyway: Shadow mode has been getting
enabled before allocation of vCPU-s for quite some time. Convert the
assertion to a conditional: As long as there are no vCPU-s, there's
nothing to blow away.
Fixes: e7aa55c0aab3 ("x86/p2m: free the paging memory pool preemptively") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
A similar assertion/comment pair exists in _shadow_prealloc(); the
comment is similarly bogus, and the assertion could in principle trigger
e.g. when shadow_alloc_p2m_page() is called early enough. Replace those
at the same time by a similar early return, here indicating failure to
the caller (which will generally lead to the domain being crashed in
shadow_prealloc()).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Juergen Gross [Fri, 21 Oct 2022 10:50:26 +0000 (12:50 +0200)]
xen/sched: fix restore_vcpu_affinity() by removing it
When the system is coming up after having been suspended,
restore_vcpu_affinity() is called for each domain in order to adjust
the vcpu's affinity settings in case a cpu didn't come to live again.
The way restore_vcpu_affinity() is doing that is wrong, because the
specific scheduler isn't being informed about a possible migration of
the vcpu to another cpu. Additionally the migration is often even
happening if all cpus are running again, as it is done without check
whether it is really needed.
As cpupool management is already calling cpu_disable_scheduler() for
cpus not having come up again, and cpu_disable_scheduler() is taking
care of eventually needed vcpu migration in the proper way, there is
simply no need for restore_vcpu_affinity().
So just remove restore_vcpu_affinity() completely, together with the
no longer used sched_reset_affinity_broken().
Fixes: 8a04eaa8ea83 ("xen/sched: move some per-vcpu items to struct sched_unit") Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Dario Faggioli <dfaggioli@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Juergen Gross [Fri, 21 Oct 2022 10:32:23 +0000 (12:32 +0200)]
xen/sched: fix race in RTDS scheduler
When a domain gets paused the unit runnable state can change to "not
runnable" without the scheduling lock being involved. This means that
a specific scheduler isn't involved in this change of runnable state.
In the RTDS scheduler this can result in an inconsistency in case a
unit is losing its "runnable" capability while the RTDS scheduler's
scheduling function is active. RTDS will remove the unit from the run
queue, but doesn't do so for the replenish queue, leading to hitting
an ASSERT() in replq_insert() later when the domain is unpaused again.
Fix that by removing the unit from the replenish queue as well in this
case.
Fixes: 7c7b407e7772 ("xen/sched: introduce unit_runnable_state()") Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Dario Faggioli <dfaggioli@suse.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Jan Beulich [Fri, 21 Oct 2022 10:30:24 +0000 (12:30 +0200)]
EFI: don't convert memory marked for runtime use to ordinary RAM
efi_init_memory() in both relevant places is treating EFI_MEMORY_RUNTIME
higher priority than the type of the range. To avoid accessing memory at
runtime which was re-used for other purposes, make
efi_arch_process_memory_map() follow suit. While in theory the same would
apply to EfiACPIReclaimMemory, we don't actually "reclaim" or clobber
that memory (converted to E820_ACPI on x86) there (and it would be a bug
if the Dom0 kernel tried to reclaim the range, bypassing Xen's memory
management, plus it would be at least bogus if it clobbered that space),
hence that type's handling can be left alone.
Fixes: bf6501a62e80 ("x86-64: EFI boot code") Fixes: facac0af87ef ("x86-64: EFI runtime code") Fixes: 6d70ea10d49f ("Add ARM EFI boot support") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Xenia Ragiadakou [Wed, 19 Oct 2022 14:49:13 +0000 (17:49 +0300)]
xen/arm: p2m: fix pa_range_info for 52-bit pa range
Currently, the fields 'root_order' and 'sl0' of the pa_range_info for
the 52-bit pa range have the values 3 and 3, respectively.
This configuration does not match any of the valid root table configurations
for 4KB granule and t0sz 12, described in ARM DDI 0487I.a D8.2.7.
More specifically, according to ARM DDI 0487I.a D8.2.7, in order to support
the 52-bit pa size with 4KB granule, the p2m root table needs to be configured
either as a single table at level -1 or as 16 concatenated tables at level 0.
Since, currently there is no support for level -1, set the 'root_order' and
'sl0' fields of the 52-bit pa_range_info according to the second approach.
Note that the values of those fields are not used so far. This patch updates
their values only for the sake of correctness.
Fixes: 407b13a71e32 ("xen/arm: p2m don't fall over on FEAT_LPA enabled hw") Signed-off-by: Xenia Ragiadakou <burzalodowa@gmail.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
All functions in domain_build.c should be marked __init. This was
spotted when building the hypervisor with -Og.
Fixes: 1050a7b91c2e ("xen/arm: add pci-domain for disabled devices") Signed-off-by: Stewart Hildebrand <stewart.hildebrand@amd.com> Acked-by: Julien Grall <jgrall@amazon.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Edwin Török [Fri, 21 Oct 2022 07:59:25 +0000 (08:59 +0100)]
tools/ocaml/xenstored: fix live update exception
During live update we will load the /tool/xenstored path from the previous binary,
and then try to mkdir /tool again which will fail with EEXIST.
Check for existence of the path before creating it.
The write call to /tool/xenstored should not need any changes
(and we do want to overwrite any previous path, in case it changed).
Prior to 7110192b1df6 live update would work only if the binary path was
specified, and with 7110192b1df6 and this live update also works when
no binary path is specified in `xenstore-control live-update`.
Fixes: 7110192b1df6 ("tools/oxenstored: Fix Oxenstored Live Update") Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Peter Hoyes [Mon, 3 Oct 2022 14:42:16 +0000 (15:42 +0100)]
tools/xendomains: Restrict domid pattern in LIST_GREP
The xendomains script uses the output of `xl list -l` to collect the
id and name of each domain, which is used in the shutdown logic, amongst
other purposes.
The linked commit added a "domid" field to libxl_domain_create_info.
This causes the output of `xl list -l` to contain two "domid"s per
domain, which may not be equal. This in turn causes `xendomains stop` to
issue two shutdown commands per domain, one of which is to a duplicate
and/or invalid domid.
To work around this, make the LIST_GREP pattern more restrictive for
domid, so it only detects the domid at the top level and not the domid
inside c_info.
Fixes: 4a3a25678d92 ("libxl: allow creation of domains with a specified or random domid") Signed-off-by: Peter Hoyes <Peter.Hoyes@arm.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Andrew Cooper [Wed, 19 Oct 2022 17:12:33 +0000 (18:12 +0100)]
tools/oxenstored: Fix Oxenstored Live Update
tl;dr This hunk was part of the patch emailed to xen-devel, but was missing
from what ultimately got committed.
https://lore.kernel.org/xen-devel/4164cb728313c3b9fc38cf5e9ecb790ac93a9600.1610748224.git.edvin.torok@citrix.com/
is the patch in question, but was part of a series that had threading issues.
I have a vague recollection that I sourced the commits from a local branch,
which clearly wasn't as up-to-date as I had thought.
Either way, it's my fault/mistake, and this hunk should have been part of what
got comitted.
Fixes: 00c48f57ab36 ("tools/oxenstored: Start live update process") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Roger Pau Monné [Thu, 20 Oct 2022 14:37:29 +0000 (16:37 +0200)]
test/vpci: enable by default
CONFIG_HAS_PCI is not defined for the tools build, and as a result the
vpci harness would never get build. Fix this by building it
unconditionally, there's nothing arch specific in it.
Reported-by: Andrew Cooper <Andrew.Cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Roger Pau Monné [Thu, 20 Oct 2022 14:37:15 +0000 (16:37 +0200)]
test/vpci: fix vPCI test harness to provide pci_get_pdev()
Instead of pci_get_pdev_by_domain(), which is no longer present in the
hypervisor.
While there add parentheses around the define value.
Fixes: a37f9ea7a6 ('PCI: fold pci_get_pdev{,_by_domain}()') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Roger Pau Monné [Thu, 20 Oct 2022 14:36:48 +0000 (16:36 +0200)]
test/vpci: add dummy cfcheck define
Some vpci functions got the cfcheck attribute added, but that's not
defined in the user-space test harness, so add a dummy define in order
for the harness to build.
Fixes: 4ed7d5525f ('xen/vpci: CFI hardening') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Henry Wang [Tue, 18 Oct 2022 14:23:46 +0000 (14:23 +0000)]
xen/arm: p2m: Populate pages for GICv2 mapping in p2m_init()
Hardware using GICv2 needs to create a P2M mapping of 8KB GICv2 area
when the domain is created. Considering the worst case of page tables
which requires 6 P2M pages as the two pages will be consecutive but not
necessarily in the same L3 page table and keep a buffer, populate 16
pages as the default value to the P2M pages pool in p2m_init() at the
domain creation stage to satisfy the GICv2 requirement. For GICv3, the
above-mentioned P2M mapping is not necessary, but since the allocated
16 pages here would not be lost, hence populate these pages
unconditionally.
With the default 16 P2M pages populated, there would be a case that
failures would happen in the domain creation with P2M pages already in
use. To properly free the P2M for this case, firstly support the
optionally preemption of p2m_teardown(), then call p2m_teardown() and
p2m_set_allocation(d, 0, NULL) non-preemptively in p2m_final_teardown().
As non-preemptive p2m_teardown() should only return 0, use a
BUG_ON to confirm that.
Since p2m_final_teardown() is called either after
domain_relinquish_resources() where relinquish_p2m_mapping() has been
called, or from failure path of domain_create()/arch_domain_create()
where mappings that require p2m_put_l3_page() should never be created,
relinquish_p2m_mapping() is not added in p2m_final_teardown(), add
in-code comments to refer this.
Fixes: cbea5a1149ca ("xen/arm: Allocate and free P2M pages from the P2M pool") Suggested-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com> Release-acked-by: George Dunlap <george.dunlap@citrix.com>
Andrew Cooper [Tue, 18 Oct 2022 14:23:45 +0000 (14:23 +0000)]
arm/p2m: Rework p2m_init()
p2m_init() is mostly trivial initialisation, but has two fallible operations
which are on either side of the backpointer trigger for teardown to take
actions.
p2m_free_vmid() is idempotent with a failed p2m_alloc_vmid(), so rearrange
p2m_init() to perform all trivial setup, then set the backpointer, then
perform all fallible setup.
This will simplify a future bugfix which needs to add a third fallible
operation.
No practical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Anthony PERARD [Mon, 17 Oct 2022 10:34:03 +0000 (11:34 +0100)]
tools: Workaround wrong use of tools/Rules.mk by qemu-trad
qemu-trad build system, when built from xen.git, will make use of
Rules.mk (setup via qemu-trad.git/xen-setup). This mean that changes
to Rules.mk will have an impact our ability to build qemu-trad.
Recent commit e4f5949c4466 ("tools: Add -Werror by default to all
tools/") have added "-Werror" to the CFLAGS and qemu-trad start to use
it. But this fails and there's lots of warning that are now turned
into error.
We should teach qemu-trad and xen.git to not have to use Rules.mk when
building qemu-trad, but for now, avoid adding -Werror to CFLAGS when
building qemu-trad.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Anthony PERARD [Thu, 13 Oct 2022 13:05:13 +0000 (14:05 +0100)]
tools: Rework linking options for ocaml binding libraries
Using a full path to the C libraries when preparing one of the ocaml
binding for those libraries make the binding unusable by external
project. The full path is somehow embedded and reused by the external
project when linking against the binding.
Instead, we will use the proper way to link a library, by using '-l'.
For in-tree build, we also need to provide the search directory via
'-L'.
(The search path -L are still be embedded, but at least that doesn't
prevent the ocaml binding from been used.)
Related-to: xen-project/xen#96 Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Anthony PERARD [Thu, 13 Oct 2022 13:05:12 +0000 (14:05 +0100)]
tools/golang/xenlight: Rework gengotypes.py and generation of *.gen.go
gengotypes.py creates both "types.gen.go" and "helpers.gen.go", but
make can start gengotypes.py twice. Rework the rules so that
gengotypes.py is executed only once.
Also, add the ability to provide a path to tell gengotypes.py where to
put the files. This doesn't matter yet but it will when for example
the script will be run from tools/ to generate the targets.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Anthony PERARD [Thu, 13 Oct 2022 13:05:09 +0000 (14:05 +0100)]
libs/light: Rework generation of include/_libxl_*.h
Instead of moving the public "_libxl_*.h" headers, we make a copy to
the destination so that make doesn't try to remake the targets
"_libxl_*.h" in libs/light/ again.
A new .PRECIOUS target is added to tell make to not deletes the
intermediate targets generated by "gentypes.py".
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Anthony PERARD [Thu, 13 Oct 2022 13:05:08 +0000 (14:05 +0100)]
libs/light: Rework acpi table build targets
Currently, a rebuild of libxl will always rebuild "build.o". This is because
the target depends on "acpi" which never exist. So instead we will have
"build.o" have as prerequisites targets that are actually generated by "acpi",
that is $(DSDT_FILES-y).
While "dsdt_*.c" isn't really a dependency for "build.o", a side
effect of building that dsdt_*.c is to also generate the "ssdt_*.h"
that "build.o" needs, but I don't want to list all the headers needed
by "build.o" and duplicate the information available in
"libacpi/Makefile" at this time.
Also avoid duplicating the "acpi" target for Arm, and unique one for
both architecture. And move the "acpi" target to be with other targets
rather than in the middle of the source listing. For the same reason,
move the prerequisites listing for both $(DSDT_FILES-y) and "build.o".
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Anthony PERARD [Thu, 13 Oct 2022 13:05:07 +0000 (14:05 +0100)]
tools/include: Rework Makefile
Rework "xen-xsm" rules to not have to change directory to run
mkflask.sh, and store mkflask.sh path in a var, and use a full path
for FLASK_H_DEPEND, and output directory is made relative.
Rename "all-y" target to a more descriptive "xen/lib/x86/all".
Removed the "dist" target which was the only one existing in tools/.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Anthony PERARD [Thu, 13 Oct 2022 13:05:05 +0000 (14:05 +0100)]
libs: Avoid exposing -Wl,--version-script to other built library
$(SHLIB_LDFLAGS) is used by more targets that the single targets that
except it (libxenfoo.so.X.Y). There is also some dynamic libraries in
stats/ that uses $(SHLIB_LDFLAGS) (even if those are never built), and
there's libxenlight_test.so which doesn't needs a version script.
Also, libxenlight_test.so might failed to build if the version script
doesn't exist yet.
For these reasons, avoid changing the generic $(SHLIB_LDFLAGS) flags,
and add the flag directly on the command line.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Anthony PERARD [Thu, 13 Oct 2022 13:05:04 +0000 (14:05 +0100)]
git-checkout.sh: handle running git-checkout from a different directory
"$DIR" might not be a full path and it might not have `pwd` as ".."
directory. So use `cd -` to undo the first `cd` command.
Also, use `basename` to make a symbolic link with a relative path.
This doesn't matter yet but it will when for example the commands to
clone OVMF is been run from tools/ rather than tools/firmware/.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Anthony PERARD [Thu, 13 Oct 2022 13:05:03 +0000 (14:05 +0100)]
libs/light/gentypes.py: allow to generate headers in subdirectory
This doesn't matter yet but it will when for example the script will
be run from tools/ to generate files tools/libs/light/.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Anthony PERARD [Thu, 13 Oct 2022 13:05:02 +0000 (14:05 +0100)]
tools/hotplug: Generate "hotplugpath.sh" with configure
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Anthony PERARD [Thu, 13 Oct 2022 13:05:01 +0000 (14:05 +0100)]
tools: Remove -Werror everywhere else
The previous changeset, e4f5949c4466 ("tools: Add -Werror by default to all
tools/"), added "-Werror" to CFLAGS in tools/Rules.mk. Remove it from
everywhere else now it is duplicated.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Acked-by: Daniel P. Smith <dpsmith@apertussolutions.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Anthony PERARD [Thu, 13 Oct 2022 13:05:00 +0000 (14:05 +0100)]
tools: Add -Werror by default to all tools/
And provide an option to ./configure to disable it.
A follow-up patch will remove -Werror from every other Makefile in
tools/.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Luca Fancellu <luca.fancellu@arm.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Anthony PERARD [Thu, 13 Oct 2022 13:04:59 +0000 (14:04 +0100)]
tools: Introduce $(xenlibs-ldflags, ) macro
This avoid the need to open-coding the list of flags needed to link
with an in-tree Xen library when using -lxen*.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Henry Wang <Henry.Wang@arm.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Anthony PERARD [Thu, 13 Oct 2022 13:04:58 +0000 (14:04 +0100)]
tools/xentrace: rework Makefile
Remove "build" targets.
Use "$(TARGETS)" to list binary to be built.
Cleanup "clean" rule.
Also drop conditional install of $(BIN) and $(LIBBIN) as those two
variables are now always populated.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Anthony PERARD [Thu, 13 Oct 2022 13:04:57 +0000 (14:04 +0100)]
tools/debugger/gdbsx: Fix and cleanup makefiles
gdbsx/:
- Make use of subdir facility for the "clean" target.
- No need to remove the *.a, they aren't in this dir.
- Avoid calling "distclean" in subdirs as "distclean" targets do only
call "clean", and the "clean" also runs "clean" in subdirs.
- Avoid the need to make "gx_all.a" and "xg_all.a" in the "all"
recipe by forcing make to check for update of "xg/xg_all.a" and
"gx/gx_all.a" by having "FORCE" as prerequisite. Now, when making
"gdbsx", make will recurse even when both *.a already exist.
- List target in $(TARGETS).
gdbsx/*/:
- Fix dependency on *.h.
- Remove some dead code.
- List targets in $(TARGETS).
- Remove "build" target.
- Cleanup "clean" targets.
- remove comments about the choice of "ar" instead of "ld"
- Use "$(AR)" instead of plain "ar".
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Luca Fancellu <luca.fancellu@arm.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Jason Andryuk [Fri, 7 Oct 2022 19:31:24 +0000 (15:31 -0400)]
argo: Remove reachable ASSERT_UNREACHABLE
I observed this ASSERT_UNREACHABLE in partner_rings_remove consistently
trip. It was in OpenXT with the viptables patch applied.
dom10 shuts down.
dom7 is REJECTED sending to dom10.
dom7 shuts down and this ASSERT trips for dom10.
The argo_send_info has a domid, but there is no refcount taken on
the domain. Therefore it's not appropriate to ASSERT that the domain
can be looked up via domid. Replace with a debug message.
Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Christopher Clark <christopher.w.clark@gmail.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
As that commit describes, on early Sapphire Rapids Xeon platforms the C1 and
C1E states were mutually exclusive, so that users could only have either C1 and
C6, or C1E and C6.
However, Intel firmware engineers managed to remove this limitation and make C1
and C1E to be completely independent, just like on previous Xeon platforms.
Therefore, this patch:
* Removes commentary describing the old, and now non-existing SPR C1E
limitation.
* Marks SPR C1E as available by default.
* Removes the 'preferred_cstates' parameter handling for SPR. Both C1 and
C1E will be available regardless of 'preferred_cstates' value.
We expect that all SPR systems are shipping with new firmware, which includes
the C1/C1E improvement.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 1548fac47a11 Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Peter Zijlstra [Thu, 13 Oct 2022 15:55:22 +0000 (17:55 +0200)]
x86/mwait-idle: disable IBRS during long idle
Having IBRS enabled while the SMT sibling is idle unnecessarily slows
down the running sibling. OTOH, disabling IBRS around idle takes two
MSR writes, which will increase the idle latency.
Therefore, only disable IBRS around deeper idle states. Shallow idle
states are bounded by the tick in duration, since NOHZ is not allowed
for them by virtue of their short target residency.
Only do this for mwait-driven idle, since that keeps interrupts disabled
across idle, which makes disabling IBRS vs IRQ-entry a non-issue.
Note: C6 is a random threshold, most importantly C1 probably shouldn't
disable IBRS, benchmarking needed.
Suggested-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de>
Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git bf5835bcdb96 Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Zhang Rui [Thu, 13 Oct 2022 15:54:23 +0000 (17:54 +0200)]
x86/mwait-idle: add AlderLake support
Similar to SPR, the C1 and C1E states on ADL are mutually exclusive.
Only one of them can be enabled at a time.
But contrast to SPR, which usually has a strong latency requirement
as a Xeon processor, C1E is preferred on ADL for better energy
efficiency.
Add custom C-state tables for ADL with both C1 and C1E, and
1. Enable the "C1E promotion" bit in MSR_IA32_POWER_CTL and mark C1
with the CPUIDLE_FLAG_UNUSABLE flag, so C1 is not available by
default.
2. Add support for the "preferred_cstates" module parameter, so that
users can choose to use C1 instead of C1E by booting with
"intel_idle.preferred_cstates=2".
Separate custom C-state tables are introduced for the ADL mobile and
desktop processors, because of the exit latency differences between
these two variants, especially with respect to PC10.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
[ rjw: Changelog edits, code rearrangement ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git d1cf8bbfed1e Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Add a Sapphire Rapids Xeon C6 optimization, similar to what we have for Sky Lake
Xeon: if package C6 is disabled, adjust C6 exit latency and target residency to
match core C6 values, instead of using the default package C6 values.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 3a9cf77b60dc
Make sure a contradictory "preferred-cstates" wouldn't cause bypassing
of the added logic.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Artem Bityutskiy [Thu, 13 Oct 2022 15:52:36 +0000 (17:52 +0200)]
x86/mwait-idle: add 'preferred-cstates' command line option
On Sapphire Rapids Xeon (SPR) the C1 and C1E states are basically mutually
exclusive - only one of them can be enabled. By default, 'intel_idle' driver
enables C1 and disables C1E. However, some users prefer to use C1E instead of
C1, because it saves more energy.
This patch adds a new module parameter ('preferred_cstates') for enabling C1E
and disabling C1. Here is the idea behind it.
1. This option has effect only for "mutually exclusive" C-states like C1 and
C1E on SPR.
2. It does not have any effect on independent C-states, which do not require
other C-states to be disabled (most states on most platforms as of today).
3. For mutually exclusive C-states, the 'intel_idle' driver always has a
reasonable default, such as enabling C1 on SPR by default. On other
platforms, the default may be different.
4. Users can override the default using the 'preferred_cstates' parameter.
5. The parameter accepts the preferred C-states bit-mask, similarly to the
existing 'states_off' parameter.
6. This parameter is not limited to C1/C1E, and leaves room for supporting
other mutually exclusive C-states, if they come in the future.
Today 'intel_idle' can only be compiled-in, which means that on SPR, in order
to disable C1 and enable C1E, users should boot with the following kernel
argument: intel_idle.preferred_cstates=4
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git da0e58c038e6
Enable C1E (if requested) not only on the BSP's socket / package. Alter
command line option to fit our model, and extend it to also accept
string form arguments.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Andrew Cooper [Mon, 25 Jul 2022 17:36:29 +0000 (18:36 +0100)]
tools/ocaml/xc: Address ABI issues with physinfo arch flags
The current bindings function, but the preexisting
type physinfo_arch_cap_flag =
| X86 of x86_physinfo_arch_cap_flag
is a special case in the Ocaml type system with an unusual indirection, and
will break when a second option, e.g. `| ARM of ...` is added.
Also, the position the list is logically wrong. Currently, the types express
a list of elements which might be an x86 flag or an arm flag (and can
intermix), whereas what we actually want is either a list of x86 flags, or a
list of ARM flags (that cannot intermix).
Rework the Ocaml types to avoid the ABI special case and move the list
primitive, and adjust the C bindings to match.
Fixes: 2ce11ce249a3 ("x86/HVM: allow per-domain usage of hardware virtualized APIC") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Andrew Cooper [Wed, 12 Oct 2022 10:02:08 +0000 (11:02 +0100)]
tools/ocaml/xc: Fix code legibility in stub_xc_domain_create()
Reposition the defines to match the outer style and to make the logic
half-legible.
No functional change.
Fixes: 0570d7f276dd ("x86/msr: introduce an option for compatible MSR behavior selection") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Jan Beulich [Wed, 12 Oct 2022 15:57:56 +0000 (17:57 +0200)]
VMX: correct error handling in vmx_create_vmcs()
With the addition of vmx_add_msr() calls to construct_vmcs() there are
now cases where simply freeing the VMCS isn't enough: The MSR bitmap
page as well as one of the MSR area ones (if it's the 2nd vmx_add_msr()
which fails) may also need freeing. Switch to using vmx_destroy_vmcs()
instead.
Fixes: 3bd36952dab6 ("x86/spec-ctrl: Introduce an option to control L1D_FLUSH for HVM HAP guests") Fixes: 53a570b28569 ("x86/spec-ctrl: Support IBPB-on-entry") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Jan Beulich [Tue, 11 Oct 2022 12:30:41 +0000 (14:30 +0200)]
x86emul: respect NSCB
protmode_load_seg() would better adhere to that "feature" of clearing
base (and limit) during NULL selector loads.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Jan Beulich [Tue, 11 Oct 2022 12:29:30 +0000 (14:29 +0200)]
gnttab: correct locking on transitive grant copy error path
While the comment next to the lock dropping in preparation of
recursively calling acquire_grant_for_copy() mistakenly talks about the
rd == td case (excluded a few lines further up), the same concerns apply
to the calling of release_grant_for_copy() on a subsequent error path.
This is CVE-2022-33748 / XSA-411.
Fixes: ad48fb963dbf ("gnttab: fix transitive grant handling") Signed-off-by: Jan Beulich <jbeulich@suse.com>
Henry Wang [Mon, 6 Jun 2022 06:17:30 +0000 (06:17 +0000)]
xen/arm: Allocate and free P2M pages from the P2M pool
This commit sets/tearsdown of p2m pages pool for non-privileged Arm
guests by calling `p2m_set_allocation` and `p2m_teardown_allocation`.
- For dom0, P2M pages should come from heap directly instead of p2m
pool, so that the kernel may take advantage of the extended regions.
- For xl guests, the setting of the p2m pool is called in
`XEN_DOMCTL_shadow_op` and the p2m pool is destroyed in
`domain_relinquish_resources`. Note that domctl->u.shadow_op.mb is
updated with the new size when setting the p2m pool.
- For dom0less domUs, the setting of the p2m pool is called before
allocating memory during domain creation. Users can specify the p2m
pool size by `xen,domain-p2m-mem-mb` dts property.
To actually allocate/free pages from the p2m pool, this commit adds
two helper functions namely `p2m_alloc_page` and `p2m_free_page` to
`struct p2m_domain`. By replacing the `alloc_domheap_page` and
`free_domheap_page` with these two helper functions, p2m pages can
be added/removed from the list of p2m pool rather than from the heap.
Since page from `p2m_alloc_page` is cleaned, take the opportunity
to remove the redundant `clean_page` in `p2m_create_table`.
This is part of CVE-2022-33747 / XSA-409.
Signed-off-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Henry Wang [Mon, 6 Jun 2022 06:17:29 +0000 (06:17 +0000)]
xen/arm, libxl: Implement XEN_DOMCTL_shadow_op for Arm
This commit implements the `XEN_DOMCTL_shadow_op` support in Xen
for Arm. The p2m pages pool size for xl guests is supposed to be
determined by `XEN_DOMCTL_shadow_op`. Hence, this commit:
- Introduces a function `p2m_domctl` and implements the subops
`XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION` and
`XEN_DOMCTL_SHADOW_OP_GET_ALLOCATION` of `XEN_DOMCTL_shadow_op`.
- Adds the `XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION` support in libxl.
Therefore enabling the setting of shadow memory pool size
when creating a guest from xl and getting shadow memory pool size
from Xen.
Note that the `XEN_DOMCTL_shadow_op` added in this commit is only
a dummy op, and the functionality of setting/getting p2m memory pool
size for xl guests will be added in following commits.
This is part of CVE-2022-33747 / XSA-409.
Signed-off-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Henry Wang [Mon, 6 Jun 2022 06:17:28 +0000 (06:17 +0000)]
xen/arm: Construct the P2M pages pool for guests
This commit constructs the p2m pages pool for guests from the
data structure and helper perspective.
This is implemented by:
- Adding a `struct paging_domain` which contains a freelist, a
counter variable and a spinlock to `struct arch_domain` to
indicate the free p2m pages and the number of p2m total pages in
the p2m pages pool.
- Adding a helper `p2m_get_allocation` to get the p2m pool size.
- Adding a helper `p2m_set_allocation` to set the p2m pages pool
size. This helper should be called before allocating memory for
a guest.
- Adding a helper `p2m_teardown_allocation` to free the p2m pages
pool. This helper should be called during the xl domain destory.
This is part of CVE-2022-33747 / XSA-409.
Signed-off-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Henry Wang [Mon, 6 Jun 2022 06:17:27 +0000 (06:17 +0000)]
libxl, docs: Add per-arch extra default paging memory
This commit adds a per-arch macro `EXTRA_DEFAULT_PAGING_MEM_MB`
to the default paging memory size, in order to cover the p2m
pool for extended regions of a xl-based guest on Arm.
For Arm, the extra default paging memory is 128MB.
For x86, the extra default paging memory is zero, since there
are no extended regions on x86.
Also update the xl.cfg documentation to add Arm documentation
according to code changes.
This is part of CVE-2022-33747 / XSA-409.
Signed-off-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
Julien Grall [Tue, 11 Oct 2022 12:24:48 +0000 (14:24 +0200)]
xen/x86: p2m: Add preemption in p2m_teardown()
The list p2m->pages contain all the pages used by the P2M. On large
instance this can be quite large and the time spent to call
d->arch.paging.free_page() will take more than 1ms for a 80GB guest
on a Xen running in nested environment on a c5.metal.
By extrapolation, it would take > 100ms for a 8TB guest (what we
current security support). So add some preemption in p2m_teardown()
and propagate to the callers. Note there are 3 places where
the preemption is not enabled:
- hap_final_teardown()/shadow_final_teardown(): We are
preventing update the P2M once the domain is dying (so
no more pages could be allocated) and most of the P2M pages
will be freed in preemptive manneer when relinquishing the
resources. So this is fine to disable preemption.
- shadow_enable(): This is fine because it will undo the allocation
that may have been made by p2m_alloc_table() (so only the root
page table).
The preemption is arbitrarily checked every 1024 iterations.
We now need to include <xen/event.h> in p2m-basic in order to
import the definition for local_events_need_delivery() used by
general_preempt_check(). Ideally, the inclusion should happen in
xen/sched.h but it opened a can of worms.
Note that with the current approach, Xen doesn't keep track on whether
the alt/nested P2Ms have been cleared. So there are some redundant work.
However, this is not expected to incurr too much overhead (the P2M lock
shouldn't be contended during teardown). So this is optimization is
left outside of the security event.
This is part of CVE-2022-33746 / XSA-410.
Signed-off-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Jan Beulich <jbeulich@suse.com>
----
Changes since v12:
- Correct altp2m preemption check placement.
Changes since v9:
- Integrate patch into series.
Changes since v2:
- Rework the loop doing the preemption
- Add a comment in shadow_enable() to explain why p2m_teardown()
doesn't need to be preemptible.
Changes since v1:
- Update the commit message
- Rebase on top of Roger's v8 series
- Fix preemption check
- Use 'unsigned int' rather than 'unsigned long' for the counter
Roger Pau Monné [Tue, 11 Oct 2022 12:24:21 +0000 (14:24 +0200)]
x86/p2m: free the paging memory pool preemptively
The paging memory pool is currently freed in two different places:
from {shadow,hap}_teardown() via domain_relinquish_resources() and
from {shadow,hap}_final_teardown() via complete_domain_destroy().
While the former does handle preemption, the later doesn't.
Attempt to move as much p2m related freeing as possible to happen
before the call to {shadow,hap}_teardown(), so that most memory can be
freed in a preemptive way. In order to avoid causing issues to
existing callers leave the root p2m page tables set and free them in
{hap,shadow}_final_teardown(). Also modify {hap,shadow}_free to free
the page immediately if the domain is dying, so that pages don't
accumulate in the pool when {shadow,hap}_final_teardown() get called.
Move altp2m_vcpu_disable_ve() to be done in hap_teardown(), as that's
the place where altp2m_active gets disabled now.
This is part of CVE-2022-33746 / XSA-410.
Reported-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
Roger Pau Monné [Tue, 11 Oct 2022 12:23:51 +0000 (14:23 +0200)]
x86/p2m: truly free paging pool memory for dying domains
Modify {hap,shadow}_free to free the page immediately if the domain is
dying, so that pages don't accumulate in the pool when
{shadow,hap}_final_teardown() get called. This is to limit the amount of
work which needs to be done there (in a non-preemptable manner).
Note the call to shadow_free() in shadow_free_p2m_page() is moved after
increasing total_pages, so that the decrease done in shadow_free() in
case the domain is dying doesn't underflow the counter, even if just for
a short interval.
This is part of CVE-2022-33746 / XSA-410.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
Roger Pau Monné [Tue, 11 Oct 2022 12:22:53 +0000 (14:22 +0200)]
x86/shadow: tolerate failure in shadow_prealloc()
Prevent _shadow_prealloc() from calling BUG() when unable to fulfill
the pre-allocation and instead return true/false. Modify
shadow_prealloc() to crash the domain on allocation failure (if the
domain is not already dying), as shadow cannot operate normally after
that. Modify callers to also gracefully handle {_,}shadow_prealloc()
failing to fulfill the request.
Note this in turn requires adjusting the callers of
sh_make_monitor_table() also to handle it returning INVALID_MFN.
sh_update_paging_modes() is also modified to add additional error
paths in case of allocation failure, some of those will return with
null monitor page tables (and the domain likely crashed). This is no
different that current error paths, but the newly introduced ones are
more likely to trigger.
The now added failure points in sh_update_paging_modes() also require
that on some error return paths the previous structures are cleared,
and thus monitor table is null.
While there adjust the 'type' parameter type of shadow_prealloc() to
unsigned int rather than u32.
This is part of CVE-2022-33746 / XSA-410.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
Jan Beulich [Tue, 11 Oct 2022 12:22:24 +0000 (14:22 +0200)]
x86/shadow: tolerate failure of sh_set_toplevel_shadow()
Subsequently sh_set_toplevel_shadow() will be adjusted to install a
blank entry in case prealloc fails. There are, in fact, pre-existing
error paths which would put in place a blank entry. The 4- and 2-level
code in sh_update_cr3(), however, assume the top level entry to be
valid.
Hence bail from the function in the unlikely event that it's not. Note
that 3-level logic works differently: In particular a guest is free to
supply a PDPTR pointing at 4 non-present (or otherwise deemed invalid)
entries. The guest will crash, but we already cope with that.
Really mfn_valid() is likely wrong to use in sh_set_toplevel_shadow(),
and it should instead be !mfn_eq(gmfn, INVALID_MFN). Avoid such a change
in security context, but add a respective assertion.
This is part of CVE-2022-33746 / XSA-410.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Tue, 11 Oct 2022 12:21:56 +0000 (14:21 +0200)]
x86/HAP: adjust monitor table related error handling
hap_make_monitor_table() will return INVALID_MFN if it encounters an
error condition, but hap_update_paging_modes() wasn’t handling this
value, resulting in an inappropriate value being stored in
monitor_table. This would subsequently misguide at least
hap_vcpu_teardown(). Avoid this by bailing early.
Further, when a domain has/was already crashed or (perhaps less
important as there's no such path known to lead here) is already dying,
avoid calling domain_crash() on it again - that's at best confusing.
This is part of CVE-2022-33746 / XSA-410.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
Roger Pau Monné [Tue, 11 Oct 2022 12:21:23 +0000 (14:21 +0200)]
x86/p2m: add option to skip root pagetable removal in p2m_teardown()
Add a new parameter to p2m_teardown() in order to select whether the
root page table should also be freed. Note that all users are
adjusted to pass the parameter to remove the root page tables, so
behavior is not modified.
No functional change intended.
This is part of CVE-2022-33746 / XSA-410.
Suggested-by: Julien Grall <julien@xen.org> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
Julien Grall [Mon, 6 Jun 2022 06:17:26 +0000 (06:17 +0000)]
xen/arm: p2m: Handle preemption when freeing intermediate page tables
At the moment the P2M page tables will be freed when the domain structure
is freed without any preemption. As the P2M is quite large, iterating
through this may take more time than it is reasonable without intermediate
preemption (to run softirqs and perhaps scheduler).
Split p2m_teardown() in two parts: one preemptible and called when
relinquishing the resources, the other one non-preemptible and called
when freeing the domain structure.
As we are now freeing the P2M pages early, we also need to prevent
further allocation if someone call p2m_set_entry() past p2m_teardown()
(I wasn't able to prove this will never happen). This is done by
the checking domain->is_dying from previous patch in p2m_set_entry().
Similarly, we want to make sure that no-one can accessed the free
pages. Therefore the root is cleared before freeing pages.
This is part of CVE-2022-33746 / XSA-410.
Signed-off-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Henry Wang <Henry.Wang@arm.com> Tested-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Julien Grall [Mon, 6 Jun 2022 06:17:25 +0000 (06:17 +0000)]
xen/arm: p2m: Prevent adding mapping when domain is dying
During the domain destroy process, the domain will still be accessible
until it is fully destroyed. So does the P2M because we don't bail
out early if is_dying is non-zero. If a domain has permission to
modify the other domain's P2M (i.e. dom0, or a stubdomain), then
foreign mapping can be added past relinquish_p2m_mapping().
Therefore, we need to prevent mapping to be added when the domain
is dying. This commit prevents such adding of mapping by adding the
d->is_dying check to p2m_set_entry(). Also this commit enhances the
check in relinquish_p2m_mapping() to make sure that no mappings can
be added in the P2M after the P2M lock is released.
This is part of CVE-2022-33746 / XSA-410.
Signed-off-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Henry Wang <Henry.Wang@arm.com> Tested-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Jan Beulich [Wed, 5 Oct 2022 08:55:27 +0000 (10:55 +0200)]
x86/NUMA: correct off-by-1 in node map population
As it turns out populate_memnodemap() so far "relied" on
extract_lsb_from_nodes() setting memnodemapsize one too high in edge
cases. Correct the issue there as well, by changing "epdx" to be an
inclusive PDX and adjusting the respective relational operators.
While there also limit the scope of both related variables.
Fixes: b1f4b45d02ca ("x86/NUMA: correct off-by-1 in node map size calculation") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
xen/arm: fix booting ACPI based system after static evtchn series
When ACPI is enabled and the system booted with ACPI, BUG() is observed
after merging the static event channel series. As there is no DT when
booted with ACPI there will be no chosen node because of that
"BUG_ON(chosen == NULL)" will be hit.
(XEN) Xen BUG at arch/arm/domain_build.c:3578
Move call to alloc_static_evtchn() under acpi_disabled check to fix the
issue.