Andrew Cooper [Fri, 28 Jul 2023 17:42:12 +0000 (18:42 +0100)]
x86/amd: Fix DE_CFG truncation in amd_check_zenbleed()
This line:
val &= ~chickenbit;
ends up truncating val to 32 bits, and turning off various errata workarounds
in Zen2 systems.
Fixes: f91c5ea97067 ("x86/amd: Mitigations for Zenbleed") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit c0dd53b8cbd1e47e9c89873a9265a7170bdc6b4c)
Andrew Cooper [Mon, 22 May 2023 22:03:00 +0000 (23:03 +0100)]
x86/amd: Mitigations for Zenbleed
Zenbleed is a malfunction on AMD Zen2 uarch parts which results in corruption
of the vector registers. An attacker can trigger this bug deliberately in
order to access stale data in the physical vector register file. This can
include data from sibling threads, or a higher-privilege context.
Microcode is the preferred mitigation but in the case that's not available use
the chickenbit as instructed by AMD. Re-evaluate the mitigation on late
microcode load too.
This is XSA-433 / CVE-2023-20593.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit f91c5ea970675637721bb7f18adaa189837eb783)
Andrew Cooper [Fri, 24 Mar 2023 17:59:56 +0000 (17:59 +0000)]
CI: Remove llvm-8 from the Debian Stretch container
For similar reasons to c/s a6b1e2b80fe20. While this container is still
build-able for now, all the other problems with explicitly-versioned compilers
remain.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 7a298375721636290a57f31bb0f7c2a5a38956a4)
Anthony PERARD [Fri, 24 Feb 2023 17:29:15 +0000 (17:29 +0000)]
automation: Remove non-debug x86_32 build jobs
In the interest of having less jobs, we remove the x86_32 build jobs
that do release build. Debug build is very likely to be enough to find
32bit build issues.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 7b66792ea7f77fb9e587e1e9c530a7c869eecba1)
Anthony PERARD [Tue, 21 Feb 2023 16:55:36 +0000 (16:55 +0000)]
automation: Remove CentOS 7.2 containers and builds
We already have a container which track the latest CentOS 7, no need
for this one as well.
Also, 7.2 have outdated root certificate which prevent connection to
website which use Let's Encrypt.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit ba512629f76dfddb39ea9133ee51cdd9e392a927)
Andrew Cooper [Thu, 29 Dec 2022 15:39:13 +0000 (15:39 +0000)]
CI: Drop automation/configs/
Having 3 extra hypervisor builds on the end of a full build is deeply
confusing to debug if one of them fails, because the .config file presented in
the artefacts is not the one which caused a build failure. Also, the log
tends to be truncated in the UI.
PV-only is tested as part of PV-Shim in a full build anyway, so doesn't need
repeating. HVM-only and neither appear frequently in randconfig, so drop all
the logic here to simplify things.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Michal Orzel <michal.orzel@amd.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 7b20009a812f26e74bdbde2ab96165376b3dad34)
Anthony PERARD [Thu, 9 Sep 2021 14:33:06 +0000 (15:33 +0100)]
build: add --full to version.sh to guess $(XEN_FULLVERSION)
Running $(MAKE) like that in a $(shell ) while parsing the Makefile
doesn't work reliably. In some case, make will complain with
"jobserver unavailable: using -j1. Add '+' to parent make rule.".
Also, it isn't possible to distinguish between the output produced by
the target "xenversion" and `make`'s own output.
Instead of running make, this patch "improve" `version.sh` to try to
guess the output of `make xenversion`.
In order to have version.sh works in more scenario, it will use
XEN_EXTRAVERSION and XEN_VENDORVERSION from the environment when
present. As for the cases were those two variables are overridden by a
make command line arguments, we export them when invoking version.sh
via a new $(XEN_FULLVERSION) macro.
That should hopefully get us to having ./version.sh returning the same
value that `make xenversion` would.
This fix GitLab CI's build job "debian-unstable-gcc-arm64".
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Daniel P. Smith <dpsmith@apertussolutions.com> Reviewed-by: Ian Jackson <iwj@xenproject.org>
(cherry picked from commit ab4a83023eda9f04ad864877c1956b087ec6fc4f)
Andrew Cooper [Wed, 21 Apr 2021 09:16:13 +0000 (10:16 +0100)]
CI: Drop TravisCI
Travis-ci.org is shutting down shortly. The arm cross-compile testing has
been broken for a long time now, and all testing has now been superseded by
our Gitlab infrastructure.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wl@xen.org>
(cherry picked from commit e0dc9b095e7c73dcf6dbfe5c87c33c4708da4d1f)
CI: Drop more TravisCI remnants
This was missed from previous attempts to remove Travis.
Fixes: e0dc9b095e7c ("CI: Drop TravisCI") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
(cherry picked from commit bad4832710c7261fad1abe2d0e8e2e1d259b3e8d)
Andrew Cooper [Fri, 26 Mar 2021 11:25:07 +0000 (11:25 +0000)]
tools: Drop gettext as a build dependency
It has not been a dependency since at least 4.13. Remove its mandatory check
from ./configure.
Annotate the dependency in the CI dockerfiles, and drop them from CirrusCI and
TravisCI.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
(cherry picked from commit e21a6a4f966a7e91cb0bb014dbe15d15cc0502ad)
Andrew Cooper [Fri, 10 Feb 2023 21:11:14 +0000 (21:11 +0000)]
x86/spec-ctrl: Defer CR4_PV32_RESTORE on the cstar_enter path
As stated (correctly) by the comment next to SPEC_CTRL_ENTRY_FROM_PV, between
the two hunks visible in the patch, RET's are not safe prior to this point.
CR4_PV32_RESTORE hides a CALL/RET pair in certain configurations (PV32
compiled in, SMEP or SMAP active), and the RET can be attacked with one of
several known speculative issues.
Furthermore, CR4_PV32_RESTORE also hides a reference to the cr4_pv32_mask
global variable, which is not safe when XPTI is active before restoring Xen's
full pagetables.
This crash has gone unnoticed because it is only AMD CPUs which permit the
SYSCALL instruction in compatibility mode, and these are not vulnerable to
Meltdown so don't activate XPTI by default.
This is XSA-429 / CVE-2022-42331
Fixes: 5e7962901131 ("x86/entry: Organise the use of MSR_SPEC_CTRL at each entry/exit point") Fixes: 5784de3e2067 ("x86: Meltdown band-aid against malicious 64-bit PV guests") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit df5b055b12116d9e63ced59ae5389e69a2a3de48)
Jan Beulich [Tue, 21 Mar 2023 12:01:01 +0000 (12:01 +0000)]
x86/HVM: serialize pinned cache attribute list manipulation
While the RCU variants of list insertion and removal allow lockless list
traversal (with RCU just read-locked), insertions and removals still
need serializing amongst themselves. To keep things simple, use the
domain lock for this purpose.
This is CVE-2022-42334 / part of XSA-428.
Fixes: 642123c5123f ("x86/hvm: provide XEN_DMOP_pin_memory_cacheattr") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit 829ec245cf66560e3b50d140ccb3168e7fb7c945)
Jan Beulich [Tue, 21 Mar 2023 12:01:01 +0000 (12:01 +0000)]
x86/HVM: bound number of pinned cache attribute regions
This is exposed via DMOP, i.e. to potentially not fully privileged
device models. With that we may not permit registration of an (almost)
unbounded amount of such regions.
This is CVE-2022-42333 / part of XSA-428.
Fixes: 642123c5123f ("x86/hvm: provide XEN_DMOP_pin_memory_cacheattr") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit a5e768640f786b681063f4e08af45d0c4e91debf)
Jan Beulich [Tue, 21 Mar 2023 12:00:02 +0000 (12:00 +0000)]
x86/shadow: account for log-dirty mode when pre-allocating
Pre-allocation is intended to ensure that in the course of constructing
or updating shadows there won't be any risk of just made shadows or
shadows being acted upon can disappear under our feet. The amount of
pages pre-allocated then, however, needs to account for all possible
subsequent allocations. While the use in sh_page_fault() accounts for
all shadows which may need making, so far it didn't account for
allocations coming from log-dirty tracking (which piggybacks onto the
P2M allocation functions).
Since shadow_prealloc() takes a count of shadows (or other data
structures) rather than a count of pages, putting the adjustment at the
call site of this function won't work very well: We simply can't express
the correct count that way in all cases. Instead take care of this in
the function itself, by "snooping" for L1 type requests. (While not
applicable right now, future new request sites of L1 tables would then
also be covered right away.)
It is relevant to note here that pre-allocations like the one done from
shadow_alloc_p2m_page() are benign when they fall in the "scope" of an
earlier pre-alloc which already included that count: The inner call will
simply find enough pages available then; it'll bail right away.
Anthony PERARD [Tue, 21 Feb 2023 16:55:38 +0000 (16:55 +0000)]
automation: Remove clang-8 from Debian unstable container
First, apt complain that it isn't the right way to add keys anymore,
but hopefully that's just a warning.
Second, we can't install clang-8:
The following packages have unmet dependencies:
clang-8 : Depends: libstdc++-8-dev but it is not installable
Depends: libgcc-8-dev but it is not installable
Depends: libobjc-8-dev but it is not installable
Recommends: llvm-8-dev but it is not going to be installed
Recommends: libomp-8-dev but it is not going to be installed
libllvm8 : Depends: libffi7 (>= 3.3~20180313) but it is not installable
E: Unable to correct problems, you have held broken packages.
clang on Debian unstable is now version 14.0.6.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit a6b1e2b80fe2053b1c9c9843fb086a668513ea36)
Juergen Gross [Wed, 9 Nov 2022 10:02:19 +0000 (11:02 +0100)]
xen/sched: migrate timers to correct cpus after suspend
Today all timers are migrated to cpu 0 when the system is being
suspended. They are not migrated back after resuming the system again.
This results (at least) to visible problems with the credit scheduler,
as the timer isn't handled on the cpu it was expected to occur, which
will result in an ASSERT() triggering. Other more subtle problems, like
uninterrupted elongated time slices, are probable. The least effect
will be worse performance on cpu 0 resulting from most scheduling
related timer interrupts happening there after suspend/resume.
Add migrating the scheduling related timers of a specific cpu from cpu
0 back to its original cpu when that cpu has gone up when resuming the
system.
Fixes: 0763cd268789 ("xen/sched: don't disable scheduler on cpus during suspend") Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Dario Faggioli <dfaggioli@suse.com>
master commit: 37f82facd62f720fdcec104f72f86b8c6c214820
master date: 2022-11-04 09:03:23 +0100
Juergen Gross [Wed, 9 Nov 2022 10:01:56 +0000 (11:01 +0100)]
tools/xenstore: call remove_domid_from_perm() for special nodes
When destroying a domain, any stale permissions of the domain must be
removed from the special nodes "@...", too. This was not done in the
fix for XSA-322.
Andrew Cooper [Tue, 14 Jun 2022 15:18:36 +0000 (16:18 +0100)]
x86/spec-ctrl: Mitigate IBPB not flushing the RSB/RAS
Introduce spec_ctrl_new_guest_context() to encapsulate all logic pertaining to
using MSR_PRED_CMD for a new guest context, even if it only has one user
presently.
Introduce X86_BUG_IBPB_NO_RET, and use it extend spec_ctrl_new_guest_context()
with a manual fixup for hardware which mis-implements IBPB.
This is part of XSA-422 / CVE-2022-23824.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 2b27967fb89d7904a1571a2fb963b1c9cac548db)
Andrew Cooper [Tue, 14 Jun 2022 15:18:36 +0000 (16:18 +0100)]
x86/spec-ctrl: Enumeration for IBPB_RET
The IBPB_RET bit indicates that the CPU's implementation of MSR_PRED_CMD.IBPB
does flush the RSB/RAS too.
This is part of XSA-422 / CVE-2022-23824.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 24496558e650535bdbd22cc04731e82276cd1b3f)
tools/xenstore: harden transaction finalization against errors
When finalizing a transaction, any error occurring after checking for
conflicts will result in the transaction being performed only
partially today. Additionally accounting data will not be updated at
the end of the transaction, which might result in further problems
later.
Avoid those problems by multiple modifications:
- free any transaction specific nodes which don't need to be committed
as they haven't been written during the transaction as soon as their
generation count has been verified, this will reduce the risk of
out-of-memory situations
- store the transaction specific node name in struct accessed_node in
order to avoid the need to allocate additional memory for it when
finalizing the transaction
- don't stop the transaction finalization when hitting an error
condition, but try to continue to handle all modified nodes
- in case of a detected error do the accounting update as needed and
call the data base checking only after that
- if writing a node in a transaction is failing (e.g. due to a failed
quota check), fail the transaction, as prior changes to struct
accessed_node can't easily be undone in that case
In case a node has been created in a transaction and it is later
deleted in the same transaction, the transaction will be terminated
with an error.
As this error is encountered only when handling the deleted node at
transaction finalization, the transaction will have been performed
partially and without updating the accounting information. This will
enable a malicious guest to create arbitrary number of nodes.
Edwin Török [Wed, 12 Oct 2022 18:13:05 +0000 (19:13 +0100)]
tools/ocaml: Ensure packet size is never negative
Integers in Ocaml have 63 or 31 bits of signed precision.
On 64-bit builds of Ocaml, this is fine because a C uint32_t always fits
within a 63-bit signed integer.
In 32-bit builds of Ocaml, this goes wrong. The C uint32_t is truncated
first (loses the top bit), then has a unsigned/signed mismatch.
A "negative" value (i.e. a packet on the ring of between 1G and 2G in size)
will trigger an exception later in Bytes.make in xb.ml, and because the packet
is not removed from the ring, the exception re-triggers on every subsequent
query, creating a livelock.
Fix both the source of the exception in Xb, and as defence in depth, mark the
domain as bad for any Invalid_argument exceptions to avoid the risk of
livelock.
This is XSA-420 / CVE-2022-42324.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit ae34df4d82636f4c82700b447ea2c93b9f82b3f3)
Edwin Török [Wed, 12 Oct 2022 18:13:06 +0000 (19:13 +0100)]
tools/ocaml/xenstored: Fix quota bypass on domain shutdown
XSA-322 fixed a domid reuse vulnerability by assigning Dom0 as the owner of
any nodes left after a domain is shutdown (e.g. outside its /local/domain/N
tree).
However Dom0 has no quota on purpose, so this opened up another potential
attack vector. Avoid it by deleting these nodes instead of assigning them to
Dom0.
This is part of XSA-419 / CVE-2022-42323.
Fixes: c46eff921209 ("tools/ocaml/xenstored: clean up permissions for dead domains") Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit db471408edd46af403b8bd44d180a928ad7fbb80)
tools/xenstore: make the internal memory data base the default
Having a file backed data base has the only advantage of being capable
to dump the contents of it while Xenstore is running, and potentially
using less swap space in case the data base can't be kept in memory.
It has the major disadvantage of a huge performance overhead: switching
to keep the data base in memory only speeds up live update of xenstored
with 120000 nodes from 20 minutes to 11 seconds. A complete tree walk
of this configuration will be reduced from 7 seconds to 280 msecs
(measured by "xenstore-control check").
So make the internal memory data base the default and enhance the
"--internal-db" command line parameter to take an optional parameter
allowing to switch the internal data base back to the file based one.
tools/xenstore: remove nodes owned by destroyed domain
In case a domain is removed from Xenstore, remove all nodes owned by
it per default.
This tackles the problem that nodes might be created by a domain
outside its home path in Xenstore, leading to Xenstore hogging more
and more memory. Domain quota don't work in this case if the guest is
rebooting in between.
Since XSA-322 ownership of such stale nodes is transferred to dom0,
which is helping against unintended access, but not against OOM of
Xenstore.
As a fallback for weird cases add a Xenstore start parameter for
keeping today's way to handle stale nodes, adding the risk of Xenstore
hitting an OOM situation.
This is part of XSA-419 / CVE-2022-42322.
Fixes: 496306324d8d ("tools/xenstore: revoke access rights for removed domains") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit 755d3f9debf8879448211fffb018f556136f6a79)
check_store() is using a hash table for storing all node names it has
found via walking the tree. Additionally it using another hash table
for all children of a node to detect duplicate child names.
Simplify that by dropping the second hash table as the first one is
already holding all the needed information.
Add a generic function to walk the complete node tree. It will start
at "/" and descend recursively into each child, calling a function
specified by the caller. Depending on the return value of the user
specified function the walk will be aborted, continued, or the current
child will be skipped by not descending into its children.
tools/xenstore: don't let remove_child_entry() call corrupt()
In case of write_node() returning an error, remove_child_entry() will
call corrupt() today. This could result in an endless recursion, as
remove_child_entry() is called by corrupt(), too:
Today chk_domain_generation() is being used to check whether a node
permission entry is still valid or whether it is referring to a domain
no longer existing. This is done by comparing the node's and the
domain's generation count.
In case no struct domain is existing for a checked domain, but the
domain itself is valid, chk_domain_generation() assumes it is being
called due to the first node created for a new domain and it will
return success.
This might be wrong in case the checked permission is related to an
old domain, which has just been replaced with a new domain using the
same domid.
Fix that by letting chk_domain_generation() fail in case a struct
domain isn't found. In order to cover the case of the first node for
a new domain try to allocate the needed struct domain explicitly when
processing the related SET_PERMS command. In case a referenced domain
isn't existing, flag the related permission to be ignored right away.
tools/xenstore: don't use conn->in as context for temporary allocations
Using the struct buffered data pointer of the current processed request
for temporary data allocations has a major drawback: the used area (and
with that the temporary data) is freed only after the response of the
request has been written to the ring page or has been read via the
socket. This can happen much later in case a guest isn't reading its
responses fast enough.
As the temporary data can be safely freed after creating the response,
add a temporary context for that purpose and use that for allocating
the temporary memory, as it was already the case before commit cc0612464896 ("xenstore: add small default data buffer to internal
struct").
Some sub-functions need to gain the "const" attribute for the talloc
context.
This is XSA-416 / CVE-2022-42319.
Fixes: cc0612464896 ("xenstore: add small default data buffer to internal struct") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit 2a587de219cc0765330fbf9fac6827bfaf29e29b)
SUPPORT.md: clarify support of untrusted driver domains with oxenstored
Add a support statement for the scope of support regarding different
Xenstore variants. Especially oxenstored does not (yet) have security
support of untrusted driver domains, as those might drive oxenstored
out of memory by creating lots of watch events for the guests they are
servicing.
Add a statement regarding Live Update support of oxenstored.
This is part of XSA-326.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit c7bc20d8d123851a468402bbfc9e3330efff21ec)
Edwin Török [Wed, 12 Oct 2022 18:13:04 +0000 (19:13 +0100)]
tools/ocaml: Limit maximum in-flight requests / outstanding replies
Introduce a limit on the number of outstanding reply packets in the xenbus
queue. This limits the number of in-flight requests: when the output queue is
full we'll stop processing inputs until the output queue has room again.
To avoid a busy loop on the Unix socket we only add it to the watched input
file descriptor set if we'd be able to call `input` on it. Even though Dom0
is trusted and exempt from quotas a flood of events might cause a backlog
where events are produced faster than daemons in Dom0 can consume them, which
could lead to an unbounded queue size and OOM.
Therefore the xenbus queue limit must apply to all connections, Dom0 is not
exempt from it, although if everything works correctly it will eventually
catch up.
This prevents a malicious guest from sending more commands while it has
outstanding watch events or command replies in its input ring. However if it
can cause the generation of watch events by other means (e.g. by Dom0, or
another cooperative guest) and stop reading its own ring then watch events
would've queued up without limit.
The xenstore protocol doesn't have a back-pressure mechanism, and doesn't
allow dropping watch events. In fact, dropping watch events is known to break
some pieces of normal functionality. This leaves little choice to safely
implement the xenstore protocol without exposing the xenstore daemon to
out-of-memory attacks.
Implement the fix as pipes with bounded buffers:
* Use a bounded buffer for watch events
* The watch structure will have a bounded receiving pipe of watch events
* The source will have an "overflow" pipe of pending watch events it couldn't
deliver
Items are queued up on one end and are sent as far along the pipe as possible:
source domain -> watch -> xenbus of target -> xenstore ring/socket of target
If the pipe is "full" at any point then back-pressure is applied and we prevent
more items from being queued up. For the source domain this means that we'll
stop accepting new commands as long as its pipe buffer is not empty.
Before we try to enqueue an item we first check whether it is possible to send
it further down the pipe, by attempting to recursively flush the pipes. This
ensures that we retain the order of events as much as possible.
We might break causality of watch events if the target domain's queue is full
and we need to start using the watch's queue. This is a breaking change in
the xenstore protocol, but only for domains which are not processing their
incoming ring as expected.
When a watch is deleted its entire pending queue is dropped (no code is needed
for that, because it is part of the 'watch' type).
There is a cache of watches that have pending events that we attempt to flush
at every cycle if possible.
Introduce 3 limits here:
* quota-maxwatchevents on watch event destination: when this is hit the
source will not be allowed to queue up more watch events.
* quota-maxoustanding which is the number of responses not read from the ring:
once exceeded, no more inputs are processed until all outstanding replies
are consumed by the client.
* overflow queue on the watch event source: all watches that cannot be stored
on destination are queued up here, a single command can trigger multiple
watches (e.g. due to recursion).
The overflow queue currently doesn't have an upper bound, it is difficult to
accurately calculate one as it depends on whether you are Dom0 and how many
watches each path has registered and how many watch events you can trigger
with a single command (e.g. a commit). However these events were already
using memory, this just moves them elsewhere, and as long as we correctly
block a domain it shouldn't result in unbounded memory usage.
Note that Dom0 is not excluded from these checks, it is important that Dom0 is
especially not excluded when it is the source, since there are many ways in
which a guest could trigger Dom0 to send it watch events.
This should protect against malicious frontends as long as the backend follows
the PV xenstore protocol and only exposes paths needed by the frontend, and
changes those paths at most once as a reaction to guest events, or protocol
state.
The queue limits are per watch, and per domain-pair, so even if one
communication channel would be "blocked", others would keep working, and the
domain itself won't get blocked as long as it doesn't overflow the queue of
watch events.
Similarly a malicious backend could cause the frontend to get blocked, but
this watch queue protects the frontend as well as long as it follows the PV
protocol. (Although note that protection against malicious backends is only a
best effort at the moment)
This is part of XSA-326 / CVE-2022-42318.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 9284ae0c40fb5b9606947eaaec23dc71d0540e96)
Edwin Török [Wed, 12 Oct 2022 18:13:03 +0000 (19:13 +0100)]
tools/ocaml/xb: Add BoundedQueue
Ensures we cannot store more than [capacity] elements in a [Queue]. Replacing
all Queue with this module will then ensure at compile time that all Queues
are correctly bound checked.
Each element in the queue has a class with its own limits. This, in a
subsequent change, will ensure that command responses can proceed during a
flood of watch events.
No functional change.
This is part of XSA-326.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 19171fb5d888b4467a7073e8febc5e05540956e9)
Edwin Török [Wed, 12 Oct 2022 18:13:02 +0000 (19:13 +0100)]
tools/ocaml: Change Xb.input to return Packet.t option
The queue here would only ever hold at most one element. This will simplify
follow-up patches.
This is part of XSA-326.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit c0a86a462721008eca5ff733660de094d3c34bc7)
Edwin Török [Fri, 29 Jul 2022 17:53:29 +0000 (18:53 +0100)]
tools/ocaml/libs/xb: hide type of Xb.t
Hiding the type will make it easier to change the implementation
in the future without breaking code that relies on it.
No functional change.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 7ade30a1451734d041363c750a65d322e25b47ba)
Edwin Török [Wed, 12 Oct 2022 18:13:07 +0000 (19:13 +0100)]
tools/ocaml: GC parameter tuning
By default the OCaml garbage collector would return memory to the OS only
after unused memory is 5x live memory. Tweak this to 120% instead, which
would match the major GC speed.
This is part of XSA-326.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 4a8bacff20b857ca0d628ef5525877ade11f2a42)
Edwin Török [Thu, 28 Jul 2022 16:08:15 +0000 (17:08 +0100)]
tools/ocaml/xenstored: Check for maxrequests before performing operations
Previously we'd perform the operation, record the updated tree in the
transaction record, then try to insert a watchop path and the reply packet.
If we exceeded max requests we would've returned EQUOTA, but still:
* have performed the operation on the transaction's tree
* have recorded the watchop, making this queue effectively unbounded
It is better if we check whether we'd have room to store the operation before
performing the transaction, and raise EQUOTA there. Then the transaction
record won't grow.
This is part of XSA-326 / CVE-2022-42317.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 329f4d1a6535c6c5a34025ca0d03fc5c7228fcff)
Edwin Török [Wed, 12 Oct 2022 18:13:01 +0000 (19:13 +0100)]
tools/ocaml/xenstored: Synchronise defaults with oxenstore.conf.in
We currently have 2 different set of defaults in upstream Xen git tree:
* defined in the source code, only used if there is no config file
* defined in the oxenstored.conf.in upstream Xen
An oxenstored.conf file is not mandatory, and if missing, maxrequests in
particular has an unsafe default.
Resync the defaults from oxenstored.conf.in into the source code.
This is part of XSA-326 / CVE-2022-42316.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 84734955d4bf629ba459a74773afcde50a52236f)
tools/xenstore: add control command for setting and showing quota
Add a xenstore-control command "quota" to:
- show current quota settings
- change quota settings
- show current quota related values of a domain
Note that in the case the new quota is lower than existing one,
Xenstored may continue to handle requests from a domain exceeding the
new limit (depends on which one has been broken) and the amount of
resource used will not change. However the domain will not be able to
create more resource (associated to the quota) until it is back to below
the limit.
Add the memory accounting for Xenstore nodes. In order to make this
not too complicated allow for some sloppiness when writing nodes. Any
hard quota violation will result in no further requests to be accepted.
When a socket connection is destroyed, the associated watches are
removed, too. In order to keep memory accounting correct the watches
must be removed explicitly via a call of conn_delete_all_watches() from
destroy_conn().
tools/xenstore: add memory accounting for responses
Add the memory accounting for queued responses.
In case adding a watch event for a guest is causing the hard memory
quota of that guest to be violated, the event is dropped. This will
ensure that it is impossible to drive another guest past its memory
quota by generating insane amounts of events for that guest. This is
especially important for protecting driver domains from that attack
vector.
tools/xenstore: add infrastructure to keep track of per domain memory usage
The amount of memory a domain can consume in Xenstore is limited by
various quota today, but even with sane quota a domain can still
consume rather large memory quantities.
Add the infrastructure for keeping track of the amount of memory a
domain is consuming in Xenstore. Note that this is only the memory a
domain has direct control over, so any internal administration data
needed by Xenstore only is not being accounted for.
There are two quotas defined: a soft quota which will result in a
warning issued via syslog() when it is exceeded, and a hard quota
resulting in a stop of accepting further requests or watch events as
long as the hard quota would be violated by accepting those.
Setting any of those quotas to 0 will disable it.
As default values use 2MB per domain for the soft limit (this basically
covers the allowed case to create 1000 nodes needing 2kB each), and
2.5MB for the hard limit.
tools/xenstore: limit max number of nodes accessed in a transaction
Today a guest is free to access as many nodes in a single transaction
as it wants. This can lead to unbounded memory consumption in Xenstore
as there is the need to keep track of all nodes having been accessed
during a transaction.
In oxenstored the number of requests in a transaction is being limited
via a quota maxrequests (default is 1024). As multiple accesses of a
node are not problematic in C Xenstore, limit the number of accessed
nodes.
In order to let read_node() detect a quota error in case too many nodes
are being accessed, check the return value of access_node() and return
NULL in case an error has been seen. Introduce __must_check and add it
to the access_node() prototype.
tools/xenstore: simplify and fix per domain node accounting
The accounting of nodes can be simplified now that each connection
holds the associated domid.
Fix the node accounting to cover nodes created for a domain before it
has been introduced. This requires to react properly to an allocation
failure inside domain_entry_inc() by returning an error code.
Especially in error paths the node accounting has to be fixed in some
cases.
A guest not reading its Xenstore response buffer fast enough might
pile up lots of Xenstore watch events buffered. Reduce the generated
load by dropping new events which already have an identical copy
pending.
The special events "@..." are excluded from that handling as there are
known use cases where the handler is relying on each event to be sent
individually.
Add another quota for limiting the number of outstanding requests of a
guest. As the way to specify quotas on the command line is becoming
rather nasty, switch to a new scheme using [--quota|-Q] <what>=<val>
allowing to add more quotas in future easily.
Set the default value to 20 (basically a random value not seeming to
be too high or too low).
A request is said to be outstanding if any message generated by this
request (the direct response plus potential watch events) is not yet
completely stored into a ring buffer. The initial watch event sent as
a result of registering a watch is an exception.
Note that across a live update the relation to buffered watch events
for other domains is lost.
Use talloc_zero() for allocating the domain structure in order to have
all per-domain quota zeroed initially.
A future modification will limit the number of outstanding requests
for a domain, where "outstanding" means that the response of the
request or any resulting watch event hasn't been consumed yet.
In order to avoid a malicious guest being capable to block other guests
by not reading watch events, add a timeout for watch events. In case a
watch event hasn't been consumed after this timeout, it is being
deleted. Set the default timeout to 20 seconds (a random value being
not too high).
In order to support to specify other timeout values in future, use a
generic command line option for that purpose:
When removing a watched node outside of a transaction, two watch events
are being produced instead of just a single one.
When finalizing a transaction watch events can be generated for each
node which is being modified, even if outside a transaction such
modifications might not have resulted in a watch event.
This happens e.g.:
- for nodes which are only modified due to added/removed child entries
- for nodes being removed or created implicitly (e.g. creation of a/b/c
is implicitly creating a/b, resulting in watch events for a, a/b and
a/b/c instead of a/b/c only)
Avoid these additional watch events, in order to reduce the needed
memory inside Xenstore for queueing them.
This is being achieved by adding event flags to struct accessed_node
specifying whether an event should be triggered, and whether it should
be an exact match of the modified path. Both flags can be set from
fire_watches() instead of implying them only.
tools/xenstore: add helpers to free struct buffered_data
Add two helpers for freeing struct buffered_data: free_buffered_data()
for freeing one instance and conn_free_buffered_data() for freeing all
instances for a connection.
This is avoiding duplicated code and will help later when more actions
are needed when freeing a struct buffered_data.
tools/xenstore: Fail a transaction if it is not possible to create a node
Commit f2bebf72c4d5 "xenstore: rework of transaction handling" moved
out from copying the entire database everytime a new transaction is
opened to track the list of nodes changed.
The content of all the nodes accessed during a transaction will be
temporarily stored in TDB using a different key.
The function create_node() may write/update multiple nodes if the child
doesn't exist. In case of a failure, the function will revert any
changes (this include any update to TDB). Unfortunately, the function
which reverts the changes (i.e. destroy_node()) will not use the correct
key to delete any update or even request the transaction to fail.
This means that if a client decide to go ahead with committing the
transaction, orphan nodes will be created because they were not linked
to an existing node (create_node() will write the nodes backwards).
Once some nodes have been partially updated in a transaction, it is not
easily possible to undo any changes. So rather than continuing and hit
weird issue while committing, it is much saner to fail the transaction.
This will have an impact on any client that decides to commit even if it
can't write a node. Although, it is not clear why a normal client would
want to do that...
Lastly, update destroy_node() to use the correct key for deleting the
node. Rather than recreating it (this will allocate memory and
therefore fail), stash the key in the structure node.
tools/xenstore: create_node: Don't defer work to undo any changes on failure
XSA-115 extended destroy_node() to update the node accounting for the
connection. The implementation is assuming the connection is the parent
of the node, however all the nodes are allocated using a separate context
(see process_message()). This will result to crash (or corrupt) xenstored
as the pointer is wrongly used.
In case of an error, any changes to the database or update to the
accounting will now be reverted in create_node() by calling directly
destroy_node(). This has the nice advantage to remove the loop to unset
the destructors in case of success.
Take the opportunity to free the nodes right now as they are not
going to be reachable (the function returns NULL) and are just wasting
resources.
Igor Druzhinin [Mon, 31 Oct 2022 12:38:05 +0000 (13:38 +0100)]
x86/pv-shim: correct ballooning down for compat guests
The compat layer for multi-extent memory ops may need to split incoming
requests. Since the guest handles in the interface structures may not be
altered, it does so by leveraging do_memory_op()'s continuation
handling: It hands on non-initial requests with a non-zero start extent,
with the (native) handle suitably adjusted down. As a result
do_memory_op() sees only the first of potentially several requests with
start extent being zero. In order to be usable as overall result, the
function accumulates args.nr_done, i.e. it initialized the field with
the start extent. Therefore non-initial requests resulting from the
split would pass too large a number into pv_shim_offline_memory().
Address that breakage by always calling pv_shim_offline_memory()
regardless of current hypercall preemption status, with a suitably
adjusted first argument. Note that this is correct also for the native
guest case: We now simply "commit" what was completed right away, rather
than at the end of a series of preemption/re-start cycles. In fact this
improves overall preemption behavior: There's no longer a potentially
big chunk of work done non-preemptively at the end of the last
"iteration".
Fixes: b2245acc60c3 ("xen/pvshim: memory hotplug") Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1d7fbc535d1d37bdc2cc53ede360b0f6651f7de1
master date: 2022-10-28 15:49:33 +0200
Igor Druzhinin [Mon, 31 Oct 2022 12:37:42 +0000 (13:37 +0100)]
x86/pv-shim: correct ballooning up for compat guests
The compat layer for multi-extent memory ops may need to split incoming
requests. Since the guest handles in the interface structures may not be
altered, it does so by leveraging do_memory_op()'s continuation
handling: It hands on non-initial requests with a non-zero start extent,
with the (native) handle suitably adjusted down. As a result
do_memory_op() sees only the first of potentially several requests with
start extent being zero. It's only that case when the function would
issue a call to pv_shim_online_memory(), yet the range then covers only
the first sub-range that results from the split.
Address that breakage by making a complementary call to
pv_shim_online_memory() in compat layer.
Fixes: b2245acc60c3 ("xen/pvshim: memory hotplug") Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a0bfdd201ea12aa5679bb8944d63a4e0d3c23160
master date: 2022-10-28 15:48:50 +0200
Mem-op requests may have zero extents. Such requests need treating as
no-ops. pv_shim_online_memory(), however, would have tried to take 2³²-1
order-sized pages from its balloon list (to then populate them),
typically ending when the entire set of ballooned pages of this order
was consumed.
Note that pv_shim_offline_memory() does not have such an issue.
Fixes: b2245acc60c3 ("xen/pvshim: memory hotplug") Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9272225ca72801fd9fa5b268a2d1c5adebd19cd9
master date: 2022-10-28 15:47:59 +0200
Jan Beulich [Mon, 31 Oct 2022 12:36:50 +0000 (13:36 +0100)]
common: map_vcpu_info() wants to unshare the underlying page
Not passing P2M_UNSHARE to get_page_from_gfn() means there won't even be
an attempt to unshare the referenced page, without any indication to the
caller (e.g. -EAGAIN). Note that guests have no direct control over
which of their pages are shared (or paged out), and hence they have no
way to make sure all on their own that the subsequent obtaining of a
writable type reference can actually succeed.
Jan Beulich [Mon, 31 Oct 2022 12:36:25 +0000 (13:36 +0100)]
x86: also zap secondary time area handles during soft reset
Just like domain_soft_reset() properly zaps runstate area handles, the
secondary time area ones also need discarding to prevent guest memory
corruption once the guest is re-started.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: b80d4f8d2ea6418e32fb4f20d1304ace6d6566e3
master date: 2022-10-27 11:49:09 +0200
Roger Pau Monné [Mon, 31 Oct 2022 12:35:59 +0000 (13:35 +0100)]
vpci/msix: remove from table list on detach
Teardown of MSIX vPCI related data doesn't currently remove the MSIX
device data from the list of MSIX tables handled by the domain,
leading to a use-after-free of the data in the msix structure.
Remove the structure from the list before freeing in order to solve
it.
Reported-by: Jan Beulich <jbeulich@suse.com> Fixes: d6281be9d0 ('vpci/msix: add MSI-X handlers') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c14aea137eab29eb9c30bfad745a00c65ad21066
master date: 2022-10-26 14:56:58 +0200
Roger Pau Monné [Mon, 31 Oct 2022 12:35:33 +0000 (13:35 +0100)]
vpci: don't assume that vpci per-device data exists unconditionally
It's possible for a device to be assigned to a domain but have no
vpci structure if vpci_process_pending() failed and called
vpci_remove_device() as a result. The unconditional accesses done by
vpci_{read,write}() and vpci_remove_device() to pdev->vpci would
then trigger a NULL pointer dereference.
Add checks for pdev->vpci presence in the affected functions.
Jan Beulich [Mon, 31 Oct 2022 12:35:06 +0000 (13:35 +0100)]
x86/shadow: drop (replace) bogus assertions
The addition of a call to shadow_blow_tables() from shadow_teardown()
has resulted in the "no vcpus" related assertion becoming triggerable:
If domain_create() fails with at least one page successfully allocated
in the course of shadow_enable(), or if domain_create() succeeds and
the domain is then killed without ever invoking XEN_DOMCTL_max_vcpus.
Note that in-tree tests (test-resource and test-tsx) do exactly the
latter of these two.
The assertion's comment was bogus anyway: Shadow mode has been getting
enabled before allocation of vCPU-s for quite some time. Convert the
assertion to a conditional: As long as there are no vCPU-s, there's
nothing to blow away.
Fixes: e7aa55c0aab3 ("x86/p2m: free the paging memory pool preemptively") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
A similar assertion/comment pair exists in _shadow_prealloc(); the
comment is similarly bogus, and the assertion could in principle trigger
e.g. when shadow_alloc_p2m_page() is called early enough. Replace those
at the same time by a similar early return, here indicating failure to
the caller (which will generally lead to the domain being crashed in
shadow_prealloc()).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a92dc2bb30ba65ae25d2f417677eb7ef9a6a0fef
master date: 2022-10-24 15:46:11 +0200
Juergen Gross [Mon, 31 Oct 2022 12:34:28 +0000 (13:34 +0100)]
xen/sched: fix restore_vcpu_affinity() by removing it
When the system is coming up after having been suspended,
restore_vcpu_affinity() is called for each domain in order to adjust
the vcpu's affinity settings in case a cpu didn't come to live again.
The way restore_vcpu_affinity() is doing that is wrong, because the
specific scheduler isn't being informed about a possible migration of
the vcpu to another cpu. Additionally the migration is often even
happening if all cpus are running again, as it is done without check
whether it is really needed.
As cpupool management is already calling cpu_disable_scheduler() for
cpus not having come up again, and cpu_disable_scheduler() is taking
care of eventually needed vcpu migration in the proper way, there is
simply no need for restore_vcpu_affinity().
So just remove restore_vcpu_affinity() completely, together with the
no longer used sched_reset_affinity_broken().
Fixes: 8a04eaa8ea83 ("xen/sched: move some per-vcpu items to struct sched_unit") Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Dario Faggioli <dfaggioli@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
master commit: fce1f381f7388daaa3e96dbb0d67d7a3e4bb2d2d
master date: 2022-10-24 11:16:27 +0100
Juergen Gross [Mon, 31 Oct 2022 12:33:59 +0000 (13:33 +0100)]
xen/sched: fix race in RTDS scheduler
When a domain gets paused the unit runnable state can change to "not
runnable" without the scheduling lock being involved. This means that
a specific scheduler isn't involved in this change of runnable state.
In the RTDS scheduler this can result in an inconsistency in case a
unit is losing its "runnable" capability while the RTDS scheduler's
scheduling function is active. RTDS will remove the unit from the run
queue, but doesn't do so for the replenish queue, leading to hitting
an ASSERT() in replq_insert() later when the domain is unpaused again.
Fix that by removing the unit from the replenish queue as well in this
case.
Jan Beulich [Mon, 31 Oct 2022 12:33:29 +0000 (13:33 +0100)]
EFI: don't convert memory marked for runtime use to ordinary RAM
efi_init_memory() in both relevant places is treating EFI_MEMORY_RUNTIME
higher priority than the type of the range. To avoid accessing memory at
runtime which was re-used for other purposes, make
efi_arch_process_memory_map() follow suit. While in theory the same would
apply to EfiACPIReclaimMemory, we don't actually "reclaim" or clobber
that memory (converted to E820_ACPI on x86) there (and it would be a bug
if the Dom0 kernel tried to reclaim the range, bypassing Xen's memory
management, plus it would be at least bogus if it clobbered that space),
hence that type's handling can be left alone.
Jason Andryuk [Mon, 31 Oct 2022 12:32:59 +0000 (13:32 +0100)]
argo: Remove reachable ASSERT_UNREACHABLE
I observed this ASSERT_UNREACHABLE in partner_rings_remove consistently
trip. It was in OpenXT with the viptables patch applied.
dom10 shuts down.
dom7 is REJECTED sending to dom10.
dom7 shuts down and this ASSERT trips for dom10.
The argo_send_info has a domid, but there is no refcount taken on
the domain. Therefore it's not appropriate to ASSERT that the domain
can be looked up via domid. Replace with a debug message.
Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Christopher Clark <christopher.w.clark@gmail.com>
master commit: 197f612b77c5afe04e60df2100a855370d720ad7
master date: 2022-10-14 14:45:41 +0100
Jan Beulich [Mon, 31 Oct 2022 12:31:26 +0000 (13:31 +0100)]
VMX: correct error handling in vmx_create_vmcs()
With the addition of vmx_add_msr() calls to construct_vmcs() there are
now cases where simply freeing the VMCS isn't enough: The MSR bitmap
page as well as one of the MSR area ones (if it's the 2nd vmx_add_msr()
which fails) may also need freeing. Switch to using vmx_destroy_vmcs()
instead.
Fixes: 3bd36952dab6 ("x86/spec-ctrl: Introduce an option to control L1D_FLUSH for HVM HAP guests") Fixes: 53a570b28569 ("x86/spec-ctrl: Support IBPB-on-entry") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 448d28309f1a966bdc850aff1a637e0b79a03e43
master date: 2022-10-12 17:57:56 +0200
Henry Wang [Tue, 25 Oct 2022 09:21:12 +0000 (09:21 +0000)]
xen/arm: p2m: Populate pages for GICv2 mapping in p2m_init()
Hardware using GICv2 needs to create a P2M mapping of 8KB GICv2 area
when the domain is created. Considering the worst case of page tables
which requires 6 P2M pages as the two pages will be consecutive but not
necessarily in the same L3 page table and keep a buffer, populate 16
pages as the default value to the P2M pages pool in p2m_init() at the
domain creation stage to satisfy the GICv2 requirement. For GICv3, the
above-mentioned P2M mapping is not necessary, but since the allocated
16 pages here would not be lost, hence populate these pages
unconditionally.
With the default 16 P2M pages populated, there would be a case that
failures would happen in the domain creation with P2M pages already in
use. To properly free the P2M for this case, firstly support the
optionally preemption of p2m_teardown(), then call p2m_teardown() and
p2m_set_allocation(d, 0, NULL) non-preemptively in p2m_final_teardown().
As non-preemptive p2m_teardown() should only return 0, use a
BUG_ON to confirm that.
Since p2m_final_teardown() is called either after
domain_relinquish_resources() where relinquish_p2m_mapping() has been
called, or from failure path of domain_create()/arch_domain_create()
where mappings that require p2m_put_l3_page() should never be created,
relinquish_p2m_mapping() is not added in p2m_final_teardown(), add
in-code comments to refer this.
Fixes: cbea5a1149ca ("xen/arm: Allocate and free P2M pages from the P2M pool") Suggested-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
(cherry picked from commit: c7cff1188802646eaa38e918e5738da0e84949be)
Andrew Cooper [Tue, 25 Oct 2022 09:21:11 +0000 (09:21 +0000)]
arm/p2m: Rework p2m_init()
p2m_init() is mostly trivial initialisation, but has two fallible operations
which are on either side of the backpointer trigger for teardown to take
actions.
p2m_free_vmid() is idempotent with a failed p2m_alloc_vmid(), so rearrange
p2m_init() to perform all trivial setup, then set the backpointer, then
perform all fallible setup.
This will simplify a future bugfix which needs to add a third fallible
operation.
No practical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
(cherry picked from commit: 3783e583319fa1ce75e414d851f0fde191a14753)
Jan Beulich [Wed, 12 Oct 2022 15:33:40 +0000 (17:33 +0200)]
libxl/Arm: correct xc_shadow_control() invocation to fix build
The backport didn't adapt to the earlier function prototype taking more
(unused here) arguments.
Fixes: c5215044578e ("xen/arm, libxl: Implement XEN_DOMCTL_shadow_op for Arm") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Henry Wang <Henry.Wang@arm.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Tamas K Lengyel [Tue, 11 Oct 2022 13:17:42 +0000 (15:17 +0200)]
x86/vpmu: Fix race-condition in vpmu_load
The vPMU code-bases attempts to perform an optimization on saving/reloading the
PMU context by keeping track of what vCPU ran on each pCPU. When a pCPU is
getting scheduled, checks if the previous vCPU isn't the current one. If so,
attempts a call to vpmu_save_force. Unfortunately if the previous vCPU is
already getting scheduled to run on another pCPU its state will be already
runnable, which results in an ASSERT failure.
Fix this by always performing a pmu context save in vpmu_save when called from
vpmu_switch_from, and do a vpmu_load when called from vpmu_switch_to.
While this presents a minimal overhead in case the same vCPU is getting
rescheduled on the same pCPU, the ASSERT failure is avoided and the code is a
lot easier to reason about.
Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: defa4e51d20a143bdd4395a075bf0933bb38a9a4
master date: 2022-09-30 09:53:49 +0200
Jan Beulich [Tue, 11 Oct 2022 13:17:12 +0000 (15:17 +0200)]
x86: wire up VCPUOP_register_vcpu_time_memory_area for 32-bit guests
Forever sinced its introduction VCPUOP_register_vcpu_time_memory_area
was available only to native domains. Linux, for example, would attempt
to use it irrespective of guest bitness (including in its so called
PVHVM mode) as long as it finds XEN_PVCLOCK_TSC_STABLE_BIT set (which we
set only for clocksource=tsc, which in turn needs engaging via command
line option).
Fixes: a5d39947cb89 ("Allow guests to register secondary vcpu_time_info") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: b726541d94bd0a80b5864d17a2cd2e6d73a3fe0a
master date: 2022-09-29 14:47:45 +0200
Juergen Gross [Tue, 11 Oct 2022 13:16:53 +0000 (15:16 +0200)]
xen/gnttab: fix gnttab_acquire_resource()
Commit 9dc46386d89d ("gnttab: work around "may be used uninitialized"
warning") was wrong, as vaddrs can legitimately be NULL in case
XENMEM_resource_grant_table_id_status was specified for a grant table
v1. This would result in crashes in debug builds due to
ASSERT_UNREACHABLE() triggering.
Check vaddrs only to be NULL in the rc == 0 case.
Expand the tests in tools/tests/resource to tickle this path, and verify that
using XENMEM_resource_grant_table_id_status on a v1 grant table fails.
Fixes: 9dc46386d89d ("gnttab: work around "may be used uninitialized" warning") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> # xen Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 52daa6a8483e4fbd6757c9d1b791e23931791608
master date: 2022-09-09 16:28:38 +0100
Jan Beulich [Tue, 11 Oct 2022 13:15:55 +0000 (15:15 +0200)]
Config.mk: correct PIE-related option(s) in EMBEDDED_EXTRA_CFLAGS
I haven't been able to find evidence of "-nopie" ever having been a
supported compiler option. The correct spelling is "-no-pie".
Furthermore like "-pie" this is an option which is solely passed to the
linker. The compiler only recognizes "-fpie" / "-fPIE" / "-fno-pie", and
it doesn't infer these options from "-pie" / "-no-pie".
Add the compiler recognized form, but for the possible case of the
variable also being used somewhere for linking keep the linker option as
well (with corrected spelling).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <jgrall@amazon.com>
Build: Drop -no-pie from EMBEDDED_EXTRA_CFLAGS
This breaks all Clang builds, as demostrated by Gitlab CI.
Contrary to the description in ecd6b9759919, -no-pie is not even an option
passed to the linker. GCC's actual behaviour is to inhibit the passing of
-pie to the linker, as well as selecting different cr0 artefacts to be linked.
EMBEDDED_EXTRA_CFLAGS is not used for $(CC)-doing-linking, and not liable to
gain such a usecase.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Tested-by: Stefano Stabellini <sstabellini@kernel.org> Fixes: ecd6b9759919 ("Config.mk: correct PIE-related option(s) in EMBEDDED_EXTRA_CFLAGS")
master commit: ecd6b9759919fa6335b0be1b5fc5cce29a30c4f1
master date: 2022-09-08 09:25:26 +0200
master commit: 13a7c0074ac8fb31f6c0485429b7a20a1946cb22
master date: 2022-09-27 15:40:42 -0700
Juergen Gross [Tue, 11 Oct 2022 13:15:41 +0000 (15:15 +0200)]
xen/sched: fix cpu hotplug
Cpu unplugging is calling schedule_cpu_rm() via stop_machine_run() with
interrupts disabled, thus any memory allocation or freeing must be
avoided.
Since commit 5047cd1d5dea ("xen/common: Use enhanced
ASSERT_ALLOC_CONTEXT in xmalloc()") this restriction is being enforced
via an assertion, which will now fail.
Fix this by allocating needed memory before entering stop_machine_run()
and freeing any memory only after having finished stop_machine_run().
Fixes: 1ec410112cdd ("xen/sched: support differing granularity in schedule_cpu_[add/rm]()") Reported-by: Gao Ruifeng <ruifeng.gao@intel.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d84473689611eed32fd90b27e614f28af767fa3f
master date: 2022-09-05 11:42:30 +0100
Juergen Gross [Tue, 11 Oct 2022 13:15:14 +0000 (15:15 +0200)]
xen/sched: carve out memory allocation and freeing from schedule_cpu_rm()
In order to prepare not allocating or freeing memory from
schedule_cpu_rm(), move this functionality to dedicated functions.
For now call those functions from schedule_cpu_rm().
No change of behavior expected.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d42be6f83480b3ada286dc18444331a816be88a3
master date: 2022-09-05 11:42:30 +0100
For updating the node affinities of all domains in a cpupool add a new
function cpupool_update_node_affinity().
In order to avoid multiple allocations of cpumasks carve out memory
allocation and freeing from domain_update_node_affinity() into new
helpers, which can be used by cpupool_update_node_affinity().
Modify domain_update_node_affinity() to take an additional parameter
for passing the allocated memory in and to allocate and free the memory
via the new helpers in case NULL was passed.
This will help later to pre-allocate the cpumasks in order to avoid
allocations in stop-machine context.
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a83fa1e2b96ace65b45dde6954d67012633a082b
master date: 2022-09-05 11:42:30 +0100
Jan Beulich [Tue, 11 Oct 2022 13:14:05 +0000 (15:14 +0200)]
x86/CPUID: surface suitable value in EBX of XSTATE subleaf 1
While the SDM isn't very clear about this, our present behavior make
Linux 5.19 unhappy. As of commit 8ad7e8f69695 ("x86/fpu/xsave: Support
XSAVEC in the kernel") they're using this CPUID output also to size
the compacted area used by XSAVEC. Getting back zero there isn't really
liked, yet for PV that's the default on capable hardware: XSAVES isn't
exposed to PV domains.
Considering that the size reported is that of the compacted save area,
I view Linux'es assumption as appropriate (short of the SDM properly
considering the case). Therefore we need to populate the field also when
only XSAVEC is supported for a guest.
Fixes: 460b9a4b3630 ("x86/xsaves: enable xsaves/xrstors for hvm guest") Fixes: 8d050ed1097c ("x86: don't expose XSAVES capability to PV guests") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c3bd0b83ea5b7c0da6542687436042eeea1e7909
master date: 2022-08-24 14:23:59 +0200
Anthony PERARD [Tue, 11 Oct 2022 13:13:15 +0000 (15:13 +0200)]
tools/libxl: Replace deprecated -soundhw on QEMU command line
-soundhw is deprecated since 825ff02911c9 ("audio: add soundhw
deprecation notice"), QEMU v5.1, and is been remove for upcoming v7.1
by 039a68373c45 ("introduce -audio as a replacement for -soundhw").
Instead we can just add the sound card with "-device", for most option
that "-soundhw" could handle. "-device" is an option that existed
before QEMU 1.0, and could already be used to add audio hardware.
The list of possible option for libxl's "soundhw" is taken the list
from QEMU 7.0.
The list of options for "soundhw" are listed in order of preference in
the manual. The first three (hda, ac97, es1370) are PCI devices and
easy to test on Linux, and the last four are ISA devices which doesn't
seems to work out of the box on linux.
The sound card 'pcspk' isn't listed even if it used to be accepted by
'-soundhw' because QEMU crash when trying to add it to a Xen domain.
Also, it wouldn't work with "-device" might need to be "-machine
pcspk-audiodev=default" instead.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Jason Andryuk <jandryuk@gmail.com>
master commit: 62ca138c2c052187783aca3957d3f47c4dcfd683
master date: 2022-08-18 09:25:50 +0200
Jan Beulich [Tue, 11 Oct 2022 13:11:01 +0000 (15:11 +0200)]
gnttab: correct locking on transitive grant copy error path
While the comment next to the lock dropping in preparation of
recursively calling acquire_grant_for_copy() mistakenly talks about the
rd == td case (excluded a few lines further up), the same concerns apply
to the calling of release_grant_for_copy() on a subsequent error path.
This is CVE-2022-33748 / XSA-411.
Fixes: ad48fb963dbf ("gnttab: fix transitive grant handling") Signed-off-by: Jan Beulich <jbeulich@suse.com>
master commit: 6e3aab858eef614a21a782a3b73acc88e74690ea
master date: 2022-10-11 14:29:30 +0200
Henry Wang [Tue, 11 Oct 2022 13:10:34 +0000 (15:10 +0200)]
xen/arm: Allocate and free P2M pages from the P2M pool
This commit sets/tearsdown of p2m pages pool for non-privileged Arm
guests by calling `p2m_set_allocation` and `p2m_teardown_allocation`.
- For dom0, P2M pages should come from heap directly instead of p2m
pool, so that the kernel may take advantage of the extended regions.
- For xl guests, the setting of the p2m pool is called in
`XEN_DOMCTL_shadow_op` and the p2m pool is destroyed in
`domain_relinquish_resources`. Note that domctl->u.shadow_op.mb is
updated with the new size when setting the p2m pool.
- For dom0less domUs, the setting of the p2m pool is called before
allocating memory during domain creation. Users can specify the p2m
pool size by `xen,domain-p2m-mem-mb` dts property.
To actually allocate/free pages from the p2m pool, this commit adds
two helper functions namely `p2m_alloc_page` and `p2m_free_page` to
`struct p2m_domain`. By replacing the `alloc_domheap_page` and
`free_domheap_page` with these two helper functions, p2m pages can
be added/removed from the list of p2m pool rather than from the heap.
Since page from `p2m_alloc_page` is cleaned, take the opportunity
to remove the redundant `clean_page` in `p2m_create_table`.
This is part of CVE-2022-33747 / XSA-409.
Signed-off-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: cbea5a1149ca7fd4b7cdbfa3ec2e4f109b601ff7
master date: 2022-10-11 14:28:44 +0200
Henry Wang [Tue, 11 Oct 2022 13:10:16 +0000 (15:10 +0200)]
xen/arm, libxl: Implement XEN_DOMCTL_shadow_op for Arm
This commit implements the `XEN_DOMCTL_shadow_op` support in Xen
for Arm. The p2m pages pool size for xl guests is supposed to be
determined by `XEN_DOMCTL_shadow_op`. Hence, this commit:
- Introduces a function `p2m_domctl` and implements the subops
`XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION` and
`XEN_DOMCTL_SHADOW_OP_GET_ALLOCATION` of `XEN_DOMCTL_shadow_op`.
- Adds the `XEN_DOMCTL_SHADOW_OP_SET_ALLOCATION` support in libxl.
Therefore enabling the setting of shadow memory pool size
when creating a guest from xl and getting shadow memory pool size
from Xen.
Note that the `XEN_DOMCTL_shadow_op` added in this commit is only
a dummy op, and the functionality of setting/getting p2m memory pool
size for xl guests will be added in following commits.
This is part of CVE-2022-33747 / XSA-409.
Signed-off-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: cf2a68d2ffbc3ce95e01449d46180bddb10d24a0
master date: 2022-10-11 14:28:42 +0200
Henry Wang [Tue, 11 Oct 2022 13:09:58 +0000 (15:09 +0200)]
xen/arm: Construct the P2M pages pool for guests
This commit constructs the p2m pages pool for guests from the
data structure and helper perspective.
This is implemented by:
- Adding a `struct paging_domain` which contains a freelist, a
counter variable and a spinlock to `struct arch_domain` to
indicate the free p2m pages and the number of p2m total pages in
the p2m pages pool.
- Adding a helper `p2m_get_allocation` to get the p2m pool size.
- Adding a helper `p2m_set_allocation` to set the p2m pages pool
size. This helper should be called before allocating memory for
a guest.
- Adding a helper `p2m_teardown_allocation` to free the p2m pages
pool. This helper should be called during the xl domain destory.
This is part of CVE-2022-33747 / XSA-409.
Signed-off-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
master commit: 55914f7fc91a468649b8a3ec3f53ae1c4aca6670
master date: 2022-10-11 14:28:39 +0200
Henry Wang [Tue, 11 Oct 2022 13:09:32 +0000 (15:09 +0200)]
libxl, docs: Use arch-specific default paging memory
The default paging memory (descibed in `shadow_memory` entry in xl
config) in libxl is used to determine the memory pool size for xl
guests. Currently this size is only used for x86, and contains a part
of RAM to shadow the resident processes. Since on Arm there is no
shadow mode guests, so the part of RAM to shadow the resident processes
is not necessary. Therefore, this commit splits the function
`libxl_get_required_shadow_memory()` to arch specific helpers and
renamed the helper to `libxl__arch_get_required_paging_memory()`.
On x86, this helper calls the original value from
`libxl_get_required_shadow_memory()` so no functional change intended.
On Arm, this helper returns 1MB per vcpu plus 4KB per MiB of RAM
for the P2M map and additional 512KB.
Also update the xl.cfg documentation to add Arm documentation
according to code changes and correct the comment style following Xen
coding style.
This is part of CVE-2022-33747 / XSA-409.
Suggested-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Anthony PERARD <anthony.perard@citrix.com>
master commit: 156a239ea288972425f967ac807b3cb5b5e14874
master date: 2022-10-11 14:28:37 +0200
Julien Grall [Tue, 11 Oct 2022 13:09:12 +0000 (15:09 +0200)]
xen/x86: p2m: Add preemption in p2m_teardown()
The list p2m->pages contain all the pages used by the P2M. On large
instance this can be quite large and the time spent to call
d->arch.paging.free_page() will take more than 1ms for a 80GB guest
on a Xen running in nested environment on a c5.metal.
By extrapolation, it would take > 100ms for a 8TB guest (what we
current security support). So add some preemption in p2m_teardown()
and propagate to the callers. Note there are 3 places where
the preemption is not enabled:
- hap_final_teardown()/shadow_final_teardown(): We are
preventing update the P2M once the domain is dying (so
no more pages could be allocated) and most of the P2M pages
will be freed in preemptive manneer when relinquishing the
resources. So this is fine to disable preemption.
- shadow_enable(): This is fine because it will undo the allocation
that may have been made by p2m_alloc_table() (so only the root
page table).
The preemption is arbitrarily checked every 1024 iterations.
We now need to include <xen/event.h> in p2m-basic in order to
import the definition for local_events_need_delivery() used by
general_preempt_check(). Ideally, the inclusion should happen in
xen/sched.h but it opened a can of worms.
Note that with the current approach, Xen doesn't keep track on whether
the alt/nested P2Ms have been cleared. So there are some redundant work.
However, this is not expected to incurr too much overhead (the P2M lock
shouldn't be contended during teardown). So this is optimization is
left outside of the security event.
Roger Pau Monné [Tue, 11 Oct 2022 13:08:58 +0000 (15:08 +0200)]
x86/p2m: free the paging memory pool preemptively
The paging memory pool is currently freed in two different places:
from {shadow,hap}_teardown() via domain_relinquish_resources() and
from {shadow,hap}_final_teardown() via complete_domain_destroy().
While the former does handle preemption, the later doesn't.
Attempt to move as much p2m related freeing as possible to happen
before the call to {shadow,hap}_teardown(), so that most memory can be
freed in a preemptive way. In order to avoid causing issues to
existing callers leave the root p2m page tables set and free them in
{hap,shadow}_final_teardown(). Also modify {hap,shadow}_free to free
the page immediately if the domain is dying, so that pages don't
accumulate in the pool when {shadow,hap}_final_teardown() get called.
Move altp2m_vcpu_disable_ve() to be done in hap_teardown(), as that's
the place where altp2m_active gets disabled now.
This is part of CVE-2022-33746 / XSA-410.
Reported-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: e7aa55c0aab36d994bf627c92bd5386ae167e16e
master date: 2022-10-11 14:24:21 +0200
Roger Pau Monné [Tue, 11 Oct 2022 13:08:42 +0000 (15:08 +0200)]
x86/p2m: truly free paging pool memory for dying domains
Modify {hap,shadow}_free to free the page immediately if the domain is
dying, so that pages don't accumulate in the pool when
{shadow,hap}_final_teardown() get called. This is to limit the amount of
work which needs to be done there (in a non-preemptable manner).
Note the call to shadow_free() in shadow_free_p2m_page() is moved after
increasing total_pages, so that the decrease done in shadow_free() in
case the domain is dying doesn't underflow the counter, even if just for
a short interval.
This is part of CVE-2022-33746 / XSA-410.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: f50a2c0e1d057c00d6061f40ae24d068226052ad
master date: 2022-10-11 14:23:51 +0200
Roger Pau Monné [Tue, 11 Oct 2022 13:08:14 +0000 (15:08 +0200)]
x86/shadow: tolerate failure in shadow_prealloc()
Prevent _shadow_prealloc() from calling BUG() when unable to fulfill
the pre-allocation and instead return true/false. Modify
shadow_prealloc() to crash the domain on allocation failure (if the
domain is not already dying), as shadow cannot operate normally after
that. Modify callers to also gracefully handle {_,}shadow_prealloc()
failing to fulfill the request.
Note this in turn requires adjusting the callers of
sh_make_monitor_table() also to handle it returning INVALID_MFN.
sh_update_paging_modes() is also modified to add additional error
paths in case of allocation failure, some of those will return with
null monitor page tables (and the domain likely crashed). This is no
different that current error paths, but the newly introduced ones are
more likely to trigger.
The now added failure points in sh_update_paging_modes() also require
that on some error return paths the previous structures are cleared,
and thus monitor table is null.
While there adjust the 'type' parameter type of shadow_prealloc() to
unsigned int rather than u32.
This is part of CVE-2022-33746 / XSA-410.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: b7f93c6afb12b6061e2d19de2f39ea09b569ac68
master date: 2022-10-11 14:22:53 +0200