Avoid incorrectly triggering an error when a broadcast buffered ioreq
is not handled by all registered clients, as long as the failure is
strictly because the client doesn't handle buffered ioreqs.
Signed-off-by: Per Bilse <per.bilse@citrix.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: a44734df6c24fadbdb001f051cc5580c467caf7d
master date: 2022-12-07 12:17:30 +0100
Andrew Cooper [Thu, 1 Dec 2022 21:06:25 +0000 (21:06 +0000)]
tools/oxenstored: Render backtraces more nicely in Syslog
fallback_exception_handler feeds a string with embedded newlines directly into
syslog(). While this is an improvement on getting nothing, syslogd escapes
all control characters it gets, and emits one (long) log line.
Fix the problem generally in the syslog stub. As we already have a local copy
of the string, split it in place and emit one syslog() call per line.
Also tweak Logging.msg_of to avoid putting an extra newline on a string which
already ends with one.
Fixes: ee7815f49faf ("tools/oxenstored: Set uncaught exception handler") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit d2162d884cba0ff7b2ac0d832f4e044444bda2e1)
strdup() may return NULL. Check for this before passing to syslog().
Drop const from c_msg. It is bogus, as demonstrated by the need to cast to
void * in order to free the memory.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit acd3fb6d65905f8a185dcb9fe6a330a591b96203)
Edwin Török [Mon, 7 Nov 2022 17:41:36 +0000 (17:41 +0000)]
tools/oxenstored: Set uncaught exception handler
Unhandled exceptions go to stderr by default, but this doesn't typically work
for oxenstored because:
* daemonize reopens stderr as /dev/null
* systemd redirects stderr to /dev/null too
Debugging an unhandled exception requires reproducing the issue locally when
using --no-fork, and is not conducive to figuring out what went wrong on a
remote system.
Install a custom handler which also tries to render the backtrace to the
configured syslog facility, and DAEMON|ERR otherwise.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit ee7815f49faf743e960dac9e72809eb66393bc6d)
Edwin Török [Tue, 8 Nov 2022 08:57:47 +0000 (08:57 +0000)]
tools/oxenstored: Log live update issues at warning level
During live update, oxenstored tries a best effort approach to recover as many
domains and information as possible even if it encounters errors restoring
some domains.
However, logging about misunderstood input is more severe than simply info.
Log it at warning instead.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 3f02e0a70fe9f8143454b742563433958d4a87f8)
Edwin Török [Thu, 3 Nov 2022 15:31:39 +0000 (15:31 +0000)]
tools/oxenstored: Keep /dev/xen/evtchn open across live update
Closing the evtchn handle will unbind and free all local ports. The new
xenstored would need to rebind all evtchns, which is work that we don't want
or need to be doing during the critical handover period.
However, it turns out that the Windows PV drivers also rebind their local port
too across suspend/resume, leaving (o)xenstored with a stale idea of the
remote port to use. In this case, reusing the established connection is the
only robust option.
Therefore:
* Have oxenstored open /dev/xen/evtchn without CLOEXEC at start of day.
* Extend the handover information with the evtchn fd, domexc virq local port,
and the local port number for each domain connection.
* Have (the new) oxenstored recover the open handle using Xeneventchn.fdopen,
and use the provided local ports rather than trying to rebind them.
When this new information isn't present (i.e. live updating from an oxenstored
prior to this change), the best-effort status quo will have to do.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 9b224c25293a53fcbe32da68052d861dda71a6f4)
Andrew Cooper [Wed, 30 Nov 2022 11:59:34 +0000 (11:59 +0000)]
tools/oxenstored: Rework Domain evtchn handling to use port_pair
Inter-domain event channels are always a pair of local and remote ports.
Right now the handling is asymmetric, caused by the fact that the evtchn is
bound after the associated Domain object is constructed.
First, move binding of the event channel into the Domain.make() constructor.
This means the local port no longer needs to be an option. It also removes
the final callers of Domain.bind_interdomain.
Next, introduce a new port_pair type to encapsulate the fact that these two
should be updated together, and replace the previous port and remote_port
fields. This refactoring also changes the Domain.get_port interface (removing
an option) so take the opportunity to name it get_local_port instead.
Also, this fixes a use-after-free risk with Domain.close. Once the evtchn has
been unbound, the same local port number can be reused for a different
purpose, so explicitly invalidate the ports to prevent their accidental misuse
in the future.
This also cleans up some of the debugging, to always print a port pair.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit df2db174b36eba67c218763ef621c67912202fc6)
Andrew Cooper [Wed, 30 Nov 2022 11:55:58 +0000 (11:55 +0000)]
tools/oxenstored: Implement Domain.rebind_evtchn
Generally speaking, the event channel local/remote port is fixed for the
lifetime of the associated domain object. The exception to this is a
secondary XS_INTRODUCE (defined to re-bind to a new event channel) which pokes
around at the domain object's internal state.
We need to refactor the evtchn handling to support live update, so start by
moving the relevant manipulation into Domain.
No practical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit aecdc28d9538ca2a1028ef9bc6550cb171dbbed4)
Andrew Cooper [Wed, 30 Nov 2022 03:17:28 +0000 (03:17 +0000)]
tools/oxenstored: Rename some 'port' variables to 'remote_port'
This will make the logic clearer when we plumb local_port through these
functions.
While doing this, rearrange the construct in Domains.create0 to separate the
remote port handling from the interface handling. (The interface logic is
dubious in several ways, but not altered by this cleanup.)
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 31fbee749a75621039ca601eaee7222050a7dd83)
Andrew Cooper [Tue, 29 Nov 2022 21:05:43 +0000 (21:05 +0000)]
tools/oxenstored: Bind the DOM_EXC VIRQ in in Event.init()
Xenstored always needs to bind the DOM_EXC VIRQ.
Instead of doing it shortly after the call to Event.init(), do it in the
constructor directly. This removes the need for the field to be a mutable
option.
It will also simplify a future change to support live update. Rename the
field from virq_port (which could be any VIRQ) to it's proper name.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 9804a5db435fe40c8ded8cf36c2d2b2281c56f1d)
Andrew Cooper [Wed, 30 Nov 2022 14:56:43 +0000 (14:56 +0000)]
tools/oxenstored: Style fixes to Domain
This file has some style problems so severe that they interfere with the
readability of the subsequent bugfix patches.
Fix these issues ahead of time, to make the subsequent changes more readable.
No functional change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit b45bfaf359e4821b1bf98a4fcd194d7fd176f167)
Edwin Török [Thu, 3 Nov 2022 14:50:38 +0000 (14:50 +0000)]
tools/ocaml/evtchn: Extend the init() binding with a cloexec flag
For live update, oxenstored wants to clear CLOEXEC on the evtchn handle, so it
survives the execve() into the new oxenstored.
Have the new interface match how cloexec works in other Ocaml standard
libraries.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 9bafe4a53306e7aa2ce6ffc96f7477c6f329f7a7)
Edwin Török [Mon, 14 Nov 2022 13:36:19 +0000 (13:36 +0000)]
tools/ocaml/evtchn: Add binding for xenevtchn_fdopen()
For live update, the new oxenstored needs to reconstruct an evtchn object
around an existing file descriptor.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 7ba68a6c558e1fd811c95cb7215a5cd07a3cc2ea)
There is no binding for xenevtchn_close(). In principle, this is a resource
leak, but the typical usage is as a singleton that lives for the lifetime of
the program.
Ocaml 5 no longer permits storing a naked C pointer in an Ocaml value.
Therefore, use a Custom block. This allows us to use the finaliser callback
to call xenevtchn_close(), if the Ocaml object goes out of scope.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 22d5affdf0cecfa6faae46fbaec68b8018835220)
Andrew Cooper [Fri, 11 Nov 2022 18:50:34 +0000 (18:50 +0000)]
tools/oxenstored: Fix incorrect scope after an if statement
A debug statement got inserted into a single-expression if statement.
Insert brackets to give the intended meaning, rather than the actual meaning
where the "let con = Connections..." is outside and executed unconditionally.
This results in some unnecessary ring checks for domains which otherwise have
IO credit.
Fixes: 42f0581a91d4 ("tools/oxenstored: Implement live update for socket connections") Reported-by: Edwin Török <edvin.torok@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit ee36179371fd4215a43fb179be2165f65c1cd1cd)
Edwin Török [Wed, 9 Nov 2022 09:48:33 +0000 (10:48 +0100)]
tools/ocaml/xenstored/store.ml: fix build error
Building with Dune in release mode fails with:
```
File "ocaml/xenstored/store.ml", line 464, characters 13-32:
Warning 18: this type-based record disambiguation is not principal.
File "ocaml/xenstored/store.ml", line 1:
Error: Some fatal warnings were triggered (1 occurrences)
```
This is a warning to help keep the code futureproof, quoting from its
documentation:
> Check information path during type-checking, to make sure that all types are
> derived in a principal way. When using labelled arguments and/or polymorphic
> methods, this flag is required to ensure future versions of the compiler will
> be able to infer types correctly, even if internal algorithms change. All
> programs accepted in -principal mode are also accepted in the default mode with
> equivalent types, but different binary signatures, and this may slow down type
> checking; yet it is a good idea to use it once before publishing source code.
Fixes: db471408edd46 "tools/ocaml/xenstored: Fix quota bypass on domain shutdown" Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
(cherry picked from commit 124492eff8e4acdaaed939fa9406b108c55fec73)
Edwin Török [Fri, 21 Oct 2022 07:59:25 +0000 (08:59 +0100)]
tools/ocaml/xenstored: fix live update exception
During live update we will load the /tool/xenstored path from the previous binary,
and then try to mkdir /tool again which will fail with EEXIST.
Check for existence of the path before creating it.
The write call to /tool/xenstored should not need any changes
(and we do want to overwrite any previous path, in case it changed).
Prior to 7110192b1df6 live update would work only if the binary path was
specified, and with 7110192b1df6 and this live update also works when
no binary path is specified in `xenstore-control live-update`.
Fixes: 7110192b1df6 ("tools/oxenstored: Fix Oxenstored Live Update") Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
(cherry picked from commit f838b956779ff8a0b94636462f3c6d95c3adeb73)
Andrew Cooper [Wed, 19 Oct 2022 17:12:33 +0000 (18:12 +0100)]
tools/oxenstored: Fix Oxenstored Live Update
tl;dr This hunk was part of the patch emailed to xen-devel, but was missing
from what ultimately got committed.
https://lore.kernel.org/xen-devel/4164cb728313c3b9fc38cf5e9ecb790ac93a9600.1610748224.git.edvin.torok@citrix.com/
is the patch in question, but was part of a series that had threading issues.
I have a vague recollection that I sourced the commits from a local branch,
which clearly wasn't as up-to-date as I had thought.
Either way, it's my fault/mistake, and this hunk should have been part of what
got comitted.
Fixes: 00c48f57ab36 ("tools/oxenstored: Start live update process") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
(cherry picked from commit 7110192b1df697be84a50f741651d4c3cb129504)
Jan Beulich [Thu, 8 Dec 2022 09:12:41 +0000 (10:12 +0100)]
x86/HVM: don't mark evtchn upcall vector as pending when vLAPIC is disabled
Linux'es relatively new use of HVMOP_set_evtchn_upcall_vector has
exposed a problem with the marking of the respective vector as
pending: For quite some time Linux has been checking whether any stale
ISR or IRR bits would still be set while preparing the LAPIC for use.
This check is now triggering on the upcall vector, as the registration,
at least for APs, happens before the LAPIC is actually enabled.
In software-disabled state an LAPIC would not accept any interrupt
requests and hence no IRR bit would newly become set while in this
state. As a result it is also wrong for us to mark the upcall vector as
having a pending request when the vLAPIC is in this state.
To compensate for the "enabled" check added to the assertion logic, add
logic to (conditionally) mark the upcall vector as having a request
pending at the time the LAPIC is being software-enabled by the guest.
Note however that, like for the pt_may_unmask_irq() we already have
there, long term we may need to find a different solution. This will be
especially relevant in case yet better LAPIC acceleration would
eliminate notifications of guest writes to this and other registers.
Fixes: 7b5b8ca7dffd ("x86/upcall: inject a spurious event after setting upcall vector") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Juergen Gross <jgross@suse.com>
master commit: f5d0279839b58cb622f0995dbf9cff056f03082e
master date: 2022-12-06 13:51:49 +0100
Jan Beulich [Thu, 8 Dec 2022 09:12:02 +0000 (10:12 +0100)]
x86/Viridian: don't mark IRQ vectors as pending when vLAPIC is disabled
In software-disabled state an LAPIC does not accept any interrupt
requests and hence no IRR bit would newly become set while in this
state. As a result it is also wrong for us to mark Viridian IPI or timer
vectors as having a pending request when the vLAPIC is in this state.
Such interrupts are simply lost.
Introduce a local variable in send_ipi() to help readability.
Fixes: fda96b7382ea ("viridian: add implementation of the HvSendSyntheticClusterIpi hypercall") Fixes: 26fba3c85571 ("viridian: add implementation of synthetic timers") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 831419f82913417dee4e5b0f80769c5db590540b
master date: 2022-12-02 10:35:32 +0100
Jan Beulich [Thu, 8 Dec 2022 09:11:32 +0000 (10:11 +0100)]
x86/HVM: don't mark external IRQs as pending when vLAPIC is disabled
In software-disabled state an LAPIC does not accept any interrupt
requests and hence no IRR bit would newly become set while in this
state. As a result it is also wrong for us to mark IO-APIC or MSI
originating vectors as having a pending request when the vLAPIC is in
this state. Such interrupts are simply lost.
Introduce (IO-APIC) or re-use (MSI) a local variable to help
readability.
Fixes: 4fe21ad3712e ("This patch add virtual IOAPIC support for VMX guest") Fixes: 85715f4bc7c9 ("MSI 5/6: add MSI support to passthrough HVM domain") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f1d7aac1e3c3cd164e17d41791a575a5c3e87121
master date: 2022-12-02 10:35:01 +0100
Roger Pau Monné [Thu, 8 Dec 2022 09:10:50 +0000 (10:10 +0100)]
x86/pvh: do not forward MADT Local APIC NMI structures to dom0
Currently Xen will passthrough any Local APIC NMI Structure found in
the native ACPI MADT table to a PVH dom0. This is wrong because PVH
doesn't have access to the physical local APIC, and instead gets an
emulated local APIC by Xen, that doesn't have the LINT0 or LINT1
pins wired to anything. Furthermore the ACPI Processor UIDs used in
the APIC NMI Structures are likely to not match the ones generated by
Xen for the Local x2APIC Structures, creating confusion to dom0.
Fix this by removing the logic to passthrough the Local APIC NMI
Structure for PVH dom0.
Roger Pau Monné [Thu, 8 Dec 2022 09:10:00 +0000 (10:10 +0100)]
x86/irq: do not release irq until all cleanup is done
Current code in _clear_irq_vector() will mark the irq as unused before
doing the cleanup required when move_in_progress is true.
This can lead to races in create_irq() if the function picks an irq
desc that's been marked as unused but has move_in_progress set, as the
call to assign_irq_vector() in that function can then fail with
-EAGAIN.
Prevent that by only marking irq descs as unused when all the cleanup
has been done. While there also use write_atomic() when setting
IRQ_UNUSED in _clear_irq_vector() and add a barrier in order to
prevent the setting of IRQ_UNUSED getting reordered by the compiler.
The check for move_in_progress cannot be removed from
_assign_irq_vector(), as other users (io_apic_set_pci_routing() and
ioapic_guest_write()) can still pass active irq descs to
assign_irq_vector().
Note the trace point is not moved and is now set before the irq is
marked as unused. This is done so that the CPU mask provided in the
trace point is the one belonging to the current vector, not the old
one.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: e267d11969a40f0aec33dbf966f5a6490b205f43
master date: 2022-12-02 10:32:21 +0100
Andrew Cooper [Thu, 20 Oct 2022 11:14:30 +0000 (12:14 +0100)]
xen/arm: Correct the p2m pool size calculations
Allocating or freeing p2m pages doesn't alter the size of the mempool; only
the split between free and used pages.
Right now, the hypercalls operate on the free subset of the pool, meaning that
XEN_DOMCTL_get_paging_mempool_size varies with time as the guest shuffles its
physmap, and XEN_DOMCTL_set_paging_mempool_size ignores the used subset of the
pool and lets the guest grow unbounded.
This fixes test-pagign-mempool on ARM so that the behaviour matches x86.
This is part of XSA-409 / CVE-2022-33747.
Fixes: cbea5a1149ca ("xen/arm: Allocate and free P2M pages from the P2M pool") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Release-acked-by: Henry Wang <Henry.Wang@arm.com>
Commit 34990446ca91 started to overwrite the `rc` value from
libxl__arch_domain_create(), thus error aren't propagated anymore.
Check `rc` value before doing the next thing.
Fixes: 34990446ca91 ("libxl: don't ignore the return value from xc_cpuid_apply_policy") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Jason Andryuk <jandryuk@gmail.com>
master commit: 8cdfbf95b19c01fbb741c41d5ea5a94f8823964c
master date: 2022-11-21 12:23:01 +0100
Roger Pau Monné [Mon, 28 Nov 2022 10:45:45 +0000 (11:45 +0100)]
efifb: ignore frame buffer with invalid configuration
On one of my boxes when the HDMI cable is not plugged in the
FrameBufferBase of the EFI_GRAPHICS_OUTPUT_PROTOCOL_MODE structure is
set to 0 by the firmware (while some of the other fields looking
plausible).
Such (bogus address) ends up mapped in vesa_init(), and since it
overlaps with a RAM region the whole system goes down pretty badly,
see:
(XEN) vesafb: framebuffer at 0x0000000000000000, mapped to 0xffff82c000201000, using 35209k, total 35209k
(XEN) vesafb: mode is 0x37557x32, linelength=960, font 8x16
(XEN) vesafb: Truecolor: size=8:8:8:8, shift=24:0:8:16
(XEN) (XEN) (XEN) (XEN) (XEN) (XEN) (XEN) (XEN) �ERROR: Class:0; Subclass:0; Operation: 0
ERROR: No ConOut
ERROR: No ConIn
Do like Linux and prevent using the EFI Frame Buffer if the base
address is 0. This is inline with the logic in Linuxes
fb_base_is_valid() function at drivers/video/fbdev/efifb.c v6.0.9.
Juergen Gross [Wed, 9 Nov 2022 10:00:04 +0000 (11:00 +0100)]
xen/sched: migrate timers to correct cpus after suspend
Today all timers are migrated to cpu 0 when the system is being
suspended. They are not migrated back after resuming the system again.
This results (at least) to visible problems with the credit scheduler,
as the timer isn't handled on the cpu it was expected to occur, which
will result in an ASSERT() triggering. Other more subtle problems, like
uninterrupted elongated time slices, are probable. The least effect
will be worse performance on cpu 0 resulting from most scheduling
related timer interrupts happening there after suspend/resume.
Add migrating the scheduling related timers of a specific cpu from cpu
0 back to its original cpu when that cpu has gone up when resuming the
system.
Fixes: 0763cd268789 ("xen/sched: don't disable scheduler on cpus during suspend") Signed-off-by: Juergen Gross <jgross@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Acked-by: Dario Faggioli <dfaggioli@suse.com>
master commit: 37f82facd62f720fdcec104f72f86b8c6c214820
master date: 2022-11-04 09:03:23 +0100
Juergen Gross [Wed, 9 Nov 2022 09:59:42 +0000 (10:59 +0100)]
tools/xenstore: call remove_domid_from_perm() for special nodes
When destroying a domain, any stale permissions of the domain must be
removed from the special nodes "@...", too. This was not done in the
fix for XSA-322.
Andrew Cooper [Tue, 14 Jun 2022 15:18:36 +0000 (16:18 +0100)]
x86/spec-ctrl: Mitigate IBPB not flushing the RSB/RAS
Introduce spec_ctrl_new_guest_context() to encapsulate all logic pertaining to
using MSR_PRED_CMD for a new guest context, even if it only has one user
presently.
Introduce X86_BUG_IBPB_NO_RET, and use it extend spec_ctrl_new_guest_context()
with a manual fixup for hardware which mis-implements IBPB.
This is part of XSA-422 / CVE-2022-23824.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 2b27967fb89d7904a1571a2fb963b1c9cac548db)
Andrew Cooper [Tue, 14 Jun 2022 15:18:36 +0000 (16:18 +0100)]
x86/spec-ctrl: Enumeration for IBPB_RET
The IBPB_RET bit indicates that the CPU's implementation of MSR_PRED_CMD.IBPB
does flush the RSB/RAS too.
This is part of XSA-422 / CVE-2022-23824.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 24496558e650535bdbd22cc04731e82276cd1b3f)
tools/xenstore: harden transaction finalization against errors
When finalizing a transaction, any error occurring after checking for
conflicts will result in the transaction being performed only
partially today. Additionally accounting data will not be updated at
the end of the transaction, which might result in further problems
later.
Avoid those problems by multiple modifications:
- free any transaction specific nodes which don't need to be committed
as they haven't been written during the transaction as soon as their
generation count has been verified, this will reduce the risk of
out-of-memory situations
- store the transaction specific node name in struct accessed_node in
order to avoid the need to allocate additional memory for it when
finalizing the transaction
- don't stop the transaction finalization when hitting an error
condition, but try to continue to handle all modified nodes
- in case of a detected error do the accounting update as needed and
call the data base checking only after that
- if writing a node in a transaction is failing (e.g. due to a failed
quota check), fail the transaction, as prior changes to struct
accessed_node can't easily be undone in that case
In case a node has been created in a transaction and it is later
deleted in the same transaction, the transaction will be terminated
with an error.
As this error is encountered only when handling the deleted node at
transaction finalization, the transaction will have been performed
partially and without updating the accounting information. This will
enable a malicious guest to create arbitrary number of nodes.
Edwin Török [Wed, 12 Oct 2022 18:13:05 +0000 (19:13 +0100)]
tools/ocaml: Ensure packet size is never negative
Integers in Ocaml have 63 or 31 bits of signed precision.
On 64-bit builds of Ocaml, this is fine because a C uint32_t always fits
within a 63-bit signed integer.
In 32-bit builds of Ocaml, this goes wrong. The C uint32_t is truncated
first (loses the top bit), then has a unsigned/signed mismatch.
A "negative" value (i.e. a packet on the ring of between 1G and 2G in size)
will trigger an exception later in Bytes.make in xb.ml, and because the packet
is not removed from the ring, the exception re-triggers on every subsequent
query, creating a livelock.
Fix both the source of the exception in Xb, and as defence in depth, mark the
domain as bad for any Invalid_argument exceptions to avoid the risk of
livelock.
This is XSA-420 / CVE-2022-42324.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit ae34df4d82636f4c82700b447ea2c93b9f82b3f3)
Edwin Török [Wed, 12 Oct 2022 18:13:06 +0000 (19:13 +0100)]
tools/ocaml/xenstored: Fix quota bypass on domain shutdown
XSA-322 fixed a domid reuse vulnerability by assigning Dom0 as the owner of
any nodes left after a domain is shutdown (e.g. outside its /local/domain/N
tree).
However Dom0 has no quota on purpose, so this opened up another potential
attack vector. Avoid it by deleting these nodes instead of assigning them to
Dom0.
This is part of XSA-419 / CVE-2022-42323.
Fixes: c46eff921209 ("tools/ocaml/xenstored: clean up permissions for dead domains") Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit db471408edd46af403b8bd44d180a928ad7fbb80)
tools/xenstore: make the internal memory data base the default
Having a file backed data base has the only advantage of being capable
to dump the contents of it while Xenstore is running, and potentially
using less swap space in case the data base can't be kept in memory.
It has the major disadvantage of a huge performance overhead: switching
to keep the data base in memory only speeds up live update of xenstored
with 120000 nodes from 20 minutes to 11 seconds. A complete tree walk
of this configuration will be reduced from 7 seconds to 280 msecs
(measured by "xenstore-control check").
So make the internal memory data base the default and enhance the
"--internal-db" command line parameter to take an optional parameter
allowing to switch the internal data base back to the file based one.
tools/xenstore: remove nodes owned by destroyed domain
In case a domain is removed from Xenstore, remove all nodes owned by
it per default.
This tackles the problem that nodes might be created by a domain
outside its home path in Xenstore, leading to Xenstore hogging more
and more memory. Domain quota don't work in this case if the guest is
rebooting in between.
Since XSA-322 ownership of such stale nodes is transferred to dom0,
which is helping against unintended access, but not against OOM of
Xenstore.
As a fallback for weird cases add a Xenstore start parameter for
keeping today's way to handle stale nodes, adding the risk of Xenstore
hitting an OOM situation.
This is part of XSA-419 / CVE-2022-42322.
Fixes: 496306324d8d ("tools/xenstore: revoke access rights for removed domains") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit 755d3f9debf8879448211fffb018f556136f6a79)
check_store() is using a hash table for storing all node names it has
found via walking the tree. Additionally it using another hash table
for all children of a node to detect duplicate child names.
Simplify that by dropping the second hash table as the first one is
already holding all the needed information.
Add a generic function to walk the complete node tree. It will start
at "/" and descend recursively into each child, calling a function
specified by the caller. Depending on the return value of the user
specified function the walk will be aborted, continued, or the current
child will be skipped by not descending into its children.
tools/xenstore: don't let remove_child_entry() call corrupt()
In case of write_node() returning an error, remove_child_entry() will
call corrupt() today. This could result in an endless recursion, as
remove_child_entry() is called by corrupt(), too:
Today chk_domain_generation() is being used to check whether a node
permission entry is still valid or whether it is referring to a domain
no longer existing. This is done by comparing the node's and the
domain's generation count.
In case no struct domain is existing for a checked domain, but the
domain itself is valid, chk_domain_generation() assumes it is being
called due to the first node created for a new domain and it will
return success.
This might be wrong in case the checked permission is related to an
old domain, which has just been replaced with a new domain using the
same domid.
Fix that by letting chk_domain_generation() fail in case a struct
domain isn't found. In order to cover the case of the first node for
a new domain try to allocate the needed struct domain explicitly when
processing the related SET_PERMS command. In case a referenced domain
isn't existing, flag the related permission to be ignored right away.
tools/xenstore: don't use conn->in as context for temporary allocations
Using the struct buffered data pointer of the current processed request
for temporary data allocations has a major drawback: the used area (and
with that the temporary data) is freed only after the response of the
request has been written to the ring page or has been read via the
socket. This can happen much later in case a guest isn't reading its
responses fast enough.
As the temporary data can be safely freed after creating the response,
add a temporary context for that purpose and use that for allocating
the temporary memory, as it was already the case before commit cc0612464896 ("xenstore: add small default data buffer to internal
struct").
Some sub-functions need to gain the "const" attribute for the talloc
context.
This is XSA-416 / CVE-2022-42319.
Fixes: cc0612464896 ("xenstore: add small default data buffer to internal struct") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit 2a587de219cc0765330fbf9fac6827bfaf29e29b)
SUPPORT.md: clarify support of untrusted driver domains with oxenstored
Add a support statement for the scope of support regarding different
Xenstore variants. Especially oxenstored does not (yet) have security
support of untrusted driver domains, as those might drive oxenstored
out of memory by creating lots of watch events for the guests they are
servicing.
Add a statement regarding Live Update support of oxenstored.
This is part of XSA-326.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit c7bc20d8d123851a468402bbfc9e3330efff21ec)
Edwin Török [Wed, 12 Oct 2022 18:13:04 +0000 (19:13 +0100)]
tools/ocaml: Limit maximum in-flight requests / outstanding replies
Introduce a limit on the number of outstanding reply packets in the xenbus
queue. This limits the number of in-flight requests: when the output queue is
full we'll stop processing inputs until the output queue has room again.
To avoid a busy loop on the Unix socket we only add it to the watched input
file descriptor set if we'd be able to call `input` on it. Even though Dom0
is trusted and exempt from quotas a flood of events might cause a backlog
where events are produced faster than daemons in Dom0 can consume them, which
could lead to an unbounded queue size and OOM.
Therefore the xenbus queue limit must apply to all connections, Dom0 is not
exempt from it, although if everything works correctly it will eventually
catch up.
This prevents a malicious guest from sending more commands while it has
outstanding watch events or command replies in its input ring. However if it
can cause the generation of watch events by other means (e.g. by Dom0, or
another cooperative guest) and stop reading its own ring then watch events
would've queued up without limit.
The xenstore protocol doesn't have a back-pressure mechanism, and doesn't
allow dropping watch events. In fact, dropping watch events is known to break
some pieces of normal functionality. This leaves little choice to safely
implement the xenstore protocol without exposing the xenstore daemon to
out-of-memory attacks.
Implement the fix as pipes with bounded buffers:
* Use a bounded buffer for watch events
* The watch structure will have a bounded receiving pipe of watch events
* The source will have an "overflow" pipe of pending watch events it couldn't
deliver
Items are queued up on one end and are sent as far along the pipe as possible:
source domain -> watch -> xenbus of target -> xenstore ring/socket of target
If the pipe is "full" at any point then back-pressure is applied and we prevent
more items from being queued up. For the source domain this means that we'll
stop accepting new commands as long as its pipe buffer is not empty.
Before we try to enqueue an item we first check whether it is possible to send
it further down the pipe, by attempting to recursively flush the pipes. This
ensures that we retain the order of events as much as possible.
We might break causality of watch events if the target domain's queue is full
and we need to start using the watch's queue. This is a breaking change in
the xenstore protocol, but only for domains which are not processing their
incoming ring as expected.
When a watch is deleted its entire pending queue is dropped (no code is needed
for that, because it is part of the 'watch' type).
There is a cache of watches that have pending events that we attempt to flush
at every cycle if possible.
Introduce 3 limits here:
* quota-maxwatchevents on watch event destination: when this is hit the
source will not be allowed to queue up more watch events.
* quota-maxoustanding which is the number of responses not read from the ring:
once exceeded, no more inputs are processed until all outstanding replies
are consumed by the client.
* overflow queue on the watch event source: all watches that cannot be stored
on destination are queued up here, a single command can trigger multiple
watches (e.g. due to recursion).
The overflow queue currently doesn't have an upper bound, it is difficult to
accurately calculate one as it depends on whether you are Dom0 and how many
watches each path has registered and how many watch events you can trigger
with a single command (e.g. a commit). However these events were already
using memory, this just moves them elsewhere, and as long as we correctly
block a domain it shouldn't result in unbounded memory usage.
Note that Dom0 is not excluded from these checks, it is important that Dom0 is
especially not excluded when it is the source, since there are many ways in
which a guest could trigger Dom0 to send it watch events.
This should protect against malicious frontends as long as the backend follows
the PV xenstore protocol and only exposes paths needed by the frontend, and
changes those paths at most once as a reaction to guest events, or protocol
state.
The queue limits are per watch, and per domain-pair, so even if one
communication channel would be "blocked", others would keep working, and the
domain itself won't get blocked as long as it doesn't overflow the queue of
watch events.
Similarly a malicious backend could cause the frontend to get blocked, but
this watch queue protects the frontend as well as long as it follows the PV
protocol. (Although note that protection against malicious backends is only a
best effort at the moment)
This is part of XSA-326 / CVE-2022-42318.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 9284ae0c40fb5b9606947eaaec23dc71d0540e96)
Edwin Török [Wed, 12 Oct 2022 18:13:03 +0000 (19:13 +0100)]
tools/ocaml/xb: Add BoundedQueue
Ensures we cannot store more than [capacity] elements in a [Queue]. Replacing
all Queue with this module will then ensure at compile time that all Queues
are correctly bound checked.
Each element in the queue has a class with its own limits. This, in a
subsequent change, will ensure that command responses can proceed during a
flood of watch events.
No functional change.
This is part of XSA-326.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 19171fb5d888b4467a7073e8febc5e05540956e9)
Edwin Török [Wed, 12 Oct 2022 18:13:02 +0000 (19:13 +0100)]
tools/ocaml: Change Xb.input to return Packet.t option
The queue here would only ever hold at most one element. This will simplify
follow-up patches.
This is part of XSA-326.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit c0a86a462721008eca5ff733660de094d3c34bc7)
Edwin Török [Fri, 29 Jul 2022 17:53:29 +0000 (18:53 +0100)]
tools/ocaml/libs/xb: hide type of Xb.t
Hiding the type will make it easier to change the implementation
in the future without breaking code that relies on it.
No functional change.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 7ade30a1451734d041363c750a65d322e25b47ba)
Edwin Török [Wed, 12 Oct 2022 18:13:07 +0000 (19:13 +0100)]
tools/ocaml: GC parameter tuning
By default the OCaml garbage collector would return memory to the OS only
after unused memory is 5x live memory. Tweak this to 120% instead, which
would match the major GC speed.
This is part of XSA-326.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 4a8bacff20b857ca0d628ef5525877ade11f2a42)
Edwin Török [Thu, 28 Jul 2022 16:08:15 +0000 (17:08 +0100)]
tools/ocaml/xenstored: Check for maxrequests before performing operations
Previously we'd perform the operation, record the updated tree in the
transaction record, then try to insert a watchop path and the reply packet.
If we exceeded max requests we would've returned EQUOTA, but still:
* have performed the operation on the transaction's tree
* have recorded the watchop, making this queue effectively unbounded
It is better if we check whether we'd have room to store the operation before
performing the transaction, and raise EQUOTA there. Then the transaction
record won't grow.
This is part of XSA-326 / CVE-2022-42317.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 329f4d1a6535c6c5a34025ca0d03fc5c7228fcff)
Edwin Török [Wed, 12 Oct 2022 18:13:01 +0000 (19:13 +0100)]
tools/ocaml/xenstored: Synchronise defaults with oxenstore.conf.in
We currently have 2 different set of defaults in upstream Xen git tree:
* defined in the source code, only used if there is no config file
* defined in the oxenstored.conf.in upstream Xen
An oxenstored.conf file is not mandatory, and if missing, maxrequests in
particular has an unsafe default.
Resync the defaults from oxenstored.conf.in into the source code.
This is part of XSA-326 / CVE-2022-42316.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 84734955d4bf629ba459a74773afcde50a52236f)
tools/xenstore: add control command for setting and showing quota
Add a xenstore-control command "quota" to:
- show current quota settings
- change quota settings
- show current quota related values of a domain
Note that in the case the new quota is lower than existing one,
Xenstored may continue to handle requests from a domain exceeding the
new limit (depends on which one has been broken) and the amount of
resource used will not change. However the domain will not be able to
create more resource (associated to the quota) until it is back to below
the limit.
Add the memory accounting for Xenstore nodes. In order to make this
not too complicated allow for some sloppiness when writing nodes. Any
hard quota violation will result in no further requests to be accepted.
When a socket connection is destroyed, the associated watches are
removed, too. In order to keep memory accounting correct the watches
must be removed explicitly via a call of conn_delete_all_watches() from
destroy_conn().
tools/xenstore: add memory accounting for responses
Add the memory accounting for queued responses.
In case adding a watch event for a guest is causing the hard memory
quota of that guest to be violated, the event is dropped. This will
ensure that it is impossible to drive another guest past its memory
quota by generating insane amounts of events for that guest. This is
especially important for protecting driver domains from that attack
vector.
tools/xenstore: add infrastructure to keep track of per domain memory usage
The amount of memory a domain can consume in Xenstore is limited by
various quota today, but even with sane quota a domain can still
consume rather large memory quantities.
Add the infrastructure for keeping track of the amount of memory a
domain is consuming in Xenstore. Note that this is only the memory a
domain has direct control over, so any internal administration data
needed by Xenstore only is not being accounted for.
There are two quotas defined: a soft quota which will result in a
warning issued via syslog() when it is exceeded, and a hard quota
resulting in a stop of accepting further requests or watch events as
long as the hard quota would be violated by accepting those.
Setting any of those quotas to 0 will disable it.
As default values use 2MB per domain for the soft limit (this basically
covers the allowed case to create 1000 nodes needing 2kB each), and
2.5MB for the hard limit.
tools/xenstore: limit max number of nodes accessed in a transaction
Today a guest is free to access as many nodes in a single transaction
as it wants. This can lead to unbounded memory consumption in Xenstore
as there is the need to keep track of all nodes having been accessed
during a transaction.
In oxenstored the number of requests in a transaction is being limited
via a quota maxrequests (default is 1024). As multiple accesses of a
node are not problematic in C Xenstore, limit the number of accessed
nodes.
In order to let read_node() detect a quota error in case too many nodes
are being accessed, check the return value of access_node() and return
NULL in case an error has been seen. Introduce __must_check and add it
to the access_node() prototype.
tools/xenstore: simplify and fix per domain node accounting
The accounting of nodes can be simplified now that each connection
holds the associated domid.
Fix the node accounting to cover nodes created for a domain before it
has been introduced. This requires to react properly to an allocation
failure inside domain_entry_inc() by returning an error code.
Especially in error paths the node accounting has to be fixed in some
cases.
A guest not reading its Xenstore response buffer fast enough might
pile up lots of Xenstore watch events buffered. Reduce the generated
load by dropping new events which already have an identical copy
pending.
The special events "@..." are excluded from that handling as there are
known use cases where the handler is relying on each event to be sent
individually.
Add another quota for limiting the number of outstanding requests of a
guest. As the way to specify quotas on the command line is becoming
rather nasty, switch to a new scheme using [--quota|-Q] <what>=<val>
allowing to add more quotas in future easily.
Set the default value to 20 (basically a random value not seeming to
be too high or too low).
A request is said to be outstanding if any message generated by this
request (the direct response plus potential watch events) is not yet
completely stored into a ring buffer. The initial watch event sent as
a result of registering a watch is an exception.
Note that across a live update the relation to buffered watch events
for other domains is lost.
Use talloc_zero() for allocating the domain structure in order to have
all per-domain quota zeroed initially.
A future modification will limit the number of outstanding requests
for a domain, where "outstanding" means that the response of the
request or any resulting watch event hasn't been consumed yet.
In order to avoid a malicious guest being capable to block other guests
by not reading watch events, add a timeout for watch events. In case a
watch event hasn't been consumed after this timeout, it is being
deleted. Set the default timeout to 20 seconds (a random value being
not too high).
In order to support to specify other timeout values in future, use a
generic command line option for that purpose:
When removing a watched node outside of a transaction, two watch events
are being produced instead of just a single one.
When finalizing a transaction watch events can be generated for each
node which is being modified, even if outside a transaction such
modifications might not have resulted in a watch event.
This happens e.g.:
- for nodes which are only modified due to added/removed child entries
- for nodes being removed or created implicitly (e.g. creation of a/b/c
is implicitly creating a/b, resulting in watch events for a, a/b and
a/b/c instead of a/b/c only)
Avoid these additional watch events, in order to reduce the needed
memory inside Xenstore for queueing them.
This is being achieved by adding event flags to struct accessed_node
specifying whether an event should be triggered, and whether it should
be an exact match of the modified path. Both flags can be set from
fire_watches() instead of implying them only.
tools/xenstore: add helpers to free struct buffered_data
Add two helpers for freeing struct buffered_data: free_buffered_data()
for freeing one instance and conn_free_buffered_data() for freeing all
instances for a connection.
This is avoiding duplicated code and will help later when more actions
are needed when freeing a struct buffered_data.
tools/xenstore: Fail a transaction if it is not possible to create a node
Commit f2bebf72c4d5 "xenstore: rework of transaction handling" moved
out from copying the entire database everytime a new transaction is
opened to track the list of nodes changed.
The content of all the nodes accessed during a transaction will be
temporarily stored in TDB using a different key.
The function create_node() may write/update multiple nodes if the child
doesn't exist. In case of a failure, the function will revert any
changes (this include any update to TDB). Unfortunately, the function
which reverts the changes (i.e. destroy_node()) will not use the correct
key to delete any update or even request the transaction to fail.
This means that if a client decide to go ahead with committing the
transaction, orphan nodes will be created because they were not linked
to an existing node (create_node() will write the nodes backwards).
Once some nodes have been partially updated in a transaction, it is not
easily possible to undo any changes. So rather than continuing and hit
weird issue while committing, it is much saner to fail the transaction.
This will have an impact on any client that decides to commit even if it
can't write a node. Although, it is not clear why a normal client would
want to do that...
Lastly, update destroy_node() to use the correct key for deleting the
node. Rather than recreating it (this will allocate memory and
therefore fail), stash the key in the structure node.
tools/xenstore: create_node: Don't defer work to undo any changes on failure
XSA-115 extended destroy_node() to update the node accounting for the
connection. The implementation is assuming the connection is the parent
of the node, however all the nodes are allocated using a separate context
(see process_message()). This will result to crash (or corrupt) xenstored
as the pointer is wrongly used.
In case of an error, any changes to the database or update to the
accounting will now be reverted in create_node() by calling directly
destroy_node(). This has the nice advantage to remove the loop to unset
the destructors in case of success.
Take the opportunity to free the nodes right now as they are not
going to be reachable (the function returns NULL) and are just wasting
resources.
Andrew Cooper [Wed, 24 Aug 2022 13:16:44 +0000 (14:16 +0100)]
x86/vmx: Revert "VMX: use a single, global APIC access page"
The claim "No accesses would ever go to this page." is false. A consequence
of how Intel's APIC Acceleration works, and Xen's choice to have per-domain
P2Ms (rather than per-vCPU P2Ms) means that the APIC page is fully read-write
to any vCPU which is not in xAPIC mode.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 3b5beaf49033cddf4b2cc4e4d391b966f4203471)
Igor Druzhinin [Mon, 31 Oct 2022 12:28:46 +0000 (13:28 +0100)]
x86/pv-shim: correct ballooning down for compat guests
The compat layer for multi-extent memory ops may need to split incoming
requests. Since the guest handles in the interface structures may not be
altered, it does so by leveraging do_memory_op()'s continuation
handling: It hands on non-initial requests with a non-zero start extent,
with the (native) handle suitably adjusted down. As a result
do_memory_op() sees only the first of potentially several requests with
start extent being zero. In order to be usable as overall result, the
function accumulates args.nr_done, i.e. it initialized the field with
the start extent. Therefore non-initial requests resulting from the
split would pass too large a number into pv_shim_offline_memory().
Address that breakage by always calling pv_shim_offline_memory()
regardless of current hypercall preemption status, with a suitably
adjusted first argument. Note that this is correct also for the native
guest case: We now simply "commit" what was completed right away, rather
than at the end of a series of preemption/re-start cycles. In fact this
improves overall preemption behavior: There's no longer a potentially
big chunk of work done non-preemptively at the end of the last
"iteration".
Fixes: b2245acc60c3 ("xen/pvshim: memory hotplug") Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 1d7fbc535d1d37bdc2cc53ede360b0f6651f7de1
master date: 2022-10-28 15:49:33 +0200
Igor Druzhinin [Mon, 31 Oct 2022 12:28:15 +0000 (13:28 +0100)]
x86/pv-shim: correct ballooning up for compat guests
The compat layer for multi-extent memory ops may need to split incoming
requests. Since the guest handles in the interface structures may not be
altered, it does so by leveraging do_memory_op()'s continuation
handling: It hands on non-initial requests with a non-zero start extent,
with the (native) handle suitably adjusted down. As a result
do_memory_op() sees only the first of potentially several requests with
start extent being zero. It's only that case when the function would
issue a call to pv_shim_online_memory(), yet the range then covers only
the first sub-range that results from the split.
Address that breakage by making a complementary call to
pv_shim_online_memory() in compat layer.
Fixes: b2245acc60c3 ("xen/pvshim: memory hotplug") Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a0bfdd201ea12aa5679bb8944d63a4e0d3c23160
master date: 2022-10-28 15:48:50 +0200
Mem-op requests may have zero extents. Such requests need treating as
no-ops. pv_shim_online_memory(), however, would have tried to take 2³²-1
order-sized pages from its balloon list (to then populate them),
typically ending when the entire set of ballooned pages of this order
was consumed.
Note that pv_shim_offline_memory() does not have such an issue.
Fixes: b2245acc60c3 ("xen/pvshim: memory hotplug") Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9272225ca72801fd9fa5b268a2d1c5adebd19cd9
master date: 2022-10-28 15:47:59 +0200
Jan Beulich [Mon, 31 Oct 2022 12:26:33 +0000 (13:26 +0100)]
common: map_vcpu_info() wants to unshare the underlying page
Not passing P2M_UNSHARE to get_page_from_gfn() means there won't even be
an attempt to unshare the referenced page, without any indication to the
caller (e.g. -EAGAIN). Note that guests have no direct control over
which of their pages are shared (or paged out), and hence they have no
way to make sure all on their own that the subsequent obtaining of a
writable type reference can actually succeed.
Jan Beulich [Mon, 31 Oct 2022 12:26:08 +0000 (13:26 +0100)]
x86: also zap secondary time area handles during soft reset
Just like domain_soft_reset() properly zaps runstate area handles, the
secondary time area ones also need discarding to prevent guest memory
corruption once the guest is re-started.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: b80d4f8d2ea6418e32fb4f20d1304ace6d6566e3
master date: 2022-10-27 11:49:09 +0200
Roger Pau Monné [Mon, 31 Oct 2022 12:25:40 +0000 (13:25 +0100)]
vpci/msix: remove from table list on detach
Teardown of MSIX vPCI related data doesn't currently remove the MSIX
device data from the list of MSIX tables handled by the domain,
leading to a use-after-free of the data in the msix structure.
Remove the structure from the list before freeing in order to solve
it.
Reported-by: Jan Beulich <jbeulich@suse.com> Fixes: d6281be9d0 ('vpci/msix: add MSI-X handlers') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: c14aea137eab29eb9c30bfad745a00c65ad21066
master date: 2022-10-26 14:56:58 +0200
Roger Pau Monné [Mon, 31 Oct 2022 12:25:13 +0000 (13:25 +0100)]
vpci: don't assume that vpci per-device data exists unconditionally
It's possible for a device to be assigned to a domain but have no
vpci structure if vpci_process_pending() failed and called
vpci_remove_device() as a result. The unconditional accesses done by
vpci_{read,write}() and vpci_remove_device() to pdev->vpci would
then trigger a NULL pointer dereference.
Add checks for pdev->vpci presence in the affected functions.
Jan Beulich [Mon, 31 Oct 2022 12:24:33 +0000 (13:24 +0100)]
x86/shadow: drop (replace) bogus assertions
The addition of a call to shadow_blow_tables() from shadow_teardown()
has resulted in the "no vcpus" related assertion becoming triggerable:
If domain_create() fails with at least one page successfully allocated
in the course of shadow_enable(), or if domain_create() succeeds and
the domain is then killed without ever invoking XEN_DOMCTL_max_vcpus.
Note that in-tree tests (test-resource and test-tsx) do exactly the
latter of these two.
The assertion's comment was bogus anyway: Shadow mode has been getting
enabled before allocation of vCPU-s for quite some time. Convert the
assertion to a conditional: As long as there are no vCPU-s, there's
nothing to blow away.
Fixes: e7aa55c0aab3 ("x86/p2m: free the paging memory pool preemptively") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
A similar assertion/comment pair exists in _shadow_prealloc(); the
comment is similarly bogus, and the assertion could in principle trigger
e.g. when shadow_alloc_p2m_page() is called early enough. Replace those
at the same time by a similar early return, here indicating failure to
the caller (which will generally lead to the domain being crashed in
shadow_prealloc()).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: a92dc2bb30ba65ae25d2f417677eb7ef9a6a0fef
master date: 2022-10-24 15:46:11 +0200
Juergen Gross [Mon, 31 Oct 2022 12:23:50 +0000 (13:23 +0100)]
xen/sched: fix restore_vcpu_affinity() by removing it
When the system is coming up after having been suspended,
restore_vcpu_affinity() is called for each domain in order to adjust
the vcpu's affinity settings in case a cpu didn't come to live again.
The way restore_vcpu_affinity() is doing that is wrong, because the
specific scheduler isn't being informed about a possible migration of
the vcpu to another cpu. Additionally the migration is often even
happening if all cpus are running again, as it is done without check
whether it is really needed.
As cpupool management is already calling cpu_disable_scheduler() for
cpus not having come up again, and cpu_disable_scheduler() is taking
care of eventually needed vcpu migration in the proper way, there is
simply no need for restore_vcpu_affinity().
So just remove restore_vcpu_affinity() completely, together with the
no longer used sched_reset_affinity_broken().
Fixes: 8a04eaa8ea83 ("xen/sched: move some per-vcpu items to struct sched_unit") Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Dario Faggioli <dfaggioli@suse.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
master commit: fce1f381f7388daaa3e96dbb0d67d7a3e4bb2d2d
master date: 2022-10-24 11:16:27 +0100
Juergen Gross [Mon, 31 Oct 2022 12:22:54 +0000 (13:22 +0100)]
xen/sched: fix race in RTDS scheduler
When a domain gets paused the unit runnable state can change to "not
runnable" without the scheduling lock being involved. This means that
a specific scheduler isn't involved in this change of runnable state.
In the RTDS scheduler this can result in an inconsistency in case a
unit is losing its "runnable" capability while the RTDS scheduler's
scheduling function is active. RTDS will remove the unit from the run
queue, but doesn't do so for the replenish queue, leading to hitting
an ASSERT() in replq_insert() later when the domain is unpaused again.
Fix that by removing the unit from the replenish queue as well in this
case.
Jan Beulich [Mon, 31 Oct 2022 12:22:17 +0000 (13:22 +0100)]
EFI: don't convert memory marked for runtime use to ordinary RAM
efi_init_memory() in both relevant places is treating EFI_MEMORY_RUNTIME
higher priority than the type of the range. To avoid accessing memory at
runtime which was re-used for other purposes, make
efi_arch_process_memory_map() follow suit. While in theory the same would
apply to EfiACPIReclaimMemory, we don't actually "reclaim" or clobber
that memory (converted to E820_ACPI on x86) there (and it would be a bug
if the Dom0 kernel tried to reclaim the range, bypassing Xen's memory
management, plus it would be at least bogus if it clobbered that space),
hence that type's handling can be left alone.
Jason Andryuk [Mon, 31 Oct 2022 12:21:31 +0000 (13:21 +0100)]
argo: Remove reachable ASSERT_UNREACHABLE
I observed this ASSERT_UNREACHABLE in partner_rings_remove consistently
trip. It was in OpenXT with the viptables patch applied.
dom10 shuts down.
dom7 is REJECTED sending to dom10.
dom7 shuts down and this ASSERT trips for dom10.
The argo_send_info has a domid, but there is no refcount taken on
the domain. Therefore it's not appropriate to ASSERT that the domain
can be looked up via domid. Replace with a debug message.
Signed-off-by: Jason Andryuk <jandryuk@gmail.com> Reviewed-by: Christopher Clark <christopher.w.clark@gmail.com>
master commit: 197f612b77c5afe04e60df2100a855370d720ad7
master date: 2022-10-14 14:45:41 +0100
Jan Beulich [Mon, 31 Oct 2022 12:20:40 +0000 (13:20 +0100)]
VMX: correct error handling in vmx_create_vmcs()
With the addition of vmx_add_msr() calls to construct_vmcs() there are
now cases where simply freeing the VMCS isn't enough: The MSR bitmap
page as well as one of the MSR area ones (if it's the 2nd vmx_add_msr()
which fails) may also need freeing. Switch to using vmx_destroy_vmcs()
instead.
Fixes: 3bd36952dab6 ("x86/spec-ctrl: Introduce an option to control L1D_FLUSH for HVM HAP guests") Fixes: 53a570b28569 ("x86/spec-ctrl: Support IBPB-on-entry") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 448d28309f1a966bdc850aff1a637e0b79a03e43
master date: 2022-10-12 17:57:56 +0200
Henry Wang [Tue, 25 Oct 2022 09:21:12 +0000 (09:21 +0000)]
xen/arm: p2m: Populate pages for GICv2 mapping in p2m_init()
Hardware using GICv2 needs to create a P2M mapping of 8KB GICv2 area
when the domain is created. Considering the worst case of page tables
which requires 6 P2M pages as the two pages will be consecutive but not
necessarily in the same L3 page table and keep a buffer, populate 16
pages as the default value to the P2M pages pool in p2m_init() at the
domain creation stage to satisfy the GICv2 requirement. For GICv3, the
above-mentioned P2M mapping is not necessary, but since the allocated
16 pages here would not be lost, hence populate these pages
unconditionally.
With the default 16 P2M pages populated, there would be a case that
failures would happen in the domain creation with P2M pages already in
use. To properly free the P2M for this case, firstly support the
optionally preemption of p2m_teardown(), then call p2m_teardown() and
p2m_set_allocation(d, 0, NULL) non-preemptively in p2m_final_teardown().
As non-preemptive p2m_teardown() should only return 0, use a
BUG_ON to confirm that.
Since p2m_final_teardown() is called either after
domain_relinquish_resources() where relinquish_p2m_mapping() has been
called, or from failure path of domain_create()/arch_domain_create()
where mappings that require p2m_put_l3_page() should never be created,
relinquish_p2m_mapping() is not added in p2m_final_teardown(), add
in-code comments to refer this.
Fixes: cbea5a1149ca ("xen/arm: Allocate and free P2M pages from the P2M pool") Suggested-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Henry Wang <Henry.Wang@arm.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
(cherry picked from commit: c7cff1188802646eaa38e918e5738da0e84949be)
Andrew Cooper [Tue, 25 Oct 2022 09:21:11 +0000 (09:21 +0000)]
arm/p2m: Rework p2m_init()
p2m_init() is mostly trivial initialisation, but has two fallible operations
which are on either side of the backpointer trigger for teardown to take
actions.
p2m_free_vmid() is idempotent with a failed p2m_alloc_vmid(), so rearrange
p2m_init() to perform all trivial setup, then set the backpointer, then
perform all fallible setup.
This will simplify a future bugfix which needs to add a third fallible
operation.
No practical change.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
(cherry picked from commit: 3783e583319fa1ce75e414d851f0fde191a14753)
Tamas K Lengyel [Tue, 11 Oct 2022 13:01:48 +0000 (15:01 +0200)]
x86/vpmu: Fix race-condition in vpmu_load
The vPMU code-bases attempts to perform an optimization on saving/reloading the
PMU context by keeping track of what vCPU ran on each pCPU. When a pCPU is
getting scheduled, checks if the previous vCPU isn't the current one. If so,
attempts a call to vpmu_save_force. Unfortunately if the previous vCPU is
already getting scheduled to run on another pCPU its state will be already
runnable, which results in an ASSERT failure.
Fix this by always performing a pmu context save in vpmu_save when called from
vpmu_switch_from, and do a vpmu_load when called from vpmu_switch_to.
While this presents a minimal overhead in case the same vCPU is getting
rescheduled on the same pCPU, the ASSERT failure is avoided and the code is a
lot easier to reason about.
Signed-off-by: Tamas K Lengyel <tamas.lengyel@intel.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: defa4e51d20a143bdd4395a075bf0933bb38a9a4
master date: 2022-09-30 09:53:49 +0200
Jan Beulich [Tue, 11 Oct 2022 13:01:36 +0000 (15:01 +0200)]
x86: wire up VCPUOP_register_vcpu_time_memory_area for 32-bit guests
Forever sinced its introduction VCPUOP_register_vcpu_time_memory_area
was available only to native domains. Linux, for example, would attempt
to use it irrespective of guest bitness (including in its so called
PVHVM mode) as long as it finds XEN_PVCLOCK_TSC_STABLE_BIT set (which we
set only for clocksource=tsc, which in turn needs engaging via command
line option).
Fixes: a5d39947cb89 ("Allow guests to register secondary vcpu_time_info") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: b726541d94bd0a80b5864d17a2cd2e6d73a3fe0a
master date: 2022-09-29 14:47:45 +0200
Juergen Gross [Tue, 11 Oct 2022 13:01:22 +0000 (15:01 +0200)]
xen/gnttab: fix gnttab_acquire_resource()
Commit 9dc46386d89d ("gnttab: work around "may be used uninitialized"
warning") was wrong, as vaddrs can legitimately be NULL in case
XENMEM_resource_grant_table_id_status was specified for a grant table
v1. This would result in crashes in debug builds due to
ASSERT_UNREACHABLE() triggering.
Check vaddrs only to be NULL in the rc == 0 case.
Expand the tests in tools/tests/resource to tickle this path, and verify that
using XENMEM_resource_grant_table_id_status on a v1 grant table fails.
Fixes: 9dc46386d89d ("gnttab: work around "may be used uninitialized" warning") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> # xen Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 52daa6a8483e4fbd6757c9d1b791e23931791608
master date: 2022-09-09 16:28:38 +0100
Jan Beulich [Tue, 11 Oct 2022 13:00:33 +0000 (15:00 +0200)]
Config.mk: correct PIE-related option(s) in EMBEDDED_EXTRA_CFLAGS
I haven't been able to find evidence of "-nopie" ever having been a
supported compiler option. The correct spelling is "-no-pie".
Furthermore like "-pie" this is an option which is solely passed to the
linker. The compiler only recognizes "-fpie" / "-fPIE" / "-fno-pie", and
it doesn't infer these options from "-pie" / "-no-pie".
Add the compiler recognized form, but for the possible case of the
variable also being used somewhere for linking keep the linker option as
well (with corrected spelling).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <jgrall@amazon.com>
Build: Drop -no-pie from EMBEDDED_EXTRA_CFLAGS
This breaks all Clang builds, as demostrated by Gitlab CI.
Contrary to the description in ecd6b9759919, -no-pie is not even an option
passed to the linker. GCC's actual behaviour is to inhibit the passing of
-pie to the linker, as well as selecting different cr0 artefacts to be linked.
EMBEDDED_EXTRA_CFLAGS is not used for $(CC)-doing-linking, and not liable to
gain such a usecase.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Tested-by: Stefano Stabellini <sstabellini@kernel.org> Fixes: ecd6b9759919 ("Config.mk: correct PIE-related option(s) in EMBEDDED_EXTRA_CFLAGS")
master commit: ecd6b9759919fa6335b0be1b5fc5cce29a30c4f1
master date: 2022-09-08 09:25:26 +0200
master commit: 13a7c0074ac8fb31f6c0485429b7a20a1946cb22
master date: 2022-09-27 15:40:42 -0700
Juergen Gross [Tue, 11 Oct 2022 13:00:19 +0000 (15:00 +0200)]
xen/sched: fix cpu hotplug
Cpu unplugging is calling schedule_cpu_rm() via stop_machine_run() with
interrupts disabled, thus any memory allocation or freeing must be
avoided.
Since commit 5047cd1d5dea ("xen/common: Use enhanced
ASSERT_ALLOC_CONTEXT in xmalloc()") this restriction is being enforced
via an assertion, which will now fail.
Fix this by allocating needed memory before entering stop_machine_run()
and freeing any memory only after having finished stop_machine_run().
Fixes: 1ec410112cdd ("xen/sched: support differing granularity in schedule_cpu_[add/rm]()") Reported-by: Gao Ruifeng <ruifeng.gao@intel.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d84473689611eed32fd90b27e614f28af767fa3f
master date: 2022-09-05 11:42:30 +0100
Juergen Gross [Tue, 11 Oct 2022 13:00:05 +0000 (15:00 +0200)]
xen/sched: carve out memory allocation and freeing from schedule_cpu_rm()
In order to prepare not allocating or freeing memory from
schedule_cpu_rm(), move this functionality to dedicated functions.
For now call those functions from schedule_cpu_rm().
No change of behavior expected.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d42be6f83480b3ada286dc18444331a816be88a3
master date: 2022-09-05 11:42:30 +0100