Michal Orzel [Wed, 5 May 2021 07:43:01 +0000 (09:43 +0200)]
arm: Modify type of actlr to register_t
AArch64 registers are 64bit whereas AArch32 registers
are 32bit or 64bit. MSR/MRS are expecting 64bit values thus
we should get rid of helpers READ/WRITE_SYSREG32
in favour of using READ/WRITE_SYSREG.
We should also use register_t type when reading sysregs
which can correspond to uint64_t or uint32_t.
Even though many AArch64 registers have upper 32bit reserved
it does not mean that they can't be widen in the future.
ACTLR_EL1 system register bits are implementation defined
which means it is possibly a latent bug on current HW as the CPU
implementer may already have decided to use the top 32bit.
Signed-off-by: Michal Orzel <michal.orzel@arm.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
(cherry picked from commit b80470c84553808fef3a6803000ceee8a100e63c)
Jan Beulich [Fri, 11 Jun 2021 13:04:24 +0000 (15:04 +0200)]
Arm32: MSR to SPSR needs qualification
The Arm ARM's description of MSR (ARM DDI 0406C.d section B9.3.12)
doesn't even allow for plain "SPSR" here, and while gas accepts this, it
takes it to mean SPSR_cf. Yet surely all of SPSR wants updating on this
path, not just the lowest and highest 8 bits.
Fixes: dfcffb128be4 ("xen/arm32: SPSR_hyp/SPSR") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
(cherry picked from commit 93031fbe9f4c341a2e7950a088025ea550291433)
SPSR_hyp is not meant to be accessed from Hyp mode (EL2); accesses
trigger UNPREDICTABLE behaviour. Xen should read/write SPSR instead.
See: ARM DDI 0487D.b page G8-5993.
This fixes booting Xen/arm32 on QEMU.
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com> Tested-by: Edgar E. Iglesias <edgar.iglesias@xilinx.com>
(cherry picked from commit dfcffb128be46a3e413eaa941744536fe53c94b6)
Andrew Cooper [Fri, 16 Jul 2021 06:26:33 +0000 (08:26 +0200)]
x86/tsx: Fix backport of "x86/cpuid: Rework HLE and RTM handling"
The backport dropped the hunk deleting the setup_clear_cpu_cap() for HLE/RTM,
but retained the hunk adding setup_force_cpu_cap().
Calling both force and clear on the same feature elicits an error, and clear
takes precedence, which breaks the part of the bugfix which makes migration
from older versions of Xen function safe for VMs using TSX.
Fixes: f17d848c4caa ("x86/cpuid: Rework HLE and RTM handling") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Fixes: 91c3e3dc91d6 ("tools/xentop: Display '-' when stats are not available.") Signed-off-by: Richard Kojedzinszky <richard@kojedz.in> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 89d57f291e37b4769ab26db919eba46548f2e13e)
Jan Beulich [Thu, 15 Jul 2021 07:44:20 +0000 (09:44 +0200)]
x86/mem-sharing: ensure consistent lock order in get_two_gfns()
While the comment validly says "Sort by domain, if same domain by gfn",
the implementation also included equal domain IDs in the first part of
the check, thus rending the second part entirely dead and leaving
deadlock potential when there's only a single domain involved.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tamas K Lengyel <tamas@tklengyel.com>
master commit: 09af2d01a2fe6a0af08598bdfe12c9707f4d82ba
master date: 2021-07-07 12:35:12 +0200
Jan Beulich [Thu, 15 Jul 2021 07:43:38 +0000 (09:43 +0200)]
IOMMU/PCI: don't let domain cleanup continue when device de-assignment failed
Failure here could in principle mean the device may still be issuing DMA
requests, which would continue to be translated by the page tables the
device entry currently points at. With this we cannot allow the
subsequent cleanup step of freeing the page tables to occur, to prevent
use-after-free issues. We would need to accept, for the time being, that
in such a case the remaining domain resources will all be leaked, and
the domain will continue to exist as a zombie.
However, with flushes no longer timing out (and with proper timeout
detection for device I/O TLB flushing yet to be implemented), there's no
way anymore for failures to occur, except due to bugs elsewhere. Hence
the change here is merely a "just in case" one.
In order to continue the loop in spite of an error, we can't use
pci_get_pdev_by_domain() anymore. I have no idea why it was used here in
the first place, instead of the cheaper list iteration.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: f591755823a7e94fc6b4b8ddce71f0421a94fa09
master date: 2021-06-25 14:06:55 +0200
Jan Beulich [Thu, 15 Jul 2021 07:43:11 +0000 (09:43 +0200)]
VT-d: don't lose errors when flushing TLBs on multiple IOMMUs
While no longer an immediate problem with flushes no longer timing out,
errors (if any) get properly reported by iommu_flush_iotlb_{dsi,psi}().
Overwriting such an error with, perhaps, a success indicator received
from another IOMMU will misguide callers. Record the first error, but
don't bail from the loop (such that further necessary invalidation gets
carried out on a best effort basis).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: e7059776f9755b989a992d229c68c3d7778412be
master date: 2021-06-24 16:30:06 +0200
Jan Beulich [Thu, 15 Jul 2021 07:42:47 +0000 (09:42 +0200)]
VT-d: clear_fault_bits() should clear all fault bits
If there is any way for one fault to be left set in the recording
registers, there's no reason there couldn't also be multiple ones. If
PPF set set (being the OR or all F fields), simply loop over the entire
range of fault recording registers, clearing F everywhere.
Since PPF is a r/o bit, also remove it from DMA_FSTS_FAULTS (arguably
the constant's name is ambiguous as well).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 80589800ae62fce43fd3921e8fbd362465fe5ba3
master date: 2021-06-24 16:29:42 +0200
Jan Beulich [Thu, 15 Jul 2021 07:42:18 +0000 (09:42 +0200)]
VT-d: adjust domid map updating when unmapping context
When an earlier error occurred, cleaning up the domid mapping data is
wrong, as references likely still exist. The only exception to this is
when the actual unmapping worked, but some flush failed (supposedly
impossible after XSA-373). The guest will get crashed in such a case
though, so add fallback cleanup to domain destruction to cover this
case. This in turn makes it desirable to silence the dprintk() in
domain_iommu_domid().
Note that no error will be returned anymore when the lookup fails - in
the common case lookup failure would already have caused
domain_context_unmap_one() to fail, yet even from a more general
perspective it doesn't look right to fail domain_context_unmap() in such
a case when this was the last device, but not when any earlier unmap was
otherwise successful.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 32655880057ce2829f962d46916ea6cec60f98d3
master date: 2021-06-24 16:29:13 +0200
Jan Beulich [Thu, 15 Jul 2021 07:41:48 +0000 (09:41 +0200)]
VT-d: undo device mappings upon error
When
- flushes (supposedly not possible anymore after XSA-373),
- secondary mappings for legacy PCI devices behind bridges,
- secondary mappings for chipset quirks, or
- find_upstream_bridge() invocations
fail, the successfully established device mappings should not be left
around.
Further, when (parts of) unmapping fail, simply returning an error is
typically not enough. Crash the domain instead in such cases, arranging
for domain cleanup to continue in a best effort manner despite such
failures.
Finally make domain_context_unmap()'s error behavior consistent in the
legacy PCI device case: Don't bail from the function in one special
case, but always just exit the switch statement.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: f3401d65d9f0dce508c3d7da55de4a093d748ae1
master date: 2021-06-24 16:28:25 +0200
Commit cf8c4d3d13b8 made some preparation to have one day
variable-length-array argument, but didn't declare the array in the
function prototype the same way as in the function definition. And now
GCC 11 complains about it.
Fixes: cf8c4d3d13b8 ("tools/libs/foreignmemory: pull array length argument to map forward") Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5d3e4ebb5c71477d74a0c503438545a0126d3863
master date: 2021-06-15 18:07:58 +0100
Jan Beulich [Thu, 15 Jul 2021 07:40:29 +0000 (09:40 +0200)]
x86/vpt: fully init timers before putting onto list
With pt_vcpu_lock() no longer acquiring the pt_migrate lock, parties
iterating the list and acting on the timers of the list entries will no
longer be kept from entering their loops by create_periodic_time()'s
holding of that lock. Therefore at least init_timer() needs calling
ahead of list insertion, but keep this and set_timer() together.
Fixes: 8113b02f0bf8 ("x86/vpt: do not take pt_migrate rwlock in some cases") Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
master commit: 6d622f3a96bbd76ce8422c6e3805e6609417ec76
master date: 2021-06-15 15:14:20 +0200
Andrew Cooper [Thu, 15 Jul 2021 07:40:02 +0000 (09:40 +0200)]
x86/cpuid: Fix HLE and RTM handling (again)
For reasons which are my fault, but I don't recall why, the
FDP_EXCP_ONLY/NO_FPU_SEL adjustment uses the whole special_features[] array
element, not the two relevant bits.
HLE and RTM were recently added to the list of special features, causing them
to be always set in guest view, irrespective of the toolstacks choice on the
matter.
Rewrite the logic to refer to the features specifically, rather than relying
on the contents of the special_features[] array.
Fixes: 8fe24090d9 ("x86/cpuid: Rework HLE and RTM handling") Reported-by: Edwin Török <edvin.torok@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 60fa12dbf1d4d2c4ffe1ef34b495b24aa7e41aa0
master date: 2021-06-07 15:43:35 +0100
xen: credit2: fix per-entity load tracking when continuing running
If we schedule, and the current vCPU continues to run, its statistical
load is not properly updated, resulting in something like this, even if
all the 8 vCPUs are 100% busy:
As we can see, the average load of the runqueue as a whole is, instead,
computed properly.
This issue would, in theory, potentially affect Credit2 load balancing
logic. In practice, however, the problem only manifests (at least with
these characteristics) when there is only 1 runqueue active in the
cpupool, which also means there is no need to do any load-balancing.
Hence its real impact is pretty much limited to wrong per-vCPU load
percentages, when looking at the output of the 'r' debug-key.
With this patch, the load is updated and displayed correctly:
credit2: make sure we pick a runnable unit from the runq if there is one
A !runnable unit (temporarily) present in the runq may cause us to
stop scanning the runq itself too early. Of course, we don't run any
non-runnable vCPUs, but we end the scan and we fallback to picking
the idle unit. In other word, this prevent us to find there and pick
the actual unit that we're meant to start running (which might be
further ahead in the runq).
Depending on the vCPU pinning configuration, this may lead to such
unit to be stuck in the runq for long time, causing malfunctioning
inside the guest.
Fix this by checking runnable/non-runnable status up-front, in the runq
scanning function.
Reported-by: Michał Leszczyński <michal.leszczynski@cert.pl> Reported-by: Dion Kant <g.w.kant@hunenet.nl> Signed-off-by: Dario Faggioli <dfaggioli@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 07b0eb5d0ef0be154606aa46b5b4c5c59b158505
master date: 2021-06-07 13:16:36 +0100
Andrew Cooper [Thu, 20 May 2021 00:21:39 +0000 (01:21 +0100)]
x86/spec-ctrl: Mitigate TAA after S3 resume
The user chosen setting for MSR_TSX_CTRL needs restoring after S3.
All APs get the correct setting via start_secondary(), but the BSP was missed
out.
This is XSA-377 / CVE-2021-28690.
Fixes: 8c4330818f6 ("x86/spec-ctrl: Mitigate the TSX Asynchronous Abort sidechannel") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 8cf276cb2e0b99b96333865873f56b0b31555ff1)
Andrew Cooper [Thu, 11 Mar 2021 14:39:11 +0000 (14:39 +0000)]
x86/spec-ctrl: Protect against Speculative Code Store Bypass
Modern x86 processors have far-better-than-architecturally-guaranteed self
modifying code detection. Typically, when a write hits an instruction in
flight, a Machine Clear occurs to flush stale content in the frontend and
backend.
For self modifying code, before a write which hits an instruction in flight
retires, the frontend can speculatively decode and execute the old instruction
stream. Speculation of this form can suffer from type confusion in registers,
and potentially leak data.
Furthermore, updates are typically byte-wise, rather than atomic. Depending
on timing, speculation can race ahead multiple times between individual
writes, and execute the transiently-malformed instruction stream.
Xen has stubs which are used in certain cases for emulation purposes. Inhibit
speculation between updating the stub and executing it.
This is XSA-375 / CVE-2021-0089.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 45f59ed8865318bb0356954bad067f329677ce9e)
Jan Beulich [Tue, 8 Jun 2021 17:58:24 +0000 (18:58 +0100)]
AMD/IOMMU: drop command completion timeout
First and foremost - such timeouts were not signaled to callers, making
them believe they're fine to e.g. free previously unmapped pages.
Mirror VT-d's behavior: A fixed number of loop iterations is not a
suitable way to detect timeouts in an environment (CPU and bus speeds)
independent manner anyway. Furthermore, leaving an in-progress operation
pending when it appears to take too long is problematic: If a command
completed later, the signaling of its completion may instead be
understood to signal a subsequently started command's completion.
Log excessively long processing times (with a progressive threshold) to
have some indication of problems in this area. Allow callers to specify
a non-default timeout bias for this logging, using the same values as
VT-d does, which in particular means a (by default) much larger value
for device IO TLB invalidation.
Jan Beulich [Tue, 8 Jun 2021 17:58:05 +0000 (18:58 +0100)]
AMD/IOMMU: wait for command slot to be available
No caller cared about send_iommu_command() indicating unavailability of
a slot. Hence if a sufficient number prior commands timed out, we did
blindly assume that the requested command was submitted to the IOMMU
when really it wasn't. This could mean both a hanging system (waiting
for a command to complete that was never seen by the IOMMU) or blindly
propagating success back to callers, making them believe they're fine
to e.g. free previously unmapped pages.
Fold the three involved functions into one, add spin waiting for an
available slot along the lines of VT-d's qinval_next_index(), and as a
consequence drop all error indicator return types/values.
Jan Beulich [Tue, 8 Jun 2021 17:56:39 +0000 (18:56 +0100)]
VT-d: eliminate flush related timeouts
Leaving an in-progress operation pending when it appears to take too
long is problematic: If e.g. a QI command completed later, the write to
the "poll slot" may instead be understood to signal a subsequently
started command's completion. Also our accounting of the timeout period
was actually wrong: We included the time it took for the command to
actually make it to the front of the queue, which could be heavily
affected by guests other than the one for which the flush is being
performed.
Do away with all timeout detection on all flush related code paths.
Log excessively long processing times (with a progressive threshold) to
have some indication of problems in this area.
Additionally log (once) if qinval_next_index() didn't immediately find
an available slot. Together with the earlier change sizing the queue(s)
dynamically, we should now have a guarantee that with our fully
synchronous model any demand for slots can actually be satisfied.
Jan Beulich [Tue, 8 Jun 2021 17:56:24 +0000 (18:56 +0100)]
AMD/IOMMU: size command buffer dynamically
With the present synchronous model, we need two slots for every
operation (the operation itself and a wait command). There can be one
such pair of commands pending per CPU. To ensure that under all normal
circumstances a slot is always available when one is requested, size the
command ring according to the number of present CPUs.
Jan Beulich [Tue, 8 Jun 2021 17:55:50 +0000 (18:55 +0100)]
VT-d: size qinval queue dynamically
With the present synchronous model, we need two slots for every
operation (the operation itself and a wait descriptor). There can be
one such pair of requests pending per CPU. To ensure that under all
normal circumstances a slot is always available when one is requested,
size the queue ring according to the number of present CPUs.
xen/arm: Boot modules should always be scrubbed if bootscrub={on, idle}
The function to initialize the pages (see init_heap_pages()) will request
scrub when the admin request idle bootscrub (default) and state ==
SYS_STATE_active. When bootscrub=on, Xen will scrub any free pages in
heap_init_late().
Currently, the boot modules (e.g. kernels, initramfs) will be discarded/
freed after heap_init_late() is called and system_state switched to
SYS_STATE_active. This means the pages associated with the boot modules
will not get scrubbed before getting re-purposed.
If the memory is assigned to an untrusted domU, it may be able to
retrieve secrets from the modules.
Julien Grall [Mon, 17 May 2021 16:47:13 +0000 (17:47 +0100)]
xen/arm: Create dom0less domUs earlier
In a follow-up patch we will need to unallocate the boot modules
before heap_init_late() is called.
The modules will contain the domUs kernel and initramfs. Therefore Xen
will need to create extra domUs (used by dom0less) before heap_init_late().
This has two consequences on dom0less:
1) Domains will not be unpaused as soon as they are created but
once all have been created. However, Xen doesn't guarantee an order
to unpause, so this is not something one could rely on.
2) The memory allocated for a domU will not be scrubbed anymore when an
admin select bootscrub=on. This is not something we advertised, but if
this is a concern we can introduce either force scrub for all domUs or
a per-domain flag in the DT. The behavior for bootscrub=off and
bootscrub=idle (default) has not changed.
The original commit wasn't quite sufficient: Emptying DEPS is helpful
only when nothing will get added to it subsequently. xen/Rules.mk will,
after including the local Makefile, amend DEPS by dependencies for
objects living in sub-directories though. For the purpose of suppressing
dependencies of the makefiles on the .*.d2 files (and thus to avoid
their re-generation) it is, however, not necessary at all to play with
DEPS. Instead we can override DEPS_INCLUDE (which generally is a late-
expansion variable).
Andrew Cooper [Fri, 4 Jun 2021 13:04:56 +0000 (15:04 +0200)]
x86/cpuid: Rework HLE and RTM handling
The TAA mitigation offered the option to hide the HLE and RTM CPUID bits,
which has caused some migration compatibility problems.
These two bits are special. Annotate them with ! to emphasise this point.
Hardware Lock Elision (HLE) may or may not be visible in CPUID, but is
disabled in microcode on all CPUs, and has been removed from the architecture.
Do not advertise it to VMs by default.
Restricted Transactional Memory (RTM) may or may not be visible in CPUID, and
may or may not be configured in force-abort mode. Have tsx_init() note
whether RTM has been configured into force-abort mode, so
guest_common_feature_adjustments() can conditionally hide it from VMs by
default.
The host policy values for HLE/RTM may or may not be set, depending on any
previous running kernel's choice of visibility, and Xen's choice. TSX is
available on any CPU which enumerates a TSX-hiding mechanism, so instead of
doing a two-step to clobber any hiding, scan CPUID, then set the visibility,
just force visibility of the bits in the first place.
With the HLE/RTM bits now unilaterally visible in the host policy,
xc_cpuid_apply_policy() can construct a more appropriate policy out of thin
air for pre-4.13 VMs with no CPUID data in their migration stream, and
specifically one where HLE/RTM doesn't potentially disappear behind the back
of a running VM.
Fixes: 8c4330818f6 ("x86/spec-ctrl: Mitigate the TSX Asynchronous Abort sidechannel") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8fe24090d940d760145ccd5e234290be7418b175
master date: 2021-05-27 19:34:00 +0100
Jan Beulich [Fri, 4 Jun 2021 13:02:33 +0000 (15:02 +0200)]
x86: make hypervisor build with gcc11
Gcc 11 looks to make incorrect assumptions about valid ranges that
pointers may be used for addressing when they are derived from e.g. a
plain constant. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100680.
Utilize RELOC_HIDE() to work around the issue, which for x86 manifests
in at least
- mpparse.c:efi_check_config(),
- tboot.c:tboot_probe(),
- tboot.c:tboot_gen_frametable_integrity(),
- x86_emulate.c:x86_emulate() (at -O2 only).
The last case is particularly odd not just because it only triggers at
higher optimization levels, but also because it only affects one of at
least three similar constructs. Various "note" diagnostics claim the
valid index range to be [0, 2⁶³-1].
Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Jason Andryuk <jandryuk@gmail.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 722f59d38c710a940ab05e542a83020eb5546dea
master date: 2021-05-27 14:40:29 +0200
Jan Beulich [Fri, 4 Jun 2021 13:02:13 +0000 (15:02 +0200)]
x86emul: fix test harness build for gas 2.36
All of the sudden, besides .text and .rodata and alike, an always
present .note.gnu.property section has appeared. This section, when
converting to binary format output, gets placed according to its
linked address, causing the resulting blobs to be about 128Mb in size.
The resulting headers with a C representation of the binary blobs then
are, of course all a multiple of that size (and take accordingly long
to create). I didn't bother waiting to see what size the final
test_x86_emulator binary then would have had.
See also https://sourceware.org/bugzilla/show_bug.cgi?id=27753.
Rather than figuring out whether gas supports -mx86-used-note=, simply
remove the section while creating *.bin.
Roger Pau Monné [Fri, 4 Jun 2021 13:01:33 +0000 (15:01 +0200)]
x86/vhpet: fix RTC special casing
Restore setting the virtual timer callback private data to NULL if the
timer is not level triggered. This fixes the special casing done in
pt_update_irq so that the RTC interrupt when originating from the HPET
is suspended if the interrupt source is masked.
Note the RTC special casing done in pt_update_irq should only apply to
the RTC interrupt originating from the emulated RTC device (which does
set the callback private data), as in that case the callback itself
will destroy the virtual timer if the interrupt is ignored.
While there also use RTC_IRQ instead of 8 when the HPET is configured
in LegacyReplacement Mode.
Fixes: be07023be115 ("x86/vhpet: add support for level triggered interrupts") Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 472a13988a051e5ae84b95815c6caf4378062abe
master date: 2021-05-07 10:43:29 +0200
Chao Gao [Fri, 4 Jun 2021 13:00:37 +0000 (15:00 +0200)]
VT-d: Don't assume register-based invalidation is always supported
According to Intel VT-d SPEC rev3.3 Section 6.5, Register-based Invalidation
isn't supported by Intel VT-d version 6 and beyond.
This hardware change impacts following two scenarios: admin can disable
queued invalidation via 'qinval' cmdline and use register-based interface;
VT-d switches to register-based invalidation when queued invalidation needs
to be disabled, for example, during disabling x2apic or during system
suspension or after enabling queued invalidation fails.
To deal with this hardware change, if register-based invalidation isn't
supported, queued invalidation cannot be disabled through Xen cmdline; and
if queued invalidation has to be disabled temporarily in some scenarios,
VT-d won't switch to register-based interface but use some dummy functions
to catch errors in case there is any invalidation request issued when queued
invalidation is disabled.
Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 6773b1a7584a75a486e9774541ad5bd84c9aa5ee
master date: 2021-04-26 10:16:50 +0200
Ian Jackson [Tue, 9 Mar 2021 15:00:47 +0000 (15:00 +0000)]
SUPPORT.md: Document speculative attacks status of non-shim 32-bit PV
This documents, but does not fix, XSA-370.
Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
Commit e1de4c196a2e ("x86/timer: Fix boot on Intel systems using ITSSPRC
static PIT clock gating") was reported to cause boot failures on certain
AMD Ryzen systems.
Refine the fix to do nothing in the default case, and only attempt to
configure legacy replacement mode if IRQ0 is found to not be working. If
legacy replacement mode doesn't help, undo it before falling back to other IRQ
routing configurations.
In addition, introduce a "hpet" command line option so this heuristic
can be overridden. Since it makes little sense to introduce just
"hpet=legacy-replacement", also allow for a boolean argument as well as
"broadcast" to replace the separate "hpetbroadcast" option.
Reported-by: Frédéric Pierret frederic.pierret@qubes-os.org Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Tested-by: Frédéric Pierret <frederic.pierret@qubes-os.org>
master commit: b53173e7cdafb7a318a239d557478fd73734a86a
master date: 2021-04-15 18:25:40 +0100
Boris Ostrovsky [Tue, 20 Apr 2021 10:15:12 +0000 (12:15 +0200)]
x86/vpt: do not take pt_migrate rwlock in some cases
Commit 8e76aef72820 ("x86/vpt: fix race when migrating timers between
vCPUs") addressed XSA-336 by introducing a per-domain rwlock that was
intended to protect periodic timer during VCPU migration. Since such
migration is an infrequent event no performance impact was expected.
Unfortunately this turned out not to be the case: on a fairly large
guest (92 VCPUs) we've observed as much as 40% TPCC performance
regression with some guest kernels. Further investigation pointed to
pt_migrate read lock taken in pt_update_irq() as the largest contributor
to this regression. With large number of VCPUs and large number of VMEXITs
(from where pt_update_irq() is always called) the update of an atomic in
read_lock() is thought to be the main cause.
Stephen Brennan analyzed locking pattern and classified lock users as
follows:
1. Functions which read (maybe write) all periodic_time instances attached
to a particular vCPU. These are functions which use pt_vcpu_lock() such
as pt_restore_timer(), pt_save_timer(), etc.
2. Functions which want to modify a particular periodic_time object.
These functions lock whichever vCPU the periodic_time is attached to, but
since the vCPU could be modified without holding any lock, they are
vulnerable to XSA-336. Functions in this group use pt_lock(), such as
pt_timer_fn() or destroy_periodic_time().
3. Functions which not only want to modify the periodic_time, but also
would like to modify the =vcpu= fields. These are create_periodic_time()
or pt_adjust_vcpu(). They create XSA-336 conditions for group 2, but we
can't simply hold 2 vcpu locks due to the deadlock risk.
Roger then pointed out that group 1 functions don't really need to hold
the pt_migrate rwlock and that instead groups 2 and 3 should hold per-vcpu
lock whenever they modify per-vcpu timer lists.
Suggested-by: Stephen Brennan <stephen.s.brennan@oracle.com> Suggested-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Stephen Brennan <stephen.s.brennan@oracle.com>
master commit: 1f3d87c7512975274cc45c40097b05550eba1ac9
master date: 2021-04-09 09:21:27 +0200
Jan Beulich [Tue, 20 Apr 2021 10:14:34 +0000 (12:14 +0200)]
fix for_each_cpu() again for NR_CPUS=1
Unfortunately aa50f45332f1 ("xen: fix for_each_cpu when NR_CPUS=1") has
caused quite a bit of fallout with gcc10, e.g. (there are at least two
more similar ones, and I didn't bother trying to find them all):
In file included from .../xen/include/xen/config.h:13,
from <command-line>:
core_parking.c: In function ‘core_parking_power’:
.../xen/include/asm/percpu.h:12:51: error: array subscript 1 is above array bounds of ‘long unsigned int[1]’ [-Werror=array-bounds]
12 | (*RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]))
.../xen/include/xen/compiler.h:141:29: note: in definition of macro ‘RELOC_HIDE’
141 | (typeof(ptr)) (__ptr + (off)); })
| ^~~
core_parking.c:133:39: note: in expansion of macro ‘per_cpu’
133 | core_tmp = cpumask_weight(per_cpu(cpu_core_mask, cpu));
| ^~~~~~~
In file included from .../xen/include/xen/percpu.h:4,
from .../xen/include/asm/msr.h:7,
from .../xen/include/asm/time.h:5,
from .../xen/include/xen/time.h:76,
from .../xen/include/xen/spinlock.h:4,
from .../xen/include/xen/cpu.h:5,
from core_parking.c:19:
.../xen/include/asm/percpu.h:6:22: note: while referencing ‘__per_cpu_offset’
6 | extern unsigned long __per_cpu_offset[NR_CPUS];
| ^~~~~~~~~~~~~~~~
One of the further errors even went as far as claiming that an array
index (range) of [0, 0] was outside the bounds of a [1] array, so
something fishy is pretty clearly going on there.
The compiler apparently wants to be able to see that the loop isn't
really a loop in order to avoid triggering such warnings, yet what
exactly makes it consider the loop exit condition constant and within
the [0, 1] range isn't obvious - using ((mask)->bits[0] & 1) instead of
cpumask_test_cpu() for example did _not_ help.
Re-instate a special form of for_each_cpu(), experimentally "proven" to
avoid the diagnostics.
Jan Beulich [Tue, 20 Apr 2021 10:13:53 +0000 (12:13 +0200)]
VT-d: restore flush hooks when disabling qinval
Leaving the hooks untouched is at best a latent risk: There may well be
cases where some flush is needed, which then needs carrying out the
"register" way.
Switch from u<N> to uint<N>_t while needing to touch the function
headers anyway.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 9c39dba2b179c0f4c42c98e97ea0878119718530
master date: 2021-03-30 14:40:24 +0200
Jan Beulich [Tue, 20 Apr 2021 10:13:26 +0000 (12:13 +0200)]
VT-d: re-order register restoring in vtd_resume()
For one FECTL must be written last - the interrupt shouldn't be unmasked
without first having written the data and address needed to actually
deliver it. In the common case (when dma_msi_set_affinity() doesn't end
up bailing early) this happens from init_vtd_hw(), but for this to
actually have the intended effect we shouldn't subsequently overwrite
what was written there - this is only benign when old and new settings
match. Instead we should restore the registers ahead of calling
init_vtd_hw(), just for the unlikely case of dma_msi_set_affinity()
bailing early.
In the moved code drop some stray casts as well.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 95e07f8c0e64889ee56015c7f99bdf5309e9e8ef
master date: 2021-03-30 14:39:54 +0200
Andrew Cooper [Tue, 20 Apr 2021 10:11:59 +0000 (12:11 +0200)]
x86/timer: Fix boot on Intel systems using ITSSPRC static PIT clock gating
Recent Intel client devices have disabled the legacy PIT for powersaving
reasons, breaking compatibility with a traditional IBM PC. Xen depends on a
legacy timer interrupt to check that the IO-APIC/PIC routing is configured
correctly, and fails to boot with:
(XEN) *******************************
(XEN) Panic on CPU 0:
(XEN) IO-APIC + timer doesn't work! Boot with apic_verbosity=debug and send report. Then try booting with the `noapic` option
(XEN) *******************************
While this setting can be undone by Xen, the details of how to differ by
chipset, and would be very short sighted for battery based devices. See bit 2
"8254 Static Clock Gating Enable" in:
All impacted systems have an HPET, but there is no indication of the absence
of PIT functionality, nor a suitable way to probe for its absence. As a short
term fix, reconfigure the HPET into legacy replacement mode. A better
longterm fix would be to avoid the reliance on the timer interrupt entirely.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Jason Andryuk <jandryuk@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: e1de4c196a2eb4fd5063c30a2e115adf144bdeef
master date: 2021-01-27 19:15:19 +0000
Jan Beulich [Tue, 20 Apr 2021 10:11:11 +0000 (12:11 +0200)]
x86emul: fix PINSRW and adjust other {,V}PINSR*
The use of simd_packed_int together with no further update to op_bytes
has lead to wrong signaling of #GP(0) for PINSRW without a 16-byte
aligned memory operand. Use simd_none instead and override it after
general decoding with simd_other, like is done for the B/D/Q siblings.
While benign, for consistency also use DstImplicit instead of DstReg
in x86_decode_twobyte().
PINSR{B,D,Q} also had a stray (redundant) get_fpu() invocation, which
gets dropped.
For further consistency also
- use src.bytes instead of op_bytes in relevant memcpy() invocations,
- avoid the pointless updating of op_bytes (all we care about later is
that the value be less than 16).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 06f0598b41f23c9e4cf7d8c5a05b282de92f3a35
master date: 2020-10-23 18:03:18 +0200
pci: cleanup MSI interrupts before removing device from IOMMU
Doing the MSI cleanup after removing the device from the IOMMU leads
to the following panic on AMD hardware:
Assertion 'table.ptr && (index < intremap_table_entries(table.ptr, iommu))' failed at iommu_intr.c:172
----[ Xen-4.13.1-10.0.3-d x86_64 debug=y Not tainted ]----
CPU: 3
RIP: e008:[<ffff82d08026ae3c>] drivers/passthrough/amd/iommu_intr.c#get_intremap_entry+0x52/0x7b
[...]
Xen call trace:
[<ffff82d08026ae3c>] R drivers/passthrough/amd/iommu_intr.c#get_intremap_entry+0x52/0x7b
[<ffff82d08026af25>] F drivers/passthrough/amd/iommu_intr.c#update_intremap_entry_from_msi_msg+0xc0/0x342
[<ffff82d08026ba65>] F amd_iommu_msi_msg_update_ire+0x98/0x129
[<ffff82d08025dd36>] F iommu_update_ire_from_msi+0x1e/0x21
[<ffff82d080286862>] F msi_free_irq+0x55/0x1a0
[<ffff82d080286f25>] F pci_cleanup_msi+0x8c/0xb0
[<ffff82d08025cf52>] F pci_remove_device+0x1af/0x2da
[<ffff82d0802a42d1>] F do_physdev_op+0xd18/0x1187
[<ffff82d080383925>] F pv_hypercall+0x1f5/0x567
[<ffff82d08038a432>] F lstar_enter+0x112/0x120
That's because the call to iommu_remove_device on AMD hardware will
remove the per-device interrupt remapping table, and hence the call to
pci_cleanup_msi done afterwards will find a null intremap table and
crash.
Reorder the calls so that MSI interrupts are torn down before removing
the device from the IOMMU.
Julien Grall [Sat, 20 Feb 2021 14:04:12 +0000 (14:04 +0000)]
xen/vgic: Implement write to ISPENDR in vGICv{2, 3}
Currently, Xen will send a data abort to a guest trying to write to the
ISPENDR.
Unfortunately, recent version of Linux (at least 5.9+) will start
writing to the register if the interrupt needs to be re-triggered
(see the callback irq_retrigger). This can happen when a driver (such as
the xgbe network driver on AMD Seattle) re-enable an interrupt:
Implementing the write part of ISPENDR is somewhat easy. For
virtual interrupt, we only need to inject the interrupt again.
For physical interrupt, we need to be more careful as the de-activation
of the virtual interrupt will be propagated to the physical distributor.
For simplicity, the physical interrupt will be set pending so the
workflow will not differ from a "real" interrupt.
Longer term, we could possible directly activate the physical interrupt
and avoid taking an exception to inject the interrupt to the domain.
(This is the approach taken by the new vGIC based on KVM).
Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
(cherry picked from commit 067935804a8e7a33ff7170a2db8ce94bb46d9a63)
xen/arm: call iomem_permit_access for passthrough devices
iomem_permit_access should be called for MMIO regions of devices
assigned to a domain. Currently it is not called for MMIO regions of
passthrough devices of Dom0less guests. This patch fixes it.
Luca Fancellu [Fri, 19 Mar 2021 18:55:03 +0000 (11:55 -0700)]
xen/arm: Add workaround for Cortex-A53 erratum #843419
On the Cortex A53, when executing in AArch64 state, a load or store instruction
which uses the result of an ADRP instruction as a base register, or which uses
a base register written by an instruction immediately after an ADRP to the
same register, might access an incorrect address.
The workaround is to enable the linker flag --fix-cortex-a53-843419
if present, to check and fix the affected sequence. Otherwise print a warning
that Xen may be susceptible to this errata
Bertrand Marquis [Tue, 24 Nov 2020 11:12:15 +0000 (11:12 +0000)]
xen/arm: Add workaround for Cortex-A55 erratum #1530923
On the Cortex A55, TLB entries can be allocated by a speculative AT
instruction. If this is happening during a guest context switch with an
inconsistent page table state in the guest, TLBs with wrong values might
be allocated.
The ARM64_WORKAROUND_AT_SPECULATE workaround is used as for erratum 1165522 on Cortex A76 or Neoverse N1.
This change is also introducing the MIDR identifier for the Cortex-A55.
Penny Zheng [Mon, 9 Nov 2020 08:21:10 +0000 (16:21 +0800)]
xen/arm: Add Cortex-A73 erratum 858921 workaround
CNTVCT_EL0 or CNTPCT_EL0 counter read in Cortex-A73 (all versions)
might return a wrong value when the counter crosses a 32bit boundary.
Until now, there is no case for Xen itself to access CNTVCT_EL0,
and it also should be the Guest OS's responsibility to deal with
this part.
But for CNTPCT, there exists several cases in Xen involving reading
CNTPCT, so a possible workaround is that performing the read twice,
and to return one or the other depending on whether a transition has
taken place.
Michal Orzel [Wed, 14 Oct 2020 10:05:41 +0000 (12:05 +0200)]
xen/arm: Document the erratum #853709 related to Cortex A72
The Cortex-A72 erratum #853709 is the same as the Cortex-A57
erratum #852523. As the latter is already workaround, we only
need to update the documentation.
Signed-off-by: Michal Orzel <michal.orzel@arm.com>
[julieng: Reworded the commit message] Reviewed-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Bertrand Marquis <bertrand.marquis@arm.com>
(cherry picked from commit f776e5fb3ee699745f6442ec8c47d0fa647e0575)
Julien Grall [Sun, 7 Jun 2020 15:51:54 +0000 (16:51 +0100)]
xen/arm: mm: Access a PT entry before the table is unmapped
xen_pt_next_level() will retrieve the MFN from the entry right after the
page-table has been unmapped.
After calling xen_unmap_table(), there is no guarantee the mapping will
still be valid. Depending on the implementation, this may result to a
data abort in Xen.
Re-order the code to retrieve the MFN before the table is unmapped.
Fixes: 53abb9a1dcd9 ("xen/arm: mm: Rework Xen page-tables walk during update") Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Release-acked-by: Paul Durrant <paul@xen.org>
(cherry picked from commit 63b4c9bfb788ebfd35d0172f7e8e2e41ef948f70)
xen/arm: sched: Ensure the vCPU context is seen before vcpu_pause() returns
Some callers of vcpu_pause() will expect to access the latest vcpu
context when the function returns (see XENDOMCTL_{set,get}vcpucontext}.
However, the latest vCPU context can only be observed after
v->is_running has been observed to be false.
As there is no memory barrier instruction generated, a processor could
try to speculatively access the vCPU context before it was observed.
To prevent the corruption of the vCPU context, we need to insert a
memory barrier instruction after v->is_running is observed and before
the context is accessed. This barrier is added in sync_vcpu_execstate()
as it seems to be the place where we expect the synchronization to
happen.
Andrew Cooper [Fri, 19 Mar 2021 14:39:42 +0000 (15:39 +0100)]
x86/amd: Initial support for Fam19h processors
Fam19h is very similar to Fam17h in these regards.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/ucode/amd: Fix microcode payload size for Fam19 processors
The original limit provided wasn't accurate. Blobs are in fact rather larger.
Fixes: fe36a173d1 ("x86/amd: Initial support for Fam19h processors") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: fe36a173d110fd792f5e337e208a5ed714df1536
master date: 2020-05-04 11:04:29 +0100
master commit: 90b014a6e6ecad036ec5846426afd19b305dedff
master date: 2021-02-10 13:23:51 +0000
Edwin Török [Fri, 15 Jan 2021 19:38:58 +0000 (19:38 +0000)]
tools/oxenstored: mkdir conflicts were sometimes missed
Due to how set_write_lowpath was used here it didn't detect create/delete
conflicts. When we create an entry we must mark our parent as modified
(this is what creating a new node via write does).
Otherwise we can have 2 transactions one creating, and another deleting a node
both succeeding depending on timing. Or one transaction reading an entry,
concluding it doesn't exist, do some other work based on that information and
successfully commit even if another transaction creates the node via mkdir
meanwhile.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
(cherry picked from commit 45dee7d92b493bb531e7e77a6f9c0180ab152f87)
Edwin Török [Fri, 15 Jan 2021 19:28:37 +0000 (19:28 +0000)]
tools/oxenstored: Reject invalid watch paths early
Watches on invalid paths were accepted, but they would never trigger. The
client also got no notification that its watch is bad and would never trigger.
Found again by the structured fuzzer, due to an error on live update reload:
the invalid watch paths would get rejected during live update and the list of
watches would be different pre/post live update.
The testcase is watch on `//`, which is an invalid path.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
(cherry picked from commit dc8caf214fb882546b0e93317b9828247a7c9da8)
Edwin Török [Fri, 15 Jan 2021 19:11:32 +0000 (19:11 +0000)]
tools/oxenstored: Fix quota calculation for mkdir EEXIST
We increment the domain's quota on mkdir even when the node already exists.
This results in a quota inconsistency after live update, where reconstructing
the tree from scratch results in a different quota.
Not a security issue because the domain uses up quota faster, so it will only
get a Quota error sooner than it should.
Found by the structured fuzzer.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com> Release-Acked-by: Ian Jackson <iwj@xenproject.org>
(cherry picked from commit c8b96708252a436da44005307f7c195d699bd7c5)
Edwin Török [Fri, 8 Jan 2021 11:57:37 +0000 (11:57 +0000)]
tools/oxenstored: Trim txhistory on xenbus reconnect
There is a global history, containing transactions from the past 0.05s, which
get trimmed whenever any transaction commits or aborts. Destroying a domain
will cause xenopsd to perform some transactions deleting the tree, so that is
fine. But I think that a domain can abuse the xenbus reconnect facility to
cause a large history to be recorded - provided that noone does any
transactions on the system inbetween, which may be difficult to achieve given
squeezed's constant pinging.
The theoretical situation is like this:
- a domain starts a transaction, creates as large a tree as it can, commits
it. Then repeatedly:
- start a transaction, do nothing with it, start a transaction, delete
part of the large tree, write some new unique data there, don't commit
- cause a xenbus reconnect (I think this can be done by writing something
to the ring). This causes all transactions/watches for the connection to
be cleared, but NOT the history, there were no commits, so nobody
trimmed the history, i.e. it the history can contain transactions from
more than just 0.05s
- loop back and start more transactions, you can keep this up indefinitely
without hitting quotas
Now there is a periodic History.trim running every 0.05s, so I don't think you
can do much damage with it. But lets be safe an trim the transaction history
anyway on reconnect.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 2a47797d1f3b14aab4f0368ab833abd311f94a70)
Edwin Török [Fri, 15 Jan 2021 18:23:10 +0000 (18:23 +0000)]
tools/ocaml/libs/xb: Do not crash after xenbus is unmapped
Xenmmap.unmap sets the address to MAP_FAILED in xenmmap_stubs.c. If due to a
bug there were still references to the Xenbus and we attempt to use it then we
crash. Raise an exception instead of crashing.
Signed-off-by: Edwin Török <edvin.torok@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 5e317896342d553f0b55f72948bbf93a0f1147d3)
Edwin Török [Wed, 15 Jul 2020 15:10:56 +0000 (16:10 +0100)]
oxenstored: fix ABI breakage introduced in Xen 4.9.0
dbc84d2983969bb47d294131ed9e6bbbdc2aec49 (Xen >= 4.9.0) deleted XS_RESTRICT
from oxenstored, which caused all the following opcodes to be shifted by 1:
reset_watches became off-by-one compared to the C version of xenstored.
Looking at the C code the opcode for reset watches needs:
XS_RESET_WATCHES = XS_SET_TARGET + 2
So add the placeholder `Invalid` in the OCaml<->C mapping list.
(Note that the code here doesn't simply convert the OCaml constructor to
an integer, so we don't need to introduce a dummy constructor).
Igor says that with a suitably patched xenopsd to enable watch reset,
we now see `reset watches` during kdump of a guest in xenstored-access.log.
Julien Grall [Mon, 30 Mar 2020 19:21:52 +0000 (20:21 +0100)]
tools/libxc: misc: Mark const the parameter 'keys' of xc_send_debug_keys()
OCaml is using a string to describe the parameter 'keys' of
xc_send_debug_keys(). Since Ocaml 4.06.01, String_val() will return a
const char * when using -safe-string. This will result to a build
failure because xc_send_debug_keys() expects a char *.
The function should never modify the parameter 'keys' and therefore the
parameter should be const. Unfortunately, this is not directly possible
because DECLARE_HYPERCALL_BOUNCE() is expecting a non-const variable.
A new macro DECLARE_HYPERCALL_BOUNCE_IN() is introduced and will take
care of const parameter. The first user will be xc_send_debug_keys() but
this can be used in more place in the future.
Reported-by: Dario Faggioli <dfaggioli@suse.com> Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit 2b8079610ec55413613ad071cc81cd9f97232a7e)
Julien Grall [Mon, 30 Mar 2020 17:50:08 +0000 (18:50 +0100)]
tools/ocaml: libxb: Avoid to use String_val() when value is bytes
Commit ec7d54dd1a "ocaml/libs/xb: Use bytes in place of strings for
mutable buffers" switch mutable buffers from string to bytes. However
the C code were still using String_Val() to access them.
While the underlying structure is the same between string and bytes, a
string is meant to be immutable. OCaml 4.06.1 and later will enforce it.
Therefore, it will not be possible to build the OCaml libs when using
-safe-string. This is because String_val() will return a const value.
To avoid plain cast in the code, the code is now switched to use
Bytes_val(). As the macro is not defined in older OCaml version, we need
to provide a stub.
Take the opportunity to switch to const the buffer in
ml_interface_write() as it should not be modified.
Reported-by: Dario Faggioli <dfaggioli@suse.com> Signed-off-by: Julien Grall <jgrall@amazon.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 78686437e949a85a207ae1a0d637efe2d3778bbe)
Julien Grall [Mon, 30 Mar 2020 14:14:23 +0000 (15:14 +0100)]
tools/ocaml: Fix stubs build when OCaml has been compiled with -safe-string
The OCaml code has been fixed to handle properly -safe-string in Xen
4.11, however the stubs part were missed.
On OCaml newer than 4.06.1, String_Val() will return a const char *
when using -safe-string leading to build failure when this is used
in place where char * is expected.
The main use in Xen code base is when a new string is allocated. The
suggested approach by the OCaml community [1] is to use the helper
caml_alloc_initialized_string() but it was introduced by OCaml 4.06.1.
The next best approach is to cast String_val() to (char *) as the helper
would have done. So use it when we need to update the new string using
memcpy().
Take the opportunity to remove the unnecessary cast of the source as
mempcy() is expecting a void *.
[1] https://github.com/ocaml/ocaml/pull/1274
Reported-by: Dario Faggioli <dfaggioli@suse.com> Signed-off-by: Julien Grall <jgrall@amazon.com> Acked-by: Christian Lindig <christian.lindig@citrix.com>
(cherry picked from commit 59b087e3954402c487e0abb4ad9bd05f43669436)
Anthony PERARD [Thu, 18 Mar 2021 14:07:59 +0000 (15:07 +0100)]
libxl: Fix domain soft reset state handling
In do_domain_soft_reset(), a `libxl__domain_suspend_state' is used
without been properly initialised and disposed of. This lead do a
abort() in libxl due to the `dsps.qmp' state been used before been
initialised:
libxl__ev_qmp_send: Assertion `ev->state == qmp_state_disconnected || ev->state == qmp_state_connected' failed.
Once initialised, `dsps' also needs to be disposed of as the `qmp'
state might still be in the `Connected' state in the callback for
libxl__domain_suspend_device_model(). So this patch adds
libxl__domain_suspend_dispose() which can be called from the two
places where we need to dispose of `dsps'.
This is XSA-368.
Reported-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Ian Jackson <iwj@xenproject.org> Tested-by: Olaf Hering <olaf@aepfle.de>
master commit: dae3c3e8b257cd27d6b35a467a34bf79a6650340
master date: 2021-03-18 14:56:33 +0100
Dario Faggioli [Thu, 18 Mar 2021 14:07:17 +0000 (15:07 +0100)]
xen: fix for_each_cpu when NR_CPUS=1
When running an hypervisor build with NR_CPUS=1 for_each_cpu does not
take into account whether the bit of the CPU is set or not in the
provided mask.
This means that whatever we have in the bodies of these loops is always
done once, even if the mask was empty and it should never be done. This
is clearly a bug and was in fact causing an assert to trigger in credit2
code.
Removing the special casing of NR_CPUS == 1 makes things work again.
Igor Druzhinin [Thu, 18 Mar 2021 14:06:47 +0000 (15:06 +0100)]
vtd: make sure QI/IR are disabled before initialisation
BIOS might pass control to Xen leaving QI and/or IR in enabled and/or
partially configured state. In case of x2APIC code path where EIM is
enabled early in boot - those are correctly disabled by Xen before any
attempt to configure. But for xAPIC that step is missing which was
proven to cause QI initialization failures on some ICX based platforms
where QI is left pre-enabled and partially configured by BIOS. That
problem becomes hard to avoid since those platforms are shipped with
x2APIC opt out being advertised by default at the same time by firmware.
Unify the behaviour between x2APIC and xAPIC code paths keeping that in
line with what Linux does.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 04181c6fb543db01f635227c7681ced4073109ba
master date: 2021-03-12 17:01:52 +0100
Jan Beulich [Thu, 18 Mar 2021 14:06:12 +0000 (15:06 +0100)]
x86/shadow: suppress "fast fault path" optimization without reserved bits
When none of the physical address bits in PTEs are reserved, we can't
create any 4k (leaf) PTEs which would trigger reserved bit faults. Hence
the present SHOPT_FAST_FAULT_PATH machinery needs to be suppressed in
this case, which is most easily achieved by never creating any magic
entries.
To compensate a little, eliminate sh_write_p2m_entry_post()'s impact on
such hardware.
While at it, also avoid using an MMIO magic entry when that would
truncate the incoming GFN.
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Tim Deegan <tim@xen.org>
x86/shadow: suppress "fast fault path" optimization when running virtualized
We can't make correctness of our own behavior dependent upon a
hypervisor underneath us correctly telling us the true physical address
with hardware uses. Without knowing this, we can't be certain reserved
bit faults can actually be observed. Therefore, besides evaluating the
number of address bits when deciding whether to use the optimization,
also check whether we're running virtualized ourselves. (Note that since
we may get migrated when running virtualized, the number of address bits
may also change.)
Requested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Tim Deegan <tim@xen.org>
master commit: 9318fdf757ec234f0ee6c5cd381326b2f581d065
master date: 2021-03-05 13:29:28 +0100
master commit: 60c0444fae2148452f9ed0b7c49af1fa41f8f522
master date: 2021-03-08 10:41:50 +0100
Jan Beulich [Thu, 18 Mar 2021 14:05:33 +0000 (15:05 +0100)]
crypto: adjust rijndaelEncrypt() prototype for gcc11
The upcoming release complains, not entirely unreasonably:
In file included from rijndael.c:33:
.../xen/include/crypto/rijndael.h:55:53: note: previously declared as 'const unsigned char[]'
55 | void rijndaelEncrypt(const unsigned int [], int, const unsigned char [],
| ^~~~~~~~~~~~~~~~~~~~~~
rijndael.c:865:8: error: argument 4 of type 'u8[16]' {aka 'unsigned char[16]'} with mismatched bound [-Werror=array-parameter=]
865 | u8 ct[16])
| ~~~^~~~~~
In file included from rijndael.c:33:
.../xen/include/crypto/rijndael.h:56:13: note: previously declared as 'unsigned char[]'
56 | unsigned char []);
| ^~~~~~~~~~~~~~~~
Simply declare the correct array dimensions right away. This then allows
compilers to apply checking at call sites, which seems desirable anyway.
For the moment I'm leaving untouched the disagreement between u8/u32
used in the function definition and unsigned {char,int} used in the
declaration, as making this consistent would call for touching further
functions.
Reported-by: Charles Arnold <carnold@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com>
master commit: c6ad5a701b9a6df443a6c98d9e7201c958bbcafc
master date: 2021-03-04 16:47:51 +0100
Julien Grall [Fri, 5 Mar 2021 14:48:45 +0000 (15:48 +0100)]
xen/sched: Add missing memory barrier in vcpu_block()
The comment in vcpu_block() states that the events should be checked
/after/ blocking to avoids wakeup waiting race. However, from a generic
perspective, set_bit() doesn't prevent re-ordering. So the following
could happen:
CPU0 (blocking vCPU A) | CPU1 ( unblock vCPU A)
|
A <- read local events |
| set local events
| test_and_clear_bit(_VPF_blocked)
| -> Bail out as the bit if not set
|
set_bit(_VFP_blocked) |
|
check A |
The variable A will be 0 and therefore the vCPU will be blocked when it
should continue running.
vcpu_block() is now gaining an smp_mb__after_atomic() to prevent the CPU
to read any information about local events before the flag _VPF_blocked
is set.
When using atomic variables for synchronization barriers are needed
to ensure proper data serialization. Introduce smp_mb__before_atomic()
and smp_mb__after_atomic() as in the Linux kernel for that purpose.
Jan Beulich [Fri, 5 Mar 2021 14:48:06 +0000 (15:48 +0100)]
x86/EFI: suppress GNU ld 2.36'es creation of base relocs
All of the sudden ld creates base relocations itself, for PE
executables - as a result we now have two of them for every entity to
be relocated. While we will likely want to use this down the road, it
doesn't work quite right yet in corner cases, so rather than suppressing
our own way of creating the relocations we need to tell ld to avoid
doing so.
Probe whether --disable-reloc-section (which was introduced by the same
commit making relocation generation the default) is recognized by ld's PE
emulation, and use the option if so. (To limit redundancy, move the first
part of setting EFI_LDFLAGS earlier, and use it already while probing.)
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 76cbb9c3f4dd9ab6aa44eeacab84fb88b2e8bfc1
master date: 2021-02-25 15:11:58 +0100
Jan Beulich [Fri, 5 Mar 2021 14:47:35 +0000 (15:47 +0100)]
gnttab: bypass IOMMU (un)mapping when a domain is (un)mapping its own grant
Mappings for a domain's own pages should already be present in the
IOMMU. While installing the same mapping again is merely redundant (and
inefficient), removing the mapping when the grant mapping gets removed
is outright wrong in this case: The mapping was there before the map, so
should remain in place after unmapping.
This affects
- Arm Dom0 in the direct mapped case,
- x86 PV Dom0 in the "iommu=dom0-strict" / "dom0-iommu=strict" case,
- all x86 PV DomU-s, including driver domains.
See the code comment for why it's the original domain and not the page
owner that gets compared against.
Jan Beulich [Fri, 5 Mar 2021 14:46:55 +0000 (15:46 +0100)]
gnttab: never permit mapping transitive grants
Transitive grants allow an intermediate domain I to grant a target
domain T access to a page which origin domain O did grant I access to.
As an implementation restriction, T is not allowed to map such a grant.
This restriction is currently tried to be enforced by marking active
entries resulting from transitive grants as is-sub-page; sub-page grants
for obvious reasons don't allow mapping. However, marking (and checking)
only active entries is insufficient, as a map attempt may also occur on
a grant not otherwise in use. When not presently in use (pin count zero)
the grant type itself needs checking. Otherwise T may be able to map an
unrelated page owned by I. This is because the "transitive" sub-
structure of the v2 union would end up being interpreted as "full_page"
sub-structure instead. The low 32 bits of the GFN used would match the
grant reference specified in I's transitive grant entry, while the upper
32 bits could be random (depending on how exactly I sets up its grant
table entries).
Note that if one mapping already exists and the granting domain _then_
changes the grant to GTF_transitive (which the domain is not supposed to
do), the changed type will only be honored after the pin count has gone
back to zero. This is no different from e.g. GTF_readonly or
GTF_sub_page becoming set when a grant is already in use.
While adjusting the implementation, also adjust commentary in the public
header to better reflect reality.
Jan Beulich [Fri, 5 Mar 2021 14:45:54 +0000 (15:45 +0100)]
x86emul: fix SYSENTER/SYSCALL switching into 64-bit mode
When invoked by compat mode, mode_64bit() will be false at the start of
emulation. The logic after complete_insn, however, needs to consider the
mode switched into, in particular to avoid truncating RIP.
Inspired by / paralleling and extending Linux commit 943dea8af21b ("KVM:
x86: Update emulator context mode if SYSENTER xfers to 64-bit mode").
While there, tighten a related assertion in x86_emulate_wrapper() - we
want to be sure to not switch into an impossible mode when the code gets
built for 32-bit only (as is possible for the test harness).
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citirix.com>
master commit: f3e1eb2f0234c955243a915d69ebd84f26eec130
master date: 2021-02-11 17:53:10 +0100
Andrew Cooper [Fri, 5 Mar 2021 14:45:06 +0000 (15:45 +0100)]
x86/ucode/amd: Fix OoB read in cpu_request_microcode()
verify_patch_size() is a maximum size check, and doesn't have a minimum bound.
If the microcode container encodes a blob with a length less than 64 bytes,
the subsequent calls to microcode_fits()/compare_header() may read off the end
of the buffer.
Fixes: 4de936a38a ("x86/ucode/amd: Rework parsing logic in cpu_request_microcode()") Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1cbc4d89c45cba3929f1c0cb4bca0b000c4f174b
master date: 2021-02-10 13:23:51 +0000
Jan Beulich [Fri, 5 Mar 2021 14:44:47 +0000 (15:44 +0100)]
x86/EFI: work around GNU ld 2.36 issue
Our linker capability check fails with the recent binutils release's ld:
.../check.o:(.debug_aranges+0x6): relocation truncated to fit: R_X86_64_32 against `.debug_info'
.../check.o:(.debug_info+0x6): relocation truncated to fit: R_X86_64_32 against `.debug_abbrev'
.../check.o:(.debug_info+0xc): relocation truncated to fit: R_X86_64_32 against `.debug_str'+76
.../check.o:(.debug_info+0x11): relocation truncated to fit: R_X86_64_32 against `.debug_str'+d
.../check.o:(.debug_info+0x15): relocation truncated to fit: R_X86_64_32 against `.debug_str'+2b
.../check.o:(.debug_info+0x29): relocation truncated to fit: R_X86_64_32 against `.debug_line'
.../check.o:(.debug_info+0x30): relocation truncated to fit: R_X86_64_32 against `.debug_str'+19
.../check.o:(.debug_info+0x37): relocation truncated to fit: R_X86_64_32 against `.debug_str'+71
.../check.o:(.debug_info+0x3e): relocation truncated to fit: R_X86_64_32 against `.debug_str'
.../check.o:(.debug_info+0x45): relocation truncated to fit: R_X86_64_32 against `.debug_str'+5e
.../check.o:(.debug_info+0x4c): additional relocation overflows omitted from the output
Tell the linker to strip debug info as a workaround. Debug info has been
getting stripped already anyway when linking the actual xen.efi.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: f4318db940c39cc656128fcf72df3e79d2e55bc1
master date: 2021-02-05 14:09:42 +0100
Roger Pau Monné [Fri, 5 Mar 2021 14:44:18 +0000 (15:44 +0100)]
x86/efi: enable MS ABI attribute on clang
Or else the EFI service calls will use the wrong calling convention.
The __ms_abi__ attribute is available on all supported versions of
clang. Add a specific Clang check because the GCC version reported by
Clang is below the required 4.4 to use the __ms_abi__ attribute.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Ian Jackson <iwj@xenproject.org>
master commit: 92f5ffa58d188c9f9a9f1bcdccb6d6348d9df612
master date: 2021-02-04 14:02:32 +0100
Jan Beulich [Fri, 5 Mar 2021 14:43:52 +0000 (15:43 +0100)]
x86/string: correct memmove()'s forwarding to memcpy()
With memcpy() expanding to the compiler builtin, we may not hand it
overlapping source and destination. We strictly mean to forward to our
own implementation (a few lines up in the same source file).
Fixes: 78825e1c60fa ("x86/string: Clean up x86/string.h") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 7b93d92a35dc7c0a6e5f1f79b3c887aa3e66ddc0
master date: 2021-02-04 13:59:56 +0100
Tamas K Lengyel [Fri, 5 Mar 2021 14:43:31 +0000 (15:43 +0100)]
x86/debug: fix page-overflow bug in dbg_rw_guest_mem
When using gdbsx dbg_rw_guest_mem is used to read/write guest memory. When the
buffer being accessed is on a page-boundary, the next page needs to be grabbed
to access the correct memory for the buffer's overflown parts. While
dbg_rw_guest_mem has logic to handle that, it broke with 229492e210a. Instead
of grabbing the next page the code right now is looping back to the
start of the first page. This results in errors like the following while trying
to use gdb with Linux' lx-dmesg:
[ 0.114457] PM: hibernation: Registered nosave memory: [mem
0xfdfff000-0xffffffff]
[ 0.114460] [mem 0x90000000-0xfbffffff] available for PCI demem 0
[ 0.114462] f]f]
Python Exception <class 'ValueError'> embedded null character:
Error occurred in Python: embedded null character
Fixing this bug by taking the variable assignment outside the loop.
Fixes: 229492e210a ("x86/debugger: use copy_to/from_guest() in dbg_rw_guest_mem()") Signed-off-by: Tamas K Lengyel <tamas@tklengyel.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9dc687f155a57216b83b17f9cde55dd43e06b0cd
master date: 2021-01-30 03:21:33 +0000
Jan Beulich [Fri, 5 Mar 2021 14:42:59 +0000 (15:42 +0100)]
x86/HVM: re-order error path of hvm_domain_initialise()
hvm_destroy_all_ioreq_servers(), called from
hvm_domain_relinquish_resources(), invokes relocate_portio_handler(),
which uses d->arch.hvm.io_handler. Defer freeing of this array
accordingly on the error path of hvm_domain_initialise().
Similarly rtc_deinit() requires d->arch.hvm.pl_time to still be around,
or else an armed timer structure would get freed, and that timer never
get killed.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: fbb3bf002b42803ef289ea2a649ebd5f25d22236
master date: 2021-01-29 11:36:54 +0100
Jan Beulich [Fri, 5 Mar 2021 14:42:31 +0000 (15:42 +0100)]
memory: bail from page scrubbing when CPU is no longer online
Scrubbing can significantly delay the offlining (parking) of a CPU (e.g.
because of booting into in smt=0 mode), to a degree that the "CPU <n>
still not dead..." messages logged on x86 in 1s intervals can be seen
multiple times. There are no softirqs involved in this process, so
extend the existing preemption check in the scrubbing logic to also exit
when the CPU is no longer observed online.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 3c9fd69416f8ffc611705fb24dfb383203ddc84f
master date: 2021-01-29 11:34:37 +0100
Andrew Cooper [Fri, 5 Mar 2021 14:41:35 +0000 (15:41 +0100)]
x86/timer: Fix boot on Intel systems using ITSSPRC static PIT clock gating
Recent Intel client devices have disabled the legacy PIT for powersaving
reasons, breaking compatibility with a traditional IBM PC. Xen depends on a
legacy timer interrupt to check that the IO-APIC/PIC routing is configured
correctly, and fails to boot with:
(XEN) *******************************
(XEN) Panic on CPU 0:
(XEN) IO-APIC + timer doesn't work! Boot with apic_verbosity=debug and send report. Then try booting with the `noapic` option
(XEN) *******************************
While this setting can be undone by Xen, the details of how to differ by
chipset, and would be very short sighted for battery based devices. See bit 2
"8254 Static Clock Gating Enable" in:
All impacted systems have an HPET, but there is no indication of the absence
of PIT functionality, nor a suitable way to probe for its absence. As a short
term fix, reconfigure the HPET into legacy replacement mode. A better
longterm fix would be to avoid the reliance on the timer interrupt entirely.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Jason Andryuk <jandryuk@gmail.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: e1de4c196a2eb4fd5063c30a2e115adf144bdeef
master date: 2021-01-27 19:15:19 +0000
Jan Beulich [Fri, 5 Mar 2021 14:41:11 +0000 (15:41 +0100)]
xen/include: compat/xlat.h may change with .config changes
$(xlat-y) getting derived from $(headers-y) means its contents may
change with changes to .config. The individual files $(xlat-y) refers
to, otoh, may not change, and hence not trigger rebuilding of xlat.h.
(Note that the issue was already present before the commit referred to
below, but it was far more limited in affecting only changes to
CONFIG_XSM_FLASK.)
Fixes: 2c8fabb2232d ("x86: only generate compat headers actually needed") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: c7b0f25e8f86373ed54e1c446f8e67ce25ac6819
master date: 2021-01-26 14:42:23 +0100
Roger Pau Monné [Fri, 5 Mar 2021 14:40:32 +0000 (15:40 +0100)]
x86/vioapic: check IRR before attempting to inject interrupt after EOI
In vioapic_update_EOI the irq_lock will be dropped in order to forward
the EOI to the dpci handler, so there's a window between clearing IRR
and checking if the line is asserted where IRR can change behind our
back.
Fix this by checking whether IRR is set before attempting to inject a
new interrupt.
Fixes: 06e3f8f2766 ('vt-d: Do dpci eoi outside of irq_lock.') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ba584fb1a26c058ebd0e6a2779287b3e4400415c
master date: 2021-01-22 12:13:05 +0100
On ARM we need gnttab_need_iommu_mapping to be true for dom0 when it is
directly mapped and IOMMU is enabled for the domain, like the old check
did, but the new check is always false.
In fact, need_iommu_pt_sync is defined as dom_iommu(d)->need_sync and
need_sync is set as:
if ( !is_hardware_domain(d) || iommu_hwdom_strict )
hd->need_sync = !iommu_use_hap_pt(d);
iommu_use_hap_pt(d) means that the page-table used by the IOMMU is the
P2M. It is true on ARM. need_sync means that you have a separate IOMMU
page-table and it needs to be updated for every change. need_sync is set
to false on ARM. Hence, gnttab_need_iommu_mapping(d) is false too,
which is wrong.
As a consequence, when using PV network from a domU on a system where
IOMMU is on from Dom0, I get:
(XEN) smmu: /smmu@fd800000: Unhandled context fault: fsr=0x402, iova=0x8424cb148, fsynr=0xb0001, cb=0
[ 68.290307] macb ff0e0000.ethernet eth0: DMA bus error: HRESP not OK
The fix is to go back to something along the lines of the old
implementation of gnttab_need_iommu_mapping.
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Fixes: 91d4eca7add ("mm / iommu: split need_iommu() into has_iommu_pt() and need_iommu_pt_sync()")
Backport: 4.13+
(cherry picked from commit 04085ec1ac05a362812e9b0c6b5a8713d7dc88ad)
Julien Grall [Tue, 16 Feb 2021 14:38:11 +0000 (15:38 +0100)]
xen/page_alloc: Only flush the page to RAM once we know they are scrubbed
At the moment, each page are flushed to RAM just after the allocator
found some free pages. However, this is happening before check if the
page was scrubbed.
As a consequence, on Arm, a guest may be able to access the old content
of the scrubbed pages if it has cache disabled (default at boot) and
the content didn't reach the Point of Coherency.
The flush is now moved after we know the content of the page will not
change. This also has the benefit to reduce the amount of work happening
with the heap_lock held.
This is XSA-364.
Fixes: 307c3be3ccb2 ("mm: Don't scrub pages while holding heap lock in alloc_heap_pages()") Signed-off-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3b1cc15f1931ba56d0ee256fe9bfe65509733b27
master date: 2021-02-16 15:32:08 +0100
Roger Pau Monné [Thu, 21 Jan 2021 15:26:16 +0000 (16:26 +0100)]
x86/dpci: do not remove pirqs from domain tree on unbind
A fix for a previous issue removed the pirqs from the domain tree when
they are unbound in order to prevent shared pirqs from triggering a
BUG_ON in __pirq_guest_unbind if they are unbound multiple times. That
caused free_domain_pirqs to no longer unmap the pirqs because they
are gone from the domain pirq tree, thus leaving stale unbound pirqs
after domain destruction if the domain had mapped dpci pirqs after
shutdown.
Take a different approach to fix the original issue, instead of
removing the pirq from d->pirq_tree clear the flags of the dpci pirq
struct to signal that the pirq is now unbound. This prevents calling
pirq_guest_unbind multiple times for the same pirq without having to
remove it from the domain pirq tree.
This is XSA-360.
Fixes: 5b58dad089 ('x86/pass-through: avoid double IRQ unbind during domain cleanup') Reported-by: Samuel Verschelde <samuel.verschelde@vates.fr> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 58427889f5a420cc5226f88524b3228f90b72a58
master date: 2021-01-21 16:11:41 +0100
Jan Beulich [Thu, 21 Jan 2021 15:25:06 +0000 (16:25 +0100)]
x86/ACPI: don't overwrite FADT
When marking fields invalid for our own purposes, we should do so in our
local copy (so we will notice later on), not in the firmware provided
one (which another entity may want to look at again, e.g. after kexec).
Also mark the function parameter const to notice such issues right away.
Instead use the pointer at the firmware copy for specifying an adjacent
printk()'s arguments. If nothing else this at least reduces the number
of relocations the assembler hasto emit and the linker has to process.
Fixes: 62d1a69a4e9f ("ACPI: support v5 (reduced HW) sleep interface") Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 654c917d94d24587bb6b4a8d862baf04b25cbe33
master date: 2021-01-11 14:55:52 +0100
Roger Pau Monné [Thu, 21 Jan 2021 15:24:37 +0000 (16:24 +0100)]
x86/hypercall: fix gnttab hypercall args conditional build on pvshim
A pvshim build doesn't require the grant table functionality built in,
but it does require knowing the number of arguments the hypercall has
so the hypercall parameter clobbering works properly.
Instead of also setting the argument count for the gnttab case if PV
shim functionality is enabled, just drop all of the conditionals from
hypercall_args_table, as a hypercall having a NULL handler won't get
to use that information anyway.
Note this hasn't been detected by osstest because the tools pvshim
build is done without debug enabled, so the hypercall parameter
clobbering doesn't happen.
Fixes: d2151152dd2 ('xen: make grant table support configurable') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b468b464c89e92629bd52cec58e9f51eee2e950a
master date: 2021-01-08 16:51:52 +0100
Roger Pau Monné [Thu, 21 Jan 2021 15:23:58 +0000 (16:23 +0100)]
x86/dpci: EOI interrupt regardless of its masking status
Modify hvm_pirq_eoi to always EOI the interrupt if required, instead
of not doing such EOI if the interrupt is routed through the vIO-APIC
and the entry is masked at the time the EOI is performed.
Further unmask of the vIO-APIC pin won't EOI the interrupt, and thus
the guest OS has to wait for the timeout to expire and the automatic
EOI to be performed.
This allows to simplify the helpers and drop the vioapic_redir_entry
parameter from all of them.
Fixes: ccfe4e08455 ('Intel vt-d specific changes in arch/x86/hvm/vmx/vtd.') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: eb298f32fac5ac362eef30a66a9c9c42724d4ce6
master date: 2021-01-07 15:10:29 +0100
Jan Beulich [Thu, 21 Jan 2021 15:23:23 +0000 (16:23 +0100)]
x86/vPCI: tolerate (un)masking a disabled MSI-X entry
None of the four reasons causing vpci_msix_arch_mask_entry() to get
called (there's just a single call site) are impossible or illegal prior
to an entry actually having got set up:
- the entry may remain masked (in this case, however, a prior masked ->
unmasked transition would already not have worked),
- MSI-X may not be enabled,
- the global mask bit may be set,
- the entry may not otherwise have been updated.
Hence the function asserting that the entry was previously set up was
simply wrong. Since the caller tracks the masked state (and setting up
of an entry would only be effected when that software bit is clear),
it's okay to skip both masking and unmasking requests in this case.
Fixes: d6281be9d0145 ('vpci/msix: add MSI-X handlers') Reported-by: Manuel Bouyer <bouyer@antioche.eu.org> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Manuel Bouyer <bouyer@antioche.eu.org>
master commit: 04b090366ca59e8a75837c822df261a8d0bd1a30
master date: 2021-01-05 13:17:54 +0100