George Dunlap [Fri, 6 Dec 2019 11:47:08 +0000 (12:47 +0100)]
Rationalize max_grant_frames and max_maptrack_frames handling
Xen used to have single, system-wide limits for the number of grant
frames and maptrack frames a guest was allowed to create. Increasing
or decreasing this single limit on the Xen command-line would change
the limit for all guests on the system.
Later, per-domain limits for these values was created. The system-wide
limits became strict limits: domains could not be created with higher
limits, but could be created with lower limits. However, that change
also introduced a range of different "default" values into various
places in the toolstack:
- The python libxc bindings hard-coded these values to 32 and 1024,
respectively
- The libxl default values are 32 and 1024 respectively.
- xl will use the libxl default for maptrack, but does its own default
calculation for grant frames: either 32 or 64, based on the max
possible mfn.
These defaults interact poorly with the hypervisor command-line limit:
- The hypervisor command-line limit cannot be used to raise the limit
for all guests anymore, as the default in the toolstack will
effectively override this.
- If you use the hypervisor command-line limit to *reduce* the limit,
then the "default" values generated by the toolstack are too high,
and all guest creations will fail.
In other words, the toolstack defaults require any change to be
effected by having the admin explicitly specify a new value in every
guest.
In order to address this, have grant_table_init treat negative values
for max_grant_frames and max_maptrack_frames as instructions to use the
system-wide default, and have all the above toolstacks default to passing
-1 unless a different value is explicitly configured.
This restores the old behavior in that changing the hypervisor command-line
option can change the behavior for all guests, while retaining the ability
to set per-guest values. It also removes the bug that reducing the
system-wide max will cause all domains without explicit limits to fail.
NOTE: - The Ocaml bindings require the caller to always specify a value,
and the code to start a xenstored stubdomain hard-codes these to 4
and 128 respectively; this behavour will not be modified.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Paul Durrant <pdurrant@amazon.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Wei Liu <wl@xen.org> Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
master commit: f2ae59bc4b9b5c3f12de86aa42cdf413d2c3ffbf
master date: 2019-11-29 21:43:49 +0000
Paul Durrant [Fri, 6 Dec 2019 11:46:24 +0000 (12:46 +0100)]
x86 / iommu: set up a scratch page in the quarantine domain
This patch introduces a new iommu_op to facilitate a per-implementation
quarantine set up, and then further code for x86 implementations
(amd and vtd) to set up a read-only scratch page to serve as the source
for DMA reads whilst a device is assigned to dom_io. DMA writes will
continue to fault as before.
The reason for doing this is that some hardware may continue to re-try
DMA (despite FLR) in the event of an error, or even BME being cleared, and
will fail to deal with DMA read faults gracefully. Having a scratch page
mapped will allow pending DMA reads to complete and thus such buggy
hardware will eventually be quiesced.
NOTE: These modifications are restricted to x86 implementations only as
the buggy h/w I am aware of is only used with Xen in an x86
environment. ARM may require similar code but, since I am not
aware of the need, this patch does not modify any ARM implementation.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ea38867831da67eed0e9c61672c8941016b63dd9
master date: 2019-11-29 18:27:54 +0000
Julien Grall [Fri, 6 Dec 2019 11:45:48 +0000 (12:45 +0100)]
xen/x86: vpmu: Unmap per-vCPU PMU page when the domain is destroyed
A guest will setup a shared page with the hypervisor for each vCPU via
XENPMU_init. The page will then get mapped in the hypervisor and only
released when XENPMU_finish is called.
This means that if the guest fails to invoke XENPMU_finish, e.g if it is
destroyed rather than cleanly shut down, the page will stay mapped in the
hypervisor. One of the consequences is the domain can never be fully
destroyed as a page reference is still held.
As Xen should never rely on the guest to correctly clean-up any
allocation in the hypervisor, we should also unmap such pages during the
domain destruction if there are any left.
We can re-use the same logic as in pvpmu_finish(). To avoid
duplication, move the logic in a new function that can also be called
from vpmu_destroy().
NOTE: - The call to vpmu_destroy() must also be moved from
arch_vcpu_destroy() into domain_relinquish_resources() such that
the reference on the mapped page does not prevent domain_destroy()
(which calls arch_vcpu_destroy()) from being called.
- Whilst it appears that vpmu_arch_destroy() is idempotent it is
by no means obvious. Hence make sure the VPMU_CONTEXT_ALLOCATED
flag is cleared at the end of vpmu_arch_destroy().
- This is not an XSA because vPMU is not security supported (see
XSA-163).
Signed-off-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: be18e39d2f69038804b27c30026754deaeefa543
master date: 2019-11-29 18:23:24 +0000
Andrew Cooper [Fri, 6 Dec 2019 11:45:05 +0000 (12:45 +0100)]
x86/svm: Write the correct %eip into the outgoing task
The TASK_SWITCH vmexit has fault semantics, and doesn't provide any NRIPs
assistance with instruction length. As a result, any instruction-induced task
switch has the outgoing task's %eip pointing at the instruction switch caused
the switch, rather than after it.
This causes callers of task gates to livelock (repeatedly execute the call/jmp
to enter the task), and any restartable task to become a nop after its first
use (the (re)entry state points at the iret used to exit the task).
32bit Windows in particular is known to use task gates for NMI handling, and
to use NMI IPIs.
In the task switch handler, distinguish instruction-induced from
interrupt/exception-induced task switches, and decode the instruction under
%rip to calculate its length.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1d758bc6d1a8c0f658a874470c349ee4e27aee46
master date: 2019-11-28 17:14:38 +0000
Andrew Cooper [Fri, 6 Dec 2019 11:44:24 +0000 (12:44 +0100)]
x86/svm: Always intercept ICEBP
ICEBP isn't handled well by SVM.
The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
appropriate instruction boundary (fault or trap, as appropriate), except for
an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
rather than after it. As ICEBP isn't distinguished in the vectoring event
type, the state is ambiguous.
To add to the confusion, an ICEBP which occurs due to Introspection
intercepting the instruction, or from x86_emulate() will have %rip updated as
a consequence of partial emulation required to inject an ICEBP event in the
first place.
We could in principle spot the non-injected case in the TASK_SWITCH handler,
but this still results in complexity if the ICEBP instruction also has an
Instruction Breakpoint active on it (which genuinely has fault semantics).
Unconditionally intercept ICEBP. This does have NRIPs support as it is an
instruction intercept, which allows us to move %rip forwards appropriately
before the TASK_SWITCH intercept is hit. This makes #DB-vectored switches
have consistent behaviour however the ICEBP #DB came about, and avoids special
cases in the TASK_SWITCH intercept.
This in turn allows for the removal of the conditional
hvm_set_icebp_interception() logic used by the monitor subsystem, as ICEBP's
will now always be submitted for monitoring checks.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Alexandru Isaila <aisaila@bitdefender.com> Reviewed-by: Petre Pircalabu <ppircalabu@bitdefender.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: e2585f8c2e0d43d350503ff2b2be252adc6b7239
master date: 2019-11-28 17:14:38 +0000
Andrew Cooper [Fri, 6 Dec 2019 11:43:43 +0000 (12:43 +0100)]
x86/vtx: Fix fault semantics for early task switch failures
The VT-x task switch handler adds inst_len to %rip before calling
hvm_task_switch(), which is problematic in two ways:
1) Early faults (i.e. ones delivered in the context of the old task) get
delivered with trap semantics, and break restartibility.
2) The addition isn't truncated to 32 bits. In the corner case of a task
switch instruction crossing the 4G->0 boundary taking an early fault (with
trap semantics), a VMEntry failure will occur due to %rip being out of
range.
Instead, pass the instruction length into hvm_task_switch() and write it into
the outgoing TSS only, leaving %rip in its original location.
For now, pass 0 on the SVM side. This highlights a separate preexisting bug
which will be addressed in the following patch.
While adjusting call sites, drop the unnecessary uint16_t cast.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 943c74bc0ee5044a826e428a3b2ffbdf9a43628d
master date: 2019-11-28 17:14:38 +0000
Jan Beulich [Fri, 6 Dec 2019 11:42:56 +0000 (12:42 +0100)]
x86/IRQ: make internally used IRQs also honor the pending EOI stack
At the time the pending EOI stack was introduced there were no
internally used IRQs which would have the LAPIC EOI issued from the
->end() hook. This had then changed with the introduction of IOMMUs,
but the interaction issue was presumably masked by
irq_guest_eoi_timer_fn() frequently EOI-ing interrupts way too early
(which got fixed by 359cf6f8a0ec ["x86/IRQ: don't keep EOI timer
running without need"]).
The problem is that with us re-enabling interrupts across handler
invocation, a higher priority (guest) interrupt may trigger while
handling a lower priority (internal) one. The EOI issued from
->end() (for ACKTYPE_EOI kind interrupts) would then mistakenly
EOI the higher priority (guest) interrupt, breaking (among other
things) pending EOI stack logic's assumptions.
Notes:
- In principle we could get away without the check_eoi_deferral flag.
I've introduced it just to make sure there's as little change as
possible to unaffected paths.
- Similarly the cpu_has_pending_apic_eoi() check in do_IRQ() isn't
strictly necessary.
- The new function's name isn't very helpful with its use in
end_level_ioapic_irq_new(). I did also consider eoi_APIC_irq() (to
parallel ack_APIC_irq()), but then liked this even less.
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Diagnosed-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 5655ce8b1ec2a82ef080078e41c73bbd536174e1
master date: 2019-11-28 15:14:03 +0100
Roger Pau Monné [Fri, 6 Dec 2019 11:42:13 +0000 (12:42 +0100)]
x86/vmx: always sync PIR to IRR before vmentry
When using posted interrupts on Intel hardware it's possible that the
vCPU resumes execution with a stale local APIC IRR register because
depending on the interrupts to be injected vlapic_has_pending_irq
might not be called, and thus PIR won't be synced into IRR.
Fix this by making sure PIR is always synced to IRR in
hvm_vcpu_has_pending_irq regardless of what interrupts are pending.
Reported-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Joe Jin <joe.jin@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 56348df32bbc782e63b6e3fb978b80e015ae76e7
master date: 2019-11-28 11:58:25 +0100
Jan Beulich [Fri, 6 Dec 2019 11:41:42 +0000 (12:41 +0100)]
EFI: fix "efi=attr=" handling
Commit 633a40947321 ("docs: Improve documentation and parsing for efi=")
failed to honor the strcmp()-like return value convention of
cmdline_strcmp().
Reported-by: Roman Shaposhnik <roman@zededa.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 5530782cfe70ed22fe44358f6a10c38916443b42
master date: 2019-11-26 14:17:45 +0100
There are two mappings active in the middle of do_recalc(), and hence
commit 0d0f4d78e5d1 ("p2m: change write_p2m_entry to return an error
code") should have added (or otherwise invoked) unmapping code just
like it did in p2m_next_level(), despite us not expecting any errors
here. Arrange for the existing unmap invocation to take effect in all
cases.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 3f1a53bef84fca5ffb4178638db14c747231851f
master date: 2019-11-26 14:17:11 +0100
Anthony PERARD [Fri, 6 Dec 2019 11:40:16 +0000 (12:40 +0100)]
x86/domctl: have XEN_DOMCTL_getpageframeinfo3 preemptible
This hypercall can take a long time to finish because it attempts to
grab the `hostp2m' lock up to 1024 times. The accumulated wait for the
lock can take several seconds.
This can easily happen with a guest with 32 vcpus and plenty of RAM,
during localhost migration.
While the patch doesn't fix the problem with the lock contention and
the fact that the `hostp2m' lock is currently global (and not on a
single page), it is still an improvement to the hypercall. It will in
particular, down the road, allow dropping the arbitrary limit of 1024
entries per request.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 48599114d3ca24157c25f6684bb9322f6dca12bb
master date: 2019-11-26 14:16:09 +0100
George Dunlap [Fri, 6 Dec 2019 11:39:02 +0000 (12:39 +0100)]
x86: Don't increase ApicIdCoreSize past 7
Changeset ca2eee92df44 ("x86, hvm: Expose host core/HT topology to HVM
guests") attempted to "fake up" a topology which would induce guest
operating systems to not treat vcpus as sibling hyperthreads. This
involved actually reporting hyperthreading as available, but giving
vcpus every other ApicId; which in turn led to doubling the ApicIds
per core by bumping the ApicIdCoreSize by one. In particular, Ryzen
3xxx series processors, and reportedly EPYC "Rome" cpus -- have an
ApicIdCoreSize of 7; the "fake" topology increases this to 8.
Unfortunately, Windows running on modern AMD hardware -- including
Ryzen 3xxx series processors, and reportedly EPYC "Rome" cpus --
doesn't seem to cope with this value being higher than 7. (Linux
guests have so far continued to cope.)
A "proper" fix is complicated and it's too late to fix it either for
4.13, or to backport to supported branches. As a short-term fix,
limit this value to 7.
This does mean that a Linux guest, booted on such a system without
this change, and then migrating to a system with this change, with
more than 64 vcpus, would see an apparent topology change. This is a
low enough risk in practice that enabling this limit unilaterally, to
allow other guests to boot without manual intervention, is worth it.
Reported-by: Steven Haigh <netwiz@crc.id.au> Reported-by: Andreas Kinzler <hfp@posteo.de> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 8c79c129a6db2220c1089e0ce5fa49e7298b1d3e
master date: 2019-11-26 10:33:52 +0000
xen/arm: entry: Ensure the guest state is synced when receiving a vSError
When a SError/Asynchronous Abort generated by the guest has been
consumed, we will skip the handling of the initial exception.
This includes the calls to enter_hypervisor_from_guest{, _noirq} that
is used to synchronize part of the guest state with the internal
representation and re-enable workarounds (e.g. SSBD). However, we still
call leave_hypervisor_to_guest() which is used for preempting the guest
and synchronizing back part of the guest state.
enter_hypervisor_from_guest{, _noirq} works in pair with
leave_hypervisor_to_guest(), so skipping the first two may result
in a loss of some part of guest state.
An example is the new vGIC which will save the state of the LRs on exit
from the guest and rewrite all of them on entry to the guest.
A more worrying example is SSBD workaround may not be re-enabled. If
leave_hypervisor_to_guest() is rescheduling the vCPU, then we may end to
run a lot of code with SSBD workaroud disabled.
For now, calling leave_hypervisor_to_guest() is not necessary when
injecting a vSError to the guest. But it would still be good to give an
opportunity to reschedule. So both enter_hypervisor_from_guest() and
leave_hypervisor_to_guest() are called.
Note that on arm64, the return value for check_pending_vserror is now
stored in x19 instead of x0. This is because we want to keep the value
across call to C-functions (x0, unlike x19, will not be saved by the
callee).
Take the opportunity to rename check_pending_vserror() to
check_pending_guest_serror() as the function is dealing with host SError
and *not* virtual SError. The documentation is also updated accross
Arm32 and Arm64 to clarify how Xen is dealing with SError generated by
the guest.
Julien Grall [Mon, 7 Oct 2019 12:57:00 +0000 (13:57 +0100)]
xen/arm: Update the ASSERT() in SYNCHRONIZE_SERROR()
The macro SYNCHRONIZE_SERROR() has an assert to check whether it will
be called with Abort interrupt unmasked. However, this is only done if
a given cap is not enabled.
None of the callers will treat the abort interrupt differently
depending on a feature. Furthermore, it makes more difficult to check
whether SYNCHRONIZE_SERROR() is going to be called with abort interrupt
unmasked.
Therefore, we now require the abort interrupt to be unmasked regardless
the state of the cap.
Mark Rutland [Tue, 24 Sep 2019 11:25:47 +0000 (12:25 +0100)]
xen/arm: alternative: add auto-nop infrastructure
In some cases, one side of an alternative sequence is simply a number of
NOPs used to balance the other side. Keeping track of this manually is
tedious, and the presence of large chains of NOPs makes the code more
painful to read than necessary.
To ameliorate matters, this patch adds a new alternative_else_nop_endif,
which automatically balances an alternative sequence with a trivial NOP
sled.
In many cases, we would like a NOP-sled in the default case, and
instructions patched in in the presence of a feature. To enable the NOPs
to be generated automatically for this case, this patch also adds a new
alternative_if, and updates alternative_else and alternative_endif to
work with either alternative_if or alternative_endif.
The alternative infrastructure was originally ported from Linux. So this
is pretty much a straight backport from commit 792d47379f4d "arm64:
alternative: add auto-nop infrastructure". The only difference is the
nops macro added as not yet existing in Xen.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
[will: use new nops macro to generate nop sequences] Signed-off-by: Will Deacon <will.deacon@arm.com>
[julien: Add nops and port to Xen] Signed-off-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Volodymyr Babchuk <volodymyr_babchuk@epam.com> Acked-by: Stefano Stabellini <sstabellini@kernel.org> Release-acked-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit f11fda966365db591d280ac1522993409e20fd8c)
A follow-up patch will require to include insn.h from assembly code. So
we need to protect any C-specific definition to avoid compilation
errors when used in assembly code.
xen/arm: alternative: Remove unused parameter for alternative_if_not_cap
The macro alternative_if_not_cap is taking two parameters. The second
parameter is never used and it is hard to see how this can be used
correctly as it is only protecting the alternative section magic.
xen/arm: Ensure the SSBD workaround is re-enabled right after exiting a guest
At the moment, SSBD workaround is re-enabled for Xen after interrupts
are unmasked. This means we may end up to execute some part of the
hypervisor if an interrupt is received before the workaround is
re-enabled.
Each trap may require to unmask different interrupts.
As the rest of enter_hypervisor_from_guest() does not require to have
interrupts masked, the function is now split in two parts:
1) enter_hypervisor_from_guest_preirq() called with interrupts
masked.
2) enter_hypervisor_from_guest() called with interrupts unmasked.
Note that while it might be possible to avoid spliting the function in
two parts, it requires a bit more work than I can currently invest to
avoid using indirect branch.
Furthermore, the function name is rather generic as there might be more
work to dob before interrupts are unmasked in the future.
Fixes: a7898e4c59 ("xen/arm: Add ARCH_WORKAROUND_2 support for guests") Reported-by: Andrii Anisov <andrii_anisov@epam.com> Signed-off-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Release-acked-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit efee8ba9bf84d54e752f2a44c510cdfb3cc0c282)
Julien Grall [Wed, 30 Oct 2019 11:24:59 +0000 (11:24 +0000)]
xen/arm32: entry: Rename save_guest_regs()
The function save_guest_regs() is doing more than saving guest
registers. It also restore the vectors table and consume any pending
SErrors generated by the guest. So rename the function to
prepare_context_from_guest().
Take the opportunity to use ENDPROC() for the benefits of static
analizer and the reader.
Julien Grall [Thu, 31 Oct 2019 15:09:12 +0000 (15:09 +0000)]
xen/arm: traps: Rework entry/exit from the guest path
At the moment, enter_hypervisor_head() and leave_hypervisor_tail() are
used to deal with actions to be done before/after any guest request is
handled.
While they are meant to work in pair, the former is called for most of
the traps, including traps from the same exception level (i.e.
hypervisor) whilst the latter will only be called when returning to the
guest.
As pointed out, the enter_hypervisor_head() is not called from all the
traps, so this makes potentially difficult to extend it for the dealing
with same exception level.
Furthermore, some assembly only path will require to call
enter_hypervisor_tail(). So the function is now directly call by
assembly in for guest vector only. This means that the check whether we
are called in a guest trap can now be removed.
Take the opportunity to rename enter_hypervisor_tail() and
leave_hypervisor_tail() to something more meaningful and document them.
This should help everyone to understand the purpose of the two
functions.
Note that enter_hypervisor_tail() does not take any parameters anymore
as after the rework, the code does not use them anymore.
Julien Grall [Thu, 31 Oct 2019 15:09:11 +0000 (15:09 +0000)]
xen/arm64: entry: Check if an SError is pending when receiving a vSError
At the moment, when we receive an SError exception from the guest, we
don't check if there are any other pending. For hardening the code, we
should ensure any pending SError are accounted to the guest before
executing any code with SError unmasked.
The recently introduced macro 'guest_vector' could used to generate the
two vectors and therefore take advantage of any change required in the
future.
Julien Grall [Thu, 31 Oct 2019 15:09:10 +0000 (15:09 +0000)]
xen/arm64: entry: Introduce a macro to generate guest vector and use it
Most of the guest vectors are using the same pattern. This makes fairly
tedious to alter the pattern and risk introducing mistakes when updating
each path.
A new macro is introduced to generate the guest vectors and now use it
in the one that use the open-code version.
Julien Grall [Thu, 31 Oct 2019 15:09:08 +0000 (15:09 +0000)]
xen/arm: traps: Update the correct PC when inject a virtual SError to the guest
When injecting a virtual Abort to the guest, we want to update the guest
PC so it can re-execute the HVC/SMC once it has handled the SError.
This is unfortunately not the case when the SError is synchronized on
entry from the guest. As the SError will be received while running in
hypervisor context, we will update the PC of hypervisor context (i.e
the trap).
Rework inject_vabt_exception so it uses the guest context rather than
the current one.
Julien Grall [Thu, 31 Oct 2019 15:09:07 +0000 (15:09 +0000)]
docs/misc: xen-command-line: Rework documentation of the option 'serrors'
The current documentation is misleading for a few reasons:
1) The synchronization happens on all exit/entry from/to the guest.
This includes from EL0 (i.e userspace).
2) Trusted guest can also generate SErrors (e.g. memory failure)
3) Without RAS support, SErrors are IMP DEFINED. Unless you have a
complete TRM in hand, you can't really make a decision.
4) The documentation is written around performance when this is not
the first concern.
The documentation is now reworked to focus on the consequences of using
serrors="panic" and avoid to go in details on the exact implementation.
The documentation on top of __do_serror() is trying to describe all the
possibilities to receive an SErrors.
The description of type#2 is quite misleading because receiving an
SError in EL2 after unmasking SError interrupt ({PSTATE, CPSR}.A) does
not necessarily imply the SError were generated by the guest. You also
need to be in a special window (see abort_guest_exist_{guest, end}).
However, for the context of the function it does not matter how we
categorize the interrupts. What matter is to know whether this is a
guest-generated SError.
All the documentation of __do_serror() is now reworked to avoid
misleading information.
Take the opportunity to simplify the code after the forward option has
been dropped.
Julien Grall [Thu, 31 Oct 2019 15:09:05 +0000 (15:09 +0000)]
xen/arm: Remove serrors=forward
Per the Arm ARM (D4.5 in ARM DDI 0487E.a), SError may be precise or
imprecise.
Imprecise means the state presented to the exception handler is not
guaranteed to be consistent with any point in the excution stream from
which the exception was taken. In other words, they are likely to be
fatal as you can't return safely from them.
Without the RAS extension, the Arm architecture does not provide a way
to differentiate between imprecise and precise SError. Furthermore Xen
has no support for RAS yet. So from a software POV, there is not much
we can do.
More generally, forwarding blindly SErrors to the guest is likely to be
the wrong thing to do. Indeed, Xen is not able to know what is the
content of the SError. This may be a critical device used by the
hypervisor that is about to fail.
In a nutshell, the option serrors=forward is not safe to use in any
environment with the current state of Xen. Therefore the option and any
code related to it are completely removed.
Take the opportunity to rework the comment in do_trap_data_abort() as
all SErrors/External Abort generated by the hypervisor will result in
a crash of the system no matter what the user passed on the command
line.
Julien Grall [Thu, 31 Oct 2019 15:09:04 +0000 (15:09 +0000)]
docs/misc: xen-command-line: Remove wrong statement from serrors=diverse
When serrors=diverse is selected by the user, we will only synchronize
the pending SErrors on entry to hypervisor from guest context and exit
from guest to hypervisor context.
We don't need synchronize SErrors between guest context switch as they
would be categorized to Hypervisor generated SErrors in any case.
Jan Beulich [Tue, 26 Nov 2019 13:23:08 +0000 (14:23 +0100)]
IOMMU: default to always quarantining PCI devices
XSA-302 relies on the use of libxl's "assignable-add" feature to prepare
devices to be assigned to untrusted guests.
Unfortunately, this is not considered a strictly required step for
device assignment. The PCI passthrough documentation on the wiki
describes alternate ways of preparing devices for assignment, and
libvirt uses its own ways as well. Hosts where these alternate methods
are used will still leave the system in a vulnerable state after the
device comes back from a guest.
Default to always quarantining PCI devices, but provide a command line
option to revert back to prior behavior (such that people who both
sufficiently trust their guests and want to be able to use devices in
Dom0 again after they had been in use by a guest wouldn't need to
"manually" move such devices back from DomIO to Dom0).
This is XSA-306.
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org>
master commit: ba2ab00bbb8c74e311a252d816d68dee47c779a0
master date: 2019-11-26 14:15:01 +0100
George Dunlap [Mon, 25 Nov 2019 15:07:20 +0000 (16:07 +0100)]
x86/mm: Adjust linear uses / entries when a page loses validation
"Linear pagetables" is a technique which involves either pointing a
pagetable at itself, or to another pagetable the same or higher level.
Xen has limited support for linear pagetables: A page may either point
to itself, or point to another page of the same level (i.e., L2 to L2,
L3 to L3, and so on).
XSA-240 introduced an additional restriction that limited the "depth"
of such chains by allowing pages to either *point to* other pages of
the same level, or *be pointed to* by other pages of the same level,
but not both. To implement this, we keep track of the number of
outstanding times a page points to or is pointed to another page
table, to prevent both from happening at the same time.
Additionally, XSA-299 introduced a mode whereby if a page was known to
have been only partially validated, _put_page_type() would be called
with PTF_partial_set, indicating that if the page had been
de-validated by someone else, the type count should be left alone.
Unfortunately, this change did not account for the required accounting
for linear page table uses and entries; in the case that a previously
partially-devalidated pagetable was fully-devalidated by someone else,
the linear_pt_counts are not updated.
This could happen in one of two places:
1. In the case a partially-devalidated page was re-validated by
someone else
2. During domain tear-down, when pages are force-invalidated while
leaving the type count intact.
The second could be ignored, since at that point the pages can no
longer be abused; but the first requires handling. Note however that
this would not be a security issue: having the counts be too high is
overly strict (i.e., will prevent a page from being used in a way
which is perfectly safe), but shouldn't cause any other issues.
Fix this by adjusting the linear counts when a page loses validation,
regardless of whether the de-validation completed or was only partial.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 77beba7c921a286c31a2a76f26500047f353614a
master date: 2019-11-25 10:58:27 +0000
Andrew Cooper [Mon, 25 Nov 2019 15:06:27 +0000 (16:06 +0100)]
x86/vvmx: Fix livelock with XSA-304 fix
It turns out that the XSA-304 / CVE-2018-12207 fix of disabling executable
superpages doesn't work well with the nested p2m code.
Nested virt is experimental and not security supported, but is useful for
development purposes. In order to not regress the status quo, disable the
XSA-304 workaround until the nested p2m code can be improved.
Introduce a per-domain exec_sp control and set it based on the current
opt_ept_exec_sp setting. Take the oppotunity to omit a PVH hardware domain
from the performance hit, because it is already permitted to DoS the system in
such ways as issuing a reboot.
When nested virt is enabled on a domain, force it to using executable
superpages and rebuild the p2m.
Having the setting per-domain involves rearranging the internals of
parse_ept_param_runtime() but it still retains the same overall semantics -
for each applicable domain whose setting needs to change, rebuild the p2m.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: 183f354e1430087879de071f0c7122e42703916e
master date: 2019-11-23 14:06:24 +0000
Andrew Cooper [Mon, 25 Nov 2019 15:05:48 +0000 (16:05 +0100)]
x86/livepatch: Prevent patching with active waitqueues
The safety of livepatching depends on every stack having been unwound, but
there is one corner case where this is not true. The Sharing/Paging/Monitor
infrastructure may use waitqueues, which copy the stack frame sideways and
longjmp() to a different vcpu.
This case is rare, and can be worked around by pausing the offending
domain(s), waiting for their rings to drain, then performing a livepatch.
In the case that there is an active waitqueue, fail the livepatch attempt with
-EBUSY, which is preforable to the fireworks which occur from trying to unwind
the old stack frame at a later point.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com>
master commit: ca4cd3668237d50a0b33b48e7de7f93d9475120d
master date: 2019-11-22 17:05:43 +0000
Roger Pau Monné [Mon, 25 Nov 2019 15:04:34 +0000 (16:04 +0100)]
x86/vlapic: allow setting APIC_SPIV_FOCUS_DISABLED in x2APIC mode
Current code unconditionally prevents setting APIC_SPIV_FOCUS_DISABLED
regardless of the processor model, which is not correct according to
the specification.
This issue was discovered while trying to boot a pvshim with x2APIC
enabled.
Always allow setting APIC_SPIV_FOCUS_DISABLED: the local APIC
provided to guests is emulated by Xen, and as such doesn't depend on
the features found on the hardware processor. Note for example that
Xen offers x2APIC support to guests even when the underlying hardware
doesn't have such feature.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: d7cd999faa1edf745a7597db811956cb882a5436
master date: 2019-11-22 17:52:59 +0100
Julien Grall [Mon, 25 Nov 2019 15:04:02 +0000 (16:04 +0100)]
xen: Add missing va_end() in hypercall_create_continuation()
The documentation requires va_start() to always be matched with a
corresponding va_end(). However, this is not the case in the path used
for bad format.
This was introduced by XSA-296.
Coverity-ID: 1488727 Fixes: 0bf9f8d3e3 ("xen/hypercall: Don't use BUG() for parameter checking in hypercall_create_continuation()") Signed-off-by: Julien Grall <julien@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Andrew Cooper <andrew.cooper3@citrix.com>
master commit: df7a19338a892b5cf585fd2bee8584cb15e0cace
master date: 2019-11-21 15:50:01 +0000
Anthony PERARD [Mon, 25 Nov 2019 15:03:12 +0000 (16:03 +0100)]
x86: fix race to build arch/x86/efi/relocs-dummy.o
With $(TARGET).efi depending on efi/relocs-dummy.o, arch/x86/Makefile
will attempt to build that object. This may result in a dependency file
being generated that has relocs-dummy.o depending on efi/relocs-dummy.S.
Then, when arch/x86/efi/Makefile tries to build relocs-dummy.o, well
efi/relocs-dummy.S doesn't exist.
Have only one makefile responsible for building relocs-dummy.o.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/Makefile: remove $(guard) use from $(TARGET).efi target
Following the patch 65d104984c04 ("x86: fix race to build
arch/x86/efi/relocs-dummy.o"), the error message
nm: 'efi/relocs-dummy.o': No such file"
started to appear on system which can't build the .efi target. This is
because relocs-dummy.o isn't built anymore.
The error is printed by the evaluation of VIRT_BASE and ALT_BASE which
aren't use anyway.
But, we don't need that file as we don't want to build `$(TARGET).efi'
anyway. On such system, $(guard) evaluate to the shell builtin ':',
which prevent any of the shell commands in `$(TARGET).efi' from been
executed.
Even if $(guard) is evaluated opon use, it depends on $(XEN_BUILD_PE)
which is evaluated at the assignment. So, we can replace $(guard) in
$(TARGET).efi by having two different rules depending on
$(XEN_BUILD_PE) instead.
The change with this patch is that none of the dependency of
$(TARGET).efi will be built if the linker doesn't support PE
and VIRT_BASE and ALT_BASE don't get evaluated anymore, so nm will not
complain about the missing relocs-dummy.o file anymore.
Since prelink-efi.o isn't built on system that can't build
$(TARGET).efi anymore, we can remove the $(guard) variable everywhere.
Reported-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 65d104984c04e69234f77bd3b8f8c0ef85b3f7fa
master date: 2019-11-15 14:18:16 +0100
master commit: 7059afb202ff0d82a6fa94f7ef84e4bb3139914e
master date: 2019-11-20 17:12:12 +0100
Jan Beulich [Mon, 25 Nov 2019 15:01:00 +0000 (16:01 +0100)]
AMD/IOMMU: don't needlessly trigger errors/crashes when unmapping a page
Unmapping a page which has never been mapped should be a no-op (note how
it already is in case there was no root page table allocated). There's
in particular no need to grow the number of page table levels in use,
and there's also no need to allocate intermediate page tables except
when needing to split a large page.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Paul Durrant <paul@xen.org> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: ad591454f069647c36a7daaa9ec23384c0263f0b
master date: 2019-11-12 11:08:34 +0100
Roger Pau Monné [Mon, 25 Nov 2019 15:00:26 +0000 (16:00 +0100)]
x86/ioapic: fix clear_IO_APIC_pin write of raw entries
clear_IO_APIC_pin can be called after the iommu has been enabled, and
using raw reads and writes to modify IO-APIC entries that have been
setup to use interrupt remapping can lead to issues as some of the
fields have different meaning when the IO-APIC entry is setup to point
to an interrupt remapping table entry.
The following ASSERT in AMD IOMMU code triggers afterwards as a result
of the raw changes to IO-APIC entries performed by clear_IO_APIC_pin.
(XEN) [ 10.082154] ENABLING IO-APIC IRQs
(XEN) [ 10.087789] -> Using new ACK method
(XEN) [ 10.093738] Assertion 'get_rte_index(rte) == offset' failed at iommu_intr.c:328
Fix this by making sure that modifications to entries are performed in
non raw mode when fields are affected which may either have changed
meaning with interrupt remapping, or which may need mirroring into
IRTEs.
Andrew Cooper [Mon, 25 Nov 2019 14:58:26 +0000 (15:58 +0100)]
x86/pv: Fix !CONFIG_PV build following XSA-299
PTF_* are declared within CONFIG_PV, and used outside:
mm.c: In function ‘_put_page_type’:
mm.c:2819:32: error: ‘PTF_preemptible’ undeclared (first use in this function)
bool preemptible = flags & PTF_preemptible;
^~~~~~~~~~~~~~~
mm.c:2819:32: note: each undeclared identifier is reported only once for each
function it appears in
mm.c:2842:24: error: ‘PTF_partial_set’ undeclared (first use in this function)
if ( !(flags & PTF_partial_set) )
^~~~~~~~~~~~~~~
mm.c: In function ‘put_page_type_preemptible’:
mm.c:3090:33: error: ‘PTF_preemptible’ undeclared (first use in this function)
return _put_page_type(page, PTF_preemptible, NULL);
^~~~~~~~~~~~~~~
mm.c: In function ‘put_old_guest_table’:
mm.c:3108:25: error: ‘PTF_preemptible’ undeclared (first use in this function)
PTF_preemptible |
^~~~~~~~~~~~~~~
mm.c:3110:27: error: ‘PTF_partial_set’ undeclared (first use in this function)
PTF_partial_set : 0 ),
^~~~~~~~~~~~~~~
mm.c: In function ‘put_page_type_preemptible’:
mm.c:3091:1: error: control reaches end of non-void function
[-Werror=return-type]
}
^
cc1: all warnings being treated as errors
Re-position the definitions to be outside of the #ifdef CONFIG_PV
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wl@xen.org>
master commit: 7e4404f8c66f94ac173a3232712074677415d842
master date: 2019-11-01 10:48:04 +0000
Andrew Cooper [Mon, 25 Nov 2019 14:57:39 +0000 (15:57 +0100)]
x86/vtx: Fixes to Haswell/Broadwell LBR TSX errata
Cross reference and list all errata, now that they are published.
These errata are specific to Haswell/Broadwell. They should have model and
vendor checks, as Intel isn't the only vendor to implement VT-x.
All affected models use the same MSR indicies, so these can be hard coded
rather than looking up and storing constant values.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: f51d4a19427674491eaecef85c551613450188c5
master date: 2019-10-29 19:27:40 +0000
Andrew Cooper [Mon, 25 Nov 2019 14:56:30 +0000 (15:56 +0100)]
x86/vtx: Corrections to BDF93 errata workaround
At the time of fixing c/s 20f1976b44, no obvious errata had been published,
and BDF14 looked like the most obvious candidate. Subsequently, BDF93 has
been published and it is obviously this.
The erratum states that LER_TO_LIP is the only affected MSR. The provisional
fix in Xen adjusted LER_FROM_LIP, but this is not correct. The FROM MSRs are
intended to have TSX metadata, and for steppings with TSX enabled, it will
corrupt the value the guest sees, while for parts with TSX disabled, it is
redundant with FIXUP_TSX. Drop the LER_FROM_LIP adjustment.
Replace BDF14 references with BDF93, drop the redundant 'bdw_erratum_' prefix,
and use an Intel vendor check, as other vendors implement VT-x.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: 1a3b393129c1dcfec418f9b0ee92d126c2ae8141
master date: 2019-10-29 19:27:40 +0000
Jan Beulich [Mon, 25 Nov 2019 14:53:11 +0000 (15:53 +0100)]
x86: fix off-by-one in is_xen_fixed_mfn()
__2M_rwdata_end marks the first byte after the Xen image, not its last
byte. Subtract 1 to obtain the upper bound to compare against. (Note
that instead switching from <= to < is less desirable, as in principle
__pa() might return rubbish for addresses outside of the Xen image.)
Since the & needs to be dropped from the line in question, also drop it
from the adjacent one.
Reported-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 9633929824204ca7a6d60d083466de79993d60f1
master date: 2019-10-25 10:38:58 +0200
Roger Pau Monné [Mon, 25 Nov 2019 14:52:19 +0000 (15:52 +0100)]
x86/tsc: update vcpu time info on guest TSC adjustments
If a HVM/PVH guest writes to MSR_IA32_TSC{_ADJUST} and thus changes
the value of the time stamp counter the vcpu time info must also be
updated, or the time calculated by the guest using the Xen PV clock
interface will be skewed.
Update the vcpu time info when the guest writes to either MSR_IA32_TSC
or MSR_IA32_TSC_ADJUST. This fixes lockups seen when running the
pv-shim on AMD hardware, since the shim will aggressively try to keep
TSCs in sync by periodically writing to MSR_IA32_TSC if the TSC is not
reliable.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Wei Liu <wl@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7eee9c16d6405a1a1f2e8c6472923db842c90cfb
master date: 2019-10-23 17:01:56 +0100
Andrew Cooper [Mon, 25 Nov 2019 14:50:44 +0000 (15:50 +0100)]
x86/vvmx: Fix the use of RDTSCP when it is intercepted at L0
Linux has started using RDTSCP as of v5.1. This has highlighted a bug in Xen,
where virtual vmexit simply gives up.
(XEN) d1v1 Unhandled nested vmexit: reason 51
(XEN) domain_crash called from vvmx.c:2671
(XEN) Domain 1 (vcpu#1) crashed on cpu#2:
Handle RDTSCP in the virtual vmexit hander in the same was as RDTSC
intercepts.
Reported-by: Sarah Newman <srn@prgmr.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Chris Brannon <cmb@prgmr.com> Reviewed-by: Wei Liu <wl@xen.org>
master commit: 9257c218e56e9902b78662e5852d69329b9cc204
master date: 2019-10-23 16:43:48 +0100
Andrew Cooper [Wed, 19 Jun 2019 17:16:03 +0000 (18:16 +0100)]
x86/tsx: Introduce tsx= to use MSR_TSX_CTRL when available
To protect against the TSX Async Abort speculative vulnerability, Intel have
released new microcode for affected parts which introduce the MSR_TSX_CTRL
control, which allows TSX to be turned off. This will be architectural on
future parts.
Introduce tsx= to provide a global on/off for TSX, including its enumeration
via CPUID. Provide stub virtualisation of this MSR, as it is not exposed to
guests at the moment.
VMs may have booted before microcode is loaded, or before hosts have rebooted,
and they still want to migrate freely. A VM which booted seeing TSX can
migrate safely to hosts with TSX disabled - TSX will start unconditionally
aborting, but still behave in a manner compatible with the ABI.
The guest-visible behaviour is equivalent to late loading the microcode and
setting the RTM_DISABLE bit in the course of live patching.
This is part of XSA-305 / CVE-2019-11135
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Fri, 8 Nov 2019 16:36:50 +0000 (16:36 +0000)]
x86/vtx: Allow runtime modification of the exec-sp setting
See patch for details.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
Andrew Cooper [Thu, 20 Dec 2018 17:25:29 +0000 (17:25 +0000)]
x86/vtx: Disable executable EPT superpages to work around CVE-2018-12207
CVE-2018-12207 covers a set of errata on various Intel processors, whereby a
machine check exception can be generated in a corner case when an executable
mapping changes size or cacheability without TLB invalidation. HVM guest
kernels can trigger this to DoS the host.
To mitigate, in affected hardware, all EPT superpages are marked NX. When an
instruction fetch violation is observed against the superpage, the superpage
is shattered to 4k and has execute permissions restored. This prevents the
guest kernel from being able to create the necessary preconditions in the iTLB
to exploit the vulnerability.
This does come with a workload-dependent performance overhead, caused by
increased TLB pressure. Performance can be restored, if guest kernels are
trusted not to mount an attack, by specifying ept=exec-sp on the command line.
This is part of XSA-304 / CVE-2018-12207
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Andrew Cooper [Thu, 24 Oct 2019 13:09:01 +0000 (14:09 +0100)]
x86/vtd: Hide superpage support for SandyBridge IOMMUs
Something causes SandyBridge IOMMUs to choke when sharing EPT pagetables, and
an EPT superpage gets shattered. The root cause is still under investigation,
but the end result is unusable in combination with CVE-2018-12207 protections.
This is part of XSA-304 / CVE-2018-12207
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
Julien Grall [Thu, 31 Oct 2019 15:59:17 +0000 (16:59 +0100)]
xen/arm64: Don't blindly unmask interrupts on trap without a change of level
Some of the traps without a change of the level (i.e. hypervisor ->
hypervisor) will unmask interrupts regardless the state of them in the
interrupted context.
One of the consequences is IRQ will be unmasked when receiving a
synchronous exception (used by WARN*()). This could result to unexpected
behavior such as deadlock (if a lock was shared with interrupts).
In a nutshell, interrupts should only be unmasked when it is safe to
do. Xen only unmask IRQ and Abort interrupts, so the logic can stay
simple:
- hyp_error: All the interrupts are now kept masked. SError should
be pretty rare and if ever happen then we most likely want to
avoid any other interrupts to be generated. The potential main
"caller" is during virtual SError synchronization on the exit
path from the guest (see check_pending_vserror).
- hyp_sync: The interrupts state is inherited from the interrupted
context.
- hyp_irq: All the interrupts but IRQ state are inherited from the
interrupted context. IRQ is kept masked.
Julien Grall [Thu, 31 Oct 2019 15:58:37 +0000 (16:58 +0100)]
xen/arm32: Don't blindly unmask interrupts on trap without a change of level
Exception vectors will unmask interrupts regardless the state of them in
the interrupted context.
One of the consequences is IRQ will be unmasked when receiving an
undefined instruction exception (used by WARN*) from the hypervisor.
This could result to unexpected behavior such as deadlock (if a lock was
shared with interrupts).
In a nutshell, interrupts should only be unmasked when it is safe to do.
Xen only unmask IRQ and Abort interrupts, so the logic can stay simple.
As vectors exceptions may be shared between guest and hypervisor, we now
need to have a different policy for the interrupts.
On exception from hypervisor, each vector will select the list of
interrupts to inherit from the interrupted context. Any interrupts not
listed will be kept masked.
On exception from the guest, the Abort and IRQ will be unmasked
depending on the exact vector.
The interrupts will be kept unmasked when the vector cannot used by
either guest or hypervisor.
Note that each vector is not anymore preceded by ALIGN. This is fine
because the alignment is already bigger than what we need.
Julien Grall [Thu, 31 Oct 2019 15:58:12 +0000 (16:58 +0100)]
xen/arm32: entry: Fold the macro SAVE_ALL in the macro vector
Follow-up rework will require the macro vector to distinguish between
a trap from a guest vs while in the hypervisor.
The macro SAVE_ALL already has code to distinguish between the two and
it is only called by the vector macro. So fold the former into the
latter. This will help to avoid duplicating the check.
Julien Grall [Thu, 31 Oct 2019 15:57:45 +0000 (16:57 +0100)]
xen/arm32: entry: Split __DEFINE_ENTRY_TRAP in two
The preprocessing macro __DEFINE_ENTRY_TRAP is used to generate trap
entry function. While the macro is fairly small today, follow-up patches
will increase the size signicantly.
In general, assembly macros are more readable as they allow you to name
parameters and avoid '\'. So the actual implementation of the trap is
now switched to an assembly macro.
Paul Durrant [Thu, 31 Oct 2019 15:57:17 +0000 (16:57 +0100)]
passthrough: quarantine PCI devices
When a PCI device is assigned to an untrusted domain, it is possible for
that domain to program the device to DMA to an arbitrary address. The
IOMMU is used to protect the host from malicious DMA by making sure that
the device addresses can only target memory assigned to the guest. However,
when the guest domain is torn down the device is assigned back to dom0,
thus allowing any in-flight DMA to potentially target critical host data.
This patch introduces a 'quarantine' for PCI devices using dom_io. When
the toolstack makes a device assignable (by binding it to pciback), it
will now also assign it to DOMID_IO and the device will only be assigned
back to dom0 when the device is made unassignable again. Whilst device is
assignable it will only ever transfer between dom_io and guest domains.
dom_io is actually only used as a sentinel domain for quarantining purposes;
it is not configured with any IOMMU mappings. Assignment to dom_io simply
means that the device's initiator (requestor) identifier is not present in
the IOMMU's device table and thus any DMA transactions issued will be
terminated with a fault condition.
In addition, a fix to assignment handling is made for VT-d. Failure
during the assignment step should not lead to a device still being
associated with its prior owner. Hand the device to DomIO temporarily,
until the assignment step has completed successfully. Remove the PI
hooks from the source domain then earlier as well.
Failure of the recovery reassign_device_ownership() may not go silent:
There e.g. may still be left over RMRR mappings in the domain assignment
to which has failed, and hence we can't allow that domain to continue
executing.
NOTE: This patch also includes one printk() cleanup; the
"XEN_DOMCTL_assign_device: " tag is dropped in iommu_do_pci_domctl(),
since similar printk()-s elsewhere also don't log such a tag.
This is XSA-302.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
master commit: 319f9a0ba94c7db505cd5dd9cb0b037ab1aa8e12
master date: 2019-10-31 16:20:05 +0100
Julien Grall [Thu, 31 Oct 2019 15:56:52 +0000 (16:56 +0100)]
xen/arm: p2m: Don't check the return of p2m_get_root_pointer() with BUG_ON()
It turns out that the BUG_ON() was actually reachable with well-crafted
hypercalls. The BUG_ON() is here to prevent catch logical error, so
crashing Xen is a bit over the top.
While all the holes should now be fixed, it would be better to downgrade
the BUG_ON() to something less fatal to prevent any more DoS.
The BUG_ON() in p2m_get_entry() is now replaced by ASSERT_UNREACHABLE()
to catch mistake in debug build and return INVALID_MFN for production
build. The interface also requires to set page_order to give an idea of
the size of "hole". So 'level' is now set so we report a hole of size of
the an entry of the root page-table. This stays inline with what happen
when the GFN is higher than p2m->max_mapped_gfn.
The BUG_ON() in p2m_resolve_translation_fault() is now replaced by
ASSERT_UNREACHABLE() to catch mistake in debug build and just report a
fault for producion build.
Julien Grall [Thu, 31 Oct 2019 15:56:34 +0000 (16:56 +0100)]
xen/arm: p2m: Avoid off-by-one check on p2m->max_mapped_gfn
The code base is using inconsistently the field p2m->max_mapped_gfn.
Some of the useres expect that p2m->max_guest_gfn contain the highest
mapped GFN while others expect highest + 1.
p2m->max_guest_gfn is set as highest + 1, because of that the sanity
check on the GFN in p2m_resolved_translation_fault() and
p2m_get_entry() can be bypassed when GFN == p2m->max_guest_gfn.
p2m_get_root_pointer(p2m->max_guest_gfn) may return NULL if it is
outside of address range supported and therefore the BUG_ON() could be
hit.
The current value hold in p2m->max_mapped_gfn is inconsistent with the
expectation of the common code (see domain_get_maximum_gpfn()) and also
the documentation of the field.
Rather than changing the check in p2m_translation_fault() and
p2m_get_entry(), p2m->max_mapped_gfn is now containing the highest
mapped GFN and the callers assuming "highest + 1" are now adjusted.
Take the opportunity to use 1UL rather than 1 as page_order could
theoritically big enough to overflow a 32-bit integer.
Lastly, the documentation of the field max_guest_gfn to reflect how it
is computed.
Julien Grall [Thu, 31 Oct 2019 15:56:04 +0000 (16:56 +0100)]
xen/arm: p2m: Avoid aliasing guest physical frame
The P2M helpers implementation is quite lax and will end up to ignore
the unused top bits of a guest physical frame.
This effectively means that p2m_set_entry() will create a mapping for a
different frame (it is always equal to gfn & (mask unused bits)). Yet
p2m->max_mapped_gfn will be updated using the original frame.
At the moment, p2m_get_entry() and p2m_resolve_translation_fault()
assume that p2m_get_root_pointer() will always return a non-NULL pointer
when the GFN is smaller than p2m->max_mapped_gfn.
Unfortunately, because of the aliasing described above, it would be
possible to set p2m->max_mapped_gfn high enough so it covers frame that
would lead p2m_get_root_pointer() to return NULL.
As we don't sanity check the guest physical frame provided by a guest, a
malicious guest could craft a series of hypercalls that will hit the
BUG_ON() and therefore DoS Xen.
To prevent aliasing, the function p2m_get_root_pointer() is now reworked
to return NULL If any of the unused top bits are not zero. The caller
can then decide what's the appropriate action to do. Since the two paths
(i.e. P2M_ROOT_PAGES == 1 and P2M_ROOT_PAGES != 1) are now very
similarly, take the opportunity to consolidate them making the code a
bit simpler.
With this change, p2m_get_entry() will not try to insert a mapping as
the root pointer is invalid.
Note that root_table is now switch to unsigned long as unsigned int is
not enough to hold part of a GFN.
George Dunlap [Thu, 31 Oct 2019 15:55:31 +0000 (16:55 +0100)]
x86/mm: Don't drop a type ref unless you held a ref to begin with
Validation and de-validation of pagetable trees may take arbitrarily
large amounts of time, and so must be preemptible. This is indicated
by setting the PGT_partial bit in the type_info, and setting
nr_validated_entries and partial_flags appropriately. Specifically,
if the entry at [nr_validated_entries] is partially validated,
partial_flags should have the PGT_partial_set bit set, and the entry
should hold a general reference count. During de-validation,
put_page_type() is called on partially validated entries.
Unfortunately, there are a number of issues with the current algorithm.
First, doing a "normal" put_page_type() is not safe when no type ref
is held: there is nothing to stop another vcpu from coming along and
picking up validation again: at which point the put_page_type may drop
the only page ref on an in-use page. Some examples are listed in the
appendix.
The core issue is that put_page_type() is being called both to clean
up PGT_partial, and to drop a type count; and has no way of knowing
which is which; and so if in between, PGT_partial is cleared,
put_page_type() will drop the type ref erroneously.
What is needed is to distinguish between two states:
- Dropping a type ref which is held
- Cleaning up a page which has been partially de/validated
Fix this by telling put_page_type() which of the two activities you
intend.
When cleaning up a partial de/validation, take no action unless you
find a page partially validated.
If put_page_type() is called without PTF_partial_set, and finds the
page in a PGT_partial state anyway, then there's certainly been a
misaccounting somewhere, and carrying on would almost certainly cause
a security issue, so crash the host instead.
In put_page_from_lNe, pass partial_flags on to _put_page_type().
old_guest_table may be set either with a fully validated page (when
using the "deferred put" pattern), or with a partially validated page
(when a normal "de-validation" is interrupted, or when a validation
fails part-way through due to invalid entries). Add a flag,
old_guest_table_partial, to indicate which of these it is, and use
that to pass the appropriate flag to _put_page_type().
While here, delete stray trailing whitespace.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix:
Suppose page A, when interpreted as an l3 pagetable, contains all
valid entries; and suppose A[x] points to page B, which when
interpreted as an l2 pagetable, contains all valid entries.
P1: PIN_L3_TABLE
A -> PGT_l3_table | 1 | valid
B -> PGT_l2_table | 1 | valid
P1: UNPIN_TABLE
> Arrange to interrupt after B has been de-validated
B:
type_info -> PGT_l2_table | 0
A:
type_info -> PGT_l3_table | 1 | partial
nr_validated_enties -> (less than x)
P2: mod_l4_entry to point to A
> Arrange for this to be interrupted while B is being validated
B:
type_info -> PGT_l2_table | 1 | partial
(nr_validated_entires &c set as appropriate)
A:
type_info -> PGT_l3_table | 1 | partial
nr_validated_entries -> x
partial_pte = 1
P3: mod_l3_entry some other unrelated l3 to point to B:
B:
type_info -> PGT_l2_table | 1
P1: Restart UNPIN_TABLE
At this point, since A.nr_validate_entries == x and A.partial_pte !=
0, free_l3_table() will call put_page_from_l3e() on pl3e[x], dropping
its type count to 0 while it's still being pointed to by some other l3
A similar issue arises with old_guest_table. Consider the following
scenario:
Suppose A is a page which, when interpreted as an l2, has valid entries
until entry x, which is invalid.
V1: PIN_L2_TABLE(A)
<Validate until we try to validate [x], get -EINVAL>
A -> PGT_l2_table | 1 | PGT_partial
V1 -> old_guest_table = A
<delayed>
V2: PIN_L2_TABLE(A)
<Pick up where V1 left off, try to re-validate [x], get -EINVAL>
A -> PGT_l2_table | 1 | PGT_partial
V2 -> old_guest_table = A
<restart>
put_old_guest_table()
_put_page_type(A)
A -> PGT_l2_table | 0
Indeed, it is possible to engineer for old_guest_table for every vcpu
a guest has to point to the same page.
master commit: c40b33d72630dcfa506d6fd856532d6152cb97dc
master date: 2019-10-31 16:16:37 +0100
George Dunlap [Thu, 31 Oct 2019 15:55:10 +0000 (16:55 +0100)]
x86/mm: Fix nested de-validation on error
If an invalid entry is discovered when validating a page-table tree,
the entire tree which has so far been validated must be de-validated.
Since this may take a long time, alloc_l[2-4]_table() set current
vcpu's old_guest_table immediately; put_old_guest_table() will make
sure that put_page_type() will be called to finish off the
de-validation before any other MMU operations can happen on the vcpu.
The invariant for partial pages should be:
* Entries [0, nr_validated_ptes) should be completely validated;
put_page_type() will de-validate these.
* If [nr_validated_ptes] is partially validated, partial_flags should
set PTF_partiaL_set. put_page_type() will be called on this page to
finish off devalidation, and the appropriate refcount adjustments
will be done.
alloc_l[2-3]_table() indicates partial validation to its callers by
setting current->old_guest_table.
Unfortunately, this is mishandled.
Take the case where validating lNe[x] returns an error.
First, alloc_l3_table() doesn't check old_guest_table at all; as a
result, partial_flags is not set when it should be. nr_validated_ptes
is set to x; and since PFT_partial_set clear, de-validation resumes at
nr_validated_ptes-1. This means that the l2 page at pl3e[x] will not
have put_page_type() called on it when de-validating the rest of the
l3: it will be stuck in the PGT_partial state until the domain is
destroyed, or until it is re-used as an l2. (Any other page type will
fail.)
Worse, alloc_l4_table(), rather than setting PTF_partial_set as it
should, sets nr_validated_ptes to x+1. When de-validating, since
partial is 0, this will correctly resume calling put_page_type at [x];
but, if the put_page_type() is never called, but instead
get_page_type() is called, validation will pick up at [x+1],
neglecting to validate [x]. If the rest of the validation succeeds,
the l4 will be validated even though [x] is invalid.
Fix this in both cases by setting PTF_partial_set if old_guest_table
is set.
While here, add some safety catches:
- old_guest_table must point to the page contained in
[nr_validated_ptes].
- alloc_l1_page shouldn't set old_guest_table
If we experience one of these situations in production builds, it's
safer to avoid calling put_page_type for the pages in question. If
they have PGT_partial set, they will be cleaned up on domain
destruction; if not, we have no idea whether a type count is safe to
drop. Retaining an extra type ref that should have been dropped may
trigger a BUG() on the free_domain_page() path, but dropping a type
count that shouldn't be dropped may cause a privilege escalation.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3c15a2d8cc1981f369cc9542f028054d0dfb325b
master date: 2019-10-31 16:16:13 +0100
George Dunlap [Thu, 31 Oct 2019 15:54:48 +0000 (16:54 +0100)]
x86/mm: Properly handle linear pagetable promotion failures
In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted. This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.
The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count. If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held. If PTF_partial_set is set, then [nr_entries_validated]
is partially validated, and a general reference count is held.
Unfortunately, in cases where an entry began with PTF_partial_set set,
and get_page_from_lNe() returns -EINVAL, the PTF_partial_set bit is
erroneously dropped. (This scenario can be engineered mainly by the
use of interleaving of promoting and demoting a page which has "linear
pagetable" entries; see the appendix for a sketch.) This means that
we will "leak" a general reference count on the page in question,
preventing the page from being freed.
Fix this by setting page->partial_flags to the partial_flags local
variable.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix
Suppose A and B can both be promoted to L2 pages, and A[x] points to B.
V1: MOD_L3_ENTRY pointing something to A.
In the process of validating A[x], grab an extra type / ref on B:
B.type_count = 2 | PGT_validated
B.count = 3 | PGC_allocated
A.type_count = 1 | PGT_validated
A.count = 2 | PGC_allocated
V1: MOD_L3_ENTRY removing the reference to A.
De-validate A, down to A[x], which points to B.
Drop the final type on B. Arrange to be interrupted.
B.type_count = 1 | PGT_partial
B.count = 2 | PGC_allocated
A.type_count = 1 | PGT_partial
A.nr_validated_entries = x
A.partial_pte = -1
V2: MOD_L3_ENTRY adds a reference to A.
At this point, get_page_from_l2e(A[x]) tries
get_page_and_type_from_mfn(), which fails because it's the wrong type;
and get_l2_linear_pagetable() also fails, because B isn't validated as
an l2 anymore.
master commit: 2f126247ef49c2ba52bae29a2ab371059ede67c0
master date: 2019-10-31 16:15:48 +0100
George Dunlap [Thu, 31 Oct 2019 15:53:16 +0000 (16:53 +0100)]
x86/mm: Always retain a general ref on partial
In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted. This is stored in two elements in the page struct:
nr_entries_validated and partial_flags.
The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count. If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held. If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.
At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.
PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held. Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.
Unfortunately, because a refcount is not held, it is possible to
engineer a situation where PFT_partial_set is set but the page in
question has been assigned to another domain. A sketch is provided in
the appendix.
Fix this by having the parent page table entry hold a general
reference count whenever PFT_partial_set is set. (For clarity of
change, keep two separate flags. These will be collapsed in a
subsequent changeset.)
This has two basic implications. On the put_page_from_lNe() side,
this mean that the (partial_set && !partial_ref) case can never happen,
and no longer needs to be special-cased.
Secondly, because both flags are set together, there's no need to carry over
existing bits from partial_pte.
(NB there is still another issue with calling _put_page_type() on a
page which had PGT_partial set; that will be handled in a subsequent
patch.)
On the get_page_and_type_from_mfn() side, we need to distinguish
between callers which hold a reference on partial (i.e.,
alloc_lN_table()), and those which do not (new_cr3, PIN_LN_TABLE, and
so on): pass a flag if the type should be retained on interruption.
NB that since l1 promotion can't be preempted, that get_page_from_l2e
can't return -ERESTART.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
* Appendix: Engineering PTF_partial_set while a page belongs to a
foreign domain
Suppose A is a page which can be promoted to an l3, and B is a page
which can be promoted to an l2, and A[x] points to B. B has
PGC_allocated set but no other general references.
V1: PIN_L3 A.
A is validated, B is validated.
A.type_count = 1 | PGT_validated | PGT_pinned
B.type_count = 1 | PGT_validated
B.count = 2 | PGC_allocated (A[x] holds a general ref)
V1: UNPIN A.
A begins de-validation.
Arrange to be interrupted when i < x
V1->old_guest_table = A
V1->old_guest_table_ref_held = false
A.type_count = 1 | PGT_partial
A.nr_validated_entries = i < x
B.type_count = 0
B.count = 1 | PGC_allocated
V2: MOD_L4_ENTRY to point some l4e to A.
Picks up re-validation of A.
Arrange to be interrupted halfway through B's validation
B.type_count = 1 | PGT_partial
B.count = 2 | PGC_allocated (PGT_partial holds a general ref)
A.type_count = 1 | PGT_partial
A.nr_validated_entries = x
A.partial_pte = PTF_partial_set
V3: MOD_L3_ENTRY to point some other l3e (not in A) to B.
Validates B.
B.type_count = 1 | PGT_validated
B.count = 2 | PGC_allocated ("other l3e" holds a general ref)
V3: MOD_L3_ENTRY to clear l3e pointing to B.
Devalidates B.
B.type_count = 0
B.count = 1 | PGC_allocated
V3: decrease_reservation(B)
Clears PGC_allocated
B.count = 0 => B is freed
B gets assigned to a different domain
V1: Restarts UNPIN of A
put_old_guest_table(A)
...
free_l3_table(A)
Now since A.partial_flags has PTF_partial_set, free_l3_table() will
call put_page_from_l3e() on A[x], which points to B, while B is owned
by another domain.
If A[x] held a general refcount for B on partial validation, as it does
for partial de-validation, then B would still have a reference count of
1 after PGC_allocated was freed; so B wouldn't be freed until after
put_page_from_l3e() had happend on A[x].
master commit: 18b0ab697830a46ce3dacaf9210799322cb3732c
master date: 2019-10-31 16:14:36 +0100
George Dunlap [Thu, 31 Oct 2019 15:52:39 +0000 (16:52 +0100)]
x86/mm: Have alloc_l[23]_table clear partial_flags when preempting
In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted. This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.
The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count. If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held. If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.
At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.
PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held. Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.
Unfortunately, when alloc_l[23]_table check hypercall_preempt_check()
and return -ERESTART, they set nr_entries_validated, but don't clear
partial_flags.
If we were picking up from a previously-interrupted promotion, that
means that PTF_partial_set would be set even though
[nr_entries_validated] was not partially validated. This means that
if the page in this state were de-validated, put_page_type() would
erroneously be called on that entry.
Perhaps worse, if we were racing with a de-validation, then we might
leave both PTF_partial_set and PTF_partial_general_ref; and when
de-validation picked up again, both the type and the general ref would
be erroneously dropped from [nr_entries_validated].
In a sense, the real issue here is code duplication. Rather than
duplicate the interruption code, set rc to -EINTR and fall through to
the code which already handles that case correctly.
Given the logic at this point, it should be impossible for
partial_flags to be non-zero; add an ASSERT() to catch any changes.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ff0b9a5d69b744a99e8bbeac820a985db5a3bf8e
master date: 2019-10-31 16:14:14 +0100
Make it easier to read by declaring the conditions in which we will
retain the ref, rather than the conditions under which we release it.
The only way (page == current->arch.old_guest_table) can be true is if
preemptible is true; so remove this from the query itself, and add an
ASSERT() to that effect on the opposite path.
No functional change intended.
NB that alloc_lN_table() mishandle the "linear pt failure" situation
described in the comment; this will be addressed in a future patch.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 2aab06d742e13d7a9d248f1fc7f0ec62b295ada1
master date: 2019-10-31 16:13:23 +0100
George Dunlap [Thu, 31 Oct 2019 15:51:49 +0000 (16:51 +0100)]
x86/mm: Use flags for _put_page_type rather than a boolean
This is in mainly in preparation for _put_page_type taking the
partial_flags value in the future. It also makes it easier to read in
the caller (since you see a flag name rather than `true` or `false`).
No functional change intended.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 0121588ec0f6950ed65d906d860df49be2c8e655
master date: 2019-10-31 16:12:53 +0100
George Dunlap [Thu, 31 Oct 2019 15:50:43 +0000 (16:50 +0100)]
x86/mm: Separate out partial_pte tristate into individual flags
At the moment, partial_pte is a tri-state that contains two distinct bits
of information:
1. If zero, the pte at index [nr_validated_ptes] is un-validated. If
non-zero, the pte was last seen with PGT_partial set.
2. If positive, the pte at index [nr_validated_ptes] does not hold a
general reference count. If negative, it does.
To make future patches more clear, separate out this functionality
into two distinct, named bits: PTF_partial_set (for #1) and
PTF_partial_general_ref (for #2).
Additionally, a number of functions which need this information also
take other flags to control behavior (such as `preemptible` and
`defer`). These are hard to read in the caller (since you only see
'true' or 'false'), and ugly when many are added together. In
preparation for adding yet another flag in a future patch, collapse
all of these into a single `flag` variable.
NB that this does mean checking for what was previously the '-1'
condition a bit more ugly in the put_page_from_lNe functions (since
you have to check for both partial_set and general ref); but this
clause will go away in a future patch.
Also note that the original comment had an off-by-one error:
partial_flags (like partial_pte before it) concerns
plNe[nr_validated_ptes], not plNe[nr_validated_ptes+1].
No functional change intended.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 1b6fa638d21006d3c0a3038132c6cb326d8bba08
master date: 2019-10-31 16:12:14 +0100
George Dunlap [Thu, 31 Oct 2019 15:50:19 +0000 (16:50 +0100)]
x86/mm: Don't re-set PGT_pinned on a partially de-validated page
When unpinning pagetables, if an operation is interrupted,
relinquish_memory() re-sets PGT_pinned so that the un-pin will
pickedup again when the hypercall restarts.
This is appropriate when put_page_and_type_preemptible() returns
-EINTR, which indicates that the page is back in its initial state
(i.e., completely validated). However, for -ERESTART, this leads to a
state where a page has both PGT_pinned and PGT_partial set.
This happens to work at the moment, although it's not really a
"canonical" state; but in subsequent patches, where we need to make a
distinction in handling between PGT_validated and PGT_partial pages,
this causes issues.
Move to a "canonical" state by:
- Only re-setting PGT_pinned on -EINTR
- Re-dropping the refcount held by PGT_pinned on -ERESTART
In the latter case, the PGT_partial bit will be cleared further down
with the rest of the other PGT_partial pages.
While here, clean up some trainling whitespace.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: bf656e02d8e7f49b484e2587aef4f18deda6e2ab
master date: 2019-10-31 16:11:46 +0100
George Dunlap [Thu, 31 Oct 2019 15:49:43 +0000 (16:49 +0100)]
x86/mm: L1TF checks don't leave a partial entry
On detection of a potential L1TF issue, most validation code returns
-ERESTART to allow the switch to shadow mode to happen and cause the
original operation to be restarted.
However, in the validation code, the return value -ERESTART has been
repurposed to indicate 1) the function has partially completed
something which needs to be undone, and 2) calling put_page_type()
should cleanly undo it. This causes problems in several places.
For L1 tables, on receiving an -ERESTART return from alloc_l1_table(),
alloc_page_type() will set PGT_partial on the page. If for some
reason the original operation never restarts, then on domain
destruction, relinquish_memory() will call free_page_type() on the
page.
Unfortunately, alloc_ and free_l1_table() aren't set up to deal with
PGT_partial. When returning a failure, alloc_l1_table() always
de-validates whatever it's validated so far, and free_l1_table()
always devalidates the whole page. This means that if
relinquish_memory() calls free_page_type() on an L1 that didn't
complete due to an L1TF, it will call put_page_from_l1e() on "page
entries" that have never been validated.
For L2+ tables, setting rc to ERESTART causes the rest of the
alloc_lN_table() function to *think* that the entry in question will
have PGT_partial set. This will cause it to set partial_pte = 1. If
relinqush_memory() then calls free_page_type() on one of those pages,
then free_lN_table() will call put_page_from_lNe() on the entry when
it shouldn't.
Rather than indicating -ERESTART, indicate -EINTR. This is the code
to indicate that nothing has changed from when you started the call
(which is effectively how alloc_l1_table() handles errors).
mod_lN_entry() shouldn't have any of these types of problems, so leave
potential changes there for a clean-up patch later.
This is part of XSA-299.
Reported-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 3165ffef09e89d38f84d26051f606d2c1421aea3
master date: 2019-10-31 16:11:12 +0100
Jan Beulich [Thu, 31 Oct 2019 15:49:20 +0000 (16:49 +0100)]
x86/PV: check GDT/LDT limits during emulation
Accesses beyond the LDT limit originating from emulation would trigger
the ASSERT() in pv_map_ldt_shadow_page(). On production builds such
accesses would cause an attempt to promote the touched page (offset from
the present LDT base address) to a segment descriptor one. If this
happens to succeed, guest user mode would be able to elevate its
privileges to that of the guest kernel. This is particularly easy when
there's no LDT at all, in which case the LDT base stored internally to
Xen is simply zero.
Also adjust the ASSERT() that was triggering: It was off by one to
begin with, and for production builds we also better use
ASSERT_UNREACHABLE() instead with suitable recovery code afterwards.
This is XSA-298.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 93021cbe880a8013691a48d0febef8ed7d3e3ebd
master date: 2019-10-31 16:08:16 +0100
Andrew Cooper [Thu, 31 Oct 2019 15:48:52 +0000 (16:48 +0100)]
xen/hypercall: Don't use BUG() for parameter checking in hypercall_create_continuation()
Since c/s 1d429034 "hypercall: update vcpu_op to take an unsigned vcpuid",
which incorrectly swapped 'i' for 'u' in the parameter type list, guests have
been able to hit the BUG() in next_args()'s default case.
Correct these back to 'i'.
In addition, make adjustments to prevent this class of issue from occurring in
the future - crashing Xen is not an appropriate form of parameter checking.
Capitalise NEXT_ARG() to catch all uses, to highlight that it is a macro doing
non-function-like things behind the scenes, and undef it when appropriate.
Implement a bad_fmt: block which prints an error, asserts unreachable, and
crashes the guest.
On the ARM side, drop all parameter checking of p. It is asymmetric with the
x86 side, and akin to expecting memcpy() or sprintf() to check their src/fmt
parameter before use. A caller passing "" or something other than a string
literal will be obvious during code review.
This is XSA-296.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <julien.grall@arm.com>
master commit: 0bf9f8d3e399a0e1d2b717f71b4776172446184b
master date: 2019-10-31 16:07:11 +0100
Julien Grall [Mon, 18 Mar 2019 18:01:31 +0000 (18:01 +0000)]
xen/arm: mm: Flush the TLBs even if a mapping failed in create_xen_entries
At the moment, create_xen_entries will only flush the TLBs if the full
range has successfully been updated. This may lead to leave unwanted
entries in the TLBs if we fail to update some entries.
pfn_to_pdx expects an address, not a size, as a parameter. Specifically,
it expects the end address, then the masks calculations compensate for
any holes between start and end. Thus, we should pass the end address to
pfn_to_pdx.
The initial pdx is stored in frametable_base_pdx, so we can subtract the
result of pfn_to_pdx(start_address) from nr_pdxs; we know that we don't
need to cover any memory in the range 0-start in the frametable.
Remove the variable `nr_pages' because it is unused.
xen/arm64: Correctly compute the virtual address in maddr_to_virt()
The helper maddr_to_virt() is used to translate a machine address to a
virtual address. To save some valuable address space, some part of the
machine address may be compressed.
In theory the PDX code is free to compress any bits so there are no
guarantee the machine index computed will be always greater than
xenheap_mfn_start. This would result to return a virtual address that is
not part of the direct map and trigger a crash at least on debug-build later
on because of the check in virt_to_page().
A recently reverted patch (see 1191156361 "xen/arm: fix mask calculation
in pdx_init_mask") allows the PDX to compress more bits and triggered a
crash on AMD Seattle Platform.
Avoid the crash by keeping track of the base PDX for the xenheap and use
it for computing the virtual address.
Note that virt_to_maddr() does not need to have similar modification as
it is using the hardware to translate the virtual address to a machine
address.
Take the opportunity to fix the ASSERT() as the direct map base address
correspond to the start of the RAM (this is not always 0).
Julien Grall [Thu, 16 May 2019 22:31:46 +0000 (23:31 +0100)]
xen/arm: vsmc: The function identifier is always 32-bit
On Arm64, the SMCCC function identifier is always stored in the first 32-bit
of x0 register. The rest of the bits are not defined and should be
ignored.
This means the variable funcid should be an uint32_t rather than
register_t.
Julien Grall [Fri, 9 Aug 2019 12:59:15 +0000 (13:59 +0100)]
xen/arm: p2m: Free the p2m entry after flushing the IOMMU TLBs
When freeing a p2m entry, all the sub-tree behind it will also be freed.
This may include intermediate page-tables or any l3 entry requiring to
drop a reference (e.g for foreign pages). As soon as pages are freed,
they may be re-used by Xen or another domain. Therefore it is necessary
to flush *all* the TLBs beforehand.
While CPU TLBs will be flushed before freeing the pages, this is not
the case for IOMMU TLBs. This can be solved by moving the IOMMU TLBs
flush earlier in the code.
This wasn't considered as a security issue as device passthrough on Arm
is not security supported.
Julien Grall [Wed, 16 Oct 2019 10:53:03 +0000 (11:53 +0100)]
xen/arm: Don't use _end in is_xen_fixed_mfn()
virt_to_maddr() is using the hardware page-table walk instructions to
translate a virtual address to physical address. The function should
only be called on virtual address mapped.
_end points past the end of Xen binary and may not be mapped when the
binary size is page-aligned. This means virt_to_maddr() will not be able
to do the translation and therefore crash Xen.
Note there is also an off-by-one issue in this code, but the panic will
trump that.
Both issues can be fixed by using _end - 1 in the check.
xen/arm: Implement workaround for Cortex A-57 and Cortex A72 AT speculate
Both Cortex-A57 (erratum 1319537) and Cortex-A72 (erratum 1319367) can
end with corrupted TLBs if they speculate an AT instruction while S1/S2
system registers in inconsistent state.
The workaround is the same as for Cortex A-76 implemented by commit a18be06aca "xen/arm: Implement workaround for Cortex-A76 erratum 1165522",
so it is only necessary to plumb in the cpuerrata framework.
Julien Grall [Wed, 27 Mar 2019 18:45:23 +0000 (18:45 +0000)]
xen/arm: memaccess: Initialize correctly *access in __p2m_get_mem_access
The commit 8d84e701fd "xen/arm: initialize access" initializes
*access using the wrong enumeration type. This result to a warning
using clang:
mem_access.c:50:20: error: implicit conversion from enumeration type
'p2m_access_t' to different enumeration type 'xenmem_access_t'
[-Werror,-Wenum-conversion]
*access = p2m->default_access;
~ ~~~~~^~~~~~~~~~~~~~
The correct solution is to use the array memaccess that will do the
conversion between the 2 enums.
Julien Grall [Wed, 15 May 2019 20:17:30 +0000 (21:17 +0100)]
xen/arm: traps: Avoid using BUG_ON() to check guest state in advance_pc()
The condition of the BUG_ON() in advance_pc() is pretty wrong because
the bits [26:25] and [15:10] have a different meaning between AArch32
and AArch64 state.
On AArch32, they are used to store PSTATE.IT. On AArch64, they are RES0
or used for new feature (e.g ARMv8.0-SSBS, ARMv8.5-BTI).
This means a 64-bit guest will hit the BUG_ON() if it is trying to use
any of these features.
More generally, RES0 means that the bits is reserved for future use. So
crashing the host is definitely not the right solution.
In this particular case, we only need to know the guest was using 32-bit
Mode and the Thumb instructions. So replace the BUG_ON() by a proper
check.
On Arm64, system registers are always 64-bit including SCTLR_EL1.
However, Xen is assuming this is 32-bit because earlier revision of
Armv8 had the top 32-bit RES0 (see ARM DDI0595.b).
>From Armv8.5, some bits in [63:32] will be defined and allowed to be
modified by the guest. So we would effectively reset those bits to 0
after each context switch. This means the guest may not function
correctly afterwards.
Rather than resetting to 0 the bits [63:32], preserve them across
context switch.
Note that the corresponding register on Arm32 (i.e SCTLR) is always
32-bit. So we need to use register_t anywhere we deal the SCTLR{,_EL1}.
Outside interface is switched to use 64-bit to allow ABI compatibility
between 32-bit and 64-bit.
Julien Grall [Wed, 15 May 2019 16:16:13 +0000 (17:16 +0100)]
xen/arm: traps: Avoid using BUG_ON() in _show_registers()
At the moment, _show_registers() is using a BUG_ON() to assert only
userspace will run 32-bit code in a 64-bit domain.
Such extra precaution is not necessary and could be avoided by only
checking the CPU mode to decide whether show_registers_64() or
show_reigsters_32() should be called.
This has also the nice advantage to avoid nested if in the code.
Igor Druzhinin [Fri, 25 Oct 2019 09:43:49 +0000 (11:43 +0200)]
x86/efi: properly handle 0 in pixel reserved bitmask
In some graphics modes firmware is allowed to return 0 in pixel reserved
bitmask which doesn't go against UEFI Spec 2.8 (12.9 Graphics Output Protocol).
Without this change non-TrueColor modes won't work which will cause
GOP init to fail - observed while trying to boot EFI Xen with Cirrus VGA.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 521a1445510a30873aec471194045e7f4b5e8d75
master date: 2019-10-10 16:50:50 +0200
Roger Pau Monné [Fri, 25 Oct 2019 09:43:06 +0000 (11:43 +0200)]
pci: clear {host/guest}_maskall field on assign
The current implementation of host_maskall makes it sticky across
assign and deassign calls, which means that once a guest forces Xen to
set host_maskall the maskall bit is not going to be cleared until a
call to PHYSDEVOP_prepare_msix is performed. Such call however
shouldn't be part of the normal flow when doing PCI passthrough, and
hence the flag needs to be cleared when assigning in order to prevent
host_maskall being carried over from previous assignations.
Note that the entry maskbit is reset when the msix capability is
initialized, and the guest_maskall field is also cleared so that the
hardware value matches Xen's internal state (hardware maskall =
host_maskall | guest_maskall).
Also note that doing the reset of host_maskall there would allow the
guest to reset such field by enabling and disabling MSIX, which is not
intended.
Igor Druzhinin [Fri, 25 Oct 2019 09:42:29 +0000 (11:42 +0200)]
efi/boot: make sure graphics mode is set while booting through MB2
If a bootloader is using native driver instead of EFI GOP it might
reset graphics mode to be different from what has been originally set
by firmware. While booting through MB2 Xen either need to parse video
setting passed by MB2 and use them instead of what GOP reports or
reset the mode to synchronise it with firmware - prefer the latter.
Observed while booting Xen using MB2 with EFI GRUB2 compiled with
all possible video drivers where native drivers take priority over firmware.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: af9f357fb8dbceb9c5dd1c5cb8b4e198f6149456
master date: 2019-10-10 10:58:45 +0200
Igor Druzhinin [Fri, 25 Oct 2019 09:40:17 +0000 (11:40 +0200)]
x86/crash: force unlock console before printing on kexec crash
There is a small window where shootdown NMI might come to a CPU
(e.g. in serial interrupt handler) where console lock is taken. In order
not to leave following console prints waiting infinitely for shot down
CPUs to free the lock - force unlock the console.
The race has been frequently observed while crashing nested Xen in
an HVM domain.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 7d5247cee21aa38a16c4b21bc9243eda70c8aebd
master date: 2019-10-02 11:25:05 +0100
Juergen Gross [Fri, 25 Oct 2019 09:39:12 +0000 (11:39 +0200)]
sched: don't let XEN_RUNSTATE_UPDATE leak into vcpu_runstate_get()
vcpu_runstate_get() should never return a state entry time with
XEN_RUNSTATE_UPDATE set. To avoid this let update_runstate_area()
operate on a local runstate copy.
As it is required to first set the XEN_RUNSTATE_UPDATE indicator in
guest memory, then update all the runstate data, and then at last
clear the XEN_RUNSTATE_UPDATE again it is much less effort to have
a local copy of the runstate data instead of keeping only a copy of
state_entry_time.
This problem was introduced with commit 2529c850ea48f036 ("add update
indicator to vcpu_runstate_info").
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Julien Grall <julien.grall@arm.com>
master commit: f28c4c4c10bdacb1e49cc6e9de57eb1f973cbdf6
master date: 2019-09-26 18:04:09 +0200
Juergen Gross [Fri, 25 Oct 2019 09:38:39 +0000 (11:38 +0200)]
sched: fix freeing per-vcpu data in sched_move_domain()
In case of an allocation error of per-vcpu data in sched_move_domain()
the already allocated data is freed just using xfree(). This is wrong
as some schedulers need to do additional operations (e.g. the arinc653
scheduler needs to remove the vcpu-data from a list).
So instead xfree() make use of the sched_free_vdata() hook.
Jan Beulich [Fri, 25 Oct 2019 09:37:20 +0000 (11:37 +0200)]
ACPI/cpuidle: bump maximum number of power states we support
Commit 4c6cd64519 ("mwait_idle: Skylake Client Support") added a table
with 8 entries, which - together with C0 - rendered the current limit
too low. It should have been accompanied by an increase of the constant;
do this now. Don't bump by too much though, as there are a number of on-
stack arrays which are dimensioned by this constant.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org>
master commit: ff22a91b4c45f9310d0ec0d7ee070d84a373dd87
master date: 2019-09-25 15:53:35 +0200
Jan Beulich [Fri, 25 Oct 2019 09:36:50 +0000 (11:36 +0200)]
libxc/x86: avoid certain overflows in CPUID APIC ID adjustments
Recent AMD processors may report up to 128 logical processors in CPUID
leaf 1. Doubling this value produces 0 (which OSes sincerely dislike),
as the respective field is only 8 bits wide. Suppress doubling the value
(and its leaf 0x80000008 counterpart) in such a case.
Note that while there's a similar overflow in intel_xc_cpuid_policy(),
that one is being left alone for now.
Note further that while it was considered to suppress the multiplication
by 2 altogether if the host topology already provides at least one bit
of thread ID within APIC IDs, it was decided to avoid more change here
than really needed at this point.
Also zap leaf 4 (and at the same time leaf 2) EDX output for AMD, as it
should have been from the beginning.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
libxc/x86: correct overflow avoidance check in AMD CPUID handling
Commit df29d03f1d ("libxc/x86: avoid certain overflows in CPUID APIC ID
adjustments" introduced a one bit too narrow mask when checking whether
multiplying by 1 (in particular in leaf 1) would result in overflow.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: df29d03f1d97bdde1bc0cea8ef8538d4f524b3ec
master date: 2019-09-24 10:50:33 +0200
master commit: c9c7ac508b3f65f7d5f9685893096a1b22d8b176
master date: 2019-09-25 15:50:58 +0200
The loop in FOR_EACH_IOREQ_SERVER is backwards hence the cleanup on
failure needs to be done forwards.
Fixes: 97a5a3e30161 ('x86/hvm/ioreq: maintain an array of ioreq servers rather than a list') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Paul Durrant <paul.durrant@citrix.com>
master commit: 215f2576b0ac1bc18f3ff74e34f0d8379bda9040
master date: 2019-09-10 16:32:47 +0200