credit2: avoid vCPUs to ever reach lower credits than idle
There have been report of stalls of guest vCPUs, when Credit2 was used.
It seemed like these vCPUs were not getting scheduled for very long
time, even under light load conditions (e.g., during dom0 boot).
Investigations led to the discovery that --although rarely-- it can
happen that a vCPU manages to run for very long timeslices. In Credit2,
this means that, when runtime accounting happens, the vCPU will lose a
large quantity of credits. This in turn may lead to the vCPU having less
credits than the idle vCPUs (-2^30). At this point, the scheduler will
pick the idle vCPU, instead of the ready to run vCPU, for a few
"epochs", which often times is enough for the guest kernel to think the
vCPU is not responding and crashing.
An example of this situation is shown here. In fact, we can see d0v1
sitting in the runqueue while all the CPUs are idle, as it has
-1254238270 credits, which is smaller than -2^30 = −1073741824:
We certainly don't want, under any circumstance, this to happen.
Let's, therefore, define a minimum amount of credits a vCPU can have.
During accounting, we make sure that, for however long the vCPU has
run, it will never get to have less than such minimum amount of
credits. Then, we set the credits of the idle vCPU to an even
smaller value.
NOTE: investigations have been done about _how_ it is possible for a
vCPU to execute for so much time that its credits becomes so low. While
still not completely clear, there are evidence that:
- it only happens very rarely,
- it appears to be both machine and workload specific,
- it does not look to be a Credit2 (e.g., as it happens when
running with Credit1 as well) issue, or a scheduler issue.
This patch makes Credit2 more robust to events like this, whatever
the cause is, and should hence be backported (as far as possible).
Andrew Cooper [Thu, 9 Apr 2020 07:19:51 +0000 (09:19 +0200)]
x86/ucode/amd: Fix more potential buffer overruns with microcode parsing
cpu_request_microcode() doesn't know the buffer is at least 4 bytes long
before inspecting UCODE_MAGIC.
install_equiv_cpu_table() doesn't know the boundary of the buffer it is
interpreting as an equivalency table. This case was clearly observed at one
point in the past, given the subsequent overrun detection, but without
comprehending that the damage was already done.
Make the logic consistent with container_fast_forward() and pass size_left in
to install_equiv_cpu_table().
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 718d1432000079ea7120f6cb770372afe707ce27
master date: 2020-04-01 14:00:12 +0100
Jan Beulich [Thu, 9 Apr 2020 07:18:11 +0000 (09:18 +0200)]
x86/HVM: fix AMD ECS handling for Fam10
The involved comparison was, very likely inadvertently, converted from
>= to > when making changes unrelated to the actual family range.
Fixes: 9841eb71ea87 ("x86/cpuid: Drop a guests cached x86 family and model information") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <paul@xen.org>
master commit: 5d515b1c296ebad6889748ea1e49e063453216a3
master date: 2020-04-01 12:28:30 +0200
Jan Beulich [Thu, 9 Apr 2020 07:15:38 +0000 (09:15 +0200)]
libx86/CPUID: fix (not just) leaf 7 processing
For one, subleaves within the respective union shouldn't live in
separate sub-structures. And then x86_cpuid_policy_fill_native() should,
as it did originally, iterate over all subleaves here as well as over
all main leaves. Switch to using a "<= MIN()"-based approach similar to
that used in x86_cpuid_copy_to_buffer(). Also follow this for the
extended main leaves then.
Fixes: 1bd2b750537b ("libx86: Fix 32bit stubdom build of x86_cpuid_policy_fill_native()") Fixes: 97e4ebdcd765 ("x86/CPUID: support leaf 7 subleaf 1 / AVX512_BF16") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: eb0bad81fceb3e81df5f73441771b49b732edf56
master date: 2020-03-27 11:40:59 +0100
Pu Wen [Thu, 9 Apr 2020 07:14:09 +0000 (09:14 +0200)]
SVM: Add union intstat_t for offset 68h in vmcb struct
According to chapter "Appendix B Layout of VMCB" in the new version
(v3.32) AMD64 APM[1], bit 1 of the VMCB offset 68h is defined as
GUEST_INTERRUPT_MASK.
In current xen codes, it use whole u64 interrupt_shadow to setup
interrupt shadow, which will misuse other bit in VMCB offset 68h
as part of interrupt_shadow, causing svm_get_interrupt_shadow() to
mistake the guest having interrupts enabled as being in an interrupt
shadow. This has been observed to cause SeaBIOS to hang on boot.
Add union intstat_t for VMCB offset 68h and fix codes to only use
bit 0 as intr_shadow according to the new APM description.
Andrew Cooper [Thu, 9 Apr 2020 07:13:15 +0000 (09:13 +0200)]
x86/ucode: Fix error paths in apply_microcode()
In the unlikley case that patch application completes, but the resutling
revision isn't expected, sig->rev doesn't get updated to match reality.
It will get adjusted the next time collect_cpu_info() gets called, but in the
meantime Xen might operate on a stale value. Nothing good will come of this.
Rewrite the logic to always update the stashed revision, before worrying about
whether the attempt was a success or failure.
Take the opportunity to make the printk() messages as consistent as possible.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wl@xen.org>
master commit: d2a0a96cf76603b2e2b87c3ce80c3f9d098327d4
master date: 2020-03-26 18:57:45 +0000
Andrew Cooper [Thu, 9 Apr 2020 07:12:26 +0000 (09:12 +0200)]
x86/ucode/amd: Fix assertion in compare_patch()
This is clearly a typo.
Fixes: 9da23943ccd "microcode: introduce a global cache of ucode patch" Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wl@xen.org>
master commit: 13ed5d49a4214dc3521d4af7bfcf13fbcf5bfd63
master date: 2020-03-26 18:57:45 +0000
Offlining a cpu with core scheduling active can result in a hanging
system. Reason is the scheduling resource and unit of the to be removed
cpus needs to be split in order to remove the cpu from its cpupool and
move it to the idle scheduler. In case one of the involved cpus happens
to have received a sched slave event due to a vcpu former having been
running on that cpu being woken up again, it can happen that this cpu
will enter sched_wait_rendezvous_in() while its scheduling resource is
just about to be split. It might wait for ever for the other sibling
to join, which will never happen due to the resources already being
modified.
This can easily be avoided by:
- resetting the rendezvous counters of the idle unit which is kept
- checking for a new scheduling resource in sched_wait_rendezvous_in()
after reacquiring the scheduling lock and resetting the counters in
that case without scheduling another vcpu
- moving schedule resource modifications (in schedule_cpu_rm()) and
retrieving (schedule(), sched_slave() is fine already, others are not
critical) into locked regions
sched: fix onlining cpu with core scheduling active
When onlining a cpu cpupool_cpu_add() checks whether all siblings of
the new cpu are free in order to decide whether to add it to cpupool0.
In case the added cpu is not the last sibling to be onlined this test
is wrong as it only checks for all online siblings to be free. The
test should include the check for the number of siblings having
reached the scheduling granularity of cpupool0, too.
Igor Druzhinin [Thu, 9 Apr 2020 07:09:40 +0000 (09:09 +0200)]
x86/shim: fix ballooning up the guest
args.preempted is meaningless here as it doesn't signal whether the
hypercall was preempted before. Use start_extent instead which is
correct (as long as the hypercall was invoked in a "normal" way).
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 76dbabb59eeaa78e9f57407e5b15a6606488333e
master date: 2020-03-18 12:55:54 +0100
iommu: fix check for autotranslated hardware domain
The current position of the check_hwdom_reqs is wrong, as there's a
is_iommu_enabled at the top of the function that will prevent getting
to the check on systems without an IOMMU, because the hardware domain
won't have the XEN_DOMCTL_CDF_iommu flag set.
Move the position of the check so it's done before the
is_iommu_enabled one, and thus attempts to create a translated
hardware domain without an IOMMU can be detected.
Fixes: f89f555827a ('remove late (on-demand) construction of IOMMU page tables') Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: faf0d606a6eb911449075b83ea0ad085960c9acb
master date: 2020-03-05 10:43:46 +0100
x86/dom0: improve PVH initrd and metadata placement
Don't assume there's going to be enough space at the tail of the
loaded kernel and instead try to find a suitable memory area where the
initrd and metadata can be loaded.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/dom0: Fix build with clang
find_memory() isn't marked as __init, so if it isn't fully inlined, it ends up
tripping:
Error: size of dom0_build.o:.text is 0x0c1
Fixes: 73b47eea21 "x86/dom0: improve PVH initrd and metadata placement" Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 73b47eea21045556dc5334e4f17d0c05c23f3c16
master date: 2020-03-05 10:43:15 +0100
master commit: 40213cd8626bac712fa69c4978993e87b57a7d0c
master date: 2020-03-05 18:11:51 +0000
sched: fix error path in cpupool_unassign_cpu_start()
In case moving away all domains from the cpu to be removed is failing
in cpupool_unassign_cpu_start() the error path is missing to release
sched_res_rculock.
The normal exit path is releasing domlist_read_lock instead (this is
currently no problem as the reference to the specific rcu lock is not
used by rcu_read_unlock()).
While at it indent the present error label by one space.
One of the main design goals of core scheduling is to avoid actions
which are not directly related to the domain currently running on a
given cpu or core. Live patching is one of those actions which are
allowed taking place on a cpu only when the idle scheduling unit is
active on that cpu.
Unfortunately live patching tries to force the cpus into the idle loop
just by raising the schedule softirq, which will no longer be
guaranteed to work with core scheduling active. Additionally there are
still some places in the hypervisor calling check_for_livepatch_work()
without being in the idle loop.
It is easy to force a cpu into the main idle loop by scheduling a
tasklet on it. So switch live patching to use tasklets for switching to
idle and raising scheduling events. Additionally the calls of
check_for_livepatch_work() outside the main idle loop can be dropped.
As tasklets are only running on idle vcpus and stop_machine_run()
is activating tasklets on all cpus but the one it has been called on
to rendezvous, it is mandatory for stop_machine_run() to be called on
an idle vcpu, too, as otherwise there is no way for scheduling to
activate the idle vcpu for the tasklet on the sibling of the cpu
stop_machine_run() has been called on.
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Julien Grall <jgrall@amazon.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
master commit: 005de45c887e0fefde59570686877afeda2c7b4e
master date: 2020-03-02 18:36:50 +0000
xen: make sure stop_machine_run() is always called in a tasklet
With core scheduling active it is mandatory for stop_machine_run() to
be called in idle context only (so either during boot or in a tasklet),
as otherwise a scheduling deadlock would occur: stop_machine_run()
does a cpu rendezvous by activating a tasklet on all other cpus. In
case stop_machine_run() was not called in an idle vcpu it would block
scheduling the idle vcpu on its siblings with core scheduling being
active, resulting in a hang.
Put a BUG_ON() into stop_machine_run() to test for being called in an
idle vcpu only and adapt the missing call site (ucode loading) to use a
tasklet for calling stop_machine_run().
Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 188f479de4b77e5493a7df258974a0a9d119fb0c
master date: 2020-03-02 13:08:04 +0000
Andrew Cooper [Thu, 9 Apr 2020 06:59:55 +0000 (08:59 +0200)]
x86/amd: Avoid cpu_has_hypervisor evaluating true on native hardware
Currently when booting native on AMD hardware, cpuidmask_defaults._1cd gets
configured with the HYPERVISOR bit before native CPUID is scanned for feature
bits.
This results in cpu_has_hypervisor becoming set as part of identify_cpu(), and
ends up appearing in the raw and host CPU policies.
A combination of this bug, and c/s bb502a8ca59 "x86: check feature flags after
resume" which checks that feature bits don't go missing, results in broken S3
on AMD hardware.
Alter amd_init_levelling() to exclude the HYPERVISOR bit from
cpumask_defaults, and update domain_cpu_policy_changed() to allow it to be
explicitly forwarded.
This also fixes a bug on kexec, where the hypervisor bit is left enabled for
the new kernel to find.
These changes highlight a further but - dom0 construction is asymetric with
domU construction, by not having any calls to domain_cpu_policy_changed().
Extend arch_domain_create() to always call domain_cpu_policy_changed().
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: e2d1254f5af2a6ff24d009523639b80ccba2c089
master date: 2020-02-14 18:01:52 +0000
Igor Druzhinin [Thu, 9 Apr 2020 06:58:22 +0000 (08:58 +0200)]
x86/time: update vtsc_last with cmpxchg and drop vtsc_lock
Now that vtsc_last is the only entity protected by vtsc_lock we can
simply update it using a single atomic operation and drop the spinlock
entirely. This is extremely important for the case of running nested
(e.g. shim instance with lots of vCPUs assigned) since if preemption
happens somewhere inside the critical section that would immediately
mean that other vCPU stop progressing (and probably being preempted
as well) waiting for the spinlock to be freed.
This fixes constant shim guest boot lockups with ~32 vCPUs if there is
vCPU overcommit present (which increases the likelihood of preemption).
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: f9dee1f945ebb6fb5f9df6f5d95b15c25727f48e
master date: 2019-12-20 16:44:38 +0100
x86: do not enable global pages when virtualized on AMD or Hygon hardware
When using global pages a full tlb flush can only be performed by
toggling the PGE bit in CR4, which is usually quite expensive in terms
of performance when running virtualized. This is specially relevant on
AMD or Hygon hardware, which doesn't have the ability to do selective
CR4 trapping, but can also be relevant on e.g. Intel if the underlying
hypervisor also traps accesses to the PGE CR4 bit.
In order to avoid this performance penalty, do not use global pages
when running virtualized on AMD or Hygon hardware. A command line option
'global-pages' is provided in order to allow the user to select whether
global pages will be enabled for PV guests.
The above figures are from a PV shim running on AMD hardware with
32 vCPUs:
Average lock time: 240829ns
Average block time: 1453029ns
As seen from the above figures the lock and block time of the flush
lock is reduced to approximately 1/3 of the original value.
Note that XEN_MINIMAL_CR4 and mmu_cr4_features are not modified, and
thus global pages are left enabled for the hypervisor. This is not an
issue because the code to switch the control registers (cr3 and cr4)
already takes into account such situation and performs the necessary
flushes. The same already happens when using XPTI or PCIDE, as the
guest cr4 doesn't have global pages enabled in that case either.
Also note that the suspend and resume code is correct in writing
mmu_cr4_features into cr4 on resume, since that's the cr4 used by the
idle vCPU which is the context used by the suspend and resume routine.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
x86/pv: Fix `global-pages` to match the documentation
c/s 5de961d9c09 "x86: do not enable global pages when virtualized on AMD or
Hygon hardware" in fact does. Fix the calculation in pge_init().
While fixing this, adjust the command line documenation, first to use the
newer style, and to expand the description to discuss cases where the option
might be useful to use, but Xen can't account for by default.
Fixes: 5de961d9c09 ('x86: do not enable global pages when virtualized on AMD or Hygon hardware') Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 5de961d9c0976f0a03d830956a4e7ac3e9d887ff
master date: 2019-12-10 11:34:00 +0100
master commit: b041709c369b36cb17a019a196fba773ec7e77bd
master date: 2019-12-16 16:04:10 +0000
Jan Beulich [Thu, 5 Mar 2020 10:06:57 +0000 (11:06 +0100)]
x86: "spec-ctrl=no-xen" should also disable branch hardening
This is controlling Xen behavior alone, after all.
Reported-by: Jin Nan Wang <jnwang@suse.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: e6ca7afcf2ddeb72beade853ccd6fa3332210014
master date: 2020-02-20 11:37:01 +0100
Juergen Gross [Thu, 5 Mar 2020 10:06:19 +0000 (11:06 +0100)]
sched: fix get_cpu_idle_time() with core scheduling
get_cpu_idle_time() is calling vcpu_runstate_get() for an idle vcpu.
With core scheduling active this is fragile, as idle vcpus are assigned
to other scheduling units temporarily, and that assignment is changed
in some cases without holding the scheduling lock, and
vcpu_runstate_get() is using v->sched_unit as parameter for
unit_schedule_[un]lock_irq(), resulting in an ASSERT() triggering in
unlock in case v->sched_unit has changed meanwhile.
Fix that by using a local unit variable holding the correct unit.
Jan Beulich [Thu, 5 Mar 2020 10:05:44 +0000 (11:05 +0100)]
VT-d: check all of an RMRR for being E820-reserved
Checking just the first and last page is not sufficient (and redundant
for single-page regions). As we don't need to care about IA64 anymore,
use an x86-specific function to get this done without looping over each
individual page.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: d6573bc6e6b7d95bb9de8471a6bfd7048ebc50f3
master date: 2020-02-18 16:21:19 +0100
Igor Druzhinin [Thu, 5 Mar 2020 10:04:27 +0000 (11:04 +0100)]
x86/shim: suspend and resume platform time correctly
Similarly to S3, platform time needs to be saved on guest suspend
and restored on resume respectively. This should account for expected
jumps in PV clock counter value after resume. time_suspend/resume()
are safe to use in PVH setting as is since any existing operations
with PIT/HPET that they do would simply be ignored if PIT/HPET is
not present.
Additionally, add resume callback for Xen PV clocksource to avoid
its breakage on migration.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: a7a3ecd82e289a9a2ecc1d3b5128580e0b577cc7
master date: 2020-02-14 18:01:52 +0000
David Woodhouse [Thu, 5 Mar 2020 10:02:52 +0000 (11:02 +0100)]
x86/smp: reset x2apic_enabled in smp_send_stop()
Just before smp_send_stop() re-enables interrupts when shutting down
for reboot or kexec, it calls __stop_this_cpu() which in turn calls
disable_local_APIC(), which puts the APIC back in to the mode Xen found
it in at boot.
If that means turning x2APIC off and going back into xAPIC mode, then
a timer interrupt occurring just after interrupts come back on will
lead to a GP# when apic_timer_interrupt() attempts to ack the IRQ
through the EOI register in x2APIC MSR 0x80b:
We can't clear the global x2apic_enabled variable in disable_local_APIC()
itself because that runs on each CPU. Instead, correct it (by using
current_local_apic_mode()) in smp_send_stop() while interrupts are still
disabled immediately after calling __stop_this_cpu() for the boot CPU,
after all other CPUs have been stopped.
cf: d639bdd9bbe ("x86/apic: Disable the LAPIC later in smp_send_stop()")
... which didn't quite fix it completely.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
master commit: 8b1002ab037aeacdece7723c07ab35ca16c1e22e
master date: 2020-02-14 18:01:52 +0000
Andrew Cooper [Thu, 5 Mar 2020 10:02:02 +0000 (11:02 +0100)]
xen/pvh: Fix segment selector ABI
The written ABI states that %es will be set up, but libxc doesn't do so. In
practice, it breaks `rep movs` inside guests before they reload %es.
The written ABI doesn't mention %ss, but libxc does set it up. Having %ds
different to %ss is obnoxous to work with, as different registers have
different implicit segments.
Modify the spec to state that %ss is set up as a flat read/write segment.
This a) matches the Multiboot 1 spec, b) matches what is set up in practice,
and c) is the more sane behaviour for guests to use.
Fixes: 68e1183411b ('libxc: introduce a xc_dom_arch for hvm-3.0-x86_32 guests') Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wl@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
x86/pvh: Adjust dom0's starting state
Fixes: b25fb1a04e "xen/pvh: Fix segment selector ABI" Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Wei Liu <wl@xen.org> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: b25fb1a04e99cc03359eade1affb56ef0eee766f
master date: 2020-02-10 15:26:09 +0000
master commit: 6ee10313623c1f41fc72fe12372e176e744463c1
master date: 2020-02-11 11:04:26 +0000
Jan Beulich [Thu, 5 Mar 2020 10:01:01 +0000 (11:01 +0100)]
xmalloc: guard against integer overflow
There are hypercall handling paths (EFI ones are what this was found
with) needing to allocate buffers of a caller specified size. This is
generally fine, as our page allocator enforces an upper bound on all
allocations. However, certain extremely large sizes could, when adding
in allocator overhead, result in an apparently tiny allocation size,
which would typically result in either a successful allocation, but a
severe buffer overrun when using that memory block, or in a crash right
in the allocator code.
Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: cf38b4926e2b55d1d7715cff5095a7444f5ed42d
master date: 2020-02-06 09:53:12 +0100
Jan Beulich [Thu, 5 Mar 2020 10:00:22 +0000 (11:00 +0100)]
EFI: don't leak heap contents through XEN_EFI_get_next_variable_name
Commit 1f4eb9d27d0e ("EFI: fix getting EFI variable list on some
systems") switched to using the caller provided size for the copy-out
without making sure the copied buffer is properly scrubbed.
Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: 4783ee894f6bfb0f4deec9f1fe8e7faceafaa1a2
master date: 2020-02-06 09:52:33 +0100
Jan Beulich [Thu, 5 Mar 2020 09:59:33 +0000 (10:59 +0100)]
EFI: re-check {get,set}-variable name strings after copying in
A malicious guest given permission to invoke XENPF_efi_runtime_call may
play with the strings underneath Xen sizing them and copying them in.
Guard against this by re-checking the copyied in data for consistency
with the initial sizing. At the same time also check that the actual
copy-in is in fact successful, and switch to the lighter weight non-
checking flavor of the function.
Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com>
master commit: ad38db5852f0e30d90c93c6a62b754f2861549e0
master date: 2020-02-06 09:51:17 +0100
Julien Grall [Thu, 5 Mar 2020 09:58:58 +0000 (10:58 +0100)]
xen/x86: domctl: Don't leak data via XEN_DOMCTL_gethvmcontext
The HVM context may not fill up the full buffer passed by the caller.
While we report corectly the size of the context, we will still be
copying back the full size of the buffer.
As the buffer is allocated through xmalloc(), we will be copying some
bits from the previous allocation.
Only copy back the part of the buffer used by the HVM context to prevent
any leak.
Note that per XSA-72, this is not a security issue.
Jan Beulich [Thu, 5 Mar 2020 09:58:04 +0000 (10:58 +0100)]
x86/HVM: relinquish resources also from hvm_domain_destroy()
Domain creation failure paths don't call domain_relinquish_resources(),
yet allocations and alike done from hvm_domain_initialize() need to be
undone nevertheless. Call the function also from hvm_domain_destroy(),
after making sure all descendants are idempotent.
Note that while viridian_{domain,vcpu}_deinit() were already used in
ways suggesting they're idempotent, viridian_time_vcpu_deinit() actually
wasn't: One can't kill a timer that was never initialized.
For hvm_destroy_all_ioreq_servers()'s purposes make
relocate_portio_handler() return whether the to be relocated port range
was actually found. This seems cheaper than introducing a flag into
struct hvm_domain's ioreq_server sub-structure.
In hvm_domain_initialise() additionally
- use XFREE() also to replace adjacent xfree(),
- use hvm_domain_relinquish_resources() as being idempotent now.
There as well as in hvm_domain_destroy() the explicit call to
rtc_deinit() isn't needed anymore.
In hvm_domain_relinquish_resources() additionally drop a no longer
relevant if().
Fixes: e7a9b5e72f26 ("viridian: separately allocate domain and vcpu structures") Fixes: 26fba3c85571 ("viridian: add implementation of synthetic timers") Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Paul Durrant <pdurrant@amazon.com>
master commit: b3344bb1cae0c9ac22a57db8ecca488ad0e4a66d
master date: 2020-01-31 16:47:29 +0100
Igor Druzhinin [Thu, 5 Mar 2020 09:57:20 +0000 (10:57 +0100)]
x86/suspend: disable watchdog before calling console_start_sync()
... and enable it after exiting S-state. Otherwise accumulated
output in serial buffer might easily trigger the watchdog if it's
still enabled after entering sync transmission mode.
The issue observed on machines which, unfortunately, generate non-0
output in CPU offline callbacks.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 5e08f5f56c9955d853c26c985b6fb1fb45d0355d
master date: 2020-01-29 15:06:10 +0100
Roger Pau Monné [Thu, 5 Mar 2020 09:56:38 +0000 (10:56 +0100)]
x86/apic: fix disabling LVT0 in disconnect_bsp_APIC
The Intel SDM states:
"When an illegal vector value (0 to 15) is written to a LVT entry and
the delivery mode is Fixed (bits 8-11 equal 0), the APIC may signal an
illegal vector error, without regard to whether the mask bit is set or
whether an interrupt is actually seen on the input."
And that's exactly what's currently done in disconnect_bsp_APIC when
virt_wire_setup is true and LVT LINT0 is being masked. By writing only
APIC_LVT_MASKED Xen is actually setting the vector to 0 and the
delivery mode to Fixed (0), and hence it triggers an APIC error even
when the LVT entry is masked.
This would usually manifest when Xen is being shut down, as that's
where disconnect_bsp_APIC is called:
(XEN) APIC error on CPU0: 40(00)
Fix this by calling clear_local_APIC prior to setting the LVT LINT
registers which already clear LVT LINT0, and hence the troublesome
write can be avoided as the register is already cleared.
Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 782b48b7f7319c07b044606d67a60875e53dd05b
master date: 2020-01-29 14:47:00 +0100
Jan Beulich [Thu, 5 Mar 2020 09:55:15 +0000 (10:55 +0100)]
VT-d: don't pass bridge devices to domain_context_mapping_one()
When passed a non-NULL pdev, the function does an owner check when it
finds an already existing context mapping. Bridges, however, don't get
passed through to guests, and hence their owner is always going to be
Dom0, leading to the assigment of all but one of the function of multi-
function PCI devices behind bridges to fail.
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com>
master commit: a4d457fd59f4ebfb524aec82cb6a3030087914ca
master date: 2020-01-22 16:39:58 +0100
Jan Beulich [Thu, 5 Mar 2020 09:54:33 +0000 (10:54 +0100)]
build: fix dependency file generation with ENFORCE_UNIQUE_SYMBOLS=y
The recorded file, unless overridden by -MQ (or -MT) is that specified
by -o, which doesn't produce correct dependencies and hence will cause
failure to re-build when included files change.
Fixes: 81ecb38b83b0 ("build: provide option to disambiguate symbol names") Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 17a6c03701bf65c0b4e8b5ed5a3970cd0248c47f
master date: 2020-01-17 17:38:19 +0100
Igor Druzhinin [Thu, 5 Mar 2020 09:53:59 +0000 (10:53 +0100)]
x86/sm{e, a}p: do not enable SMEP/SMAP in PV shim by default on AMD
Due to AMD and Hygon being unable to selectively trap CR4 bit modifications
running 32-bit PV guest inside PV shim comes with significant performance
hit. Moreover, for SMEP in particular every time CR4.SMEP changes on context
switch to/from 32-bit PV guest, it gets trapped by L0 Xen which then
tries to perform global TLB invalidation for PV shim domain. This usually
results in eventual hang of a PV shim with at least several vCPUs.
Since the overall security risk is generally lower for shim Xen as it being
there more of a defense-in-depth mechanism, choose to disable SMEP/SMAP in
it by default on AMD and Hygon unless a user chose otherwise.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b05ec9263e56ef0784da766e829cfe08569d1d88
master date: 2020-01-17 16:18:20 +0100
Igor Druzhinin [Thu, 5 Mar 2020 09:53:21 +0000 (10:53 +0100)]
x86/time: update TSC stamp on restore from deep C-state
If ITSC is not available on CPU (e.g if running nested as PV shim)
then X86_FEATURE_NONSTOP_TSC is not advertised in certain cases, i.e.
all AMD and some old Intel processors. In which case TSC would need to
be restored on CPU from platform time by Xen upon exiting C-states.
As platform time might be behind the last TSC stamp recorded for the
current CPU, invariant of TSC stamp being always behind local TSC counter
is violated. This has an effect of get_s_time() going negative resulting
in eventual system hang or crash.
Fix this issue by updating local TSC stamp along with TSC counter write.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: bbf283f853f8c0e4d29248dd44d3b0e0abc07629
master date: 2020-01-17 16:11:20 +0100
Jan Beulich [Thu, 5 Mar 2020 09:52:31 +0000 (10:52 +0100)]
IRQ: u16 is too narrow for an event channel number
FIFO event channels allow ports up to 2^17, so we need to use a wider
field in struct pirq. Move "masked" such that it may share the 8-byte
slot with struct arch_pirq on 64-bit arches, rather than leaving a
7-byte hole in all cases.
Take the opportunity and also add a comment regarding "arch" placement
within the structure.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Arm: fix build after 892b9dcebdb7
"IRQ: u16 is too narrow for an event channel number" introduced a use of
evetchn_port_t, but its typedef apparently surfaces indirectly here only
on x86.
Juergen Gross [Wed, 15 Jan 2020 13:24:47 +0000 (14:24 +0100)]
x86: clear per cpu stub page information in cpu_smpboot_free()
cpu_smpboot_free() removes the stubs for the cpu going offline, but it
isn't clearing the related percpu variables. This will result in
crashes when a stub page is released due to all related cpus gone
offline and one of those cpus going online later.
Fix that by clearing stubs.addr and stubs.mfn in order to allocate a
new stub page when needed, irrespective of whether the CPU gets parked
or removed.
Fixes: 2e6c8f182c9c50 ("x86: distinguish CPU offlining from CPU removal") Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Wei Liu <wl@xen.org> Reviewed-by: Jan Beulich <jbeulich@suse.com> Tested-by: Tao Xu <tao3.xu@intel.com>
master commit: 774901788c5614798931a1cb2e20dd8b885f97ab
master date: 2020-01-09 11:07:38 +0100
Juergen Gross [Wed, 15 Jan 2020 13:24:09 +0000 (14:24 +0100)]
sched: fix resuming from S3 with smt=0
When resuming from S3 and smt=0 or maxcpus= are specified we must not
do anything in cpu_schedule_callback(). This is not true today for
taking down a cpu during resume.
If anything goes wrong during resume all the scheduler related error
handling is in cpupool.c, so we can just bail out early from
cpu_schedule_callback() when suspending or resuming.
This fixes commit 0763cd2687897b55e7 ("xen/sched: don't disable
scheduler on cpus during suspend").
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: George Dunlap <george.dunlap@citrix.com>
master commit: d7f3c76317108ee9989f00545d394fa495fba752
master date: 2020-01-08 14:59:25 +0100
Roger Pau Monné [Wed, 15 Jan 2020 13:23:36 +0000 (14:23 +0100)]
x86/tlbflush: do not toggle the PGE CR4 bit unless necessary
When PCID is not available Xen does a full tlbflush by toggling the
PGE bit in CR4. This is not necessary if PGE is not enabled, since a
flush can be performed by writing to CR3 in that case.
Change the code in do_tlb_flush to only toggle the PGE bit in CR4 if
it's already enabled, otherwise do the tlb flush by writing to CR3.
This is relevant when running virtualized, since hypervisors don't
usually trap accesses to CR3 when using hardware assisted paging, but
do trap accesses to CR4 specially on AMD hardware, which makes such
accesses much more expensive.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: b5087a31efee7a4e34c958b88671ac6669501b09
master date: 2019-12-03 14:15:35 +0100
"Some Coffee Lake platforms have a skewed HPET timer once the SoCs entered
PC10, which in consequence marks TSC as unstable because HPET is used as
watchdog clocksource for TSC."
Follow this for Xen as well. Looking at its patch context made me notice
they have a pre-existing quirk for Bay Trail as well. The comment there,
however, points at a Cherry Trail document. Looking at the datasheets of
both, there appear to be similar issues, so go beyond Linux'es coverage
and exclude both. Also key the disable on the PCI IDs of the actual
affected devices, rather than those of 00:00.0.
Apply the workarounds only when the use of HPET was not explicitly
requested on the command line and when use of (deep) C-states was not
disabled.
Adjust a few types in touched or nearby code at the same time.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: d5294a302c8441191d47888452958aea25243723
master date: 2019-12-03 14:14:44 +0100
Jan Beulich [Wed, 15 Jan 2020 13:22:22 +0000 (14:22 +0100)]
gnttab: make sure grant map operations don't skip their IOMMU part
Two almost simultaneous mapping requests need to make sure that at the
completion of the earlier one IOMMU mappings (established explicitly
here in the PV case) have been put in place. Forever since the splitting
of the grant table lock a violation of this has been possible (using
simplified pin counts, as it doesn't matter whether we talk about read
or write mappings here):
initial state: act->pin = 0
vCPU A: progress the operation past the dropping of the locks after the
act->pin updates (act->pin = 1, old_pin = 0, act_pin = 1)
vCPU B: progress the operation past the dropping of the locks after the
act->pin updates (act->pin = 2, old_pin = 1, act_pin = 2)
vCPU B: (re-)acquire both gt locks, mapkind() returns 0, but both
iommu_legacy_map() invocations get skipped due to non-zero
old_pin
With the locks dropped intermediately, whether to invoke
iommu_legacy_map() must depend on only the return value of mapkind()
and of course the kind of mapping request being processed, just like
is already the case in unmap_common().
Also fix the style of the adjacent comment, and correct a nearby one
still referring to a prior name of what is now mapkind().
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
master commit: 921f1f42260c7967bf18f8a143d39511d163c421
master date: 2019-12-03 14:13:40 +0100
Julien Grall [Wed, 15 Jan 2020 13:21:14 +0000 (14:21 +0100)]
xen/x86: vpmu: Unmap per-vCPU PMU page when the domain is destroyed
A guest will setup a shared page with the hypervisor for each vCPU via
XENPMU_init. The page will then get mapped in the hypervisor and only
released when XENPMU_finish is called.
This means that if the guest fails to invoke XENPMU_finish, e.g if it is
destroyed rather than cleanly shut down, the page will stay mapped in the
hypervisor. One of the consequences is the domain can never be fully
destroyed as a page reference is still held.
As Xen should never rely on the guest to correctly clean-up any
allocation in the hypervisor, we should also unmap such pages during the
domain destruction if there are any left.
We can re-use the same logic as in pvpmu_finish(). To avoid
duplication, move the logic in a new function that can also be called
from vpmu_destroy().
NOTE: - The call to vpmu_destroy() must also be moved from
arch_vcpu_destroy() into domain_relinquish_resources() such that
the reference on the mapped page does not prevent domain_destroy()
(which calls arch_vcpu_destroy()) from being called.
- Whilst it appears that vpmu_arch_destroy() is idempotent it is
by no means obvious. Hence make sure the VPMU_CONTEXT_ALLOCATED
flag is cleared at the end of vpmu_arch_destroy().
- This is not an XSA because vPMU is not security supported (see
XSA-163).
Signed-off-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: be18e39d2f69038804b27c30026754deaeefa543
master date: 2019-11-29 18:23:24 +0000
Julien Grall [Thu, 19 Dec 2019 08:12:21 +0000 (08:12 +0000)]
xen/arm: Place a speculation barrier sequence following an eret instruction
Some CPUs can speculate past an ERET instruction and potentially perform
speculative accesses to memory before processing the exception return.
Since the register state is often controlled by lower privilege level
at the point of an ERET, this could potentially be used as part of a
side-channel attack.
Newer CPUs may implement a new SB barrier instruction which acts
as an architected speculation barrier. For current CPUs, the sequence
DSB; ISB is known to prevent speculation.
The latter sequence is heavier than SB but it would never be executed
(this is speculation after all!).
Introduce a new macro 'sb' that could be used when a speculation barrier
is required. For now it is using dsb; isb but this could easily be
updated to cater SB in the future.
Juergen Gross [Wed, 11 Dec 2019 16:56:59 +0000 (17:56 +0100)]
build: fix tools/configure in case only python3 exists
Calling ./configure with python3 being there but no python,
tools/configure will fail. Fix that by defaulting to python and
falling back to python3 or python2.
While at it fix the use of non portable "type -p" by replacing it by
AC_PATH_PROG().
Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wl@xen.org>
[ wei: run autogen.sh ] Reviewed-by: Ian Jackson <ian.jackson@eu.citrix.com>
(cherry picked from commit 5852ca48526316918cd82fba1033a6a5379fbc4c)
Andrew Cooper [Tue, 3 Dec 2019 16:59:09 +0000 (16:59 +0000)]
x86/svm: Fix handling of EFLAGS.RF on task switch
VT-x updates RF before vmexit, so eflags written into the outgoing TSS happens
to be correct. SVM does not update RF before vmexit, and instead provides it
via a bit in exitinfo2.
In practice, needing RF set in the outgoing state occurs when a task gate is
used to handle faults.
Extend hvm_task_switch() with an extra_eflags parameter which gets fed into
the outgoing TSS, and fill it in suitably from the SVM vmexit information.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Wed, 11 Dec 2019 14:06:08 +0000 (15:06 +0100)]
AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables
update_paging_mode() has multiple bugs:
1) Booting with iommu=debug will cause it to inform you that that it called
without the pdev_list lock held.
2) When growing by more than a single level, it leaks the newly allocated
table(s) in the case of a further error.
Furthermore, the choice of default level for a domain has issues:
1) All HVM guests grow from 2 to 3 levels during construction because of the
position of the VRAM just below the 4G boundary, so defaulting to 2 is a
waste of effort.
2) The limit for PV guests doesn't take memory hotplug into account, and
isn't dynamic at runtime like HVM guests. This means that a PV guest may
get RAM which it can't map in the IOMMU.
The dynamic height is a property unique to AMD, and adds a substantial
quantity of complexity for what is a marginal performance improvement. Remove
the complexity by removing the dynamic height.
PV guests now get 3 or 4 levels based on any hotplug regions in the host.
This only makes a difference for hardware which previously had all RAM below
the 512G boundary, and a hotplug region above.
HVM guests now get 4 levels (which will be sufficient until 256TB guests
become a thing), because we don't currently have the information to know when
3 would be safe to use.
The overhead of this extra level is not expected to be noticeable. It costs
one page (4k) per domain, and one extra IO-TLB paging structure cache entry
which is very hot and less likely to be evicted.
This is XSA-311.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: b4f042236ae0bb6725b3e8dd40af5a2466a6f971
master date: 2019-12-11 14:55:32 +0100
George Dunlap [Wed, 11 Dec 2019 14:05:39 +0000 (15:05 +0100)]
x86/mm: relinquish_memory: Grab an extra type ref when setting PGT_partial
The PGT_partial bit in page->type_info holds both a type count and a
general ref count. During domain tear-down, when free_page_type()
returns -ERESTART, relinquish_memory() correctly handles the general
ref count, but fails to grab an extra type count when setting
PGT_partial. When this bit is eventually cleared, type_count underflows
and triggers the following BUG in page_alloc.c:free_domheap_pages():
As far as we can tell, this page underflow cannot be exploited any any
other way: The page can't be used as a pagetable by the dying domain
because it's dying; it can't be used as a pagetable by any other
domain since it belongs to the dying domain; and ownership can't
transfer to any other domain without hitting the BUG_ON() in
free_domheap_pages().
(steal_page() won't work on a page in this state, since it requires
PGC_allocated to be set, and PGC_allocated will already have been
cleared.)
Fix this by grabbing an extra type ref if setting PGT_partial in
relinquish_memory.
This is part of XSA-310.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com>
master commit: 66bdc16aeed8ddb2ae724adc5ea6bde0dea78c3d
master date: 2019-12-11 14:55:08 +0100
George Dunlap [Wed, 11 Dec 2019 14:04:53 +0000 (15:04 +0100)]
x86/mm: alloc/free_lN_table: Retain partial_flags on -EINTR
When validating or de-validating pages (in alloc_lN_table and
free_lN_table respectively), the `partial_flags` local variable is
used to keep track of whether the "current" PTE started the entire
operation in a "may be partial" state.
One of the patches in XSA-299 addressed the fact that it is possible
for a previously-partially-validated entry to subsequently be found to
have invalid entries (indicated by returning -EINVAL); in which case
page->partial_flags needs to be set to indicate that the current PTE
may have the partial bit set (and thus _put_page_type() should be
called with PTF_partial_set).
Unfortunately, the patches in XSA-299 assumed that once
put_page_from_lNe() returned -ERESTART on a page, it was not possible
for it to return -EINTR. This turns out to be true for
alloc_lN_table() and free_lN_table, but not for _get_page_type() and
_put_page_type(): both can return -EINTR when called on pages with
PGT_partial set. In these cases, the pages PGT_partial will still be
set; failing to set partial_flags appropriately may allow an attacker
to do a privilege escalation similar to those described in XSA-299.
Fix this by always copying the local partial_flags variable into
page->partial_flags when exiting early.
NB that on the "get" side, no adjustment to nr_validated_entries is
needed: whether pte[i] is partially validated or entirely
un-validated, we want nr_validated_entries = i. On the "put" side,
however, we need to adjust nr_validated_entries appropriately: if
pte[i] is entirely validated, we want nr_validated_entries = i + 1; if
pte[i] is partially validated, we want nr_validated_entries = i.
This is part of XSA-310.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 4e70f4476c0c543559f971faecdd5f1300cddb0a
master date: 2019-12-11 14:54:43 +0100
George Dunlap [Wed, 11 Dec 2019 14:04:08 +0000 (15:04 +0100)]
x86/mm: Set old_guest_table when destroying vcpu pagetables
Changeset 6c4efc1eba ("x86/mm: Don't drop a type ref unless you held a
ref to begin with"), part of XSA-299, changed the calling discipline
of put_page_type() such that if put_page_type() returned -ERESTART
(indicating a partially de-validated page), subsequent calls to
put_page_type() must be called with PTF_partial_set. If called on a
partially de-validated page but without PTF_partial_set, Xen will
BUG(), because to do otherwise would risk opening up the kind of
privilege escalation bug described in XSA-299.
One place this was missed was in vcpu_destroy_pagetables().
put_page_and_type_preemptible() is called, but on -ERESTART, the
entire operation is simply restarted, causing put_page_type() to be
called on a partially de-validated page without PTF_partial_set. The
result was that if such an operation were interrupted, Xen would hit a
BUG().
Fix this by having vcpu_destroy_pagetables() consistently pass off
interrupted de-validations to put_old_page_type():
- Unconditionally clear references to the page, even if
put_page_and_type failed
- Set old_guest_table and old_guest_table_partial appropriately
While here, do some refactoring:
- Move clearing of arch.cr3 to the top of the function
- Now that clearing is unconditional, move the unmap to the same
conditional as the l4tab mapping. This also allows us to reduce
the scope of the l4tab variable.
- Avoid code duplication by looping to drop references on
guest_table_user
This is part of XSA-310.
Reported-by: Sarah Newman <srn@prgmr.com> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: ececa12b2c4c8e4433e4f9be83f5c668ae36fe08
master date: 2019-12-11 14:54:13 +0100
George Dunlap [Wed, 11 Dec 2019 14:03:35 +0000 (15:03 +0100)]
x86/mm: Don't reset linear_pt_count on partial validation
"Linear pagetables" is a technique which involves either pointing a
pagetable at itself, or to another pagetable the same or higher level.
Xen has limited support for linear pagetables: A page may either point
to itself, or point to another page of the same level (i.e., L2 to L2,
L3 to L3, and so on).
XSA-240 introduced an additional restriction that limited the "depth"
of such chains by allowing pages to either *point to* other pages of
the same level, or *be pointed to* by other pages of the same level,
but not both. To implement this, we keep track of the number of
outstanding times a page points to or is pointed to another page
table, to prevent both from happening at the same time.
Unfortunately, the original commit introducing this reset this count
when resuming validation of a partially-validated pagetable, dropping
some "linear_pt_entry" counts.
On debug builds on systems where guests used this feature, this might
lead to crashes that look like this:
Assertion 'oc > 0' failed at mm.c:874
Worse, if an attacker could engineer such a situation to occur, they
might be able to make loops or other abitrary chains of linear
pagetables, leading to the denial-of-service situation outlined in
XSA-240.
This is XSA-309.
Reported-by: Manuel Bouyer <bouyer@antioche.eu.org> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com>
master commit: 7473efd12fb7a6548f5303f1f4c5cb521543a813
master date: 2019-12-11 14:10:27 +0100
Andrew Cooper [Wed, 11 Dec 2019 14:03:00 +0000 (15:03 +0100)]
x86/vtx: Work around SingleStep + STI/MovSS VMEntry failures
See patch comment for technical details.
Concerning the timeline, this was first discovered in the aftermath of
XSA-156 which caused #DB to be intercepted unconditionally, but only in
its SingleStep + STI form which is restricted to privileged software.
After working with Intel and identifying the problematic vmentry check,
this workaround was suggested, and the patch was posted in an RFC
series. Outstanding work for that series (not breaking Introspection)
is still pending, and this fix from it (which wouldn't have been good
enough in its original form) wasn't committed.
A vmentry failure was reported to xen-devel, and debugging identified
this bug in its SingleStep + MovSS form by way of INT1, which does not
involve the use of any privileged instructions, and proving this to be a
security issue.
This is XSA-308
Reported-by: Håkon Alstadheim <hakon@alstadheim.priv.no> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Kevin Tian <kevin.tian@intel.com>
master commit: 1d3eb8259804e5bec991a3462d69ba6bd80bb40e
master date: 2019-12-11 14:09:30 +0100
Jan Beulich [Wed, 11 Dec 2019 14:01:22 +0000 (15:01 +0100)]
x86+Arm32: make find_next_{,zero_}bit() have well defined behavior
These functions getting used with the 2nd and 3rd arguments being equal
wasn't well defined: Arm64 reliably returns the value of the 2nd
argument in this case, while on x86 for bitmaps up to 64 bits wide the
return value was undefined (due to the undefined behavior of a shift of
a value by the number of bits it's wide) when the incoming value was 64.
On Arm32 an actual out of bounds access would happen when the
size/offset value is a multiple of 32; if this access doesn't fault, the
return value would have been sufficiently correct afaict.
Make the functions consistently tolerate the last two arguments being
equal (and in fact the 3rd argument being greater or equal to the 2nd),
in favor of finding and fixing all the use sites that violate the
original more strict assumption.
Andrew Cooper [Sat, 7 Dec 2019 15:50:22 +0000 (15:50 +0000)]
xen/build: Automatically locate a suitable python interpreter
Needing to pass PYTHON=python3 into hypervisor builds is irritating and
unnecessary. Locate a suitable interpreter automatically, defaulting to Py3
if it is available.
Reported-by: Steven Haigh <netwiz@crc.id.au> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Sat, 7 Dec 2019 16:20:55 +0000 (16:20 +0000)]
xen/flask: Drop the gen-policy.py script
The script is Python 2 specific, and fails with string/binary issues with
Python 3:
Traceback (most recent call last):
File "gen-policy.py", line 14, in <module>
for char in sys.stdin.read():
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8c in position 0: invalid start byte
Fixing the script to be compatible isn't hard, but using python here is
wasteful. Drop the script entirely, and write an equivelent flask-policy.S
instead. This removes the need for a $(PYTHON) and $(CC) pass.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov> Acked-by: Julien Grall <julien@xen.org> Release-acked-by: Juergen Gross <jgross@suse.com>
Igor Druzhinin [Tue, 10 Dec 2019 10:36:20 +0000 (11:36 +0100)]
x86/AMD: unbreak CPU hotplug on AMD systems without RstrFpErrPtrs
If the feature is not present Xen will try to force X86_BUG_FPU_PTRS
feature at CPU identification time. This is especially noticeable in
PV-shim that usually hotplugs its vCPUs. We either need to restrict this
action for boot CPU only or allow secondary CPUs to modify
forced CPU capabilities at runtime. Choose the former since modifying
forced capabilities out of boot path leaves the system in potentially
inconsistent state.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
master commit: 0f3cdbdf7a7f33952121a5bde02cae2b3f703f67
master date: 2019-12-10 11:07:22 +0100
There's no reason to allocate the dec{32,64}table on the stack; it
just wastes a bunch of instructions setting them up and, of course,
also consumes quite a bit of stack. Using size_t for such small
integers is a little excessive.
The now inlined lz4_uncompress functions used to have a stack
footprint of 176 bytes (according to -fstack-usage); their inlinees
have increased their stack use from 32 bytes to 48 and 80 bytes,
respectively.
Jan Beulich [Mon, 9 Dec 2019 14:07:49 +0000 (15:07 +0100)]
lz4: refine commit 9143a6c55ef7 for the 64-bit case
I clearly went too far there: While the LZ4_WILDCOPY() instances indeed
need prior guarding, LZ4_SECURECOPY() needs this only in the 32-bit case
(where it simply aliases LZ4_WILDCOPY()). "cpy" can validly point
(slightly) below "op" in these cases, due to
cpy = op + length - (STEPSIZE - 4);
where length can be as low as 0 and STEPSIZE is 8. However, instead of
removing the check via "#if !LZ4_ARCH64", refine it such that it would
also properly work in the 64-bit case, aborting decompression instead
of continuing on bogus input.
Reported-by: Mark Pryor <pryorm09@gmail.com> Reported-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Tested-by: Mark Pryor <pryorm09@gmail.com> Tested-by: Jeremi Piotrowski <jeremi.piotrowski@gmail.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
master commit: 2d7572cdfa4d481c1ca246aa1ce5239ccae7eb59
master date: 2019-12-09 14:01:25 +0100
Andrew Cooper [Wed, 11 Sep 2019 19:12:31 +0000 (20:12 +0100)]
docs/sphinx: License content with CC-BY-4.0
Creative Commons is a more common license than GPL for documentation purposes.
Switch to using CC-BY-4.0 to explicitly permit re-purposing and remixing of
the content.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Lars Kurth <lars.kurth@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Roger Pau Monne [Tue, 3 Dec 2019 10:33:52 +0000 (11:33 +0100)]
automation: increase tests maximum time from 10s to 30s
10s is too low for the clang tests, this is the output from a clang
test:
(XEN) [ 6.512748] ***************************************************
(XEN) [ 6.513323] SELFTEST FAILURE: CORRECT BEHAVIOR CANNOT BE GUARANTEED
(XEN) [ 6.513891] ***************************************************
(XEN) [ 6.514469] 3... 2... 1...
(XEN) [ 9.520011] *** Serial input to DOM0 (type 'CTRL-a' three times to switch input)
(XEN) [ 9.544319] Freed 488kB init memory
--- Xen Test Framework ---
Environment: HVM 32bit (PAE 3 levels)
Hello World
Test result: SUCCESS
(XEN) [ 9.610977] Hardware Dom0 halted: halting machine
As can be seen from the output above booting Xen and the XTF test
takes ~10s, without accounting for the time it takes for QEMU to
initialize.
Increase the timeout to 30s to be on the safe side.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wl@xen.org> Release-acked-by: Juergen Gross <jgross@suse.com>
Roger Pau Monne [Tue, 3 Dec 2019 10:33:51 +0000 (11:33 +0100)]
automation: add timestamps to Xen tests
Enable Xen timestamps in the automated Xen tests, this is helpful in
order to figure out if Xen is stuck or just slow in the automated
tests.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Wei Liu <wl@xen.org> Release-acked-by: Juergen Gross <jgross@suse.com>
Yi Sun [Mon, 2 Dec 2019 07:24:48 +0000 (15:24 +0800)]
x86/psr: fix bug which may cause crash
During test, we found a crash on Xen with below trace.
(XEN) Xen call trace:
(XEN) [<ffff82d0802a065a>] R psr.c#l3_cdp_write_msr+0x1e/0x22
(XEN) [<ffff82d0802a0858>] F psr.c#do_write_psr_msrs+0x6d/0x109
(XEN) [<ffff82d08023e000>] F smp_call_function_interrupt+0x5a/0xac
(XEN) [<ffff82d0802a2b89>] F call_function_interrupt+0x20/0x34
(XEN) [<ffff82d080282c64>] F do_IRQ+0x175/0x6ae
(XEN) [<ffff82d08038b8ba>] F common_interrupt+0x10a/0x120
(XEN) [<ffff82d0802ec616>] F cpu_idle.c#acpi_idle_do_entry+0x9d/0xb1
(XEN) [<ffff82d0802ecc01>] F cpu_idle.c#acpi_processor_idle+0x41d/0x626
(XEN) [<ffff82d08027353b>] F domain.c#idle_loop+0xa5/0xa7
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 20:
(XEN) GENERAL PROTECTION FAULT
(XEN) [error_code=0000]
(XEN) ****************************************
The bug happens when CDP and MBA co-exist and MBA COS_MAX is bigger
than CDP COS_MAX. E.g. MBA has 8 COS registers but CDP only have 6.
When setting MBA throttling value for the 7th guest, the value array
would be:
+------------------+------------------+--------------+
| Data default val | Code default val | MBA throttle |
+------------------+------------------+--------------+
Then, COS id 7 will be selected for writting the values. We should
avoid writting CDP data/code valules to COS id 7 MSR because it
exceeds the CDP COS_MAX.
Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
George Dunlap [Fri, 29 Nov 2019 17:24:45 +0000 (17:24 +0000)]
Rationalize max_grant_frames and max_maptrack_frames handling
Xen used to have single, system-wide limits for the number of grant
frames and maptrack frames a guest was allowed to create. Increasing
or decreasing this single limit on the Xen command-line would change
the limit for all guests on the system.
Later, per-domain limits for these values was created. The system-wide
limits became strict limits: domains could not be created with higher
limits, but could be created with lower limits. However, that change
also introduced a range of different "default" values into various
places in the toolstack:
- The python libxc bindings hard-coded these values to 32 and 1024,
respectively
- The libxl default values are 32 and 1024 respectively.
- xl will use the libxl default for maptrack, but does its own default
calculation for grant frames: either 32 or 64, based on the max
possible mfn.
These defaults interact poorly with the hypervisor command-line limit:
- The hypervisor command-line limit cannot be used to raise the limit
for all guests anymore, as the default in the toolstack will
effectively override this.
- If you use the hypervisor command-line limit to *reduce* the limit,
then the "default" values generated by the toolstack are too high,
and all guest creations will fail.
In other words, the toolstack defaults require any change to be
effected by having the admin explicitly specify a new value in every
guest.
In order to address this, have grant_table_init treat negative values
for max_grant_frames and max_maptrack_frames as instructions to use the
system-wide default, and have all the above toolstacks default to passing
-1 unless a different value is explicitly configured.
This restores the old behavior in that changing the hypervisor command-line
option can change the behavior for all guests, while retaining the ability
to set per-guest values. It also removes the bug that reducing the
system-wide max will cause all domains without explicit limits to fail.
NOTE: - The Ocaml bindings require the caller to always specify a value,
and the code to start a xenstored stubdomain hard-codes these to 4
and 128 respectively; this behavour will not be modified.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Signed-off-by: Paul Durrant <pdurrant@amazon.com> Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com> Acked-by: Wei Liu <wl@xen.org> Acked-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
(cherry picked from commit f2ae59bc4b9b5c3f12de86aa42cdf413d2c3ffbf)
Paul Durrant [Wed, 27 Nov 2019 17:11:43 +0000 (17:11 +0000)]
x86 / iommu: set up a scratch page in the quarantine domain
This patch introduces a new iommu_op to facilitate a per-implementation
quarantine set up, and then further code for x86 implementations
(amd and vtd) to set up a read-only scratch page to serve as the source
for DMA reads whilst a device is assigned to dom_io. DMA writes will
continue to fault as before.
The reason for doing this is that some hardware may continue to re-try
DMA (despite FLR) in the event of an error, or even BME being cleared, and
will fail to deal with DMA read faults gracefully. Having a scratch page
mapped will allow pending DMA reads to complete and thus such buggy
hardware will eventually be quiesced.
NOTE: These modifications are restricted to x86 implementations only as
the buggy h/w I am aware of is only used with Xen in an x86
environment. ARM may require similar code but, since I am not
aware of the need, this patch does not modify any ARM implementation.
Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Fri, 29 Nov 2019 16:20:06 +0000 (17:20 +0100)]
console: avoid buffer overrun in guest_console_write()
conring_puts() has been requiring a nul-terminated string, which the
local kbuf[] doesn't get set for anymore. Add a length parameter to the
function, just like was done for others, thus allowing embedded nul to
also be read through XEN_SYSCTL_readconsole.
While there drop a stray cast: Both operands of - are already uint32_t.
Jan Beulich [Fri, 29 Nov 2019 16:18:33 +0000 (17:18 +0100)]
console: avoid buffer overflow in guest_console_write()
The switch of guest_console_write()'s second parameter from plain to
unsigned int has caused the function's main loop header to no longer
guard the min_t() use within the function against effectively negative
values, due to the casts hidden inside the macro. Replace by a plain
min(), casting one of the arguments as necessary.
Fixes: ea601ec9995b ("xen/console: Rework HYPERCALL_console_io interface") Reported-by: Ilja Van Sprundel <ivansprundel@ioactive.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: Julien Grall <julien@xen.org> Release-acked-by: Juergen Gross <jgross@suse.com>
master commit: aaf8839fdf8b9b1a93a3837b82f680adea1b297c
master date: 2019-11-29 17:08:20 +0100
Andrew Cooper [Thu, 21 Nov 2019 17:22:52 +0000 (17:22 +0000)]
x86/svm: Write the correct %eip into the outgoing task
The TASK_SWITCH vmexit has fault semantics, and doesn't provide any NRIPs
assistance with instruction length. As a result, any instruction-induced task
switch has the outgoing task's %eip pointing at the instruction switch caused
the switch, rather than after it.
This causes callers of task gates to livelock (repeatedly execute the call/jmp
to enter the task), and any restartable task to become a nop after its first
use (the (re)entry state points at the iret used to exit the task).
32bit Windows in particular is known to use task gates for NMI handling, and
to use NMI IPIs.
In the task switch handler, distinguish instruction-induced from
interrupt/exception-induced task switches, and decode the instruction under
%rip to calculate its length.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Mon, 25 Nov 2019 19:33:36 +0000 (19:33 +0000)]
x86/svm: Always intercept ICEBP
ICEBP isn't handled well by SVM.
The VMexit state for a #DB-vectored TASK_SWITCH has %rip pointing to the
appropriate instruction boundary (fault or trap, as appropriate), except for
an ICEBP-induced #DB TASK_SWITCH, where %rip points at the ICEBP instruction
rather than after it. As ICEBP isn't distinguished in the vectoring event
type, the state is ambiguous.
To add to the confusion, an ICEBP which occurs due to Introspection
intercepting the instruction, or from x86_emulate() will have %rip updated as
a consequence of partial emulation required to inject an ICEBP event in the
first place.
We could in principle spot the non-injected case in the TASK_SWITCH handler,
but this still results in complexity if the ICEBP instruction also has an
Instruction Breakpoint active on it (which genuinely has fault semantics).
Unconditionally intercept ICEBP. This does have NRIPs support as it is an
instruction intercept, which allows us to move %rip forwards appropriately
before the TASK_SWITCH intercept is hit. This makes #DB-vectored switches
have consistent behaviour however the ICEBP #DB came about, and avoids special
cases in the TASK_SWITCH intercept.
This in turn allows for the removal of the conditional
hvm_set_icebp_interception() logic used by the monitor subsystem, as ICEBP's
will now always be submitted for monitoring checks.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Alexandru Isaila <aisaila@bitdefender.com> Reviewed-by: Petre Pircalabu <ppircalabu@bitdefender.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Thu, 21 Nov 2019 17:22:52 +0000 (17:22 +0000)]
x86/vtx: Fix fault semantics for early task switch failures
The VT-x task switch handler adds inst_len to %rip before calling
hvm_task_switch(), which is problematic in two ways:
1) Early faults (i.e. ones delivered in the context of the old task) get
delivered with trap semantics, and break restartibility.
2) The addition isn't truncated to 32 bits. In the corner case of a task
switch instruction crossing the 4G->0 boundary taking an early fault (with
trap semantics), a VMEntry failure will occur due to %rip being out of
range.
Instead, pass the instruction length into hvm_task_switch() and write it into
the outgoing TSS only, leaving %rip in its original location.
For now, pass 0 on the SVM side. This highlights a separate preexisting bug
which will be addressed in the following patch.
While adjusting call sites, drop the unnecessary uint16_t cast.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Thu, 28 Nov 2019 16:47:25 +0000 (17:47 +0100)]
build: provide option to disambiguate symbol names
The .file assembler directives generated by the compiler do not include
any path components (gcc) or just the ones specified on the command line
(clang, at least version 5), and hence multiple identically named source
files (in different directories) may produce identically named static
symbols (in their kallsyms representation). The binary diffing algorithm
used by xen-livepatch, however, depends on having unique symbols.
Make the ENFORCE_UNIQUE_SYMBOLS Kconfig option control the (build)
behavior, and if enabled use objcopy to prepend the (relative to the
xen/ subdirectory) path to the compiler invoked STT_FILE symbols. Note
that this build option is made no longer depend on LIVEPATCH, but merely
defaults to its setting now.
Conditionalize explicit .file directive insertion in C files where it
exists just to disambiguate names in a less generic manner; note that
at the same time the redundant emission of STT_FILE symbols gets
suppressed for clang. Assembler files as well as multiply compiled C
ones using __OBJECT_FILE__ are left alone for the time being.
Since we now expect there not to be any duplicates anymore, also don't
force the selection of the option to 'n' anymore in allrandom.config.
Similarly COVERAGE no longer suppresses duplicate symbol warnings if
enforcement is in effect, which in turn allows
SUPPRESS_DUPLICATE_SYMBOL_WARNINGS to simply depend on
!ENFORCE_UNIQUE_SYMBOLS.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Acked-by: Wei Liu <wl@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Sergey Dyasli <sergey.dyasli@citrix.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
Jan Beulich [Thu, 28 Nov 2019 14:14:03 +0000 (15:14 +0100)]
x86/IRQ: make internally used IRQs also honor the pending EOI stack
At the time the pending EOI stack was introduced there were no
internally used IRQs which would have the LAPIC EOI issued from the
->end() hook. This had then changed with the introduction of IOMMUs,
but the interaction issue was presumably masked by
irq_guest_eoi_timer_fn() frequently EOI-ing interrupts way too early
(which got fixed by 359cf6f8a0ec ["x86/IRQ: don't keep EOI timer
running without need"]).
The problem is that with us re-enabling interrupts across handler
invocation, a higher priority (guest) interrupt may trigger while
handling a lower priority (internal) one. The EOI issued from
->end() (for ACKTYPE_EOI kind interrupts) would then mistakenly
EOI the higher priority (guest) interrupt, breaking (among other
things) pending EOI stack logic's assumptions.
Notes:
- In principle we could get away without the check_eoi_deferral flag.
I've introduced it just to make sure there's as little change as
possible to unaffected paths.
- Similarly the cpu_has_pending_apic_eoi() check in do_IRQ() isn't
strictly necessary.
- The new function's name isn't very helpful with its use in
end_level_ioapic_irq_new(). I did also consider eoi_APIC_irq() (to
parallel ack_APIC_irq()), but then liked this even less.
Reported-by: Igor Druzhinin <igor.druzhinin@citrix.com> Diagnosed-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Roger Pau Monné [Thu, 28 Nov 2019 10:58:25 +0000 (11:58 +0100)]
x86/vmx: always sync PIR to IRR before vmentry
When using posted interrupts on Intel hardware it's possible that the
vCPU resumes execution with a stale local APIC IRR register because
depending on the interrupts to be injected vlapic_has_pending_irq
might not be called, and thus PIR won't be synced into IRR.
Fix this by making sure PIR is always synced to IRR in
hvm_vcpu_has_pending_irq regardless of what interrupts are pending.
Reported-by: Joe Jin <joe.jin@oracle.com> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com> Tested-by: Joe Jin <joe.jin@oracle.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Sergey Dyasli [Wed, 27 Nov 2019 10:04:30 +0000 (10:04 +0000)]
x86/microcode: refuse to load the same revision ucode
Currently if a user tries to live-load the same or older ucode revision
than CPU already has, he will get a single message in Xen log like:
(XEN) 128 cores are to update their microcode
No actual ucode loading will happen and this situation can be quite
confusing. Fix this by starting ucode update only when the provided
ucode revision is higher than the currently cached one (if any).
This is based on the property that if microcode_cache exists, all CPUs
in the system should have at least that ucode revision.
Additionally, print a user friendly message if no matching or newer
ucode can be found in the provided blob. This also requires ignoring
-ENODATA in AMD-side code, otherwise the message given to the user is:
(XEN) Parsing microcode blob error -61
Which actually means that a ucode blob was parsed fine, but no matching
ucode was found.
Igor Druzhinin [Tue, 26 Nov 2019 17:08:19 +0000 (17:08 +0000)]
AMD/IOMMU: honour IR setting while pre-filling DTEs
IV bit shouldn't be set in DTE if interrupt remapping is not
enabled. It's a regression in behavior of "iommu=no-intremap"
option which otherwise would keep interrupt requests untranslated
for all of the devices in the system regardless of wether it's
described as valid in IVRS or not.
Signed-off-by: Igor Druzhinin <igor.druzhinin@citrix.com> Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
George Dunlap [Tue, 26 Nov 2019 15:49:20 +0000 (15:49 +0000)]
docs/xl: Document pci-assignable state
Changesets 319f9a0ba9 ("passthrough: quarantine PCI devices") and ba2ab00bbb ("IOMMU: default to always quarantining PCI devices")
introduced PCI device "quarantine" behavior, but did not document how
the pci-assignable-add and -remove functions act in regard to this.
Rectify this.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wl@xen.org> Reviewed-by: Paul Durrant <paul@xen.org> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Tue, 26 Nov 2019 13:17:45 +0000 (14:17 +0100)]
EFI: fix "efi=attr=" handling
Commit 633a40947321 ("docs: Improve documentation and parsing for efi=")
failed to honor the strcmp()-like return value convention of
cmdline_strcmp().
Reported-by: Roman Shaposhnik <roman@zededa.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org> Reviewed-by: Roger Pau Monné <roger.pau@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
There are two mappings active in the middle of do_recalc(), and hence
commit 0d0f4d78e5d1 ("p2m: change write_p2m_entry to return an error
code") should have added (or otherwise invoked) unmapping code just
like it did in p2m_next_level(), despite us not expecting any errors
here. Arrange for the existing unmap invocation to take effect in all
cases.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: George Dunlap <george.dunlap@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Anthony PERARD [Tue, 26 Nov 2019 13:16:09 +0000 (14:16 +0100)]
x86/domctl: have XEN_DOMCTL_getpageframeinfo3 preemptible
This hypercall can take a long time to finish because it attempts to
grab the `hostp2m' lock up to 1024 times. The accumulated wait for the
lock can take several seconds.
This can easily happen with a guest with 32 vcpus and plenty of RAM,
during localhost migration.
While the patch doesn't fix the problem with the lock contention and
the fact that the `hostp2m' lock is currently global (and not on a
single page), it is still an improvement to the hypercall. It will in
particular, down the road, allow dropping the arbitrary limit of 1024
entries per request.
Signed-off-by: Anthony PERARD <anthony.perard@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Jan Beulich [Tue, 26 Nov 2019 13:15:01 +0000 (14:15 +0100)]
IOMMU: default to always quarantining PCI devices
XSA-302 relies on the use of libxl's "assignable-add" feature to prepare
devices to be assigned to untrusted guests.
Unfortunately, this is not considered a strictly required step for
device assignment. The PCI passthrough documentation on the wiki
describes alternate ways of preparing devices for assignment, and
libvirt uses its own ways as well. Hosts where these alternate methods
are used will still leave the system in a vulnerable state after the
device comes back from a guest.
Default to always quarantining PCI devices, but provide a command line
option to revert back to prior behavior (such that people who both
sufficiently trust their guests and want to be able to use devices in
Dom0 again after they had been in use by a guest wouldn't need to
"manually" move such devices back from DomIO to Dom0).
This is XSA-306.
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Jan Beulich <jbeulich@suse.com> Reviewed-by: Wei Liu <wl@xen.org>
George Dunlap [Tue, 26 Nov 2019 10:32:42 +0000 (10:32 +0000)]
x86: Don't increase ApicIdCoreSize past 7
Changeset ca2eee92df44 ("x86, hvm: Expose host core/HT topology to HVM
guests") attempted to "fake up" a topology which would induce guest
operating systems to not treat vcpus as sibling hyperthreads. This
involved actually reporting hyperthreading as available, but giving
vcpus every other ApicId; which in turn led to doubling the ApicIds
per core by bumping the ApicIdCoreSize by one. In particular, Ryzen
3xxx series processors, and reportedly EPYC "Rome" cpus -- have an
ApicIdCoreSize of 7; the "fake" topology increases this to 8.
Unfortunately, Windows running on modern AMD hardware -- including
Ryzen 3xxx series processors, and reportedly EPYC "Rome" cpus --
doesn't seem to cope with this value being higher than 7. (Linux
guests have so far continued to cope.)
A "proper" fix is complicated and it's too late to fix it either for
4.13, or to backport to supported branches. As a short-term fix,
limit this value to 7.
This does mean that a Linux guest, booted on such a system without
this change, and then migrating to a system with this change, with
more than 64 vcpus, would see an apparent topology change. This is a
low enough risk in practice that enabling this limit unilaterally, to
allow other guests to boot without manual intervention, is worth it.
Reported-by: Steven Haigh <netwiz@crc.id.au> Reported-by: Andreas Kinzler <hfp@posteo.de> Signed-off-by: George Dunlap <george.dunlap@citrix.com> Acked-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
George Dunlap [Fri, 22 Nov 2019 18:52:02 +0000 (18:52 +0000)]
x86/mm: Adjust linear uses / entries when a page loses validation
"Linear pagetables" is a technique which involves either pointing a
pagetable at itself, or to another pagetable the same or higher level.
Xen has limited support for linear pagetables: A page may either point
to itself, or point to another page of the same level (i.e., L2 to L2,
L3 to L3, and so on).
XSA-240 introduced an additional restriction that limited the "depth"
of such chains by allowing pages to either *point to* other pages of
the same level, or *be pointed to* by other pages of the same level,
but not both. To implement this, we keep track of the number of
outstanding times a page points to or is pointed to another page
table, to prevent both from happening at the same time.
Additionally, XSA-299 introduced a mode whereby if a page was known to
have been only partially validated, _put_page_type() would be called
with PTF_partial_set, indicating that if the page had been
de-validated by someone else, the type count should be left alone.
Unfortunately, this change did not account for the required accounting
for linear page table uses and entries; in the case that a previously
partially-devalidated pagetable was fully-devalidated by someone else,
the linear_pt_counts are not updated.
This could happen in one of two places:
1. In the case a partially-devalidated page was re-validated by
someone else
2. During domain tear-down, when pages are force-invalidated while
leaving the type count intact.
The second could be ignored, since at that point the pages can no
longer be abused; but the first requires handling. Note however that
this would not be a security issue: having the counts be too high is
overly strict (i.e., will prevent a page from being used in a way
which is perfectly safe), but shouldn't cause any other issues.
Fix this by adjusting the linear counts when a page loses validation,
regardless of whether the de-validation completed or was only partial.
Signed-off-by: George Dunlap <george.dunlap@citrix.com> Reviewed-by: Jan Beulich <jbeulich@suse.com> Release-acked-by: Juergen Gross <jgross@suse.com>
libxl: make default path to add/remove all PV devices
Adding/removing device is handled for specific devices only: VBD, VIF,
QDISK. This commit adds default case to handle adding/removing for all PV
devices by default, except QDISK device, which requires special handling.
If any other device is required a special handling it should be done by
implementing separate case (similar to QDISK device). The default
behaviour for adding device is to wait when the backend goes to
XenbusStateInitWait and the default behaviour on removing device is to
start generic device remove procedure.
Also this commit fixes removing guest function: before the guest was
removed when all VIF and VBD devices are removed. The fix removes
guest when all created devices are removed. This is done by checking the
guest device list instead of checking num_vifs and num_vbds. num_vifs and
num_vbds variables are removed as redundant in this case.
Signed-off-by: Oleksandr Grytsov <oleksandr_grytsov@epam.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Wei Liu <wl@xen.org> Release-acked-by: Juergen Gross <jgross@suse.com>
There are two kind of VKBD devices: with QEMU backend and user space PV
backend. In current implementation they can't be distinguished as both use
VKBD backend type. As result, user space PV KBD backend is started and
stopped as QEMU backend. This commit adds new device kind VINPUT to be
used as backend type for user space PV KBD backend.
Signed-off-by: Oleksandr Grytsov <oleksandr_grytsov@epam.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Acked-by: Anthony PERARD <anthony.perard@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Thu, 21 Nov 2019 18:21:49 +0000 (18:21 +0000)]
x86/vvmx: Fix livelock with XSA-304 fix
It turns out that the XSA-304 / CVE-2018-12207 fix of disabling executable
superpages doesn't work well with the nested p2m code.
Nested virt is experimental and not security supported, but is useful for
development purposes. In order to not regress the status quo, disable the
XSA-304 workaround until the nested p2m code can be improved.
Introduce a per-domain exec_sp control and set it based on the current
opt_ept_exec_sp setting. Take the oppotunity to omit a PVH hardware domain
from the performance hit, because it is already permitted to DoS the system in
such ways as issuing a reboot.
When nested virt is enabled on a domain, force it to using executable
superpages and rebuild the p2m.
Having the setting per-domain involves rearranging the internals of
parse_ept_param_runtime() but it still retains the same overall semantics -
for each applicable domain whose setting needs to change, rebuild the p2m.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Acked-by: George Dunlap <george.dunlap@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>
Andrew Cooper [Tue, 5 Nov 2019 19:08:14 +0000 (19:08 +0000)]
x86/livepatch: Prevent patching with active waitqueues
The safety of livepatching depends on every stack having been unwound, but
there is one corner case where this is not true. The Sharing/Paging/Monitor
infrastructure may use waitqueues, which copy the stack frame sideways and
longjmp() to a different vcpu.
This case is rare, and can be worked around by pausing the offending
domain(s), waiting for their rings to drain, then performing a livepatch.
In the case that there is an active waitqueue, fail the livepatch attempt with
-EBUSY, which is preforable to the fireworks which occur from trying to unwind
the old stack frame at a later point.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Ross Lagerwall <ross.lagerwall@citrix.com> Release-acked-by: Juergen Gross <jgross@suse.com>